Hi, I'm Carrie Anne, and welcome to Crash Course Computer Science!
Last episode, we discussed a few example classic algorithms, like sorting a list of numbers
and finding the shortest path in a graph.
What we didn't talk much about, is how the data the algorithms ran on was stored in computer
memory.
You don't want your data to be like John Green's college dorm room, with food, clothing
and papers strewn everywhere.
Instead, we want our data to be structured, so that it's organized, allowing things
to be easily retrieved and read.
For this, computer scientists use Data Structures!
INTRO
We already introduced one basic data structure last episode, Arrays, also called lists or
Vectors in some languages.
These are a series of values stored in memory.
So instead of just a single value being saved into a variable, like 'j equals 5', we
can define a whole series of numbers, and save that into an array variable.
To be able to find a particular value in this array, we have to specify an index.
Almost all programing languages start arrays at index 0, and use a square bracket syntax
to denote array access.
So, for example, if we want to add the values in the first and third spots of our array
'j', and save that into a variable 'a', we would write a line of code like this.
How an array is stored in memory is pretty straightforward.
For simplicity, let's say that the compiler chose to store ours at memory location 1,000.
The array contains 7 numbers, and these are stored one after another in memory, as seen here.
So when we write "j index of 0", the computer goes to memory location 1,000, with an offset
of 0, and we get the value 5.
If we wanted to retrieve "j index of 5", our program goes to memory location 1000,
plus an offset of 5, which in this case, holds a value of 4.
It's easy to confuse the fifth number in the array with the number at index 5.
They are not the same.
Remember, the number at index 5 is the 6th number in the array because the first number
is at index 0.
Arrays are extremely versatile data structures, used all the time, and so there are many functions
that can handle them to do useful things.
For example, pretty much every programming language comes with a built-in sort function,
where you just pass in your array, and it comes back sorted.
So there's no need to write that algorithm from scratch.
Very closely related are Strings, which are just arrays of characters, like letters, numbers,
punctuation and other written symbols.
We talked about how computers store characters way back in Episode 4.
Most often, to save a string into memory, you just put it in quotes, like so.
Although it doesn't look like an array, it is.
Behind the scenes, the memory looks like this.
Note that the string ends with a zero in memory.
It's not the character zero, but the binary value 0.
This is called the null character, and denotes the end of the string in memory.
This is important because if I call a function like "print quote", which writes the string
to the screen, it prints out each character in turn starting at the first memory location,
but it needs to know when to stop!
Otherwise, it would print out every single thing in memory as text.
The zero tells string functions when to stop.
Because computers work with text so often, there are many functions that specifically
handle strings.
For example, many programming languages have a string concatenation function, or "strcat",
which takes in two strings, and copies the second one to the end of the first.
We can use arrays for making one dimensional lists, but sometimes you want to manipulate
data that is two dimensional, like a grid of numbers in a spreadsheet, or the pixels
on your computer screen.
For this, we need a Matrix.
You can think of a Matrix as an array of arrays!
So a 3 by 3 matrix is really 2 an array of size 3, with each index storing an array of
size 3.
We can initialize a matrix like so.
In memory, this is packed together in order like this.
To access a value, you need to specify two indexes, like "J index of 2, then index
of 1" - this tells the computer you're looking for the item in subarray 2 at position 1.
And this would give us the value 12.
The cool thing about matrices is we're not limited to 3 by 3 -- we can make them any
size we want -- and we can also make them any number of dimensions we want.
For example, we can create a five dimensional matrix and access it like this.
That's right, you now know how to access a five dimensional matrix -- tell your friends!
So far, we've been storing individual numbers or letters into our arrays or matrices.
But often it's useful to store a block of related variables together.
Like, you might want to store a bank account number along with its balance.
Groups of variables like these can be bundled together into a Struct.
Now we can create variables that aren't just single numbers, but are compound data
structures, able to store several pieces of data at once.
We can even make arrays of structs that we define, which are automatically bundled together
in memory.
If we access, for example, J index of 0, we get back the whole struct stored there, and
we can pull the specific account number and balance data we want.
This array of structs, like any other array, gets created at a fixed size that can't
be enlarged to add more items.
Also, arrays must be stored in order in memory, making it hard to add a new item to the middle.
But, the struct data structure can be used for building more complicated data structures
that avoid these restrictions.
Let's take a look at this struct that's called a "node".
It stores a variable, like a number, and also a pointer.
A pointer is a special variable that points, hence the name, to a location in memory.
Using this struct, we can create a linked list, which is a flexible data structure that
can store many nodes.
It does this by having each node point to the next node in the list.
Let's imagine we have three node structs saved in memory, at locations 1000, 1002 and 1008.
They might be spaced apart, because they were created at different times, and other data
can sit between them.
So, you see that the first node contains the value 7, and the location 1008 in its "next"
pointer.
This means that the next node in the linked list is located at memory location 1008.
Looking down the linked list, to the next node, we see it stores the value 112 and points
to another node at location 1002.
If we follow that, we find a node that contains the value 14 and points back to the first
node at location 1000.
So this linked list happened to be circular, but it could also have been terminated by
using a next pointer value of 0 -- the null value -- which would indicate we've reached
the end of the list.
When programmers use linked lists, they rarely look at the memory values stored in the next
pointers.
Instead, they can use an abstraction of a linked list, that looks like this, which is
much easier to conceptualize.
Unlike an array, whose size has to be pre-defined, linked lists can be dynamically extended or
shortened.
For example, we can allocate a new node in memory, and insert it into this list, just
by changing the next pointers.
Linked Lists can also easily be re-ordered, trimmed, split, reversed, and so on.
Which is pretty nifty!
And pretty useful for algorithms like sorting, which we talked about last week.
Owing to this flexibility, many more-complex data structures are built on top of linked lists
The most famous and universal are queues and stacks.
A queue – like the line at your post office – goes in order of arrival.
The person who has been waiting the longest, gets served first.
No matter how frustrating it is that all you want to do is buy stamps and the person in
front of you seems to be mailing 23 packages.
But, regardless, this behavior is called First-In First-Out, or FIFO.
That's the first part.
Not the 23 packages thing.
Imagine we have a pointer, named "post office queue", that points to the first node in
our linked list.
Once we're done serving Hank, we can read Hank's next pointer, and update our "post
office queue" pointer to the next person in the line.
We've successfully dequeued Hank -- he's gone, done, finished.
If we want to enqueue someone, that is, add them to the line, we have to traverse down
the linked list until we hit the end, and then change that next pointer to point to
the new person.
With just a small change, we can use linked lists as stacks, which are LIFO…
Last-In First-Out.
You can think of this like a stack of pancakes... as you make them, you add them to the top
of stack.
And when you want to eat one, you take them from the top of the stack.
Delicious!
Instead of enqueueing and dequeuing, data is pushed onto the stack and popped from the stacks.
Yep, those are the official terms!
If we update our node struct to contain not just one, but two pointers, we can build trees,
another data structure that's used in many algorithms.
Again, programmers rarely look at the values of these pointers, and instead conceptualize
trees like this: The top most node is called the root.
And any nodes that hang from other nodes are called children nodes.
As you might expect, nodes above children are called parent nodes.
Does this example imply that Thomas Jefferson is the parent of Aaron Burr?
I'll leave that to your fanfiction to decide.
And finally, any nodes that have no children -- where the tree ends -- are called Leaf Nodes.
In our example, nodes can have up to two children, and for that reason, this particular data
structure is called a binary tree.
But you could just as easily have trees with three, four or any number of children by modifying
the data structure accordingly.
You can even have tree nodes that use linked lists to store all the nodes they point to.
An important property of trees – both in real life and in data structures – is that
there's a one-way path from roots to leaves.
It'd be weird if roots connected to leaves, that connected to roots.
For data that links arbitrarily, that include things like loops, we can use a graph data
structure instead.
Remember our graph from last episode of cities connected by roads?
This can be stored as nodes with many pointers, very much like a tree, but there is no notion
of roots and leaves, and children and parents…
Anything can point to anything!
So that's a whirlwind overview of pretty much all of the fundamental data structures
used in computer science.
On top of these basic building blocks, programmers have built all sorts of clever variants, with
slightly different properties -- data structures like red-black trees and heaps, which we don't
have time to cover.
These different data structures have properties that are useful for particular computations.
The right choice of data structure can make your job a lot easier, so it pays off to think
about how you want to structure your data before you jump in.
Fortunately, most programming languages come with libraries packed full of ready-made data structures.
For example, C++ has its Standard Template Library, and Java has the Java Class Library.
These mean programmers don't have to waste time implementing things from scratch, and
can instead wield the power of data structures to do more interesting things, once again
allowing us to operate at a new level of abstraction!
I'll see you next week.
Không có nhận xét nào:
Đăng nhận xét