Algorithms and Data Structures (Part 2)Article index:
Searching and Sorting
Computers are very good at performing certain repetitive tasks—if they are programmed correctly. That can be a big “if.” It is important to realize that any mistake that is made in code can be amplified exponentially by the power of a computer. If your algorithm has a tiny flaw that makes it less efficient that flaw will be magnified as the task is repeated millions and billions of times by a computer.
The tasks of searching and sorting are both comparable to everyday tasks completed by humans. Humans search and sort objects, papers and things, whereas computers sort data. It is easy to compare these type of tasks to the human perspective to explain them. For instance, we search for things all the time. “Where did my keys go?” you may have asked for the hundredth time in your life. We try to find our favorite shirt, our wallet, our cup of coffee: “I swear it was just here a minute ago!” Usually there is a method to our searching for things. Sometimes we want to find something that is kept with other things of the same type. This is why we have things like a silverware drawer. We know that in this drawer, we keep our silverware. And what helps us find the fork? The silverware is sorted by type. So we also sort things and as it turns out, sorting things makes it easier for us to search for things. This is true for computers as well.
We began this unit with a discussion of Google search—the search of all searches in the biggest database ever made, the internet. Usually programmers don’t have such a daunting search space, they have a simple data set, something like a linked list or an array.
Gathering and Sorting Data
When data is input into a computer, the process depends on the type of data collected. It’s similar to real life—sometimes things come to you randomly, but sometimes they are already in order. Or, as things often go, they are in a kind of order, but not perfectly arranged. It may be that once data is input into a computer that it is happily already sorted, but most times this is not the case. We must do some additional processing of the data: a sort.
Let’s say you have a box full of kids toys and you want to sort them in some kind of order. First you have to choose what the sort criteria will be. What will you use to sort these toys? You could sort them alphabetically by name, but what if the stuffed “bear” is also called “Peter”? Do you use “b” or “p”? It can be ambiguous. You could sort by size of the toy—but would you be using length or total volume? Sorting criteria must be exact. Even when sorting alphabetically decisions must be made—is there a different between capital and lowercase? Which is first?
Computers cannot work with ambiguity—everything must be exact. Fortunately, all data in a computer is represented in binary, and we can sort by the value of these binary numbers. The most common sorts, then, would be by numeric value or by alpha-numeric codes using one of the common computer encoding systems such as Unicode. Since each letter and symbol has a binary value attached to it, that value can be used to sort.
Of course, the sort must be either ascending or descending; in other words, smallest to largest or vice versa. Once you have figured out your sorting method, it’s time to sort.
Looking back at our box of toys, imagine what you would do first if we sorted by largest toy. Your eye would quickly scan all the toys for the one taking up the most space. Then you’d take it and put it first (or last) in the order. You are making order out of chaos. Each time when you pick the next toy, you will scan them all—and each time you do this, the number of toys left is smaller so it should take slightly less time. The last toy will take almost no time since it is the only one left and no visual scanning is required. Some steps will be easier since the next toy choice might be obvious, but some steps might take longer because two toys might be very close in size (or identical, which also requires a rule).
As you can see, sorting is a bit of a complex task that takes time and also must have a very specific set of criteria or rules. In fact, we could call our set of rules for searching an algorithm.
Searching for Data
Let’s imagine a different kind of scenario. Today, we live in a very digital world, but not long ago everyone in the United States and many other countries had their own phone book. This enormous tome listed the phone numbers of everyone (with permission) in the local area, along with the numbers of businesses.
Today, phonebooks are still to be found littering the landscape, an echo of the age of paper. However, phonebooks serve as a perfect example of sorting and searching. The list of all of the phone numbers was first sorted. Of course, the publishers could have chosen to sort by number, but most people were looking for the number and already knew the name. The sorting algorithm for a phone book was decided as the last name followed by the first name. Once all the names were sorted, the phone numbers and addresses were printed next to them in the book.
Once the books were delivered, a person searching for the name could open the book and look. Let’s say the person in our example is Ken, and he is looking up Josh Rodriguez. How would it be best for him to search? You might say that he should open the book near the end, since “R” is farther down through the alphabet. However, there may be many names that start with “S” and so the bulk of the last part of the book would be those names. However, Ken decides to choose his starting point, he will then have to decide whether to go forward or backward in the book. If he opens about three quarters of the way through and sees the letter “S,” he will go backwards. However, if he opens to “O,” he knows he must go forward in the book. He also can search through one page at a time, looking at the letters, or simply skip a number of pages to see what letter he finds next.
Obviously, leafing through every page will take more time than jumping to a new page, but once you are close to the answer, leafing might become faster. It really depends on the data. We can pick a search algorithm for Ken and make it very simple. We tell him, “Ken, this is what you’re going to do. You open the book halfway. Then, you decide to go forward or backward. Once you decide that, you will split the remaining pages in half and open to that page. You will then decide again and keep splitting the pages in half until you have arrived at Mr. Rodriguez.” This is actually what we call a binary search, since it continually splits the data size into two.
There are many ways we could tell Ken, or a computer, to search a data set.
Let’s take a look at some common searching rules. Some are more efficient than others. Many of them depend on the data itself, and data could be random, sorted, partially sorted, or in some other unusual order.
This kind of search is pretty simple. It starts at the beginning and looks at each element in order. This would be like telling Ken to start at page one of the phonebook and read every single entry until he found Josh Rodriguez. If he was lucky, Josh would be near the beginning, but with a last name of “R” it seems unlikely.
For this one, we need a little more information. Let’s say the book is 2,500 pages long. We calculate the square root of n, the number of pages, which gives us 50, which we will use for the jump size, or m. Ken will first jump to page 50 and see if Josh is higher or lower, then keep jumping until he is past Josh. Then, he will go back to the previous jump and do a linear search (page by page) on the next 50 pages to find him. As you can see, this is definitely an improvement over a straight linear search.
This is the one we have already described in our first example. Ken would start at page 1,250 of our 2,500-page book, and if Josh was higher in the list then jump half the remaining space, 625 pages, and keep splitting it until he arrived on the proper page.
Depending on the data itself, any of these techniques could be faster or slower than the others. The last two are also dependent on the data being sorted first. Since a linear search simply reads every entry until it finds the correct one, it will work on sorted or unsorted data lists.
We have seen that there is an advantage to searching data if it is first sorted. But how is the data sorted? There are also several different methods we can instruct a computer to use to organize our data.
If we go back to our toy box example, we can imagine a computer doing something similar. However, our brains work a bit differently. We can look at a collection of objects and pick out the largest one without pointing our eye at each one separately—it’s one of the amazing things our brain does for us. However, a computer has to look at each element individually—it can’t look at an entire array or linked list from “above” like we do to pick. For a computer it is more like looking through a phone book where the pages are hidden: it can’t see them until it opens each one.
One of the simplest types of sorting algorithms is the bubble sort. You can imagine how it works by thinking of the largest values “bubbling” to the top, like air bubbles in your drink. The rules for bubble search are this: starting at the beginning, compare the first two values. If the first one is larger, move it to the second. If it is not, they are already in order. Then compare the next two values and do the same until you reach the end. Once this is done, do it again for every element except the last one, since it will be in order now. Each pass you complete will guarantee one more element is in order. Others may happen to be in order, but the bubble sort will check them all regardless.
Let’s use a simple array for our example.
(Kayli, Katie, John, Nathan, Anna)
As you can see it has five elements. The first step is to compare Kayli and Katie. Since “y” is after “t” they are out of order, so bubble search will switch them.
(Katie, Kayli, John, Nathan, Anna)
Then Kayli and John are compared. Kayli comes after John, so they are switched.
(Katie, John, Kayli, Nathan, Anna)
Kayli and Nathan are compared and found to be in the correct order, so the list is unchanged for this step.
(Katie, John, Kayli, Nathan, Anna)
Finally, Nathan and Anna are compared, and Nathan is switched with Anna.
(Katie, John, Kayli, Anna, Nathan)
With this method we now know that the last element, “Nathan” is in its proper place. Now we must compare the remaining four elements. After that, then the remaining three, then the last two. A regular bubble sort will always take the same number of steps based on the list size. However, a modified bubble sort could be instructed to stop if the list is sorted just by detecting if there were no swaps made. If no swaps are needed, the list is already in order. You can see that the number of comparisons in each pass is n-1, where n is the array size. In this case there were four comparisons on five elements.
Binary insertion sort
The binary insertion sort is based on the same method we used for a binary search—by splitting the data in half repeatedly. To sort the data, it builds a new output list and takes items from the input list. The input list are items yet to be sorted, and the output are the ones already sorted. Imagine you’re sorting your toys in the toy box.
First you dump them all on the floor—that’s your input list. Then you place them back in the box one at a time in order, that’s the output list. This example will make more sense with a larger list. Let’s use numerical data this time and assume that the first five elements have already been sorted. So far, we have:
Input (66, 3, 23, 76, 27, 1, 11): yet to be sorted
Output (13, 35, 37, 44, 56): sorted
The original list, then, had 12 elements. We now reach down for the next toy or element in the list, 66. Remember a computer can’t see a top down view like we do of all the items, it needs to check them individually. However, we can speed this up by splitting the ordered list in half, just like Ken did with the phonebook. The computer looks at the middle element out of the five, which is the third element, 37. 66 is higher, so now it jumps to the top half of the data. It then splits the difference again and looks at the fourth element, which is 44. 66 is larger, so it splits the two remaining into a space of one—now we are at the last comparison. 66 is larger than 56 so it is added to the end of the list. We now have:
Input (3, 23, 76, 27, 1, 11)
Output (13, 35, 37, 44, 56, 66)
This took three comparisons for a five-element list. A bubble sort would take four comparisons for each pass, so generally speaking, this algorithm is an improvement.
The next step would continue with the input of 3. Since there are an even number of elements (six) we have to split the middle with either the third or fourth. Let’s say the algorithm chooses the third. We compare 3 to 37. It is lower, so we split the remaining space and compare it to 35. 3 is lower again and we compare it to 13. It is lowest again, so 3 is placed at the beginning of the list. The advantage of the binary insertion sort is that it can at the very least eliminate comparing to half of the data elements in the list in the first step. Now we have:
Input (23, 76, 27, 1, 11)
Output (3, 13, 35, 37, 44, 56, 66)
This process would continue as each element is added to the output list in order, so when the last element is added, the list will be sorted.
Search algorithms get different results based on the data set, but Quicksort is regarded as one of the most efficient algorithms overall. Quicksort is more complex than the previous algorithms, but the explanation of how it works is simple. Imagine a teacher with a stack of graded papers. The teacher wants to sort all of the papers by grade. The pile is split into all the grades higher than 55 (this number is the pivot and can be chosen by different methods) and those less than 55. Then the two piles are split around another pivot bringing the total number of piles to four. The teacher then sorts each pile separately and when they are done, can simply put all the piles back together in order from lowest pile to highest. It can be thought of as a “divide and conquer” type of method. In this example, the stack was split into four piles but, in reality, the data determines how many stacks it is split into for sorting.