Information Representation (Part 1)Article contents:
Number Representation: Formats
Human beings have ten fingers. It is a common reason given for why we use a base 10 numbering system: decimal. Decimal uses ten symbols, 0 through 9, to represent all possible numbers. Computers, however, don’t have fingers. What early computers did have were switches. These could be turned on or off, giving two possibilities and a base 2 numbering system: binary. There are two symbols, 0 and 1, that are used in binary. Since humans and machines use different numbering systems, translation must occur.
In decimal, the number 806,412 is easily decipherable to a human. We recognize that each successive number to the left of the first one represents the next power of ten. In other words, 800,000 + 6,000 + 400 + 10 + 2 = 806,412. Most of us learned this very early in school. Each time a digit is added, there are ten new combinations of possibilities, which is why we multiply by a power of ten. In binary, there are only two possibilities for each digit, which means adding a digit creates a power of 2 more possibilities.
This eight-digit binary number represents a numeric value to a computer. A human would have to translate 00101101 this way: 32 + 8 + 4 + 1 = 45. There is a 1 in the 32’s place, the 8’s place, the 4’s place, and the 1’s place. There are no 16s or 2s. This demonstrates how to translate binary into decimal. Since a byte is a common unit of data in computing, it is useful to know that 28 is 256, meaning that 8 digits in decimal can represent 256 combinations, or 0 to 255. Just as we can keep adding digits in decimal to get larger numbers, we can add more digits in binary to increase the value.
It is simple to convert from decimal to binary—the process is to keep dividing the decimal number by 2 while recording the remainder. This will reveal the binary number (in reverse order of digits). For instance, let’s look at the number 97 in decimal and change it to binary.
Bytes and Data Measurement
Since a single bit only represents two possibilities, a larger grouping of bytes is often used for measuring data. A byte can represent simple data like a letter from a certain alphabet, a color, or a musical note. Different types of data can take up varying amounts of space, which are measured in bytes or multiples of bytes.
For instance, a Word document might take up around 28KB of storage, a JPG photograph about 3MB, and a DVD movie around 4GB. Most modern hard drives are capable of storing around 1TB of data. Video, applications (especially video games), and operating systems take up the largest amounts of data storage. Notice that each naming convention above is 1,000 times the previous.
You may have observed that it takes quite a large number of digits to represent numbers in binary in comparison to decimal. A good method to abbreviate binary is to use the octal number system, which, as you may have surmised from the name, is a base 8 system. This gives us combinations of digits, with 0 through 7 as possible values for each digit. Each digit’s place value is a multiple of 8, which is a power of 2, so it maps nicely—one digit of octal can represent exactly three digits in binary.
As you can see, though we are using some of the same symbols in all three systems, they mean different things. When we write “10” in decimal, to us humans it means 10. In octal, “10” means 8 to us, and “10” means 2 in binary. In fact, there is an old joke which illustrates this concept.
Though octal may be useful in some situations, Hexadecimal is the preferred intermediate between decimal and binary. One digit of hexadecimal (or “hex”) can represent four digits of binary. Hexadecimal is base 16—that means there are sixteen possible symbols per digit. Obviously, we will run out of decimal numbers for symbols, so hex recruits letters from the alphabet.
In computer applications, hexadecimal numbers are usually written in groups of two, since two digits represent eight digits in binary (a byte). You will often see a leading zero before a hex number to represent this, as in the table above. Hex is often used to view raw data in a file; it is a sort of shorthand for binary. It represents 16 different possibilities with one digit, 0—F. “A” represents decimal 10, “B” represents 11, and so on.
Often, computer scientists may want to analyze the contents of a file. Viewing it in binary is very difficult and nearly meaningless for a human. “Hex editors” are often used that show a view of the hex translated content, as are ASCII decoders.
Computing and Numbering
Since computers use binary, people must use different ways to represent data that are more compatible with human thinking and concepts. These ways include different codes, such as ASCII, and different numbering systems such as octal and hexadecimal. Hexadecimal is often used as a go-between for humans since: a) it has an exact 1:4 ratio of digits with binary, and b) it has more symbols, so it is easier for humans to read. Displaying data in hex takes four times less space than in binary.
Representation of Non-Numerical Information
We have demonstrated how decimal is converted to binary and stored in a computer. This will work for any data set that consists only of numbers. However, human data are also stored in words, images, and even sound. This means that to store this information on a computer we must also have a system to convert all of this to ones and zeroes.
Everything is Binary
Think of your favorite song right now. On a computer, it is stored in binary. The entirety of all the subtle notes and sounds are captured as zeroes and ones. What about a beautiful painting? This, too, is broken down into zeroes and ones and then displayed to you as light on a monitor screen. From Handel’s greatest works to The Rolling Stones, from DaVinci to Dali, a computer sees everything, stores everything, as binary numbers. The Bible, the writings of Goethe, the sayings of Confucius—all become digital data when stored on a computer. How is this done? It all depends on context.
One of the simpler ways to represent data is to convert alphabetical letters into numbers. Looking back to our decoder ring from the 1930s, we can see that it is a simple matter to assign a specific number to a letter. But what about upper and lowercase? Then we create separate numbers for that. Punctuation? Once again, they get unique numbers. Once we have decided how many letters and symbols we have to represent (including things like spaces) then we can create a 1-to-1 assignment of numbers. It goes without saying, of course, that these numbers will all be in binary. We have discussed ASCII before, but this concept was expanded for more modern computers and international languages with an encoding standard called Unicode.
ASCII can only display English characters (with a few exceptions), while Unicode was designed to represent most of the alphabets across the entire globe. It includes world currency symbols and italics. Since it uses 16 bits, it can represent a total of 216 or 65,536 different symbols it can represent. Unicode is often used for internet
communication, from email to webpages. With Unicode we are able to represent raw, unformatted characters and symbols in many languages, but to create something like a Word document, another layer of complexity is needed.
In 2007, Microsoft released an update of their Office software suite which included a new way to store its data files. This new method was based on XML, or Extensible Markup Language. A document, Word or otherwise, must have other information besides letters and symbols: this includes margins, page size, fonts, sizes, and colors, for example. Put simply, Word documents and other documents use binary codes to “tag” different sections of text with different attributes such as bold or a certain indent. This, of course, is also all stored in binary in a very specific order. When a Word document is read by an application, the binary is interpreted according to the rules for the file type and it is displayed on your screen.
A Word document (or any other data file) is a meaningless string of zeroes and ones without context. When you use the Word application to open a Word document, it knows the specific context, and therefore exactly what all those numbers mean.
Photos and other graphics that are displayed on a screen are broken down into pixels, or very small squares (sometimes rectangles). For instance, right now you are viewing a document that is stored as text and formatting; these must be converted into pixels on your screen. When a human uses a computer, several layers of conversion and translation are continually occurring, carried out by the operating system and the application. Once we developed large enough data storage, humans were able to design computers with very small pixels so that millions of them could fit on your display screen, giving you a high resolution, which makes photographic images look realistic and text very readable.
Many displays now use 24-bit color. This means there are 8 bits each for red, blue, and green. These colors of light combined together can represent the full spectrum of visual light. To store any image as binary, it is broken up into pixels which are each given a binary value that shows how much red, blue, and green light intensity to display it with.
Since 8 bits (a byte) can store values from 0—255, each color can have that many degrees of intensity. Absence of a color would be zero, while the most intense is 255. Using 255 for all colors will display white, while using 0 for all colors would be black. This requires enough memory to store 24 bits multiplied by the number of pixels on the screen. This means that to store an HD image (1920 x 1080 pixels) with 24-bit (3 byte) color, you would need:
Which, in other words, is about 6MB. Older computers from the 1990s, with only 1MB of memory, had no means of displaying anything with close to this many pixels and colors, which is why images on those screens looked very pixelated.
Audio and Music
Light and sound in the real world exists in waves. Information represented in waves (e.g., an audio cassette tape) is referred to as analog. When we store it on a computer it must be converted into digital data. Does this mean that some of the information is lost? Definitely. Analog waves have an infinite continuum of values, whereas digital representations of waves must simulate them with sampling.
Analog data are sliced into very small time samples and a digital value is assigned for that time slice. The smaller the slices, the more accurate a representation. However, we cannot have infinitely small samples, just as we cannot have infinite pixels.
When we look at the world with our eyes, we are not seeing pixels, but continuous waves of light. However, when a computer makes the pixels on the screen small enough our eye simply cannot tell the difference. This is the same principle with audio sampling. If we cut up an audio wave into small enough slices and reassemble it, the human ear cannot tell the difference.
Even before digital video, the same concept was used for film. A movie viewed in the theater is really a series of still images shown rapidly enough that to the human eye (and brain) it seems like real motion.
Whether the audio is music, singing, sound effects, or someone speaking, it is sampled and stored as binary data, where the context tells the computer to reproduce a sound through a speaker so humans can listen.
Context is King
Computers need assistance in deciphering the message we are sending with binary. There are two primary ways that a file (a discrete unit of data) is labeled to contextualize the data. The first, simple way is through the file extension. The three or four letters appended to the filename tell the operating system what type of application is able to read the data in this file properly. Not only that, but files also have headers (or file signatures) at the beginning with specific binary sequences that indicate that the data are an image, a document, an executable program, or something else.
For any specialized type of data, a computer system must first be created to store all of the information in binary. That standard will then be used by any application that wants to read or write these types of data. This is where we get file types such as.jpg,.docx,.exe,.dmg, etc. New file types and data contexts are continually created every day.