window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-63172957-1');
Sound
Awwwards
</html>
Created by potrace 1.16, written by Peter Selinger 2001-2019
Back to blog
Computer Science

Information Representation (Part 2)

Article contents:

Data Types

Files, or large blocks of data, need to have context. On a more granular level, individual units of data must also have context. When we work with a data point, it must be defined to the computer as a number, a character, a memory location, or something else. For instance, a Word document contains all kinds of different data: some numerical, some character, some formatting. Each separate character would need a data type to identify it correctly.

Variables

In programming, the coder creates variables to store individual units of data. A variable is a container for information. It has three important defining factors: a name, a type, and a memory location. The name is what it will be referred to in code, just as if you wrote a label on a box. The memory location is where the computer can find this box in RAM.

Let’s say you create a variable called “AGE.” The computer will label that box with the name and store it somewhere in memory. But there’s nothing in the box yet. We need to store a value in the box; a variable is a container for a data value. The value of 37 is assigned to the variable “AGE.” Now when the computer goes to find the box and opens it, the value 37 is discovered within. Just like in real life, a box can be reused; we can put other values in the box at different times. This is why it is called a variable—the value inside it can vary. It is different from a constant which is declared at the beginning of the program and cannot ever change. The data inside must also have a “type.” Just as a paper bag is not made to contain liquids, but solids, variables are designed to contain specific types of data.

Integers

An integer is a unit number; this means it contains no decimal component. Integers include all the natural numbers (1,2,3, etc.), their negative versions, and zero. This definition is important to a computer since it does not have to assign any memory to decimals, only unit numbers. In the previous case of the variable “AGE,” an integer would be a good data type to use, since it is a standard convention to list someone’s age as a unit number without a fraction or decimal on many forms and records.

In Java, the integer data type is abbreviated “int.” To declare (create) a variable for age, it would look like this:

int age = 0;

When declaring a variable, in some languages a starting value must be given; this is referred to as initializing. In this case, the variable is initialized to zero. It presumes that an actual age will be calculated or input by the user later.

Numerical data types

Other standard data types include longer integers and decimals. The following are some example numerical data types in the programming language Java.

Byte: integer from -128 to 127 (8 bits)

Int: integer from -2147483648 to 213748647 (32 bits)

Float: a decimal number from 3.4e – 038 to 3.4e + 038 (32 bits)

Double: a decimal number from 1.7 e – 308 to 1.7e +308 (64 bits)

As you can see, data types are specified in bits—this is important to programmers. You don’t need to use a huge box to store a pair of sunglasses; it’s a waste of space. Coders also shouldn’t use large variables such as int when a byte will do. Note that a “byte” data type can store 256 different values; some of them are used for negative, some positive, and one is zero. To declare these variables in Java, you’d do something like this:

byte age = 37;

int jellybeans = 2243;

float fluidOunces = 28.73;

double solarDistance = 932842983.2340273; 

The reason a decimal is called a “float” is because the decimal point can float around to different places in the number. This is an important distinction for a computer: where is the decimal? That must be stored somewhere in the variable itself. It is something a human would take for granted, but something that has to be specified for a computer. Many programming languages are also case-sensitive; this means that the variable “age” is not the same as “Age” or even “AGE.” Good coders should follow a naming convention for all of their variables for consistency.

Non-numeric data types

Most languages, including Java, have a basic data type to store one “character.” This represents one symbol in a language, so would include letters, numbers, spaces, and punctuation. You might think that we have already covered numbers, but remember computers are very specific. The numeric value of “2” is a separate concept than the character for “2”. Humans automatically deal with the different contexts where we see the symbol for “2,” but computers are more specific. Consider that displaying the number 2 on the screen as a specific group of pixels is completely different operation than multiplying a value by 2.

In English and many other languages, words and sentences are represented by groups of characters strung together—this is the reason that computers have a data type called a string. A string is a group of characters strung together. Therefore, it is actually a complex data type consisting of several simple (or “primitive”) variables combined together (primitive data types). Here are some non-numeric data types from Java (Oracle, 2020):

  • Char (a character): letter, number or symbol (16 bits)
  • Boolean: represents “true or false” (1 bit)
  • String: a sequence of character information (varies in size)
  • Array: an ordered sequence of a primitive data type (varies in size)

A Boolean variable (named after mathematician George Boole) represents either a TRUE or FALSE value, and therefore only has two possible states, which technically can be stored as one bit; zero is used for false and one for true.

An array is an ordered arrangement of a simpler data type. You could, for instance, have an array of integers, doubles, or even Booleans. However, an array of characters is so common that the data type string usually exists in most languages for storing language text. Let’s look at how these data types would be declared as Java variables:


char middleInitial = “A”;

String firstName = “Joseph”;

int[] lottoNumbers = {5, 22, 11, 7, 16}

elements in the array that represent different lottery number picks. To reference the second number, you would type “lottoNumbers[1]” and it would return the value “22”.

Arrays are often used when you have a collection of data of the same type that you want to store in order. You can search through them sequentially, pick out singular items that you want to work with, and even sort the list of elements in order. Uses of arrays include lists of options to choose from, arranging data groupings, or collecting user input, to name a few.

Using Variables Correctly

When creating variables, you should be considering the correct data type as well as the memory usage. Not only that, but descriptive names should be given that also follow the standard naming convention that is used for that language. Many new programmers begin by naming their variables “x” and “y” for simple programs, but these names are non-descriptive and will be problematic in code of any reasonable length. In code with hundreds of variables, non-descriptive names will leave the programming unreadable to anyone, including the original coder. Variables should not be a mystery; they should be named in a way that describes their use in code. Computers have very specific data types and expect them to be used in the correct context. You cannot “add” together two char data types and expect the same result as adding two integers. The computer will see this as two different things. If you have the characters (char data type) “4” and “2” and combine them, in many computer languages you will get “42” instead of 6. The characters are seen as symbols and simply placed together; if they were numeric variables (such as int) then adding them would produce a numeric answer rather than a character based answer. Computers do exactly what they are told. This is why data types and variables must be used with precision.

Redundancy and Error Tolerance

Imagine writing a 100,000-word novel. You carefully read the manuscript after your first draft, and check for mistakes—but did you really find all of them? There is still a chance that errors slipped through. You might do a second check or hire an editor. Are you sure you got all of them this time? It is very difficult to be sure of perfection in such a large volume of words. Computers process words and information at a much higher rate than human beings. A simple image on your screen of 800 x 600 pixels has 480,000 pixels. Are they all exactly the right color? Did some bits get confused in transmission? A computer, though very perfect in its calculation, can also suffer from errors in data transmission. Whenever data is moved from one location to another, there is a chance that the electrical signal may be garbled, interfered with, or interrupted. Data can move from RAM to the hard drive, from the hard drive to the internet, then from the internet to your friend’s house in Argentina. So just sending a photo of a cat to Mari Rubi in Buenos Aires has several points where errors can occur. Therefore, computers must have built in precautions against error corruption.

Storage Error Checking

Some errors are not caused by transmission, but media corruption. Hard drives typically last about three to five years. Some drives experience a gradual corruption, where parts of the physical surface become unstable and will no longer store data. An early method to protect files in storage from corruption is the checksum.

Recall that any file on a computer is a series of binary numbers—in fact, it could be considered one very long binary number. One way to check if it has been changed (corrupted) would be to store an exact copy somewhere and then compare it. But because some files are megabytes, or even gigabytes, in size, this would be very inefficient. Instead, since the file is one large binary number, it is placed into an algorithm. This algorithm is like a mathematical function—it has a numeric solution. The checksum is the solution to this problem; it is calculated and appended to the file when it is saved. When the file is opened, the calculation is run again to see if it matches the number stored in the file. If it does, then the chances of it being uncorrupted are incredibly high. (There is a small chance that changes to a file might give the same checksum, but it is remote enough that the error checking is considered reliable.) Usually, corruption in files is very minor and easy to detect in a checksum.

Notice in this example the text is only changed slightly in the later texts, but the checksum is very different. Since the number is 10 digits, there are 10,000,000,000 different possibilities, making the random chance that two texts have the same checksum 1 in 10 billion.

Transmission Data Fault Tolerance

The problem of corrupted transmission of data has existed since before computer networks were invented. Radio transmission of secret codes by different militaries risked this problem, as did communication by telegram, and even a letter written by hand could be damaged by water or other environmental hazards.

Parity

One of the earliest methods of error-checking for binary data is called parity. Communication with telephone-based modems was serial, or one bit at a time. For every seven bits, an eighth parity bit was added for error checking. These basically worked like a very simple checksum—in this case we can even do the math in our heads. Communications via modem were set to either “even” or “odd” parity, which meant that for every 8 bits there must be an even or odd number of ones in the data. If it did not match, then it was assumed there was an error, and a request was made by the receiving computer for retransmission.

The parity bit is sent first in a group of eight bits, and its value must make the total number of ones in the byte odd. In the example above, you can see that the values are 5, 5, 3, 5, and 1 when the ones are tallied. This first bit is not counted as the primary data, it is used for error-checking and discarded, just like the checksum on the end of a data file. If a computer set to odd parity received 8 bits with an even number of ones, it would assume it was an error. Of course, if two bits happened to flip during transmission it could still pass parity. The assumption here is that the most common error is losing one bit in an 8-bit sequence. This also does not account for the possibility that the parity bit itself is corrupted, but this was used as a basic method of error-checking for many years in telecommunications.

Redundancy Check

A more advanced type of error-detection is the Cyclic Redundancy Check, or CRC. It works with a similar principle to the checksum. A block of data is processed through a mathematical algorithm and a result is appended to the data. After the data is sent, a calculation is repeated to check the integrity. The CRC can be applied to any length of binary data and will always return a code of the exact same length. This makes CRC a “hashing” algorithm, one that returns a value of consistent length in digits.

TCP/IP Error Detection and Correction

The main protocol, or set of rules, for internet communication is TCP/IP. This network standard has several layers of protection for data corruption. The first one is one we’re already familiar with: the checksum.

TCP/IP divides data into pieces to be sent over a network into segments or datagrams. These segments have a “header” at the beginning which defines the sending computer, the receiving computer, transmission settings, and also a 16-bit checksum to verify the data. The maximum size of this segment is 65,536 bytes (Stevens, 2004.) When each segment arrives at the destination computer it is checked against the checksum value; if the values don’t match, a retransmit request is sent to the source computer.

Along with a checksum, TCP/IP also checks for transmission errors on other levels. The receiving computer must also send an acknowledgement (ACK) for each datagram received. This is accomplished by giving each datagram a sequence number in the header. The recipient machine sends an ACK for each sequence number—if one of the numbers is not acknowledged after a certain period of time, the source computer will retransmit.

At another level, TCP/IP will also detect broken routes over the network and re-route to new ones. This is one reason why TCP/IP has been used for decades on the internet: it is fault-tolerant. Part of the network can stop working without taking the entire network offline. Routers and devices that stop forwarding network traffic can be worked around by checking for alternate routes to the destination.



Back to blog

</html>
Wordpress Developer Loader, Web Developer Loader , Front End Developer Loader Jack is thinking