Relational Database BasicsArticle contents:
Basic Concepts of the Relational Data Model
The data to be processed by industrial information systems are usually stored in databases, which are made available within database systems. Databases have a number of advantages over simple files that make them particularly relevant for large amounts of data. Specifically, databases:
- Provide specific and relatively easy-to-use functions for the evaluation of data, making even complex evaluations across different data sets possible with manageable effort.
- Are optimized for efficiency when dealing with complex evaluations and large amounts of data—only the data that are actually required will be read and written, not the complete file.
- Allow simultaneous access to the data by multiple users.
- Support fine-grained access protection, i.e., the data users are allowed to access can be defined for the particular user.
- Increase the reliability of systems in the sense that in the event of a write error or crash, a defined, consistent state can normally be restored and it will be possible to trace which changes have already been implemented and which have not.
In addition to the standard database management systems (DBMSs) described here, a number of DBMSs have recently become widespread that take the form of NoSQL systems, which have a slightly different focus and de-emphasize data consistency in order to increase efficiency. However, our focus should first be on standard DBMSs.
Let us now introduce three closely related terms (see illustration):
- Databases themselves contain data stored in an efficient format on a storage medium, e.g., a hard disk.
- Database management systems (DBMSs) manage access to databases and provide corresponding functions. An application normally only accesses a database indirectly via the DBMS.
- Database system describes the combination of a DBMS and database.
Depending upon the author, sometimes applications are also counted within the database system. (Similarly, there is often no distinction between the database and the database system, and both are referred to as “the database.”)
Structure of Relational Databases
In order for the information stored in a database to be processed by information systems, the individual data records must be arranged in a predefined structure (Schubert, M., 2007). If you did not have a database, the most important data could be saved as text, such as in a Word document, as in this example of a customer management system of an online shop:
“Mr. Emil Schulze has the email address firstname.lastname@example.org and is listed under customer number 00200.”
“Ms. Silke Bauer with the customer number 00300 can be reached by email at email@example.com.”
However, customer data stored in this form cannot be automatically read and modified by IT systems. Therefore, the data records are reduced to the most important information and brought into a uniform structure. The most important characteristics of the customers just described are customer number, title, last name, first name, and email address. One way to structure these customer data records is to display them in a table whose columns each store a specific property of the data record and whose rows each contain exactly one data record.
All data records contained in Figure 1 have the same attributes and therefore the same structure. A data set is also called an object. A set of objects with the same structure—i.e., the same attributes—is stored in a relation. The Customer relation shown in Figure 1 has the structure (CUSTOMER NUMBER, TITLE, LAST NAME, FIRST NAME, EMAIL). A relation, in turn, is stored in a table named after the relation.
Objects in relations must be able to be uniquely identified at all times, so two data sets in a relation may never have exactly the same values in all attributes. However, this required uniqueness is not always possible from the domain point of view. Therefore, in addition to the domain attributes, a technical attribute that can uniquely identify a data record is often stored in the relation. Such an attribute is called an identifier (ID attribute). In relational databases, identifiers are referred to as “primary keys.”
Table 1 contains the customer relation already introduced above, which has been extended to include the Customer ID attribute. This attribute serves as the primary key in this relation and is shown in gray in the table.
|Table 1: Customer Data of an Online Shop|
In almost all practical applications, a single relation (table) is not sufficient to store all the required information. In order to operate an online shop, it must be possible to store not only the most important customer data for each customer but also their registration address and various delivery and billing addresses. In this case, a relation Address with the attributes (ADDRESSID, STREET, HOUSENUMBER, POSTALCODE, CITY, COUNTRY) will be used, which is represented in Table 2.
|Table 2: Table with ADDRESS Relation|
With the two relations CUSTOMER and ADDRESS, it is now possible to store customer data and address data in a structured way. However, the classification of objects from the CUSTOMER relation to objects from the ADDRESS relation is still missing. This classification is made using relationships. Every relationship has a unique name. The following types of relationships are defined by their cardinality (or by arity or multiplicity): 1:1 relationship, 1:N relationship, and N:M relationship.
Figure 2 illustrates the 1:1 relationship “primary residence” by way of example. This relationship connects a data set from the relation CUSTOMER with exactly one data set from the relation ADDRESS.
Figure 3 shows the relation CUSTOMER extended by the attribute ADDRESSID. The value stored in the ADDRESSID attribute must uniquely identify an ADDRESS record. Therefore, the data model requires that only the existing identifier (primary key) of the relation ADDRESS be stored as the value of ADDRESSID. The identifier of ADDRESS is the attribute ADDRESSID. Therefore, the concrete values of ADDRESSID in the objects of type CUSTOMER must be values from the column ADDRESSID of the relation ADDRESS. The primary key of ADDRESS is thus also used in the “foreign” relation CUSTOMER. The primary keys of one table used in another table are called foreign keys; a foreign key references the primary key of another table related to the current table, thus linking them.
Figure 4 shows the 1:N relationship “delivery address,” with which exactly one object of the CUSTOMER relation is linked to several objects of the ADDRESS relation.
Figure 5 shows the relation ADDRESS that has been extended by the attribute CUSTOMERID. Values from the CUSTOMERID column of the CUSTOMER table are stored in this attribute. The attribute CUSTOMERID is the primary key of the relation CUSTOMERID. This attribute can be used to uniquely identify the actual customers for the data records in the table ADDRESS.
Figure 6 illustrates the N:M relationship “billing address.” This relationship establishes a connection between M objects of the CUSTOMER relation and N objects of the ADDRESS relation.
Mapping N:M relationships in the relational data model requires “relationship tables.” Figure 7 shows an example of a relationship table that stores the N:M relationship between records of the CUSTOMER relation and records of the ADDRESS relation. A relationship table has as attributes the identifiers of the correlating relations. In the concrete example, this is ADDRESSID of the relation ADDRESS and CUSTOMERID of the relation CUSTOMER. Data records in this relationship table consist only of foreign keys.
The primary key of the ACCOUNT ADDRESS table results from the combination of the ADDRESSID and the CUSTOMERID. This is a composite primary key.
Entering Data Records into Relational Databases
In general, data records are entered into relational databases according to the following scheme:
- Select the relations (tables) in which the desired information is to be entered;
- Insert the object (data record) with its corresponding attribute values;
- Make sure that the primary keys are set uniquely and that no foreign keys are included that do not have corresponding primary keys.
The following facts are to be stored in the relational data schema already presented above:
“Mrs. Lisa Wagner, with the email address firstname.lastname@example.org, should be stored under the customer number 00700. Her primary residence is Domstr. 18 in Hamburg (20095) Germany. Almost all her shipments and invoices have been sent to her home so far. However, the last order, including invoice, was shipped to the following address: Stuttgarter Str. 73, Dresden, Germany 01189.”
To enter this data, three tables must be changed: CLIENT, ADDRESS, and BILLINGADDRESS. These tables are shown below. The new objects are added to the end of each table: one object in the CUSTOMER relation, two objects in the ADDRESS relation, and two objects in the BILLINGADDRESS relation. The values inserted in the example are marked in bold. In addition to the uniqueness of all primary keys, the following foreign key relationships must be taken into account:
- Save the primary residence in the attribute ADDRESSID of the CUSTOMER table;
- Save the delivery addresses in the attribute CUSTOMERID in the ADDRESS table;
- Save the billing addresses in the BILLINGADDRESS table with the respective foreign keys CUSTOMERID and ADDRESSID.
|Table 3: ADDRESS Relation Table|
|Table 4: ADDRESS Table Including Sample Data|
|Table 5: ADDRESS Table Including Sample Data|
Terms and Definitions
The following terms are central to dealing with relational databases and are frequently used in the broader field. Many of the following definitions are based on those from Steiner (2014), Elmasri and Navathe (2016), Unterstein and Mathiessen (2012), and Foster and Godbole (2016).
A relation R on the quantities A1, A2, …, An is a subset of the Cartesian product A1 X A2 … X An, thus R ⊆ A1 X A2 X … X An. The structure of Table 4 can be noted as the following relationship: ADDRESS ⊆ ADDRESSID X STREET X HOUSENUMBER X POSTALCODE X CITY X COUNTRY X CUSTOMERID, i.e., as a seven-tuple with the respective columns as components. A value range can be assigned to each component.
A table is an illustration of a relation that contains the values that belong to the relation.
All data in a table row are referred to as a data record. A data record is an element of the relation illustrated by the table.
An entity is something separate and distinct that can be uniquely identified among other similar entities, such as a business object. Each record stored in the database is an entity.
Attribute (or Property)
Concrete properties of entities are described as attributes. As in object orientation, each attribute has a name (or identifier) and a precisely-defined value range (or domain). Each column in a table stores the value for a specific attribute according to its name.
Entities with the same properties are combined into entity sets. All records stored in a table form an entity set.
NULL value (or NULL)
If no value is stored in a database for an attribute in a table, this is indicated by the attribute value NULL. This means that no information has been saved for this attribute. Saving a NULL value usually requires as much memory space as a valid value from the value range of the attribute. Tables with many NULL values should therefore be avoided when designing the database.
A key is an attribute or set of attributes of an entity that can be used to uniquely distinguish an entity from another similar entity. There must not be two entities with the same values in all attributes. If there were, they would be indistinguishable. A key that uniquely identifies an entity is called a primary key (or key attribute, identifier).
A primary key formed by exactly one attribute is a simple primary key. It is also possible to form a composite primary key where several attributes form the primary key. In this case, the combinations of all attributes involved in the primary key must be unique. A primary key must never be a NULL value.
Artificial key (Surrogate key)
An artificial key is an additional attribute added to a relation that is used as a primary key if there is no natural and unique primary key. This additional attribute has no meaning in the real world outside the database and is only used for the unique referencing of data records. An integer type is often used as an artificial key, the value of which is automatically determined by the database.
An attribute of a table is called a foreign key if it references a primary key of another table. In the data record of a database, it must be ensured that there is a corresponding value as a primary key in the referenced table for each value of a foreign key attribute. This condition (that there must be an actual primary key for each foreign key) is called referential integrity.
The data records stored in a database must be technically correct and consistent, and technical conditions such as referential integrity must be fulfilled. Maintaining data consistency and preventing inconsistencies is one of the central tasks of a database management system.
Database management system (DBMS)
A software system for operating databases is called a database management system (DBMS). Typical relational database management systems include MySQL, Oracle, Microsoft SQL Server, and IBM DB2.
Finding and Deleting Records in the Database
Finding and returning data from a database is the most common operation performed on a dataset.
The basic procedure is as follows:
- Determine the relations (tables) in the database in which the searched information is stored. This includes all tables that may not directly contain the information you are looking for but are needed to identify the concrete data records. Relationship tables are a typical example of this.
- Next, identify the concrete attributes in the relations selected in step 1 that are to be returned as the result.
- Restrict the output. Frequently, not all data records with the required attributes should be output, but only those data records whose attribute values meet predefined criteria. Therefore, the value set identified in step 2 may have to be restricted in step 3.
- Sort the result set determined in step 3 according to defined criteria.
As a result of a query or search operation to a relational database, a set of objects is returned that have only the requested attributes and whose values meet the restrictions specified in the query.
Using the data stored in Table 3, Table 4, and Table 5, determine all locations and their postal codes to which the customer David Schwarz has already had shipments delivered.
Step 1: CUSTOMER and ADDRESS are the relevant relations, the DELIVERYADDRESS relationship between CUSTOMER and ADDRESS is stored in the attribute CUSTOMERID in the relation ADDRESS.
Step 2: Determine Mr. Schwarz’s CUSTOMERID, determine all addresses whose CUSTOMERID value match Mr. Schwarz’s CUSTOMERID.
Step 3: Restrict the addresses found to the POSTAL CODE and LOCATION attributes.
Result: (60311 Frankfurt), (10117 Berlin).
Obviously, for the formulation of search queries to a database, an exact knowledge of the data model is necessary. In particular, the technical meaning of foreign keys is often not directly recognizable from the data model, but can only be found in the documentation. For example, it is not directly apparent from the relations that the attribute ADDRESSID in CUSTOMER is the residential address. The table also does not clearly show the meaning of the foreign key CUSTOMERID in the relation ADDRESS.
|Table 6: CUSTOMER Table|
|Table 7: ADDRESS Table|
|Table 8: ADDRESS Table|
Deleting Records from Relational Databases
Unlike searching in a database (which does not change the information stored in it), deleting records is a critical operation. Especially in the case of data records that are distributed over several relations, it must be ensured that all other data records remain complete and stored consistently, even after deletion. The following steps can be identified for deleting data:
- Select the relations (tables) in the database in which the objects to be deleted are stored. This also includes all tables that may not directly contain the information you are looking for but are needed to identify the concrete data records. Relationship tables are a typical example of this.
- Identify the data records to be deleted based on the criteria formulated in the deletion request.
- Ensure that there are no foreign keys that cannot be resolved after deleting the objects from the database (referential integrity).
Refer to the data set shown in Table 9, Table 10, and Table 11. Delete the customer Silke Bauer from the database, along with all other address data not required by other customer records.
Step 1: Identify the relations and relationships that are affected. CUSTOMER and ADDRESS as well as all relationship tables are relevant relations that must be taken into account when deleting the data for Silke Bauer.
Step 2: Determine the CUSTOMERID of Ms. Bauer, then determine and remove all records in the relationship tables whose value in CUSTOMERID matches the CUSTOMERID of Ms. Bauer. Remove the data record of the customer Ms. Bauer in the table CUSTOMER. Remove all records in ADDRESS that are no longer used in a relationship table. This applies to addresses 5 and 6 in Table 10, while addresses 2, 3, and 4 may not be deleted as they are also used by other customers.
Step 3: Make sure that after removing all data, all foreign keys in the relationship tables reference a valid record. In addition, there is the entry (3; 098483) in the table of invoice addresses, independent of the deletion of customer Silke Bauer, for which there is no suitable entry in the customer file. This means that this data record must also be deleted.
The cells to be deleted in Tables 9, 10, and 11 are marked in bold.
|Table 9: CUSTOMER Table before Deletion|
|Table 10: ADDRESS Table before Deletion|
|Table 11: Relationship Tables MAINRESIDENCE, DELIVERYADDRESS, and INVOICEADDRESS before Deletion|
SQL and Relational Database Systems
In order to operate and use relational databases, a database management system (DBMS) must be available and there must also be an interface for querying and manipulating the data stored in the DBMS.
Structured Query Language
The Structured Query Language (SQL) is an internationally standardized programming language for data queries (ISO/IEC 9075). The SQL commands can each be assigned to one of the following sublanguages:
- Data Manipulation Language (DML): commands for reading and changing stored data;
- Data Definition Language (DDL): commands for creating and changing the data schema, i.e., the logical structure of a database;
- Transaction Control Language (TCL): commands for the controlled execution and monitoring of complex DML commands; and
- Data Control Language (DCL): commands for managing access rights.
As such, the SQL language represents a universal interface for relational database systems. It can be used to create the database, access the data in a controlled manner, and control who has the right to access the database. The standardization of SQL has created a basis that should make it possible to develop and operate an application independently of a DBMS. No matter which DBMS is used, the SQL commands should not differ. Nevertheless, the concrete implementations of DBMSs by their manufacturers vary, so a simple exchange of DBMSs is not possible. Because the syntax specified by SQL is the same in almost all implementations and the SQL commands are also standardized, it is not surprising that the SQL syntax is the same in almost all implementations. However, there are differences in the sets of available data types for attributes and the sets of data types supported by the DBMS where functions are concerned. In addition, the execution speeds of individual DBMSs vary. Therefore, changing the database layer of an application is a complex task because every SQL command used must be tested in the new environment.
Typical Relational Database Management Systems
There are many different DBMSs available on the market. Therefore, only a small selection of the best-known will be mentioned.
MariaDB is a relational DBMS that is widely used for small and medium-sized applications. It was created by a spin-off from MySQL after MySQL separated from Oracle (MySQL, 2019). MariaDB is available as open-source software for many operating systems, meaning there are no license fees for its commercial use.
Typical representatives of commercial relational DBMSs are:
- Oracle Database
- IBM DB2
- Microsoft SQL Server