There is an overwhelming amount of data in the world. We need a way to manage that data. The most popular method for serious data storage and manipulation is to use a relational database, along with Structured Query Language (SQL). In these next posts, we're going to learn how to effectively use SQL to store, retrieve, and manipulate data. The mini-tutorial will be split into four key sections:
Other tutorials will typically start by having you download a 'toy database' to get started. This is because in real-world applications it is overwhelmingly likely that you will be querying and modifying an existing database, and rarely creating a new one from scratch. That being said, this tutorial will start with us creating a tiny database from scratch, so that we can be intimately familiar with the data and learn SQL with a 'bottom-up' approach.
The process for setting up an SQL will vary depending on which system you choose, as well as your operating system. Instead of exhaustively covering every possibility, I will simply refer you to the official installation instructions for the two most popular systems: PostgreSQL and MySQL. While I will be using PostgreSQL personally, all statements within this tutorial will also work in a MySQL database, so feel free to setup whichever one you prefer.
The table is the fundamental building block of relational databases. A relational database must have at least one table (though there will almost always be many more). A table can be created as:
CREATE TABLE us_presidents (
first_name VARCHAR(30),
last_name VARCHAR(30),
political_party VARCHAR(30)
);
Here we've created a table and given it the name "us_presidents". We've given the table three columns: first_name, last_name, and political_party. In SQL, we must specify the datatype of columns, be it INT, DATE, etc. In this case, we've set the datatype of each column to VARCHAR(30), meaning a string of variable length, up to 30 characters.
So we have a table, but it's currently devoid of any data. We can insert data into the table like so:
INSERT INTO us_presidents (first_name, last_name, political_party)
VALUES ('John', 'Adams', 'Federalist');
We've INSERTed a row INTO the us_presidents table here. When calling the INSERT INTO statement, we must specify the table name as well as the columns of that table that correspond with the data we are inserting.
Notice how when calling INSERT INTO, we specify the columns first, then the data using the VALUES keyword. We are not required to specify the columns in the same order they appear in the table. For example:
INSERT INTO us_presidents (last_name, political_party, first_name)
VALUES ('Thomas', 'Jefferson', 'Democratic-Republican');
We just need to make sure the order of the data in the VALUES statement aligns with the order of the columns we specify.
We can also leave a column completely blank should we wish:
INSERT INTO us_presidents (first_name, last_name)
VALUES ('George', 'Washington');
We didn't specify a political party for George Washington (because he never had one, technically), so the political_party cell for him will simply be blank.
We've been INSERTing data into a table, but up until now we haven't been able to see what we've been doing! Let's address that with the SELECT statement:
SELECT first_name, last_name, political_party
FROM us_presidents;
first_name | last_name | political_party |
---|---|---|
John | Adams | Federalist |
Thomas | Jefferson | Democratic-Republican |
George | Washington |
(3 rows)
When INSERTing, we could specify the columns in whichever order we want (or even choose to omit columns altogether), the same is true with the SELECT statement:
SELECT political_party, last_name
FROM us_presidents;
political_party | last_name |
---|---|
Federalist | Adams |
Democratic-Republican | Jefferson |
Washington |
(3 rows)
There is a lot more we can do with the SELECT statement. To show more of its capabilities, let's add more data to our us_presidents table. We can INSERT multiple rows at once like so:
INSERT INTO us_presidents (first_name, last_name, political_party)
VALUES ('James', 'Madison', 'Democratic-Republican'),
('James', 'Monroe', 'Democratic-Republican'),
('John', 'Adams', 'Democratic-Republican'),
('Andrew', 'Jackson', 'Democratic'),
('Martin', 'Buren', 'Democratic'),
('William', 'Harrison', 'Whig');
SELECT first_name, last_name, political_party
FROM us_presidents;
first_name | last_name | political_party |
---|---|---|
John | Adams | Federalist |
Thomas | Jefferson | Democratic-Republican |
George | Washington | |
James | Madison | Democratic-Republican |
James | Monroe | Democratic-Republican |
John | Adams | Democratic-Republican |
Andrew | Jackson | Democratic |
Martin | Buren | Democratic |
William | Harrison | Whig |
(9 rows)
We can filter our query using clauses, the first of which we'll examine is WHERE:
SELECT first_name, last_name, political_party
FROM us_presidents
WHERE first_name = 'James';
first_name | last_name | political_party |
---|---|---|
James | Madison | Democratic-Republican |
James | Monroe | Democratic-Republican |
(2 rows)
We can even use the WHERE clause to filter based on a column we don't want to actually see in our output table:
SELECT first_name, last_name
FROM us_presidents
WHERE political_party = 'Democratic-Republican';
first_name | last_name |
---|---|
Thomas | Jefferson |
James | Madison |
James | Monroe |
John | Adams |
(4 rows)
Have you noticed anything strange?
SELECT first_name, last_name, political_party
FROM us_presidents
WHERE last_name = 'Adams';
first_name | last_name | political_party |
---|---|---|
John | Adams | Federalist |
John | Adams | Democratic-Republican |
(2 rows)
What's going on here? Well the first row is referring to John Adams, second President of the United States, while the other row is for John Quincy Adams, sixth President of the United States. As a temporary fix to this issue, we can add a middle name column to the table (we will discuss better methods for handling situations such as these in the next post).
ALTER TABLE us_presidents
ADD middle_name VARCHAR(30);
SELECT first_name, middle_name, last_name
FROM us_presidents;
(9 rows)
Now we have a blank middle_name column, we can UPDATE John Adams to give him his Quincy middle name:
UPDATE us_presidents
SET middle_name = 'Quincy'
WHERE last_name = 'Adams';
SELECT first_name, middle_name, last_name
FROM us_presidents;
first_name | middle_name | last_name | political_party |
---|---|---|---|
Thomas | Jefferson | Democratic-Republican | |
George | Washington | ||
James | Madison | Democratic-Republican | |
James | Monroe | Democratic-Republican | |
Andrew | Jackson | Democratic | |
Martin | Buren | Democratic | |
William | Harrison | Whig | |
John | Quincy | Adams | Federalist |
John | Quincy | Adams | Democratic-Republican |
(9 rows)
Oh no! That statement updated both John Adams'! But it we look back at our query this makes perfect sense. We simply asked to make the middle_name of each row WHERE last_name = 'Adams' equal to 'Quincy', We can write a statement that specifically targets just on John Adams by using the AND clause. Let's go ahead an update another president commonly known by his first, middle, and last names as well.
UPDATE us_presidents
SET middle_name = NULL
WHERE last_name = 'Adams' AND political_party = 'Federalist';
UPDATE us_presidents
SET middle_name = 'Van'
WHERE first_name = 'Martin' AND last_name = 'Buren';
UPDATE us_presidents
SET middle_name = 'Henry'
WHERE first_name = 'William' AND last_name = 'Harrison';
SELECT first_name, middle_name, last_name, political_party
FROM us_presidents;
first_name | middle_name | last_name | political_party |
---|---|---|---|
Thomas | Jefferson | Democratic-Republican | |
George | Washington | ||
James | Madison | Democratic-Republican | |
James | Monroe | Democratic-Republican | |
Andrew | Jackson | Democratic | |
Martin | Van | Buren | Democratic |
William | Henry | Harrison | Whig |
John | Quincy | Adams | Democratic-Republican |
John | Adams | Federalist |
(9 rows)
We've updated both John Adams (the first one) to get rid of the Quincy middle name, and we've given the "Harrison" middle name to William Henry Harrison (We've also updated Martin van Buren while we were at it). If that name sounds strangely familiar, it's probably because William Henry Harrison is the man who died 31 days after becoming president. A tenure so short that one might argue he shouldn't even be included in a list of presidents. We can entertain such an argument with:
DELETE FROM us_presidents
WHERE first_name = 'William' AND middle_name = 'Henry' AND last_name = 'Harrison';
SELECT first_name, middle_name, last_name
FROM us_presidents;
first_name | middle_name | last_name |
---|---|---|
Thomas | Jefferson | |
George | Washington | |
James | Madison | |
James | Monroe | |
Andrew | Jackson | |
Martin | Van | Buren |
John | Quincy | Adams |
John | Adams |
(8 rows)
In addition to the AND clause, there's also the OR and NOT clauses, and more.
As mentioned before, a relational database will almost always have more than one table, so let's create a second one now:
CREATE TABLE us_states (
number SERIAL,
name VARCHAR(30),
region VARCHAR(30)
);
PostgreSQL uses the SERIAL keyword whereas other databases, such as MySQL use the AUTO_INCREMENT keyword. Here is how you would create the same table in MySQL:
CREATE TABLE us_states (
number INT AUTO_INCREMENT,
name VARCHAR(30),
region VARCHAR(30)
);
Here, we've created a table called us_states and given it three columns: number, name, and region. Let's fill our new table with some data by INSERTing states based on the order they joined the union:
INSERT INTO us_states (name, region)
VALUES ('Delaware', 'South'),
('Pennsylvania', 'Northeast'),
('New Jersey', 'Northeast'),
('Georgia', 'South'),
('Connecticut', 'Northeast'),
('Massachusetts', 'Northeast'),
('Maryland', 'South'),
('South Carolina', 'South'),
('New Hampshire', 'Northeast'),
('Virginia', 'South'),
('New York', 'Northeast'),
('North Carolina', 'South'),
('Rhode Island', 'Northeast');
SELECT number, name, region
FROM us_states;
number | name | region |
---|---|---|
1 | Delaware | South |
2 | Pennsylvania | Northeast |
3 | New Jersey | Northeast |
4 | Georgia | South |
5 | Connecticut | Northeast |
6 | Massachusetts | Northeast |
7 | Maryland | South |
8 | South Carolina | South |
9 | New Hampshire | Northeast |
10 | Virginia | South |
11 | New York | Northeast |
12 | North Carolina | South |
13 | Rhode Island | Northeast |
(13 rows)
Notice we did not specify any values for the number column, yet the column is populated anyway. This is because we've created this column to start at 1 and automatically increment as each new row is added.
Now let's add a new column to us_presidents to house the home state of each president. While we're at it, let's also give each president a number, to indicate the order of presidential succession. You'll notice that accomplishing this task requires a very verbose statement. This is because while adding new columns to tables in a relational database is possible (we'll make it slightly less painful using the CASE keyword, which allows us to implement a switch-like statement), it is very impractical. Should you ever find yourself designing a database, you'll need to put careful thought and consideration into the layout of your tables. Adding new columns to a database that has already been deployed, and populating them with data is no easy task:
UPDATE us_presidents
SET number =
CASE
WHEN last_name = 'Washington' THEN 1
WHEN last_name = 'Adams' AND middle_name IS NULL THEN 2
WHEN last_name = 'Jefferson' THEN 3
WHEN last_name = 'Madison' THEN 4
WHEN last_name = 'Monroe' THEN 5
WHEN last_name = 'Adams' AND middle_name = 'Quincy' THEN 6
WHEN last_name = 'Jackson' THEN 7
WHEN last_name = 'Buren' THEN 8
END;
UPDATE us_presidents
SET home_state =
CASE
WHEN last_name = 'Washington' THEN 'Virginia'
WHEN last_name = 'Adams' AND middle_name IS NULL THEN 'Massachusetts'
WHEN last_name = 'Jefferson' THEN 'Virginia'
WHEN last_name = 'Madison' THEN 'Virginia'
WHEN last_name = 'Monroe' THEN 'Virginia'
WHEN last_name = 'Adams' AND middle_name = 'Quincy' THEN 'Massachusetts'
WHEN last_name = 'Jackson' THEN 'South Carolina'
WHEN last_name = 'Buren' THEN 'New York'
END;
SELECT first_name, middle_name, last_name, home_state
FROM us_presidents
ORDER BY number;
first_name | middle_name | last_name | home_state |
---|---|---|---|
George | Washington | Virginia | |
John | Adams | Massachusetts | |
Thomas | Jefferson | Virginia | |
James | Madison | Virginia | |
James | Monroe | Virginia | |
John | Quincy | Adams | Massachusetts |
Andrew | Jackson | South Carolina | |
Martin | Buren | New York |
(8 rows)
Here, we SELECTed and included the ORDER BY clause to sort our results, so we'd have the first president in the first row (like when we used WHERE in the last post, we can ORDER BY a column even if we didn't SELECT that column!)
SQL is case-insensitive, meaning that
SELECT id FROM People;
select id from people;
SeLeCt iD fRoM pEoPlE;
are all valid and equivalent statements. SQL statements are delimited with a semicolon (;), regardless of white-space, meaning that
SELECT
id
FROM
People
;
Is also valid and equivalent to the statements above. All that being said, don't allow yourself to become so preoccupied with whether or not you can, that you forget to stop and think if you should. In fact, when writing anything at all, I'd generally recommend finding a style guide and sticking to it. Virtually all SQL style guides agree that you should capitalize every letter in reserved keywords (e.g., SELECT and FROM). I will be following Simon Holywell's style guide for all statements in this tutorial, so statements will follow the form:
SELECT id FROM people;
In the next tutorial, we will learn about joins and relationships.
Posted: 2023-05-10