In the last post we created a table for some early U.S. Presidents and states, and learned we can view our tables with:
SELECT first_name, middle_name, last_name, home_state
FROM us_presidents;
SELECT name, region
FROM us_states;
first_name | middle_name | last_name | home_state |
---|---|---|---|
Thomas | Jefferson | Virginia | |
James | Madison | Virginia | |
James | Monroe | Virginia | |
John | Adams | Massachusetts | |
George | Washington | Virginia | |
John | Quincy | Adams | Massachusetts |
Andrew | Jackson | Tennessee | |
Martin | Van | Buren | New York |
(8 rows)
name | region |
---|---|
Delaware | South |
Pennsylvania | Northeast |
New Jersey | Northeast |
Georgia | South |
Connecticut | Northeast |
Massachusetts | Northeast |
Maryland | South |
South Carolina | South |
New Hampshire | Northeast |
Virginia | South |
New York | Northeast |
North Carolina | South |
Rhode Island | Northeast |
(13 rows)
As you've likely gathered from this posts' title, we're going to be JOINing these tables together. We're going to be making the use of aliases by using the AS keyword to keep the statement length under control. We'll also add the ORDER BY clause so that we can sort the presidents by the order of which they were elected. Notice how just like with the WHERE clause, we can ORDER BY a column even if we aren't SELECTing that column.
SELECT p.first_name, p.middle_name, p.last_name, p.home_state, s.region
FROM us_presidents AS p
JOIN us_states AS s ON p.home_state = s.name
ORDER BY p.number;
first_name | middle_name | last_name | home_state | region |
---|---|---|---|---|
George | Washington | Virginia | South | |
John | Adams | Massachusetts | Northeast | |
Thomas | Jefferson | Virginia | South | |
James | Madison | Virginia | South | |
James | Monroe | Virginia | South | |
John | Quincy | Adams | Massachusetts | Northeast |
Martin | Van | Buren | New York | Northeast |
(8 rows)
Now we have a single table comprised of columns from both the us_presidents and the us_states tables.
When calling JOIN, you must specify a column from each table to be JOINed ONto. In the above example, we JOINed the home_state column from us_presidents ONto the name column of us_states. This caused SQL to stitch together rows where those columns match. Take a look at George Washington's row in us_presidents, and Virginia's row in us_states. Since the value of home_state matches the value of name, the data in these rows across these tables were combined into one.
When we use the JOIN keyword by itself, we're technically performing an inner join, meaning all rows that do not have a 'match' are automatically filtered out.
You've probably noticed that not every state from us_states made it into this new table, in fact, only three states, Virginia, Massachusetts, and New York are present. You may have also noticed that not every president made it to this list as well... Andrew Jackson, who hails from Tennessee is missing, the reason being is Tennessee is not yet in the us_states table.
We can keep the unmatched rows, and populate the missing columns with NULL by specifying we'd like a FULL JOIN:
SELECT p.first_name, p.middle_name, p.last_name, p.home_state, s.region
FROM us_presidents AS p
FULL JOIN us_states AS s ON p.home_state = s.name;
ORDER BY p.number;
first_name | middle_name | last_name | home_state | region |
---|---|---|---|---|
George | Washington | Virginia | South | |
John | Adams | Massachusetts | Northeast | |
Thomas | Jefferson | Virginia | South | |
James | Madison | Virginia | South | |
James | Monroe | Virginia | South | |
John | Quincy | Adams | Massachusetts | Northeast |
Andrew | Jackson | Tennessee | ||
Martin | Van | Buren | New York | Northeast |
Northeast | ||||
South | ||||
Northeast | ||||
South | ||||
Northeast | ||||
South | ||||
Northeast | ||||
Northeast | ||||
South | ||||
Northeast |
(18 rows)
Using FULL joins are pretty rare (after all, it's not clear what the above table even means*), more commonly you will see the LEFT JOIN and the RIGHT JOIN, which I think of as 'partial FULL JOINs'. A LEFT JOIN will keep the rows from the first specified (i.e., 'left') table, even if it does not match the second ('right') table on the JOIN column. Vice versa for RIGHT JOIN:
SELECT p.first_name, p.middle_name, p.last_name, p.home_state, s.region
FROM us_presidents as p
LEFT JOIN us_states as s ON p.home_state = s.name
ORDER BY p.number;
first_name | middle_name | last_name | home_state | region |
---|---|---|---|---|
George | Washington | Virginia | South | |
John | Adams | Massachusetts | Northeast | |
Thomas | Jefferson | Virginia | South | |
James | Madison | Virginia | South | |
James | Monroe | Virginia | South | |
John | Quincy | Adams | Massachusetts | Northeast |
Andrew | Jackson | Tennessee | ||
Martin | Van | Buren | New York | Northeast |
(8 rows)
Notice here in the LEFT JOIN that Andrew Jackson is present, with the region column simply being left blank for him. Because this was a LEFT JOIN, rows from the RIGHT table that do not match any rows on the LEFT table on the join column are absent.
Take a second to consider what the output of a RIGHT JOIN will be:
SELECT p.first_name, p.middle_name, p.last_name, p.home_state, s.region
FROM us_presidents as p
RIGHT JOIN us_states as s ON p.home_state = s.name
ORDER BY p.number;
first_name | middle_name | last_name | home_state | region |
---|---|---|---|---|
George | Washington | Virginia | South | |
John | Adams | Massachusetts | Northeast | |
Thomas | Jefferson | Virginia | South | |
James | Madison | Virginia | South | |
James | Monroe | Virginia | South | |
John | Quincy | Adams | Massachusetts | Northeast |
Martin | Van | Buren | New York | Northeast |
South | ||||
Northeast | ||||
Northeast | ||||
Northeast | ||||
Northeast | ||||
South | ||||
Northeast | ||||
South | ||||
South | ||||
Northeast |
(17 rows)
Wrapping your head around the different types of JOINs can be a bit tricky, this visualization (credit: w3schools) can help make sense of it.
We'll wrap up this post by fleshing out our querying knowledge, starting with the aggregate function COUNT(), which simply returns the number of matching rows:
SELECT COUNT(home_state)
FROM us_presidents
WHERE home_state = 'Virginia';
count |
---|
4 |
(1 row)
This query simply COUNTed the number of rows where home_state is 'Virginia'. Using aggregate functions introduces some limitations into how we write our queries. For example, this query fails:
SELECT last_name, COUNT(home_state)
FROM us_presidents
WHERE home_state = 'Virginia';
ERROR: column "us_presidents.last_name" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT last_name, COUNT(home_state)
We can intuitively understand this error by looking at the output of the previous query. By adding the COUNT() aggregate function, we've 'smashed' all rows WHERE home_state = 'Virginia' into one. We can't SELECT a last_name from this one row because there's no way for SQL to determine which last_name ('Washington', 'Jefferson', 'Madison', or 'Monroe') we could be referring to!
The other aggregate functions are SUM() and AVG(). They're not particularly applicable to this database, but are incredibly useful in others.
So if we want more columns we'll need to either use more aggregate functions or the GROUP BY clause. Here is a query that will output the number of presidents from each state:
SELECT home_state, COUNT(home_state)
FROM us_presidents
GROUP BY home_state
ORDER BY count DESC;
home_state | count |
---|---|
Virginia | 4 |
Massachusetts | 2 |
New York | 1 |
Tennessee | 1 |
(4 rows)
And finally, we of course can combine JOINs, aggregate functions, and the GROUP BY clause. Let's come up with a query that will tell use home many presidents came from each region of the United States:
SELECT s.region, COUNT(s.region)
FROM us_presidents AS p
JOIN us_states As s ON p.home_state = s.name
GROUP BY s.region;
region | count |
---|---|
South | 4 |
Northeast | 3 |
(2 rows)
You would be forgiven if you think this is a strange query. After all, why are we JOINing us_presidents ONto us_states, if we're only SELECTing columns from us_states? What must be realized is that when executing this query, the JOIN executes first, meaning we're recreating something like the table created by our very first JOIN, all the way back at the start of this post. Afterwards, we're using the COUNT() aggregate function and GROUP BY clause to smash the rows whose regions are equal to each other together, leaving us with just two rows containing 'South' and 'Northeast', along with the respective COUNTs.
In the next post I'm going to introduce Primary and Foreign keys, a topic I've been carefully avoiding up until now, despite many tutorials leading with it.
* We can actually make this table a little less confusing by re-thinking our query. The home_state column is blank for states that haven't yet sent someone to the White House, but we could instead SELECT the name column from the states table:
SELECT p.first_name, p.middle_name, p.last_name, s.name, s.region
FROM us_presidents as p
FULL JOIN us_states as s ON p.home_state = s.name
ORDER BY p.number;
first_name | middle_name | last_name | name | region |
---|---|---|---|---|
George | Washington | Virginia | South | |
John | Adams | Massachusetts | Northeast | |
Thomas | Jefferson | Virginia | South | |
James | Madison | Virginia | South | |
James | Monroe | Virginia | South | |
John | Quincy | Adams | Massachusetts | Northeast |
Andrew | Jackson | |||
Martin | Van | Buren | New York | Northeast |
New Hampshire | Northeast | |||
Maryland | South | |||
Connecticut | Northeast | |||
Georgia | South | |||
New Jersey | Northeast | |||
North Carolina | South | |||
Rhode Island | Northeast | |||
Delaware | Northeast | |||
South Carolina | South | |||
Pennsylvania | Northeast |
(18 rows)
That's somewhat better, at least we can now see the names of the states with no matching president. However, it isn't great that we now have columns called first_name, middle_name, last_name, and... name..., the last of which referring to a completely different entity than the first three. We can again simply use an ALIAS to resolve this:
SELECT p.first_name, p.middle_name, p.last_name, s.name AS state, s.region
FROM us_presidents as p
FULL JOIN us_states as s ON p.home_state = s.name
ORDER BY p.number;
first_name | middle_name | last_name | state | region |
---|---|---|---|---|
George | Washington | Virginia | South | |
John | Adams | Massachusetts | Northeast | |
Thomas | Jefferson | Virginia | South | |
James | Madison | Virginia | South | |
James | Monroe | Virginia | South | |
John | Quincy | Adams | Massachusetts | Northeast |
Andrew | Jackson | |||
Martin | Van | Buren | New York | Northeast |
New Hampshire | Northeast | |||
Maryland | South | |||
Connecticut | Northeast | |||
Georgia | South | |||
New Jersey | Northeast | |||
North Carolina | South | |||
Rhode Island | Northeast | |||
Delaware | Northeast | |||
South Carolina | South | |||
Pennsylvania | Northeast |
(18 rows)
Posted: 2023-05-10