In this post, we're going to learn how to simplify our queries and save time writing them.
Throughout this tutorial, we've written the following query many, many times:
SELECT first_name, middle_name, last_name, home_state
FROM us_presidents
ORDER BY number;
Even a simple query such as this is rather verbose, not to mention a big pain to type out. The good news is we can save ourselves a lot of time in the future by creating a VIEW:
CREATE VIEW sorted_presidents AS
SELECT first_name, middle_name, last_name, home_state
FROM us_presidents
ORDER BY number;
And now we can get the same effect by simply running:
SELECT * FROM sorted_presidents;
first_name | middle_name | last_name | home_state |
---|---|---|---|
George | Washington | Virginia | |
John | Adams | Massachusetts | |
Thomas | Jefferson | Virginia | |
James | Madison | Virginia | |
James | Monroe | Virginia | |
John | Quincy | Adams | Massachusetts |
Andrew | Jackson | Tennessee | |
Martin | Van | Buren | New York |
William | Henreigh | Harrison |
(9 rows)
Here, once we've created our VIEW, we SELECTed everything (using the wildcard * operator) from that VIEW. A VIEW can be thought of as a 'saved query'. Note that it is the query that is being saved here, not the results. For example let's rectify the silliness we've created with William 'Henreigh' Harrison while experimenting earlier:
BEGIN TRANSACTION;
INSERT INTO us_states (name, region)
VALUES ('Ohio', 'Midwest');
UPDATE us_presidents
SET middle_name = 'Henry',
home_state = 'Ohio'
WHERE first_name = 'William'
AND last_name = 'Harrison';
SELECT * FROM sorted_presidents;
COMMIT;
first_name | middle_name | last_name | home_state |
---|---|---|---|
George | Washington | Virginia | |
John | Adams | Massachusetts | |
Thomas | Jefferson | Virginia | |
James | Madison | Virginia | |
James | Monroe | Virginia | |
John | Quincy | Adams | Massachusetts |
Andrew | Jackson | Tennessee | |
Martin | Van | Buren | New York |
William | Henry | Harrison | Ohio |
(9 rows)
There are two important things to note here: First is that we nested our UPDATE within a TRANSACTION, and that we were able to first SELECT FROM sorted_presidents to ensure that the UPDATEs did what we were aiming to do, and second is that the results SELECTing * (everything) FROM sorted_presidents did indeed change to reflect the UPDATEs we made.
We aren't limited to simple SELECT queries when creating VIEWs. For example:
CREATE VIEW southern_presidents AS
SELECT p.first_name, p.middle_name, p.last_name, p.home_state
FROM us_presidents AS p
JOIN us_states AS s on p.home_state = s.name
WHERE s.region = 'South'
ORDER BY p.number;
SELECT * FROM southern_presidents;
first_name | middle_name | last_name | home_state |
---|---|---|---|
George | Washington | Virginia | |
Thomas | Jefferson | Virginia | |
James | Madison | Virginia | |
James | Monroe | Virginia | |
Andrew | Jackson | Tennessee |
(5 rows)
Brownie points if you thought "Hang on, didn't we just CREATE a VIEW to save us from having to type SELECT first_name, middle_name, last_name, home_state ...?" Because we can absolutely make use of old VIEWs when creating new VIEWs!
CREATE VIEW northeastern_presidents AS
SELECT * FROM sorted_presidents AS p
JOIN us_states AS s on p.home_state = s.name
WHERE s.region = 'Northeast';
SELECT * FROM northeastern_presidents;
first_name | middle_name | last_name | home_state | number | name | region |
---|---|---|---|---|---|---|
John | Adams | Massachusetts | 6 | Massachusetts | Northeast | |
John | Quincy | Adams | Massachusetts | 6 | Massachusetts | Northeast |
Martin | Van | Buren | New York | 11 | New York | Northeast |
(3 rows)
However as you can see there is some strange behavior here. We have the number, name, and region columns from us_states even though we never SELECTed them...
... or did we?
Well actually we did SELECT all these columns. SQL follows an order of operations (also known as the logical query processing phase)*. So even though we wrote the query as SELECT * FROM sorted_presidents, then JOIN that ONTO us_states, it was processed as JOIN sorted_presidents ONTO us_states, then SELECT * FROM the result. This is one of the potential pitfalls to watch out for when using the wildcard operator, and why it's generally considered a best practice to avoid it.
* For reference, the SQL order of operations is:
The differences between database management systems (e.g., MySQL, PostgreSQL, etc.) are more significant when writing PROCEDUREs and FUNCTIONs. The following may not be strictly true for all database management systems, however the principles explored within are widely considered best practices no matter which system you're using.
Earlier we learned we can think of VIEWs as stored SELECT queries. PROCEDUREs and FUNCTIONs are similar, except instead of storing SELECT queries, we're organizing and storing multiple statements into a single call. PROCEDUREs and FUNCTIONs are very similar to each other, however there are two key differences: 1. PROCEDUREs can manage TRANSACTIONs, while FUNCTIONs cannot. 2. FUNCTIONs must always RETURN something, PROCEDUREs never return anything.
If your goal is to fetch information from the database, you should write a FUNCTION. If you need to somehow change the database, you should write a PROCEDURE.
Earlier we experienced some minor headaches trying to INSERT presidents when their home_state didn't already exist in the us_states table. We can bundle some statements together to automatically INSERT a president's home state into us_states if it doesn't already exist. Since this will be modifying the database, we'll write it as a PROCEDURE:
CREATE PROCEDURE add_president (first_name VARCHAR, last_name VARCHAR,
political_party VARCHAR,
home_state VARCHAR)
LANGUAGE plpgsql
AS $$
BEGIN
INSERT INTO us_states (name)
VALUES (home_state)
ON CONFLICT (name)
DO NOTHING;
INSERT INTO us_presidents (first_name, last_name,
political_party,
home_state)
VALUES (first_name, last_name,
home_state);
COMMIT;
END;
$$;
CALL add_president('John', 'Tyler', 'Independent', 'Virginia');
CALL add_president('Abraham', 'Lincoln', 'Republican', 'Illinois');
SELECT * FROM sorted_presidents;
SELECT name AS State_name, region FROM us_states ORDER BY number;
first_name | middle_name | last_name | home_state |
---|---|---|---|
George | Washington | Virginia | |
John | Adams | Massachusetts | |
Thomas | Jefferson | Virginia | |
James | Madison | Virginia | |
James | Monroe | Virginia | |
John | Quincy | Adams | Massachusetts |
Andrew | Jackson | Tennessee | |
Martin | Van | Buren | New York |
William | Henry | Harrison | Ohio |
John | Tyler | Virginia | |
Abraham | Lincoln | Illinois |
(11 rows)
state_name | region |
---|---|
Delaware | Northeast |
Pennsylvania | Northeast |
New Jersey | Northeast |
Georgia | South |
Connecticut | Northeast |
Massachusetts | Northeast |
Maryland | South |
South Carolina | South |
New Hampshire | Northeast |
Virginia | South |
New York | Northeast |
North Carolina | South |
Rhode Island | Northeast |
Vermont | Northeast |
Kentucky | South |
Tennessee | South |
Ohio | Midwest |
Illinois |
(18 rows)
Okay there's a bit going on in our PROCEDURE. We start by giving the PROCEDURE a name and defining the arguments we'd like it to take (first_name, last_name, home_state). Afterwards, we're setting the language to plpgsql, which stands for Procedural Language/PostgreSQL, this is simply a language that provides additional features that are lacking in base SQL. The PROCEDURE body is enclosed in 'dollar quotes' ($$)** and BEGIN and END; calls. We've seen most of what remains before, however we've added an ON CONFLICT clause and specified that should there be a CONFLICT on a states' name we shall DO NOTHING. This prevents duplicate entries from being added if that state already exists in us_states (e.g., when CALLed the PROCEDURE with 'John', 'Tyler', Virginia'). When our PROCEDURE is CALLed and a home_state is supplied that is not already in us_states, it's added there first to prevent a foreign key conflict (as was done for Illinois when we CALLed with 'Abraham', 'Lincoln', 'Illinois'). Finally, we end with the COMMIT statement. Notice we did not need to include BEGIN TRANSACTION; because a PROCEDURE is already by definition a TRANSACTION.
Note that this PROCEDURE left the region of Illinois blank. This is of course because we never specified the region when INSERTing. In times such as these, we have a tough decision to make. We can:
All have their pros and cons, and the 'best' option will depend entirely on your specific needs and the role of your database.
Now that we've taken a crack at PROCEDUREs, let's write a FUNCTION. We could write one that will count the number of presidents from different regions of the United States:
CREATE FUNCTION count_presidents_by_region (us_region VARCHAR)
RETURNS INT
LANGUAGE plpgsql
AS $$
DECLARE
result INT;
BEGIN
SELECT COUNT(*) INTO result
FROM us_presidents AS p
JOIN us_states AS s on p.home_state = s.name
WHERE s.region = us_region;
RETURN result;
END;
$$;
SELECT count_presidents_by_region ('South') AS num_southerns;
num_southerns |
---|
6 |
(1 row)
** SQL expects the body of PROCEDUREs and FUNCTIONs to be a single string, which will then be interpreted and executed by the database engine. It's possible to enclose the body in regular quotes (e.g., 'BEGIN RETURN i + 1;'), however you'll run into issues if your function body contains strings within itself. The dollar quotes ($$) automatically converts everything enclosed within them to a string, including strings within the string. Demonstrating how this is useful is beyond the scope of this tutorial, but now you know what the meaning of this symbol!
In conclusion, we can encapsulate multiple SQL statements into one using VIEWs, PROCEDUREs, and FUNCTIONs. We use VIEWs to store the result of a SELECT query, FUNCTIONs to store statements that fetch information from the database, and PROCEDUREs to store statements that will somehow modify the database.
And with that we've finished this SQL tutorial! The next post will feature some exercises to test your understanding of what we've learned, and your ability to further develop your skills on your own.
Posted: 2023-05-23