SQL Basic Data Manipulation

There is an overwhelming amount of data in the world. We need a way to manage that data. The most popular method for serious data storage and manipulation is to use a relational database, along with Structured Query Language (SQL). In these next posts, we're going to learn how to effectively use SQL to store, retrieve, and manipulate data. The mini-tutorial will be split into four key sections:

Basic Data Manipulation (this post)

Joins and Relationships

Primary and Foreign Keys

Transactions and ACID properties

Views Procedures and Functions

Exercises)

Other tutorials will typically start by having you download a 'toy database' to get started. This is because in real-world applications it is overwhelmingly likely that you will be querying and modifying an existing database, and rarely creating a new one from scratch. That being said, this tutorial will start with us creating a tiny database from scratch, so that we can be intimately familiar with the data and learn SQL with a 'bottom-up' approach.

Setup

The process for setting up an SQL will vary depending on which system you choose, as well as your operating system. Instead of exhaustively covering every possibility, I will simply refer you to the official installation instructions for the two most popular systems: PostgreSQL and MySQL. While I will be using PostgreSQL personally, all statements within this tutorial will also work in a MySQL database, so feel free to setup whichever one you prefer.

Creating tables and inserting data

The table is the fundamental building block of relational databases. A relational database must have at least one table (though there will almost always be many more). A table can be created as:

CREATE TABLE us_presidents (
    first_name VARCHAR(30),
    last_name VARCHAR(30),
    political_party VARCHAR(30)
);

Here we've created a table and given it the name "us_presidents". We've given the table three columns: first_name, last_name, and political_party. In SQL, we must specify the datatype of columns, be it INT, DATE, etc. In this case, we've set the datatype of each column to VARCHAR(30), meaning a string of variable length, up to 30 characters.

So we have a table, but it's currently devoid of any data. We can insert data into the table like so:

INSERT INTO us_presidents (first_name, last_name, political_party)
VALUES ('John', 'Adams', 'Federalist');
Despite being on two lines, the above is treated as a single statement, delimited by the semicolon. See the note below for more info.*
It's worth noting that despite being three separate keywords, for all intents and purposes, INSERT, INTO, and VALUES are all just part of one command. You will never use one without the other two.

We've INSERTed a row INTO the us_presidents table here. When calling the INSERT INTO statement, we must specify the table name as well as the columns of that table that correspond with the data we are inserting.

Notice how when calling INSERT INTO, we specify the columns first, then the data using the VALUES keyword. We are not required to specify the columns in the same order they appear in the table. For example:

INSERT INTO us_presidents (last_name, political_party, first_name)
VALUES ('Thomas', 'Jefferson', 'Democratic-Republican');

We just need to make sure the order of the data in the VALUES statement aligns with the order of the columns we specify.

We can also leave a column completely blank should we wish:

INSERT INTO us_presidents (first_name, last_name)
VALUES ('George', 'Washington');

We didn't specify a political party for George Washington (because he never had one, technically), so the political_party cell for him will simply be blank.

Querying data

We've been INSERTing data into a table, but up until now we haven't been able to see what we've been doing! Let's address that with the SELECT statement:

SELECT first_name, last_name, political_party
  FROM us_presidents;
first_name last_name political_party
John Adams Federalist
Thomas Jefferson Democratic-Republican
George Washington  

(3 rows)

In cases such as this, where we are SELECTing every column from a table, you can use the wildcard (*) operator like so: SELECT * FROM us_presidents;. I'm going to avoid doing so in this tutorial to be as explicit as possible, but using it isn't "wrong" by any means.

When INSERTing, we could specify the columns in whichever order we want (or even choose to omit columns altogether), the same is true with the SELECT statement:

SELECT political_party, last_name
  FROM us_presidents;
political_party last_name
Federalist Adams
Democratic-Republican Jefferson
  Washington

(3 rows)

There is a lot more we can do with the SELECT statement. To show more of its capabilities, let's add more data to our us_presidents table. We can INSERT multiple rows at once like so:

INSERT INTO us_presidents (first_name, last_name, political_party)
VALUES ('James', 'Madison', 'Democratic-Republican'),
       ('James', 'Monroe', 'Democratic-Republican'),
       ('John', 'Adams', 'Democratic-Republican'),
       ('Andrew', 'Jackson', 'Democratic'),
       ('Martin', 'Buren', 'Democratic'),
       ('William', 'Harrison', 'Whig');

SELECT first_name, last_name, political_party
  FROM us_presidents;
first_name last_name political_party
John Adams Federalist
Thomas Jefferson Democratic-Republican
George Washington  
James Madison Democratic-Republican
James Monroe Democratic-Republican
John Adams Democratic-Republican
Andrew Jackson Democratic
Martin Buren Democratic
William Harrison Whig

(9 rows)

We can filter our query using clauses, the first of which we'll examine is WHERE:

SELECT first_name, last_name, political_party
  FROM us_presidents
 WHERE first_name = 'James';
first_name last_name political_party
James Madison Democratic-Republican
James Monroe Democratic-Republican

(2 rows)

Notice the single equals sign here is checking for equality, not assigning as it would in languages such as Python and Java.

We can even use the WHERE clause to filter based on a column we don't want to actually see in our output table:

SELECT first_name, last_name
  FROM us_presidents
 WHERE political_party = 'Democratic-Republican';
first_name last_name
Thomas Jefferson
James Madison
James Monroe
John Adams

(4 rows)

Have you noticed anything strange?

SELECT first_name, last_name, political_party
  FROM us_presidents
 WHERE last_name = 'Adams';
first_name last_name political_party
John Adams Federalist
John Adams Democratic-Republican

(2 rows)

What's going on here? Well the first row is referring to John Adams, second President of the United States, while the other row is for John Quincy Adams, sixth President of the United States. As a temporary fix to this issue, we can add a middle name column to the table (we will discuss better methods for handling situations such as these in the next post).

ALTER TABLE us_presidents
  ADD middle_name VARCHAR(30);

SELECT first_name, middle_name, last_name
  FROM us_presidents;
first_name middle_name last_name John   Adams Thomas   Jefferson George   Washington James   Madison James   Monroe John   Adams Andrew   Jackson Martin   Buren William   Harrison

(9 rows)

Now we have a blank middle_name column, we can UPDATE John Adams to give him his Quincy middle name:

UPDATE us_presidents
   SET middle_name = 'Quincy'
 WHERE last_name = 'Adams';

SELECT first_name, middle_name, last_name
  FROM us_presidents;
first_name middle_name last_name political_party
Thomas   Jefferson Democratic-Republican
George   Washington  
James   Madison Democratic-Republican
James   Monroe Democratic-Republican
Andrew   Jackson Democratic
Martin   Buren Democratic
William   Harrison Whig
John Quincy Adams Federalist
John Quincy Adams Democratic-Republican

(9 rows)

Oh no! That statement updated both John Adams'! But it we look back at our query this makes perfect sense. We simply asked to make the middle_name of each row WHERE last_name = 'Adams' equal to 'Quincy', We can write a statement that specifically targets just on John Adams by using the AND clause. Let's go ahead an update another president commonly known by his first, middle, and last names as well.

UPDATE us_presidents
   SET middle_name = NULL
 WHERE last_name = 'Adams' AND political_party = 'Federalist';

UPDATE us_presidents
   SET middle_name = 'Van'
 WHERE first_name = 'Martin' AND last_name = 'Buren';

UPDATE us_presidents
   SET middle_name = 'Henry'
 WHERE first_name = 'William' AND last_name = 'Harrison';

SELECT first_name, middle_name, last_name, political_party
  FROM us_presidents;
first_name middle_name last_name political_party
Thomas   Jefferson Democratic-Republican
George   Washington  
James   Madison Democratic-Republican
James   Monroe Democratic-Republican
Andrew   Jackson Democratic
Martin Van Buren Democratic
William Henry Harrison Whig
John Quincy Adams Democratic-Republican
John   Adams Federalist

(9 rows)

We've updated both John Adams (the first one) to get rid of the Quincy middle name, and we've given the "Harrison" middle name to William Henry Harrison (We've also updated Martin van Buren while we were at it). If that name sounds strangely familiar, it's probably because William Henry Harrison is the man who died 31 days after becoming president. A tenure so short that one might argue he shouldn't even be included in a list of presidents. We can entertain such an argument with:

DELETE FROM us_presidents
 WHERE first_name = 'William' AND middle_name = 'Henry' AND last_name = 'Harrison';

SELECT first_name, middle_name, last_name
  FROM us_presidents;
Notice we can chain multiple AND clauses together.
first_name middle_name last_name
Thomas   Jefferson
George   Washington
James   Madison
James   Monroe
Andrew   Jackson
Martin Van Buren
John Quincy Adams
John   Adams

(8 rows)

In addition to the AND clause, there's also the OR and NOT clauses, and more.

As mentioned before, a relational database will almost always have more than one table, so let's create a second one now:

CREATE TABLE us_states (
number SERIAL,
name VARCHAR(30),
region VARCHAR(30)
);

PostgreSQL uses the SERIAL keyword whereas other databases, such as MySQL use the AUTO_INCREMENT keyword. Here is how you would create the same table in MySQL:

CREATE TABLE us_states (
number INT AUTO_INCREMENT,
name VARCHAR(30),
region VARCHAR(30)
);

Here, we've created a table called us_states and given it three columns: number, name, and region. Let's fill our new table with some data by INSERTing states based on the order they joined the union:

INSERT INTO us_states (name, region)
VALUES ('Delaware', 'South'),
       ('Pennsylvania', 'Northeast'),
       ('New Jersey', 'Northeast'),
       ('Georgia', 'South'),
       ('Connecticut', 'Northeast'),
       ('Massachusetts', 'Northeast'),
       ('Maryland', 'South'),
       ('South Carolina', 'South'),
       ('New Hampshire', 'Northeast'),
       ('Virginia', 'South'),
       ('New York', 'Northeast'),
       ('North Carolina', 'South'),
       ('Rhode Island', 'Northeast');

SELECT number, name, region
  FROM us_states;
number name region
1 Delaware South
2 Pennsylvania Northeast
3 New Jersey Northeast
4 Georgia South
5 Connecticut Northeast
6 Massachusetts Northeast
7 Maryland South
8 South Carolina South
9 New Hampshire Northeast
10 Virginia South
11 New York Northeast
12 North Carolina South
13 Rhode Island Northeast

(13 rows)

Notice we did not specify any values for the number column, yet the column is populated anyway. This is because we've created this column to start at 1 and automatically increment as each new row is added.

Now let's add a new column to us_presidents to house the home state of each president. While we're at it, let's also give each president a number, to indicate the order of presidential succession. You'll notice that accomplishing this task requires a very verbose statement. This is because while adding new columns to tables in a relational database is possible (we'll make it slightly less painful using the CASE keyword, which allows us to implement a switch-like statement), it is very impractical. Should you ever find yourself designing a database, you'll need to put careful thought and consideration into the layout of your tables. Adding new columns to a database that has already been deployed, and populating them with data is no easy task:

UPDATE us_presidents
   SET number =
  CASE
  WHEN last_name = 'Washington' THEN 1
  WHEN last_name = 'Adams' AND middle_name IS NULL THEN 2
  WHEN last_name = 'Jefferson' THEN 3
  WHEN last_name = 'Madison' THEN 4
  WHEN last_name = 'Monroe' THEN 5
  WHEN last_name = 'Adams' AND middle_name = 'Quincy' THEN 6
  WHEN last_name = 'Jackson' THEN 7
  WHEN last_name = 'Buren' THEN 8
END;

UPDATE us_presidents
   SET home_state =
  CASE
  WHEN last_name = 'Washington' THEN 'Virginia'
  WHEN last_name = 'Adams' AND middle_name IS NULL THEN 'Massachusetts'
  WHEN last_name = 'Jefferson' THEN 'Virginia'
  WHEN last_name = 'Madison' THEN 'Virginia'
  WHEN last_name = 'Monroe' THEN 'Virginia'
  WHEN last_name = 'Adams' AND middle_name = 'Quincy' THEN 'Massachusetts'
  WHEN last_name = 'Jackson' THEN 'South Carolina'
  WHEN last_name = 'Buren' THEN 'New York'
END;

SELECT first_name, middle_name, last_name, home_state
  FROM us_presidents
 ORDER BY number;
Notice we had to be careful of John (Quincy)? Adams again
first_name middle_name last_name home_state
George   Washington Virginia
John   Adams Massachusetts
Thomas   Jefferson Virginia
James   Madison Virginia
James   Monroe Virginia
John Quincy Adams Massachusetts
Andrew   Jackson South Carolina
Martin   Buren New York

(8 rows)

Here, we SELECTed and included the ORDER BY clause to sort our results, so we'd have the first president in the first row (like when we used WHERE in the last post, we can ORDER BY a column even if we didn't SELECT that column!)

*A brief word on SQL syntax and style

SQL is case-insensitive, meaning that

SELECT id FROM People;
select id from people;
SeLeCt iD fRoM pEoPlE;

are all valid and equivalent statements. SQL statements are delimited with a semicolon (;), regardless of white-space, meaning that

SELECT
id
FROM
People
;

Is also valid and equivalent to the statements above. All that being said, don't allow yourself to become so preoccupied with whether or not you can, that you forget to stop and think if you should. In fact, when writing anything at all, I'd generally recommend finding a style guide and sticking to it. Virtually all SQL style guides agree that you should capitalize every letter in reserved keywords (e.g., SELECT and FROM). I will be following Simon Holywell's style guide for all statements in this tutorial, so statements will follow the form:

SELECT id FROM people;

In the next tutorial, we will learn about joins and relationships.


Posted: 2023-05-10

Older Newer Newest