SQL

Still Quite Laughable

Vitalstatistix

Name
Dave Jones
Occupation
Database Frobnicator
Twitter
@waveform80
GitHub
github.com/waveform80

Now available online!

www.waveform.org.uk/presentations/sql/

(my apologies this doesn't seem to work well in Firefox!)

SQL Introduction

  • It's a standard … which no-one really follows
  • It's old, and therefore very weird in places
  • Despite numerous valiant (and some not so valiant) attempts to replace it, it's still the de-facto standard data language

Using SQL from Python

  • PEP-249 (sqlite3, psycopg2, mysqldb, etc.)
  • Badly designed in several places
  • Too variable / low-level to be seriously useful

Using SQL from Python

  • SQLAlchemy to the rescue!
  • … and Pandas!
  • … or whatever your preferences are!

Contact!


          >>> import sqlalchemy as sa
          >>> import pandas as pd
          >>> engine = sa.create_engine('sqlite:///flight.db')
          >>> c = engine.connect()
          

Tables

flighttimestamptemppresshum
hab109:00:0025.5102040
hab109:01:0025.6101941
hab109:02:0025.4102042

Tables

SQL “tables” have:

  • No order
  • No duplicates (well … sort of)

They are sets of tuples (rows)

Creating tables

Creating tables

column-def::

Notes

  • Columns are nullable by default!
  • The default default is NULL
  • … even for NOT NULL columns

Creating tables

constraint-def::

Notes

Probably the most important bit of a relational database, and what really sets it apart from your spreadsheet!

Intermission

columns::

expressions::

predicate::

Intermission

test::

Creating tables

unique-key::

check::

Creating tables

foreign-key::

rule::

Creating tables: example


          CREATE TABLE readings (
              flight    VARCHAR(10) NOT NULL,
              ts        TIMESTAMP NOT NULL,
              temp      NUMERIC(3,1) NOT NULL,
              pressure  NUMERIC(4,0) NOT NULL,
              humidity  NUMERIC(3,0) NOT NULL,
              accel_x   REAL DEFAULT 0 NOT NULL,
              accel_y   REAL DEFAULT 0 NOT NULL,
              accel_z   REAL DEFAULT 0 NOT NULL,

              CONSTRAINT readings_pk PRIMARY KEY (flight, ts),
              CONSTRAINT temp_ck CHECK (temp BETWEEN -70 AND 70),
              CONSTRAINT pres_ck CHECK (pressure BETWEEN 0 AND 2000),
              CONSTRAINT hum_ck CHECK (humidity BETWEEN 0 AND 100)
          );
          

Some notes

  • Nothing wrong with composite keys!
    • but keys should be immutable for entity's life
    • don't be afraid to use surrogates
    • but don't blindly use them everywhere
  • Use proper types for things!
    • avoid TEXT unless you really mean it
  • Use constraints liberally!
    • including NOT NULL
    • and name them!

Running SQL in Python


          >>> sql = """ ... paste SQL here ... """
          >>> c.execute(sql)
          

Yup … that's it

Filling tables

data::

Filling tables: examples


          INSERT INTO readings(flight, ts, temp, pressure, humidity)
          VALUES ('hab1', '2015-01-01 09:00:00', 25.5, 1020, 40);

          -- Inserting multiple rows in a single statement
          INSERT INTO readings(flight, ts, temp, pressure, humidity)
          VALUES
            ('hab1', '2015-01-01 09:01:00', 25.5, 1019, 40),
            ('hab1', '2015-01-01 09:02:00', 25.5, 1019, 41);

          -- Inserting the results of a query
          INSERT INTO readings
          SELECT flight, DATETIME(ts, '+3 minutes') AS ts, temp,
                 pressure, humidity, accel_x, accel_y, accel_z
          FROM readings;
          

Syntax Notes

  • 'I am a string'
  • "I am NOT a string" (in most engines)
  • -- I am a single line comment
  • /* I am a multi-line comment */
  • No hanging commas (awww)
  • Case insensitive … sort of

Viewing results


          >>> pd.set_option('display.width', 120)
          >>> pd.read_sql('readings', c)
                flight                  ts  temp  pressure  humidity  accel_x  accel_y  accel_z
          0   hab1 2015-01-01 09:00:00  25.5      1020        40        0        0        0
          1   hab1 2015-01-01 09:01:00  25.5      1019        40        0        0        0
          2   hab1 2015-01-01 09:02:00  25.5      1019        41        0        0        0
          3   hab1 2015-01-01 09:03:00  25.5      1020        40        0        0        0
          4   hab1 2015-01-01 09:04:00  25.5      1019        40        0        0        0
          5   hab1 2015-01-01 09:05:00  25.5      1019        41        0        0        0
          

Transactions

Atomicity
– everything either happens or it doesn't
Consistency
– you don't need to think about this one
Isolation
– nobody else sees your changes until COMMIT
Durability
– when COMMIT says it's complete, it's safe to take an axe to the power cable (well … for your data anyway)

Transactions

  • Not unique to relational databases, but (almost) universal among relational engines
  • COMMIT and ROLLBACK are the usual terms
  • But you don't need to remember that because auto-commit is (sadly) often the default
  • And in an explicit transactions COMMIT / ROLLBACK is usually implicit
  • Nested transactions are (almost) universally supported
  • Beware engine peculiarities (DDL especially)!

… in Python


          >>> t = c.begin()
          >>> c.execute(""" ... some SQL ... """)
          >>> t.rollback()
          >>> t = c.begin()
          >>> c.execute(""" ... some SQL ... """)
          >>> t.commit()
          

This is fine at the REPL, but …

… in Python


          with c.begin():
              c.execute(""" ... some SQL ... """)
              raise Exception('Something went wrong!')

          with c.begin():
              c.execute(""" ... some SQL ... """)
          

This is more normal in code (implicit transaction end)

Clearing tables

Clearing tables


          -- Delete a row
          DELETE FROM readings WHERE ts = '2015-01-01 09:05:00';

          -- Delete some rows
          DELETE FROM readings WHERE ts > '2015-01-01 09:02:00';

          -- Delete EVERYTHING! (mwuhahahaha)
          DELETE FROM readings;
          

Bulk Loading

Download this (≈70Mb): www.waveform.org.uk/presentations/sql/flight-data.csv

This is a dataset (≈200k rows) of sensor readings from a HAB flight with a Raspberry Pi Sense HAT.

Bound Parameters


          def load_data(filename):
              c.execute("DELETE FROM readings")
              sql = """
                  INSERT INTO readings
                      (flight, ts, temp, pressure, humidity,
                      accel_x, accel_y, accel_z)
                  VALUES
                      ('hab1', ?, ?, ?, ?, ?, ?, ?)
              """
              data = pd.read_csv(filename)
              for row in data.itertuples():
                  c.execute(sql, (
                      row.timestamp, row.temp_h, row.pressure,
                      row.humidity, row.accel_x, row.accel_y,
                      row.accel_z
                  ))
          

Bound Parameters

  • The only way you should be inserting user values into your database!
  • One (standard) form: ? (positional parameter)
  • You're better off using SQLAlchemy's SQL generation for this so you don't need to deal with PEP-249's insane variants
  • Each parameter can only represent a single value
  • You cannot use parameters to represent objects (tables and columns); if you think you need this, you're doing it wrong!

Error!


          def load_data(filename):
              insert = """
                  INSERT INTO readings
                      (flight, ts, temp, pressure, humidity,
                      accel_x, accel_y, accel_z)
                  VALUES
                      ('hab1', ?, ?, ?, ?, ?, ?, ?)
              """
              data = pd.read_csv(filename)
              c.execute("DELETE FROM readings")
              for row in data.itertuples():
                  c.execute(insert, (
                      row.timestamp, row.temp_h, row.pressure,
                      min(100, max(0, row.humidity)),
                      row.accel_x, row.accel_y, row.accel_z
                  ))
          

Twiddle your thumbs…

  • Transactions are very expensive!
  • Mostly because they sync to disk
  • Long running transactions are bad (especially in concurrent situations)
  • Pack as much into a transaction as you reasonably can in a short space of time

The Need for Speed


          def load_data(filename):
              insert = """
                  INSERT INTO readings
                      (flight, ts, temp, pressure, humidity,
                      accel_x, accel_y, accel_z)
                  VALUES
                      ('hab1', ?, ?, ?, ?, ?, ?, ?)
              """
              data = pd.read_csv(filename)
              with c.begin():
                  c.execute("DELETE FROM readings")
                  for row in data.itertuples():
                      c.execute(insert, (
                          row.timestamp, row.temp_h, row.pressure,
                          min(100, max(0, row.humidity)),
                          row.accel_x, row.accel_y, row.accel_z
                      ))
          

This is ≈100× faster

Round-trip cost


          def load_data(filename):
              insert = """
                  INSERT INTO readings
                      (flight, ts, temp, pressure, humidity,
                      accel_x, accel_y, accel_z)
                  VALUES
                      ('hab1', ?, ?, ?, ?, ?, ?, ?)
              """
              data = pd.read_csv(filename)
              data = [
                  (row.timestamp, row.temp_h, row.pressure,
                   min(100, max(0, row.humidity)),
                   row.accel_x, row.accel_y, row.accel_z)
                  for row in data.itertuples()
              ]
              with c.begin():
                  c.execute("DELETE FROM readings")
                  c.execute(insert, data)
          

This is ≈2× faster. See also: to_sql() in Pandas

Modifying tables

Modifying tables: examples


          -- Increase the temperature for all records
          UPDATE readings SET temp = temp + 1;

          -- Shift all hab1 readings to hab2
          UPDATE readings SET flight = 'hab2' WHERE flight = 'hab1';

          -- ... or something more complex
          UPDATE readings SET temp = temp * CASE
              WHEN pressure BETWEEN 0 AND 100 THEN 0.9
              WHEN pressure BETWEEN 101 AND 800 THEN 1.0
              WHEN pressure BETWEEN 801 AND 1000 THEN 1.1
          END
          WHERE pressure <= 1000;
          

Upsert?

  • There's a standard: MERGE
    • all the commercial engines implement it
    • … to varying degrees
    • … none of the open-source engines use it
  • None of the open-source engines quite agree on what to use instead:
    • REPLACE
    • WITH (UPDATE…)
    • INSERT … ON CONFLICT
    • INSERT … IGNORE
    • etc.

Upsert?

  • I'm not touching this one with a barge pole
  • If you need this: read your engine's docs
  • … and preferably some recent blog posts too (because your engine may have changed approach recently)
  • Don't think this is a trivial / easy operation
  • … this is only true in trivial circumstances!

Moar syntax!

Moar tables!


          CREATE TABLE flights (
              flight   VARCHAR(10) NOT NULL,
              lat      REAL NOT NULL DEFAULT 0.0,
              long     REAL NOT NULL DEFAULT 0.0,

              CONSTRAINT flights_pk PRIMARY KEY (flight),
              CONSTRAINT lat_ck CHECK (lat BETWEEN -90 AND 90),
              CONSTRAINT long_ck CHECK (long BETWEEN -180 AND 180)
          );

          INSERT INTO flights(flight) VALUES ('hab1');

          ALTER TABLE readings
              ADD CONSTRAINT flights_fk FOREIGN KEY (flight)
                  REFERENCES flights(flight) ON DELETE CASCADE;
          

But for SQLite…


          CREATE TABLE readings2 (
              flight    VARCHAR(10) NOT NULL,
              ts        TIMESTAMP NOT NULL,
              temp      NUMERIC(3,1) NOT NULL,
              pressure  NUMERIC(4,0) NOT NULL,
              humidity  NUMERIC(3,0) NOT NULL,
              accel_x   REAL DEFAULT 0 NOT NULL,
              accel_y   REAL DEFAULT 0 NOT NULL,
              accel_z   REAL DEFAULT 0 NOT NULL,

              CONSTRAINT readings_pk PRIMARY KEY (flight, ts),
              CONSTRAINT temp_ck CHECK (temp BETWEEN -70 AND 70),
              CONSTRAINT pres_ck CHECK (pressure BETWEEN 0 AND 2000),
              CONSTRAINT hum_ck CHECK (humidity BETWEEN 0 AND 100),
              CONSTRAINT flights_fk FOREIGN KEY (flight)
                  REFERENCES flights(flight) ON DELETE CASCADE
          );
          

But for SQLite…


          PRAGMA foreign_keys=on;
          INSERT INTO flights(flight) VALUES ('hab1');
          INSERT INTO readings2 SELECT * FROM readings;
          DROP TABLE readings;
          ALTER TABLE readings2 RENAME TO readings;
          

Notes

  • Foreign keys are not automatically indexed on the child side
  • SQLite doesn't support FOREIGN KEY by default (compatibility)
  • SQLite doesn't support ALTER … DROP (argh)

Querying tables

Welcome to the stupidest syntax in SQL:

Query execution order

  1. FROM
  2. WHERE
  3. GROUP BY
  4. HAVING
  5. SELECT
  6. ORDER BY
  7. LIMIT

Notes

The output of each clause is a single table; yes, even the FROM clause!

Moar data!


          INSERT INTO flights(flight) VALUES ('hab2'), ('hab3');
          INSERT INTO readings
          SELECT
              'hab2', ts, temp, pressure, humidity,
              accel_x, accel_y, accel_z
          FROM readings
          WHERE flight = 'hab1';
          INSERT INTO readings
          SELECT
              'hab3', ts, temp, pressure, humidity,
              accel_x, accel_y, accel_z
          FROM readings
          WHERE flight = 'hab1';
          

FKs in action!


          SELECT COUNT(*) FROM readings;
          DELETE FROM flights WHERE flight = 'hab3';
          SELECT COUNT(*) FROM readings;
          

Joining tables

joins::

table-ref::

Join semantics

All joins start life as a cross product:

foo
idvalue_a
A1
B2
C3
bar
idvalue_b
A4
B5
foo JOIN bar
foo.idbar.idfoo.value_abar.value_b
AA14
AB15
BA24
BB25
CA34
CB35

Join semantics

Then the predicate is applied to filter rows:

foo JOIN bar ON foo.id = bar.id
foo.idbar.idfoo.value_abar.value_b
AA14
BB25

Notes

Technically, the result set has two columns named “id”; this is normal in SQL and is why column order matters

Join semantics

All joins form a tree which is evaluated to form the output of the FROM clause:


          SELECT *
          FROM a
              JOIN b ON a.id = b.ref_id
              JOIN (
                  c JOIN d ON c.id = d.ref_id
              ) ON b.cid = c.ref_id
          

FKs in action!


          SELECT f.flight, r.ts, r.temp
          FROM flights f JOIN readings r ON f.flight = r.flight
          

Notes

Some engines will perform join elimination when foreign keys permit them to, as in the above case (unfortunately SQLite is not one of them)

Views

Yup … that's it

Views: example


          CREATE VIEW flight_readings AS
          SELECT
              r.flight, f.lat, f.long, r.ts,
              r.temp, r.pressure, r.humidity,
              r.accel_x, r.accel_y, r.accel_z
          FROM
              flights AS f
              JOIN readings AS r ON f.flight = r.flight;
          

SQL compilers are smarter than most people realize!


          SELECT flight, ts, temp FROM flight_readings;
          

On a smart engine this will become:


          SELECT flight, ts, temp FROM readings;
          

Thank You

Questions?