Cheatsheet - SQL

Count

SELECT count(*) from classics;

Distinct

select distinct author from classics;

Delete

DELETE FROM classics WHERE title='Little Dorrit';

Query

SELECT author,title FROM classics;

Exact Match:

SELECT author,title FROM classics WHERE author="Mark Twain";

Approx Match:

SELECT author,title FROM classics WHERE author LIKE "Charles%";
SELECT author,title FROM classics WHERE title LIKE "%Species";
SELECT author,title FROM classics WHERE title LIKE "%and%";

Limit the number of results:

SELECT author,title FROM classics LIMIT 3;
SELECT author,title FROM classics LIMIT 1,2;
SELECT author,title FROM classics LIMIT 3,1; (omit first 3, starting from 4, select 1)

MATCH...AGAINST: search in FULLTEXT, could contain multiple words return nothing since 'and' is a stopword:

SELECT author,title FROM classics
WHERE MATCH(author,title) AGAINST('and');
SELECT author,title FROM classics
WHERE MATCH(author,title) AGAINST('old shop');
SELECT author,title FROM classics
WHERE MATCH(author,title) AGAINST('tom sawyer');

SELECT author,title FROM classics
WHERE MATCH(author,title)
AGAINST('+charles -species' IN BOOLEAN MODE);
SELECT author,title FROM classics
WHERE MATCH(author,title)
AGAINST('"origin of"' IN BOOLEAN MODE);

Order by:

SELECT author,title FROM classics ORDER BY author;
SELECT author,title FROM classics ORDER BY title DESC;
SELECT author,title,year FROM classics ORDER BY author,year DESC;
SELECT author,title,year FROM classics ORDER BY author ASC,year DESC;

Group by:

SELECT category,COUNT(author) FROM classics GROUP BY category;

Join:

SELECT name,author,title from customers,classics
WHERE customers.isbn=classics.isbn;

SELECT name,author,title FROM customers
JOIN classics ON customers.isbn=classics.isbn;

SELECT name,author,title FROM customers NATURAL JOIN classics;

As:

SELECT name,author,title from
customers AS cust, classics AS class WHERE cust.isbn=class.isbn;

SELECT author,title FROM classics WHERE
author LIKE "Charles%" AND author LIKE "%Darwin";
SELECT author,title FROM classics WHERE
author LIKE "%Mark Twain%" OR author LIKE "%Samuel Langhorne Clemens%";
SELECT author,title FROM classics WHERE
author LIKE "Charles%" AND author NOT LIKE "%Darwin";

Find MIN

select MIN(DATE) from SomeTable;

Join array to string

SELECT ARRAY_TO_STRING(arr, ",")
FROM ...

What are the SQL Standards and Dialects?

The latest SQL standard is SQL

Systems that support standard SQL:

Presto: used by Facebook, open source, written in Java, successor of Hive (Presto is based on ANSI SQL but HiveQL is not)
BigQuery / Spanner, used by Google, both use Google's Standard SQL, based on ANSI SQL

BigQuery and Presto are optimized for batch/read queries.

Dialects / Extensions:

PL/SQL: Oracle's procedural extension for SQL
PL/pgSQL: PostgreSQL's version.
Transact-SQL (T-SQL): Microsoft's extension to SQL

What are DQL, DDL, DCL and DML?

What are different types of joins?

INNER JOIN: returns matched rows in both tables
OUTER JOIN:
- LEFT JOIN: returns all rows from the left table and the matched rows from the right table.
- RIGHT JOIN: returns all rows from the right table and the matched rows from the left table.
- FULL JOIN: returns all the rows from both joined tables, whether they have a matching row or not.
CROSS JOIN: a cartesian product of the two tables.

How to speed up queries?

Avoid using SELECT *. Applying a LIMIT clause to a SELECT * query does not affect the amount of data read
Keep data denormalized. "The storage savings from normalized data are less of a concern in modern systems"
"Trim the data as early in the query as possible, before the query performs a JOIN"
JOIN order: place the largest table first, followed by the smallest, and then by decreasing size.

What is a primary key?

A primary key is a column that can uniquely identify the rows.

What are contraints?

E.g. NOT NULL, UNIQUE, PRIMARY KEY, FOREIGN KEY.

2 Types: column level constraints and table level constraints.

What is a Trigger in SQL?

A trigger is a database object that is associated with a table. It will be activated when a defined action is executed for the table. E.g. when you run INSERT, UPDATE and DELETE.

What's the difference between `WHERE` and `HAVING`?

WHERE: filter all the rows.
HAVING: used after GROUP BY, filter grouped results.

What is normalization?

Normalization: decompose tables to remove redundant data. It devides large tables into smaller tables linked by keys.

Normal forms:

1NF (First Normal Form)
- Each table cell should contain a single value.
- Each record needs to be unique.
2NF (Second Normal Form)
- 1NF + Single Column Primary Key that does not functionally dependant on any subset of candidate key relation.
3NF (Third Normal Form)
- 2NF + NO transitive functional dependencies.
BCNF (Boyce-Codd Normal Form), a.k.a. 3.5NF, to address certain types of anomalies not dealt with by 3NF as originally defined
- 3NF + if a trivial functional dependency X -> Y exist for a relation; then X must be a super key.
4NF (Fourth Normal Form)
- no database table instance contains two or more, independent and multivalued data describing the relevant entity
5NF (Fifth Normal Form)
- 4NF + it cannot be decomposed into any number of smaller tables without loss of data.
6NF (Sixth Normal Form)

What is Denormalization?

The reverse process of normalization: use redundancy to improve performance.

What is transitive functional dependency?

A transitive dependency exists when you have the following functional dependency pattern: A->B and B->C; therefore A->C.

For example: Book -> Author, Author -> Author age, therefore Book -> Author age.

Transitive functional dependency basically means when changing a non-key column, it may cause any of the other non-key columns to change.

What is a Primary Key?

A primary key uniquely identifies a row in a table.

What is a Composite Key?

Composite Key: consists of multiple columns that together uniquely identify a row.

What is a Foreign Key?

A foreign key referneces the primary key of another table.

it can have a different name from the primary key.
they do not have to be unique.
foreign key can be null; primary key cannot.

What is the difference between UDF and TVF?

Both are user-defined. TVF (table-valued function) returns a table. They are different in different systems. For example:

In BigQuery, UDF returns the result as a value (SQL UDFs are created by CREATE FUNCTION or CREATE TEMP FUNCTION). TVF returns a table (created by CREATE TABLE FUNCTION).
In SQL Server, there are 2 kinds of UDF: SVF (Scalar-valued Function) and TVF (Table-valued Function).

What is the difference between `DELETE`, `TRUNCATE` and `DROP`?

DELETE: delete rows from the table based on the condition specified by WHERE.
TRUNCATE: delete all the rows and free the space.
DROP: drop the table.

What is Pattern Matching in SQL?

Wildcards:

% matches 0 or more characters.
_ matches exactly one character.

-- matches any string that contains `bar`
SELECT * FROM Foo WHERE field LIKE '%bar%';

-- matches any string with exact 3 characters
SELECT * FROM Foo WHERE field LIKE '___';

How to create a table?

It might be different in different engines. For example in Presto:

CREATE TABLE IF NOT EXISTS
table_name (
    xxx_names ARRAY<VARCHAR>,
    ds VARCHAR
)

How to insert values?

INSERT INTO table_name(column1, column2, ...)
VALUES (value1, value2, ...);

How to use `CASE` / `WHEN`?

Create is_not_null column

CASE WHEN <column> IS NOT NULL THEN 1 ELSE 0 END AS is_not_null

How to calculate MAX_BY / MIN_BY?

Presto provides MAX_BY and MIN_BY, but BigQuery does not, here's an alternative:

ARRAY_AGG(field1 ORDER BY field2)[OFFSET(0)]

How to calculate Cumulative sum?

SUM(a) OVER (ORDER BY a DESC ROWS UNBOUNDED PRECEDING)

How to calculate % of overall?

SUM(a) / SUM(SUM(a)) OVER ()

How to calculate Conditional SUM?

If bool_val is true, add up the val.

SUM(IF(bool_val, val, 0))

How to Join ARRAY to STRING

SELECT ARRAY_TO_STRING(arr, ",")
FROM ...

How to transpose a table?

Most of SQL engines do not have built-in functions to convert rows into columns, (a.k.a "transpose" the table). In order to convert a table like this:

id   | key   | value
-----|-------|------------
id_0 | key_0 | value_1
id_0 | key_1 | value_10
id_1 | key_1 | value_100
id_2 | key_0 | value_1000

To this, so each key is a separate column:

id   | key_0      | key_1
-----|------------|------------
id_0 | value_1    | value_10
id_1 | null       | value_100
id_2 | value_1000 | null

This can be achieved by GROUP(by id) and ARRAY_AGG and filter out un-related info:

-- prep the table
WITH data AS (
  SELECT 'id_0' AS id, 'key_0' AS key, 'value_1' AS value
  UNION ALL SELECT 'id_0' AS id, 'key_1' AS key, 'value_10' AS value
  UNION ALL SELECT 'id_1' AS id, 'key_1' AS key, 'value_100' AS value
  UNION ALL SELECT 'id_2' AS id, 'key_0' AS key, 'value_1000' AS value
)

-- actual pivot logic
SELECT
  id,
  ARRAY_AGG(CASE key WHEN 'key_0' THEN value ELSE NULL END IGNORE NULLS)[SAFE_OFFSET(0)] AS key_0,
  ARRAY_AGG(CASE key WHEN 'key_1' THEN value ELSE NULL END IGNORE NULLS)[SAFE_OFFSET(0)] AS key_1
FROM data
GROUP BY 1

Where CASE key WHEN 'key_0' THEN value ELSE NULL END will nullify values not related to this key, and IGNORE NULLS will make sure only relevant values goes into ARRAY_AGG.

What is `UNNEST`?

UNNEST: expands an ARRAY into a relation (i.e. table)

Presto: supports ARRAY to 1 column and MAP to 2 columns table
BigQuery: only supports ARRAY to 1 column table

For example:

SELECT ARRAY[1, 2, 3]

Result is one row

f0_
---
[1, 2, 3]

Use UNNEST

SELECT * FROM UNNEST(ARRAY[1, 2, 3]) AS a

The result is a table of one column

a
---
1
2
3

Equivalent to a map function:

SELECT a*a FROM UNNEST(ARRAY[1, 2, 3]) AS a

a
---
1
4
9

SELECT ARRAY(SELECT a*a FROM UNNEST(ARRAY[1, 2, 3]) AS a)

f0_
---
[1,4,9]

SELECT ARRAY_TO_STRING(
  ARRAY(SELECT CAST(a * a AS STRING) FROM UNNEST(ARRAY[1, 2, 3]) AS a), ","
)

f0_
---
1,4,9

What is Rollup?

ROLLUP(YEAR, MONTH, DAY)

With a ROLLUP, it will have the following outputs:

YEAR, MONTH, DAY
YEAR, MONTH
YEAR
()

What is Cube?

With CUBE, it will have the following:

YEAR, MONTH, DAY
YEAR, MONTH
YEAR, DAY
YEAR
MONTH, DAY
MONTH
DAY
()

CUBE essentially contains every possible rollup scenario for each node whereas ROLLUP will keep the hierarchy in tact (so it won't skip MONTH and show YEAR/DAY, whereas CUBE will)

How to calculate SUM of a repeated field?

message TopLevelMessage {
  repeated FooMessage foos = 1;
  ...
}

message FooMessage {
 int64 bar = 1;
 int64 baz = 2;
...
}

SELECT
  ...
  (SELECT SUM(bar) FROM UNNEST(foos)) AS sum_bar
FROM table

How to Change password?

For example in PostgreSQL:

ALTER ROLE username
WITH PASSWORD 'password';