Databases

Over the years, different kinds of databases were developed for different use cases. DB-Engines is tracking more than 300 databases. Choose wisely.

Relational Databases

The traditional type.

relations: between rows and columns(a relation is a table, not between tables)
a unique key identifying each row.

It is a safe bet to start with PostgreSQL or MySQL. Many mobile apps rely on SQLite.

Cloud services:

AWS, Azure and GCP have their home-grown relational database, and also provide hosted services for popular databases.

Amazon Aurora: Amazon's own relational database
Amazon RDS: Amazon managed MySQL / PostgreSQL / MariaDB
Google Cloud SQL: Google managed MySQL / PostgreSQL
Azure Database for MySQL / PostgreSQL / MariaDB: Microsoft managed db.

NewSQL

NewSQL = Online scalability of NoSQL databases + relational data model / ACID consistency + SQL interface

Google Cloud Spanner: Google's own relational database

Versus Traditional RDBMS:

In a traditional vertically scaled RDBMS, there is a single node that manages all of the tables. (Depending on the installation, there may be replicas that can be used for reading or for failover). This single node therefore has full control over the table row locks, and the issuing of unique keys from a numeric sequence.

new sql: with many nodes reading and writing to the database at any one time.

Traditional vertical RDBMS have integrated sequence generators, which create new integer keys from a sequence during a transaction. Cloud Spanner cannot have this due to its distributed architecture, as there would either be race conditions between the split leader nodes when inserting new keys, or the table would have to be globally locked when generating a new key, both of which would reduce performance.

Data Warehouse

Not for online serving, but for offline analytics.

Time Series Database

Especially useful for:

DevOps Monitoring
IoT Applications
Real-time analytics

Examples:

InfluxDB: https://www.influxdata.com The most popular Time Series db tracked on db-engines.
Prometheus: a CNCF project.
Graphite
TimescaleDB
Amazon Timestream

NoSQL (non-relational)

Several different types:

document: stores data as JSON documents.
wide-column: stores data as tables but the columns are flexible, not all the rows have all the columns. (Data is actually stored in tuples like (row, column, timestamp).)
key-value: often used for caching.
graph: stores data in nodes and relationships in edges.

Document

MongoDB is probably the most popular open source NoSQL database. DynamoDB is the one shining in the cloud.

Cloud services:

MongoDB Atlas: MongoDB's own cloud offering https://www.mongodb.com/cloud/atlas/
Amazon DynamoDB

Wide-column

Based on the design of Google's Bigtable:

Cloud services:

Google Cloud Bigtable

Caching (key-value)

Choose Redis for cache for all new projects. Though Memcache is used extensively at Facebook.

Redis: not often used as a primary data store, but for storing and accessing ephemeral data(loss can be tolerated) – metrics, session state, caching.

Some databases are optimized for flash, making them cheaper alternatives to caches. E.g. Aerospike.

Key-value stores, by design, have trouble linking values together (in other words, they have no foreign keys).

Graph

Relational databases can function as general purpose graph databases, but they are not very efficient to traverse the graphs. Multiple queries and joins are often needed.

E.g. Facebook created its own huge social graph, every entity is a node(like a person, a business, a page, a group), and the different types of relationships are the edges. It is backed by TAO, which is actually a caching layer over MySQL.

Search

Solr is losing popularity to Elasticsearch.

Cloud Services:

Amazon CloudSearch
Microsoft Azure Search
Google Search Appliance

Notes of Popular databases

PostgreSQL
MySQL
MongoDB
Wide Column Stores: including Bigtable, Cassandra, HBase, Accumulo
Spanner: including Spanner, CockroachDB, YugaByteDB

SQLite

Often used in web browsers and mobile apps.

Pros

the whole database is in one single file on disk, can be embedded inside the application, very portable.
zero-config, easy to setup (e.g. no GRANT, use file permissions)

Cons

not a client–server database, no network capabilities, cannot be used over a network.
not for write-intensive deployments, not for high concurrency use case, since it relies on file-system locks, versus server-based databases handle locks in daemons
no type checking, the type of a value is dynamic and not strictly constrained by the schema
no user management; no way to tune for higher performance
not that reliable, comparing to other RDBMS like MySQL

Riak

Uses vector lock to resolve write conflicts.

LevelDB and RocksDB

LevelDB (created by Google): inspired by Bigtable, also use SSTable.
RocksDB (created by Facebook): a fork of LevelDB
write-optimized
embedded
better compression
less write-amplification
http://rocksdb.blogspot.com/2013/11/the-history-of-rocksdb.html
https://code.facebook.com/posts/666746063357648/under-the-hood-building-and-open-sourcing-rocksdb/

How it works:

SSTables are arranged in several levels.
SSTables are non-overlapping within a level, e.g. 2 levels:
- level 1: 2 SSTables, one with key space [a, b), and another [b,c)
- level 2: single SSTable with key space [a, c)
- the query with a string starting with a will look at [a,c) on level 2 and [a, b) on level 1. Since the strings are sorted, it takes log(n)

Aerospike

Data model:

namespaces(databases): contain sets (tables) of records
sets: table
records:
- key: identify records
- metadata: generation tracks record modification, time-to-live (TTL) specifies record expiration
- bin: name value pairs.

Traits:

Data is sharded and balanced between servers using a Paxos-based membership algorithm.
Aerospike makes a dangerous assumption for a distributed datastore: it assumes the network is reliable

Comparing to in memory cache:

optimized for flash storage
cheaper than the in memory cache
no need to re-load data into memory after outrage

CouchDB

append only: data is virtually incorruptible and easy to replicate, back up, and restore.
not for ad hoc queries
multiple masters

Amazon RDS

Data is continuously backed up to S3 in real time, with no performance impact.

CosmosDB

Instead of creating an index tree for each column, Cosmos DB employs one index for the whole database account

Comparisons

InnoDB vs RocksDB

InnoDB: B-tree
RocksDB: LSM(log-structured merge tree)

LSM vs B-tree: http://smalldatum.blogspot.com/2016/01/summary-of-advantages-of-lsm-vs-b-tree.html

We have found RocksDB, when compared to InnoDB, uses less storage space to store an UDB instance and generates less write traffic overall
A B-Tree wastes space when pages fragment. An LSM doesn't fragment.

Timescale vs InfluxDB

https://blog.timescale.com/timescaledb-vs-influxdb-for-time-series-data-timescale-influx-sql-nosql-36489299877

as cardinality moderately increases, InfluxDB performance drops dramatically due to its reliance on time-structured merge trees (which, similar to the log-structured merge trees it is modeled after, suffers with higher-cardinality datasets). This of course should be no surprise, as high cardinality is a well known Achilles heel for InfluxDB

Pipeline that responds to database writes

Use cases:

Extraction: Stream data to key-value maps or other simplified forms, so that serving systems or others can consume them.
Transformation: Do additional computation and write the augmented data back to the same database.