Databases - Spanner and Related Databases
Spanner
Spanner is a globally-scalable database system used internally by Google; it is the successor of the BigTable database. (Link to the paper)
Cloud Spanner is the managed database on Google Cloud Platform.
Similar to BigTable, Spanner also uses SSTable; however it starts to migrate to use a columnar format instead.
Spanner does not have auto-increment key; do not use numbers in incremental order as keys, including timestamps, because Spanner is distributed and sharded by key, such keys will result in hotspots and hurt performance.
Natively support ProtoBuf.
Cloud Spanner manages splits using Paxos.
Data model
- A Spanner database have one or more tables.
- Tables are the same as in other relational database tables: rows, columns and values. Data is strongly typed.
- One or more primary keys.
- Can define one or more secondary indexes.
- Support table interleaving and foreign keys.
When you committed the writes to the Spanner database, the system versioned each item of data and associated it with a specific commit timestamp. This means the next time you update an item of data, the old version of that data can still be read (subject to garbage collection limits), and the new version will be assigned a timestamp that's guaranteed to be greater than the timestamp of the old version. This allows Spanner clients to read current values of the data (aka "strong reads") and older values (using a read at a timestamp or a bounded stale read) within a certain bound (e.g. the past ~4 hours).
TrueTime
TrueTime is a highly available, distributed clock that is provided to applications on all Google servers. TrueTime enables applications to generate monotonically increasing timestamps.
Achieving Eventual Consistency Performance
Spanner provides stale reads, which offer similar performance benefits as eventual consistency but with much stronger consistency guarantees. A stale read returns data from an "old" timestamp, which cannot block writes because old versions of data are immutable.
Proto
Spanner supports a PROTO<...>
type, which allows for the storage of structured data using a user-defined protocol buffer type.
- with
BLOB
: Opaque protos, not validated at write time, queries might returnnull
at query time for data that does not match the proto definition. - without
BLOB
:CREATE PROTO BUNDLE
required to validate message contents on writes; also enables use of fields in that proto type in Spanner SQL queries.
Index
In Cloud Spanner, indexes are actually implemented using tables, which allows them to be distributed and enables the same degree of scalability and performance as normal tables.
However, because of this type of implementation, using indexes to read the data from the table row is less efficient than in a traditional RDBMS. It’s effectively an inner join with the original table,
using an index in Cloud Spanner is always a trade-off between improved read performance and reduced write performance.
Spanner Inspired Databases
- CockroachDB
- YugaByteDB
CockroachDB
https://www.cockroachlabs.com/
An open source version of Google Spanner. CockroachDB is a distributed database architected for modern cloud applications. It is wire compatible with PostgreSQL.
CockroachDB is backed by RocksDB, an embedded key-value store, or a purpose-built derivative, called Pebble. Though RocksDB is from Facebook, but it is based on LevelDB, which was also from Google.
CockroachDB is implemented in Go.
CockrachDB deprecated interleaving tables and indexes in v20.2. Saying it is much slower than scanning over tables and indexes with no child objects, and database schema changes are slower for interleaved objects. https://www.cockroachlabs.com/docs/v21.1/interleave-in-parent#deprecation
Pebble: KV engine; replace RocksDB as the default storage engine in Cockroachdb; a subset of rocksdb; also LSM-Tree; Pebble does not aim to be a complete replacement for RocksDB, but only a replacement for the functionality in RocksDB used by CockroachDB. https://www.cockroachlabs.com/blog/pebble-rocksdb-kv-store/ https://github.com/cockroachdb/pebble
Tablet vs Group
TL;DR: A group
is a logical set of splits, whereas a tablet
is a physical replica of that group. (A tablet
is a replica, a group
is a replicaset).
They are related but represent different concepts in Spanner's architecture:
-
Tablet:
- What it is: The fundamental unit of data storage and partitioning in Spanner.
- Analogy: Think of a very large table in your database. Spanner automatically splits this table (based on size or load) into manageable, contiguous ranges of rows, sorted by primary key. Each of these ranges is a tablet.
- Purpose: To break down large tables/indexes into smaller chunks that can be independently moved, managed, and served by different Spanner servers (nodes). This enables horizontal scaling.
- Content: Contains a specific range of rows for a table or index.
- Lifecycle: Spanner automatically splits tablets when they get too large or busy, and merges them when they become too small or inactive.
-
Spanner Group (usually means Paxos Group):
- What it is: The fundamental unit of replication and consensus for the data within a single tablet.
- Analogy: For each tablet (chunk of data), Spanner creates multiple copies (replicas) and places them in different physical locations (zones) for high availability and durability. This collection of replicas for one tablet forms a Paxos Group.
- Purpose: To ensure data consistency (via the Paxos consensus algorithm) for writes and strongly consistent reads affecting the data range managed by its associated tablet. It also provides fault tolerance – if one replica (or even its zone) becomes unavailable, the group can continue operating using the remaining replicas.
- Composition: Consists of several replicas of a single tablet. One replica acts as the leader, coordinating writes, while others are followers (or sometimes witnesses).
- Relationship to Tablet: Every tablet has an associated Paxos group responsible for managing its replication and consistency.
Here's a table summarizing the key differences:
Feature | Tablet | Spanner Group (Paxos Group) |
---|---|---|
Unit of | Data Storage & Partitioning | Replication & Consensus |
Represents | A contiguous range of rows from a table/index | A set of replicas for a single tablet |
Purpose | Horizontal scaling, data distribution | Consistency, High Availability, Fault Tolerance |
Composition | Actual user data (a subset of a table) | Multiple copies (replicas) of one tablet's data |
Managed By | Spanner (splitting/merging) | Paxos consensus algorithm (among replicas) |
In simple terms:
- Spanner chops your big table into smaller pieces called Tablets.
- For each Tablet, Spanner makes several copies (replicas) and puts them in different places. This set of copies for one tablet is managed as a Paxos Group to keep them all consistent and available.