logo

Databases - Spanner and Related Databases

Spanner

Spanner is a globally-scalable database system used internally by Google; it is the successor of the BigTable database. (Link to the paper)

Cloud Spanner is the managed database on Google Cloud Platform.

Similar to BigTable, Spanner also uses SSTable; however it starts to migrate to use a columnar format instead.

Spanner does not have auto-increment key; do not use numbers in incremental order as keys, including timestamps, because Spanner is distributed and sharded by key, such keys will result in hotspots and hurt performance.

Natively support ProtoBuf.

Cloud Spanner manages splits using Paxos.

Data model

  • A Spanner database have one or more tables.
  • Tables are the same as in other relational database tables: rows, columns and values. Data is strongly typed.
  • One or more primary keys.
  • Can define one or more secondary indexes.
  • Support table interleaving and foreign keys.

When you committed the writes to the Spanner database, the system versioned each item of data and associated it with a specific commit timestamp. This means the next time you update an item of data, the old version of that data can still be read (subject to garbage collection limits), and the new version will be assigned a timestamp that's guaranteed to be greater than the timestamp of the old version. This allows Spanner clients to read current values of the data (aka "strong reads") and older values (using a read at a timestamp or a bounded stale read) within a certain bound (e.g. the past ~4 hours).

TrueTime

TrueTime is a highly available, distributed clock that is provided to applications on all Google servers. TrueTime enables applications to generate monotonically increasing timestamps.

Achieving Eventual Consistency Performance

Spanner provides stale reads, which offer similar performance benefits as eventual consistency but with much stronger consistency guarantees. A stale read returns data from an "old" timestamp, which cannot block writes because old versions of data are immutable.

Proto

Spanner supports a PROTO<...> type, which allows for the storage of structured data using a user-defined protocol buffer type.

  • with BLOB: Opaque protos, not validated at write time, queries might return null at query time for data that does not match the proto definition.
  • without BLOB: CREATE PROTO BUNDLE required to validate message contents on writes; also enables use of fields in that proto type in Spanner SQL queries.

Index

In Cloud Spanner, indexes are actually implemented using tables, which allows them to be distributed and enables the same degree of scalability and performance as normal tables.

However, because of this type of implementation, using indexes to read the data from the table row is less efficient than in a traditional RDBMS. It’s effectively an inner join with the original table,

using an index in Cloud Spanner is always a trade-off between improved read performance and reduced write performance.

Spanner Inspired Databases

  • CockroachDB
  • YugaByteDB

CockroachDB

https://www.cockroachlabs.com/

An open source version of Google Spanner. CockroachDB is a distributed database architected for modern cloud applications. It is wire compatible with PostgreSQL.

CockroachDB is backed by RocksDB, an embedded key-value store, or a purpose-built derivative, called Pebble. Though RocksDB is from Facebook, but it is based on LevelDB, which was also from Google.

CockroachDB is implemented in Go.

CockrachDB deprecated interleaving tables and indexes in v20.2. Saying it is much slower than scanning over tables and indexes with no child objects, and database schema changes are slower for interleaved objects. https://www.cockroachlabs.com/docs/v21.1/interleave-in-parent#deprecation

Pebble: KV engine; replace RocksDB as the default storage engine in Cockroachdb; a subset of rocksdb; also LSM-Tree; Pebble does not aim to be a complete replacement for RocksDB, but only a replacement for the functionality in RocksDB used by CockroachDB. https://www.cockroachlabs.com/blog/pebble-rocksdb-kv-store/ https://github.com/cockroachdb/pebble

Tablet vs Group

TL;DR: A group is a logical set of splits, whereas a tablet is a physical replica of that group. (A tablet is a replica, a group is a replicaset).

They are related but represent different concepts in Spanner's architecture:

  1. Tablet:

    • What it is: The fundamental unit of data storage and partitioning in Spanner.
    • Analogy: Think of a very large table in your database. Spanner automatically splits this table (based on size or load) into manageable, contiguous ranges of rows, sorted by primary key. Each of these ranges is a tablet.
    • Purpose: To break down large tables/indexes into smaller chunks that can be independently moved, managed, and served by different Spanner servers (nodes). This enables horizontal scaling.
    • Content: Contains a specific range of rows for a table or index.
    • Lifecycle: Spanner automatically splits tablets when they get too large or busy, and merges them when they become too small or inactive.
  2. Spanner Group (usually means Paxos Group):

    • What it is: The fundamental unit of replication and consensus for the data within a single tablet.
    • Analogy: For each tablet (chunk of data), Spanner creates multiple copies (replicas) and places them in different physical locations (zones) for high availability and durability. This collection of replicas for one tablet forms a Paxos Group.
    • Purpose: To ensure data consistency (via the Paxos consensus algorithm) for writes and strongly consistent reads affecting the data range managed by its associated tablet. It also provides fault tolerance – if one replica (or even its zone) becomes unavailable, the group can continue operating using the remaining replicas.
    • Composition: Consists of several replicas of a single tablet. One replica acts as the leader, coordinating writes, while others are followers (or sometimes witnesses).
    • Relationship to Tablet: Every tablet has an associated Paxos group responsible for managing its replication and consistency.

Here's a table summarizing the key differences:

Feature Tablet Spanner Group (Paxos Group)
Unit of Data Storage & Partitioning Replication & Consensus
Represents A contiguous range of rows from a table/index A set of replicas for a single tablet
Purpose Horizontal scaling, data distribution Consistency, High Availability, Fault Tolerance
Composition Actual user data (a subset of a table) Multiple copies (replicas) of one tablet's data
Managed By Spanner (splitting/merging) Paxos consensus algorithm (among replicas)

In simple terms:

  • Spanner chops your big table into smaller pieces called Tablets.
  • For each Tablet, Spanner makes several copies (replicas) and puts them in different places. This set of copies for one tablet is managed as a Paxos Group to keep them all consistent and available.