Multigres Architecture Overview
Principles
Multigres will follow a set of principles suited for large scale distributed systems. They are as follows:
Scalability
In a distributed system, scalability is mainly achieved by removing all possible bottlenecks. Among them, the most challenging one is the database. Multigres will be designed to scale the database horizontally by sharding it across multiple Postgres instances. Multigres will also provide an additional scalability option by managing read replicas.
High Availability
Multigres strives for enterprise grade availability. To achieve this, it will use a combination of techniques:
- A consensus protocol for leader election and failover management.
- Fully automated cluster management.
- No disruption of service during upgrades or maintenance.
Data Durability
Multigres will ensure that data is durable using a consensus protocol. It will guarantee that a write that has been acknowledged as success to a client must not be lost.
Additionally, it will provide a backup and restore mechanism to ensure that data can be recovered in case of catastrophic failures.
All other metadata will be stored in a distributed key-value store like etcd, which can also be backed up regularly, or can be manually reconstructed.
Resilience
Multigres will provide resilience against spikes and overloads by employing queuing and load shedding mechanisms.
To protect from cascading failures, it will implement adaptive timeouts and exponential backoffs on retries.
Observability
In spite of all precautions, incidents can happen. Multigres will provide a comprehensive set of metrics and logs to help diagnose issues.
Features
A primary goal of Multigres is to provide full Postgres compatibility while enhancing scalability, availability, and performance. Its key features will include:
- Proxy layer and Connection pooling
- Performance and High Availability
- Cluster management across multiple zones
- Indefinite scaling through sharding
Multigres can be deployed to suit different needs. We will introduce you to the key components as we illustrate various deployment scenarios.
Single database deployment
In a single database deployment, Multigres will act as a proxy layer in front of a single PostgreSQL instance. This setup will be ideal for small applications or development environments where simplicity is key.
The main components involved will be:
- MultiGateway: MultiGateway will speak the Postgres protocol and route queries to MultiPooler through a single multiplexed gRPC connection.
- MultiPooler: The MultiPooler will be connected to a single Postgres server, and will manage a pool of connections to the database. They will both run on the same host, which will typically be a Kubernetes pod.
In this scenario, Multigres will not address the durability of the underlying data. Therefore, it is recommended to use a resilient form of cloud storage to ensure data safety.
For a Multigres cluster to operate, two other components will be required:
- Provisioner: This will typically be a Kubernetes operator that handles provisioning of resources for the cluster. For example, a
CREATE DATABASE
command will be redirected to the provisioner that will allocate the necessary resources and launch the MultiPooler along with its associated Postgres instance. - Topo Server: This will typically be an etcd cluster. The Provisioner will store the existence of the newly created database in the Topo Server. The MultiPooler will also register itself in the Topo Server to allow MultiGateway to discover it.
Multiple database deployment
Unlike a traditional Postgres server, every Multigres database will be created in a brand new Postgres instance coupled with its own MultiPooler.
The MultiGateways will be scalable independently based on resource needs. The application will connect to any MultiGateway, which will route the queries to the appropriate MultiPooler based on the database name.
This deployment style will allow for a large number of databases to be deployed under a single Multigres cluster.
The figure does not show the Topo Server and Provisioner components, but they will still be required for the cluster to operate.
Performance and High Availability
Multigres can be configured to add replicas as standbys. In this setup, we introduce the MultiOrch
component, which manages the health of replication across replicas. It monitors replication, repairs broken streams, and coordinates failover, ensuring replicas remain in sync with the primary database.
MultiOrch will implement a distributed consensus algorithm that will provide the following benefits:
- High Availability: by promoting one of the replicas to be the new primary in case of a failure.
- Data Durability: by ensuring that all writes are acknowledged by a quorum of replicas before being considered successful.
- Performance: because the data can be stored on a local NVMe for faster access.
MultiOrch will operate on an unmodified Postgres engine by using full sync replication. For a better experience, we recommend using the two-phase sync plug in (more details on this later).
MultiOrch will be configurable to use a RAFT style majority quorum. It will also be configurable to support more advanced durability policies that don't depend on the quorum size. This will be achieved by using a new generalized consensus approach (covered later).
MultiGateway will make use of the replicas to scale reads for situations where the application can tolerate eventual consistency. It will also be configurable to support consistent reads from replicas at the cost of waiting for writes to finish transmitting the data to the replicas.
Cluster management
A Multigres cluster will be deployable across multiple zones or geographical regions. In the previous examples, the components will be deployed in a single zone. Under the covers, they will be deployed in the default
cell, which will be implicitly created for every new database. In the case of a multi-zone deployment, you will have to explicitly create cells and deploy the components in those cells. You will not have to preserve the original default cell.
In Multigres parlance, a cell will be a user-defined grouping of components. It will represent a zone or a region.
Topo Servers
In a multi-cell deployment, the Topo Server will be splittable into multiple instances. It is recommended that the Global Topo Server be deployed with nodes in multiple cells. For every cell, a cell-specific topo server will be deployable.
The Global Topo Server will contain the list of databases and the cells in which replicas are deployed. This information will be used more sparingly.
The cell-specific topo servers will contain the list of components deployed in that cell, such as MultiGateways and MultiPoolers. The purpose of this design is to ensure that a cell that is partitioned from the rest of the system can continue to operate independently for as long as the data is not stale.
Single Primary
Irrespective of the number of cells, there will exist only one primary database at any given time. The MultiGateways will route all requests meant for the primary to the current primary even if it is not in the same cell.
However, read traffic directed at replicas will be served from the local cell.
MultiOrch
It will be recommended that one MultiOrch be deployed per cell to ensure that failovers can be successfully performed even if the network is partitioned.
The consensus protocol will ensure safety even if the MultiOrchs are not able to communicate with each other.
The durability policies will be settable to survive network partitions. For example, you may request a cross-cell durability policy that will require an acknowledgment from a replica in a different cell before considering a write successful.
You will also be able to request MultiOrch to prefer appointing a primary within the same cell as the previous primary to avoid unnecessary churn.
Backup and Restore
Multigres will perform regular backups of the databases. These backups will be restored when new replicas are brought online.
Sharding
In the previous examples where the databases were unsharded, all the tables would have been stored on a single Postgres database. In this situation, there will be a one-to-one mapping between a Multigres database and the Postgres instance.
In reality, a Multigres database will be distributable across multiple Postgres instances. These will be known as TableGroups. Additionally, each TableGroup will be shardable independently, which will result in more Postgres instances within a TableGroup. When a Multigres database is created, a default TableGroup
will be created, which will be an unsharded Postgres instance. This will be where all the initial tables are created. When you decide to shard a set of tables, you will be able to create a new sharded TableGroup and migrate those tables to it. From the application's perspective, the tables will appear as if they are part of a single database.
You will also be able to create separate unsharded TableGroups.
In the above example:
t1
is stored in the originaldefault
unsharded TableGroup. The single shard in this TableGroup is nameddefault:0-inf
.t2
is split in the two shards oftg1
. The shards are namedtg1:0-8
andtg1:8-inf
. Note that the digits are hexadecimal.t3
is stored in TableGrouptg2
. The single shard in this TableGroup is namedtg2:0-inf
.
The rest of the Multigres features like cluster management, HA, etc. will be designed to work seamlessly with these sharded TableGroups.
We will cover the Multigres sharding model in more detail in a separate document.
Upcoming Topics
We will cover the following topics in more detail in the future:
- Two-phase sync replication
- Generalized consensus for durability policies
- Multigres sharding model