OceanBase

OceanBase

Tags
Database
Distributed
Published
September 20, 2023
Author

Design Overview

Goal

  1. Fast scale-out(scale-in) on commodity hardware to achieve high performance and low TCO(total cost of ownership)
  1. Cross-region deployment and fault tolerance
  1. Compatible with classical RDBMS

Infrastructure

OceanBase supports shared-nothing architecture, and its overall architecture is shown in Figure 1. Multiple servers in a distributed cluster of OceanBase concurrently provide database services with high availability. In Figure 1, the application layer sends a request to the proxy layer, and after the routing of the proxy service, it is sent to a database node (OBServer) of the actual service data, and the execution result follows the reverse path to the application layer.
notion image
Each OceanBase cluster consists of several zones. These zones can be restricted to one region or spread over multiple regions. In each zone, OceanBase can be deployed as shared-nothing.
Database tables, especially the large ones, are partitioned explicitly by the user and these partitions are the basic units for the data distribution and load balance. For every partition, there is a replica in each zone, and these replicas form a Paxos group.
 
Among all Paxos groups of an OceanBase cluster, one Paxos group is in charge of the management of the cluster, e.g., load balance, adding (removing) nodes, failure detection of nodes, and fault tolerance of failure nodes.

SQL Engine

notion image
The SQL engine of OceanBase is the data computing hub of the entire database. When it receives a SQL request after a series of processes such as syntax analysis, semantic analysis, query rewriting, and query optimization, the executor is responsible for the execution. If the SQL statement involves a large amount of data, the query execution engine of OceanBase implements a series of techniques such as distributed query execution, data reshuffle, vertical (horizontal) parallel execution, dynamic join filter, dynamic partition pruning, and global queueing.
Since many systems also have plan cache alike components implemented, OceanBase uses a fast parser that is a super lightweight framework to do only lexical analysis, and then attempts to match an existing plan in the plan cache. The consideration behind is that it does not require a grammar/syntax checking as long as it can match a statement already in the plan cache. This approach is 10 times faster as compared to a normal parser.
For different types of execution plans, the logic of Executor is very different. (1) For local execution plans, the Executor will merely call from the operator at the top of the execution plan, and the entire execution process will be completed by the operator’s own logic, and the execution results will be returned. (2) For distributed plans, the Executor needs to divide the execution tree into multiple schedulable threads according to the preselected divisions, and send them to relevant nodes for execution through RPC.

Multi-tenancy

For OceanBase, multi-tenancy is an important feature, which forms the basis of database object management and resource management. OceanBase includes two categories of tenants, system and ordinary tenants.
  • System tenants:
  • Ordinary tenants:

Resource Isolation

Features

  1. High performance: the storage adopts the read-write separation architecture, the thorough performance optimization of the computing engine, and the performance of the quasi memory database.
  1. Low cost: using PC server, the high storage compression ratio reduces the storage cost, and efficiently reduces the computing cost, besides the multi-tenant deployment making full use of the system resources.
  1. High availability:

Storage Engine

notion image
OceanBase has a Log-Structured Merge-tree (LSM-tree) storage system, similar to Bigtable.

Asymmetric Read and Write

There are often heavy read and write operations in a database. Similar to a classical RDBMS, the basic read unit of OceanBase, called microblock, has a relatively small size, e.g., 4KB and 8KB, and it is configurable through a database administrator. On the contrary, the write unit of OceanBase, called macroblock, is 2MB. Macroblock is also the basic unit of allocation and garbage collection of the storage system. Many microblocks are packed into a macroblock and this makes the disk utilization more efficient at the cost of a larger write amplification.

Daily Incremental Major Compaction

  • During a major compaction, if there is certain data modification (insert, update, delete) within a macroblock, the macroblock will be rewritten, otherwise, the macroblock will be reused in the new baseline data without any IO cost.
  • OceanBase staggers the normal service and the merge time through a round-robin compaction mechanism, thus isolating the normal user requests from the interference of the compaction operation.

Replica Type

  • Full replica: In OceanBase, a complete replica of a partition or table consists of the baseline, mutation increment, and redo log and such a complete replica is called a full replica.
  • Data replica: a data replica consists of the baseline and redo log. A data replica copies the minor compactions (minor compacted mutations) from a full replica on demand. A data replica can be upgraded to a full replica when it completes the replaying redo log after the last minor compaction has received from a full replica. Data replica can reduce both the CPU and memory cost by eliminating the redo log replay and MemTable.
  • Log replica: a log replica consists of redo log only. The Log replica is a member of the corresponding Paxos group, though there exists neither MemTable nor SSTable. By the deployment of two full replicas and one log replica instead of three full replicas, a system still owns quite a high availability while the storage and memory cost is significantly reduced.
notion image

Transaction Processing Engine

Partition and Paxos Group

A table partition is the basic unit for the data distribution, load balance, and Paxos synchronization. Typically, there is a Paxos group for each partition.
notion image

Timestamp Service

To enable a high available timestamp service, a timestamp Paxos group has been used. Paxos leader of the timestamp Paxos group is often in the same region as Paxos leaders of the table partitions. Each OceanBase node retrieves the timestamp from the timestamp Paxos leader periodically.

Transaction Processing

notion image
OceanBase introduces the Paxos distributed consistency protocol to 2PC, enabling the distributed transactions with automatic fault tolerance.
notion image
  • Fault tolerance: Each participant in the two-phase commit contains multiple copies, and the copies are readily available through the Paxos protocol. When a participant node fails, the Paxos protocol can quickly elect another replica to replace the original participant to continue providing services, and restore the state of the original participant.
  • High performance: In OceanBase, the first participant of each distributed transaction is the coordinator of the two-phase commit. Furthermore, the coordinator terminates in the state of the two-phase commit, but dynamically constructs it through the local state of all the participants during a disaster recovery. This results in two, instead of three Paxos synchronizations, in a two-phase commit, and the delay of a two-phase commit is further reduced to one Paxos synchronization, as shown in Figure 6(b).

Isolation Level

For compatibility and performance considerations, OceanBase supports read committed and snapshot isolation, and makes the former as the default isolation level.

Replicated Table

In OceanBase, a replicated table is replicated on each OceanBase node. There are two kinds of replicated tables, viz., synchronously replicated and asynchronously replicated tables.
  • For a synchronously replicated table, a mutation is committed only after the mutation is performed by every replica of the table.
    • This ensures that any transaction in any OceanBase node will see the same content of a synchronously replicated table.
    • But synchronous mutation of all replicas deteriorates the write performance of the corresponding replicated table.
  • For an asynchronously replicated table, however, a mutation is committed after the redo log of the mutation continues on a majority of the Paxos group of the table. Thereafter, the mutation will spread to all replicas of the table in a cascading mode.
    • Asynchronous mutation of all replicas guarantees the write performance of the corresponding replicated table such that replicas except the Paxos leader may have the slightly old version of data.
    • If a transaction encounters a very old version of an asynchronous replicated table, it will attempt a remote replica.

TPC-C Benchmark Test

Lessons in Building Oceanbase