Database Built on Go

Database Built on Go

Tags
Database
Golang
Published
July 3, 2023

04. B-Tree

Our B-tree will be persisted to the disk eventually, so we need to design the wire format for the B-tree nodes first. Without the format, we won’t know the size of a node and when to split a node.
A node consists of:
  1. A fixed-sized header containing the type of the node (leaf node or internal node) and the number of keys.
  1. A list of pointers to the child nodes. (Used by internal nodes).
  1. A list of offsets pointing to each key-value pair.
  1. Packed KV pairs.
| type | nkeys | pointers | offsets | key-values | 2B | 2B | nkeys * 8B | nkeys * 2B | ...
This is the format of the KV pair. Lengths followed by data.
| klen | vlen | key | val | | 2B | 2B | ... | ... |
To keep things simple, both leaf nodes and internal nodes use the same format.

4.2 Data Types

Since we’re going to dump our B-tree to the disk eventually, why not use an array of bytes as our in-memory data structure as well?
type BNodestruct { data []byte// can be dumped to the disk } const ( BNODE_NODE = 1// internal nodes without values BNODE_LEAF = 2// leaf nodes with values )
And we can’t use the in-memory pointers, the pointers are 64-bit integers referencing disk pages instead of in-memory nodes. We’ll add some callbacks to abstract away this aspect so that our data structure code remains pure data structure code.
type BTreestruct { // pointer (a nonzero page number) root uint64// callbacks for managing on-disk pages getfunc(uint64) BNode// dereference a pointer newfunc(BNode) uint64// allocate a new page delfunc(uint64)// deallocate a page }
The page size is defined to be 4K bytes. A larger page size such as 8K or 16K also works.
We also add some constraints on the size of the keys and values. So that a node with a single KV pair always fits on a single page. If you need to support bigger keys or bigger values, you have to allocate extra pages for them and that adds complexity.
const HEADER = 4const BTREE_PAGE_SIZE = 4096 const BTREE_MAX_KEY_SIZE = 1000 const BTREE_MAX_VAL_SIZE = 3000 func init() { node1max := HEADER + 8 + 2 + 4 + BTREE_MAX_KEY_SIZE + BTREE_MAX_VAL_SIZE assert(node1max <= BTREE_PAGE_SIZE) }

4.3 Decoding the B-tree Node

Since a node is just an array of bytes, we’ll add some helper functions to access its content.
// header func (node BNode) btype() uint16 { return binary.LittleEndian.Uint16(node.data) } func (node BNode) nkeys() uint16 { return binary.LittleEndian.Uint16(node.data[2:4]) } func (node BNode) setHeader(btype uint16, nkeys uint16) { binary.LittleEndian.PutUint16(node.data[0:2], btype) binary.LittleEndian.PutUint16(node.data[2:4], nkeys) }
// pointers func (node BNode) getPtr(idx uint16) uint64 { assert(idx < node.nkeys()) pos := HEADER + 8*idx return binary.LittleEndian.Uint64(node.data[pos:]) } func (node BNode) setPtr(idx uint16, val uint64) { assert(idx < node.nkeys()) pos := HEADER + 8*idx binary.LittleEndian.PutUint64(node.data[pos:], val) }
Some details about the offset list:
  • The offset is relative to the position of the first KV pair.
  • The offset of the first KV pair is always zero, so it is not stored in the list.
  • We store the offset to the end of the last KV pair in the offset list, which is used to determine the size of the node.
// offset list func offsetPos(node BNode, idx uint16) uint16 { assert(1 <= idx && idx <= node.nkeys()) return HEADER + 8*node.nkeys() + 2*(idx-1) } func (node BNode) getOffset(idx uint16) uint16 { if idx == 0 { return 0 } return binary.LittleEndian.Uint16(node.data[offsetPos(node, idx):]) } func (node BNode) setOffset(idx uint16, offset uint16) { binary.LittleEndian.PutUint16(node.data[offsetPos(node, idx):], offset) }
The offset list is used to locate the nth KV pair quickly.
// key-values func (node BNode) kvPos(idx uint16) uint16 { assert(idx <= node.nkeys()) return HEADER + 8*node.nkeys() + 2*node.nkeys() + node.getOffset(idx) } func (node BNode) getKey(idx uint16) []byte { assert(idx < node.nkeys()) pos := node.kvPos(idx) klen := binary.LittleEndian.Uint16(node.data[pos:]) return node.data[pos+4:][:klen] } func (node BNode) getVal(idx uint16) []byte { assert(idx < node.nkeys()) pos := node.kvPos(idx) klen := binary.LittleEndian.Uint16(node.data[pos+0:]) vlen := binary.LittleEndian.Uint16(node.data[pos+2:]) return node.data[pos+4+klen:][:vlen] }
And to determine the size of the node.
// node size in bytes func (node BNode) nbytes() uint16 { return node.kvPos(node.nkeys()) }

4.4 The B-Tree Insertion

The code is broken down into small steps.

Step 1: Look Up the Key

To insert a key into a leaf node, we need to look up its position in the sorted KV list.
// returns the first kid node whose range intersects the key. (kid[i] <= key) //TODO: bisect func nodeLookupLE(node BNode, key []byte) uint16 { nkeys := node.nkeys() found := uint16(0) // the first key is a copy from the parent node, // thus it's always less than or equal to the key. for i := uint16(1); i < nkeys; i++ { cmp := bytes.Compare(node.getKey(i), key)if cmp <= 0 { found = i }if cmp >= 0 { break } } return found }
The lookup works for both leaf nodes and internal nodes. Note that the first key is skipped for comparison, since it has already been compared from the parent node.

Step 2: Update Leaf Nodes

After looking up the position to insert, we need to create a copy of the node with the new key in it.
// add a new key to a leaf node func leafInsert( new BNode, old BNode, idx uint16, key []byte, val []byte,) { new.setHeader(BNODE_LEAF, old.nkeys()+1) nodeAppendRange(new, old, 0, 0, idx) nodeAppendKV(new, idx, 0, key, val) nodeAppendRange(new, old, idx+1, idx, old.nkeys()-idx) }
The nodeAppendRange function copies keys from an old node to a new node.
// copy multiple KVs into the position func nodeAppendRange( new BNode, old BNode, dstNew uint16, srcOld uint16, n uint16,) { assert(srcOld+n <= old.nkeys()) assert(dstNew+n <= new.nkeys()) if n == 0 { return }// pointers for i := uint16(0); i < n; i++ { new.setPtr(dstNew+i, old.getPtr(srcOld+i)) }// offsets dstBegin := new.getOffset(dstNew) srcBegin := old.getOffset(srcOld) for i := uint16(1); i <= n; i++ {//NOTE: the range is [1, n] offset := dstBegin + old.getOffset(srcOld+i) - srcBegin new.setOffset(dstNew+i, offset) }// KVs begin := old.kvPos(srcOld) end := old.kvPos(srcOld + n) copy(new.data[new.kvPos(dstNew):], old.data[begin:end]) }
The nodeAppendKV function copies a KV pair to the new node.
// copy a KV into the position func nodeAppendKV(new BNode, idx uint16, ptr uint64, key []byte, val []byte) {// ptrs new.setPtr(idx, ptr)// KVs pos := new.kvPos(idx) binary.LittleEndian.PutUint16(new.data[pos+0:], uint16(len(key))) binary.LittleEndian.PutUint16(new.data[pos+2:], uint16(len(val))) copy(new.data[pos+4:], key) copy(new.data[pos+4+uint16(len(key)):], val)// the offset of the next key new.setOffset(idx+1, new.getOffset(idx)+4+uint16((len(key)+len(val)))) }

Step 3: Recursive Insertion

The main function for inserting a key.
// insert a KV into a node, the result might be split into 2 nodes. // the caller is responsible for deallocating the input node // and splitting and allocating result nodes. func treeInsert(tree *BTree, node BNode, key []byte, val []byte) BNode { // the result node. // it's allowed to be bigger than 1 page and will be split if so new := BNode{data: make([]byte, 2*BTREE_PAGE_SIZE)}// where to insert the key? idx := nodeLookupLE(node, key)// act depending on the node type switch node.btype() { case BNODE_LEAF: // leaf, node.getKey(idx) <= key if bytes.Equal(key, node.getKey(idx)) {// found the key, update it. leafUpdate(new, node, idx, key, val) }else {// insert it after the position. leafInsert(new, node, idx+1, key, val) }case BNODE_NODE: // internal node, insert it to a kid node. nodeInsert(tree, new, node, idx, key, val)default: panic("bad node!") } return new }
The leafUpdate function is similar to the leafInsert function.

Step 4: Handle Internal Nodes

Now comes the code for handling internal nodes.
// part of the treeInsert(): KV insertion to an internal node func nodeInsert( tree *BTree, new BNode, node BNode, idx uint16, key []byte, val []byte,) {// get and deallocate the kid node kptr := node.getPtr(idx) knode := tree.get(kptr) tree.del(kptr)// recursive insertion to the kid node knode = treeInsert(tree, knode, key, val)// split the result nsplit, splited := nodeSplit3(knode)// update the kid links nodeReplaceKidN(tree, new, node, idx, splited[:nsplit]...) }

Step 5: Split Big Nodes

Inserting keys into a node increases its size, causing it to exceed the page size. In this case, the node is split into multiple smaller nodes.
The maximum allowed key size and value size only guarantee that a single KV pair always fits on one page. In the worst case, the fat node is split into 3 nodes (one large KV pair in the middle).
// split a bigger-than-allowed node into two. // the second node always fits on a page. func nodeSplit2(left BNode, right BNode, old BNode) { // code omitted... } // split a node if it's too big. the results are 1~3 nodes. func nodeSplit3(old BNode) (uint16, [3]BNode) { if old.nbytes() <= BTREE_PAGE_SIZE { old.data = old.data[:BTREE_PAGE_SIZE] return 1, [3]BNode{old} } left := BNode{make([]byte, 2*BTREE_PAGE_SIZE)}// might be split later right := BNode{make([]byte, BTREE_PAGE_SIZE)} nodeSplit2(left, right, old)if left.nbytes() <= BTREE_PAGE_SIZE { left.data = left.data[:BTREE_PAGE_SIZE] return 2, [3]BNode{left, right} }// the left node is still too large leftleft := BNode{make([]byte, BTREE_PAGE_SIZE)} middle := BNode{make([]byte, BTREE_PAGE_SIZE)} nodeSplit2(leftleft, middle, left) assert(leftleft.nbytes() <= BTREE_PAGE_SIZE) return 3, [3]BNode{leftleft, middle, right} }

Step 6: Update Internal Nodes

Inserting a key into a node can result in either 1, 2 or 3 nodes. The parent node must update itself accordingly. The code for updating an internal node is similar to that for updating a leaf node.
// replace a link with multiple links func nodeReplaceKidN( tree *BTree, new BNode, old BNode, idx uint16, kids ...BNode,) { inc := uint16(len(kids)) new.setHeader(BNODE_NODE, old.nkeys()+inc-1) nodeAppendRange(new, old, 0, 0, idx)for i, node :=range kids { nodeAppendKV(new, idx+uint16(i), tree.new(node), node.getKey(0), nil) } nodeAppendRange(new, old, idx+inc, idx+1, old.nkeys()-(idx+1)) }
We have finished the B-tree insertion. Deletion and the rest of the code will be introduced in the next chapter.