B+-Tree Indexes

Storage structure of records

All records of a table in a RDBMS are stored on disk to guarantee their permanence in case of software and/or hardware failures. The records in a table are organized in files that are managed by the DBMS -- not by the OS. Whenever a query is submitted to the RDBMS, the query is processed and the records in the relational database table(s) that satisfy the query are returned by the DBMS. The way in which the query is processed depends on the particular storage structure of the records in the file. If the records are organized in no particular order then the storage structure is referred to as a heap. Whereas when the records are ordered according to some of its attributes' values, then the storage structure of the file is said to be sorted.

To retrieve records in a heap organized file, all that the RDBMS can do is a sequential scan of the disk pages where the records are stored. Typically the pages that make up the file are contiguous on disk.

If, on the other hand, the records are in a sorted file, then a binary search on the disk pages of the file can be used to retrieve records as long as the query's criteria uses the same attributes as the ones used to maintain sorted order of the records in the file. For example, if a file that stores the records for the Students table is sorted on last name and first name (in that order), then binary search can be used to process queries that search for students by last name only or queries that search for students by their full name. A query that searches for students by their date of birth must be processed using a sequential scan of the sorted file; similarly if the query searches for students by their first name.

A table index is a data structure and its associated algorithms that provide a mechanism by which records of the table can be searched for more efficiently than either sequential scan or binary search. The most common index structures used by RDBMS are B+-trees -- another index structure is the Hash table. The rest of this webpage describes the B+-tree index structure.

Balanced multiway search trees

Retrieval time from external memory (e.g. disk) is thousands of times greater than high-speed memory. The goal in external searching is to minimize the number of disk accesses, since disk access takes longer than computations. This is different than the goal of minimizing the number of computations (i.e. comparisons) in searching data that is in the faster main memory. For this reason, typical access of a record from external memory reads an entire page or block of data at once. By fetching a disk page that contains many records, we take advantage of the locality of reference -- if a record is being retrieved and we have organized the records based on how they are accessed, then it is likely that the records that are in the fetched block will be retrieved next.

A B+-tree is one in a family of multiway search trees (others are: B-tree, 2-3-4 Tree, B*-Trees). These trees were first proposed by Bayer and McCreight in 1972 and in just a few years had replaced almost all large file access methods other than hashing. These multiway balanced search trees are now the standard file organization for applications requiring insertion, deletion, and key range searches. They have the following advantages:

B+-tree Structure

A B+-tree in certain aspects is a generalization of a binary search tree (BST). The main difference is that nodes of a B+-tree will point to many children nodes rather than being limited to only two. Since our goal is to minimize disk accesses whenever we are trying to locate records, we want to make the height of the multiway search tree as small as possible. This goal is achieved by having the tree branch in large amounts at each node.

A B+-tree of order m is a tree where each internal node contains up to m branches (children nodes) and thus store up to m-1 search key values -- in a BST, only one key value is needed since there are just two children nodes that an internal node can have. m is also known as the branching factor or the fanout of the tree.

  1. The B+-tree stores records (or pointers to actual records) only at the leaf nodes, which are all found at the same level in the tree, so the tree is always height balanced.
  2. All internal nodes, except the root, have between Ceiling(m/2) and m children
  3. The root is either a leaf or has at least two children.
  4. Internal nodes store search key values, and are used only as placeholders to guide the search. The number of search key values in each internal node is one less than the number of its non-empty children, and these keys partition the keys in the children in the fashion of a search tree. The keys are stored in non-decreasing order (i.e. sorted in lexicographical order).
  5. Depending on the size of a record as compared to the size of a key, a leaf node in a B+-tree of order m may store more or less than m records. Typically this is based on the size of a disk block, the size of a record pointer, etcetera. The leaf pages must store enough records to remain at least half full.
  6. The leaf nodes of a B+-tree are linked together to form a linked list. This is done so that the records can be retrieved sequentially without accessing the B+-tree index. This also supports fast processing of range-search queries as will be described later.

The WikiPedia page on B+-trees has a good, small example of a B+-tree of order 4.

The simplest B+-tree occurs when m = 3: every internal node then has either 2 or 3 children. In practice, the size of a node in the B+-tree is chosen to fill a disk page. A B+-tree node implementation typically allows between 100 to 200 children, the actual number depends on the memory requirements of the keys and memory addresses. It's been reported (by Ramakrishnan and Gehrke) that the typical fill-factor of the nodes is 67%, giving an average branching factor or fanout of 133 (i.e. each internal nodes points to 133 children nodes). In a B+-tree of height 4 that has an average fanout of 133 can be used to index a table of over 300million records (to be exact, that's 1334 = 312,900,700). A B+-tree of height 3 can index a table of 1333 = 2,352,637 records. By accessing a mere five disk pages (the number of levels traversed is one more than the height of the tree), the RDBMS can find one of the 300+ million records. Furthermore, it is common for the top levels of the B+-tree to be cached in main memory. The memory requirements are not that high. Using a typical page size of 8Kbytes, the first two levels of the B+-tree (up to 134 disk pages) can be cached in about 1Mbyte. Now the RDBMS needs to only access three disk pages to find one of the 300+ million records! This clearly illustrates why B+-trees have been so successful in their application to indexing data in databases.

B+-tree operations

To understand the B+-tree operations more clearly, assume, without loss of generality, that there is a table whose primary is a single attribute and that it has a B+-tree index organized on the PK attribute of the table.

Searching for records that satisfy a simple condition

To retrieve records, queries are written with conditions that describe the values that the desired records are to have. The most basic search on a table to retrieve a single record given its PK value K.

Search in a B+-tree is an alternating two-step process, beginning with the root node of the B+-tree. Say that the search is for the record with key value K -- there can only be one record because we assume that the index is built on the PK attribute of the table.

  1. Perform a binary search on the search key values in the current node -- recall that the search key values in a node are sorted and that the search starts with the root of the tree. We want to find the key Ki such that Ki ≤ K < Ki+1.
  2. If the current node is an internal node, follow the proper branch associated with the key Ki by loading the disk page corresponding to the node and repeat the search process at that node.
  3. If the current node is a leaf, then:
    1. If K=Ki, then the record exists in the table and we can return the record associated with Ki
    2. Otherwise, K is not found among the search key values at the leaf, we report that there is no record in the table with the value K.

Inserting into a B+-tree

Insertion in a B+-tree is similar to inserting into other search trees, a new record is always inserted at one of the leaf nodes. The complexity added is that insertion could overflow a leaf node that is already full. When such overflow situations occur a brand new leaf node is added to the B+-tree at the same level as the other leaf nodes. The steps to insert into a B+-tree are:

  1. Follow the path that is traversed as if a Search is being performed on the key of the new record to be inserted.
  2. The leaf page L that is reached is the node where the new record is to be indexed.
  3. If L is not full then an index entry is created that includes the seach key value of the new row and a reference to where new row is in the data file. We are done; this is the easy case!
  4. If L is full, then a new leaf node Lnew is introduced to the B+-tree as a right sibling of L. The keys in L along with the an index entry for the new record are distributed evenly among L and Lnew. Lnew is inserted in the linked list of leaf nodes just to the right of L. We must now link Lnew to the tree and since Lnew is to be a sibling of L, it will then be pointed to by the partent of L. The smallest key value of Lnew is copied and inserted into the parent of L -- which will also be the parent of Lnew. This entire step is known as commonly referred to as a split of a leaf node.
    1. If the parent P of L is full, then it is split in turn. However, this split of an internal node is a bit different. The search key values of P and the new inserted key must still be distributed evenly among P and the new page introduced as a sibling of P. In this split, however, the middle key is moved to the node above -- note, that unlike splitting a leaf node where the middle key is copied and inserted into the parent, when you split an internal node the middle key is removed from the node being split and inserted into the parent node. This splitting of nodes may continue upwards on the tree.
    2. When a key is added to a full root, then the root splits into two and the middle key is promoted to become the new root. This is the only way for a B+-tree to increase in height -- when split cascades the entire height of the tree from the leaf to the root.

Deletion

Deletion from a B+-tree again needs to be sure to maintain the property that all nodes must be at least half full. The complexity added is that deletion could underflow a leaf node that has only the minimum number of entries allowed. When such underflow situations take place, adjacent sibling nodes are examined; if one of them has more than the minimum entries required then some of its entries are taken from it to prevent a node from underflowing. Otherwise, if both adjacent sibling nodes are also at their minimum, then two of these nodes are merged into a single node. The steps to delete from a B+-tree are:

  1. Perform the search process on the key of the record to be deleted. This search will end at a leaf L.
  2. If the leaf L contains more than the minimum number of elements (more than m/2 - 1), then the index entry for the record to be removed can be safely deleted from the leaf with no further action.
  3. If the leaf contains the minimum number of entries, then the deleted entry is replaced with another entry that can take its place while maintaining the correct order. To find such entries, we inspect the two sibling leaf nodes Lleft and Lright adjacent to L -- at most one of these may not exist.
    1. If one of these leaf nodes has more than the minimum number of entries, then enough records are transferred from this sibling so that both nodes have the same number of records. This is a heuristic and is done to delay a future underflow as long as possible; otherwise, only one entry need be transferred. The placeholder key value of the parent node may need to be revised.
    2. If both Lleft and Lright have only the minimum number of entries, then L gives its records to one of its siblings and it is removed from the tree. The new leaf will contain no more than the maximum number of entries allowed. This merge process combines two subtrees of the parent, so the separating entry at the parent needs to be removed -- this may in turn cause the parent node to underflow; such an underflow is handled the same way that an underflow of a leaf node.
    3. If the last two children of the root merge together into one node, then this merged node becomes the new root and the tree loses a level.

Range Search

It is common to request records from a table that fall in a specified range. For example, we may want to retrieve all employees making salaries between $50,000 and $60,000. B+-trees are exceptionally good for range queries. Once the first record in the range has been found using the search algorithm described above, the rest of the records in the range can be found by sequential processing the remaining records in the leaf node, and then continuing down (actually right of the current leaf node) the linked list of leaf nodes as far as necessary. Once a record is found that has a search key value above the upper bound of the requested range, then the search completes. Note that to use a B+-tree index to retrieve the employees described above, the B+-tree index must exist that has been organized using the salary attribute of the employees table.

Determining the size of a B+-tree

Consider a relation of 2,000,000 tuples (that's 2million tuples) stored in a heap file structure. How large (in number of disk pages) would a B+-tree index on this relation be? We will assume that we are building a B+-tree that is filled up as much as possible. To determine this, we need to know the following:

Now that we have the information above, we can proceed with determining the size of our B+-tree index. We will determine this by working from bottom to the top of the tree:

  1. Since a page is 1,000bytes, it can only store a limited number of index entries. Also, we are building a B+-tree that is filled up as much as possible so we will store as many index entries as can fit in a disk page. An index entry is a pair of values: [search key value, reference to tuple], and in our case, size(search key value)=40 and size(reference to tuple)=10. So, to determine the number of index entries that can fit in a leaf page we can use this calculation:
    1. Number of index entries per leaf page = Floor( Page_size / (size(search key value) + size(reference to tuple)))
    2. Number of index entries per leaf page = Floor( 1000 / (40 + 10)) = 200
  2. Now we must determine the number of leaf pages that are necessary. Since the table we are indexing is stored as a heap, the tuples are not ordered in the data file. So, our B+-tree index must contain one index enter per tuple in the table; this is known as a dense index. So, there will be 2million index entries collectively stored in the leaf pages. The number of leaf pages necessary is given by:
    1. Number of leaf pages = Ceiling ( (Number of index entries to be stored) / (Number of index entries per page) )
    2. Number of leaf pages = Ceiling ( 2,000,000 / 200 ) = 10,000
  3. Next, we must determine the branching factor, m, of our B+-tree. This can be done by repeating step 1 above with some slight modifications. In a B+-tree of order m, each internal node must be able to store up to m branches (i.e. pointers to nodes in the level below) and m-1 search key values (usually called separator values when they appear in the internal nodes).
    1. Page_size =(m-1) * size(search key value) + m * size(pointer)
    2. In our case, we have:
      1,000 =(m-1) * 40 + m * 10
    3. Solving for m, we get:
      m = Floor( 1,000 + 40 ) / 50 = 200
    4. So, each internal node will store 200 pointers to nodes in the level below and a corresponding 199 search key values.
  4. Now that we have m, we can determine the number of pages at each level in the B+-tree. Since every leaf in our tree needs to be referenced from the level above, collectively we will need 10,000 pointers (i.e. branches), one per leaf page. Therefore the number of internal pages in the level above the leaves is:
    1. Number of pages in level above leaves = Ceiling( Number of leaf pages / branching factor)
    2. Number of pages in level above leaves = Ceiling( 10,000 / 200 ) = 50
    3. We need to continue doing this computation until we reach the point where we have just one internal node which will be the root of our tree
    4. Number of pages in the next level above = Ceiling( 50 / 200 ) = 1
    5. Notice that we have repeated this process two times, and that in fact:
      Height of B+-tree = Ceiling(logmNumberOfLeafPages)
      And in our case, Height of B+-tree = Ceiling(log20010000) = 2.
  5. We can finally answer the question as to the size of our B+-tree:
    1. Total number of pages = 10,000 + 50 + 1 = 10,051 pages
    2. We have a 3-level B+-tree indexing 2million tuples, so by accessing 3 disk pages of the B+-tree followed by 1 page of the file, we can find any one of these tuples.

Finally, if we were asked to find the size of a B+-tree whose pages are filled to their minimum capacity we can follow the process above with one additional consideration. Pages of a B+-tree must have a minimum of Ceiling(m/2) children nodes (i.e. half of the page must be full). So, the computations above would need to take this into account to find the size of such a tree.

Interesting bits

B*-trees

The B*-tree adds better disk page utilization by requiring that each disk page be at least 66% full. This is achieved by splitting two nodes into three when overflow occurs during insertion and merging three nodes into two when underflow occurs during deletion, rather than splitting one node into two and merging two nodes into one as is done in a B+-tree.

References

  1. The paper that originally proposed the B-tree family of trees: Organization and Maintenance of Large Ordered Indexes by R. Bayer and E. McCreight In: Acta Informatica, Vol. 1, Fasc. 3, 1972, pp. 173-189.
  2. B+-tree page on WikiPedia.
  3. A Practical Introduction to Data Structures and Algorithm Analysis, 2nd edition, 2001, by Clifford A. Shaffer. Published by Pearson Prentice Hall, http://vig.prenhall.com/catalog/academic/product/0,1144,0130284467,00.html.
  4. Database Management Systems, 3rd edition by Raghu Ramakrishnan and Johannes Gehrke. Published by McGraw-Hill, 2002. ISBN:0072465638. http://catalogs.mhhe.com/mhhe/viewProductDetails.do?isbn=0072465638
  5. Data Structures and Algorithm Analysis in C++ by Mark Allen Weiss, Published by Addison-Wesley, 1999. ISBN: 0-201-36122-1. http://www.aw-bc.com/catalog/academic/product/0,1144,0201361221,00.html
  6. Introduction to Algorithms, Second Edition by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein, published by MIT Press. http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=8570

Database Resources