B+-Tree Indexes
Storage structure of records
All records of a table in a RDBMS are stored on disk to
guarantee their permanence in case of software and/or
hardware failures. The records in a table are organized in
files that are managed by the DBMS -- not by the OS.
Whenever a query is submitted to the RDBMS, the query is
processed and the records in the relational database
table(s) that satisfy the query are returned by the DBMS.
The way in which the query is processed depends on the
particular storage structure of the records in the file. If
the records are organized in no particular order then the
storage structure is referred to as a heap.
Whereas when the records are ordered according to some of
its attributes' values, then the storage structure of the
file is said to be sorted.
To retrieve records in a heap organized file, all that
the RDBMS can do is a sequential scan of the disk pages
where the records are stored. Typically the pages that make
up the file are contiguous on disk.
If, on the other hand, the records are in a sorted file,
then a binary search on the disk pages of the file can be
used to retrieve records as long as the query's criteria
uses the same attributes as the ones used to maintain sorted
order of the records in the file. For example, if a file
that stores the records for the Students table is sorted on
last name and first name (in that order), then binary search
can be used to process queries that search for students by
last name only or queries that search for students by their
full name. A query that searches for students by their date
of birth must be processed using a sequential scan of the
sorted file; similarly if the query searches for students by
their first name.
A table index is a data structure and its associated
algorithms that provide a mechanism by which records of the
table can be searched for more efficiently than either
sequential scan or binary search. The most common index
structures used by RDBMS are B+-trees -- another
index structure is the Hash table. The rest of this webpage
describes the B+-tree index structure.
Balanced multiway search trees
Retrieval time from external memory (e.g. disk) is
thousands of times greater than high-speed memory. The goal
in external searching is to minimize the number of disk
accesses, since disk access takes longer than computations.
This is different than the goal of minimizing the number of
computations (i.e. comparisons) in searching data that is in
the faster main memory. For this reason, typical access of a
record from external memory reads an entire page or block of
data at once. By fetching a disk page that contains many
records, we take advantage of the locality of reference --
if a record is being retrieved and we have organized the
records based on how they are accessed, then it is likely
that the records that are in the fetched block will be
retrieved next.
A B+-tree is one in a family of multiway
search trees (others are: B-tree, 2-3-4 Tree,
B*-Trees). These trees were first proposed by
Bayer and McCreight in 1972 and in just a few years had
replaced almost all large file access methods other than
hashing. These multiway balanced search trees are now the
standard file organization for applications requiring
insertion, deletion, and key range searches. They have the
following advantages:
- B-trees are always height balanced, with all leaf nodes at the same level
- Update and search operations affect only a few disk pages, so performance is good.
- B-trees keep related records on the same disk page,
which takes advantage of locality of reference.
- B-trees guarantee that every node in the tree will
be full at least to a certain minimum percentage. This
improves space efficiency while reducing the typical
number of disk fetches necessary during a search or
update operation over many thousands of records.
B+-tree Structure
A B+-tree in certain aspects is a
generalization of a binary search tree (BST). The main
difference is that nodes of a B+-tree will point
to many children nodes rather than being limited to only
two. Since our goal is to minimize disk accesses whenever we
are trying to locate records, we want to make the height of
the multiway search tree as small as possible. This goal is
achieved by having the tree branch in large amounts at each
node.
A B+-tree of order m is a tree where
each internal node contains up to m branches
(children nodes) and thus store up to m-1 search
key values -- in a BST, only one key value is needed since
there are just two children nodes that an internal node can
have. m is also known as the branching factor or
the fanout of the tree.
- The B+-tree stores records (or pointers
to actual records) only at the leaf nodes, which are all
found at the same level in the tree, so the tree is
always height balanced.
- All internal nodes, except the root, have between
Ceiling(m/2) and m children
- The root is either a leaf or has at least two children.
- Internal nodes store search key values, and are used
only as placeholders to guide the search. The number of
search key values in each internal node is one less than
the number of its non-empty children, and these keys
partition the keys in the children in the fashion of a
search tree. The keys are stored in non-decreasing order
(i.e. sorted in lexicographical order).
- Depending on the size of a record as compared to the size of a key, a
leaf node in a B+-tree of order m may store
more or less than m records. Typically this is based on
the size of a disk block, the size of a record pointer,
etcetera. The leaf pages must store enough records to
remain at least half full.
- The leaf nodes of a B+-tree are linked
together to form a linked list. This is done so that the
records can be retrieved sequentially without accessing
the B+-tree index. This also supports fast
processing of range-search queries as will be described
later.
The WikiPedia page on B+-trees has a good, small example of a
B+-tree of order 4.
The simplest B+-tree occurs when m = 3:
every internal node then has either 2 or 3 children.
In practice, the size of a node in the B+-tree is
chosen to fill a disk page. A B+-tree node
implementation typically allows between 100 to 200 children,
the actual number depends on the memory requirements of the
keys and memory addresses. It's been reported (by
Ramakrishnan and Gehrke) that the typical fill-factor of the
nodes is 67%, giving an average branching factor or fanout
of 133 (i.e. each internal nodes points to 133 children
nodes). In a B+-tree of height 4 that has an
average fanout of 133 can be used to index a table of over
300million records (to be exact, that's 1334 =
312,900,700). A B+-tree of height 3 can index a
table of 1333 = 2,352,637 records. By accessing a
mere five disk pages (the number of levels traversed is one
more than the height of the tree), the RDBMS can find one of
the 300+ million records. Furthermore, it is common for the
top levels of the B+-tree to be cached in main
memory. The memory requirements are not that high. Using a
typical page size of 8Kbytes, the first two levels of the
B+-tree (up to 134 disk pages) can be cached in
about 1Mbyte. Now the RDBMS needs to only access three disk
pages, corresponding to the three bottom levels of the B+-tree index,
to find one of the 300+ million records! This clearly
illustrates why B+-trees have been so successful
in their application to indexing data in databases.
B+-tree operations
To understand the B+-tree operations more
clearly, assume, without loss of generality, that there is a
table whose primary is a single attribute and that it has a
B+-tree index organized on the PK attribute of
the table.
Searching for records that satisfy a simple condition
To retrieve records, queries are written with conditions
that describe the values that the desired records are to
have. The most basic search on a table to retrieve a single
record given its PK value K.
Search in a B+-tree is an alternating two-step
process, beginning with the root node of the
B+-tree. Say that the search is for the record
with key value K -- there can only be one record because we
assume that the index is built on the PK attribute of the
table.
- Perform a binary search on the search key values in
the current node -- recall that the search key values in
a node are sorted and that the search starts with the
root of the tree. We want to find the key Ki
such that Ki ≤ K < Ki+1.
- If the current node is an internal node, follow the
proper branch associated with the key Ki by
loading the disk page corresponding to the node and
repeat the search process at that node.
- If the current node is a leaf, then:
- If K=Ki, then the record exists
in the table and we can return the record
associated with Ki
- Otherwise, K is not found among the search
key values at the leaf, we report that there is
no record in the table with the value K.
Inserting into a B+-tree
Insertion in a B+-tree is similar to inserting
into other search trees, a new record is always inserted at
one of the leaf nodes. The complexity added is that
insertion could overflow a leaf node that is already full.
When such overflow situations occur a brand new leaf node is
added to the B+-tree at the same level as the
other leaf nodes. The steps to insert into a
B+-tree are:
- Follow the path that is traversed as if a Search is
being performed on the key of the new record to be
inserted.
- The leaf page L that is reached is the node where
the new record is to be indexed.
- If L is not full
then an index entry is created that includes the seach
key value of the new row and a reference to where new
row is in the data file. We are done; this is the easy
case!
- If L is full, then a new leaf node
Lnew is introduced to the B+-tree
as a right sibling of L. The keys in L along with the
an index entry for the new record are distributed evenly
among L and Lnew. Lnew is inserted
in the linked list of leaf nodes just to the right of L.
We must now link Lnew to the tree and since
Lnew is to be a sibling of L, it will then be
pointed to by the partent of L. The smallest key value
of Lnew is copied and
inserted into the parent of L -- which will also be the
parent of Lnew. This entire step is known as
commonly referred to as a split of a leaf
node.
- If the parent P of L is full, then it is
split in turn. However, this split of
an internal node is a bit different.
The search key values of P and the new inserted
key must still be distributed evenly among P and
the new page introduced as a sibling of P. In
this split, however, the middle key is
moved to the node above -- note, that
unlike splitting a leaf node where the middle
key is copied and inserted into the parent, when
you split an internal node the middle key is
removed from the node being split and inserted
into the parent node. This splitting of nodes
may continue upwards on the tree.
- When a key
is added to a full root, then the root splits
into two and the middle key is promoted to
become the new root. This is the only way for a
B+-tree to increase in height -- when
split cascades the entire height of the tree
from the leaf to the root.
Deletion
Deletion from a B+-tree again needs to be sure
to maintain the property that all nodes must be at least
half full. The complexity added is that deletion could
underflow a leaf node that has only the minimum number of
entries allowed. When such underflow situations take place,
adjacent sibling nodes are examined; if one of them has more
than the minimum entries required then some of its entries
are taken from it to prevent a node from underflowing.
Otherwise, if both adjacent sibling nodes are also at their
minimum, then two of these nodes are merged into a single
node. The steps to delete from a B+-tree are:
- Perform the search process on the key of the record
to be deleted. This search will end at a leaf L.
- If the leaf L contains more than the minimum number
of elements (more than m/2 - 1), then the index entry
for the record to be removed can be safely deleted from
the leaf with no further action.
- If the leaf contains the minimum number of entries,
then the deleted entry is replaced with another entry
that can take its place while maintaining the correct
order. To find such entries, we inspect the two sibling
leaf nodes Lleft and Lright
adjacent to L -- at most one of these may not exist.
- If one of these leaf nodes has more than the
minimum number of entries, then enough records
are transferred from this sibling so that both
nodes have the same number of records. This is a
heuristic and is done to delay a future
underflow as long as possible; otherwise, only
one entry need be transferred. The placeholder
key value of the parent node may need to be
revised.
- If both Lleft and
Lright have only the minimum number
of entries, then L gives its records to one of
its siblings and it is removed from the tree.
The new leaf will contain no more than the
maximum number of entries allowed. This merge
process combines two subtrees of the parent, so
the separating entry at the parent needs to be
removed -- this may in turn cause the parent
node to underflow; such an underflow is handled
the same way that an underflow of a leaf node.
- If the last two children of the root merge
together into one node, then this merged node
becomes the new root and the tree loses a level.
Range Search
It is common to request records from a table that fall in
a specified range. For example, we may want to retrieve all
employees making salaries between $50,000 and $60,000.
B+-trees are exceptionally good for range
queries. Once the first record in the range has been found
using the search algorithm described above, the rest of the
records in the range can be found by sequential processing
the remaining records in the leaf node, and then continuing
down (actually right of the current leaf node) the linked
list of leaf nodes as far as necessary. Once a record is
found that has a search key value above the upper bound of
the requested range, then the search completes. Note that to
use a B+-tree index to retrieve the employees
described above, the B+-tree index must exist
that has been organized using the salary attribute of the
employees table.
Determining the size of a B+-tree
Consider a relation of 2,000,000 tuples (that's 2million
tuples) stored in a heap file structure. How large (in
number of disk pages) would a B+-tree index on
this relation be? We will assume that we are building a
B+-tree that is filled up as much as possible. To
determine this, we need to know the following:
- The size of one disk page. While pages are usually
4KBytes or 8KBytes, to make our calculations easier we
will use a page size of 1,000 bytes
- The size of the search key value. Say that we index
the relation by using an attribute that is a 40-byte
string and which is a candidate key of the relation.
- The size of physical page addresses. This is needed
in our calculations since every node in our
B+-tree will store search key values as well
as references to disk pages -- i.e. the address of the
disk page being referenced. Again, for ease of
calculations, we use a 10-byte page address.
Now that we have the information above, we can proceed
with determining the size of our B+-tree index.
We will determine this by working from bottom to the top of
the tree:
- Since a page is 1,000bytes, it can only store a
limited number of index entries. Also, we are building a
B+-tree that is filled up as much as possible
so we will store as many index entries as can fit in a
disk page. An index entry is a pair of values: [search
key value, reference to tuple], and in our case,
size(search key value)=40 and size(reference to
tuple)=10. So, to determine the number of index entries
that can fit in a leaf page we can use this calculation:
- Number of index entries per leaf page = Floor( Page_size / (size(search key value) + size(reference to tuple)))
- Number of index entries per leaf page = Floor( 1000 / (40 + 10)) = 20
The leaf pages of a B+-tree are linked together, these links are
stored as the addresses of the adjacent disk pages. Again, for simplicity,
the disk space used to store these links is not taken into account in the
calculations shown here.
- Now we must determine the number of leaf pages that are necessary. Since the table we are indexing is stored as a heap, the tuples are not ordered in the data file. So, our B+-tree index must contain one index enter per tuple in the table; this is known as a dense index. So, there will be 2million index entries collectively stored in the leaf pages. The number of leaf pages necessary is given by:
- Number of leaf pages = Ceiling ( (Number of index entries to be stored) / (Number of index entries per leaf page) )
- Number of leaf pages = Ceiling ( 2,000,000 / 20 ) = 100,000
- Next, we must determine the branching factor,
m, of our B+-tree. This can be done
by repeating step 1 above with some slight
modifications. In a B+-tree of order
m, each internal node must be able to store up
to m branches (i.e. pointers to nodes in the
level below) and thus, up to m-1 search key
values (usually called separator values when they appear
in the internal nodes).
- Page_size >= (m-1) * size(search key value) + m * size(pointer)
- In our case, we have:
1,000 >= (m-1) * 40 + m * 10
- Solving for m, which must be a whole number, we get:
m = Floor( ( 1,000 + 40 ) / 50 ) = 20
- So, each internal node will store 20
pointers to nodes in the level below and a
corresponding 19 search key values. Note the
slight but important difference between internal
nodes and leaf nodes.
- Now that we have m, we can determine the
number of pages at each level in the B+-tree.
Since every leaf in our tree needs to be referenced from
the level above, collectively we will need 100,000
pointers (i.e. branches), one per leaf page. Therefore
the number of internal pages in the level above the
leaves is:
- Number of pages in level above leaves = Ceiling( Number of leaf pages / branching factor)
- Number of pages in level above leaves = Ceiling( 100,000 / 20 ) = 5,000
- We need to continue doing this computation until we reach the point where we have just one internal node that will be the root of our tree
- Number of pages in the next level above = Ceiling( 5000 / 20 ) = 250
- Number of pages in the next level above = Ceiling( 250 / 20 ) = 13
- Number of pages in the next level above = Ceiling( 13 / 20 ) = 1
- Notice that we have repeated this process 5 times, and that in fact:
Height of B+-tree = Ceiling(logmNumberOfLeafPages)
And in our case, Height of B+-tree = Ceiling(log20100000) = 4.
- We can finally answer the question as to the size of our B+-tree:
- Total number of pages = 100,000 + 5,000 + 250 + 13 + 1 = 105,264 pages
- We have a 5-level B+-tree
indexing 2million tuples, so by accessing 5 disk
pages of the B+-tree followed by 1
page of the file, we can find any one of these
tuples.
- Note that database systems use caching and it is common for the top two levels
of a B+-tree to be in the cache, thus saving two disk accesses in this
example.
- It is finally worth noting that this example used a small page size of 1,000 bytes
for easier calculations. Common page sizes are 4Kbytes, 8Kbytes, or even 16Kbytes.
Larger page sizes lead to larger branching factors and thus shallower trees. There
is of course a trade-off between the page size and page utilization. Occupying only
half of a 4Kbyte page is not as significant as occupying only half of a 16Kbyte
page. B*-trees can address this issue, see next section.
Finally, if we were asked to find the size of a
B+-tree whose pages are filled to their minimum
capacity we can follow the process above with one additional
consideration. Pages of a B+-tree must have a
minimum of Ceiling(m/2) children nodes (i.e. half
of the page must be full). So, the computations above would
need to take this into account to find the size of such a
tree.
Interesting bits
B*-trees
The B*-tree adds better disk page utilization
by requiring that each disk page be at least 66% full. This
is achieved by splitting two nodes into three when overflow
occurs during insertion and merging three nodes into two
when underflow occurs during deletion, rather than splitting
one node into two and merging two nodes into one as is done
in a B+-tree.
B+-trees in database systems
Most database systems use Indexes built on some form of
a B+-tree due to its many advantages, in particular
its support for range queries.
A web page descrigin Apache Derby's
B+-tree implementation.
References
- The paper that originally proposed the B-tree family
of trees: Organization and Maintenance of Large
Ordered Indexes by R. Bayer and E. McCreight In:
Acta Informatica, Vol. 1, Fasc. 3, 1972, pp. 173-189.
- B+-tree page on WikiPedia.
- A Practical Introduction to Data Structures and
Algorithm Analysis, 2nd edition, 2001, by Clifford A.
Shaffer. Published by Pearson Prentice Hall,
http://vig.prenhall.com/catalog/academic/product/0,1144,0130284467,00.html.
- Database Management Systems, 3rd edition
by Raghu Ramakrishnan and Johannes Gehrke. Published by
McGraw-Hill, 2002. ISBN:0072465638.
http://catalogs.mhhe.com/mhhe/viewProductDetails.do?isbn=0072465638
- Data Structures and Algorithm Analysis in C++ by
Mark Allen Weiss, Published by Addison-Wesley, 1999. ISBN: 0-201-36122-1.
http://www.aw-bc.com/catalog/academic/product/0,1144,0201361221,00.html
- Introduction to Algorithms, Second Edition by Thomas
H. Cormen, Charles E. Leiserson, Ronald L. Rivest and
Clifford Stein, published by MIT Press.
http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=8570