Machineboy空

Ordered Data Structures # WEEK 3 : B-Trees 본문

Computer/자료구조

Ordered Data Structures # WEEK 3 : B-Trees

안녕도라 2024. 2. 19. 12:04

3.2.1 B-Tree Introduction

 

지금까지, 이진트리, AVL Tree, 배열, 리스트 등의 알고리즘을 살펴봤는데 이건 빅오 표기법에 따라 엄청난 런타임 퍼포먼스를 보여준다.

 

but, Big O notation doesn't explain everything. In fact, Big O notation assumes a uniform access time for all of our data. But in reality, uniform access time for all of our data isn't actually the case.

 

Real Application

Imagine storing Facebook profiles for everyone in the US.

  • How many records?
    • For example, if we consider all of Facebook profiles on Facebook, give a conservative estimate that there are 500 milion users on facebook.  And let's think about every single Facebook profile. total amount of data that's on each profile is going to be at least 5 megabytes
  • How much data in total
    • Again another conservative estimate, because likely all the photos and all of the information we're storing on Facebook's probably going to far exceed five megabytes. Thinking of the entire amount of space that we take to store all these data, we can multiply 500 milion by 5 megabytes. Doing that's going to result in us having 500 billion times 5 kilobytes, 500 trillion times 5 bytes.
    • And I'm moving this over that's going to be 2500 trillion bytes ar 2.5 petabytes of data. This is incredible amount of data, 2.5 peta bytes of data it is far more than you can store in any main memory on any system. They're going to have to store some of that data on disk, and by storing on disk we know that operations going to be somewhat slower than in main memory
  • How deep is the AVL tree?

 

BTree Motivations

 

The goal of a BTree is to create data structure that's going to perform extremely well in both main memory, as well as in on disk.  Specifically we want to optimize the idea, that we want to make as few disk seek as possible, so this algorithm can be as optimal as possible. And in fact, I decided the term disk seeks, but what we really can think about is anywhere the data stored somewhere else. This might be on the cloud somewhere, or this might be anywhere else. So let's look at how we might construct such a data structure.

 

Knowing that we have large seek times for data, we want to:

 

BTrees is constructed by having a node contains a number of keys.

 

BTree (of order m)

 

Goal : Minimize the number of reads!

Build a tree that uses 1 network packet,1 disk block / node

 

The order of a B-Tree refers to the size of the nodes, not the fact that the keys are in sorted order. The order of a B-Tree is the maximum number of keys that a given node can have, plus one.

So for example, here's an example of a B-Tree of order 9. having 8 keys in this node. The goal of every single B-Tree is to minimize the number of network packets, disk seeks, or whatever operation we have to get at the data, so we want to minimize the seeks to reach our data.

 

빅오표기는 모든 데이터에 대한 접근 시간이 동일하다는 가정하에 계산하는 것인데, 

저장해야할 데이터가 엄청나게 늘어나면, 실제로는 메인메모리와 보조기억장치 등 여러 군데 저장이된다.

그래서 메모리별 접근 시간에 차이가 날 수 있다.

이러한 차이를 최소화하고 거의 동일한 접근시간을 보장하기 위해 설계할 데이터 구조가 B-Tree.

BTree의 노드는 여러 개의 key값으로 구성되며 다음 위치(?)를 가리키는 포인터를 하나씩 가지고 있음


3.2.2 B-Tree Insert

 

B-tree of order m (차수가 m) is a data structure that we introduced earlier that's going to optimize an algorithm for having a minimum number of disk seeks on network seeks to get at our data.

 

*order m (차수가 m) 인 b-tree : 자식을 m개 가질 수 있음을 의미하나 봄

 

B-Tree Insertion

 

For a B-tree "of order m":

  • All keys within a node are in sorted order.
    • These are two different meanings for the word "order" 순서를 의미하는 order가 아님
  • Each node contains no more than m-1 keys.
  • Each internal node can have at most m children, so a B-tree of order m is like an m-way tree.

 

recursive nature of going down a B-tree to find the insertion point of a value, and then, only if the node is full, splitting that node in half and throwing up the middle value to the parent node, until we find a node that hasn't been filled to capacity. So we have this understanding of exactly how this algorithm works, and now we can look at properties inside of the B-Tree  itself.

 

B-Tree Properites(4)

For a B-tree "of order m":

  1. All keys within a node are in sorted order
  2. Each node contains no more than m-1 keys.
  3. Each internal node has exactly one more child than key (at most m children, so a B-tree of order m is like an m-way tree).
    • A root node can be a leaf or have [2,m] children.
    • Each non-root, internal node has [ceil(m/2),m] children.
  4. All leaves are on the same level.

 

we've seen a broad instruction of how to find a node in a B-tree, as well as how to insert that node to create a larger and larger B-tree. You saw the recursive nature of this, where we find the spot to insert, and then once we found the insertion point of the B-tree, we're going to recursively run this algorithm so keep throwing nodes up higher and higher in the tree.

 

해당 키 값의 삽입 지점을 찾고, 반으로 분할하여 중간 값을 부모노드로 던지는 식의 재귀적 수행을 통해

점점 커지는 Btree의 특성!


3.2.3 B-Tree Search

 

BTree Analysis

 

The height of the BTree determines maximum number of seeks possible in search data.

...and the height of the structure is : logn(n)

Therefore: The number of seeks is no more than logn(n)

 

So, if you have three trillion pieces of data, you've got five different commas in that number, which means you're only going to have to do five seeks to look through all three trillion pieces of data. This is absolutely amazing given how much data you can store and in this small amount of disk seeks you need to do.

 

In comparison, if you had that much data in an AVL tree, you'd have to do over 30 seeks . So, a BTree is an optimized algorithm to optimize around the idea doing as minimal disk seeks as possible when all of our data is can't fit in main memory . It's a little bit of a hybrid between a classical tree and a pure disk algorithm.

 

AVL은 균형하지 않은 이진탐색트리를 회전으로 균형하게 만들었고,

Btree는 삽입 위치를 찾아가며 분할하고 부모노드로 올라가는 재귀의 방식으로 균형을 이룸.

특히나, 많은 데이터를 저장해야할 때, 저장 공간에 따른 접근 속도 차이가 성능에 영향을 주지 않게끔 할때 btree를 이용해 설계하나 보다.

노드에 저장할 수 있는 Key값이 여러개 이고, 이들의 포인터를 이용해 탐색, 저장 등을 함.

 

https://www.cs.usfca.edu/~galles/visualization/BTree.html

 

B-Tree Visualization

 

www.cs.usfca.edu