A complete explanation of the Unibase by DMAC btree++ index algorithm approach

Unibase by DMAC, Release 7.47i dated March 1, 2000 and later, has some new sort features. Three goals drove these features:

  1. sort speed
  2. sort predictability
  3. sort customization

Background :

Unibase by DMAC uses the sort verb to create Btree++ indexes. These indexes are used in the AID language to rapidly retrieve data.

Speed :

Unibase by DMAC clients want both speed of retrieval and speed of building from an index. Over the years both speeds have improved. Sort speed (index creation) as measured in records per second varies depending upon how many total records are going to be sorted in a single pass. Some sort speciality vendors study the file to be sorted before sorting so that an optimum approach can be used. Today sort ( indexing speeds) of over 10000 records per second are common in Unibase by DMAC.

Sort Predictability :

DMAC regularly gets asked how long it will take to sort (build on index on) a given number of records. In the past the variables were so many that the equation to predict the answer was beyond most clients and DMAC people. Today a simple means of predicting the performance on a machine of Unibase by DMAC in possible with obtaining one or two sort speeds on that particular machine configuration.

Sort Customization :

The next question to the above is, “What does one do the get better sort speeds?” The answer to this question has been expanded to four answers — 1. no network traffic (standalone pc), 2. more RAM, 3. faster CPU, and 4. no screen displays while sorting.

What Happened :

In the Microsoft Windows versions and the UNIX/LINUX versions, basic changes allow the user to take advantage of adding memory to the client.

Some of the current recommendations come from practical experience. All the sort times come from reading and studying the book, “A Practical Introduction to Data Structures and Algorithm Analysis” by Clifford A. Shaffer of Virginia Polytechnic Institute and State University.

Network Server / Standalone PC Changes :

As network servers strive to protect users, and standalone computer become faster, and graphics more elaborate, the sort speed on a standalone PC with no screen interface has surpassed the sort speed using a server/client workstation configuration. Thus the code changes to Unibase by DMAC described below are enhanced by using a standalone computer.

Code Changes to Unibase by DMAC :

One feature coded into Unibase was to set up the index so that it could reference up to 4 terabytes (4000 gigabytes) of files stored in individual files of less than 2 gigabytes.

Another, more subtle, feature was to set up so that as more RAM memory was available bigger files could be sorted efficiently in RAM memory – automatically.

Improving Predictability :

The codes changes improved the predictability of Unibase by DMAC. While the sort verb can handle all sizes of files, there is a tremendous speed decrease when anything outside of Random Access Memory (RAM) is accessed.

Users can use the following equation to determine how much RAM is need for various sizes of files:
RAM needed = A + 2 (n * (36 + key length))
(“A” is an empirical number (use 4,000,000 for example)

Some RAM needed numbers:

Key Length 10:
n 2 * N * (36+10) RAM needed
1,000 92,000 4,092,000
5,000 460,000 4,460,000
10,000 920,000 4,920,000
50,000 4,600,000 8,600,000
100,000 9,200,000 13,200,000
500,000 46,000,000 50,000,000
1,000,000 92,000,000 96,000,000
5,000,000 460,000,000 464,000,000
10,000,000 920,000,000 964,000,000
Key Length 80:
n 2 * N * (36+10) RAM needed
1,000 232,000 4,232,000
5,000 1,160,000 5,160,000
10,000 2,320,000 6,320,000
50,000 11,600,000 15,600,000
100,000 23,200,000 27,200,000
500,000 116,000,000 120,000,000
1,000,000 232,000,000 236,000,000
5,000,000 1,160,000,000 1,164,000,000
10,000,000 2,320,000,000 2,324,000,000

The Unibase code breaks down at 2 gigabytes of RAM (everyone today does). So above that number, we need to sort in strings and merge the strings up to the 4 terabytes we can index.

Sort Time :

If we ignore sorting when other storage media are used, we can predict time required quite accurately. This means that users can build themselves a sort (index generating) machine which responds according to an asymptotic algorithm analysis of the sort time being computable by the following equation:

time = A + B(n * log n)
where:
A is an empirical constant
B is an empirical constant
n is the number of keys to be indexed

Actually, if we are measuring time in seconds, A is normally about one or two seconds, so most of the time you see the above equation written:

time = B (n * log n)
where:
B is an empirical constant
n is the number of keys to be indexed

What I am saying is that if sufficient RAM (up to 2 gigabytes) is available for files up to 10,000,000 records, the above equation works for Unibase by DMAC sort verb.

Another feature developed in Unibase by DMAC will allow merging of indexes so that up to 4 terabyte indexes can be created for billions and billions of records. The Merge process works; but no times for the merge process have been collected.

Some Log Examples:

We can make things very easy by constructing a table of the value of n * log n. Most users really only want a range of values so they can pick the following values:

n log n n time log n
1,000 3.00 3,000
5,000 3.70 18,500
10,000 4.00 40,000
50,000 4.70 235,000
100,000 5.00 500,000
500,000 5.70 2,850,000

Let us look at some sample data from a 200 mhz pentium running Microsoft Windows 95 for building an index with a key size of 80 characters:

n log n build time (secs) records/second
4,250 3.63 13 346
21,250 4.33 72 295
51,000 4.71 173 295
60,700 4.78 211 288
108,100 5.03 420 258

From this data A is about 1.2 seconds
And B is .00077

So we can predict for whatever number of records we want. A million records will take about 4,600 seconds (about 76 minutes) and 5,000,000 records will take about 25,650 seconds (about 7 hours).