Static Double Array Trie (DASTrie)

Introduction

Trie is a data structure of ordered tree that implements an associative array. Looking up a record key (usually a string) is very efficient, which takes O(1) with respect to the number of stored records n. Trie is also known for efficient prefix matching, where the retrieved key strings are the prefixes of a given query string.

Double-array trie, which was proposed by Jun-ichi Aoe in the late 1980s, represents a trie in two parallel arrays (BASE and CHECK). Reducing the storage usage drastically, double array tries have been used in practical applications such as morphological analysis, spelling correction, and Japanese Kana-Kanji convertion.

Static Double Array Trie (DASTrie) is an implementation of static double-array trie. For the simplicity and efficiency, DASTrie focuses on building a static double array from a list of records sorted by dictionary order of keys. DASTrie does not provide the functionality for updating an existing trie, whereas the original framework of double array considers dynamic updates. DASTrie provides several features:

Download

DASTrie is distributed under the term of the modified BSD license.

History

Sample code

#include <fstream>
#include <iostream>
#include <string>
#include <dastrie.h>

int main(int argc, char *argv[])
{
    typedef dastrie::builder<char*, int> builder_type;
    typedef dastrie::trie<int> trie_type;
    typedef builder_type::record_type record_type;

    // Records sorted by dictionary order of keys.
    record_type records[] = {
        {"eight", 8},   {"five", 5},    {"four", 4},    {"nine", 9},
        {"one", 1},     {"seven", 7},   {"six", 6},     {"ten", 10},
        {"three", 3},   {"two", 2},
    };

    try {
        // Build a double-array trie from the records.
        builder_type builder;
        builder.build(records, records + sizeof(records)/sizeof(records[0]));

        // Store the double-array trie to a file.
        std::ofstream ofs("sample.db", std::ios::binary);
        builder.write(ofs);
        ofs.close();

    } catch (const builder_type::exception& e) {
        // Abort if something went wrong...
        std::cerr << "ERROR: " << e.what() << std::endl;
        return 1;
    }

    // Open the trie file.
    std::ifstream ifs("sample.db", std::ios::binary);
    if (ifs.fail()) {
        std::cerr << "ERROR: Failed to open a trie file." << std::endl;
        return 1;
    }

    // Read the trie.
    trie_type trie;
    if (trie.read(ifs) == 0) {
        std::cerr << "ERROR: Failed to read a trie file." << std::endl;
        return 1;
    }

    /*
      Note that, although this sample program uses a file, a trie class can
      also receive a double-array trie directly from a builder,
        trie.assign(builder.doublearray(), builder.tail(), builder.table());
    */

    // Get the values of keys or the default value if the key does not exist.
    std::cout << trie.get("one", -1) << std::endl;              // 1
    std::cout << trie.get("other", -1) << std::endl;            // -1

    // Check the existence of a key and obtain its value.
    int value;
    if (trie.find("two", value)) {
        std::cout << value << std::endl;                        // 2
    }

    // Check the existence of keys.
    std::cout << trie.in("ten") << std::endl;                   // 1 (true)
    std::cout << trie.in("eleven") << std::endl;                // 0 (false)

    // Get records whose keys are prefixes of "eighteen".
    trie_type::prefix_cursor pfx = trie.prefix("eighteen");
    while (pfx.next()) {
        std::cout
            << pfx.query.substr(0, pfx.length) << " "           // eight
            << pfx.value << std::endl;                          // 8
    }

    return 0;
}

Documentation

Performance

This section reports results of performance comparison of different trie implementations. The experiments used two text corpora, Google Web 1T corpus and SPECIALIST Lexicon. For each text corpus, the experiments measured the elapsed time for constructing a trie (build time), the total time for finding all of keys in the corpus (access time), and the size of the trie database generated.

Google Web 1T corpus

In this experiment, 13,588,391 unigrams (125,937,836 bytes) in the Google Web 1T corpus were inserted to a trie as keys (without frequency information). TinyDA was not used in this experiment because the corpus is too large to store keys within 0x007FFFFF double-array elements.

ImplementationParametersBuild [sec]Access [sec]Database [bytes]
DASTrie 1.0 Default 1821.72131,542,283
darts 0.32 Default 22.31.25406,358,432
DynDA 0.01 Default 3352.53195,374,108
Tx 0.12 Default 28.326.652,626,805

LRWD table of UMLS SPECIALIST Lexicon

In this experiment, 351,006 lexicon (4,026,389 bytes) in the LRWD table of UMLS SPECIALIST Lexicon were inserted to a trie as keys. DASTrie was configured to represent a double-array element in 5 bytes (default) and 4 bytes (compact).

ImplementationParametersBuild [sec]Access [sec]Database [bytes]
DASTrie 1.0 Default 0.300.054,534,065
DASTrie 1.0 Compact (-c) 0.270.063,783,249
darts 0.32 Default 0.650.0412,176,328
DynDA 0.01 Default 0.280.077,095,226
TinyDA 1.23 Default 0.360.064,575,520
Tx 0.12 Default 0.680.701,646,558

Acknowledgements

The data structure of the (static) double-array trie is described in:

The DASTrie distribution contains "a portable stdint.h", which is released by Paul Hsieh under the term of the modified BSD license, for addressing the compatibility issue of Microsoft Visual Studio 2008. The original code is available at: http://www.azillionmonkeys.com/qed/pstdint.h

References


Copyright (c) 2002-2008 by Naoaki Okazaki
Mon Nov 10 12:28:34 2008