DASTrie: Static Double Array Trie

O(1)

Double-array trie, which was proposed by Jun-ichi Aoe in the late 1980s, represents a trie in two parallel arrays (BASE and CHECK). Reducing the storage usage drastically, double array tries have been used in practical applications such as morphological analysis, spelling correction, and Japanese Kana-Kanji convertion.

Static Double Array Trie (DASTrie) is an implementation of static double-array trie. For the simplicity and efficiency, DASTrie focuses on building a static double array from a list of records sorted by dictionary order of keys. DASTrie does not provide the functionality for updating an existing trie, whereas the original framework of double array considers dynamic updates. DASTrie provides several features:

Associative array. DASTrie is designed to store associations between key strings and their values (similarly to std::map). Thus, it is very straightforward to put and obtain the record values for key strings, while some implementations just return unique integer identifiers for key strings. It is also possible to omit record values so that DASTrie behaves as a string set (e.g., std::set).

Configurable value type. A type of record values is configurable by a template argument of dastrie::trie and dastrie::builder. Basic types (e.g., int, double) and strings (char* and std::string) can be used as a value type of records without any additional effort. User-defined types can also be used as record values only if two operators for serialization (operator<<() and operator>>()) are implemented.

Flexible key type. A type of record keys is configurable by a template argument for dastrie::builder. One can choose either null-terminated C strings (char*) or C++ strings (std::string).

Fast look-ups. Looking up a key takes O(1) with respect to the number of records n.

Prefix match. DASTrie supports prefix matching, where the retrieved key strings are prefixes of a given query string. One can enumerate records of prefixes by using dastrie::trie::prefix_cursor.

Compact double array. DASTrie implements double arrays whose each element is only 4 or 5 bytes long, whereas most implementations consume 8 bytes for an double-array element. The size of double-array elements is configurable by trait classes, dastrie::doublearray4_traits and dastrie::doublearray5_traits.

Minimal prefix double-array. DASTrie manages what is called a tail array so that non-branching suffixes do not waste the storage space of double array. This feature makes tries compact, improveing the storage utilization greatly.

Simple write interface. DASTrie can serialize a trie data structure to C++ output streams (std::ostream) with dastrie::builder::write() function. Serialized data can be embedded into files with other arbitrary data.

Simple read interface. DASTrie can prepare a double-array trie from an input stream (std::istream) (with dastrie::trie::read() function) or from a memory block (with dastrie::trie::assign() function) to which a serialized data is read or memory-mapped from a file.

Cross platform. The source code can be compiled on Microsoft Visual Studio 2008, GNU C Compiler (gcc), etc.

Simple C++ implementation. Following the good example of Darts, DASTrie is implemented in a single header file (dastrie.h); one can use the DASTrie API only by including dastrie.h in a source code.

Download

History

Sample code

#include <fstream>
#include <iostream>
#include <string>
#include <dastrie.h>

int main(int argc, char *argv[])
{
    typedef dastrie::builder<char*, int> builder_type;
    typedef dastrie::trie<int> trie_type;
    typedef builder_type::record_type record_type;

    // Records sorted by dictionary order of keys.
    record_type records[] = {
        {"eight", 8},   {"five", 5},    {"four", 4},    {"nine", 9},
        {"one", 1},     {"seven", 7},   {"six", 6},     {"ten", 10},
        {"three", 3},   {"two", 2},
    };

    try {
        // Build a double-array trie from the records.
        builder_type builder;
        builder.build(records, records + sizeof(records)/sizeof(records[0]));

        // Store the double-array trie to a file.
        std::ofstream ofs("sample.db", std::ios::binary);
        builder.write(ofs);
        ofs.close();

    } catch (const builder_type::exception& e) {
        // Abort if something went wrong...
        std::cerr << "ERROR: " << e.what() << std::endl;
        return 1;
    }

    // Open the trie file.
    std::ifstream ifs("sample.db", std::ios::binary);
    if (ifs.fail()) {
        std::cerr << "ERROR: Failed to open a trie file." << std::endl;
        return 1;
    }

    // Read the trie.
    trie_type trie;
    if (trie.read(ifs) == 0) {
        std::cerr << "ERROR: Failed to read a trie file." << std::endl;
        return 1;
    }

    /*
      Note that, although this sample program uses a file, a trie class can
      also receive a double-array trie directly from a builder,
        trie.assign(builder.doublearray(), builder.tail(), builder.table());
    */

    // Get the values of keys or the default value if the key does not exist.
    std::cout << trie.get("one", -1) << std::endl;              // 1
    std::cout << trie.get("other", -1) << std::endl;            // -1

    // Check the existence of a key and obtain its value.
    int value;
    if (trie.find("two", value)) {
        std::cout << value << std::endl;                        // 2
    }

    // Check the existence of keys.
    std::cout << trie.in("ten") << std::endl;                   // 1 (true)
    std::cout << trie.in("eleven") << std::endl;                // 0 (false)

    // Get records whose keys are prefixes of "eighteen".
    trie_type::prefix_cursor pfx = trie.prefix("eighteen");
    while (pfx.next()) {
        std::cout
            << pfx.query.substr(0, pfx.length) << " "           // eight
            << pfx.value << std::endl;                          // 8
    }

    return 0;
}

Documentation

Performance

Google Web 1T corpus

Implementation	Parameters	Build [sec]	Access [sec]	Database [bytes]
DASTrie 1.0	Default	182	1.72	131,542,283
darts 0.32	Default	22.3	1.25	406,358,432
DynDA 0.01	Default	335	2.53	195,374,108
Tx 0.12	Default	28.3	26.6	52,626,805

LRWD table of UMLS SPECIALIST Lexicon

Implementation	Parameters	Build [sec]	Access [sec]	Database [bytes]
DASTrie 1.0	Default	0.30	0.05	4,534,065
DASTrie 1.0	Compact (-c)	0.27	0.06	3,783,249
darts 0.32	Default	0.65	0.04	12,176,328
DynDA 0.01	Default	0.28	0.07	7,095,226
TinyDA 1.23	Default	0.36	0.06	4,575,520
Tx 0.12	Default	0.68	0.70	1,646,558

Acknowledgements

The DASTrie distribution contains "a portable stdint.h", which is released by Paul Hsieh under the term of the modified BSD license, for addressing the compatibility issue of Microsoft Visual Studio 2008. The original code is available at: http://www.azillionmonkeys.com/qed/pstdint.h

Static Double Array Trie (DASTrie)

Introduction