CRFsuite - CRF Benchmark test

Introduction

This page describes benchmark results of different implementations and algorithms of CRFs.

Implementations for the benchmark

This is the list of CRF implementations used for the experiments.

Experimental settings

The experiments use the training and test sets of CoNLL 2000 chunking shared task. We employ the common feature set for different CRF implementations; state (unigram) and transition (bigram) features are generated from the training and test sets by applying the feature template bundled in CRF++. For CRFsuite, FlexCRF, and MALLET, the state observations (attributes) are generated by script programs (see the tutorial for more information). The training data contains 8936 sentences (sequences), 338,574 attributes, and 22 labels. The test data contains 2012 sentences.

All of the experiments were run on CentOS (with gcc 3.4.6), Dual core Xeon 3.0GHz x 2 processors, 64GB main memory.

Training speed

In general, it is difficult to compare different implementations in terms of training speed since the implementations may produce different numbers of features, use different algorithms for updating feature weights, and finish training processes in different criteria (e.g., convergence criterion of L-BFGS/SGD). In addition, we need to check whether CRF models obtained from different implementations perform the similar accuracy on the test data. For this reason, this benchmark reports various aspects of the implementations.

Training speed (for one iteration)

Figure 1 shows seconds elapsed for one L-BFGS/SGD iteration. Although it depends on implementations what has done in one iteration (e.g., computing forward-backward scores, observation/model expectations of features, Viterbi probabilities, and updates of feature weights), these figures assess the fundamental speed of implementations in training CRFs. The training speed is also affected by the number of features (or strategy of feature generation) used in training. Therefore, Figure 1 presents two groups of implementations: CRFsuite with sparse features, MALLET, and FlexCRF that generate 450,000-600,000 features; and CRFsuite with dense features, sgd, and CRF++ that generate ca. 7,500,000 features.

Training time

Figure 1. Training time

In summary, CRFsuite is 2.2 times faster than sgd, 5.3 times faster than CRF++, 31.6 times faster than MALLET, and 56.4 times faster than FlexCRFs on training. L-BFGS and SGD implemented in CRFsuite are roughly comparable in terms of the computational costs for updating feature weights in one iteration. However, this does not mean that L-BFGS and SGD are equally fast; as we will see, SGD improves the log-likelihood of the model more drastically than L-BFGS does in one iteration. In other words, SGD approaches to the optimal weights in less iterations than L-BFGS.

Focussing more on the comparison of L-BFGS and SGD in CRFsuite, we notice that L-BFGS is slightly faster than SGD with less features while L-BFGS is slower with more features. This is because the computational costs of L-BFGS and SGD are O(adn + bk) and O(cdn), respectively. Here:

  • d is the average number of features that are relevant to an instance
  • n is the number of instances
  • k is the number of features
  • a is the constant factor for computing gradients of an instance
  • b is the constant factor in which L-BFGS updates feature weights in one iteration
  • c is the constant factor in which SGD updates feature weights for an instance

Usually, a < c; thus, L-BFGS may run faster than SGD while the number of features (k) is small.

Training speed (full results)

Table 1 reports the full result of the experiment. All implementations roughly achieve the similar performance in terms of accuracy. SGD implementations stop after 100 iterations since sgd 1.3 has no functionality to test the convergence of log-likelihood. In general, SGD appears to be more efficient than L-BFGS.

Table 1. Performance of training

Software Solver Feature # Features Time # Iters Update Mem-L Model LL Acc Log
        [s]   [s/ev] [MB] [MB]   [%]  
CRFsuite 0.6 L-BFGS sparse 456,480 436.5 143 2.3 201 27 -8967.1 96.00 log
CRFsuite 0.6 L-BFGS dense 7,448,562 926.5 126 5.8 1,414 187 -7756.0 96.06 log
CRFsuite 0.6 SGD sparse 456,480 271.3 100 2.6 140 27 -9057.4 95.98 log
CRFsuite 0.6 SGD dense 7,448,562 564.3 100 5.1 427 187 -7799.6 96.06 log
sgd 1.3 SGD dense 7,448,606 1212.3 100 11.6 179 58 -7721.4 96.01 log
CRF++ 0.51 L-BFGS dense 7,448,606 4038.0 131 30.6 946 43 -7713.5 96.07 log
FlexCRFs 0.3 L-BFGS sparse 456,436 18424.3 142 129.8 438 9 -11687.8 95.94 log
MALLET 2.0RC2 L-BFGS sparse? 597,442 7039.6 88 72.7 1,292 13 ? 95.84 log
# Features
Number of features generated by a training program. These numbers vary depending especially on the treatment of negative features. For example, CRF++ generates all of possible features, whereas FlexCRF does positive features only. In this experiment, we call the former strategy for feature generation as dense and the latter as sparse. CRFsuite can use the both strategies by configuring options for training. Since the speed and memory space for training are greatly affected by the number of features, we need to compare implementations within each group, i.e., CRFsuite (sparse) vs FlexCRFs vs MALLET and CRFsuite (dense) vs sgd vs CRF++.
Time [s]
The total time, in seconds, elapsed while a training process alives. We cannot compare these figures between implementations, which depend on the number of features and iterations. In general, shorter is better.
# Iters
The number of iterations that an implementation used for L-BFGS/SGD. The definition of 'iteration' of CRFsuite (L-BFGS) is different from the one of CRF++ and FlexCRFs. CRFsuite does not count objective-function evaluations invoked by the line-search trials as iterations, whereas CRF++ and FlexCRFs include them. If we count the numbers of iterations of CRFsuite (sparse) and CRFsuite (dense) based on their standard, these numbers will be 168 and 143 respectively.
Update [s/ev]
The averaged elapsed time, in seconds, for an iteration. These figures are identical to ones shown in Figure 1. We can use these figures as an indicator for the 'goodness' of the implementations.
Mem [MB]
The memory space consumed while each implementation runs the training algorithm (e.g., L-BFGS and SGD).
Model [MB]
The file size of a CRF model stored by each implementation.
LL
The final value of the regularized log-likelihood of each implementation. In general, a larger value indicates that the model fits to the training data better. However, the log-likelihood that is very close to zero implies that the model might overfit to the training data. The common approach to the overfit problem is to add (subtract) a regularization term to the log-likelihood, which penalizes large feature weights. Thus, it is unlikely that a regularized log-likelihood reaches zero; we cannot compare the performance of different implementations with regularized log-likelihood values. In this experiment, we chose a regularization parameter for each implementation so that all implementations obtain the comparable effect of regularization. We did not show the LL value for MALLET, which seems to report the unregularized LL.
Acc [%]
Item accuracy of a CRF model on the test data. These figures are only for validating the correctness of this experiment; we can say that this experiment is fair since CRF models with each feature set (sparse or dense) performed roughly at the same accuracy.

Tagging speed

Tagging speed

Figure 2 measures the elapsed time for tagging all of the test data. This figure includes everything while a tagger process is alive, e.g., the time for starting up a tagger process, reading a model file, and tagging the data.

Tagging time

Figure 2. Tagging time

In summary, CRFsuite is 10.1 times faster than FlexCRF, 13.4 times faster than sgd, and 24.1 times faster than MALLET; CRF++ is slightly (1.08 times) faster than CRFsuite.

Tagging speed (full results)

Table 2 reports the detail of the tagging performance. The major cause of some slow taggers is in their preparation (startup). A tagger usually requires an associative array from features and their weights so that scores of input sequences are computed efficiently. CRFsuite and CRF++ can prepare these associative arrays from model files very quickly because the model files directly represent the data structure of associative arrays. In contrast, other implementations need to build associative arrays at run time since the model files are not designed to store associative arrays but mere lists of features and their weights. We did not measure the start-up time of MALLET, but we guess that the tagger also consumes most of time for preparation.

Table 2 also shows that CRFsuite tags input sequences faster than CRF++. However, the preparation of CRFsuite (0.25 sec) is slower than that of CRF++ (0.02 sec). This is probably because CRFsuite reads the larger model file (187MB) than CRF++ does (47MB). Therefore, the total tagging time of CRFsuite (Figure 2) would be improved by making the model file compact.

Table 2. Performance of tagging

Implementation Feature Startup Tagging Speed Mem
    [s] [s] [seq/s] [MB]
CRFsuite 0.6 sparse 0.06 0.80 2515 35
CRFsuite 0.6 dense 0.25 1.00 2012 194
sgd 1.3 dense 13.77 1.81 1163 76
CRF++ 0.51 dense 0.02 1.14 1764 60
FlexCRFs 0.3 sparse 14.05 8.69 232 275
MALLET 2.0RC2 sparse? ? ? ? 1277
Startup [s]
The elapsed time, in seconds, required for a tagger to read a CRF model and to finish necessary preparations for tagging. We modified the source codes of some implementations to measure the start-up time. We did not measure the startup time for MALLET.
Tagging [s]
The elapsed time, in seconds, required for a tagger to apply the CRF model and tag all of the test sequences.
Speed [seq/s]
The tagging speed, or the number of sequences that a tagger can process per second.
Mem [MB]
The memory space consumed by a tagger.

Comparison of training algorithms (L-BFGS vs SGD)

In the section called “Training speed”, we showed that L-BFGS and SGD are comparable in terms of the computational cost for updating feature weights in one iteration. However, this does not mean that these algorithms are equally fast.

Figure 3 plots regularized log-likelihood values at each iteration. The difference of L-BFGS and SGD is obvious from this figure; L-BFGS approaches to the solution more slowly (at iterations 1 to 50) than SGD.

L2-regularized log-likelihood and iterations

Figure 3. L2-regularized log-likelihood and iterations

Figure 4 shows accuracy values measured on the test data at each iteration of the training data. SGD achieved as high as 95.05% accuracy after the first iteration whereas L-BFGS obtained only 62.79% and required 20 iterations to reach the level of 95% accuracy.

Item accuracy and iterations

Figure 4. Item accuracy and iterations

Figure 5 presents the L2-norm of feature weights at each iteration. In the end, L-BFGS and SGD expanded feature weights to the same size (around 95.4 in L2-norm) because of the regularization effect.

Feature L2-norm and iterations

Figure 5. Feature L2-norm and iterations