CRFsuite - Documentation

Installation

Using a binary distribution

The easiest way for installing CRFsuite is to use a binary distribution. Currently, binaries for Win32 and Linux (Intel 32bit and 64bit architectures) are distributed.

Building from a source distribution

As of CRFsuite 0.5, the source package no longer include the portion of libLBFGS. In order to build CRFsuite, you need to download and build libLBFGS first.

In Windows environments, open the Visual Studio solution file (lbfgs.sln) of libLBFGS, and build it. The solution file builds a static-link library, lbfgs.lib (release build) or lbfgs_debug.lib (debug build), at Release or Debug directory. Because the solution file (crfsuite.sln) of CRFsuite assumes that the header and library files of libLBFGS exist in win32/lbfgs directory, create this directory, and copy lbfgs.h, lbfgs.lib and/or lbfgs_debug.lib into the directory. Then open the solution file (crfsuite.sln) and build it.

In Linux environments, download the source package of libLBFGS, and build it. If you do not want to install libLBFGS into your operating system, specify "--prefix" option to the configure script.

$ ./configure [--prefix=/path/to/a/temporary/directory]
$ make
$ make install

Now you are ready to build CRFsuite. If you have libLFGS installed to a different directory, please specify the directory in the argument of "--with-liblbfgs" option.

$ ./configure [--with-liblbfgs=/path/to/a/temporary/directory]
$ make
$ make install

Usage

CRFsuite utility expects the first command-line argument to be a command name:

learn
Train a CRF model from a training set.
tag
Tag sequences using a CRF model.
dump
Dump a CRF model in plain-text format.

To see the command-line options, use -h (--help) option.

$ crfsuite -h
CRFsuite 0.10  Copyright (c) 2007-2010 Naoaki Okazaki

USAGE: crfsuite <COMMAND> [OPTIONS]
    COMMAND     Command name to specify the processing
    OPTIONS     Arguments for the command (optional; command-specific)

COMMAND:
    learn       Obtain a model from a training set of instances
    tag         Assign suitable labels to given instances by using a model
    dump        Output a model in a plain-text format

For the usage of each command, specify -h option in the command argument.

Training

To train a CRF model from a training set, enter the following command,

$ crfsuite learn [OPTIONS] [DATA]

If the argument DATA is omitted or '-', this utility reads a training data from STDIN. To see the usage of learn command, specify -h (--help) option.

$ crfsuite learn -h
CRFsuite 0.10  Copyright (c) 2007-2010 Naoaki Okazaki

USAGE: crfsuite learn [OPTIONS] [DATA]
Obtain a model from a training set of instances given by a file (DATA).
If argument DATA is omitted or '-', this utility reads a data from STDIN.

OPTIONS:
    -m, --model=MODEL   Store the obtained model in a file (MODEL)
    -t, --test=TEST     Report the performance of the model on a data (TEST)
    -p, --param=NAME=VALUE  Set the parameter NAME to VALUE
    -h, --help          Show the usage of this command and exit

The following options are available for training.

-m, --model=MODEL
A filename to which CRFsuite stores an obtained CRF model. The default value is "crfsuite.model".
-t, --test=TEST
A filename of a test data for holdout evaluation during a training. With this option specified, CRFsuite evaluates the current CRF model on the holdout data and report the performance.
-p, --param=NAME=VALUE
Configure a parameter for the training. CRFsuite sets the parameter (NAME) to VALUE. Available parameters are:
algorithm=ALGORITHM
Use ALGORITHM for training. Currently, CRFsuite supports "lbfgs" (L-BFGS) and "sgd" (SGD). The default value is "lbfgs".
feature.minfreq=VALUE
Cut-off threshold for occurrence frequency of a feature. CRFsuite will ignore features whose frequencies of occurrences in the training data are no greater than VALUE. The default value is 0 (i.e., no cut-off).
feature.possible_states=BOOL
Specify whether CRFsuite generates state features that do not even occur in the training data (i.e., negative state features). Setting BOOL to 1, CRFsuite generates state features that associate all of possible combinations between attributes and labels. Suppose that the numbers of attributes and labels are A and L respectively, this function will generate (A * L) features. Enabling this function may improve the labeling accuracy because the CRF model can learn the condition where an item was not labeled to certain labels under the existence of specific attributes. However, this function may also increase the number of features and slow down the training process drastically. This function is disabled by default.
feature.possible_transitions=BOOL
Specify whether CRFsuite generates transition features that do not even occur in the training data (i.e., negative transition features). Setting BOOL to 1, CRFsuite generates transition features that associate all of possible label pairs. Suppose that the number of labels in the training data is L, this function will generate (L * L) transition features. This function is disabled by default.
feature.bos_eos=BOOL
Specify whether CRFsuite generates begin-of-sequence (BOS) and end-of-sequence (EOS) features. Setting BOOL to 1, CRFsuite generates transition features that describes the event where a sequence begins or ends with specific labels. This function is enabled by default.
regularization=TYPE
The type of regularization: "L1" (L1 regularization, Laplacian prior), "L2" (L2 regularization, Gaussian prior), or "" (no regularization). The default value is "L2". SGD does not support L1 regularization at this moment.
regularization.sigma=VALUE
The regularization parameter sigma, i.e., variance of feature weights. The default value is 10.
lbfgs.max_iterations=VALUE
The maximum number of iterations for L-BFGS optimization. The L-BFGS routine terminates if the iteration count exceeds this value. The default value is set to the maximum value of integer on the machine (INT_MAX).
lbfgs.epsilon=VALUE
The epsilon parameter that determines the condition of convergence. The default value is 1e-5.
lbfgs.stop=VALUE
The duration of iterations to test the stopping criterion. The default value is 10.
lbfgs.delta=VALUE
The threshold for the stopping criterion; an L-BFGS iteration stops when the improvement of the log likelihood over the last ${lbfgs.stop} iterations is no greater than this threshold. The default value is 1e-5.
llbfgs.num_memories=VALUE
The number of limited memories that L-BFGS uses for approximating the inverse hessian matrix. The default value is 6.
lbfgs.linesearch=METHOD
The line search method in the L-BFGS algorithm. The possible methods are: "MoreThuente" (MoreThuente method proposd by More and Thuente), " Backtracking" (Backtracking method with strong Wolfe condition), "LooseBacktracking" (Backtracking method with regular Wolfe condition). The default method is "MoreThuente".
lbfgs.linesearch.max_iterations=NUM
The maximum number of trials for the line search algorithm. The default value is 20.
sgd.max_iterations=VALUE
The maximum number of iterations (epochs) for SGD. The SGD routine terminates if the iteration count exceeds this value. The default value is 1000.
sgd.period=VALUE
The duration of iterations to test the stopping criterion. The default value is 10.
sgd.delta=VALUE
The threshold for the stopping criterion; an SGD iteration stops when the improvement of the log likelihood over the last ${sgd.period} iterations is no greater than this threshold. The default value is 1e-6.
sgd.calibration.eta=VALUE
The initial value of learning rate (eta) used for calibration. The default value is 0.1.
sgd.calibration.rate=VALUE
The rate of increase/decrease of learning rate for calibration. The default value is 2.
sgd.calibration.samples=VALUE
The number of instances used for calibration. The calibration routine randomly chooses instances no larger than VALUE. The default value is 1000.
sgd.calibration.candidates=VALUE
The number of candidates of learning rate. The calibration routine tries VALUE candidates of learning rates that can increase log-likelihood.
-h, --help
Show the usage of this command and exit.

Here are some examples of CRFsuite command-lines for training.

Train a CRF model from train.txt with the default parameters.
$ crfsuite learn train.txt
Train a CRF model from train.txt and store the model to CRF.model. During the trainig, test the model with a holdout data test.txt.
$ crfsuite learn -m CRF.model -t test.txt train.txt
Train a CRF model from train.txt by using SGD with L2 regularization (sigma=1).
$ crfsuite learn -p algorithm=sgd -p regularization.sigma=1 train.txt
Train a CRF model from train.txt with L1 regularization (sigma=1).
$ crfsuite learn -p regularization=L1 -p regularization.sigma=1 train.txt
Train a CRF model from train.txt, generating all of possible features.
$ crfsuite learn -p feature.possible_states=1 -p feature.possible_transitions=1 
train.txt

Tagging

To tag a data using a CRF model, enter the following command,

$ crfsuite tag [OPTIONS] [DATA]

If the argument DATA is omitted or '-', CRFsuite reads a data from STDIN.To see the usage of tag command, specify -h (--help) option.

$ crfsuite tag -h
CRFsuite 0.10  Copyright (c) 2007-2010 Naoaki Okazaki

USAGE: crfsuite tag [OPTIONS] [DATA]
Assign suitable labels to the instances in the data set given by a file (DATA).
If the argument DATA is omitted or '-', this utility reads a data from STDIN.
Evaluate the performance of the model on labeled instances (with -t option).

OPTIONS:
    -m, --model=MODEL   Read a model from a file (MODEL)
    -t, --test          Report the performance of the model on the data
    -r, --reference     Output the reference labels in the input data
    -q, --quiet         Suppress tagging results (useful for test mode)
    -h, --help          Show the usage of this command and exit

The following options are available for tagging.

-m, --model=MODEL
A filename from which CRFsuite reads a CRF model. The default value is "crfsuite.model".
-t, --test
Evaluate the performance (accuracy, precision, recall, f1 measure) of the CRF model, assuming that the input data is labeled. This function is disabled by default.
-r, --reference
Output the reference labels in parallel with predicted labels, assuming that the input data is labeled. This function is disabled by default.
-q, --quiet
Suppress the output of tagged labels. This function is useful for evaluating a CRF model with -t option.
-h, --help
Show the usage of this command and exit.

Here are some examples of CRFsuite command-lines for tagging.

Tag a data test.txt using a CRF model CRF.model
$ crfsuite tag -m CRF.model test.txt
Evaluate a CRF model CRF.model on the labeled data test.txt.
$ crfsuite tag -m CRF.model -qt test.txt

Model dump

To dump a CRF model in plain-text format, enter the following command,

$ crfsuite dump <MODEL>

Format of training/tagging data

CRFsuite accepts text files in a specific format as training and untagged data. A data consists of a set of sequences each of which is represented by consecutive lines and terminated by an empty line. A sequence consists of a series of items whose characteristics are described in lines. An item line begins with its label, followed by its attributes separated by tab characters. An attribute specifies an attribute name and its value separated by colon character (':'). An attribute name can include escape sequences; "\:" and "\\" represent ':' and '\', respectively. If an attribute value is omitted (without colon character), CRFsuite assumes the attribute value to be one. A line starting with sharp character ('#') is ignored as a comment. Label fields are required for both training and untagged data. CRFsuite ignores label fields in the input data when tagging.

This is the BNF notation representing the data format.

<line>           ::= <comment> | <item> | <eos>
<comment>        ::= '#' <character>+ <br>
<item>           ::= <label> ('\t' <feature>)+ <br>
<eos>            ::= <br>
<label>          ::= <string>
<feature>        ::= <name> | <name> ':' <weight>
<name>           ::= (<letter> | "\:" | "\\")+
<weight>         ::= <numeric>
<br>             ::= '\n'