Table of Contents
The easiest way for installing CRFsuite is to use a binary distribution. Currently, binaries for Win32 and Linux (Intel 32bit and 64bit architectures) are distributed.
As of CRFsuite 0.5, the source package no longer include the portion of libLBFGS. In order to build CRFsuite, you need to download and build libLBFGS first.
In Windows environments, open the Visual Studio solution file (lbfgs.sln) of libLBFGS, and build it. The solution file builds a static-link library, lbfgs.lib (release build) or lbfgs_debug.lib (debug build), at Release or Debug directory.
Because the solution file (crfsuite.sln) of CRFsuite assumes that the header and library files of libLBFGS exist in win32/lbfgs directory, create this directory, and copy lbfgs.h, lbfgs.lib and/or lbfgs_debug.lib into the directory.
Then open the solution file (crfsuite.sln) and build it.
In Linux environments, download the source package of libLBFGS, and build it. If you do not want to install libLBFGS into your operating system, specify "--prefix" option to the configure script.
$ ./configure [--prefix=/path/to/a/temporary/directory] $ make $ make install
Now you are ready to build CRFsuite. If you have libLFGS installed to a different directory, please specify the directory in the argument of "--with-liblbfgs" option.
$ ./configure [--with-liblbfgs=/path/to/a/temporary/directory] $ make $ make install
CRFsuite utility expects the first command-line argument to be a command name:
- learn
- Train a CRF model from a training set.
- tag
- Tag sequences using a CRF model.
- dump
- Dump a CRF model in plain-text format.
To see the command-line options, use -h (--help) option.
$ crfsuite -h
CRFsuite 0.10 Copyright (c) 2007-2010 Naoaki Okazaki
USAGE: crfsuite <COMMAND> [OPTIONS]
COMMAND Command name to specify the processing
OPTIONS Arguments for the command (optional; command-specific)
COMMAND:
learn Obtain a model from a training set of instances
tag Assign suitable labels to given instances by using a model
dump Output a model in a plain-text format
For the usage of each command, specify -h option in the command argument.
To train a CRF model from a training set, enter the following command,
$ crfsuite learn [OPTIONS] [DATA]
If the argument DATA is omitted or '-', this utility reads a training data from STDIN. To see the usage of learn command, specify -h (--help) option.
$ crfsuite learn -h
CRFsuite 0.10 Copyright (c) 2007-2010 Naoaki Okazaki
USAGE: crfsuite learn [OPTIONS] [DATA]
Obtain a model from a training set of instances given by a file (DATA).
If argument DATA is omitted or '-', this utility reads a data from STDIN.
OPTIONS:
-m, --model=MODEL Store the obtained model in a file (MODEL)
-t, --test=TEST Report the performance of the model on a data (TEST)
-p, --param=NAME=VALUE Set the parameter NAME to VALUE
-h, --help Show the usage of this command and exit
The following options are available for training.
- -m, --model=MODEL
- A filename to which CRFsuite stores an obtained CRF model. The default value is "crfsuite.model".
- -t, --test=TEST
- A filename of a test data for holdout evaluation during a training. With this option specified, CRFsuite evaluates the current CRF model on the holdout data and report the performance.
- -p, --param=NAME=VALUE
-
Configure a parameter for the training. CRFsuite sets the parameter (NAME) to VALUE. Available parameters are:
- algorithm=ALGORITHM
- Use ALGORITHM for training. Currently, CRFsuite supports "lbfgs" (L-BFGS) and "sgd" (SGD). The default value is "lbfgs".
- feature.minfreq=VALUE
- Cut-off threshold for occurrence frequency of a feature. CRFsuite will ignore features whose frequencies of occurrences in the training data are no greater than VALUE. The default value is 0 (i.e., no cut-off).
- feature.possible_states=BOOL
- Specify whether CRFsuite generates state features that do not even occur in the training data (i.e., negative state features). Setting BOOL to 1, CRFsuite generates state features that associate all of possible combinations between attributes and labels. Suppose that the numbers of attributes and labels are A and L respectively, this function will generate (A * L) features. Enabling this function may improve the labeling accuracy because the CRF model can learn the condition where an item was not labeled to certain labels under the existence of specific attributes. However, this function may also increase the number of features and slow down the training process drastically. This function is disabled by default.
- feature.possible_transitions=BOOL
- Specify whether CRFsuite generates transition features that do not even occur in the training data (i.e., negative transition features). Setting BOOL to 1, CRFsuite generates transition features that associate all of possible label pairs. Suppose that the number of labels in the training data is L, this function will generate (L * L) transition features. This function is disabled by default.
- feature.bos_eos=BOOL
- Specify whether CRFsuite generates begin-of-sequence (BOS) and end-of-sequence (EOS) features. Setting BOOL to 1, CRFsuite generates transition features that describes the event where a sequence begins or ends with specific labels. This function is enabled by default.
- regularization=TYPE
- The type of regularization: "L1" (L1 regularization, Laplacian prior), "L2" (L2 regularization, Gaussian prior), or "" (no regularization). The default value is "L2". SGD does not support L1 regularization at this moment.
- regularization.sigma=VALUE
- The regularization parameter sigma, i.e., variance of feature weights. The default value is 10.
- lbfgs.max_iterations=VALUE
- The maximum number of iterations for L-BFGS optimization. The L-BFGS routine terminates if the iteration count exceeds this value. The default value is set to the maximum value of integer on the machine (INT_MAX).
- lbfgs.epsilon=VALUE
- The epsilon parameter that determines the condition of convergence. The default value is 1e-5.
- lbfgs.stop=VALUE
- The duration of iterations to test the stopping criterion. The default value is 10.
- lbfgs.delta=VALUE
- The threshold for the stopping criterion; an L-BFGS iteration stops when the improvement of the log likelihood over the last ${lbfgs.stop} iterations is no greater than this threshold. The default value is 1e-5.
- llbfgs.num_memories=VALUE
- The number of limited memories that L-BFGS uses for approximating the inverse hessian matrix. The default value is 6.
- lbfgs.linesearch=METHOD
- The line search method in the L-BFGS algorithm. The possible methods are: "MoreThuente" (MoreThuente method proposd by More and Thuente), " Backtracking" (Backtracking method with strong Wolfe condition), "LooseBacktracking" (Backtracking method with regular Wolfe condition). The default method is "MoreThuente".
- lbfgs.linesearch.max_iterations=NUM
- The maximum number of trials for the line search algorithm. The default value is 20.
- sgd.max_iterations=VALUE
- The maximum number of iterations (epochs) for SGD. The SGD routine terminates if the iteration count exceeds this value. The default value is 1000.
- sgd.period=VALUE
- The duration of iterations to test the stopping criterion. The default value is 10.
- sgd.delta=VALUE
- The threshold for the stopping criterion; an SGD iteration stops when the improvement of the log likelihood over the last ${sgd.period} iterations is no greater than this threshold. The default value is 1e-6.
- sgd.calibration.eta=VALUE
- The initial value of learning rate (eta) used for calibration. The default value is 0.1.
- sgd.calibration.rate=VALUE
- The rate of increase/decrease of learning rate for calibration. The default value is 2.
- sgd.calibration.samples=VALUE
- The number of instances used for calibration. The calibration routine randomly chooses instances no larger than VALUE. The default value is 1000.
- sgd.calibration.candidates=VALUE
- The number of candidates of learning rate. The calibration routine tries VALUE candidates of learning rates that can increase log-likelihood.
- -h, --help
- Show the usage of this command and exit.
Here are some examples of CRFsuite command-lines for training.
-
Train a CRF model from
train.txtwith the default parameters. -
$ crfsuite learn train.txt
-
Train a CRF model from
train.txtand store the model toCRF.model. During the trainig, test the model with a holdout datatest.txt. -
$ crfsuite learn -m CRF.model -t test.txt train.txt
-
Train a CRF model from
train.txtby using SGD with L2 regularization (sigma=1). -
$ crfsuite learn -p algorithm=sgd -p regularization.sigma=1 train.txt
-
Train a CRF model from
train.txtwith L1 regularization (sigma=1). -
$ crfsuite learn -p regularization=L1 -p regularization.sigma=1 train.txt
-
Train a CRF model from
train.txt, generating all of possible features. -
$ crfsuite learn -p feature.possible_states=1 -p feature.possible_transitions=1 train.txt
To tag a data using a CRF model, enter the following command,
$ crfsuite tag [OPTIONS] [DATA]
If the argument DATA is omitted or '-', CRFsuite reads a data from STDIN.To see the usage of tag command, specify -h (--help) option.
$ crfsuite tag -h
CRFsuite 0.10 Copyright (c) 2007-2010 Naoaki Okazaki
USAGE: crfsuite tag [OPTIONS] [DATA]
Assign suitable labels to the instances in the data set given by a file (DATA).
If the argument DATA is omitted or '-', this utility reads a data from STDIN.
Evaluate the performance of the model on labeled instances (with -t option).
OPTIONS:
-m, --model=MODEL Read a model from a file (MODEL)
-t, --test Report the performance of the model on the data
-r, --reference Output the reference labels in the input data
-q, --quiet Suppress tagging results (useful for test mode)
-h, --help Show the usage of this command and exit
The following options are available for tagging.
- -m, --model=MODEL
- A filename from which CRFsuite reads a CRF model. The default value is "crfsuite.model".
- -t, --test
- Evaluate the performance (accuracy, precision, recall, f1 measure) of the CRF model, assuming that the input data is labeled. This function is disabled by default.
- -r, --reference
- Output the reference labels in parallel with predicted labels, assuming that the input data is labeled. This function is disabled by default.
- -q, --quiet
- Suppress the output of tagged labels. This function is useful for evaluating a CRF model with -t option.
- -h, --help
- Show the usage of this command and exit.
Here are some examples of CRFsuite command-lines for tagging.
-
Tag a data
test.txtusing a CRF modelCRF.model -
$ crfsuite tag -m CRF.model test.txt
-
Evaluate a CRF model
CRF.modelon the labeled datatest.txt. -
$ crfsuite tag -m CRF.model -qt test.txt
CRFsuite accepts text files in a specific format as training and untagged data. A data consists of a set of sequences each of which is represented by consecutive lines and terminated by an empty line. A sequence consists of a series of items whose characteristics are described in lines. An item line begins with its label, followed by its attributes separated by tab characters. An attribute specifies an attribute name and its value separated by colon character (':'). An attribute name can include escape sequences; "\:" and "\\" represent ':' and '\', respectively. If an attribute value is omitted (without colon character), CRFsuite assumes the attribute value to be one. A line starting with sharp character ('#') is ignored as a comment. Label fields are required for both training and untagged data. CRFsuite ignores label fields in the input data when tagging.
This is the BNF notation representing the data format.
<line> ::= <comment> | <item> | <eos>
<comment> ::= '#' <character>+ <br>
<item> ::= <label> ('\t' <feature>)+ <br>
<eos> ::= <br>
<label> ::= <string>
<feature> ::= <name> | <name> ':' <weight>
<name> ::= (<letter> | "\:" | "\\")+
<weight> ::= <numeric>
<br> ::= '\n'