Table of Contents
This tutorial demonstrates the use of CRFsuite for text chunking, which is to divide a text into syntactically correlated parts of words.
We use the training and testing data distributed by the CoNLL 2000 shared task.
Necessary scripts for this tutorial are included under example/CoNLL2000 directory in the CRFsuite distribution.
Firstly, move the current directory to the example directory and download the training and testing data from their website:
$ cd example/CoNLL2000/ $ wget http://www.cnts.ua.ac.be/conll2000/chunking/train.txt.gz $ wget http://www.cnts.ua.ac.be/conll2000/chunking/test.txt.gz $ less train.txt.gz ... (snip) ... London JJ B-NP shares NNS I-NP closed VBD B-VP moderately RB B-ADVP lower JJR I-ADVP in IN B-PP thin JJ B-NP trading NN I-NP . . O At IN B-PP Tokyo NNP B-NP , , O the DT B-NP Nikkei NNP I-NP index NN I-NP of IN B-PP 225 CD B-NP selected VBN I-NP issues NNS I-NP was VBD B-VP up IN B-ADVP 112.16 CD B-NP points NNS I-NP to TO B-PP 35486.38 CD B-NP . . O ... (snip) ...
The data consists of a set of sentences (sequences) each of which contains a series of words (e.g., 'London', 'shares'), part-of-speech tags (e.g., 'JJ', 'NNS'), and chunk labels (e.g., 'B-NP', 'I-NP') separated by space characters. In this tutorial, we would like to construct a CRF model that assigns a sequence of chunk labels, given a sequence of words and part-of-speech codes. Please refer to CoNLL 2000 shared task website for more information about this task.
The next step is to preprocess the training and testing data to extract attributes that express the characteristics of words (items) in the data. In general, this is the most important process for machine-learning applications, affecting the labeling accuracy a lot. In this tutorial, we express a word at position t (in offsets from the begining of a sequence) with following 19 kinds of attributes:
${x[t-2].token},${x[t-1].token},${x[t].token},${x[t+1].token},${x[t+2].token}${x[t-1].token}/${x[t].token},${x[t].token}/${x[t+1].token}${x[t-2].pos},${x[t-1].pos},${x[t].pos},${x[t+1].pos},${x[t+2].pos}${x[t-2].pos}/${x[t-1].pos},${x[t-1].pos}/${x[t].pos},${x[t].pos}/${x[t+1].pos},${x[t+1].pos}/${x[t+2].pos}${x[t-2].pos}/${x[t-1].pos}/${x[t].pos},${x[t-1].pos}/${x[t].pos}/${x[t+1].pos},${x[t].pos}/${x[t+1].pos}/${x[t+2].pos}
In this list, ${x[t].token} and ${x[t].pos} present the word and part-of-speech respectively at position t in a sequence.
These features express the characteristic of the word at position t by using information from surrounding words, e.g., ${x[t-1].token} and ${x[t+1].pos}.
This rule is compatible with the feature template for CoNLL 2000 in the CRF++ distribution.
It is easy to implement the conversion from the training/testing data to CRFsuite data.
The CRFsuite distribution includes a Python script to_crfsuite.py that generates attributes from the CoNLL 2000 data.
The procedure below converts train.txt.gz and test.txt.gz into train.crfsuite.txt and test.crfsuite.txt that are compatible with the CRFsuite data format.
$ zcat train.txt.gz | ./to_crfsuite.py > train.crfsuite.txt
$ zcat test.txt.gz | ./to_crfsuite.py > test.crfsuite.txt
$ less train.crfsuite.txt
... (snip) ...
B-NP U00= U01= U02=London U03=shares U04=closed U05=/Lon
don U06=London/shares U10= U11= U12=JJ U13=NNS U14=VBD U15=/
U16=/JJ U17=JJ/NNS U18=NNS/VBD U20=//JJ U21=/JJ/NNS U22=JJ/N
NS/VBD
I-NP U00= U01=London U02=shares U03=closed U04=moderately
U05=London/shares U06=shares/closed U10= U11=JJ U12=NNS U13=VBD
U14=RB U15=/JJ U16=JJ/NNS U17=NNS/VBD U18=VBD/RB U20=/JJ/NNS
U21=JJ/NNS/VBD U22=NNS/VBD/RB
B-VP U00=London U01=shares U02=closed U03=moderately U04=lowe
r U05=shares/closed U06=closed/moderately U10=JJ U11=NNS U12=VBD
U13=RB U14=JJR U15=JJ/NNS U16=NNS/VBD U17=VBD/RB U18=RB/JJR
U20=JJ/NNS/VBD U21=NNS/VBD/RB U22=VBD/RB/JJR
B-ADVP U00=shares U01=closed U02=moderately U03=lower U04=in
U05=closed/moderately U06=moderately/lower U10=NNS U11=VBD U12=RB U13=JJR
U14=IN U15=NNS/VBD U16=VBD/RB U17=RB/JJR U18=JJR/IN U20=NNS/
VBD/RB U21=VBD/RB/JJR U22=RB/JJR/IN
I-ADVP U00=closed U01=moderately U02=lower U03=in U04=thin
U05=moderately/lower U06=lower/in U10=VBD U11=RB U12=JJR U13=IN U14=JJ
U15=VBD/RB U16=RB/JJR U17=JJR/IN U18=IN/JJ U20=VBD/RB/JJR
U21=RB/JJR/IN U22=JJR/IN/JJ
B-PP U00=moderately U01=lower U02=in U03=thin U04=trading
U05=lower/in U06=in/thin U10=RB U11=JJR U12=IN U13=JJ U14=NN U15=RB/J
JR U16=JJR/IN U17=IN/JJ U18=JJ/NN U20=RB/JJR/IN U21=JJR/
IN/JJ U22=IN/JJ/NN
B-NP U00=lower U01=in U02=thin U03=trading U04=. U05=in/t
hin U06=thin/trading U10=JJR U11=IN U12=JJ U13=NN U14=. U15=JJR/
IN U16=IN/JJ U17=JJ/NN U18=NN/. U20=JJR/IN/JJ U21=IN/J
J/NN U22=JJ/NN/.
I-NP U00=in U01=thin U02=trading U03=. U04= U05=thin/trading
U06=trading/. U10=IN U11=JJ U12=NN U13=. U14= U15=IN/JJ
U16=JJ/NN U17=NN/. U18=./ U20=IN/JJ/NN U21=JJ/NN/. U22=NN/.
/
O U00=thin U01=trading U02=. U03= U04= U05=trading/.
U06=./ U10=JJ U11=NN U12=. U13= U14= U15=JJ/NN U16=NN/.
U17=./ U18=/ U20=JJ/NN/. U21=NN/./ U22=.//
B-PP U00= U01= U02=At U03=Tokyo U04=, U05=/At U06=At/Tokyo
U10= U11= U12=IN U13=NNP U14=, U15=/ U16=/IN U17=IN/NNP U18=NNP/
... (snip) ...
Note that "U00=", "U01=", ... are prefixes to prevent name collisions of different kinds of attributes.
Now we are ready to use CRFsuite for training.
Simply type the following command to train a CRF model from train.crfsuite.txt.
CRFsuite will read the training data, generate necessary state and transition features based on the data, maximize the log-likelihood of the conditional probability distribution, and store the model into CoNLL2000.model.
$ crfsuite learn -m CoNLL2000.model train.crfsuite.txt CRFsuite 0.6 Copyright (c) 2007-2009 Naoaki Okazaki Start time of the training: 2009-03-07T15:50:07Z Reading the training data 0....1....2....3....4....5....6....7....8....9....10 Number of instances: 8936 Total number of items: 211727 Number of attributes: 338547 Number of labels: 22 Seconds required: 5.410 Training first-order linear-chain CRFs (trainer.crf1m) Feature generation feature.minfreq: 0.000000 feature.possible_states: 0 feature.possible_transitions: 0 feature.bos_eos: 1 0....1....2....3....4....5....6....7....8....9....10 Number of features: 456480 Seconds required: 1.900 L-BFGS optimization regularization: L2 regularization.sigma: 10.000000 lbfgs.num_memories: 6 lbfgs.max_iterations: 2147483647 lbfgs.epsilon: 0.000010 lbfgs.stop: 10 lbfgs.delta: 0.000010 lbfgs.linesearch: MoreThuente lbfgs.linesearch.max_iterations: 20 ***** Iteration #1 ***** Log-likelihood: -264449.110672 Feature norm: 5.000000 Error norm: 42832.056705 Active features: 456480 Line search trials: 2 Line search step: 0.000048 Seconds required for this iteration: 6.310 ***** Iteration #2 ***** Log-likelihood: -163057.244350 Feature norm: 8.506562 Error norm: 26117.210073 Active features: 456480 Line search trials: 1 Line search step: 1.000000 Seconds required for this iteration: 2.180 ... (snip) ... ***** Iteration #89 ***** Log-likelihood: -704.485807 Feature norm: 331.586428 Error norm: 19.138697 Active features: 456480 Line search trials: 3 Line search step: 0.038164 Seconds required for this iteration: 6.450 L-BFGS terminated with error code (-1002) Total seconds required for L-BFGS: 274.920 Storing the model Number of active features: 456480 (456480) Number of active attributes: 338547 (338547) Number of active labels: 22 (22) Writing labels Writing attributes Writing feature references for transitions Writing feature references for attributes Seconds required: 0.530 End time of the training: 2009-03-07T15:54:51Z
Although the training process terminated with "L-BFGS terminated with error code (-1002)", you do not have to worry about this error.
You can also train a CRF model, with -t option, watching its performance (accuracy, precision, recall, f1 score) evaluated on the test data. It should be exciting to see your model improved as the training process advances!
$ crfsuite learn -m CoNLL2000.model -t test.crfsuite.txt train.crfsuite.txt
CRFsuite 0.6 Copyright (c) 2007-2009 Naoaki Okazaki
Start time of the training: 2009-03-07T15:58:21Z
Reading the training data
0....1....2....3....4....5....6....7....8....9....10
Number of instances: 8936
Total number of items: 211727
Number of attributes: 338547
Number of labels: 22
Seconds required: 5.370
Reading the evaluation data
0....1....2....3....4....5....6....7....8....9....10
Number of instances: 2012
Number of total items: 47377
Seconds required: 1.260
Training first-order linear-chain CRFs (trainer.crf1m)
Feature generation
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
feature.bos_eos: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 456482
Seconds required: 1.920
L-BFGS optimization
regularization: L2
regularization.sigma: 10.000000
lbfgs.num_memories: 6
lbfgs.max_iterations: 2147483647
lbfgs.epsilon: 0.000010
lbfgs.stop: 10
lbfgs.delta: 0.000010
lbfgs.linesearch: MoreThuente
lbfgs.linesearch.max_iterations: 20
***** Iteration #1 *****
Log-likelihood: -268663.973857
Feature norm: 5.000000
Error norm: 43686.795219
Active features: 456482
Line search trials: 2
Line search step: 0.000048
Seconds required for this iteration: 6.580
Performance by label (#match, #model, #ref) (precision, recall, F1):
B-NP: (8282, 10425, 12422) (0.7944, 0.6667, 0.7250)
B-PP: (3842, 5775, 4811) (0.6653, 0.7986, 0.7259)
I-NP: (14133, 27651, 14376) (0.5111, 0.9831, 0.6726)
B-VP: (0, 0, 4658) (0.0000, 0.0000, 0.0000)
I-VP: (0, 0, 2646) (0.0000, 0.0000, 0.0000)
B-SBAR: (0, 0, 535) (0.0000, 0.0000, 0.0000)
O: (3483, 3526, 6180) (0.9878, 0.5636, 0.7177)
B-ADJP: (0, 0, 438) (0.0000, 0.0000, 0.0000)
B-ADVP: (0, 0, 866) (0.0000, 0.0000, 0.0000)
I-ADVP: (0, 0, 89) (0.0000, 0.0000, 0.0000)
I-ADJP: (0, 0, 167) (0.0000, 0.0000, 0.0000)
I-SBAR: (0, 0, 4) (0.0000, 0.0000, 0.0000)
I-PP: (0, 0, 48) (0.0000, 0.0000, 0.0000)
B-PRT: (0, 0, 106) (0.0000, 0.0000, 0.0000)
B-LST: (0, 0, 5) (0.0000, 0.0000, 0.0000)
B-INTJ: (0, 0, 2) (0.0000, 0.0000, 0.0000)
I-INTJ: (0, 0, 0) (******, ******, ******)
B-CONJP: (0, 0, 9) (0.0000, 0.0000, 0.0000)
I-CONJP: (0, 0, 13) (0.0000, 0.0000, 0.0000)
I-PRT: (0, 0, 0) (******, ******, ******)
B-UCP: (0, 0, 0) (******, ******, ******)
I-UCP: (0, 0, 0) (******, ******, ******)
I-LST: (0, 0, 2) (0.0000, 0.0000, 0.0000)
Macro-average precision, recall, F1: (0.128637, 0.130956, 0.123527)
Item accuracy: 29740 / 47377 (0.6277)
Instance accuracy: 37 / 2012 (0.0184)
... (snip) ...
***** Iteration #82 *****
Log-likelihood: -826.646530
Feature norm: 366.823234
Error norm: 28.532190
Active features: 456482
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 2.480
Performance by label (#match, #model, #ref) (precision, recall, F1):
B-NP: (12000, 12403, 12422) (0.9675, 0.9660, 0.9668)
B-PP: (4699, 4854, 4811) (0.9681, 0.9767, 0.9724)
I-NP: (13931, 14444, 14376) (0.9645, 0.9690, 0.9668)
B-VP: (4459, 4668, 4658) (0.9552, 0.9573, 0.9563)
I-VP: (2526, 2656, 2646) (0.9511, 0.9546, 0.9528)
B-SBAR: (452, 518, 535) (0.8726, 0.8449, 0.8585)
O: (5941, 6149, 6180) (0.9662, 0.9613, 0.9637)
B-ADJP: (313, 403, 438) (0.7767, 0.7146, 0.7444)
B-ADVP: (702, 861, 866) (0.8153, 0.8106, 0.8130)
I-ADVP: (49, 75, 89) (0.6533, 0.5506, 0.5976)
I-ADJP: (110, 156, 167) (0.7051, 0.6587, 0.6811)
I-SBAR: (2, 15, 4) (0.1333, 0.5000, 0.2105)
I-PP: (34, 46, 48) (0.7391, 0.7083, 0.7234)
B-PRT: (79, 103, 106) (0.7670, 0.7453, 0.7560)
B-LST: (0, 0, 5) (0.0000, 0.0000, 0.0000)
B-INTJ: (1, 2, 2) (0.5000, 0.5000, 0.5000)
I-INTJ: (0, 0, 0) (******, ******, ******)
B-CONJP: (5, 9, 9) (0.5556, 0.5556, 0.5556)
I-CONJP: (10, 13, 13) (0.7692, 0.7692, 0.7692)
I-PRT: (0, 0, 0) (******, ******, ******)
B-UCP: (0, 0, 0) (******, ******, ******)
I-UCP: (0, 2, 0) (******, ******, ******)
I-LST: (0, 0, 2) (0.0000, 0.0000, 0.0000)
Macro-average precision, recall, F1: (0.567818, 0.571426, 0.564693)
Item accuracy: 45313 / 47377 (0.9564)
Instance accuracy: 1134 / 2012 (0.5636)
L-BFGS terminated with error code (-1002)
Total seconds required for L-BFGS: 271.120
Storing the model
Number of active features: 456482 (456482)
Number of active attributes: 338547 (390781)
Number of active labels: 23 (23)
Writing labels
Writing attributes
Writing feature references for transitions
Writing feature references for attributes
Seconds required: 0.530
End time of the training: 2009-03-07T16:03:02Z
This log message reports that the CRF model obtained from the training data achieved 95.6% item accuracy.
You can apply the CRF model and tag chunk labels to the test data. Even though the test data distributed by the CoNLL 2000 shared task has chunk labels annotated (for evaluation purposes), CRFsuite ignores the existing labels and outputs label sequences (one label per line; delimitered by empty lines) predicted by the model.
$ cat test.crfsuite.txt
B-NP U00= U01= U02=Rockwell U03=International U04=Corp.
U05=/Rockwell U06=Rockwell/International U10= U11= U12=NNP U13=NNP
U14=NNP U15=/ U16=/NNP U17=NNP/NNP U18=NNP/NNP U20=//NNP
U21=/NNP/NNP U22=NNP/NNP/NNP
I-NP U00= U01=Rockwell U02=International U03=Corp. U04='s
U05=Rockwell/International U06=International/Corp. U10= U11=NNP U12=NNP
U13=NNP U14=POS U15=/NNP U16=NNP/NNP U17=NNP/NNP U18=NNP/POS
U20=/NNP/NNP U21=NNP/NNP/NNP U22=NNP/NNP/POS
I-NP U00=Rockwell U01=International U02=Corp. U03='s U04=Tuls
a U05=International/Corp. U06=Corp./'s U10=NNP U11=NNP U12=NNP U13=POS
U14=NNP U15=NNP/NNP U16=NNP/NNP U17=NNP/POS U18=POS/NNP U20=NNP/
NNP/NNP U21=NNP/NNP/POS U22=NNP/POS/NNP
B-NP U00=International U01=Corp. U02='s U03=Tulsa U04=unit
U05=Corp./'s U06='s/Tulsa U10=NNP U11=NNP U12=POS U13=NNP U14=NN
U15=NNP/NNP U16=NNP/POS U17=POS/NNP U18=NNP/NN U20=NNP/NNP/POS
U21=NNP/POS/NNP U22=POS/NNP/NN
I-NP U00=Corp. U01='s U02=Tulsa U03=unit U04=said
U05='s/Tulsa U06=Tulsa/unit U10=NNP U11=POS U12=NNP U13=NN U14=VBD U15=NNP/
POS U16=POS/NNP U17=NNP/NN U18=NN/VBD U20=NNP/POS/NNP U21=POS/
NNP/NN U22=NNP/NN/VBD
I-NP U00='s U01=Tulsa U02=unit U03=said U04=it U05=Tuls
a/unit U06=unit/said U10=POS U11=NNP U12=NN U13=VBD U14=PRP U15=POS/NNP
U16=NNP/NN U17=NN/VBD U18=VBD/PRP U20=POS/NNP/NN U21=NNP/NN/VBD
U22=NN/VBD/PRP
B-VP U00=Tulsa U01=unit U02=said U03=it U04=signed
U05=unit/said U06=said/it U10=NNP U11=NN U12=VBD U13=PRP U14=VBD U15=NNP/
NN U16=NN/VBD U17=VBD/PRP U18=PRP/VBD U20=NNP/NN/VBD U21=NN/V
BD/PRP U22=VBD/PRP/VBD
... (snip) ...
$ crfsuite tag -m CoNLL2000.model test.crfsuite.txt
B-NP
I-NP
I-NP
B-NP
I-NP
I-NP
B-VP
B-NP
B-VP
B-NP
I-NP
I-NP
B-VP
B-NP
I-NP
B-PP
B-NP
I-NP
B-VP
I-VP
B-NP
I-NP
B-PP
... (snip) ...
CRFsuite can also evaluate the CRF model with labeled test data with "-qt" options.
$ crfsuite tag -qt -m CoNLL2000.model test.crfsuite.txt
CRFsuite 0.6 Copyright (c) 2007-2009 Naoaki Okazaki
Performance by label (#match, #model, #ref) (precision, recall, F1):
B-NP: (11997, 12400, 12422) (0.9675, 0.9658, 0.9666)
B-PP: (4699, 4854, 4811) (0.9681, 0.9767, 0.9724)
I-NP: (13931, 14444, 14376) (0.9645, 0.9690, 0.9668)
B-VP: (4459, 4668, 4658) (0.9552, 0.9573, 0.9563)
I-VP: (2526, 2656, 2646) (0.9511, 0.9546, 0.9528)
B-SBAR: (452, 518, 535) (0.8726, 0.8449, 0.8585)
O: (5941, 6149, 6180) (0.9662, 0.9613, 0.9637)
B-ADJP: (313, 403, 438) (0.7767, 0.7146, 0.7444)
B-ADVP: (702, 861, 866) (0.8153, 0.8106, 0.8130)
I-ADVP: (49, 75, 89) (0.6533, 0.5506, 0.5976)
I-ADJP: (110, 156, 167) (0.7051, 0.6587, 0.6811)
I-SBAR: (2, 15, 4) (0.1333, 0.5000, 0.2105)
I-PP: (34, 46, 48) (0.7391, 0.7083, 0.7234)
B-PRT: (79, 103, 106) (0.7670, 0.7453, 0.7560)
B-LST: (0, 0, 5) (0.0000, 0.0000, 0.0000)
B-INTJ: (1, 2, 2) (0.5000, 0.5000, 0.5000)
I-INTJ: (0, 0, 0) (******, ******, ******)
B-CONJP: (5, 9, 9) (0.5556, 0.5556, 0.5556)
I-CONJP: (10, 13, 13) (0.7692, 0.7692, 0.7692)
I-PRT: (0, 0, 0) (******, ******, ******)
B-UCP: (0, 1, 0) (******, ******, ******)
I-UCP: (0, 4, 0) (******, ******, ******)
I-LST: (0, 0, 2) (0.0000, 0.0000, 0.0000)
Macro-average precision, recall, F1: (0.567817, 0.571415, 0.564688)
Item accuracy: 45310 / 47377 (0.9564)
Instance accuracy: 1131 / 2012 (0.5621)
Elapsed time: 0.840000 [sec] (2395.2 [instance/sec])
When we improve the accuracy of a CRF model by tweaking the feature set, it may be useful to see the feature weights assigned by a trainer. You cannot simply read the model file since CRFsuite stores models in a binary format for the efficiency reason. Therefore, you need to use the dump command to read a model in plain text format.
$ crfsuite dump CoNLL2000.model
FILEHEADER = {
magic: lCRF
size: 28242501
type: FOMC
version: 100
num_features: 0
num_labels: 23
num_attrs: 338547
off_features: 0x30
off_labels: 0x8B4EE4
off_attrs: 0x8B5A0C
off_labelrefs: 0x169C145
off_attrrefs: 0x169C515
}
LABELS = {
0: B-NP
1: B-PP
2: I-NP
3: B-VP
4: I-VP
5: B-SBAR
6: O
7: B-ADJP
8: B-ADVP
9: I-ADVP
10: I-ADJP
11: I-SBAR
12: I-PP
13: B-PRT
14: B-LST
15: B-INTJ
16: I-INTJ
17: B-CONJP
18: I-CONJP
19: I-PRT
20: B-UCP
21: I-UCP
22: I-LST
}
ATTRIBUTES = {
0: U00=
1: U01=
2: U02=Confidence
3: U03=in
4: U04=the
5: U05=/Confidence
6: U06=Confidence/in
7: U10=
... (snip) ...
}
TRANSITIONS = {
(1) B-NP --> B-NP: 2.327985
(1) B-NP --> B-PP: 4.391125
(1) B-NP --> I-NP: 30.372649
(1) B-NP --> B-VP: 7.725525
(1) B-NP --> B-SBAR: 1.821388
(1) B-NP --> O: 3.805715
(1) B-NP --> B-ADJP: 4.801651
(1) B-NP --> B-ADVP: 3.842473
... (snip) ...
}
TRANSITIONS_FROM_BOS = {
(2) BOS --> B-NP: 17.875605
(2) BOS --> B-PP: -0.318745
(2) BOS --> I-NP: -4.387101
(2) BOS --> B-VP: -0.383031
(2) BOS --> I-VP: -1.163315
(2) BOS --> B-SBAR: 1.368176
(2) BOS --> O: 2.783132
... (snip) ...
}
TRANSITIONS_TO_EOS = {
(3) B-NP --> EOS: 16.156051
(3) B-PP --> EOS: -1.045312
(3) I-NP --> EOS: -2.762051
(3) B-VP --> EOS: -0.767247
(3) I-VP --> EOS: -1.113502
(3) B-SBAR --> EOS: -2.407145
(3) O --> EOS: 4.131429
... (snip) ...
}
STATE_FEATURES = {
(0) U00= --> B-NP: -2.622045
(0) U00= --> B-PP: -1.562976
(0) U00= --> I-NP: -2.555526
(0) U00= --> B-VP: -1.329829
(0) U00= --> I-VP: -1.152970
(0) U00= --> B-SBAR: -2.590170
(0) U00= --> O: -1.584688
(0) U00= --> B-ADJP: -1.526879
... (snip) ...
}