Classias

A collection of machine-learning algorithms for classification

Introduction

Classias is a collection of machine-learning algorithms for classification. Currently, it supports the following formalizations:

  • L1/L2-regularized logistic regression (aka. Maximum Entropy)
  • L1/L2-regularized L1-loss linear-kernel Support Vector Machine (SVM)
  • Averaged perceptron

It implements several algorithms for training classifiers:

  • Averaged perceptron
  • Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) [Nocedal80]
  • Orthant-Wise Limited-memory Quasi-Newton (OWL-QN) [Andrew07]
  • Primal Estimated sub-GrAdient SOlver (Pegasos) [Shalev-Shwartz07]
  • Truncated Gradient [Langford09], also known as FOrward LOoking Subgradient (FOLOS) [Duchi09] specialized for L1 regularization

The core library of Classias have the following features:

  • C++ header implementation. Core algorithms are implemented in C++ header files; it is possible to use Classias library only by including several headers into source codes.
  • Simple design. The source code of the core implementation is well structured and reusable; it provides components such as loss functions, instance data-structures, feature generators, online training algorithms, batch training algorithms, performance counters, and parameter exchangers. It is very easy to write an application on top of these components. Refer to Programming documentation for some sample programs.

The command-line utilities have the following features:

  • Simple data I/O format. The data format is compatible with that of existing SVM tools (e.g., libsvm and liblinear). In addition to integer feature identifiers, Classias supports strings as feature identifiers. The token separator (space by default) and item-value separator (colon by default) are configurable.
  • Support for gzip/bzip2/xv compressed files. Classias can read data sets compressed by gzip, bzip2, and xv.
  • Performance evaluation. Classias can output precision, recall, F1 scores of the model evaluated on test data. Classias also supports automatic data splitting for n-fold cross validation.
  • Probability estimate. Classias can compute the conditional probability of the label for a given instance (valid only for models trained by logistic regression).

For more information about Classias, please refer to these pages.

Download

The current release is Classias version 1.1.

Classias is distributed under the modified BSD license.

Please use the following BibTex entry when you cite Classias in your papers.

@misc{Classias,
	author = {Naoaki Okazaki},
	title = {Classias: a collection of machine-learning algorithms for classification},
	url = {http://www.chokkan.org/software/classias/},
	year = {2009}
}

Change log

Classias 1.1 (2009-12-28)
  • [classias-tag] Implemented false analyses (-f).
  • [classias-tag] Added an option (-r) that outputs the reference labels in the given data together with predicted labels.
  • [classias-tag] Added an option (-a) that outputs all candidates for each instance.
  • [classias-tag] Fixed a crash problem with specific data (Many thanks to Hiromi Wakaki).
  • [classias-train] Fixed a crash problem when an input file does not exist (Thanks to He Tan).
  • Numerous minor fixes and tunings.
Classias 1.0 (2009-09-27)
  • Initial release.

References

[Andrew07] Galen Andrew and Jianfeng Gao. Scalable training of L1-regularized log-linear models”. Proceedings of the 24th International Conference on Machine Learning (ICML 2007). 33-40. 2007.

[Duchi09] John Duchi and Yoram Singer. Online and Batch Learning using Forward Looking Subgradients”. . (to appear). 2009.

[Langford09] John Langford, Lihong Li, and Tong Zhang. Sparse Online Learning via Truncated Gradient”. Journal of Machine Learning Research. 10. Mar. 777-801. 2009.

[Nocedal80] Jorge Nocedal. “Updating Quasi-Newton Matrices with Limited Storage”. Mathematics of Computation. 35. 151. 773-782. 1980.

[Shalev-Shwartz07] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM”. Proceedings of the 24th International Conference on Machine Learning (ICML 2007). 807-814. 2007.