crfChain

Mark Schmidt, Kevin Swersky (2008)

This package is a set of Matlab functions for chain-structured conditional random fields (CRFs) with categorical features. The code implements decoding (with the Viterbi algorithm), inference (with the forwards-backwards algorithm), sampling (with the forwards-filter bacwards-sample algorithm), and parameter estimation (with a limited-memory quasi-Newton algorithm) in these models. Several of the functions have been implemented in C as mex files to speed up calculations.

The functions use a data matrix X and a label vector y. Each element i in y gives the label at position i, among the values 0-k (for some k). The label 0 is special, it indicates the position between 'sentences'. Each element (i,j) in the data matrix X gives the value of feature j at position i, among the values 0-f (for some f that can be different for each feature). A feature value of 0 is special, it indicates that the feature is ignored at this position (this is used in natural language applications to represent words that don't occur very often). The features at positions with label values of 0 are ignored.

The model uses four sets of parameters:
w_i,j,k: The potential of state j given that feature i takes the value k.
v_start,j: The potential of state j to start a sentence.
v_end,j: The potential of state j to end a sentence.
v_i,j: The potential for the transition from state i to j.

The specific model for a single sentence of length s where each word has f features is:

p(y_1:s | x_1:s, w, v_start, v_end, v) = (1/Z)exp( &Sigma_{i=1:s}&Sigma_{j=1:f}[w_{j,y_i,x_i,j}] + v_start,y₁ + v_{end,y_s} + &Sigma_{i=1:s-1}[v_{y_i,i+1}] )

The value of Z is chosen so that the sum over assignment of y_1:s is equal to one. To add a bias for each label, you can append a column of ones to X.

Usage

Given X and y in the appropriate format, the data structures used by the code can be initialized using:

nStates = max(y); nFeatures = max(X); [w,v_start,v_end,v] = crfChain_initWeights(nFeatures,nStates,'zeros'); featureStart = cumsum([1 nFeatures(1:end)]); sentences = crfChain_initSentences(y);
We can make the potential functions, do decoding, do inference, and generate 10 conditional samples for a sentence s using:

[nodePot,edgePot]=crfChain_makePotentials(X,w,v_start,v_end,v,nFeatures,featureStart,sentences,s); yViterbi = crfChain_decode(nodePot,edgePot); [nodeBel,edgeBel,logZ] = crfChain_infer(nodePot,edgePot); samples = crfChain_sample(nodePot,edgePot,10);
We can compute the negative log-likelihood of a set of training sentences trainNdx using:

crfChain_loss([w(:);v_start;v_end;v(:)],X,y,nStates,nFeatures,featureStart,sentences(trainNdx,:));

Subsequently we can train the model using this as an objective function:

[wv] = minFunc(@crfChain_loss,[w(:);v_start;v_end;v(:)],[],X,y,nStates,nFeatures,featureStart,sentences(trainNdx,:)); [w,v_start,v_end,v] = crfChain_splitWeights(wv,featureStart,nStates);

Example

Running 'example_crfChain' runs an example that illustrates the usage of crfChain, using a synthetic data set.

Download

The complete set of .m and .c files are available here.

Note that crfChain contains many sub-directories that must be present on the Matlab path for the files to work. You can add these sub-directories to the Matlab path by typing (in Matlab) 'addpath(genpath(crfChain_dir))', where 'crfChain_dir' is the directory that the zip file is extracted to.

The mex files for several architectures are included in the archive. For systems where the mex files are not included, you can compile the mex files by calling the 'mexAll' function (on Windows, this requires adding the MinGW add-on).

Citations

If you use this software in a publication, please cite the work using the following information:

M. Schmidt, K. Swersky. crfChain: Matlab code for chain-structured conditional random fields with categorical features. http://www.cs.ubc.ca/~schmidtm/Software/crfChain.html, 2008.

Using crfChain with L1-Regularization

A demo of using the crfChain code with the L1General code to do L1-regularized parameter estimation (to encourage setting parameters to exactly 0) is available here.

Faster training with Stochastic Average Gradient

The SAG4CRF software contains several methods that can be substantially faster than L-BFGS for training CRFs.

More general features/structures

The UGM package has support for CRFs with continuous features, features on the edges, parameter tieing, and general graph structures. The specific differences between UGM and crfChain are as follows:

The 'Y' variable in UGM is a nSamples by nNodes matrix, and any graph structure can be assumed of the samples. In crfChain the graph structure is forced to a chain, and the 'Y' variable is a vector that concatenates all the chains (requring a '0' in the 'position' between the chains). By always requiring the graph structure to be a chain, crfChain saves on some of the overhead required to deal with the graph structure.
In UGM, the features 'X' are assumed to be real-valued, whereas in crfChain they are assumed categorical. In UGM, you can simulate these categorical variables. If you want to simulate a categorical values that takes 's' states, simply use 's' binary indicator variables. By requiring that categorical features are used, crfChain allows a more efficient representation when the number of categories is very large.
In crfChain there are special potentials for the start and the end of the chain. This can be simuilated in UGM by adding two extra indicator features, and UGM offers even more fleixbility to do things of this nature.
In crfChain we assume a full node and edge potential parameterization and that the edge features only depend on the labels. In contrast, UGM allows you to share parameters within the potentials, as well as have non-bias features on the edges.
In crfChain we assume that all nodes and all edges share the same parameter, whereas UGM gives you the option to allow different nodes/edges to have different parameters.

Mark Schmidt > Software > crfChain