Learn Before
Theory

Conditional Random Fields (CRFs)

Yˆ=argmaxYyP(YX)Yˆ = argmax_{Y∈y} P(Y|X) However, the CRF does not compute a probability for each tag at each time step. Instead, at each time step the CRF computes log-linear functions over a set of relevant features, and these local features are aggregated and normalized to produce a global probability for the whole sequence. In a CRF, the function FF maps an entire input sequence XX and an entire output sequence YY to a feature vector. Let’s assume we have KK features, with a weight wkw_{k} for each feature FkF_{k}: p(YX)=exp(k=1KwkFk(X,Y))Yyexp(k=1KwkFk(X,Y))p(Y|X) = \frac{exp(\sum_{k=1}^K w_{k}F_{k}(X,Y))}{\sum_{Y'∈y} exp(\sum_{k=1}^Kw_{k}F_{k}(X,Y'))} We’ll call these KK functions Fk(X,Y)F_{k}(X,Y) global features, since each one is a property of the entire input sequence XX and output sequence YY. We compute them by decomposing into a sum of local features for each position ii in YY: Fk(X,Y)=i=1nfk(yi1,yi,X,i)F_{k}(X,Y) =\sum_{i=1}^n f_{k}(y_{i−1}, y_{i},X,i) This constraint to only depend on the current and previous output tokens yiy_{i} and yi1y_{i−1} are what characterizes a linear chain CRF. A general CRF allows a feature to make use of any output token, and are thus necessary for tasks in which the decision depend on distant output tokens. General CRFs require more complex inference, and are less commonly used for language processing.

0

1

Updated 2026-05-10

Tags

Data Science