TensorFlow中使用CRF

背景：关于CRF的应用，尤其是在深度学习之前它是一个nlp序列建模的比较主流方法，即使是深度学习大行其道，也会出现crf的影子，比如bilstm+crf, bert+crf. 关于crf的原理可参考众多的资料，本文提供一个在tensorflow中使用crf的一个简要概述。

CRF使用的主要API

crf_log_likelihood

def crf_log_likelihood(inputs,
                       tag_indices,
                       sequence_lengths,
                       transition_params=None):
  """Computes the log-likelihood of tag sequences in a CRF.
  Args:
    inputs: A [batch_size, max_seq_len, num_tags] tensor of unary potentials
        to use as input to the CRF layer.
    tag_indices: A [batch_size, max_seq_len] matrix of tag indices for which we
        compute the log-likelihood.
    sequence_lengths: A [batch_size] vector of true sequence lengths.
    transition_params: A [num_tags, num_tags] transition matrix, if available.
  Returns:
    log_likelihood: A [batch_size] `Tensor` containing the log-likelihood of
      each example, given the sequence of tag indices.
    transition_params: A [num_tags, num_tags] transition matrix. This is either
        provided by the caller or created in this function.
  """

输入：

inputs，一元势能得分，针对每个word级别，每个标签的预测概率值，<句子长度,标签大小>的tensor
tag_indices：真实标签的序列，<句子长度,标签大小>的tensor
sequence_lengths：实际标签序列的长度，为一个值
transition_params：标签状态转移矩阵，学习的参数矩阵，可以预先给定

输出：

log_likelihood，word级别的对数似然概率
transition_params：学习后的状态转移矩阵

解码过程的两个可用的API
- tf.contrib.crf.viterbi_decode(tf_unaryscores, tf_transition_params)
- tf.contrib.crf.crf_decode(unary_scores, transition_params, sequence_lengths)

一个具体的例子：

import numpy as np
import tensorflow as tf
# Data settings.
num_examples = 10
num_words = 20
num_features = 100
num_tags = 5
# Random features.
x = np.random.rand(num_examples, num_words, num_features).astype(np.float32)
# Random tag indices representing the gold sequence.
y = np.random.randint(num_tags, size=[num_examples, num_words]).astype(np.int32)
# All sequences in this example have the same length, but they can be variable in a real model.
sequence_lengths = np.full(num_examples, num_words - 1, dtype=np.int32)
# Train and evaluate the model.
with tf.Graph().as_default():
  with tf.Session() as session:
    # Add the data to the TensorFlow graph.
    x_t = tf.constant(x)
    y_t = tf.constant(y)
    sequence_lengths_t = tf.constant(sequence_lengths)
    # Compute unary scores from a linear layer.
    weights = tf.get_variable("weights", [num_features, num_tags])
    matricized_x_t = tf.reshape(x_t, [-1, num_features])
    matricized_unary_scores = tf.matmul(matricized_x_t, weights)
    unary_scores = tf.reshape(matricized_unary_scores,
                              [num_examples, num_words, num_tags])
    # Compute the log-likelihood of the gold sequences and keep the transition
    # params for inference at test time.
    log_likelihood, transition_params = tf.contrib.crf.crf_log_likelihood(
        unary_scores, y_t, sequence_lengths_t)
    # Compute the viterbi sequence and score.
    viterbi_sequence, viterbi_score = tf.contrib.crf.crf_decode(
        unary_scores, transition_params, sequence_lengths_t)
    # Add a training op to tune the parameters.
    loss = tf.reduce_mean(-log_likelihood)
    train_op = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
    session.run(tf.global_variables_initializer())
    mask = (np.expand_dims(np.arange(num_words), axis=0) <
            np.expand_dims(sequence_lengths, axis=1))
    total_labels = np.sum(sequence_lengths)
    # Train for a fixed number of iterations.
    for i in range(1000):
      tf_viterbi_sequence, _ = session.run([viterbi_sequence, train_op])
      if i % 100 == 0:
        correct_labels = np.sum((y == tf_viterbi_sequence) * mask)
        accuracy = 100.0 * correct_labels / float(total_labels)
        print("Accuracy: %.2f%%" % accuracy)

CRF 编码/训练逻辑

获得表示：获得原始的model产出，比如lstm或者bert的表示，<句子长度,隐藏大小>的tensor.
计算每个word得分/全局的一元势函数(global unary potential)：加入Project映射层计算每个word的得分，输出的大小为<句子长度,标签大小>
Train: 使用crf_log_likelihood进行极大似然估计的train,这里可以标签的转移矩阵需要学习，是一个参数矩阵，得到一个log_likelihood和标签状态转移表示(decode用)

log_likelihood, transition_params = tf.contrib.crf.crf_log_likelihood(
unary_scores, gold_tags, sequence_lengths, trans)
在word级别上累加计算loss

loss = tf.reduce_mean(-log_likelihood)

选择具体的优化算法进行学习

1	train_op = tf.train.GradientDescentOptimizer(0.01).minimize(loss)

CRF 解码/测试逻辑

一种可以使用crf_decode的tf代码进行解码，另外一种使用Numpy代码的viterbi_decode解码方式

numpy的解码风格:

tf_unary_scores, tf_sequence_lengths, tf_transition_params, _ = session.run([unary_scores, sequence_lengths, transition_params, train_op])
for tf_unary_scores_, tf_sequence_length_ in zip(tf_unary_scores, tf_sequence_lengths):
    # Remove padding.
    tf_unary_scores_ = tf_unary_scores_[:tf_sequence_length_]
    # Compute the highest score and its tag sequence.
    tf_viterbi_sequence, tf_viterbi_score = tf.contrib.crf.viterbi_decode(
        tf_unary_scores_, tf_transition_params)

TF的风格

1
2
3

viterbi_sequence, viterbi_score = tf.contrib.crf.crf_decode(unary_scores, transition_params, sequence_lengths)
tf_viterbi_sequence, tf_viterbi_score, _ = session.run([viterbi_sequence, viterbi_score, train_op])

主要逻辑：

通过crf正向过程获得标签状态转移的参数transition_params。

log_likelihood, transition_params = tf.contrib.crf.crf_log_likelihood(
unary_scores, gold_tags, sequence_lengths)
尝试上面介绍的任意一种解码的风格，输入参数都是transition_params和unary_scores一元势能函数，得到具体的解码序列

要点：

1.transition_params标都是来自于crf_log_likelihood的第二个输出值

1 2	log_likelihood, transition_params = tf.contrib.crf.crf_log_likelihood( unary_scores, gold_tags, sequence_lengths)

2.unary_scores，是针对每个词的输入的得分(一元势能函数得分)，维度大小是<句子长度, 标签大小>，一般来源都是经过了一个project层/全联接层，比如

bi-lstm, 输出为 <句子长度,2倍隐藏大小>, 这时需要一个映射层<2倍隐藏大小,标签大小>，转化为针对每个word的<句子长度, 标签大小>.
Bert,输出为<句子长度，隐层大小768>, 这时需要一个映射层<隐层大小,标签大小>，转化为针对每个word的<句子长度, 标签大小>.

3.训练过程和测试/解码过程都需要传入句子的实际长度的参数，主要是针对输入的长度进行了padding为最长，测试统计的时候要去掉这个部分。