TensorFlow中使用CRF

背景:关于CRF的应用,尤其是在深度学习之前它是一个nlp序列建模的比较主流方法,即使是深度学习大行其道,也会出现crf的影子,比如bilstm+crf, bert+crf. 关于crf的原理可参考众多的资料,本文提供一个在tensorflow中使用crf的一个简要概述。

CRF使用的主要API

  • crf_log_likelihood

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    def crf_log_likelihood(inputs,
    tag_indices,
    sequence_lengths,
    transition_params=None):
    """Computes the log-likelihood of tag sequences in a CRF.
    Args:
    inputs: A [batch_size, max_seq_len, num_tags] tensor of unary potentials
    to use as input to the CRF layer.
    tag_indices: A [batch_size, max_seq_len] matrix of tag indices for which we
    compute the log-likelihood.
    sequence_lengths: A [batch_size] vector of true sequence lengths.
    transition_params: A [num_tags, num_tags] transition matrix, if available.
    Returns:
    log_likelihood: A [batch_size] `Tensor` containing the log-likelihood of
    each example, given the sequence of tag indices.
    transition_params: A [num_tags, num_tags] transition matrix. This is either
    provided by the caller or created in this function.
    """

    输入:

    • inputs,一元势能得分,针对每个word级别,每个标签的预测概率值,<句子长度,标签大小>的tensor
    • tag_indices:真实标签的序列,<句子长度,标签大小>的tensor
    • sequence_lengths:实际标签序列的长度,为一个值
    • transition_params:标签状态转移矩阵,学习的参数矩阵,可以预先给定

    输出:

    • log_likelihood,word级别的对数似然概率
    • transition_params:学习后的状态转移矩阵
  • 解码过程的两个可用的API

    • tf.contrib.crf.viterbi_decode(tf_unaryscores, tf_transition_params)
    • tf.contrib.crf.crf_decode(unary_scores, transition_params, sequence_lengths)
  • 一个具体的例子:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    import numpy as np
    import tensorflow as tf
    # Data settings.
    num_examples = 10
    num_words = 20
    num_features = 100
    num_tags = 5
    # Random features.
    x = np.random.rand(num_examples, num_words, num_features).astype(np.float32)
    # Random tag indices representing the gold sequence.
    y = np.random.randint(num_tags, size=[num_examples, num_words]).astype(np.int32)
    # All sequences in this example have the same length, but they can be variable in a real model.
    sequence_lengths = np.full(num_examples, num_words - 1, dtype=np.int32)
    # Train and evaluate the model.
    with tf.Graph().as_default():
    with tf.Session() as session:
    # Add the data to the TensorFlow graph.
    x_t = tf.constant(x)
    y_t = tf.constant(y)
    sequence_lengths_t = tf.constant(sequence_lengths)
    # Compute unary scores from a linear layer.
    weights = tf.get_variable("weights", [num_features, num_tags])
    matricized_x_t = tf.reshape(x_t, [-1, num_features])
    matricized_unary_scores = tf.matmul(matricized_x_t, weights)
    unary_scores = tf.reshape(matricized_unary_scores,
    [num_examples, num_words, num_tags])
    # Compute the log-likelihood of the gold sequences and keep the transition
    # params for inference at test time.
    log_likelihood, transition_params = tf.contrib.crf.crf_log_likelihood(
    unary_scores, y_t, sequence_lengths_t)
    # Compute the viterbi sequence and score.
    viterbi_sequence, viterbi_score = tf.contrib.crf.crf_decode(
    unary_scores, transition_params, sequence_lengths_t)
    # Add a training op to tune the parameters.
    loss = tf.reduce_mean(-log_likelihood)
    train_op = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
    session.run(tf.global_variables_initializer())
    mask = (np.expand_dims(np.arange(num_words), axis=0) <
    np.expand_dims(sequence_lengths, axis=1))
    total_labels = np.sum(sequence_lengths)
    # Train for a fixed number of iterations.
    for i in range(1000):
    tf_viterbi_sequence, _ = session.run([viterbi_sequence, train_op])
    if i % 100 == 0:
    correct_labels = np.sum((y == tf_viterbi_sequence) * mask)
    accuracy = 100.0 * correct_labels / float(total_labels)
    print("Accuracy: %.2f%%" % accuracy)

CRF 编码/训练逻辑

  • 获得表示:获得原始的model产出,比如lstm或者bert的表示,<句子长度,隐藏大小>的tensor.

  • 计算每个word得分/全局的一元势函数(global unary potential):加入Project映射层计算每个word的得分,输出的大小为<句子长度,标签大小>

  • Train: 使用crf_log_likelihood进行极大似然估计的train,这里可以标签的转移矩阵需要学习,是一个参数矩阵,得到一个log_likelihood和标签状态转移表示(decode用)

    log_likelihood, transition_params = tf.contrib.crf.crf_log_likelihood(
    unary_scores, gold_tags, sequence_lengths, trans)

  • 在word级别上累加计算loss

    loss = tf.reduce_mean(-log_likelihood)

  • 选择具体的优化算法进行学习

    1
    train_op = tf.train.GradientDescentOptimizer(0.01).minimize(loss)

CRF 解码/测试逻辑

一种可以使用crf_decode的tf代码进行解码,另外一种使用Numpy代码的viterbi_decode解码方式

numpy的解码风格:

1
2
3
4
5
6
7
tf_unary_scores, tf_sequence_lengths, tf_transition_params, _ = session.run([unary_scores, sequence_lengths, transition_params, train_op])
for tf_unary_scores_, tf_sequence_length_ in zip(tf_unary_scores, tf_sequence_lengths):
# Remove padding.
tf_unary_scores_ = tf_unary_scores_[:tf_sequence_length_]
# Compute the highest score and its tag sequence.
tf_viterbi_sequence, tf_viterbi_score = tf.contrib.crf.viterbi_decode(
tf_unary_scores_, tf_transition_params)

TF的风格

1
2
3
viterbi_sequence, viterbi_score = tf.contrib.crf.crf_decode(unary_scores, transition_params, sequence_lengths)
tf_viterbi_sequence, tf_viterbi_score, _ = session.run([viterbi_sequence, viterbi_score, train_op])

主要逻辑:

  • 通过crf正向过程获得标签状态转移的参数transition_params。

    log_likelihood, transition_params = tf.contrib.crf.crf_log_likelihood(
    unary_scores, gold_tags, sequence_lengths)

  • 尝试上面介绍的任意一种解码的风格,输入参数都是transition_params和unary_scores一元势能函数,得到具体的解码序列

要点:

1.transition_params标都是来自于crf_log_likelihood的第二个输出值

1
2
log_likelihood, transition_params = tf.contrib.crf.crf_log_likelihood(
unary_scores, gold_tags, sequence_lengths)

2.unary_scores,是针对每个词的输入的得分(一元势能函数得分),维度大小是<句子长度, 标签大小>,一般来源都是经过了一个project层/全联接层,比如

  • bi-lstm, 输出为 <句子长度,2倍隐藏大小>, 这时需要一个映射层<2倍隐藏大小,标签大小>,转化为针对每个word的<句子长度, 标签大小>.
  • Bert,输出为<句子长度,隐层大小768>, 这时需要一个映射层<隐层大小,标签大小>,转化为针对每个word的<句子长度, 标签大小>.

3.训练过程和测试/解码过程都需要传入句子的实际长度的参数,主要是针对输入的长度进行了padding为最长,测试统计的时候要去掉这个部分。