How to implement and use a Linear Chain CRF in TensorFlow?

rocketstar31 · September 30, 2021, 12:14pm

Hi, I’m currently working on my first machine learning project - using neural networks to try and syllabify words using the Moby Hyphenator II dataset.

I am treating this as a multi-label classification problem in which words and their syllables are encoded in the following format:

t e n - s o r - f l o w
0 0 1   0 0 1   0 0 0 0

I have been padding all inputs to a length of 15 characters, so tensorflow would be encoded as 001001000000000.

I need to implement a linear chain conditional random field as my classifier because the online guides that I have based my project around suggest that its inclusion can greatly boost accuracy - this guide achieves 96.89% validation accuracy after hyperparameter tuning without one, but this model achieves near 100% accuracy when including a Linear Chain CRF output layer.

I have seen a guide that implements a linear chain CRF in PyTorch, but I am unsure as to how to recreate this in TensorFlow. This guide also includes special characters which are checked for in order to avoid padding being included in the computations, but this isn’t a problem that I am currently concerned with - my main problem is being able to implement a linear chain CRF in Tensorflow as the final output layer.

I looked at the official TensorFlow CRF layer implementation as well as the TFA module but I have no idea as to how to use them with the form of data that I have nor do I understand which specific functions to use. The second example model I referenced uses this CRF implementation but I again do not know how to use it - I tried to use it in my model as per the comment in the code:

    # As the last layer of sequential layer with
    # model.output_shape == (None, timesteps, nb_classes)
    crf = ChainCRF()
    model.add(crf)
    # now: model.output_shape == (None, timesteps, nb_classes)

However, using this leads to an output shape of (None, 15, 64) - this is different from my currently working dense output layer applied after global max pooling which has an output shape of (None, 15) and I am unsure of how to remedy this, as I believe that I need the output shape to be (None, 15) for the model to work.

Bhack · September 30, 2021, 3:52pm

Have you checked:

We have now a new codeowner for CRF

XiaoquanKong · September 30, 2021, 4:39pm

Thank you @Bhack for the introduction. Hi @rocketstar31, I am the new codeowner of CRF. I think the (on-going) CRF tutorial (add CRF tutorial by howl-anderson · Pull Request #2552 · tensorflow/addons · GitHub, the notebook at addons/layers_crf.ipynb at add_crf_tutorial · howl-anderson/addons · GitHub) is helpful to you. It gives detailed code samples on how to use CRF layer and compute the loss. If you still have any questions about the issue or the tutorial, please let me know.

rocketstar31 · October 3, 2021, 2:23pm

Hi @XiaoquanKong, thank you for this response. I have looked at the notebook and have used the CRFModelWrapper method to try and incorporate a CRF into my project, using a sample weight of approximately 4.2 in unpack_training_data() as there are far fewer 1s, indicating syllable breaks, than 0s. In addition, instead of using tf.data.Dataset for my input data, I have been using NumPy arrays of inputs and outputs for my training and validation data - this also appears to work fine as I can train the model without errors and the loss and CRF loss do decrease. The final modification that I have made is that I changed the input shape in the base model from (None,) to (15,) as all inputs in my dataset are padded to length 15, and I presume that this will not lead to any errors.

However, I have experienced a problem when trying to add the metric binary_accuracy, which I want to use as my labels are strings of 0s and 1s hence this is appropriate, plus I have used it successfully in my model without the CRF. I’m not sure as to where or how I can add this metric - I tried to add it as such:

# This is referring to the full model (base + CRF).
model.compile(optimizer=tf.keras.optimizers.Adam(0.02), metrics=['binary_accuracy'])

When I do this, I get the error:

ValueError: Can not squeeze dim[1], expected a dimension of 1, got 15 for '{{node Squeeze_1}} = Squeeze[T=DT_FLOAT, squeeze_dims=[-1]](Cast_7)' with input shapes: [?,15].

This is a concern for me as the main metric I want to measure is binary accuracy and how the model performs with predicting syllable breaks - any help with how I can integrate this would be much appreciated. When I received this error I did also look at Method one: Using the CRF layer in a custom training loop; in the code there is a reference to Define optimizer, metrics and train_step function, but again I’m not sure as to how I would include binary accuracy as a metric to measure.

XiaoquanKong · October 3, 2021, 3:29pm

Hi @rocketstar31 , I am very happy to see that the tutorial (partially) works for you. Regarding the metrics issue, can you provide some sample code (with faked data) to reproduce this error? I need the code to do a runtime debug which is an efficient way to find the root cause.

rocketstar31 · October 3, 2021, 4:28pm

Hello @XiaoquanKong, I have worked out to include the metrics whilst using Method one: Using the CRF layer in a custom training loop:

optimizer = tf.keras.optimizers.Adam(0.02)
train_loss = tf.keras.metrics.Mean(name="train_loss")
train_acc_metric = tf.keras.metrics.BinaryAccuracy()
val_acc_metric = tf.keras.metrics.BinaryAccuracy()

@tf.function(experimental_relax_shapes=True)
def train_step(x, y):
    with tf.GradientTape() as tape:
        decoded_sequence, potentials, sequence_length, kernel = model(x)
        crf_loss = crf_loss_func(potentials, sequence_length, kernel, y)
        loss = crf_loss + tf.reduce_sum(model.losses)
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    
    train_acc_metric.update_state(y, decoded_sequence)
    train_loss(loss)

EPOCHS = 10
for epoch in range(EPOCHS):
    # Reset the metrics at the start of the next epoch
    train_loss.reset_states()
    train_acc_metric.reset_states()

    for x, y in train_dataset:
        train_step(x, y)

    print(f"Epoch {epoch + 1}, " f"Loss: {train_loss.result()}")
    print(f"Epoch {epoch + 1}, " f"Accuracy: {train_acc_metric.result()}")

    for x, y in validation_dataset:
        decoded_sequence, potentials, sequence_length, kernel = model(x, training=False)
        val_acc_metric.update_state(y, decoded_sequence)
    val_acc = val_acc_metric.result()
    val_acc_metric.reset_states()
    print("Validation acc: %.4f" % (float(val_acc),))

This works well although there is some slowness, which is probably to be expected given the complexities of CRFs, and the training and validation accuracies both tend to drop after a few epochs of training - I am not sure if the latter phenomenon is due to the CRF’s inclusion. However, I am still unsure of how and where to add these metrics when using the CRF layer via model subclassing, which is the method I would prefer to use as with verbose=1, I can more easily monitor model performance.

XiaoquanKong · October 4, 2021, 2:11am

Hi @rocketstar31 , could you give me some training and validation sample (with the sample weight). Or the shape and datatype info of the data (and the sample weight)? So, I can generate some fake data by myself to do a runtime debug.

rocketstar31 · November 13, 2021, 9:41am

Hi @XiaoquanKong, I’m sorry for the delay - I have not been working on this project for a while due to other commitments. A sample looks like this:

[[ 3  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 3  4  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 3  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 3  3  9 18  6  5 15  0  0  0  0  0  0  0  0]
 [ 3  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 3  3  9  7  6  0  0  0  0  0  0  0  0  0  0]
 [ 3  3  5  3 11  0  0  0  0  0  0  0  0  0  0]
 [ 3  4  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 3  3  5 16 11  8  0  0  0  0  0  0  0  0  0]
 [ 3  4  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 3  3  5  6  4  0  0  0  0  0  0  0  0  0  0]
 [ 3  3  2 11  4  0  0  0  0  0  0  0  0  0  0]
 [ 3  4  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 3  3  5  6  4  0  0  0  0  0  0  0  0  0  0]
 [ 3  3 10 16  1  4  0  0  0  0  0  0  0  0  0]]

Above is an example of the training input - the integer values range from 1-37 and are padded with zeroes to the right side.

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]]

This is an example of the training (expected) output - the 1s mark syllable breaks, and as you can see they are fairly infrequent. It is important to note that if an input word has length n, a syllable break cannot be present at any entry i where i >= n, as this would mean that the syllable break lies outside of the word or on the very last character.

dada_Lai · November 18, 2021, 3:45am

Hi @XiaoquanKong,

Thank you for the great CRF tutorial which helps me a lot. I applied it to my case and found the ground truth (y) might be conversed to a 2-D form (for the one-hot encoded case) before being used for the loss calculation (in the compute_crf_loss function) because the dimension of the y_true should be 2-D. for the crf_log_likelihood function. Thanks.

XiaoquanKong · November 23, 2021, 8:09am

Hi @dada_Lai, Thank you for the good news. I am glad that my work is useful to you!

XiaoquanKong · November 23, 2021, 8:15am

Hi @rocketstar31, sorry for the late reply. And, thank you for your sample data. I will try to reproduce the bug on my computer. If I found the root cause or need your help, I will let you know.

shayue · January 2, 2022, 5:38pm

Hi, I’m learning CRF and try to implement linear chain crf myself.

A version is currently completed, but I am having some troubles. When training with my implementation, the loss would be negative with the iteration, though the inference of model can proceed normally.
In my implementation, I don’t consider the boundary of sentences. I don’t know if that causes the negative loss. What’s more, I find the most universal method to introduce boundary is adding <START> and <END> token. But in the implementation of tensorflow-addons, I find it just add left_boundary energy and right_boundary energy for the potential.

So, I have two questions:

Does the disregard of sentence boundary cause the negative loss?
Does the implementation of tf-addons is identical with adding <START> and <END> token?

It’s appreciated for your great help. Looking forward to your reply.

Lance_N · January 4, 2022, 11:08pm

Please start a new topic for this question. Thank you!