Is there an existing tokenizer model for Chinese to English translation?

I am following this tutorial for the transformer model:

I want to try Chinese to English translation, so I had this configuration:

        config = tfds.translate.wmt.WmtConfig(
            description="WMT 2019 translation task dataset.",
            version="0.0.3",
            language_pair=("zh", "en"),
            subsets={
                tfds.Split.TRAIN: ["newscommentary_v13"],
                tfds.Split.VALIDATION: ["newsdev2017"],
            }
        )

builder = tfds.builder("wmt_translate", config=config)

In the pt-en translation tutorial, it seems it has already had a corresponding tokenizer model that can be downloaded:

model_name = "ted_hrlr_translate_pt_en_converter"tf.keras.utils.get_file(    f"{model_name}.zip",    f"https://storage.googleapis.com/download.tensorflow.org/models/{model_name}.zip",    cache_dir='.', cache_subdir='', extract=True)

Is there a tokenizer similar to ted_hrlr_translate_pt_en_converter that can be used directly for Chinese to English translation? Because I want to reuse the code in the tutorial like the following:

def tokenize_pairs(pt, en):
    
    pt = tokenizers.pt.tokenize(pt)
    # Convert from ragged to dense, padding with zeros.
    pt = pt.to_tensor()

    en = tokenizers.en.tokenize(en)
    # Convert from ragged to dense, padding with zeros.
    en = en.to_tensor()
    return pt, en

In the tokenize_pairs, ‘pt’ is a Tensor. A tokenizer usually takes a string object and return a list. Why does this tokenizers.pt.tokenize() takes a tensor as input? How to make such a tokenizer to work with this tutorial? I imported the transformers tokenizer but it didn’t work, because it complains about the Tensor as tokenize input:

from transformers import BertTokenizer
tokenizer_en = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer_zh = BertTokenizer.from_pretrained("bert-base-chinese")

def tokenize_pairs(zh, en):
    zh = tokenizer_zh.tokenize(zh)
    # Convert from ragged to dense, padding with zeros.
    zh = zh.to_tensor()
    en = tokenizer_en.tokenize(en)
    # Convert from ragged to dense, padding with zeros.
    en = en.to_tensor()
    return zh, en

BUFFER_SIZE = 20000
BATCH_SIZE = 64
def make_batches(ds):
  return (
      ds
      .cache()
      .shuffle(BUFFER_SIZE)
      .batch(BATCH_SIZE)
      .map(tokenize_pairs, num_parallel_calls=tf.data.experimental.AUTOTUNE)
      .prefetch(tf.data.experimental.AUTOTUNE))

train_batches = make_batches(train_examples)
val_batches = make_batches(val_examples)

So the question is how to adapt the transformer pretrained tokenizers to work with tutorial?

maybe @markdaoust can help here

maybe @markdaoust can help here

I hope so, I did write these tutorials.

In the tokenize_pairs, ‘pt’ is a Tensor. A tokenizer usually takes a string object and return a list. Why does this tokenizers.pt.tokenize() takes a tensor as input?

Here I built the tokenizer in TensorFlow. This way when you export the model, at the end, the resulting saved model just takes a tensor of strings as inputs. The caller doesn’t need to think about the tokenizer.

The process to create those tokenizers is shown in the subwords_tokenizer tutorial. But that approach is not appropriate for Chinese. You’d need to use sentencepiece.

The simplest approach to getting this to work for Chinese would be to grab a pretrained Chinese segmentation model from tfhub, like:

How to make such a tokenizer to work with this tutorial? I imported the transformers tokenizer but it didn’t work, because it complains about the Tensor as tokenize input:

from transformers import BertTokenizer

Right, huggingface’s tokenizers operate in python. To use a python tokenizer in TensorFlow you’ll need to call it with tf.py_function or tf.numpy_function

def py_wrap_tokenize_pairs(zh, en):
  return tf.numpy_function(tokenize_pairs, [zh, en])
...

ds.map(py_wrap_tokenize_pairs,  num_parallel_calls=tf.data.experimental.AUTOTUNE)

That way the function runs in a regular python, it receives numpy arrays as input.

1 Like

Thank @markdaoust I will try your instruction.

@markdaoust I am trying on the transformer tokenizer but receives a new error. The full code is below:

    import tensorflow_datasets as tfds
    import numpy as np
    import tensorflow as tf
    import logging
    import tensorflow_text as text
    from transformers import BertTokenizer

  config = tfds.translate.wmt.WmtConfig(
    description="WMT 2019 translation task dataset.",
    version="0.0.3",
    language_pair=("zh", "en"),
    subsets={
        tfds.Split.TRAIN: ["newscommentary_v13"],
        tfds.Split.VALIDATION: ["newsdev2017"],
    }
)

builder = tfds.builder("wmt_translate", config=config)
print(builder.info.splits)
builder.download_and_prepare()
datasets = builder.as_dataset(as_supervised=True)
print('datasets is {}'.format(datasets))

train_examples=datasets["train"]
val_examples=datasets["validation"]
train_examples = train_examples.take(128)
#1. Get train, validation and test text data
for zh_examples, en_examples in train_examples.batch(3).take(1):
  for zh in zh_examples.numpy():
    print(zh.decode('utf-8'))

  print()

  for en in en_examples.numpy():
    print(en.decode('utf-8'))

print('Start building tokenizer ...')
tokenizer_en = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer_zh = BertTokenizer.from_pretrained("bert-base-chinese")
print('End building tokenizer ...')

def py_wrap_tokenize_pairs(zh, en):
  return tf.numpy_function(tokenize_pairs, [zh, en],[tf.int64,tf.int64])

def tokenize_pairs(zh, en):
    zh = tokenizer_zh.tokenize(zh)
    zh = zh.to_tensor()
    en = tokenizer_en.tokenize(en)
    en = en.to_tensor()
    return zh, en

# 4. Make batches
BUFFER_SIZE = 20000
BATCH_SIZE = 64
def make_batches(ds):
  return (
      ds
      .cache()
      .shuffle(BUFFER_SIZE)
      .batch(BATCH_SIZE)
      .map(py_wrap_tokenize_pairs, num_parallel_calls=tf.data.experimental.AUTOTUNE)
      .prefetch(tf.data.experimental.AUTOTUNE))

train_batches = make_batches(train_examples)
val_batches = make_batches(val_examples)

# Error occurs in this function
for (batch, (inp, tar)) in enumerate(train_batches):
  print(batch, inp, tar)

And the error is:
File “/Users/cong/nlp/study/transformer/data_zh.py”, line 31, in tokenize_pairs
zh = tokenizer_zh.tokenize(zh)

  File "/Users/cong/.venv/tf2/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 362, in tokenize
    tokenized_text = split_on_tokens(no_split_token, text)

  File "/Users/cong/.venv/tf2/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 336, in split_on_tokens
    if not text.strip():

AttributeError: 'numpy.ndarray' object has no attribute 'strip'

It looks like the problem is still in the tokenize_pairs function. I first load the pretrained English and chinese tokenizers, define py_wrap_tokenize_pairs wrapper, modify the tokenize_pairs function, and make_batches. I think I have some misunderstanding on your comments above.

It’s calling the tokenize_pairs function correctly. You’re close.

There’s just one little conversion missing.

  File "/Users/cong/.venv/tf2/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 336, in split_on_tokens
    if not text.strip():

AttributeError: 'numpy.ndarray' object has no attribute 'strip'

The function is getting called with a numpy array as input, but it looks like it’s expecting an actual python string.

Right now you’re running map after batch:

      .batch(BATCH_SIZE)
      .map(py_wrap_tokenize_pairs,...)

So your tokenize_pairs function is getting batches of strings as a numpy array.

You either need to loop over the array and unpack the strings, or, map-then-batch so you get scalar arrays and you can unpack those.

It looks like str(a) is how you unpack a scalar string from a numpy array.

1 Like

In the original tutorial, the type of pt and en in tokenize_pairs is:

<class 'tensorflow.python.framework.ops.Tensor'>

My code above, the type of zh and en in tokenize_pairs is:
<class 'numpy.ndarray'>

I take the map-then-batch approach:

def make_batches(ds):
      return (
          ds
          .cache()
          .shuffle(BUFFER_SIZE)
          .map(py_wrap_tokenize_pairs, num_parallel_calls=tf.data.experimental.AUTOTUNE)
          .batch(BATCH_SIZE)
          .prefetch(tf.data.experimental.AUTOTUNE))

But now the input type becomes ‘bytes’ from ‘numpy.ndarray’, and this will produces the error:

File "/Users/cong/.venv/tf2/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 285, in split_on_token
split_text = text.split(tok)

TypeError: a bytes-like object is required, not 'str'

If I add a str(zh), this produces a str object, not a Tensor. Because the original tokenize_pairs expect a Tensor of string as input, how to make the ‘zh’ become a tensor before it is fed into:

zh = tokenizer_zh.tokenize(zh)