Transformer model for language understanding with another Dataset

Hello all!

I have been reading the official guide here (Modello Transformer per la comprensione del linguaggio  |  Text  |  TensorFlow) to try and recreate the Vanilla Transformer in Tensorflow. I notice the dataset used is quite specific, and at the end of the guide, it says to try with a different dataset.

But that is where I have been stuck for a long time! I am trying to use the WMT14 dataset (as used in the original paper, Vaswani et. al.) here: wmt14_translate  |  TensorFlow Datasets.

I have also tried Multi30k and IWSLT dataset from spacy, but are there any guides on how I can fit the dataset to what the model requires? Specifically, to tokenize it. The official TF guide uses a pretrained tokenizer, which is specific to the PR-EN dataset given.

model_name = "ted_hrlr_translate_pt_en_converter"

I am wondering, how I can use the TF (bert) tokenizer to tokenize the Spacy dataset? I have the code for PyTorch, unfortunately I do not know how to adapt it for Tensorflow. Any help would be greatly appreciated!

 import spacy
    spacy_de = spacy.load('de')
    spacy_en = spacy.load('en')

    def tokenize_de(text):
        return [tok.text for tok in spacy_de.tokenizer(text)]

    def tokenize_en(text):
        return [tok.text for tok in spacy_en.tokenizer(text)]

    BOS_WORD = '<s>'
    EOS_WORD = '</s>'
    BLANK_WORD = "<blank>"
    SRC = data.Field(tokenize=tokenize_de, pad_token=BLANK_WORD)
    TGT = data.Field(tokenize=tokenize_en, init_token = BOS_WORD, 
                     eos_token = EOS_WORD, pad_token=BLANK_WORD)

    MAX_LEN = 100
    train, val, test = datasets.IWSLT.splits(
        exts=('.de', '.en'), fields=(SRC, TGT), 
        filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and 
            len(vars(x)['trg']) <= MAX_LEN)
    MIN_FREQ = 2
    SRC.build_vocab(train.src, min_freq=MIN_FREQ)
    TGT.build_vocab(train.trg, min_freq=MIN_FREQ)