Dataset Preprocessing: Preprocess corpus for GPT Training

Suppose I’ve a very simple Python code like this:

corpus = file.read()
            file_contents = corpus.split()[token_start : token_end]


input_tokens, output_tokens = [], []
        for i in tqdm(range(len(file_contents) - gpt_input - 1)):
            input_tokens += [file_contents[i : i + gpt_input]]
            output_tokens += [file_contents[i + gpt_input]]
               
            
        X = [' '.join(input_tokens[i]) for i in tqdm(range(len(input_tokens)))]
        Y = output_tokens

The code does three things:

  1. Load a file into RAM, split the contents of the file into words, i.e. - we have a list of words from the file in the order of the sentences.
  2. Next, use two variables - input_tokens, output_tokens as list and append list of first gpt_input words in input_token and gpt-input-th word in output_token. This ensures that we have all i to i + gpt_input words in input_tokens and i + 1 tokens in output_tokens, for all i = 0 to i = total_tokens - 1.
  3. Now, we reconstruct sentences with words input_tokens, i.e. - we condensate gpt_input words back to the sentences.

Example:

If the file has contents like this:

Hello World, I'm writing a new cool code in TensorFlow, please don't forget to check it!

The end result:
input_tokens for gpt_input = 3:

Hello World, I'm
World, I'm writing
I'm writing a
writing a new
a new cool
...

output_tokens for gpt_input = 3:

writing
a
new
cool
code
...

So, now the problem is - the file or the text corpus which is needed to train a GPT Model can be very large! like upto - 200-300 GB and can’t be loaded into RAM/memory directly. So, TensorFlow offers - tf.data class, with the set of tools to help loading, caching and training from very large datasets. But the problem is that, I don’t see any way to create and pre-process text file corpus using tf.data class from the documentation. To me, it seems pretty much impossible to do. If there is any way to load corpus fragments with a window size defined by words, kindly let me know.

Thank you in advance.

Hi @Abhas_Kumar,

Can you please check these following articles from TensorFlow Documentation Load text , Window , Better performance with the tf.data API , TextLineDataset to solve above Dataset Preprocessing and training of your model.

Hope this will helps you to solve your problem.

Thanks.

Thank you for your response,

But for my job these don’t suffice enough functions to work on.
TextLineDataset - loads text in the form of tensor line-by-line, some of them might include empty lines with no words too. This doesn’t solve the problem I require to get solved - to load text in the window of tokens. Though, there’s a tensor string split method offered for such, but again, streaming them to one column (or 1D-tensor) from file seems impossible.
window - method you’ve mentioned is useful but again, the preprocessing required before windowing i.e. narrowing 1D stream of words seems impossible.

Thanks for the reply.