Suppose I’ve a very simple Python code like this:

corpus =
            file_contents = corpus.split()[token_start : token_end]

input_tokens, output_tokens = [], []
        for i in tqdm(range(len(file_contents) - gpt_input - 1)):
            input_tokens += [file_contents[i : i + gpt_input]]
            output_tokens += [file_contents[i + gpt_input]]
        X = [' '.join(input_tokens[i]) for i in tqdm(range(len(input_tokens)))]
        Y = output_tokens

The code does three things:

  1. Load a file into RAM, split the contents of the file into words, i.e. - we have a list of words from the file in the order of the sentences.
  2. Next, use two variables - input_tokens, output_tokens as list and append list of first gpt_input words in input_token and gpt-input-th word in output_token. This ensures that we have all i to i + gpt_input words in input_tokens and i + 1 tokens in output_tokens, for all i = 0 to i = total_tokens - 1.
  3. Now, we reconstruct sentences with words input_tokens, i.e. - we condensate gpt_input words back to the sentences.


If the file has contents like this:

Hello World, I'm writing a new cool code in TensorFlow, please don't forget to check it!

The end result:
input_tokens for gpt_input = 3:

Hello World, I'm
World, I'm writing
I'm writing a
writing a new
a new cool

output_tokens for gpt_input = 3:


So, now the problem is - the file or the text corpus which is needed to train a GPT Model can be very large! like upto - 200-300 GB and can’t be loaded into RAM/memory directly. So, TensorFlow offers - class, with the set of tools to help loading, caching and training from very large datasets. But the problem is that, I don’t see any way to create and pre-process text file corpus using class from the documentation. To me, it seems pretty much impossible to do. If there is any way to load corpus fragments with a window size defined by words, kindly let me know.

Can you please check these following articles from TensorFlow Documentation Load text , Window , Better performance with the API , TextLineDataset to solve above Dataset Preprocessing and training of your model.

But for my job these don’t suffice enough functions to work on.
TextLineDataset - loads text in the form of tensor line-by-line, some of them might include empty lines with no words too. This doesn’t solve the problem I require to get solved - to load text in the window of tokens. Though, there’s a tensor string split method offered for such, but again, streaming them to one column (or 1D-tensor) from file seems impossible.
window - method you’ve mentioned is useful but again, the preprocessing required before windowing i.e. narrowing 1D stream of words seems impossible.

