Suppose I’ve a very simple Python code like this:
corpus = file.read() file_contents = corpus.split()[token_start : token_end] input_tokens, output_tokens = ,  for i in tqdm(range(len(file_contents) - gpt_input - 1)): input_tokens += [file_contents[i : i + gpt_input]] output_tokens += [file_contents[i + gpt_input]] X = [' '.join(input_tokens[i]) for i in tqdm(range(len(input_tokens)))] Y = output_tokens
The code does three things:
- Load a file into RAM, split the contents of the file into words, i.e. - we have a list of words from the file in the order of the sentences.
- Next, use two variables - input_tokens, output_tokens as list and append list of first
gpt_inputwords in input_token and
gpt-input-th word in output_token. This ensures that we have all
i + gpt_inputwords in input_tokens and
i + 1tokens in output_tokens, for all i = 0 to i =
total_tokens - 1.
- Now, we reconstruct sentences with words input_tokens, i.e. - we condensate gpt_input words back to the sentences.
If the file has contents like this:
Hello World, I'm writing a new cool code in TensorFlow, please don't forget to check it!
The end result:
input_tokens for gpt_input = 3:
Hello World, I'm World, I'm writing I'm writing a writing a new a new cool ...
output_tokens for gpt_input = 3:
writing a new cool code ...
So, now the problem is - the file or the text corpus which is needed to train a GPT Model can be very large! like upto - 200-300 GB and can’t be loaded into RAM/memory directly. So, TensorFlow offers - tf.data class, with the set of tools to help loading, caching and training from very large datasets. But the problem is that, I don’t see any way to create and pre-process text file corpus using tf.data class from the documentation. To me, it seems pretty much impossible to do. If there is any way to load corpus fragments with a window size defined by words, kindly let me know.
Thank you in advance.