Fine-tuning GPT2 for text summary

Seungjun_Lee · June 20, 2023, 3:58pm

Hello y’all

I’m trying to create a text summary ml model by fine-tuning GPT2. And here is my current code

import tensorflow as tf
from transformers import GPT2Tokenizer
import keras_nlp
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = TFGPT2LMHeadModel.from_pretrained("gpt2")

documents= ["Hello, World", "Hello, World", "Hello, World"]
summaries = ["Good Morning", "Good Evening", "Good Day"]

#1. tokenize each string in those two lists
# here we don't need to do process such as padding and truncating as openAI didn't
documents_toknized = list(map(tokenizer, documents))
summaries_toknized = list(map(tokenizer, summaries))

#2. convert those list into Tensotflow tensor format
documents_tensor = ### I'm stuck here
summaries_tensor = ### I'm stuck here

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss)
model.fit(x=documents_tensor, y=summaries_tensor, epochs=1)

As written in this tutorial doc, I’m trying to convert documents_toknized and summaries_toknized to Tensorflow tensor type for x and y in model.fit().

Here what I should put them into x, y in model.fit().

I know I can convert each one of them like this:

input_ids = [item['input_ids'] for item in documents_toknized]
attention_mask = [item['attention_mask'] for item in documents_toknized]
input_ids_tensor = tf.convert_to_tensor(input_ids)
attention_mask_tensor = tf.convert_to_tensor(attention_mask)

But I’m wondering should I put just each input_ids_tensor to x and y or other way.
Plus how can I apply batch here

Can anyone help me with this? Thanks