Best way to go about loading a large model with limited memory?

darkpixlz · March 9, 2024, 4:10am

Hi,

I’ve been working on a project to use TensorFlow, and I have a lot of data. So much so that if I train it based on the full data file, it spits out this error:

numpy.core._exceptions._ArrayMemoryError: Unable to allocate 135. GiB for an array with shape (2243467, 16155) and data type float32

So I split the training data into 600 files (each is roughly 65 lines of data), and train it like this:

Model.load()
for i in range(599):
    Model.train(f"training/data{i + 1}.txt")
    Model.save()

I haven’t run the training task yet because I am concerned about the current implementations with saving/loading. I’m most worried it will fail to save or read the model when it exceeds the 16GB the machine is allocated (no, I can’t increase it). This is my current implementation for the model:

class TextGenerator:
    def __init__(self, sequence_length=100, batch_size=128, embedding_dim=256, rnn_units=1024):
        self.sequence_length = sequence_length
        self.batch_size = batch_size
        self.embedding_dim = embedding_dim
        self.rnn_units = rnn_units
        self.model = None
        self.tokenizer = None

    def train(self, file_path):
        # Load and preprocess text data
        text = open(file_path, 'rb').read().decode(encoding='utf-8', errors="ignore")

        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(char_level=False)
        self.tokenizer.fit_on_texts([text])
        total_words = len(self.tokenizer.word_index) + 1

        # Create training sequences
        sequences = []
        for i in range(self.sequence_length, len(text)):
            seq = text[i - self.sequence_length:i]
            sequences.append(seq)

        input_sequences = self.tokenizer.texts_to_sequences(sequences)
        input_sequences = tf.keras.preprocessing.sequence.pad_sequences(input_sequences, maxlen=self.sequence_length, padding='pre')
        input_sequences = np.array(input_sequences)

        inputs, targets = input_sequences[:, :-1], input_sequences[:, -1]
        targets = tf.keras.utils.to_categorical(targets, num_classes=total_words)

        self.model = tf.keras.Sequential([
            tf.keras.layers.Embedding(total_words, self.embedding_dim, input_length=self.sequence_length-1),
            tf.keras.layers.LSTM(self.rnn_units),
            tf.keras.layers.Dense(total_words, activation='softmax')
        ])
        self.model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
        self.model.fit(inputs, targets, epochs=10, batch_size=self.batch_size)

        print(self.generate_text("hello")) # test it

    def generate_text(self, seed_text, num_words=50):
        for _ in range(num_words):
            token_list = tf.keras.preprocessing.sequence.pad_sequences([self.tokenizer.texts_to_sequences([seed_text])[0]], maxlen=self.sequence_length-1, padding='pre')
            output_word = self.tokenizer.index_word[np.random.choice(len(self.model.predict(token_list, verbose=0)[0]), p=self.model.predict(token_list, verbose=0)[0])]
            seed_text += " " + output_word
        return seed_text
    
    def save(self):
        self.model.save_weights("trained_text_generator_model.h5", overwrite = False)

    def load(self):
        self.model.load_weights("trained_text_generator_model.h5")

Please do forgive me if it’s messy or unoptimized, this is my first project to use TF.

With all that being said, what’s the best way to run it with my current situation of not having 150gb of memory to spare? I’d like to keep the full training set (and add to it later) but save_weights and load_weights seem like they may cause the error to crop up again. Thank you!

darkpixlz · March 10, 2024, 3:00am

Going to bump this quickly, wanted to get training started kinda soon and I can’t continue until I know it wont crash an hour or two in

darkpixlz · March 12, 2024, 2:12am

Edit: I’ll open a different topic because it’s unrelated