Issue with TPUs on Google Colab when training BERT

timodim · March 11, 2022, 1:49pm

Hi there,

I’m pretty new at this, so not sure if this is a bug or I’m doing something wrong.

I’m trying to run BERT in Google Colab using TPU, however I’m getting an error message which can be seen here.

Tensorflow version 2.8.0

Code I’m using for loading the TPU is vastly based on the original code for pre-training T5 by Google taken from here:

print("Installing dependencies...")
%tensorflow_version 2.x

import functools
import os
import time
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

import tensorflow.compat.v1 as tf
import tensorflow_datasets as tfds

BASE_DIR = "gs://bucket-xx" #@param { type: "string" }
if not BASE_DIR or BASE_DIR == "gs://":
  raise ValueError("You must enter a BASE_DIR.")
DATA_DIR = os.path.join(BASE_DIR, "data/text.csv")
MODELS_DIR = os.path.join(BASE_DIR, "models/bert")
ON_CLOUD = True


if ON_CLOUD:
  print("Setting up GCS access...")
  import tensorflow_gcs_config
  from google.colab import auth
  # Set credentials for GCS reading/writing from Colab and TPU.
  TPU_TOPOLOGY = "v2-8"
  try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection
    TPU_ADDRESS = tpu.get_master()
    print('Running on TPU:', TPU_ADDRESS)
  except ValueError:
    raise BaseException('ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')
  auth.authenticate_user()
  tf.enable_eager_execution()
  tf.config.experimental_connect_to_host(TPU_ADDRESS)
  tensorflow_gcs_config.configure_gcs_from_colab_auth()

tf.disable_v2_behavior()

# Improve logging.
from contextlib import contextmanager
import logging as py_logging

if ON_CLOUD:
  tf.get_logger().propagate = False
  py_logging.root.setLevel('INFO')

@contextmanager
def tf_verbosity_level(level):
  og_level = tf.logging.get_verbosity()
  tf.logging.set_verbosity(level)
  yield
  tf.logging.set_verbosity(og_level)

This is the code that I’m using to train BERT:

!python /content/scripts/run_mlm.py \
--model_name_or_path bert-base-cased \
--tpu_num_cores 8 \
--validation_split_percentage 20 \
--line_by_line \
--learning_rate 2e-5 \
--per_device_train_batch_size 128 \
--per_device_eval_batch_size 256 \
--num_train_epochs 4 \
--output_dir MODELS_DIR \
--train_file /content/text.csv

the run_mlm.py script is taken from the original transformers repo and can be seen here.

I found a very similar issue which can be seen here.

Any help is much appreciated thanks.