Connection timeout Error while training a TF Lite Model

Hi,
I am training a TF lite model on a HPC Cluster. While executing the script I get a URLError: <urlopen error [Errno 110] Connection timed out>.

The main challenge here is that the cluster is not connected to the internet. Is there anyway to resolve this error other than connecting to internet like installing some packages offile?

Below is the output that I get

Traceback (most recent call last):
File “/users/analysis/e40070822/anaconda3/lib/python3.10/urllib/request.py”, line 1348, in do_open
h.request(req.get_method(), req.selector, req.data, headers,
File “/users/analysis/e40070822/anaconda3/lib/python3.10/http/client.py”, line 1282, in request
self._send_request(method, url, body, headers, encode_chunked)
File “/users/analysis/e40070822/anaconda3/lib/python3.10/http/client.py”, line 1328, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File “/users/analysis/e40070822/anaconda3/lib/python3.10/http/client.py”, line 1277, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File “/users/analysis/e40070822/anaconda3/lib/python3.10/http/client.py”, line 1037, in _send_output
self.send(msg)
File “/users/analysis/e40070822/anaconda3/lib/python3.10/http/client.py”, line 975, in send
self.connect()
File “/users/analysis/e40070822/anaconda3/lib/python3.10/http/client.py”, line 1447, in connect
super().connect()
File “/users/analysis/e40070822/anaconda3/lib/python3.10/http/client.py”, line 941, in connect
self.sock = self._create_connection(
File “/users/analysis/e40070822/anaconda3/lib/python3.10/socket.py”, line 845, in create_connection
raise err
File “/users/analysis/e40070822/anaconda3/lib/python3.10/socket.py”, line 833, in create_connection
sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/users/analysis/e40070822/SeatCV/src/train_test_model.py”, line 249, in
main()
File “/users/analysis/e40070822/SeatCV/src/train_test_model.py”, line 233, in main
_, output_dir, model_fp = train_test(dataset_dir)
File “/users/analysis/e40070822/SeatCV/src/train_test_model.py”, line 191, in train_test
model, model_name = tune(train_data, validation_data, True)
File “/users/analysis/e40070822/SeatCV/src/train_test_model.py”, line 167, in tune
model = object_detector.create(train_data=train_data,
File “/users/analysis/e40070822/anaconda3/lib/python3.10/site-packages/tensorflow_examples/lite/model_maker/core/task/object_detector.py”, line 260, in create
object_detector.train(train_data, validation_data, epochs, batch_size)
File “/users/analysis/e40070822/anaconda3/lib/python3.10/site-packages/tensorflow_examples/lite/model_maker/core/task/object_detector.py”, line 118, in train
self.create_model()
File “/users/analysis/e40070822/anaconda3/lib/python3.10/site-packages/tensorflow_examples/lite/model_maker/core/task/object_detector.py”, line 74, in create_model
self.model = self.model_spec.create_model()
File “/users/analysis/e40070822/anaconda3/lib/python3.10/site-packages/tensorflow_examples/lite/model_maker/core/task/model_spec/object_detector_spec.py”, line 238, in create_model
return train_lib.EfficientDetNetTrainHub(
File “/users/analysis/e40070822/anaconda3/lib/python3.10/site-packages/tensorflow_examples/lite/model_maker/third_party/efficientdet/keras/train_lib.py”, line 862, in init
self.base_model = hub.KerasLayer(hub_module_url, trainable=True)
File “/users/analysis/e40070822/anaconda3/lib/python3.10/site-packages/tensorflow_hub/keras_layer.py”, line 153, in init
self._func = load_module(handle, tags, self._load_options)
File “/users/analysis/e40070822/anaconda3/lib/python3.10/site-packages/tensorflow_hub/keras_layer.py”, line 449, in load_module
return module_v2.load(handle, tags=tags, options=set_load_options)
File “/users/analysis/e40070822/anaconda3/lib/python3.10/site-packages/tensorflow_hub/module_v2.py”, line 92, in load
module_path = resolve(handle)
File “/users/analysis/e40070822/anaconda3/lib/python3.10/site-packages/tensorflow_hub/module_v2.py”, line 47, in resolve
return registry.resolver(handle)
File “/users/analysis/e40070822/anaconda3/lib/python3.10/site-packages/tensorflow_hub/registry.py”, line 51, in call
return impl(*args, **kwargs)
File “/users/analysis/e40070822/anaconda3/lib/python3.10/site-packages/tensorflow_hub/compressed_module_resolver.py”, line 67, in call
return resolver.atomic_download(handle, download, module_dir,
File “/users/analysis/e40070822/anaconda3/lib/python3.10/site-packages/tensorflow_hub/resolver.py”, line 418, in atomic_download
download_fn(handle, tmp_dir)
File “/users/analysis/e40070822/anaconda3/lib/python3.10/site-packages/tensorflow_hub/compressed_module_resolver.py”, line 63, in download
response = self._call_urlopen(request)
File “/users/analysis/e40070822/anaconda3/lib/python3.10/site-packages/tensorflow_hub/resolver.py”, line 522, in _call_urlopen
return urllib.request.urlopen(request)
File “/users/analysis/e40070822/anaconda3/lib/python3.10/urllib/request.py”, line 216, in urlopen
return opener.open(url, data, timeout)
File “/users/analysis/e40070822/anaconda3/lib/python3.10/urllib/request.py”, line 519, in open
response = self._open(req, data)
File “/users/analysis/e40070822/anaconda3/lib/python3.10/urllib/request.py”, line 536, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File “/users/analysis/e40070822/anaconda3/lib/python3.10/urllib/request.py”, line 496, in _call_chain
result = func(*args)
File “/users/analysis/e40070822/anaconda3/lib/python3.10/urllib/request.py”, line 1391, in https_open
return self.do_open(http.client.HTTPSConnection, req,
File “/users/analysis/e40070822/anaconda3/lib/python3.10/urllib/request.py”, line 1351, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>

Thanks in advance!!!

The root cause of the URLError: <urlopen error [Errno 110] Connection timed out> you’re experiencing is due to TensorFlow Hub attempting to download a pre-trained model or resource from the internet, which fails because your HPC Cluster does not have internet access.

To work around this issue without requiring an internet connection, you can manually download the necessary TensorFlow Hub modules or any other dependencies on a machine with internet access and then transfer them to your HPC Cluster. Here’s a step-by-step guide on how to do this:

1.	Identify Required Modules: First, determine which TensorFlow Hub modules or other resources your code is attempting to download. This information is usually present in the code where the TensorFlow Hub model is being loaded (e.g., hub.KerasLayer(hub_module_url, trainable=True) in your traceback).
2.	Manual Download: On a machine with internet access, manually download the required modules or resources. For TensorFlow Hub modules, you can visit TensorFlow Hub and search for the module you need, then download it directly from the website.
3.	Transfer to HPC Cluster: Once you’ve downloaded the necessary files, transfer them to your HPC Cluster using a method appropriate for your environment (e.g., SCP, SFTP).
4.	Local Loading: Modify your TensorFlow code to load the modules from the local filesystem instead of trying to download them from the internet. You can do this by replacing the URL in the hub.KerasLayer or similar function with the local path to the downloaded module. For example:

Instead of using a URL, use the local path to the module

hub_module_path = ‘/path/to/downloaded/module’
model = hub.KerasLayer(hub_module_path, trainable=True)

5.	Dependency Management: If there are other dependencies that your script requires from the internet (e.g., Python packages), you’ll need to download and transfer those as well. You can use pip to download packages with the -d option (for download) and then install them on the HPC Cluster with pip install --no-index --find-links=/path/to/downloaded/packages.
6.	Verification: After setting up everything locally, run your script again to ensure that all dependencies are correctly resolved and that your model can be trained without internet access.

This approach requires manual intervention, but it’s a common workaround for environments without direct internet access. Keep in mind that you’ll need to repeat this process if you change your model or require additional TensorFlow Hub modules or other dependencies.

1 Like

Thank you… After going through the steps I was able to run the code successfully