Train model in other machine using data from the main machine

Saish · August 11, 2021, 7:38am

Setup

A server- A with 150 TB
A NVIDA-DGX with 4 GPUs connected to same wifi to server A but not same machine.

Problem we are facing:

The entire training data is in the server A and the DGX has the computing power.
We are trying to find a way to train a model on DGX while using data from server A.
We tried to use Tensorflow distributed learning and specify the dgx ip as the cluster worker, the code ran and it specified as a grpc server running at tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:411] Started server with target: grpc://ip_address:port from server A and the model started training.
But there is no GPU utilization in the DGX nor there is any process running.

Is there any good resource which explicitly says about training on remote GPU machines from storage host servers, locally, without using any cloud pipelines like GCP, or any changes to our above config.