Distributed ParameterServer setup

Hello,
I managed to execute a simple distributed training using this code. I followed the documentation and several blogs and I documented it here.

But I couldn’t find any instructions to use separate VMs and setup a truly distributed training. I understand some cost is involved but my goal is to experiment a simple set up. Am I right in assuming the subprocess in the code is not a full-fledged distributed setup ?

Other have set this up. Can you help ?

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
  except RuntimeError as e:
    print(e)
# The cluster spec is a dictionary with one key per job,
# and the values are lists of task addresses (IP:port)
cluster_spec = { "worker":["127.0.0.1:9901",
                           "127.0.0.1:9902"]
               }

# set the TF_CONFIG environment variable before starting TensorFlow
# JSON-encoded dictionary containing a cluster specification (under the "cluster" key)
# and the type and index of the current task (under the "task" key)
for index, worker_address in enumerate( cluster_spec["worker"] ):
  
  os.environ['CUDA_VISIBLE_DEVICES']=str(index)
  os.environ["TF_CONFIG"] = json.dumps( { "cluster":cluster_spec,
                                          "task":{"type":"worker",
                                                  "index": index}
                                        } )
  subprocess.Popen( "python /home/jupyter/task.py",
                    shell = True)

Hi @mohan_radhakrishnan ,

Could you please check this distributed_training and ParameterServerStrategy might help you to setup the Distributed ParameterServer setup.

I hope these tutorials will help you to complete your setup.

Thanks.