Distributed ParameterServer setup

I managed to execute a simple distributed training using this code. I followed the documentation and several blogs and I documented it here.

But I couldn’t find any instructions to use separate VMs and setup a truly distributed training. I understand some cost is involved but my goal is to experiment a simple set up. Am I right in assuming the subprocess in the code is not a full-fledged distributed setup ?

Other have set this up. Can you help ?

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
  except RuntimeError as e:
# The cluster spec is a dictionary with one key per job,
# and the values are lists of task addresses (IP:port)
cluster_spec = { "worker":["",

# set the TF_CONFIG environment variable before starting TensorFlow
# JSON-encoded dictionary containing a cluster specification (under the "cluster" key)
# and the type and index of the current task (under the "task" key)
for index, worker_address in enumerate( cluster_spec["worker"] ):
  os.environ["TF_CONFIG"] = json.dumps( { "cluster":cluster_spec,
                                                  "index": index}
                                        } )
  subprocess.Popen( "python /home/jupyter/task.py",
                    shell = True)

Hi @mohan_radhakrishnan ,

Could you please check this distributed_training and ParameterServerStrategy might help you to setup the Distributed ParameterServer setup.

I hope these tutorials will help you to complete your setup.