Distributed Training in TensorFlow with AI Platform & Docker

Hi folks,

I am pleased to share my latest blog post with you: Distributed Training in TensorFlow with AI Platform & Docker.

It will walk you through the steps of running distributed training in TensorFlow with AI Platform training jobs and Docker. Below, I explain the motivation behind this blog post:

If you are conducting large-scale training it is likely that you are using a powerful remote machine via SSH access. So, even if you are not using Jupyter Notebooks, problems like SSH pipe breakage, network teardown, etc. can easily occur. Consider using a powerful virtual machine on Cloud as your remote. The problem gets far worse when there’s a connection loss but you somehow forget to turn off that virtual machine to stop consuming its resources. You get billed for practically nothing when the breakdown happens until and unless you have set up some amount of alerts and fault tolerance.

To resolve these kinds of problems, we would want to have the following things in the pipeline:

  • A training workflow that is fully managed by a secure and reliable service with high availability.
  • The service should automatically provision and de-provision the resources we would ask it to configure allowing us to only get charged for what’s been truly consumed.
  • The service should also be very flexible. It must not introduce too much technical debt into our existing pipelines.

Happy to address any feedback.

5 Likes

Nice, another solution that we have is Tensorflow-cloud

1 Like

Sure. I should add this to my post. But then again, the moment one would try to change to a different framework, tensorflow-cloud would likely break.

1 Like

Hi there!
I also would like to share a great article, which can be useful for you. It is on Cloud Agnostic vs Cloud Native: This is How to Get the Most out of Your Cloud Adoption Approach.
Here is the link - https://www.avenga.com/magazine/cloud-agnostic-vs-cloud-native/