Adopting Open-Source Dockerfiles for Official tf-nightly CI

The TensorFlow OSS DevInfra Team and TF SIG Build are developing new Dockerfiles in the SIG Build GitHub repo that we want to be used for all of TensorFlow’s official build and test environments. They are published to SIG Build on DockerHub. Our first milestone is to use the Dockerfiles to build the TF Nightly packages with the following goals:

  • Container-built packages are functionally identical to the current package
  • Developers (you!) can build the same packages that we do with minimal effort

That milestone is ready for verification. I’ve set up internal CI jobs that use the containers to build tf-nightly packages that are very similar to the current ones, and I’d like your help to evaluate them for functional differences. Starting on Monday the 30th, we’ve been using the containers to build our official tf-nightly packages.

Here is a set of packages we built at the same commits for advance comparison. There are minor cosmetic differences but we’d like your help to find out if there are any functional differences between packages on the same row of the table below.

Short Git Hash Old Non-Docker Builds New Docker Builds
5af3afc559 GPU Python 3.9 GPU Python 3.9
5af3afc559 GPU Python 3.8 GPU Python 3.8
5af3afc559 GPU Python 3.7 GPU Python 3.7
1d51452b18 CPU Python 3.9 CPU Python 3.9
1d51452b18 CPU Python 3.8 CPU Python 3.8
1d51452b18 CPU Python 3.7 CPU Python 3.7

Here’s how you can help us make the containers useful for you:

  • Install and compare the sample packages above. If you compare the two wheels for any of the rows, do they have any differences that would affect your workflow?
  • Check out the containers on DockerHub and the tf-nightly build instructions at the SIG Build repository. Are you able to build TensorFlow with them? If you use the same git hashes as above, how is your package different?
  • With the new packages that came out starting on Nov. 30, is anything different about them in a way that affects your workflow?

Please give all feedback in this thread. Thank you for your help!

3 Likes

If you have Docker (and nvidia-docker if you want to run GPU TensorFlow) set up already, here’s how to test out one of the packages linked in the OP from inside the containers:

CPU:

docker pull tensorflow/build:latest-python3.9
docker run -it --rm tensorflow/build:latest-python3.9 bash
wget https://storage.googleapis.com/tensorflow-nightly/prod/tensorflow/nightly_release/ubuntu_tfdkr/cpu_py39/6/20211117-000455/pkg/tf_nightly_cpu-2.8.0.dev20211117-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
pip install ./tf_nightly*
python
import tensorflow as tf

GPU with nvidia-docker:

docker pull tensorflow/build:latest-python3.9
docker run --gpus=all -it --rm tensorflow/build:latest-python3.9 bash
wget https://storage.googleapis.com/tensorflow-nightly/prod/tensorflow/nightly_release/ubuntu_tfdkr/gpu_py39/6/20211117-000458/pkg/tf_nightly-2.8.0.dev20211117-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
pip install ./tf_nightly*
python
import tensorflow as tf
tf.config.list_physical_devices('GPU')
1 Like

Here’s a parallel topic for SIG Build contributors we can also discuss here: container extensibility and collaboration. With regards to “what should be in the containers?”, I am strongly for saying that the “Officially Supported” Dockerfiles should only contain code that the DevInfra team has pledged to commit. We still need to decide exactly what this is, but here are some of the user stories I’ve considered and my own thoughts on whether they’ll get official support, based on matching our current testing needs:

  • Yes: Build and test on x86 (test targets need better definition)
  • Yes: Contributor utilities like pylint, clang-tidy rules
  • Yes: Support currently-receiving-security-upgrades release branches
  • Needs decision: custom-op functions for SIG Addons, IO, etc. (I want to get backing from leadership to guarantee support)
  • Needs decision: TF Lite tests / mobile builds (DevInfra is disconnected from this)
  • No: Other platforms like ARM, PowerPC, etc. DevInfra can’t support this.

I want the Dockerfiles to be good enough such that interested parties could copy the directory into a separate project that they maintain for their special needs (for example: minimized containers, containers for non-x86 platforms, containers for accelerators other than CUDA).

Any thoughts on any of this?

1 Like

I’ve launched again the Tensorflow Addons (WIP) Github Action CI with these images at:

Currently we had some issues cause custom-ops images are unmaintained in TF. See more at:

What I see is that with these images we are loosing the small size runtime and devel images if we compare these with the current “officially” published images (~ 400/600MB) as the GPU/CUDA layers are not optional.

I suppose/hope that we want to propose these images also as a reproducible environment to prepare, compile, lint and test PRs for our community contributions. As with these image the envs could be as close as possible to that we use in the TF CI (if we use these images) there is some hope that we could be almost exactly on the same page.

If this is the case I suppose that having also low overhead images, when you want to contribute a quick PR, could be still valuable. Often you don’t need to have a GPU or consume a large size GPU images to contribute to TF or you want to just minimize the cloud costs and boostrap waiting time when you are asking resources to the cloud for a new development env. e.g. Github Codespaces or Kubernetes devcontainer POD.

As a side note I still hope that we could integrate pre-commits hooks from:

We are full of multiple formatting request comments in PRs that requires manual intervention. I hope that we could enforce these a little bit more on the local dev machine with pre-commits to lower the linting roundtrips comments on the PR itself and CI run cycles.

I hope that these new reproducible env will enable a read only cache sharing from Tensorflow master so that we could quickly build TF in these envs with a reasonable amount of time on our local dev hw or cloud resources when we want to contribute a PR to TensorFlow.

1 Like

If we check the layer’s size distribution we have a single quite huge layer:

immagine

Every time the step 25 will be invalidated we will have a new quite huge download and extraction time overhead (probably disk size?)

1 Like

I’ve also tried to make a build with the CPU receipt + remote cache inside these images on the commit f3c361931fe449302953596e8f99fcd737b9390c (master):

bazel --bazelrc=/usertools/cpu.bazelrc build --config=sigbuild_remote_cache tensorflow/tools/pip_package:build_pip_package

I already see that we are having many cache misses. How frequently the remote cache is updated? Is it only with CI jobs orchestration for nightly release?

I cannot see the orchestration script-scheduling as for Github Actions but I suppose that if we are only going to update it on nightly we will be strongly conditioned by:

Is It will plausible to schedule your internal job (I mean the the on without --remote_upload_local_results=false) on every master commits or at least after any llvm updates/sync?

If not:

  • can we publish somewhere the updated nightly master commit so we know on which commit the remote cache is positioned?
  • Is it safe enough to contribute a PR starting from the “nightly commit”?

Edit:
Just the last point from the first quick pass.

For a developer image or if we want to derive an official developer image it is quite annoying to have root permission files created inside the container on your host mounted TF source path when you are back on the host. Probably It could be ok if you are always going to create a temporary source code checkout or if we suggest to maintain the TF source in a named volume but I suppose this will not be the main use case.

So we already discussed this on the Build repository some months ago and now we have also introduced a default user in official Keras developer image.

We don’t have too much upstream solutions so I think we could introduce a default user.

1 Like

the containers gain 4GB from CUDA

I’m punting this until we have usage data and rationale. Splitting the containers preemptively would add a lot of maintenance burden. Our internal CI, which I’m targeting first, doesn’t need it.

the remote cache is not consistently useful yet

I have been wondering about how this will turn out. Right now, the cache gets pushed once per day with an inconsistent commit. Using nightly won’t come until the next milestone of work is done. I think what we’ll probably do in the future is to make sure we push the cache with every day’s nightly tag and encourage developers to start from there. Most of the time, I think that should give a good improvement over the current situation.

the container creates root-owned files

This a low priority task for the moment. For my work I am currently focusing on our internal CI, where the permissions are not a problem. Feel free to work on a PR, though. I don’t want to accept the hassle of maintaining a user in the image unless it can very easily match the user and group inside the container to the permissions on the volumes, e.g. if my user is 396220:89939 then I shouldn’t end up with files owned by 1000:1000.

formatting and precommit hooks aren’t available yet

Those are still on the roadmap, but not until Q1 at the earliest.

1 Like

I’m punting this until we have usage data and rationale. Splitting the containers preemptively would add a lot of maintenance burden. Our internal CI, which I’m targeting first, doesn’t need it.

If we still think that these two needs to conflict instead of converging probably we are missing a great opportunity in this refactoring. I hope we could find the right equilibrium to stay easily on the same page with the local environment we distribute and in which the contributor prepare the PR and the automation that validates it (CI).

I think that this CI vs Dev-env approach it could really go to create some friction quite soon as the CI use case, when “in production”, will be dominant. IMHO it is better to co-design earlier if it is possible.

I have been wondering about how this will turn out. Right now, the cache gets pushed once per day with an inconsistent commit. Using nightly won’t come until the next milestone of work is done. I think what we’ll probably do in the future is to make sure we push the cache with every day’s nightly tag and encourage developers to start from there. Most of the time, I think that should give a good improvement over the current situation.

Not all the commits are the same as you can see in the mentioned forum thread. It seems also from @mihaimaruseac reported small experiments in the same thread the the llvm “daily syncs” are invalidating many targets (and so the cache).

Working with Github PR over “the last” nightly it could be ok but we really need to understand what is required when we need resolve conflicts or if and when we are asking to a developer to rebase or to merge master in a under review PR.

I don’t want to accept the hassle of maintaining a user in the image unless it can very easily match the user and group inside the container to the permissions on the volumes, e.g. if my user is 396220:89939 then I shouldn’t end up with files owned by 1000:1000.

I suppose that you know that this it is not possible as it was explored in a quite long thread at upstream level:

If we really don’t want a standard 1000 user probably it could be better to not suggest, when we will have again an official devel image, to use an host path with the TF source mounted in the container.
We could suggest to checkout the source in a namedVolume directly so that we don’t mix host path permission with the root permission in the container.

EDIT:
An alternative:
we could suggest to the user to use Docker rootless with uidmap as now it is not experimental anymore.

I still think that having a default user will cover more frequent use cases also for the standard docker root installation.

But if we don’t want to support this we could at least strongly suggest to use Docker rootless as it will not go to create all the permission problem on new files created in the container on an host shared path as we have with the current setup/docs.

Note:
These two alternative solutions are mutually incompatible as with the rootless currently the host user is mapped only with root in the container.
See more:

1 Like

In the meantime I have prepared two PRs and a suggestion for the ctrl+c issue in the Docs:

1 Like

Update: I switched TensorFlow’s tf-nightly build process over last night, and the resulting tf-nightly packages were built with Docker.

1 Like

Can we tag the nightly commit on the Github repository?

Cause in git(hub) it is not possible to shallow clone a specific commit hash but it is possible to shallow clone a tag. It will also help to fast identify the last nightly wheel related to a specific tag/commit in the repo.

Currently we need to execute every time the whole TF repository clone and then hard reset and checkout to the specific commit.
Also it is not clear how to fast identify the commit hash related to nightly other then installing the nightly wheels.

This is why in the new Docker docs we claim:

The nightly tag on GitHub is not related to the tf-nightly packages.

I don’t know if the commit is correct but I tried locally this:

docker run --rm  -it tensorflow/build:latest-python3.9 /bin/bash -c "git clone  https://github.com/tensorflow/tensorflow.git --single-branch /tf/tensorflow && cd /tf/tensorflow/ && git reset --hard 13adf6272a4 && bazel --bazelrc=/usertools/cpu.bazelrc build  --config=sigbuild_remote_cache tensorflow/tools/pip_package:build_pip_package"

At some point at 13790 processed I got:

INFO: Analyzed target //tensorflow/tools/pip_package:build_pip_package (434 packages loaded, 27055 targets configured).
INFO: Found 1 target...
WARNING: Reading from Remote Cache:
BulkTransferException
ERROR: /tf/tensorflow/tensorflow/compiler/mlir/tensorflow/BUILD:572:11: C++ compilation of rule '//tensorflow/compiler/mlir/tensorflow:tensorflow_ops' failed (Exit 4): crosstool_wrapper_driver_is_not_gcc failed: error executing command external/ubuntu18.04-gcc7_manylinux2010-cuda11.2-cudnn8.1-tensorrt7.2_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -MD -MF ... (remaining 196 argument(s) skipped)
gcc: internal compiler error: Killed (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <https://gcc.gnu.org/bugs/> for instructions.
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 4408.405s, Critical Path: 1140.54s
INFO: 13790 processes: 8985 remote cache hit, 3772 internal, 1033 local.
FAILED: Build did NOT complete successfully

I’ve tried again with today nightly with a totally reproducible command (I suppose you are writing the cache with the same commit of the nightly available wheel):

docker run --rm -it tensorflow/build:latest-python3.9 /bin/bash -c "git clone  --depth 200 https://github.com/tensorflow/tensorflow.git --single-branch  /tf/tensorflow && pip install tf-nightly-cpu && python -c \"import tensorflow as tf; import re; print(re.search('g(\S+)',tf.__git_version__).group(1))\" | GIT_DIR=/tf/tensorflow/.git xargs git reset --hard && cd /tf/tensorflow && bazel --bazelrc=/usertools/cpu.bazelrc build --config=sigbuild_remote_cache  --verbose_failures tensorflow/tools/pip_package:build_pip_package"

But I start to not hit the cache around 5000/6000 actions… e.g.:

 tensorflow/compiler/xla/service/*

I’ve added an exec log PR to debug these cache misses on you side in the CI between to run of the same commit build and on our side on different machine but with the same docker environment.

I hope that you could open this log from the CI at some point.

Thanks