LLVM updates and bazel cache

mihaimaruseac · August 30, 2021, 3:28pm

Oh, I definitely agree that we need to go deeper still. I’m certain that removing monolith core:lib and core:framework and similar targets will help a lot, as well as cc_shared_library and separating kernels to smaller libraries. We’ll basically need a year-long roadmap to untentangle all of this and make development easier.

The 20 minutes build is on a specialist machine, no RBE but a lot of power. I tried to reproduce the same build on my personal laptop (see stats below, was top percentile for performance some 4-5 years ago afaik) and I gave up after almost 9 hours of compile (almost twice as it would have taken to compile the Linux kernel). I think at that point, JVM memory overhead resulted in too much slowdown to have the experiment be meaningful.

...$ cat /proc/cpuinfo      # 8 CPUs
...
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 94
model name : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping : 3
microcode : 0xe2
cpu MHz : 800.052
cache size : 6144 KB
...
...$ free -h
              total        used        free      shared  buff/cache   available
Mem:          7.2Gi       954Mi       2.8Gi       181Mi       3.4Gi       5.7Gi
Swap:         7.3Gi       1.4Gi       5.9Gi

Regarding users being affected by this, I think users that develop on multiple branches will be affected. It’s unlikely that they create all these branches at the same commit, so whenever they switch from a branch to another (given our PR review time this is frequent) they’ll have cache invalidated.

Then, there are users that are told in PR review to rebase back to master and then run additional tests. API golden generation for example has been an issue, I recall at least 3-4 PRs where someone at Google had to regenerate the goldens after manual import because external contributor could not compile the generator in reasonable time.

I think over 50% of the PRs are not running CI locally and instead only rely on the presubmits we run on Kokoro.

And then there are 2 other uses for speeding up the compile and reducing cache invalidation rate: we can add a remote cache and GitHub Actions-based presubmits for faster turnarounds and less flakiness and we can finally enable presubmits for older release branches. These are low priority/low frequency scenarios though.

Lance_N · September 1, 2021, 1:39am

Is there a Colab notebook example of checkout and build for TF?

I tried building TF on my work laptop but could not get Bazel working due to certificate hassles.

Bhack · September 1, 2021, 1:46am

No and I suppose that the Notebook will go in timeout like the small test with public GitHub Action to build TF inside our Docker container:

github.com/tensorflow/tensorflow

Exploring a baseline Action build

tensorflow:master ← bhack:docker_devel_action

opened 09:50PM - 08 Apr 21 UTC

bhack

+42 -0

With this I want to explore a new testing baseline with Github Action and our of…ficial CPU `tensorflow/tensorflow:devel` image. The idea is to test in the CI the (more or less) Episodic contributor journey to contribute code to Tensorflow at least on CPU. This is the proposed list of steps: * `tensorflow/tensorflow:devel` image rebuid build (or Dockerhub pull?) * Code checkout * `ci_sanity.sh` selected steps (--pylint, --<whatever you want> see https://github.com/tensorflow/tensorflow/pull/48294) * TF bazel `./configure` * `bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package` * `bazel test //tensorflow/` As the average user it is already experiencing, this will probably require a bazel cache (on [GCS like for TF/IO](https://github.com/tensorflow/io/pull/1294)?) to achieve reasonable compilation times. I think that reproducibility and the timing of these build steps will let us to monitor the experience of a Tensorflow episodic contribution. /cc @angerson @mihaimaruseac @theadactyl @joanafilipa

Bhack · April 9, 2022, 10:58am

I’ve recently contributed again few PRs on the TF main repo.

I could confirm that currently LLVM daily updates are one of the major usability bottlenecks to contribute to Tensorflow (core) repository as they are going to continuously invalidate the bazel disk-cache also in a controlled and reproducible environment like a Docker container.

At the same time it is still hard to know on which specific commit we could start a PR to friendly reuse (read only) the public cache produced by the official TF nightly wheels.

/cc @yarri-oss @thea

mihaimaruseac · April 11, 2022, 2:24am

Unfortunately we cannot add a stable interface between LLVM and TF, attempts to approach this have encountered issues in the past.

Hence, we have to live with daily LLVM updates, until TF can be properly modularized

Bhack · April 11, 2022, 2:38am

At the same time it is still hard to know on which specific commit we could start a PR to friendly reuse (read only) the public cache produced by the official TF nightly wheels.

There is also this part of the post to explore as I think TF modularization is a very long term task if and when It will be scheduled as a priority. We have already discussed the status of many approved old RFCs in this thread.

Bhack · July 31, 2022, 8:22pm

With the recent long thread on the LLVM/MLIR discourse forum we will never rely on a stable release of LLVM partially for the Google policy and also for the nature of LLVM/MLIR:

Also the recent announced OpenXLA will not change to much the status quo as it will going to still maintain a rolling dependency on LLVM between TF and OpenXLA.

So the only concrete solution we have here is to put resources to finalize the sharing, to the non enterprise contributors, of the cache we daily produce for the nightly TF wheels building in our reproducible docker images:

github.com/tensorflow/build

Provide Bazel cache for TensorFlow builds

opened 11:49PM - 14 May 20 UTC

angerson

Providing a TensorFlow build cache could be very helpful to external developers,… and lower the barrier to entry of contributing to TF. Some ideas for this we've discussed before are: - Offer [Bazel RBE](https://docs.bazel.build/versions/master/remote-execution.html) resources on behalf of SIG Build. This service is in alpha on GCP. - Provide a read-only [build cache](https://docs.bazel.build/versions/master/remote-caching.html#google-cloud-storage) in a GCP bucket. - Provide `devel_cache` Docker images containing a build cache (these could be very large) - Provide code-and-cache volumes for the docker `devel` images. See also: - https://github.com/tensorflow/tensorflow/issues/39560 - https://github.com/tensorflow/tensorflow/issues/4116 - https://github.com/tensorflow/addons/issues/1414

/Cc @thea @Rostam_Dinyari