LLVM updates and bazel cache

Oh, I definitely agree that we need to go deeper still. I’m certain that removing monolith core:lib and core:framework and similar targets will help a lot, as well as cc_shared_library and separating kernels to smaller libraries. We’ll basically need a year-long roadmap to untentangle all of this and make development easier.

The 20 minutes build is on a specialist machine, no RBE but a lot of power. I tried to reproduce the same build on my personal laptop (see stats below, was top percentile for performance some 4-5 years ago afaik) and I gave up after almost 9 hours of compile (almost twice as it would have taken to compile the Linux kernel). I think at that point, JVM memory overhead resulted in too much slowdown to have the experiment be meaningful.

...$ cat /proc/cpuinfo      # 8 CPUs
...
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 94
model name : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping : 3
microcode : 0xe2
cpu MHz : 800.052
cache size : 6144 KB
...
...$ free -h
              total        used        free      shared  buff/cache   available
Mem:          7.2Gi       954Mi       2.8Gi       181Mi       3.4Gi       5.7Gi
Swap:         7.3Gi       1.4Gi       5.9Gi

Regarding users being affected by this, I think users that develop on multiple branches will be affected. It’s unlikely that they create all these branches at the same commit, so whenever they switch from a branch to another (given our PR review time this is frequent) they’ll have cache invalidated.

Then, there are users that are told in PR review to rebase back to master and then run additional tests. API golden generation for example has been an issue, I recall at least 3-4 PRs where someone at Google had to regenerate the goldens after manual import because external contributor could not compile the generator in reasonable time.

I think over 50% of the PRs are not running CI locally and instead only rely on the presubmits we run on Kokoro.

And then there are 2 other uses for speeding up the compile and reducing cache invalidation rate: we can add a remote cache and GitHub Actions-based presubmits for faster turnarounds and less flakiness and we can finally enable presubmits for older release branches. These are low priority/low frequency scenarios though.

1 Like

Is there a Colab notebook example of checkout and build for TF?

I tried building TF on my work laptop but could not get Bazel working due to certificate hassles.

No and I suppose that the Notebook will go in timeout like the small test with public GitHub Action to build TF inside our Docker container:

I’ve recently contributed again few PRs on the TF main repo.

I could confirm that currently LLVM daily updates are one of the major usability bottlenecks to contribute to Tensorflow (core) repository as they are going to continuously invalidate the bazel disk-cache also in a controlled and reproducible environment like a Docker container.

At the same time it is still hard to know on which specific commit we could start a PR to friendly reuse (read only) the public cache produced by the official TF nightly wheels.

/cc @yarri-oss @thea

Unfortunately we cannot add a stable interface between LLVM and TF, attempts to approach this have encountered issues in the past.

Hence, we have to live with daily LLVM updates, until TF can be properly modularized

At the same time it is still hard to know on which specific commit we could start a PR to friendly reuse (read only) the public cache produced by the official TF nightly wheels.

There is also this part of the post to explore as I think TF modularization is a very long term task if and when It will be scheduled as a priority. We have already discussed the status of many approved old RFCs in this thread.

With the recent long thread on the LLVM/MLIR discourse forum we will never rely on a stable release of LLVM partially for the Google policy and also for the nature of LLVM/MLIR:

Also the recent announced OpenXLA will not change to much the status quo as it will going to still maintain a rolling dependency on LLVM between TF and OpenXLA.

So the only concrete solution we have here is to put resources to finalize the sharing, to the non enterprise contributors, of the cache we daily produce for the nightly TF wheels building in our reproducible docker images:

/Cc @thea @Rostam_Dinyari