LLVM updates and bazel cache

As we are updating LLVM two times at day I’ve tried to query bazel with:

bazel aquery "rdeps(//tensorflow:*,//third_party/llvm:*)" --include_aspects 2>/dev/null | grep Compiling | wc -l

I am not a bazel ninja so probably the query could be wrong or improved but I currently see 9938 files on master (CPU only).

What is the effect of this bi-daily rolling update on the average community contributor compiling workflow/environment and his bazel cache?

1 Like

I’d assume that Bazel is smart enough to only recompile the pieces that actually changed in there, so the impact will vary depending on the actual updates we’re picking up.

As a workflow, when I use to develop in LLVM on a laptop, I would have a cron script that would run a git pull and build at 7am before I show up in the office so that when I arrive I have the most recent copy of the code with the build cache up-to-date :slight_smile:

If you execute the query on your laptop you will see that we have many targets depending on this. Are we going to invalidate all this targets cache?

This could work for you or any other TF team member as a daily routine.

But I think that it is really not the case of
an occasional/episodic contributor that invest his sparse time to contribute a TF PR.

What do you think?

Bazel is supposed to cache thing based on the content hash of the files: so updating LLVM matters only for the actual files changed in there.

Yes, but if you’re an occasional contributor, whether we update some code twice a day or every other week shouldn’t matter: you’ll have to (re)build it, won’t you?

Without an available nightly cache probably.
But it really depends on how many targets llvm will invalidate in the case the contribution is not too sparse.

An llvm update Is It going to invalidate all the targets depending on llvm?

LLVM isn’t monolithic, it depends what is changed in LLVM. Bazel tracks finer grain dependencies. If LLVM LibSupport is changed, then most things will get rebuilt, but if there is a fix in the X86 backend I would expect only the JIT dependencies to be rebuilt for example.

I am not worried about LLVM itself but of all the targets in its chain if the query was correct.

Is a small change in llvm going to invalidated all these targets?

bazel aquery "rdeps(//tensorflow:*,//third_party/llvm:*)" --include_aspects 2>/dev/null | grep Compiling

So as Mehdi was saying, you can’t just look at all TF dependencies on anything in LLVM: it’s going to matter what in LLVM changes. A small change in LLVM is going to invalidate things that depend on the specific target in LLVM that changed. That said, TF’s monolithic structure is pretty likely to create unnecessary dependencies and cause unnecessary recompiles.

Your query also doesn’t work for me. Maybe because of recent changes to use the upstream LLVM Bazel build files? //third_party/llvm is empty. I think you want @llvm-project. Everything in the repo would be @llvm-project//.... Similarly, your query for //tensorflow:* is I believe only capturing rules directly in that package and you’d need //tensorflow/... to get everything. But for reasons that aren’t really clear to me doing an aquery with either of those wildcards fail in different ways. Fun. Anyway, so we’ll limit ourselves to individual packages for now.

If llvm:Support changes:

$ bazel aquery "rdeps(//tensorflow:*, @llvm-project//llvm:Support)" --include_aspects 2>/dev/null | grep Compiling | wc -l
4930

but if llvm:Symbolize changes, then nothing needs to be recompiled

$ bazel aquery "rdeps(//tensorflow:*, @llvm-project//llvm:Symbolize)" --include_aspects 2>/dev/null | grep Compiling | wc -l
0

for an in-between example:

$ bazel aquery "rdeps(//tensorflow:*, @llvm-project//llvm:ProfileData)" --include_aspects 2>/dev/null | grep Compiling | wc -l
2818

and don’t forget about MLIR:

$ bazel aquery "rdeps(//tensorflow:*, @llvm-project//mlir:Support)" --include_aspects 2>/dev/null | grep Compiling | wc -l
1924

Another important component is that if you’re using a cache like --disk_cache, Bazel will only rerun a compile command if the inputs actually change because it’s hashing the content. So if you have a change to llvm:Support that adds something hidden behind a preprocessor macro but that doesn’t actually result in a different compiled output, then Bazel will not recompile things that depend on it, instead using the cache hit.

Yes probably as the query was tested in June.

This Is true but I am not shure how a so low-level components like LLVM and MLIR without version releases could be really isolated as modular dependecies.

To freeze TF only can you just analyze the cache hits missing before ad after a few of llvm updated commits?

I’m not really sure. But you could also look at the the command profile after an LLVM update: https://source.cloud.google.com/results/invocations/db44e648-2768-4e83-85fc-e63e092c880b/artifacts/

Just bazel analyze-profile on that doesn’t offer any insight on cache hit rate but perhaps extracting the data with --dump=raw would.

Just ran an experiment checking a commit before the recent LLVM bump on a newly created disk cache:

...$ git checkout f696b566f40baa87d776c92a12b03cca1d83bfd1
...$ bazel clean --expunge
...$ time bazel build --disk_cache=~/bazel_cache //tensorflow/tools/pip_package:build_pip_package
...
INFO: Elapsed time: 1176.812s, Critical Path: 251.94s
INFO: 11114 processes: 1289 internal, 9825 local.
INFO: Build completed successfully, 11114 total actions

real    19m36.954s
user    0m0.359s
sys     0m0.380s

Next, switch to the LLVM bump and compile again:

...$ git checkout a868b0d057b34dbd487a1e3d2b08d5489651b3ff
...$ time bazel build --disk_cache=~/bazel_cache //tensorflow/tools/pip_package:build_pip_package
...
INFO: Elapsed time: 523.303s, Critical Path: 208.88s
INFO: 3273 processes: 166 internal, 3107 local.
INFO: Build completed successfully, 3273 total actions

real    8m43.377s
user    0m0.166s
sys     0m0.171s

Note, this is on a fast machine, OSS contributors likely won’t have access to these specs:

...$ cat /proc/cpuinfo    # 56 CPUs
...
processor       : 55
vendor_id       : GenuineIntel
cpu family      : 6
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
stepping        : 1
microcode       : 0xb00003e
cpu MHz         : 3192.602
cache size      : 35840 KB
...
...$ free -h              # 126 GB RAM
               total        used        free      shared  buff/cache   available
Mem:           125Gi        42Gi        61Gi       1.5Gi        21Gi        80Gi
Swap:          127Gi       741Mi       127Gi

This experiment could be repeated for other LLVM bumps (or TFRT ones too).

So, the most recent LLVM bump caused a recompile of almost 50% (time-wise) of a full compile. This impacts anyone developing in OSS but also prevents us from using caching and GitHub Actions for GitHub presubmits. Maybe both of our teams can invest some time to reduce this (assuming this happens on other LLVM bumps)?

1 Like

Thank you @mihaimaruseac for this small test on a random lllvm udpate.

Also occasional contributors will not reach this resources configuration also with the top Codespaces profile (32 cores).

1 Like

A different set of commits, a new test:

...$ git checkout b836deac35cd58c271aebbebdc6b0bd13a058585
...$ rm -rf ~/bazel_cache/
...$ bazel clean --expunge
...$ time bazel build --disk_cache=~/bazel_cache //tensorflow/tools/pip_package:build_pip_package
...
INFO: Elapsed time: 1164.101s, Critical Path: 308.54s
INFO: 11114 processes: 1289 internal, 9825 local.
INFO: Build completed successfully, 11114 total actions

real	19m24.243s
user	0m0.360s
sys	0m0.336s
...$ git checkout b71106370c45bd584ffbdde02be21d35b882d9ee
Previous HEAD position was b836deac35c Remove TensorShape dependency from ScopedMemoryDebugAnnotation.
HEAD is now at b71106370c4 Integrate LLVM at llvm/llvm-project@bd7ece4e063e
...$ time bazel build --disk_cache=~/bazel_cache //tensorflow/tools/pip_package:build_pip_package
...
INFO: Elapsed time: 262.158s, Critical Path: 120.47s
INFO: 1567 processes: 149 internal, 1418 local.
INFO: Build completed successfully, 1567 total actions

real	4m22.240s
user	0m0.095s
sys	0m0.092s

This is around LLVM bump from 23 hours ago, so the two tests run so far cover the bumps that occur in a single day.

This is slightly faster that before, only at 20% of a full compile.

1 Like

It think that all these bumps, with this cache miss rate, are going to impact also the disk-cache size o. disk as It could be quite common to have more then one branch/PR waiting for a review and so you need to switch forth and back between branches on different bumps (or probably need we to costantly rebase/merge when possibile?).

As we know clean up the disk-cache over time is really not a so straightforward task:

Actually, let’s run a longer experiment and track changes since the first commit of this week:

...$ bazel clean --expunge
...$ rm -rf ~/bazel_cache/
...$ for commit in $(git log --first-parent --pretty=oneline ...623ed300b593b368e665899be3cf080c5a0e3ebe | tac | cut -d' ' -f1)
> do
>   git checkout ${commit} &> /dev/null
>   echo "Building at ${commit}"
>   (time bazel build --disk_cache=~/bazel_cache //tensorflow/tools/pip_package:build_pip_package >/dev/null) 2>&1 | grep real
> done | tee ~/build_log

This is 273 commits. At the end of the job, the build cache has 124GB (something most people in OSS cannot afford either):

...$ du -sh ~/bazel_cache/
124G	/usr/local/google/home/mihaimaruseac/bazel_cache/

Anyway, let’s look at the timing info from the log:

# concatenate the lines, transform xmys time into (60*x+y) seconds
...$ awk '!(NR%2) { print p, $2} { p = $3 }' ~/build_log | sed -e 's/m/ /' -e 's/s$//' | awk '{ print $1, ($2 * 60 + $3) }' > ~/times_in_seconds
# get a histogram binned for every 10 seconds
...$ sort -rnk2 ~/times_in_seconds | cut -d. -f1 | cut -d' ' -f2 | sed -e 's/.$//' | uniq -c | perl -lane 'print $F[1], "0-", $F[1], "9\t", "=" x ($F[0] / 2)'
1180-1189	
580-589	
570-579	
550-559	=
520-529	=
430-439	
360-369	
280-289	
270-279	
240-249	
230-239	
210-219	=
170-179	=
160-169	=
140-149	
120-129	==
110-119	=
90-99	
80-89	=
70-79	===
60-69	==
50-59	===
40-49	==========
30-39	=====================================================================================================
# also print the values instead of just ====s
...$ sort -rnk2 ~/times_in_seconds | cut -d. -f1 | cut -d' ' -f2 | sed -e 's/.$//' | uniq -c 
      1 118
      1 58
      1 57
      2 55
      2 52
      1 43
      1 36
      1 28
      1 27
      1 24
      1 23
      2 21
      2 17
      3 16
      1 14
      4 12
      3 11
      1 9
      3 8
      7 7
      5 6
      7 5
     20 4
    202 3

As you see, most incremental builds take 30-40 seconds (202 out of 273!) but there are some that take much longer. Let’s look into them

# longest 20 times
...$ sort -rnk2 ~/times_in_seconds | head -n20
699c63cf6b0136a330ae8c5f56a2087361f6701e 1184.46
b836deac35cd58c271aebbebdc6b0bd13a058585 585.35
abcced051cb1bd8fb05046ac3b6023a7ebcc4578 574.188
b49d731332e5d9929acc9bfc9aed88ace61b6d18 556.711
42f72014a24e218a836a87452885359919866b0b 553.296
982608d75d1493b4e351ef84d58bc0fdf78203c8 527.231
a868b0d057b34dbd487a1e3d2b08d5489651b3ff 523.162
c432f62159879d83e62d72afc9ef80cb6cdbe1e5 433.18
8b05b58c7c9cb8d1ed838a3157ddda8694c028f4 366.548
36931bae2a36efda71f96c9e879e91b087874e89 280.591
b71106370c45bd584ffbdde02be21d35b882d9ee 272.807
86fb36271f9068f84ddcecae74fe0b7df9ce83ee 242.273
1848375d184177741de4dfa4b65e497b868283cd 239.788
9770c84ea45587524e16de233d3cf8b258a9bd77 219.21
61bcb9df099b3be7dfbbbba051ca007032bfb777 214.006
d3a17786019d534fb7a112dcda5583b8fd6e7a62 172.092
e8dc63704c88007ee4713076605c90188d66f3d2 170.582
ddcc48f003e6fe233a6d63d3d3f5fde9f17404f1 169.959
2035c4acc478b475c149f9be4f2209531d3d2d0d 169.84
3edbbc918a940162fc9ae4d69bba0fff86db9ca2 167.948
# what are the commits for each one
...$ for commit in $(sort -rnk2 ~/times_in_seconds | head -n20 | awk '{ print $1 }'); do git log -n1 --pretty=oneline ${commit}; done
699c63cf6b0136a330ae8c5f56a2087361f6701e use tensorflow==2.5.0 to temporarily solve the failure of `evaluate_tflite` function.
b836deac35cd58c271aebbebdc6b0bd13a058585 Remove TensorShape dependency from ScopedMemoryDebugAnnotation.
abcced051cb1bd8fb05046ac3b6023a7ebcc4578 Prevent crashes when loading tensor slices with unsupported types.
b49d731332e5d9929acc9bfc9aed88ace61b6d18 Integrate LLVM at llvm/llvm-project@955b91c19c00
42f72014a24e218a836a87452885359919866b0b Remove experimental flag `fetch_remote_devices_in_multi_client`.
982608d75d1493b4e351ef84d58bc0fdf78203c8 Switched to OSS llvm build rules instead of scripts imported from third_party.
a868b0d057b34dbd487a1e3d2b08d5489651b3ff Integrate LLVM at llvm/llvm-project@fe611b1da84b
c432f62159879d83e62d72afc9ef80cb6cdbe1e5 Integrate LLVM at llvm/llvm-project@b52171629f56
8b05b58c7c9cb8d1ed838a3157ddda8694c028f4 Integrate LLVM at llvm/llvm-project@8c3886b0ec98
36931bae2a36efda71f96c9e879e91b087874e89 Integrate LLVM at llvm/llvm-project@4b4bc1ea16de
b71106370c45bd584ffbdde02be21d35b882d9ee Integrate LLVM at llvm/llvm-project@bd7ece4e063e
86fb36271f9068f84ddcecae74fe0b7df9ce83ee Integrate LLVM at llvm/llvm-project@fda176892e64
1848375d184177741de4dfa4b65e497b868283cd Merge pull request #51511 from PragmaTwice:patch-1
9770c84ea45587524e16de233d3cf8b258a9bd77 Integrate LLVM at llvm/llvm-project@cc4bfd7f59d5
61bcb9df099b3be7dfbbbba051ca007032bfb777 Integrate LLVM at llvm/llvm-project@8e284be04f2c
d3a17786019d534fb7a112dcda5583b8fd6e7a62 Fix and resubmit subgroup change
e8dc63704c88007ee4713076605c90188d66f3d2 Add BuildTensorSlice for building from unvalidated TensorSliceProtos.
ddcc48f003e6fe233a6d63d3d3f5fde9f17404f1 [XLA:SPMD] Improve partial manual sharding handling.  - Main change: make sharding propagation work natively with manual subgroup sharding. There were some problems when propagating with tuple shapes. This also avoids many copies, which is important for performance since the pass runs multiple times.  - Normalize HloSharding::Subgroup() to merge the same type of subgroup dims.  - Handle tuple-shaped ops (e.g., argmax as reduce, sort) in SPMD partitioner.  - Make SPMD partitioner to handle pass-through ops (e.g., tuple) natively, since they can mix partial and non-partial elements in a tuple.
2035c4acc478b475c149f9be4f2209531d3d2d0d Legalizes GatherOp via canonicalization to GatherV2Op; i.e. Providing default values of 0 for the axis parameter and the batch_dims attribute.
3edbbc918a940162fc9ae4d69bba0fff86db9ca2 Internal change

10 of these 20 commits are LLVM hash bumps. In total, there are 11 such commits in the 273 considered:

...$ for commit in $(cat ~/times_in_seconds | awk '{ print $1 }'); do git log -n1 --pretty=oneline ${commit}; done | grep LLVM | wc -l
11

So, almost all LLVM commits result in large compile times. Half of the top 20 longest compile times are LLVM hash bumps

I’d say this is quite costly and we need to find a plan to handle this in a way that helps OSS users.

Edit: Actually ALL LLVM hash bumps are included in the longest compiles, the missing one is just the conversion to upstream files:

...$ for commit in $(cat ~/times_in_seconds | awk '{ print $1 }'); do git log -n1 --pretty=oneline ${commit}; done | grep LLVM 
b49d731332e5d9929acc9bfc9aed88ace61b6d18 Integrate LLVM at llvm/llvm-project@955b91c19c00
3487b91d529f2cbc412121d60845cda014e0db7d Integrate LLVM at llvm/llvm-project@9cdd4ea06f09
c432f62159879d83e62d72afc9ef80cb6cdbe1e5 Integrate LLVM at llvm/llvm-project@b52171629f56
86fb36271f9068f84ddcecae74fe0b7df9ce83ee Integrate LLVM at llvm/llvm-project@fda176892e64
36931bae2a36efda71f96c9e879e91b087874e89 Integrate LLVM at llvm/llvm-project@4b4bc1ea16de
e624ad903f9c796a98bd309268ccfca5e7a9c19a Use upstream LLVM Bazel build rules
8b05b58c7c9cb8d1ed838a3157ddda8694c028f4 Integrate LLVM at llvm/llvm-project@8c3886b0ec98
9770c84ea45587524e16de233d3cf8b258a9bd77 Integrate LLVM at llvm/llvm-project@cc4bfd7f59d5
b71106370c45bd584ffbdde02be21d35b882d9ee Integrate LLVM at llvm/llvm-project@bd7ece4e063e
a868b0d057b34dbd487a1e3d2b08d5489651b3ff Integrate LLVM at llvm/llvm-project@fe611b1da84b
61bcb9df099b3be7dfbbbba051ca007032bfb777 Integrate LLVM at llvm/llvm-project@8e284be04f2c
1 Like

Thank you for this extended analysis.
Can you limit the number of cores/ram on one of these builds just to understand what we are talking about on an expected OSS user hw configuration?

I can run the experiment at home from a different computer. But only compiling before and after a LLVM bump, I’ll pick one that was already considered in the thread

1 Like

I don’t know if you can
constrain It with these args is it enought also on you standard build machine:

https://docs.bazel.build/versions/main/user-manual.html#flag--local_{ram,cpu}_resources

I think this is a nice start, but we need to go deeper: how much of the recompilation is due to which target dependencies changing? E.g., if we didn’t have the monolith :core target, what would have needed recompilation vs what is recompiled today? Cache only works if the dependencies are detangled, if independent parts are intermingled for historical reasons then there’ll be a lot of rebuilding. Also, how does the proposed work on making shared libraries support proper affect this? (These may be intermingled as tot get the best shared library, one would need better deps, but if statistically all linked together then cache hits will be lower and link time higher).

But I think that goes to the above question: who is the user that this affecting? E.g., using the git development model, I’d be on a branch getting my work done up until sending for review and would hit this only if I manually pull (so I choose the frequency). At the point where I do that (and I think Mihai’s numbers may be with a RBE as I can’t built TF under an hour on my machine without a populated cache), I context switch and work on something else done. So updating affects a user that is pulling frequently for some reason, but not quickly enough to get enough cache reuse?

2 Likes

I think It will affect external contributors that have more then one waiting PR for a review or when they need to rebase as we have conflict against master.
These two cases really depend on how fast our review process is vs the number of bumps and the internal “direct commit/copybara” activities.

Also it will affect when I contribute a new PR e.g. just a week or few days later to the last merged one.

Then, but this a separated topic, it is also about what we are asking to the external contributor to execute a clean room build to populate the first cache and how to manage the garbage collection for the --disk_cache as It is currently unmaneged in bazel specially if we are growing too fast on the disk size with these bumps.

1 Like