Opened 4 weeks ago

Last modified 32 hours ago

#32053 new defect

Tor Browser bundles based on Firefox 68 ESR are not reproducible (LLVM optimization issue)

Reported by: gk Owned by: tbb-team
Priority: Immediate Milestone:
Component: Applications/Tor Browser Version:
Severity: Critical Keywords: TorBrowserTeam201911, tbb-9.0-must, tbb-9.0-issues, tbb-regression, tbb-9.0.1-can, GeorgKoppen201911
Cc: boklm, manishearth@…, acrichton@… Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by gk)

For some reasons boklm and I got different macOS bundles when building our rc for 9.0a8. Linux bundles are affected, too (seee #32052) and other platforms as well.

Child Tickets

Attachments (1)

macOS_diff_1000 (64.2 KB) - added by gk 4 weeks ago.

Download all attachments as: .zip

Change History (43)

comment:1 Changed 4 weeks ago by gk

Priority: MediumImmediate
Severity: NormalCritical

The diff is too large to attach here but it's contained to XUL.

comment:2 Changed 4 weeks ago by boklm

After rebuilding the firefox part, the issue disappeared.

It is still unclear if the issue is the same as #32052, or a different one.

comment:3 Changed 4 weeks ago by boklm

Reproducing this issue seems to be more difficult than #32052. I have been running a build with this patch since yesterday, but so far did not hit the issue:
https://gitweb.torproject.org/user/boklm/tor-browser-build.git/commit/?h=bug_32053&id=8ca29105d5495fcf9e97620c4c3a60b542a75a9b

comment:4 Changed 4 weeks ago by gk

Cc: manishearth@… acrichton@… added
Keywords: tbb-9.0-must added; tbb.9.0-must removed

I got differences for macOS after a while (but I agree with boklm that this seems to be harder to achieve). libgkrust.a does not match. More specifically gkrust-6f8221aa429c2389.gkrust.41si33dt-cgu.0.rcgu.o. I am not convinced yet that this is a duplicate of #32052 as the diff looks different enough. And it is considerably larger (way over 1 GiB!). I'll add the first thousand lines in case it helps.

Changed 4 weeks ago by gk

Attachment: macOS_diff_1000 added

comment:5 Changed 3 weeks ago by gk

Keywords: tbb-9.0-issues tbb-regression tbb-9.0.1-can added

comment:6 Changed 3 weeks ago by gk

I spent some time on this issue and here is what I have so far:

If I exclude servo from gkrust_features I don't see the bug after rebuilding libgkrust.a 50 times twice on two different machines.

However, just building servo (and the necessary bindgen) is hitting the problem and early-ish.

FTR: I've modified gkrust_features like so:

-RustLibrary('gkrust', gkrust_features)
+RustLibrary('gkrust', ['bindgen', 'quantum_render', 'cubeb-remoting', 'moz_memory', 'moz_places', 'gecko_profiler'])
-RUST_TEST_FEATURES = gkrust_features
+RUST_TEST_FEATURES = ['bindgen', 'quantum_render', 'cubeb-remoting', 'moz_memory', 'moz_places', 'gecko_profiler']

While that's no resurrection of #26475 my gut tells me LTO might still be involved somehow. So, I am testing next disabling LTO and checking whether that changes things.

comment:7 Changed 3 weeks ago by gk

Okay, time to give an update here. bug_32053_v2 (https://gitweb.torproject.org/user/gk/tor-browser-build.git/log/?h=bug_32053_v2) contains two commits that reduce the build time while still being able to reproduce the bug. First of all, I am not 100% yet that LTO is not introducing a second reproducibility issue here but disabling it does not solve the bug I am hunting. It has the nice side-effect, though, that without LTO the build time of libgkrust.a goes down another approx. 2 minutes on my faster machine.

I don't blow the whole obj dir away anymore. Rather, I build everything the first time and if it's matching I just remove libstyle-*.rlib. After a while I get different Stylo .rlib files. Keeping those .rlib files and trying to check whether geckoservo or even gkrust builds trigger the bug (by just deleting their respective artifacts and checking whether libgkrust.a changes) is negative. So, I am fairly confident that building Stylo is the problem here.

That moves me to phase 2 in this exciting process: I'll start bisecting the Rust compiler to figure out where this bug started (while avoiding #26475 :) ) and I'll try to save even a bit more build time by not caring about libgkrust.a but doing the SHA-256 check against the Stylo .rlib directly.

comment:8 in reply to:  7 Changed 2 weeks ago by boklm

Replying to gk:

Okay, time to give an update here. bug_32053_v2 (https://gitweb.torproject.org/user/gk/tor-browser-build.git/log/?h=bug_32053_v2) contains two commits that reduce the build time while still being able to reproduce the bug. First of all, I am not 100% yet that LTO is not introducing a second reproducibility issue here but disabling it does not solve the bug I am hunting. It has the nice side-effect, though, that without LTO the build time of libgkrust.a goes down another approx. 2 minutes on my faster machine.

FWIW, I tried to do the same for the linux32 build for #32052, and was able to reproduce the issue with servo.patch (I did not try with no_lto.patch yet) while cleaning only libstyle-* and *.a files after each build, and was able to reproduce the issue too.

comment:9 Changed 2 weeks ago by gk

Okay, bisecting Rust is hard due to Mozilla's Rust version requirements: I can repro the issue with 1.38.0 and 1.32.0. 1.30 seems to be too old. However, switching to esr60 and trying there does not work either as 1.30 and above are too new.

So, I guess the next plan is to check Firefox commits between esr60 and esr68 to figure out some which can get compiled with older Rust versions...

Last edited 2 weeks ago by gk (previous) (diff)

comment:10 Changed 12 days ago by gk

Okay, while still bisecting my way done to the Rust commit causing this I looked a bit closer at where the differences are showing up. It turns out that the libstyle rlib is already the problem and extracting that archive shows me that

a) rust.metadata.bin matches and
b) the bc.z files differ
c) the .o files differ

Alex, Manish: Does that give a hint in which direction we need to look? Like, is b) an indication that this is a clang issue? Or do the results give some other clues? For instance would it be helpful analyzing the bc.z files, if so how?

comment:11 Changed 12 days ago by alexcrichton

*.bc.z files in archives are a semi-custom compression format for LLVM IR files (we really should just use *.gz...) The *.o files are the codegen'd versions of those. Given that the LLVM IR is changing that means one of a few things:

  • Something in the source is changing, causing different IR to be produced
  • Rustc is non-deterministically producing IR
  • LLVM is non-deterministically optimizing IR

It's great to narrow this down to just one crate! I've found that minimization tends to make bisection much much easier. Given that this is related to rlibs this probably isn't related to LTO since those object files are all pre-LTO. In terms of minimizing this further, are the object files similarly named? If so are there "obvious diffs" within them? Otherwise if the object files have completely different names that'd be more worrisome!

If you can I'd recommend whacking away at the style crate's source code, deleting swaths of it as you can to see if you can get non-reproducible builds on one compiler. Basically at this point it's just a game of minimization to find the bug. If you've got a set of semi-digestable instructions to reproduce where you're at as well, we could try to pass this around and see if others can help chip in too to diagnose the bug.

comment:12 in reply to:  11 ; Changed 12 days ago by gk

Replying to alexcrichton:

*.bc.z files in archives are a semi-custom compression format for LLVM IR files (we really should just use *.gz...) The *.o files are the codegen'd versions of those. Given that the LLVM IR is changing that means one of a few things:

  • Something in the source is changing, causing different IR to be produced
  • Rustc is non-deterministically producing IR
  • LLVM is non-deterministically optimizing IR

It's great to narrow this down to just one crate! I've found that minimization tends to make bisection much much easier. Given that this is related to rlibs this probably isn't related to LTO since those object files are all pre-LTO. In terms of minimizing this further, are the object files similarly named? If so are there "obvious diffs" within them? Otherwise if the object files have completely different names that'd be more worrisome!

The object files have the same name but alas there are no obvious diffs. The diff file I am getting after running

diff -u <(xxd 1/style-73fdc83c00a82101.style.dqyyj3ie-cgu.0.rcgu.o) <(xxd 2/style-73fdc83c00a82101.style.dqyyj3ie-cgu.0.rcgu.o)

is 300 MiB (!) large and skimming it nothing really sticks out.

One thing that's been interesting during all this bisecting is that there is not a variety of different results one can get when compiling the style crate. In fact, there are only two different .rlib files I've got so far per tested Rust version (if the Rust version contained the reproducibility bug).

If you can I'd recommend whacking away at the style crate's source code, deleting swaths of it as you can to see if you can get non-reproducible builds on one compiler. Basically at this point it's just a game of minimization to find the bug. If you've got a set of semi-digestable instructions to reproduce where you're at as well, we could try to pass this around and see if others can help chip in too to diagnose the bug.

Thanks. boklm has been working on minimizing the code that gets built when building the style crate. He might have some update on that.

comment:13 Changed 12 days ago by alexcrichton

In terms of diffing you may have better mileage diffing the *.bc.z files. While there's no standalone tool to extract those, the [format is documented](https://github.com/rust-lang/rust/blob/87cbf0a547aaf9e8a7fc708851ecf4bc2adab5fd/src/librustc_codegen_llvm/back/bytecode.rs#L1-L23) and you may be able to write a small manually program using flate2 to extract the *.bc file which you can then feed through llvm-dis. That textual representation may be a bit more diffable? (no offsets and whatnot).

Barring that though I suspect more progress will need to be made with further reductions.

comment:14 Changed 9 days ago by gk

Keywords: TorBrowserTeam201911 GeorgKoppen201911 added; TorBrowserTeam201910 removed

comment:15 in reply to:  12 Changed 9 days ago by gk

Replying to gk:

Replying to alexcrichton:

*.bc.z files in archives are a semi-custom compression format for LLVM IR files (we really should just use *.gz...) The *.o files are the codegen'd versions of those. Given that the LLVM IR is changing that means one of a few things:

  • Something in the source is changing, causing different IR to be produced
  • Rustc is non-deterministically producing IR
  • LLVM is non-deterministically optimizing IR

While struggling with reducing libstyle size I got wondering whether there is a way to easily dzmp the output of those steps. For instance, is there a rustc option i could use to dump the IR *before* LLVM is optimizing it so that we can narrow further down where the issue in the toolchain lies? I guess if we go the route you mentioned in comment:13 we would get the LLVM optimized IR? If not I'd be interested in dumping that as well with some compilation setting, if possible.

If there aren't any such options to dump intermediate output yet, could you point me to the place in the compiler where I could hack this up?

comment:16 Changed 9 days ago by alexcrichton

Ah that's a good point! I should probably have mentioned that earlier too... In any case you can set RUSTFLAGS=-Csave-temps and that'll spray a massive amount of files all over the place (*.bc, *.o, etc). You should be able to basically run a diff of all those files between builds, and you can probably pick the smallest one which has a difference in it. The *.bc files should also be natively disassemble-able by llvm-dis

comment:17 Changed 9 days ago by gk

Okay, here come some promising results:

sha256sum test1/obj-macos/x86_64-apple-darwin/release/deps/style-eb257c29b0562cc6.*
710f7c33fdf586735275ac82bae0e857dbbda4e4a9e2b95498b18231bb347a23  test1/obj-macos/x86_64-apple-darwin/release/deps/style-eb257c29b0562cc6.4rg6haojvyp4bm84.rcgu.bc
e0a5bc12a4a2933e4386a25263c148e9c9c0798adcf3a5da83c19e2897eca09d  test1/obj-macos/x86_64-apple-darwin/release/deps/style-eb257c29b0562cc6.4rg6haojvyp4bm84.rcgu.o
eeaec097f1e7170a6229e575edee88ae04cfab4d878650bd6b9f00fe6dc7ed75  test1/obj-macos/x86_64-apple-darwin/release/deps/style-eb257c29b0562cc6.d
73066b1c98f66cb36ae83096f411910b55c41494c2d808738ade3d4f27c97847  test1/obj-macos/x86_64-apple-darwin/release/deps/style-eb257c29b0562cc6.style.5crbtq6r-cgu.0.rcgu.bc
50a8624628f83c5e5e522b1389bb649f59ea0b162aed8f3dd961dc363d0d68f3  test1/obj-macos/x86_64-apple-darwin/release/deps/style-eb257c29b0562cc6.style.5crbtq6r-cgu.0.rcgu.bc.z
d1ef8bb757d3958cbab4a29a900e105ce0cc3a515af01103ceec1502a4281fe0  test1/obj-macos/x86_64-apple-darwin/release/deps/style-eb257c29b0562cc6.style.5crbtq6r-cgu.0.rcgu.no-opt.bc
b1ae81ef931a8de03edc613d6685ca263585484d65a5a73c1195115541f452cd  test1/obj-macos/x86_64-apple-darwin/release/deps/style-eb257c29b0562cc6.style.5crbtq6r-cgu.0.rcgu.o

sha256sum test2/obj-macos/x86_64-apple-darwin/release/deps/style-eb257c29b0562cc6.*
710f7c33fdf586735275ac82bae0e857dbbda4e4a9e2b95498b18231bb347a23  test2/obj-macos/x86_64-apple-darwin/release/deps/style-eb257c29b0562cc6.4rg6haojvyp4bm84.rcgu.bc
e0a5bc12a4a2933e4386a25263c148e9c9c0798adcf3a5da83c19e2897eca09d  test2/obj-macos/x86_64-apple-darwin/release/deps/style-eb257c29b0562cc6.4rg6haojvyp4bm84.rcgu.o
eeaec097f1e7170a6229e575edee88ae04cfab4d878650bd6b9f00fe6dc7ed75  test2/obj-macos/x86_64-apple-darwin/release/deps/style-eb257c29b0562cc6.d
061ac74571a2f59ec2f656f0c625093949300a80010d16fce04ac876498ff9d1  test2/obj-macos/x86_64-apple-darwin/release/deps/style-eb257c29b0562cc6.style.5crbtq6r-cgu.0.rcgu.bc
e98eb35706f10ce5559d63fa1fdd25de3673b45bf9113d14800db42187a7d4c8  test2/obj-macos/x86_64-apple-darwin/release/deps/style-eb257c29b0562cc6.style.5crbtq6r-cgu.0.rcgu.bc.z
d1ef8bb757d3958cbab4a29a900e105ce0cc3a515af01103ceec1502a4281fe0  test2/obj-macos/x86_64-apple-darwin/release/deps/style-eb257c29b0562cc6.style.5crbtq6r-cgu.0.rcgu.no-opt.bc
c3a1f4c3180f41ba5031c85d502c368ea9352fd2d4abb7e93bc2e5a3ed77d783  test2/obj-macos/x86_64-apple-darwin/release/deps/style-eb257c29b0562cc6.style.5crbtq6r-cgu.0.rcgu.o

The important part here is that the style-eb257c29b0562cc6.style.5crbtq6r-cgu.0.rcgu.no-opt.bc files *are* matching while the style-eb257c29b0562cc6.style.5crbtq6r-cgu.0.rcgu.bc ones do not. Assuming no-opt means not-optimized then the problem happens in the optimization step(?) for the bytecode, so I guess the "LLVM is non-deterministically optimizing IR" step you mentioned above.

So, it seems to be an LLVM problem. Do you know whether we could work around that? Like using no-opt.bc for now? More importantly, though, do you have an idea how a small repro test could look like based on that information? Back in the day when working on #26475 you saved my day when you came up with a test snippet for an unrelated issue, so I could avoid staring at libstyle. I have the same hope for this issue. :)

I can look at the actual diff in the .bc files tomorrow if you think that would be helpful.

comment:18 Changed 9 days ago by alexcrichton

Nice!

I agree with that conclusion as well in that it looks like LLVM may have a nondeterministic optimization somewhere in it. Can you upload the *.bc files so I could poke around at them? Both the no-opt and optimized versions if you can. Also, what rustc commit are you using? I'll try to get the same set of LLVM tools used on that commit.

In terms of how to keep minimizing, I think the first step is to use 100% pure LLVM tools to reproduce this. For example "run this command 1000 times and I get different results between runs". Given that the next best step would probably be to use bugpoint from LLVM to help reduce the input IR file into something smaller. Historically bugpoint has been a massive pain to use, but http://blog.llvm.org/2015/11/reduce-your-testcases-with-bugpoint-and.html was somewhat helpful to me in the past. The general idea is that you'll write a script which says whether an input module is "interesting", and in this case "interesting" means "I ran some LLVM optimizations a few times and it didn't always produce the same result".

In any case I can try to help with the bugpoint process once I'm able to reproduce with the input LLVM modules.

comment:19 in reply to:  18 Changed 8 days ago by gk

Replying to alexcrichton:

Nice!

I agree with that conclusion as well in that it looks like LLVM may have a nondeterministic optimization somewhere in it. Can you upload the *.bc files so I could poke around at them? Both the no-opt and optimized versions if you can. Also, what rustc commit are you using? I'll try to get the same set of LLVM tools used on that commit.

I uploaded the files to https://people.torproject.org/~gk/misc/32053/ with the sha256sums as given in comment:17. We are using the 1.34.2 tag but if things are easier for you I can recreate the .bc files with the currently stable Rust as the issue is still there. BTW: thanks for your help, that's really appreciated.

In terms of how to keep minimizing, I think the first step is to use 100% pure LLVM tools to reproduce this. For example "run this command 1000 times and I get different results between runs".

That's my current plan. I am not familiar with the LLVM tools both in terms of which to use best and which parameters to deploy but ideally I like to take the no-opt.bc file from above, run just the optimization with some LLVM tool N times and hit the bug at some point. That should be fast enough to allow a meaningful Rust/LLVM bisect if needed.

Given that the next best step would probably be to use bugpoint from LLVM to help reduce the input IR file into something smaller. Historically bugpoint has been a massive pain to use, but http://blog.llvm.org/2015/11/reduce-your-testcases-with-bugpoint-and.html was somewhat helpful to me in the past. The general idea is that you'll write a script which says whether an input module is "interesting", and in this case "interesting" means "I ran some LLVM optimizations a few times and it didn't always produce the same result".

In any case I can try to help with the bugpoint process once I'm able to reproduce with the input LLVM modules.

Thanks!

comment:20 Changed 8 days ago by gk

Oh, something I forgot and which might be important: so far we only see this issue in a cross-compilation context. Assuming bugs like #32052 are actually the same problem (I'll verify that one later today) we've essentially seen this issue for any of our platforms we cross-compile for (Linux32, macOS, Windows, Android) but not for doing native Linux64 builds. Might be coincidence, though.

comment:21 Changed 8 days ago by gk

While digging into how the optimization is actually working and how Rust is using that I realized that we might be able to play with the optimization flags to narrow things further down if we don't find a better approach (librustc_codegen_llvm/brack/write.rs has the optimize() function which is a good start:

    if config.opt_level.is_some() {

). There are more options that might play a role here (see: with_llvm_pmb() as well).

comment:22 Changed 8 days ago by alexcrichton

Thanks for the files! Also wow those are huge.

I've managed to reproduce this and it indeed looks like an LLVM issue! (yay?) I ran opt -O3 style.no-opt.bc -o foo.bc && md5sum foo.bc and I've gotten two different checksums after running a few times. I also just checked out the most recent LLVM trunk and I can see the same issue there.

I don't think there's anything else needed from rustc here, with these bitcode files it should be enough to just run these through LLVM's opt tool to find a reduction that is smaller than 90MB to report :). That being said I suspect that an LLVM bug could go ahead and get opened for this and LLVM folks might be able to help with the reduction here.

It's good to point out the cross-compile aspect, although I suspect that likely just happens to tickle the right portion of LLVM, and it's actually a bug for all platforms. We'll se though!

Do you want me to file the LLVM bug, or would you like to do so?

comment:23 in reply to:  22 Changed 8 days ago by gk

Replying to alexcrichton:

Thanks for the files! Also wow those are huge.

I've managed to reproduce this and it indeed looks like an LLVM issue! (yay?) I ran opt -O3 style.no-opt.bc -o foo.bc && md5sum foo.bc and I've gotten two different checksums after running a few times. I also just checked out the most recent LLVM trunk and I can see the same issue there.

I don't think there's anything else needed from rustc here, with these bitcode files it should be enough to just run these through LLVM's opt tool to find a reduction that is smaller than 90MB to report :). That being said I suspect that an LLVM bug could go ahead and get opened for this and LLVM folks might be able to help with the reduction here.

It's good to point out the cross-compile aspect, although I suspect that likely just happens to tickle the right portion of LLVM, and it's actually a bug for all platforms. We'll se though!

Do you want me to file the LLVM bug, or would you like to do so?

Awesome, thanks! Would you mind filing the bug mentioning all the necessary info for the llvm folks to look at (I am not sure in which components to put and whom to Cc etc.)? You can link to my files, I'll keep them there are least until the issue is resolved (Not sure if the llvm bug tracker allows such big files added). Please Cc me, if possible (gk [@] torproject . org).

comment:24 Changed 8 days ago by alexcrichton

I've opened up https://bugs.llvm.org/show_bug.cgi?id=43909 and will track that, I'm attempting to use LLVM's automatic test case reduction tools but it's likely going to take quite some time due to how large the module is.

comment:25 Changed 8 days ago by gk

Description: modified (diff)
Summary: macOS bundles for Tor Browser 9.0a8 are not reproducibleTor Browser bundles based on Firefox 68 ESR are not reproducible (LLVM optimization issue)

Closed #32052 as an actual duplicate after inspecting the intermediate compilation output of non-matching results.

comment:26 in reply to:  24 Changed 8 days ago by gk

Replying to alexcrichton:

I've opened up https://bugs.llvm.org/show_bug.cgi?id=43909 and will track that, I'm attempting to use LLVM's automatic test case reduction tools but it's likely going to take quite some time due to how large the module is.

Thanks! Let me know whether/how I can help here. I don't know much about opt and its options/flags, so there is some learning curve for me but I could spend some cycles tomorrow and the coming days given how important that bug is for us. Maybe I should just start bisecting figuring out where this got introduced. Might help tracking the optimization issue down. Either way, let me know.

comment:27 Changed 8 days ago by alexcrichton

I just posted a comment on the bug report with a much more minimal test case (only a few hundred KB!), it only took many cpu hours to extract :)

From here I'm still trying to reduce it further to increase the likelihood that someone from LLVM can help fix (I'm not so good at LLVM internals). This test case is small enough though that it may be pretty reasonable to bisect LLVM itself with. Dealing with a bitcode file across that many LLVM revisions may be pretty difficult though, so bisection likely won't be trivial.

comment:28 in reply to:  27 Changed 7 days ago by gk

Replying to alexcrichton:

I just posted a comment on the bug report with a much more minimal test case (only a few hundred KB!), it only took many cpu hours to extract :)

From here I'm still trying to reduce it further to increase the likelihood that someone from LLVM can help fix (I'm not so good at LLVM internals). This test case is small enough though that it may be pretty reasonable to bisect LLVM itself with. Dealing with a bitcode file across that many LLVM revisions may be pretty difficult though, so bisection likely won't be trivial.

Okay, it seems the optimization being the problem here is in -O1, which is unfortunate because I had some hope reducing the current -O2 to it could be a workaround... I am not sure whether -O0 is worth it. But it might be an option if we don't solve the bug until the next planned release.

I'll set up some bisecting in parallel to your efforts and see whether that gets us anywhere. I think I narrowed the problem down Rust version-wise quite a bit before (1.32 is still broken while I think 1.30 is good), which might help. If you get to the problem with bugpoint or some LLVM dev is helping meanwhile even better. :)

Last edited 7 days ago by gk (previous) (diff)

comment:29 Changed 7 days ago by alexcrichton

The bug seems to be in the -jump-threading pass which I suspect is included in the O1 optimizations, yeah, but this technically only arose during O3 when presumably enough inlining had happened to then trigger the bug. I'm not really sure what the best way to avoid this bug would be unfortunately, but I suspect that an -O1 build should be reproducible (albeit slow).

comment:30 in reply to:  29 Changed 7 days ago by gk

Replying to alexcrichton:

The bug seems to be in the -jump-threading pass which I suspect is included in the O1 optimizations, yeah, but this technically only arose during O3 when presumably enough inlining had happened to then trigger the bug. I'm not really sure what the best way to avoid this bug would be unfortunately, but I suspect that an -O1 build should be reproducible (albeit slow).

Actually -O2 is already enough. I can't trigger the issue with O1 nor with just jump-threading (and I tried pretty hard today). So, from those results I would say "something in -O2 is the problem" which brings me to the thought that we might hunt different bugs. :) But on the positive side of things I think I have a setup ready now for actual bisecting LLVM which I will pick up tomorrow.

Last edited 7 days ago by gk (previous) (diff)

comment:31 Changed 7 days ago by gk

Okay, so before I speculate further I double-check your results using -opt-bisect-limit at least figuring out which optimization is the culprit for the tests I am currently running.

comment:32 Changed 6 days ago by gk

Keywords: TorBrowserTeam201911R added; TorBrowserTeam201911 removed
Status: newneeds_review

Let's test -O1 in our upcoming alpha build to get a feeling whether that would be an acceptable workaround or not and to check whether it actually resolves our build issues: bug_32053_workaround (https://gitweb.torproject.org/user/gk/tor-browser.git/commit/?h=bug_32053_workaround&id=f1f9fa0286982d1fa486880ad6037e1e7a46457d) has a patch for review.

FWIW: It might have been enough to just patch toolchain.configure but that would result in different opt-level options passed to rustc and I was not exactly sure what would happen in that case.

comment:33 in reply to:  32 ; Changed 6 days ago by boklm

Replying to gk:

Let's test -O1 in our upcoming alpha build to get a feeling whether that would be an acceptable workaround or not and to check whether it actually resolves our build issues: bug_32053_workaround (https://gitweb.torproject.org/user/gk/tor-browser.git/commit/?h=bug_32053_workaround&id=f1f9fa0286982d1fa486880ad6037e1e7a46457d) has a patch for review.

That sounds like a good idea to test this in the next alpha. And the patch looks good to me.

comment:34 Changed 6 days ago by gk

Keywords: TorBrowserTeam201911 added; TorBrowserTeam201911R removed
Status: needs_reviewnew

That made it onto tor-browser-68.2.0esr-9.5-1 (commit f1f9fa0286982d1fa486880ad6037e1e7a46457d) and will hopefully be available in 9.5a2.

Last edited 6 days ago by gk (previous) (diff)

comment:35 in reply to:  33 Changed 6 days ago by gk

Replying to boklm:

Replying to gk:

Let's test -O1 in our upcoming alpha build to get a feeling whether that would be an acceptable workaround or not and to check whether it actually resolves our build issues: bug_32053_workaround (https://gitweb.torproject.org/user/gk/tor-browser.git/commit/?h=bug_32053_workaround&id=f1f9fa0286982d1fa486880ad6037e1e7a46457d) has a patch for review.

That sounds like a good idea to test this in the next alpha. And the patch looks good to me.

That strategy does not fly, alas, as using -O1 is causing build bustage on Linux at least (due to the current defense we have against proxy bypasses of Rust code), see: #32426.

comment:36 in reply to:  31 Changed 2 days ago by gk

Replying to gk:

Okay, so before I speculate further I double-check your results using -opt-bisect-limit at least figuring out which optimization is the culprit for the tests I am currently running.

Yeah, I can confirm that this is the -jump-threading operation here, too, good. Then let's get the LLVM bisecting going.

Alex: So, I tried to extract the problematic function name with llvm-extract but I failed so far due to my lack of knowledge of LLVM tools. How do I properly demangle the function name so that llvm-extract likes it? I tried llvm-cxxfilt but no dice. The opt output for the problematic function I get is:

BISECT: running pass (1208271) Jump Threading on function (_ZN83_$LT$style..values..specified..box_..Appearance$u20$as$u20$style..parser..Parse$GT$5parse17ha60227de7ee101e5E)

comment:37 Changed 2 days ago by alexcrichton

Oh for llvm-extract I used the -rfunc argument which is a regex instead of an exact name, like so: llvm-extract before.bc -rfunc=17h5949677e2a2fd343E -o before-extract.bc

comment:38 in reply to:  37 Changed 43 hours ago by gk

Replying to alexcrichton:

Oh for llvm-extract I used the -rfunc argument which is a regex instead of an exact name, like so: llvm-extract before.bc -rfunc=17h5949677e2a2fd343E -o before-extract.bc

Thanks, that helped. However, I've tried to repro just doing the -jump-threading pass thousands of times on the same machine (kernel, glibc etc.), with the same clang version, essentially with the same script, as I reproduce the bug when running *all* the passes up to and including the problematic -jump-threading one: but I don't hit the bug that way, which seems to reproduce my results from comment:30. I wonder what we are missing here.

comment:39 Changed 41 hours ago by alexcrichton

Oh so for just -jump-threading to work you'll need to do:

  1. Start with foo.bc
  2. Figure out smallest N where opt -O3 foo.bc -opt-bisect-limit is non-deterministic
  3. Run opt -O3 -o input.bc -opt-bisect-limit=N-1 foo.bc
  4. Use llvm-extract on input.bc to extract the function
  5. Run opt -jump-threading over the extracted *.bc file

You won't be able to run -jump-threading over the original module, you'll need to run it over the module just before the output becomes nondeterministic.

comment:40 Changed 33 hours ago by alexcrichton

I think Eli from LLVM found a fix at https://reviews.llvm.org/D70103, or at least that fixes the test cases for me locally. Can y'all patch LLVM locally to test out on your end?

comment:41 in reply to:  39 Changed 32 hours ago by gk

Replying to alexcrichton:

Oh so for just -jump-threading to work you'll need to do:

  1. Start with foo.bc
  2. Figure out smallest N where opt -O3 foo.bc -opt-bisect-limit is non-deterministic
  3. Run opt -O3 -o input.bc -opt-bisect-limit=N-1 foo.bc
  4. Use llvm-extract on input.bc to extract the function
  5. Run opt -jump-threading over the extracted *.bc file

You won't be able to run -jump-threading over the original module, you'll need to run it over the module just before the output becomes nondeterministic.

Yeah, that's what I did. :)

comment:42 in reply to:  40 Changed 32 hours ago by gk

Replying to alexcrichton:

I think Eli from LLVM found a fix at https://reviews.llvm.org/D70103, or at least that fixes the test cases for me locally. Can y'all patch LLVM locally to test out on your end?

Yeah, I am on it.

Note: See TracTickets for help on using tickets.