Opened 3 months ago

Closed 6 weeks ago

Last modified 5 weeks ago

#26514 closed defect (fixed)

intermittent updater failures on Win64 (Error 19)

Reported by: mcs Owned by: tbb-team
Priority: Very High Milestone:
Component: Applications/Tor Browser Version:
Severity: Normal Keywords: TorBrowserTeam201808R
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

When testing updates from Tor Browser 8.0a8 (ESR52) to the 8.0a9 release candidate (ESR60), Kathy and I encountered a failure when trying to update a Win64 browser that was using an obfs4 transport. The error was 19 (CERT_VERIFY_ERROR). We cannot reproduce this error on other platforms, and the mar file signature does in fact seem to be correct.

When testing with the same mar file and running updater.exe manually via a CMD shell, we saw the same error. However, after several attempts the update succeeded. It seems to fail about 4/5 of the time. Georg pointed out that this sounds a lot like a problem that was reported on tor-talk a earlier this year:

https://lists.torproject.org/pipermail/tor-talk/2018-January/043930.html

Note that this is a bug in our ESR52-based Tor Browser; we will need to determine if the bug still exists in an ESR60-based browser (and look for the cause).

Child Tickets

Attachments (1)

0001-Bug-26514-intermittent-updater-failures-on-Win64-Err.patch (1.5 KB) - added by mcs 2 months ago.
patch (workaround)

Download all attachments as: .zip

Change History (24)

comment:1 Changed 3 months ago by cypherpunks

Oh shit! It happens exactly 4 times and succeeds then. The only difference in logs is:

AUS:SVC readStatusFile - status: succeeded, path: \Tor Browser\Browser\TorBrowser\UpdateInfo\updates\0\update.status

comment:2 Changed 3 months ago by gk

Keywords: TorBrowserTeam201806 added
Priority: MediumHigh

comment:3 Changed 3 months ago by gk

https://blog.torproject.org/comment/275988#comment-275988 mentions this seems to be a race condition as this is not visible on slower machines.

comment:4 Changed 3 months ago by gk

Priority: HighVery High

comment:5 Changed 3 months ago by gk

Keywords: TorBrowserTeam201807 added; TorBrowserTeam201806 removed

Moving first batch of tickets to July 2018

comment:6 Changed 2 months ago by mcs

So far, after many attempts, we have not reproduced this bug in an ESR60-based build. However, since it seems to be timing related, maybe we are just getting lucky (that no failures occur).

Investigating the failures with the ESR52-based Tor Browser, we found that it is easy to reproduce in an optimized build. But we found that we cannot reproduce the problem after adding some logging to the libmar code. We are now going to be smarter about the logging, e.g., we will not execute any additional code (and especially we will not log anything) until after an error occurs within libmar's signature verification code.

comment:7 Changed 2 months ago by mcs

Adding logging that occurred after the signature verification failed was helpful and allowed Kathy and me to pin down the cause of this bug in an ESR52-based Tor Browser (but so far not the root cause).

The reason the signature verification fails is because a byte from the MAR file that should be skipped when computing the SHA-512 hash is included in the stream of bytes that is provided as input to the NSS hashing code.

The reason the extra byte is included is because (it seems) there is a bug in the stdio functions that are used inside modules/libmar/verify/mar_verify.c to read the MAR file. The bug is that the file offset is incorrect (functions such as ftello() return a value that is off by one), and when the code inside mar_verify_signatures_for_fp() uses fseeko() to skip past the embedded signature, the resulting file position is wrong.

The wrong file position in turn causes one byte from the signature (which should have been skipped) to be used as input to the SHA-512 hashing code. Game over.

My Windows debugging skills are limited, especially when working with optimized builds. Ideally someone who has some experience with mingw-w64 internals would investigate further. I am not sure if this bug occurs with our ESR60-based Tor Browser builds; we could not reproduce it there but it is intermittent. The libmar code has not been modified much at all but maybe something has been fixed inside mingw-w64. Kathy and I implemented a workaround; I will attach the patch (which we can make available in a git repo if we decide to use it).

Changed 2 months ago by mcs

patch (workaround)

comment:8 in reply to:  7 ; Changed 2 months ago by gk

Replying to mcs:

Adding logging that occurred after the signature verification failed was helpful and allowed Kathy and me to pin down the cause of this bug in an ESR52-based Tor Browser (but so far not the root cause).

The reason the signature verification fails is because a byte from the MAR file that should be skipped when computing the SHA-512 hash is included in the stream of bytes that is provided as input to the NSS hashing code.

The reason the extra byte is included is because (it seems) there is a bug in the stdio functions that are used inside modules/libmar/verify/mar_verify.c to read the MAR file. The bug is that the file offset is incorrect (functions such as ftello() return a value that is off by one), and when the code inside mar_verify_signatures_for_fp() uses fseeko() to skip past the embedded signature, the resulting file position is wrong.

The wrong file position in turn causes one byte from the signature (which should have been skipped) to be used as input to the SHA-512 hashing code. Game over.

My Windows debugging skills are limited, especially when working with optimized builds. Ideally someone who has some experience with mingw-w64 internals would investigate further. I am not sure if this bug occurs with our ESR60-based Tor Browser builds; we could not reproduce it there but it is intermittent. The libmar code has not been modified much at all but maybe something has been fixed inside mingw-w64.

Aewsome work! I think we could test the latter hypothesis by recompiling an ESR52-based Tor Browser with the mingw-w64 toolchain we use for ESR60, no? If so, let's do that.

comment:9 in reply to:  8 ; Changed 2 months ago by mcs

Replying to gk:

Aewsome work! I think we could test the latter hypothesis by recompiling an ESR52-based Tor Browser with the mingw-w64 toolchain we use for ESR60, no? If so, let's do that.

That is a good idea. I tried to do it today but got an error during the firefox compile step. I started with the tbb-8.0a8-build1 tag and applied the patch from #25554 (4c1f78c312aa10ff54e5b07de4ef3278ab2ed01e). Here is a portion of the log; has anyone seen this kind of error before? Maybe I need other changes from the ESR60 era?

x86_64-w64-mingw32-widl: /var/tmp/build/mingw-w64-ee9fc3d0b8c8/mingw-w64-tools/widl/src/typetree.h:274: type_alias_get_aliasee: Assertion `type_is_alias(type)' failed.
Aborted
Makefile:109: recipe for target 'typelib_done' failed
make[5]: *** [typelib_done] Error 134
make[5]: *** Waiting for unfinished jobs....

Looking in Bugzilla, https://bugzilla.mozilla.org/show_bug.cgi?id=1390583#c53 seems relevant... but I am out of time for now to try to decipher the conversation in the bug.

comment:10 in reply to:  9 ; Changed 2 months ago by gk

Replying to mcs:

Replying to gk:

Aewsome work! I think we could test the latter hypothesis by recompiling an ESR52-based Tor Browser with the mingw-w64 toolchain we use for ESR60, no? If so, let's do that.

That is a good idea. I tried to do it today but got an error during the firefox compile step. I started with the tbb-8.0a8-build1 tag and applied the patch from #25554 (4c1f78c312aa10ff54e5b07de4ef3278ab2ed01e). Here is a portion of the log; has anyone seen this kind of error before? Maybe I need other changes from the ESR60 era?

x86_64-w64-mingw32-widl: /var/tmp/build/mingw-w64-ee9fc3d0b8c8/mingw-w64-tools/widl/src/typetree.h:274: type_alias_get_aliasee: Assertion `type_is_alias(type)' failed.
Aborted
Makefile:109: recipe for target 'typelib_done' failed
make[5]: *** [typelib_done] Error 134
make[5]: *** Waiting for unfinished jobs....

Looking in Bugzilla, https://bugzilla.mozilla.org/show_bug.cgi?id=1390583#c53 seems relevant... but I am out of time for now to try to decipher the conversation in the bug.

You need to compile with --disable-accessibility as there is a bug in newer mingw-w64 versions that is not figured out yet if accessibility is enabled.

comment:11 in reply to:  10 Changed 8 weeks ago by mcs

Replying to gk:

You need to compile with --disable-accessibility as there is a bug in newer mingw-w64 versions that is not figured out yet if accessibility is enabled.

Thanks! Adding --disable-accessibility allowed me to build Tor Browser 8.0a8 with the newer mingw-w64 toolchain. The result is that this bug still occurs. At this point, I think it makes sense to try again to reproduce this problem in an ESR60-based Tor Browser, since I cannot think of a reason why we would not have the same problem there. When I tested before (see comment:6) I may have included extra logging which Kathy and I learned later can cause this bug to not occur.

comment:12 Changed 8 weeks ago by mcs

Status: newneeds_information

I can reproduce the problem with my own ESR60-based build. I started with the 8.0a9 tag, changed app.update.url to point to a test server that I control, and replaced the udater's MAR signing certificates with one that Kathy and I generated. So — minimal changes.

Kathy and I think we should merge the workaround that we developed (see comment:7) until a proper fix inside mingw-w64 is available. If we agree on that approach, I will test the workaround in an ESR60-based build and make it available in a git repo.

Is there someone we can ask about the underlying mingw-w64 bug? I am concerned about spending too much time on this ticket at the expense of other tasks (which is not to say that working updates are not important).

comment:13 in reply to:  12 Changed 8 weeks ago by gk

Status: needs_informationnew

Replying to mcs:

I can reproduce the problem with my own ESR60-based build. I started with the 8.0a9 tag, changed app.update.url to point to a test server that I control, and replaced the udater's MAR signing certificates with one that Kathy and I generated. So — minimal changes.

Kathy and I think we should merge the workaround that we developed (see comment:7) until a proper fix inside mingw-w64 is available. If we agree on that approach, I will test the workaround in an ESR60-based build and make it available in a git repo.

Sounds good to me.

Is there someone we can ask about the underlying mingw-w64 bug? I am concerned about spending too much time on this ticket at the expense of other tasks (which is not to say that working updates are not important).

Yes, we should try to get some help. There are three venues I currently can think of. We can mail Jacek and ask, ask on the mingw-w64 mailing list, and ask folks on #mingw-w64 in OFTC's network. I'll start with asking Jacek and might ask around in #mingw-w64.

comment:14 Changed 8 weeks ago by jacek

Could you provide your preprocessed mar_verify.c? Something like:
make -C obj_dir/modules/libmar/verify/ mar_verify.i
should do the trick.

comment:15 Changed 8 weeks ago by gk

Keywords: TorBrowserTeam201808 added; TorBrowserTeam201807 removed

Move our tickets to August.

comment:16 in reply to:  14 Changed 8 weeks ago by mcs

Replying to jacek:

Could you provide your preprocessed mar_verify.c? Something like:
make -C obj_dir/modules/libmar/verify/ mar_verify.i
should do the trick.

I will do that once I figure out how to get a shell within the container for the "firefox" build.

boklm: can you tell me how to do that with rbm? I know I can get a shell after a build failure, but is there way to get a shell within the container without forcing a build failure, e.g., after a successful build of the firefox component or after the firefox configure step is done? I can probably figure it out, but I suspect you or gk my know how to do this already.

comment:17 Changed 8 weeks ago by gk

I don't know how to get a shell within the container during the build. What I usually do if I need to get at preprocessed files is compile with -save-temps and do a exit 1 after the compilation step making sure I get a debug shell.

comment:18 in reply to:  17 Changed 7 weeks ago by mcs

Replying to gk:

I don't know how to get a shell within the container during the build. What I usually do if I need to get at preprocessed files is compile with -save-temps and do a exit 1 after the compilation step making sure I get a debug shell.

Very helpful; thanks! I used -save-temps and have now placed a portion of the build tree here: https://people.torproject.org/~mcs/volatile/bug-26514/

comment:19 Changed 6 weeks ago by jacek

libmar uses in fact _ftelli64/_fseeki64 due to defines in mar_private.h. Those are not available on old msvcrt.dll, so mingw-w64 has its own implementation:
https://github.com/mirror/mingw-w64/blob/master/mingw-w64-crt/stdio/fseeko64.c#L133
I can't see anything obviously wrong there, but in general it does a lot of stuff that really should be done by MS crt in fact. An ideal solution would be to use ucrt-based toolchain (ucrtbase.dll always provides those functions) and disable mingw-w64 implementation for them. I may take a look at mingw-w64 part of it, but changing to ucrt toolchain is an invasive change for Tor build system.

As a less radical solution could be using fseek64 and ftello64. Those are much thinner wrappers in mingw-64 around msvcrt.dll. You could try that by changing defines in mar_private.h

comment:20 in reply to:  19 Changed 6 weeks ago by mcs

Keywords: TorBrowserTeam201808R added; TorBrowserTeam201808 removed
Status: newneeds_review

Replying to jacek:

As a less radical solution could be using fseek64 and ftello64. Those are much thinner wrappers in mingw-64 around msvcrt.dll. You could try that by changing defines in mar_private.h

Thank you for your analysis and for this suggestion. It appears to solve the problem nicely. After reading the other code (ftelli64 and _fseeki64), it is difficult to say where the problem is but some aspects seem to rely on internal Microsoft flags, etc. and certainly there is complexity there (as you said). Anyway, here is a patch that works for Kathy and me:
https://gitweb.torproject.org/user/brade/tor-browser.git/commit/?h=bug26514-01&id=c2720b2274e51eb76b91e5f6a5f2a82bdfaf8013

We tested on an up-to-date Windows 10 system. 200 incremental updates and 200 complete updates all completed without any signature verification failures.

comment:21 Changed 6 weeks ago by gk

Resolution: fixed
Status: needs_reviewclosed

Great! Looks good to me. Applied to tor-browser-60.1.0esr-8.0-1 (commit c2720b2274e51eb76b91e5f6a5f2a82bdfaf8013).

I think it is okay to be conservative and apply the fix to 32bit Windows as well, not knowing where exactly the bug lies. We should test the 32bit case however, making sure we did not regress here.

Jacek: big thanks for your help!

comment:22 in reply to:  21 Changed 5 weeks ago by mcs

Replying to gk:

I think it is okay to be conservative and apply the fix to 32bit Windows as well, not knowing where exactly the bug lies. We should test the 32bit case however, making sure we did not regress here.

Kathy and I tested using a 32-bit Windows build and did not find any problems. I think we are done here!

comment:23 in reply to:  19 Changed 5 weeks ago by heaslr

Replying to jacek:

libmar uses in fact _ftelli64/_fseeki64 due to defines in mar_private.h. Those are not available on old msvcrt.dll, so mingw-w64 has its own implementation:
https://github.com/mirror/mingw-w64/blob/master/mingw-w64-crt/stdio/fseeko64.c#L133
I can't see anything obviously wrong there, but in general it does a lot of stuff that really should be done by MS crt in fact.

Especially, giving the fact that msvcr100.dll supports that, actually.

Note: See TracTickets for help on using tickets.