Opened 3 years ago

Closed 2 years ago

Last modified 20 months ago

#15901 closed defect (not a bug)

apparent memory corruption -- very difficult to isolate

Reported by: starlight Owned by:
Priority: Very High Milestone: Tor: 0.2.7.x-final
Component: Core Tor/Tor Version: Tor: 0.2.6.10
Severity: Keywords: PostFreeze027, 2016-bug-retrospective
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

I have encountered what appears to be a memory
corruption bug. Have reproduced it two times
since the initial incident.

Do not believe this is a remotely exploitable
bug as it only happens on the next consensus
download after a pair of locally written
control-channel scripts are run manually.

Am looking for advice on how to isolate the
issue after spending significant effort on it
thus far.

1) first occurred with 0.2.4.26 shortly after
building OpenSSL 1.0.1m shared library to
replace 1.0.1j; perhaps is related to
this but concurrently had substantially
increased relay bandwidth, which also might
have led to the conditions that triggered
the issue

2) saw start-up warning regarding difference
between OpenSSL build and runtime versions.
Never believed this was and issue and it
proved to be irrelevant

3) tried running 0.2.4.26 built with ASAN
but did not reproduce

4) built 0.2.5.12 and the bug recurred

5) ran 0.2.5.12 built with ASAN along
with OpenSSL 1.0.1m ASAN and libevent 2.0.21
ASAN and did not reproduce; tried with
libraries non-ASAN and did not reproduce

6) back to 0.2.5.12 standard build, but with
minor patch to enable core files and have
stdout+stderr directed to files; reproduced
the problem again and obtained a good
core file via SIGSEGV; core file is
fully intact and accessible with 'gdb';
thus far have chosen not to delve into
the core

7) tried running again with
MALLOC_CHECK_=3 MALLOC_PERTURB_=85
but did not reproduce the problem

8) have gone back to (6) configuration,
but so far no bug; relay has received
ever higher consensus bandwidth and
traffic since (6) and so perhaps the
sweat-spot for producing the bug is
no longer present

Have a vague suspicion that the problem
is tied to a race condition between the
main thread and the crypto thread.

The setup here is unique in a variety of
ways and I believe these differences
are the reason I see this problem where
others have not. Seems prudent at this
point to not include much detail.

At some point I deleted the cached-*
files--possibly this has prevented
reproducing the issue since but
I can't remember offhand. Have daily
network backups and can recover
various generations of these files.

In the two events where a "clean"
shutdown was obtained, unparsable-desc
files were written and these were
retained. In the case of the
SIGSEGV termination this file did
not appear.

Please note the

ISO time "2015-XX-XX v6Dp0:05" was unparseable

messages in the first incident. It appears to
me that four bytes of memory were overwritten
here. Unfortunately I was not patient enough
to wait for the second consensus attempt the
second and third times this happened so it's
unclear if this happens consistently. Perhaps
instrumenting this string with debug code
might lead to isolating the problem.

I would appreciate any advice or help
that might lead to isolating the bug,
ideally by triggering it in with the
ASAN build running and thus getting
directly to the problem. Hopefully
someone closely familiar with the
relay code might notice something
indicating a direction to pursue.

Child Tickets

Attachments (4)

tor_relay_bug_from_hell.txt (21.1 KB) - added by starlight 3 years ago.
redacted message log entries
unparseable-desc-event00 (497.6 KB) - added by starlight 3 years ago.
first occurrence unparsable-desc
event03_messages.txt (6.9 KB) - added by starlight 2 years ago.
messages from log for third event
event04_messages.txt (4.7 KB) - added by starlight 2 years ago.
yet another -- some different errors

Download all attachments as: .zip

Change History (46)

Changed 3 years ago by starlight

Attachment: tor_relay_bug_from_hell.txt added

redacted message log entries

comment:1 Changed 3 years ago by starlight

Attached the event log messages from the
three incidents to the ticket, but Trac
did not forward an indication of that
via email.

The messages are somewhat redacted. Can
provide any of the materials mentioned
in original form via private communication.

During some of the tests, logging verbosity
was elevated to 'info' in the hope it would
help, but doing so seems to inhibit memory
corruption events.

Appears the issue is exquisitely sensitive to
code path timing.

comment:2 Changed 3 years ago by starlight

Went back looking at the cached-* files.

Turns out I zapped them right after the last
event. Stands out that the old
cached-microdescs is ~450kb where the where
the rebuilt one is ~720kb. The modify date on
this file tends to be older than the others.
Perhaps this affects the layout of the
heap in such a way that the bug is more
likely to impact consensus data and result
in obvious failure.

Put back the old files and restarted with the
configuration known to fail.

comment:3 Changed 3 years ago by nickm

Milestone: Tor: 0.2.7.x-final
Priority: normalcritical

comment:4 Changed 3 years ago by nickm

A copy of the unparseable-desc files might be good for tracking this down, if you can share them.

Also helpful would be a backtrace from gdb, if you still have the core and the binary that generated it.

Changed 3 years ago by starlight

Attachment: unparseable-desc-event00 added

first occurrence unparsable-desc

comment:5 Changed 3 years ago by starlight

Uploaded unparsable-desc from first occurrence.

I suspect the only useful information may be the
number of entries as the contents look perfectly
ok. The bug seems to be about corrupting the
openssl objects employed in verifying the
descriptors rather then the descriptors
themselves.

Will prepare a package of the core and the system
files that can be used with gdb's "set sysroot"
command for a fully examinable core. Want to
share that privately so contact me at

starlight dot YYYYqQ at binnacle dot cx

where YYYY is the year and Q is the quarter
(1-4) and I'll provide a link where it can
be downloaded.

But be aware that the core is mainly providing
"effect" rather than "cause" unless it's
examined by someone intimate with the code
and who gets luckly and sees something they
recognized and that leads to the code illegally
overwriting memory. This is why I have not
dug into it.

The stack trace is apparent with 'gdb's
'where' command, but says nothing of value
as I manually killed the relay process with
"pkill -SEGV tor" after the corruption
occurred--the trace is of the signal
termination handler.

comment:6 Changed 3 years ago by starlight

I realized that possibly the first event showed
visibly corrupt "ISO time" strings due to the
0.2.4.26 version memory layout rather than due
to timing or luck.

So I have built 0.2.4.27 with the core/stdio
patch and put that live in the hope that
the the "ISO time" flavor of the event can
be reproduced. If this happens, I intend
to add some code to set one or more of the
x86 debug registers to trap on a write to
the time string, I.E. a hard-coded "watchpoint"
without using 'gdb'. This could result in a
core file where the stack trace leads directly
to the code path causing memory corruption.

It appears that any use of an alternate
malloc() such as with ASAN perturbs the
memory layout such that the bug will not
appear.

comment:7 Changed 3 years ago by starlight

Should have included these earlier:
The two scripts I run by hand that
appear to trigger the bug:

nc [scrubbed] 9151 <<EOF
AUTHENTICATE "[scrubbed]"
GETCONF RelayBandwidthRate
GETCONF RelayBandwidthBurst
getinfo ns/id/[scrubbed]
getinfo ns/name/[scurbbed]
QUIT
EOF
:
nc [scrubbed] 9151 <<EOF
AUTHENTICATE "[scurbbed]"
getinfo dir/server/authority
QUIT
EOF
:

The bug seems to show up on the first consensus
download after the above are run manually in
close succession, somewhere between 24 and
48 hours after restarting the relay.

I must also admit that the system where the problem
occurs is non-ECC, so a remote possibility exists
that the cause is some type of physical memory
corruption. However I believe this is unlikely
as the behavior of this issue is highly specific
and somewhat reproducible. If the system were
experiencing random memory corruption one would
expect to see additional failures. The box is
a low-end business-grade Dell with an AMD CPU
and I've never had a Dell that was anything but
rock solid reliable.

comment:8 Changed 3 years ago by starlight

Summary: apparent memory corruption from control channel request processing -- very difficult to isolateapparent memory corruption -- very difficult to isolate

Have to backtrack a bit on the cause. Looking carefully
at the first event, I see that it was two hours from
the time the control channel scripts were run and
the corruption occurred. Unfortunately this opens
it up to virtually any possible cause.

Also note that wherever

Got a bad signature on a networkstatus vote

appears in the log it is followed by

last message repeated 8 times

This message was eliminated by the ' Tor\['
filter used when grep'ing the messages from
the original log.

comment:9 Changed 2 years ago by starlight

Bam!

Just happened again. Is presently in the
bad-consensus state. Any suggestions
anyone before I take the core and restart?

Unfortunately I'm not seeing the corrupt
ISO time warning this time, which I had
hoped would return so I could perhaps
cook up some built-in watchpoint code.

comment:10 Changed 2 years ago by starlight

I left the relay running and this time it seems
as though it may have recovered.

Appears that 'urras' fell over and that triggered
the 0 of 8 good signatures event. Four of the
remaining authorities showed up as bad, but
this implies that the other four were ok?

Is this somehow a normal and generally experienced
phenomenon? Even if so, the original event
with corrupt ISO times and hourly recurrence
looks wrong. And why would the consensus of
four authorities go bad due to a single
authority failure?

Changed 2 years ago by starlight

Attachment: event03_messages.txt added

messages from log for third event

comment:11 Changed 2 years ago by starlight

Nope! Not recovered.

Same bad messages (uploaded to ticket) repeated
after two hours and an additional

parse error: Too few arguments to r
Error tokenizing router status

which has appeared in some earlier incidents.

the unparseable-desc file has appeared as
well with a relatively small count of 825;
curiously these descriptors also appear
in cached-consensus and in a copy of the
consensus retrieved with

wget http://<relay ip>/tor/status-vote/current/consensus

Does anyone have any suggestions before
I kill the relay for the core and restart it?

I can't say that I am enjoying this bug.

comment:12 Changed 2 years ago by starlight

Again!

I noticed this earlier, but with this incident
it's appears nearly certain that a requirement
for the failure is that the consensus weight
increase sharply. Happened with these recent
two incidents (consensus weight is oscillating
between ~10000 and ~18000) and I noticed it
with at least one other. Might go back and
look at the consensus record for all five
failures.

Perhaps the bug is in some code that runs
only when the consensus goes up. That would
certainly make it difficult to reproduce.

Changed 2 years ago by starlight

Attachment: event04_messages.txt added

yet another -- some different errors

comment:13 Changed 2 years ago by starlight

Saw a new message with this one:

Unable to load microdesc consensus directory downloaded from server '154.35.175.225:80'. I'll try again soon.
Error decoding microdescriptor digest "no3r7DN7eA+Thw8cstgD9CTAyrGV0CRhNjSqR050,Xs"

Detailed log attached to ticket.

Once again, if anyone familiar with the code
can suggest something to instrument, please
let me know. Is not too difficult to put in
a hard-coded debug-register watchpoint if
one can be identified.

Have switched it to running ASAN+UBSAN in the
hope something will shake out, but am not
overly hopeful as it didn't work the last
time. Including UBSAN perhaps could make a
difference.

comment:14 Changed 2 years ago by nickm

I'm hoping so too, or hoping that valgrind will be possible (watch out; it's expensive!) and turn something up. My main guess right now is a memory corruption issue of some kind, though, since I haven't heard anybody else run into this, and since it seems to manifest in a different way each time. But that doesn't explain why it would concentrate itself in consensus parsing.

Anybody else have some ideas of what to test?

comment:15 Changed 2 years ago by starlight

zero chance it would run live with Valgrind,
which is about 20x

I've used memory checkers for years and
Google's ASAN is genius--2x for most
code and happily the multiplier for
tor is more like 1.3, at least for
steady traffic; circuit setup is probably
more like 2x or 3x (when openssl is
instrumented)

Adding UBSAN this time did not appear to
make much difference to CPU and with
luck it will catch it

worst case I'm planning to pick up
a SkyLake later this year for which
gcc 5.1 supports MPX in UBSAN
--this hardware memory checking
feature can catch intra-structure
boundary violations, something
ASAN presently does not detect

comment:16 Changed 2 years ago by starlight

A theory as to the possible cause came to mind:

This could be a race-condition bug in OpenSSL.

They just fixed a race with 1.0.2b, though this is
not likely the one as session tickets are disabled
in tor. But it gave me the idea.

Have held back with this, but now seems reasonable
to note that one big difference in the setup here
is that OpenSSL is compiled with gcc 4.9.2 and
-flto (link-time-optimization). I believe this
difference is why the bug shows up here and not
elsewhere.

When building with LTO, gcc blurs the boundaries
between functions by treating separate modules as
one big ball of code and inlining functions that
normally would not be eligible. This can expose
race conditions where a pointer should be declared
volatile or protected by a mutex and is not, by
causing the pointer to be cached for long
intervals in a register across newly inlined
function boundaries. So such a pointer, instead
of going out of scope and being forced back to
memory by a function return, stays in a register,
possibly through a large iteration loop and the
result is a latent race-condition bug raising its
evil antenna.

To be fair the bug could be in tor and not
OpenSSL, but I suspect OpenSSL on an instinct.

I'll be disabling the second crypto thread in
tor to test this, though this check can only be
conclusive if it fails.

A more direct way to finding such issues is to run
tor+libssl.so +libcrypto.so compiled with
TSAN (thread sanitizer). Unfortunately TSAN is
massive CPU hog and though not as bad a the
Valgrind Helgrind tool, still makes it impossible
to run a relay live this way.

So I'm hoping someone who works with the Tor test
network and ASAN builds can spend some time trying
it with TSAN. I may still take a crack at it at
some point and run the relay for an few minutes
with a TSAN build, but the test environment seems
a better approach on this.

comment:17 Changed 2 years ago by cypherpunks

happens on the next consensus download after a pair of locally written control-channel scripts are run manually.

What kind of control-channel scripts, do they communicate over Tor's control port?

comment:18 Changed 2 years ago by starlight

That was a case of false correlation rather than
causation. Control channel has nothing to do
with it. Read recent updates here rather
than the initial report.

At this point I believe the problem is a
race-condition bug in OpenSSL that is exposed
by compiling and linking OpenSSL libraries
with LTO.

Now running for awhile with non-LTO OpenSSL
and with NumCPUs 1 and so far no corruption
events. However this does not prove anything
and at some point I may go back and run
TSAN and try and shake out the bug.

NOTE: NumCPUs 1 does not eliminate threading,
what it does is make the crypto-worker pool
have one thread. The bug could be either
two crypto-workers colliding or a crypto-
worker colliding with the main thread.
Perhaps TSAN will show which.

comment:19 Changed 2 years ago by starlight

The scripts are in comment 7 above.

Is advisable to read long bug
reports before commenting on them.

comment:20 Changed 2 years ago by starlight

Again!

Error decoding identity digest "rnTn"
Got a bad signature on a networkstatus vote
last message repeated 7 times
Consensus does not include configured authority 'urras' at 208.83.223.34:443 (identity 80550987E1D626E3EBA5E5E75A458DE0626D088C)
A consensus needs 5 good signatures from recognized authorities for us to accept it. This one has 0 (). 8 of the signatures on it didn't verify correctly.
Not enough good signatures on networkstatus consensus
Unable to load ns consensus directory downloaded from server '199.254.238.52:80'. I'll try again soon.
Error decoding identity digest "rnTn"
Got a bad signature on a networkstatus vote
last message repeated 7 times
Consensus does not include configured authority 'urras' at 208.83.223.34:443 (identity 80550987E1D626E3EBA5E5E75A458DE0626D088C)
A consensus needs 5 good signatures from recognized authorities for us to accept it. This one has 0 (). 8 of the signatures on it didn't verify correctly.
Not enough good signatures on networkstatus consensus
Unable to load ns consensus directory downloaded from server '131.188.40.189:80'. I'll try again soon.
Error decoding identity digest "rnTn"
Got a bad signature on a networkstatus vote
last message repeated 7 times
Consensus does not include configured authority 'urras' at 208.83.223.34:443 (identity 80550987E1D626E3EBA5E5E75A458DE0626D088C)
A consensus needs 5 good signatures from recognized authorities for us to accept it. This one has 0 (). 8 of the signatures on it didn't verify correctly.
Not enough good signatures on networkstatus consensus
Unable to load ns consensus directory downloaded from server '194.109.206.212:80'. I'll try again soon.

Made some progress on understanding what is happening. Using the first line 'Error decoding identity digest "rnTn"' and examining the core file plus the archived consensus document, found that a single stray 32-bit store is corrupting the consensus document image in memory:

original:

r Toughmudder rnTUAd5Dh4212x1HlRV3jVzTBgU Ib9ENWJVQ6+5OJlyxe5b/VFLeq8 2015-08-17 18:55:56 5.199.242.60 9001 9030
s Fast HSDir Running Stable V2Dir Valid
v Tor 0.2.4.27

corrputed:

r Toughmudder rnTn rHDXjA12x1HlRV3jVzTBgU Ib9ENWJVQ6+5OJlyxe5b/VFLeq8 2015-08-17 18:55:56 5.199.242.60 9001 9030
s Fast HSDir Running Stable V2Dir Valid
v Tor 0.2.4.27

This leads me to the idea that perhaps mprotect() can be used to mark the consensus document memory read-only immediately after being stored and until the memory is released. This might result in a memory protection violation at the time of a stray pointer write and produce a core with a stack trace precisely identifying the errant code.

Could someone familiar with this area of the code comment on the feasibility of the above and perhaps offer a suggestion or two regarding proper placement of the mprotect() calls?

comment:21 Changed 2 years ago by starlight

Version: Tor: 0.2.5.12Tor: 0.2.6.10

comment:22 Changed 2 years ago by starlight

The corrupt store consists of "n rHDXjA" and does
not appear to be a memory address or typical integer.

However this fragment appears in another descriptor:

r FBIPartyTrain rHDXjAZho0VTaXtMkOjCyGF1QW0 vtGOeTZ3wDFo4Dx70WWeGMlO4gs 2015-08-17 08:56:42 24.214.139.69 9001 25565
s Running Stable V2Dir Valid
v Tor 0.2.5.12
w Bandwidth=21
p reject 1-65535

The two values are exactly 13372 bytes distant from each
other, with the source (above) appearing at the lower
memory address. The distance is such that one can
imagine a pointer manipulation bug.

comment:23 Changed 2 years ago by someone_else

mprotect operates on memory pages (4096 bytes). You need to align the memory allocation for the consensus document at a page boundary and pad the allocation to completely include the last page (round up the allocation to a multiple of 4096).

This may change the memory layout enough so that something else gets hit by the bug.

comment:24 Changed 2 years ago by starlight

I thought about that. Figure I'll just start the
mprotect() at a rounded-up page boundary and end
it, likewise, at a rounded-down page boundary.
The consensus document is about 340 pages so
statistically that's good-enough to catch it.
Takes about three-to-four weeks after a
rm cached-* restart for memory to arrange
itself such that the bug hits the WIP
consensus document. I now realize that
the signature validation of the consensus
is acting as a honeypot for detecting the
memory access bug. Probably happens all
the time but doesn't break anything else
in an obvious way.

Here's some more core-file analysis. I'm
beginning to suspect that the low-order two
bytes of a pointer are being overwritten.

corruption value "n rHDXjA" or endian-adjusted 0x416A58444872206E

load3 from core file, large segment near end, heap per "size -A"
   "load3         67067904        1_____________6"

(gdb) find 0x00007f__98000000, 0x00007f__9bff6000, 0x416A58444872206E
0x7___9ba35d44   FBIPartyTrain   1 of 10
0x7___9bc423cc   FBIPartyTrain   2
0x7___9bda5b62   FBIPartyTrain   3
0x7___9bf0004b   FBIPartyTrain   4
warning: Unable to access 15534 bytes of target memory at 0x7___9bff2353, halting search.


load49 from core file, large segment near end, heap per "size -A"
   "load49       299806720        1_____________6"

(gdb) find 0x00007f__aa638000, 0x00007f__bc423000, 0x416A58444872206E
0x7___bb52c0b4   FBIPartyTrain   5
0x7___bb52f4f0 * ToughMudder     6 of 10   seg offset 0x10EF74F0 of 284128496, 94.7%
0x7___bb71be2e   FBIPartyTrain   7
0x7___bb85ee2a   FBIPartyTrain   8
0x7___bb9cc864   FBIPartyTrain   9
0x7___bbbab2c1   FBIPartyTrain  10
warning: Unable to access 15416 bytes of target memory at 0x7___bc41f3c9, halting search.c

display with 

(gdb) x/32c 0x7___bb52f4f0-16

comment:25 Changed 2 years ago by someone_else

Actually looking at the source code, it doesn't look like random memory corruption or a threading issue. The consensus document signatures are checked as soon as the new document is downloaded:

src/or/directory.c::connection_dir_client_reached_eof(dir_connection_t *conn)

...
// networkstatus_set_current_consensus performs signature check
if ((r=networkstatus_set_current_consensus(body, flavname, 0))<0) {
...

Looking at your log files, you are downloading the consensus from multiple different directories and it gets corrupted the same way each time. In your 'event04_messages.txt' the same descriptor is corrupted. This points to a potential gzip or zlib decoding issue (maybe the decompression code is getting miscompiled).

The best chance of tracking this down is probably to instrument the function mentioned above and save both the compressed and uncompressed versions of the consensus to disk.

Edit: BTW, what leads me to suspect gzip/zlib is that as part of the decompression a dictionary of previously encountered strings is used. You encountering the same corrupted "n rHDXjA" value twice could be explained by a corrupted entry in that dictionary. There may be some some very rarely encountered corner case in the decompression algorithm that gets miscompiled (LTO is a very dangerous compiler option).

Did you encounter the bug with a distributed binary?

Last edited 2 years ago by someone_else (previous) (diff)

comment:26 Changed 2 years ago by starlight

Thank you for the fresh look!

Very interesting, so I should be looking for corrpution
of the DEFLATE dictionary. Can't be LTO because ZLIB
was not compiled this way in the earlier instances
and in this last instance, OpenSSL was carefully
built non-LTO.

The LTO progression was ssl-only libraries shared,
ssl+event+tor with libraries shared, ssl+zlib+
event+tor static build, zlib+event+tor not ssl
static build.

I don't see it as a zlib bug because it runs fine
for three weeks before the problem occurs, is
clearly related to the gradual shifting around
of heap allocations. Also different zlibs
have been in use at the time of the event
a) 1.2.8 shared library non-LTO,
b) 1.2.8 static link LTO.

I just put up a image where ASAN+UBSAN are
applied for only the directory/consensus
logic, but I did not compile ZLIB with
instrumentation. I'll go back and add
that and restart the relay. Instrumenting
just the selected modules is handy because
the relay's forwarding performance is not
impacted. Had to purchase more RAM so
it can run this way for more than a few
days.

comment:27 Changed 2 years ago by starlight

A counter-argument exists to the idea that
zlib INFLATE is the cause. The corruption
is exactly eight bytes and is 64-bit aligned
in memory. I'm going back and looking at
all the core files to see if this pattern
is the same.

INFLATE works with arbitrary length dictionary
strings and the probability that the bad
one is eight bytes on an eight-byte memory
boundary is not especially high. Edge
of the string is in the middle of a 160-bit
SHA1 hash and one would expect uniqueness
to extend further. I'll also try running
a debug zlib deflate/inflate of this
consensus document to see what the dictionary
and token stream looks like.

If the other cores look the same and zlib
compression boundaries don't match the corruption,
I'm back at "semi-random memory pointer corruption"
and will proceed with write-protecting the
consensus document, making sure to do it
incrementally as it is uncompressed.

comment:28 in reply to:  26 ; Changed 2 years ago by someone_else

Replying to starlight:

I don't see it as a zlib bug because it runs fine
for three weeks before the problem occurs, is
clearly related to the gradual shifting around
of heap allocations.

That's rather unlikely. Look at your 'event04_messages.txt'. Tor downloads the consensus document via '128.31.0.34:9131'. Descriptor "no3r7DN7eA+Thw8cstgD9CTAyrGV0CRhNjSqR050,Xs" is corrupt. A couple of minutes later it downloads the consensus from '194.109.206.212:80' and exactly the same descriptor is corrupted (and so on for the other directories that are tried). This is not some random heap corruption, something deterministic is happening with the compressed input document.

Running fine for 3 weeks doesn't mean anything. My theory is that some very specific and extremely unlikely sequence in the compressed file is responsible.

Also different zlibs

have been in use at the time of the event
a) 1.2.8 shared library non-LTO,
b) 1.2.8 static link LTO.

Did you compile all of them yourself?

I just put up a image where ASAN+UBSAN are
applied for only the directory/consensus
logic, but I did not compile ZLIB with
instrumentation.

If there is a miscompilation of zlib this more than likely won't catch it.

comment:29 Changed 2 years ago by starlight

Ran infgen on a couple of different zlib compressed
copies of the specific consensus document, including
one that was downloaded from another relay at the time
of the event. Does not look like corruption
of zlib inflate is the specific problem here since the
corruption occurs in the middle of a string that
zlib consistently determines as too unique to place
in the dictionary.

literal 'pidsFRonline3 rGb/pKs1pZ
match 3 141
literal 'b9McACL8k2Kelw ZJ4/u/tLCi38okB0elroXaGaKgM
match 16 20174
match 5 7141
match 4 13902
match 5 17841
literal '6
match 4 773
literal '3
match 66 1208
match 16 1417
literal '52
match 22 969
literal 'FBIPartyTrain rHDXjAZho0VTaXtMkOjCyGF1QW0 vtGOeTZ3wDFo4Dx70WWeGMlO4gs
match 16 32244
match 4 5143
match 5 12599
match 4 17643
literal '39.
match 8 28994
literal '2
match 3 704
literal '5
match 40 3307
match 39 3109
match 7 979
match 4 18037
match 3 98
literal 'rHF6AbjjwA52F+9lEXpOmcAtx6A db5OxPMAy53jJf3zHgkQg20Z1HI
literal 'w VoQVaXe4yfxn8QllcFhdf4PEwk
match 16 3972
match 4 27288
literal '9
match 6 17937
literal '5
match 5 28863
match 5 15883
match 52 19906
literal '5
match 18 5579
match 22 11060
literal 'ughmudd
match 4 3136
literal 'nTUAd5Dh4212x1HlRV3jVzTBgU Ib9ENWJVQ6+5OJlyxe5b/VFLeq
match 16 24397
match 4 24963
literal '6
match 5 1521
match 4 11060
match 3 529
literal '0
match 78 18130
literal '4
match 22 17938
literal 'Fore
match 5 21104
literal 'rn/S9LXVEwN8E2fnISY3ZhP
match 3 573
literal 'c UMz8kgLbX3E3KvsXCHxDztT6jN0

also

literal 'nQPPi1x9yiWdOI62lr1V5J
match 3 3394
literal 'w VoQVaXe4yfxn8QllcFhdf4PEwk
match 16 3972
match 4 27288
literal '9
match 6 17937
literal '5
match 5 28863
match 5 15883
match 52 19906
match 17 8266
match 22 5578
literal 'Toughmudd
match 4 3136
literal 'nTUAd5Dh4212x1HlRV3jVzTBgU Ib9ENWJVQ6+5OJlyxe5b/VFLeq
match 16 24397
match 4 24963
literal '6
match 5 1521
match 4 11060
match 3 529
match 62 28438
match 18 11082
match 22 17938
literal 'Fore
match 5 21104
literal 'rn/S9LXVEwN8E2fnISY3ZhP
match 3 573
literal 'c UMz8kgLbX3E3KvsXCHxDztT6jN0

The corruption does not correspond with
any of it. As I suspected, the SHA1
digests are random enough (one would hope!)
to be, for the most part, entirely unique.

comment:30 Changed 2 years ago by starlight

I allow that the problem is deterministic and once
the bad state sets in, results in consistent
corruption of the zlib inflate of successive
copies of the document either as part of the
operation or immediately after. But reject that
gcc is generating bad code for zlib either
LTO or non-LTO. This is gcc 4.9.3. I have
not seen a material GCC bug in ten years.
Every single time since the mid-2000s I thought
I had a compiler bug it turned out to be
something else. That a GCC code generation
bug would affect zlib 1.2.8 is beyond
improbable.

Possibly some corruption in memory is leading
to consistently repeated bad zlib inflates.
Is strangest problem I've come across
--ever I think.

comment:31 Changed 2 years ago by starlight

If the pre-inflate data still exists in the
core file I'll extract it and see what
infgen makes of it. However my guess
is that it was inflated-on-the-fly so that
may not be possible.

While the theory that the actual compressed
data is triggering the problem has appeal,
I think either this isn't the problem
or that the result is a persistent
broken state. Once the relay gets in the
bad state it never recovers. Have let it
run for two-or-three consensus cycles at
times so this supports a bad memory state
over a bad document--unless of course
it's a bad document that results in a
bad memory state.

So I'll go ahead and build zlib with
ASAN+UBSAN in the hope this will catch
the problem. Have not run with zlib
instrumented thus far.

comment:32 in reply to:  28 Changed 2 years ago by starlight

Replying to someone_else:

Running fine for 3 weeks doesn't mean anything. My theory is that some very specific and extremely unlikely sequence in the compressed file is responsible.

Also different zlibs

have been in use at the time of the event
a) 1.2.8 shared library non-LTO,
b) 1.2.8 static link LTO.

Did you compile all of them yourself?

Yes. First instance of the bug occurred with
one built by GCC 4.8.1, non-LTO -O3 (configure
default) shared library. Since then reproduced
with 4.9.3 and maybe 4.9.2. Have daily network
backups so can reproduce exact everything for
each occurrence.

comment:33 Changed 2 years ago by starlight

My thought is to run it this way (partial ASAN+UBSAN)
for about six weeks. If the bug does not appear
I'll go back and try both ideas, which is to
modify the code to write a copy of every pre-zlib
-inflate consensus downloaded to a unique file and to
mprotect(READONLY) the consensus document memory
while running uninstrumented.

If I have time and am inspired between now and six
weeks, I'll write the debug patches and maybe
put them up sooner.

comment:34 Changed 2 years ago by someone_else

The 'unparsable-desc' file is written when the consensus document can't be parsed. In many cases, the corruption will probably still allow the file to be parsed and you only get the signature error.

'unparsable-desc' will have the consensus document from the point of the parsing error till the end of the document. The file you attached to the bug corresponds to '2015-04-06-03-00-00-consensus'. A tarball with the archived consensus docs is available here:
https://collector.torproject.org/archive/relay-descriptors/consensuses/consensuses-2015-04.tar.xz

This is the error:

< r Unnamed rHlP4BZR0IGz1QovA1SCpKq7RjI pZlK1ZidWxoxe/ec49ZGhy8ekjI 2015-04-05 16:43:39 83.251.83.140 9001 9030
---
> r Unnamed rHlP4BZR0IGz1QovA1SCpKq7RjI pZlK1ZidWxoxe/ec49ZGhy8ekjI 2015-04-05 16:-06 01:2.251.83.140 9001 9030

At the end of the consenses document is a list of dir auth signatures (you may be able to find them in your core dumps). You can grep for one of the lines after '-----BEGIN SIGNATURE-----' against the consensus document collection from
https://collector.torproject.org/archive/relay-descriptors/consensuses
to find the original one.

comment:35 Changed 2 years ago by starlight

Yes, I've been over this. Have several 'unparseable-desc'
files but what is more interesting is that I have
perfect openable-with-gdb core images for four of
the five events.

Back in comment 20 the basis for the analysis was
opening a 'strings' of the core file in emacs and
comparing the memory version of the consensus document
against both the archived one and one I pulled in .z
form from another relay (identical). This is where
the confident assertion that a _single_ stray 64-bit
store corrupted the document derives from (erroneously
described as 32-bit in comment 20).

I'm planning to review the other three core
images in similar fashion, but am not in a big
hurry. This bug is epic, but real and reproducible
given enough time so I'm settled into a long-haul
mindset.

Just wrote a script to pull the current consensus
from all the authorities. Turns out that the .z
rendering is byte-for-byte identical for all
authorities and that the relay I used to pull
the consensus.z at the time of the latest event
matches them. So probably the above infgen
analysis is correct and the corruption does
not correspond with the zlib stream boundaries.
I'm not ruling out a zlib related cause, but
it's not going to be an obvious one if it is.

comment:36 in reply to:  34 ; Changed 2 years ago by starlight

Replying to someone_else:

This is the error:

< 2015-04-05 16:43:39 83.251.83.140 9001 9030
---
> 2015-04-05 16:-06 01:2.251.83.140 9001 9030

Right, _exactly_ _eight_ _bytes_ corrupted
in the entire document. My bet says
it corresponds with a 64-bit-aligned
memory address in the core image, just
like my analysis above of EVENT 05
(though I will verify it).

I'm sticking with my theory that an
memory pointer somewhere is getting
the low-order 16-bits overwritten.
This is not a confining theory--limitless
possible ways for it happen either
in the zlib code or elsewhere.

comment:37 in reply to:  36 Changed 2 years ago by someone_else

Replying to starlight:

Replying to someone_else:

This is the error:

< 2015-04-05 16:43:39 83.251.83.140 9001 9030
---
> 2015-04-05 16:-06 01:2.251.83.140 9001 9030

Right, _exactly_ _eight_ _bytes_ corrupted
in the entire document.

It looks like in each case those 8 bytes are a copy of another 8 bytes earlier in the document. '-06 01:2' appears several times before. Just like with 'n rHDXjA'.

comment:38 Changed 2 years ago by someone_else

Have you tried if your Tor build(s) can parse the consensuses which showed the errors?

If you copy a consensus from the archives to cached-microdesc-consensus, Tor will try to parse it, even if it's extremely old. You need to delete the first line '@type network-status-consensus-3 1.0' from the archived ones.

Unfortunately, I don't see an easy way to feed in a compressed version.

comment:39 Changed 2 years ago by nickm

Keywords: PostFreeze027 added

I'd merge patches for these for 0.2.7 if they come in on time. In some cases, that will require figuring out an as-yet-unsolved bugs.

comment:40 Changed 2 years ago by starlight

Resolution: not a bug
Status: newclosed

Found the problem.

Have been viciously bitten by a case of lazy
stubbornness, which was the refusal to spend
a few hours updating the disto of the system
where the Tor daemon runs. It harbors an
ancient version of Fedora, and I've been dropping
in a current kernel and openssl and thinking
that would be good enough.

The glibc on this box has an evil realloc()
bug that's so nasty it could never be reliably
reproduced, was fixed through pure desk-analysis,
and was never corrected in older Fedora releases:

Bug 457508 realloc doesn't fully copy the old data

https://bugzilla.redhat.com/show_bug.cgi?id=457508

Figured this out after going over all the core
images and verifying that, a) no the problem has
absolutely nothing to do with zlib, b) in every
case it does indeed involve the random corruption
of precisely eight bytes of data aligned on an
eight-byte boundary. Two of the four cores have
a corruption value of medium-size integers instead
of duplicated strings from earlier in the consensus
document, supporting both (a) and (b).

In addition, the observed 8-byte corruption size
is exactly the corruption size of that results
from the realloc() bug.

The function tor_gzip_uncompress() starts with
a 2-to-1 compression ratio guess that should
instead be 4-to-1 for consensus documents, so every
time a consensus is downloaded tor_realloc() gets
invoked and thereby realloc() is called, enabling
the potential for the aforementioned critter to raise
it's ugly antenna.

It now makes sense that the problem is sensitive
to how old cached-* files are since the .new
differential logic impacts the layout of the
heap. The realloc() bug requires that the
destination block be in a different heap arena
than the source block--a scenario requiring
a good deal of exact arrangement and
coincidence.

Also makes sense that running with MEMCHECK
or ASAN inhibits the issue since glibc
realloc() is overlaid in both cases.

comment:41 Changed 2 years ago by nickm

Thanks for taking all the time to track this down, starlight! Excellent analysis.

comment:42 Changed 20 months ago by nickm

Keywords: 2016-bug-retrospective added

Marking for bug retrospective based on Priority.

Note: See TracTickets for help on using tickets.