I have encountered what appears to be a memory
corruption bug. Have reproduced it two times
since the initial incident.
Do not believe this is a remotely exploitable
bug as it only happens on the next consensus
download after a pair of locally written
control-channel scripts are run manually.
Am looking for advice on how to isolate the
issue after spending significant effort on it
thus far.
first occurred with 0.2.4.26 shortly after
building OpenSSL 1.0.1m shared library to
replace 1.0.1j; perhaps is related to
this but concurrently had substantially
increased relay bandwidth, which also might
have led to the conditions that triggered
the issue
saw start-up warning regarding difference
between OpenSSL build and runtime versions.
Never believed this was and issue and it
proved to be irrelevant
tried running 0.2.4.26 built with ASAN
but did not reproduce
built 0.2.5.12 and the bug recurred
ran 0.2.5.12 built with ASAN along
with OpenSSL 1.0.1m ASAN and libevent 2.0.21
ASAN and did not reproduce; tried with
libraries non-ASAN and did not reproduce
back to 0.2.5.12 standard build, but with
minor patch to enable core files and have
stdout+stderr directed to files; reproduced
the problem again and obtained a good
core file via SIGSEGV; core file is
fully intact and accessible with 'gdb';
thus far have chosen not to delve into
the core
tried running again with
MALLOC_CHECK_=3 MALLOC_PERTURB_=85
but did not reproduce the problem
have gone back to (6) configuration,
but so far no bug; relay has received
ever higher consensus bandwidth and
traffic since (6) and so perhaps the
sweat-spot for producing the bug is
no longer present
Have a vague suspicion that the problem
is tied to a race condition between the
main thread and the crypto thread.
The setup here is unique in a variety of
ways and I believe these differences
are the reason I see this problem where
others have not. Seems prudent at this
point to not include much detail.
At some point I deleted the cached-*
files--possibly this has prevented
reproducing the issue since but
I can't remember offhand. Have daily
network backups and can recover
various generations of these files.
In the two events where a "clean"
shutdown was obtained, unparsable-desc
files were written and these were
retained. In the case of the
SIGSEGV termination this file did
not appear.
Please note the
ISO time "2015-XX-XX v6Dp0:05" was unparseable
messages in the first incident. It appears to
me that four bytes of memory were overwritten
here. Unfortunately I was not patient enough
to wait for the second consensus attempt the
second and third times this happened so it's
unclear if this happens consistently. Perhaps
instrumenting this string with debug code
might lead to isolating the problem.
I would appreciate any advice or help
that might lead to isolating the bug,
ideally by triggering it in with the
ASAN build running and thus getting
directly to the problem. Hopefully
someone closely familiar with the
relay code might notice something
indicating a direction to pursue.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
Turns out I zapped them right after the last
event. Stands out that the old
cached-microdescs is ~450kb where the where
the rebuilt one is ~720kb. The modify date on
this file tends to be older than the others.
Perhaps this affects the layout of the
heap in such a way that the bug is more
likely to impact consensus data and result
in obvious failure.
Put back the old files and restarted with the
configuration known to fail.
I suspect the only useful information may be the
number of entries as the contents look perfectly
ok. The bug seems to be about corrupting the
openssl objects employed in verifying the
descriptors rather then the descriptors
themselves.
Will prepare a package of the core and the system
files that can be used with gdb's "set sysroot"
command for a fully examinable core. Want to
share that privately so contact me at
starlight dot YYYYqQ at binnacle dot cx
where YYYY is the year and Q is the quarter
(1-4) and I'll provide a link where it can
be downloaded.
But be aware that the core is mainly providing
"effect" rather than "cause" unless it's
examined by someone intimate with the code
and who gets luckly and sees something they
recognized and that leads to the code illegally
overwriting memory. This is why I have not
dug into it.
The stack trace is apparent with 'gdb's
'where' command, but says nothing of value
as I manually killed the relay process with
"pkill -SEGV tor" after the corruption
occurred--the trace is of the signal
termination handler.
I realized that possibly the first event showed
visibly corrupt "ISO time" strings due to the
0.2.4.26 version memory layout rather than due
to timing or luck.
So I have built 0.2.4.27 with the core/stdio
patch and put that live in the hope that
the the "ISO time" flavor of the event can
be reproduced. If this happens, I intend
to add some code to set one or more of the
x86 debug registers to trap on a write to
the time string, I.E. a hard-coded "watchpoint"
without using 'gdb'. This could result in a
core file where the stack trace leads directly
to the code path causing memory corruption.
It appears that any use of an alternate
malloc() such as with ASAN perturbs the
memory layout such that the bug will not
appear.
The bug seems to show up on the first consensus
download after the above are run manually in
close succession, somewhere between 24 and
48 hours after restarting the relay.
I must also admit that the system where the problem
occurs is non-ECC, so a remote possibility exists
that the cause is some type of physical memory
corruption. However I believe this is unlikely
as the behavior of this issue is highly specific
and somewhat reproducible. If the system were
experiencing random memory corruption one would
expect to see additional failures. The box is
a low-end business-grade Dell with an AMD CPU
and I've never had a Dell that was anything but
rock solid reliable.
Have to backtrack a bit on the cause. Looking carefully
at the first event, I see that it was two hours from
the time the control channel scripts were run and
the corruption occurred. Unfortunately this opens
it up to virtually any possible cause.
Also note that wherever
Got a bad signature on a networkstatus vote
appears in the log it is followed by
last message repeated 8 times
This message was eliminated by the ' Tor['
filter used when grep'ing the messages from
the original log.
Trac: Summary: apparent memory corruption from control channel request processing -- very difficult to isolate to apparent memory corruption -- very difficult to isolate
Just happened again. Is presently in the
bad-consensus state. Any suggestions
anyone before I take the core and restart?
Unfortunately I'm not seeing the corrupt
ISO time warning this time, which I had
hoped would return so I could perhaps
cook up some built-in watchpoint code.
I left the relay running and this time it seems
as though it may have recovered.
Appears that 'urras' fell over and that triggered
the 0 of 8 good signatures event. Four of the
remaining authorities showed up as bad, but
this implies that the other four were ok?
Is this somehow a normal and generally experienced
phenomenon? Even if so, the original event
with corrupt ISO times and hourly recurrence
looks wrong. And why would the consensus of
four authorities go bad due to a single
authority failure?
Same bad messages (uploaded to ticket) repeated
after two hours and an additional
parse error: Too few arguments to rError tokenizing router status
which has appeared in some earlier incidents.
the unparseable-desc file has appeared as
well with a relatively small count of 825;
curiously these descriptors also appear
in cached-consensus and in a copy of the
consensus retrieved with
wget http:///tor/status-vote/current/consensus
Does anyone have any suggestions before
I kill the relay for the core and restart it?
I noticed this earlier, but with this incident
it's appears nearly certain that a requirement
for the failure is that the consensus weight
increase sharply. Happened with these recent
two incidents (consensus weight is oscillating
between ~10000 and ~18000) and I noticed it
with at least one other. Might go back and
look at the consensus record for all five
failures.
Perhaps the bug is in some code that runs
only when the consensus goes up. That would
certainly make it difficult to reproduce.
Unable to load microdesc consensus directory downloaded from server '154.35.175.225:80'. I'll try again soon.Error decoding microdescriptor digest "no3r7DN7eA+Thw8cstgD9CTAyrGV0CRhNjSqR050,Xs"
Detailed log attached to ticket.
Once again, if anyone familiar with the code
can suggest something to instrument, please
let me know. Is not too difficult to put in
a hard-coded debug-register watchpoint if
one can be identified.
Have switched it to running ASAN+UBSAN in the
hope something will shake out, but am not
overly hopeful as it didn't work the last
time. Including UBSAN perhaps could make a
difference.
I'm hoping so too, or hoping that valgrind will be possible (watch out; it's expensive!) and turn something up. My main guess right now is a memory corruption issue of some kind, though, since I haven't heard anybody else run into this, and since it seems to manifest in a different way each time. But that doesn't explain why it would concentrate itself in consensus parsing.
zero chance it would run live with Valgrind,
which is about 20x
I've used memory checkers for years and
Google's ASAN is genius--2x for most
code and happily the multiplier for
tor is more like 1.3, at least for
steady traffic; circuit setup is probably
more like 2x or 3x (when openssl is
instrumented)
Adding UBSAN this time did not appear to
make much difference to CPU and with
luck it will catch it
worst case I'm planning to pick up
a SkyLake later this year for which
gcc 5.1 supports MPX in UBSAN
--this hardware memory checking
feature can catch intra-structure
boundary violations, something
ASAN presently does not detect
They just fixed a race with 1.0.2b, though this is
not likely the one as session tickets are disabled
in tor. But it gave me the idea.
Have held back with this, but now seems reasonable
to note that one big difference in the setup here
is that OpenSSL is compiled with gcc 4.9.2 and
-flto (link-time-optimization). I believe this
difference is why the bug shows up here and not
elsewhere.
When building with LTO, gcc blurs the boundaries
between functions by treating separate modules as
one big ball of code and inlining functions that
normally would not be eligible. This can expose
race conditions where a pointer should be declared
volatile or protected by a mutex and is not, by
causing the pointer to be cached for long
intervals in a register across newly inlined
function boundaries. So such a pointer, instead
of going out of scope and being forced back to
memory by a function return, stays in a register,
possibly through a large iteration loop and the
result is a latent race-condition bug raising its
evil antenna.
To be fair the bug could be in tor and not
OpenSSL, but I suspect OpenSSL on an instinct.
I'll be disabling the second crypto thread in
tor to test this, though this check can only be
conclusive if it fails.
A more direct way to finding such issues is to run
tor+libssl.so +libcrypto.so compiled with
TSAN (thread sanitizer). Unfortunately TSAN is
massive CPU hog and though not as bad a the
Valgrind Helgrind tool, still makes it impossible
to run a relay live this way.
So I'm hoping someone who works with the Tor test
network and ASAN builds can spend some time trying
it with TSAN. I may still take a crack at it at
some point and run the relay for an few minutes
with a TSAN build, but the test environment seems
a better approach on this.
That was a case of false correlation rather than
causation. Control channel has nothing to do
with it. Read recent updates here rather
than the initial report.
At this point I believe the problem is a
race-condition bug in OpenSSL that is exposed
by compiling and linking OpenSSL libraries
with LTO.
Now running for awhile with non-LTO OpenSSL
and with NumCPUs 1 and so far no corruption
events. However this does not prove anything
and at some point I may go back and run
TSAN and try and shake out the bug.
NOTE: NumCPUs 1 does not eliminate threading,
what it does is make the crypto-worker pool
have one thread. The bug could be either
two crypto-workers colliding or a crypto-
worker colliding with the main thread.
Perhaps TSAN will show which.
Error decoding identity digest "rnTn"Got a bad signature on a networkstatus votelast message repeated 7 timesConsensus does not include configured authority 'urras' at 208.83.223.34:443 (identity 80550987E1D626E3EBA5E5E75A458DE0626D088C)A consensus needs 5 good signatures from recognized authorities for us to accept it. This one has 0 (). 8 of the signatures on it didn't verify correctly.Not enough good signatures on networkstatus consensusUnable to load ns consensus directory downloaded from server '199.254.238.52:80'. I'll try again soon.Error decoding identity digest "rnTn"Got a bad signature on a networkstatus votelast message repeated 7 timesConsensus does not include configured authority 'urras' at 208.83.223.34:443 (identity 80550987E1D626E3EBA5E5E75A458DE0626D088C)A consensus needs 5 good signatures from recognized authorities for us to accept it. This one has 0 (). 8 of the signatures on it didn't verify correctly.Not enough good signatures on networkstatus consensusUnable to load ns consensus directory downloaded from server '131.188.40.189:80'. I'll try again soon.Error decoding identity digest "rnTn"Got a bad signature on a networkstatus votelast message repeated 7 timesConsensus does not include configured authority 'urras' at 208.83.223.34:443 (identity 80550987E1D626E3EBA5E5E75A458DE0626D088C)A consensus needs 5 good signatures from recognized authorities for us to accept it. This one has 0 (). 8 of the signatures on it didn't verify correctly.Not enough good signatures on networkstatus consensusUnable to load ns consensus directory downloaded from server '194.109.206.212:80'. I'll try again soon.
Made some progress on understanding what is happening. Using the first line 'Error decoding identity digest "rnTn"' and examining the core file plus the archived consensus document, found that a single stray 32-bit store is corrupting the consensus document image in memory:
original:
r Toughmudder rnTUAd5Dh4212x1HlRV3jVzTBgU Ib9ENWJVQ6+5OJlyxe5b/VFLeq8 2015-08-17 18:55:56 5.199.242.60 9001 9030s Fast HSDir Running Stable V2Dir Validv Tor 0.2.4.27
corrputed:
r Toughmudder rnTn rHDXjA12x1HlRV3jVzTBgU Ib9ENWJVQ6+5OJlyxe5b/VFLeq8 2015-08-17 18:55:56 5.199.242.60 9001 9030s Fast HSDir Running Stable V2Dir Validv Tor 0.2.4.27
This leads me to the idea that perhaps mprotect() can be used to mark the consensus document memory read-only immediately after being stored and until the memory is released. This might result in a memory protection violation at the time of a stray pointer write and produce a core with a stack trace precisely identifying the errant code.
Could someone familiar with this area of the code comment on the feasibility of the above and perhaps offer a suggestion or two regarding proper placement of the mprotect() calls?
The corrupt store consists of "n rHDXjA" and does
not appear to be a memory address or typical integer.
However this fragment appears in another descriptor:
r FBIPartyTrain rHDXjAZho0VTaXtMkOjCyGF1QW0 vtGOeTZ3wDFo4Dx70WWeGMlO4gs 2015-08-17 08:56:42 24.214.139.69 9001 25565s Running Stable V2Dir Validv Tor 0.2.5.12w Bandwidth=21p reject 1-65535
The two values are exactly 13372 bytes distant from each
other, with the source (above) appearing at the lower
memory address. The distance is such that one can
imagine a pointer manipulation bug.
mprotect operates on memory pages (4096 bytes). You need to align the memory allocation for the consensus document at a page boundary and pad the allocation to completely include the last page (round up the allocation to a multiple of 4096).
This may change the memory layout enough so that something else gets hit by the bug.
I thought about that. Figure I'll just start the
mprotect() at a rounded-up page boundary and end
it, likewise, at a rounded-down page boundary.
The consensus document is about 340 pages so
statistically that's good-enough to catch it.
Takes about three-to-four weeks after a
rm cached-* restart for memory to arrange
itself such that the bug hits the WIP
consensus document. I now realize that
the signature validation of the consensus
is acting as a honeypot for detecting the
memory access bug. Probably happens all
the time but doesn't break anything else
in an obvious way.
Here's some more core-file analysis. I'm
beginning to suspect that the low-order two
bytes of a pointer are being overwritten.
corruption value "n rHDXjA" or endian-adjusted 0x416A58444872206Eload3 from core file, large segment near end, heap per "size -A" "load3 67067904 1_____________6"(gdb) find 0x00007f__98000000, 0x00007f__9bff6000, 0x416A58444872206E0x7___9ba35d44 FBIPartyTrain 1 of 100x7___9bc423cc FBIPartyTrain 20x7___9bda5b62 FBIPartyTrain 30x7___9bf0004b FBIPartyTrain 4warning: Unable to access 15534 bytes of target memory at 0x7___9bff2353, halting search.load49 from core file, large segment near end, heap per "size -A" "load49 299806720 1_____________6"(gdb) find 0x00007f__aa638000, 0x00007f__bc423000, 0x416A58444872206E0x7___bb52c0b4 FBIPartyTrain 50x7___bb52f4f0 * ToughMudder 6 of 10 seg offset 0x10EF74F0 of 284128496, 94.7%0x7___bb71be2e FBIPartyTrain 70x7___bb85ee2a FBIPartyTrain 80x7___bb9cc864 FBIPartyTrain 90x7___bbbab2c1 FBIPartyTrain 10warning: Unable to access 15416 bytes of target memory at 0x7___bc41f3c9, halting search.cdisplay with (gdb) x/32c 0x7___bb52f4f0-16
Actually looking at the source code, it doesn't look like random memory corruption or a threading issue. The consensus document signatures are checked as soon as the new document is downloaded:
Looking at your log files, you are downloading the consensus from multiple different directories and it gets corrupted the same way each time. In your 'event04_messages.txt' the same descriptor is corrupted. This points to a potential gzip or zlib decoding issue (maybe the decompression code is getting miscompiled).
The best chance of tracking this down is probably to instrument the function mentioned above and save both the compressed and uncompressed versions of the consensus to disk.
Edit: BTW, what leads me to suspect gzip/zlib is that as part of the decompression a dictionary of previously encountered strings is used. You encountering the same corrupted "n rHDXjA" value twice could be explained by a corrupted entry in that dictionary. There may be some some very rarely encountered corner case in the decompression algorithm that gets miscompiled (LTO is a very dangerous compiler option).
Did you encounter the bug with a distributed binary?
Very interesting, so I should be looking for corrpution
of the DEFLATE dictionary. Can't be LTO because ZLIB
was not compiled this way in the earlier instances
and in this last instance, OpenSSL was carefully
built non-LTO.
The LTO progression was ssl-only libraries shared,
ssl+event+tor with libraries shared, ssl+zlib+
event+tor static build, zlib+event+tor not ssl
static build.
I don't see it as a zlib bug because it runs fine
for three weeks before the problem occurs, is
clearly related to the gradual shifting around
of heap allocations. Also different zlibs
have been in use at the time of the event
a) 1.2.8 shared library non-LTO,
b) 1.2.8 static link LTO.
I just put up a image where ASAN+UBSAN are
applied for only the directory/consensus
logic, but I did not compile ZLIB with
instrumentation. I'll go back and add
that and restart the relay. Instrumenting
just the selected modules is handy because
the relay's forwarding performance is not
impacted. Had to purchase more RAM so
it can run this way for more than a few
days.
A counter-argument exists to the idea that
zlib INFLATE is the cause. The corruption
is exactly eight bytes and is 64-bit aligned
in memory. I'm going back and looking at
all the core files to see if this pattern
is the same.
INFLATE works with arbitrary length dictionary
strings and the probability that the bad
one is eight bytes on an eight-byte memory
boundary is not especially high. Edge
of the string is in the middle of a 160-bit
SHA1 hash and one would expect uniqueness
to extend further. I'll also try running
a debug zlib deflate/inflate of this
consensus document to see what the dictionary
and token stream looks like.
If the other cores look the same and zlib
compression boundaries don't match the corruption,
I'm back at "semi-random memory pointer corruption"
and will proceed with write-protecting the
consensus document, making sure to do it
incrementally as it is uncompressed.
I don't see it as a zlib bug because it runs fine
for three weeks before the problem occurs, is
clearly related to the gradual shifting around
of heap allocations.
That's rather unlikely. Look at your 'event04_messages.txt'. Tor downloads the consensus document via '128.31.0.34:9131'. Descriptor "no3r7DN7eA+Thw8cstgD9CTAyrGV0CRhNjSqR050,Xs" is corrupt. A couple of minutes later it downloads the consensus from '194.109.206.212:80' and exactly the same descriptor is corrupted (and so on for the other directories that are tried). This is not some random heap corruption, something deterministic is happening with the compressed input document.
Running fine for 3 weeks doesn't mean anything. My theory is that some very specific and extremely unlikely sequence in the compressed file is responsible.
Also different zlibs
have been in use at the time of the event
a) 1.2.8 shared library non-LTO,
b) 1.2.8 static link LTO.
Did you compile all of them yourself?
I just put up a image where ASAN+UBSAN are
applied for only the directory/consensus
logic, but I did not compile ZLIB with
instrumentation.
If there is a miscompilation of zlib this more than likely won't catch it.
Ran infgen on a couple of different zlib compressed
copies of the specific consensus document, including
one that was downloaded from another relay at the time
of the event. Does not look like corruption
of zlib inflate is the specific problem here since the
corruption occurs in the middle of a string that
zlib consistently determines as too unique to place
in the dictionary.
The corruption does not correspond with
any of it. As I suspected, the SHA1
digests are random enough (one would hope!)
to be, for the most part, entirely unique.
I allow that the problem is deterministic and once
the bad state sets in, results in consistent
corruption of the zlib inflate of successive
copies of the document either as part of the
operation or immediately after. But reject that
gcc is generating bad code for zlib either
LTO or non-LTO. This is gcc 4.9.3. I have
not seen a material GCC bug in ten years.
Every single time since the mid-2000s I thought
I had a compiler bug it turned out to be
something else. That a GCC code generation
bug would affect zlib 1.2.8 is beyond
improbable.
Possibly some corruption in memory is leading
to consistently repeated bad zlib inflates.
Is strangest problem I've come across
--ever I think.
If the pre-inflate data still exists in the
core file I'll extract it and see what
infgen makes of it. However my guess
is that it was inflated-on-the-fly so that
may not be possible.
While the theory that the actual compressed
data is triggering the problem has appeal,
I think either this isn't the problem
or that the result is a persistent
broken state. Once the relay gets in the
bad state it never recovers. Have let it
run for two-or-three consensus cycles at
times so this supports a bad memory state
over a bad document--unless of course
it's a bad document that results in a
bad memory state.
So I'll go ahead and build zlib with
ASAN+UBSAN in the hope this will catch
the problem. Have not run with zlib
instrumented thus far.
Running fine for 3 weeks doesn't mean anything. My theory is that some very specific and extremely unlikely sequence in the compressed file is responsible.
Also different zlibs
have been in use at the time of the event
a) 1.2.8 shared library non-LTO,
b) 1.2.8 static link LTO.
Did you compile all of them yourself?
Yes. First instance of the bug occurred with
one built by GCC 4.8.1, non-LTO -O3 (configure
default) shared library. Since then reproduced
with 4.9.3 and maybe 4.9.2. Have daily network
backups so can reproduce exact everything for
each occurrence.
My thought is to run it this way (partial ASAN+UBSAN)
for about six weeks. If the bug does not appear
I'll go back and try both ideas, which is to
modify the code to write a copy of every pre-zlib
-inflate consensus downloaded to a unique file and to
mprotect(READONLY) the consensus document memory
while running uninstrumented.
If I have time and am inspired between now and six
weeks, I'll write the debug patches and maybe
put them up sooner.
The 'unparsable-desc' file is written when the consensus document can't be parsed. In many cases, the corruption will probably still allow the file to be parsed and you only get the signature error.
< r Unnamed rHlP4BZR0IGz1QovA1SCpKq7RjI pZlK1ZidWxoxe/ec49ZGhy8ekjI 2015-04-05 16:43:39 83.251.83.140 9001 9030---> r Unnamed rHlP4BZR0IGz1QovA1SCpKq7RjI pZlK1ZidWxoxe/ec49ZGhy8ekjI 2015-04-05 16:-06 01:2.251.83.140 9001 9030
At the end of the consenses document is a list of dir auth signatures (you may be able to find them in your core dumps). You can grep for one of the lines after '-----BEGIN SIGNATURE-----' against the consensus document collection from
https://collector.torproject.org/archive/relay-descriptors/consensuses
to find the original one.
Yes, I've been over this. Have several 'unparseable-desc'
files but what is more interesting is that I have
perfect openable-with-gdb core images for four of
the five events.
Back in comment 20 the basis for the analysis was
opening a 'strings' of the core file in emacs and
comparing the memory version of the consensus document
against both the archived one and one I pulled in .z
form from another relay (identical). This is where
the confident assertion that a single stray 64-bit
store corrupted the document derives from (erroneously
described as 32-bit in comment 20).
I'm planning to review the other three core
images in similar fashion, but am not in a big
hurry. This bug is epic, but real and reproducible
given enough time so I'm settled into a long-haul
mindset.
Just wrote a script to pull the current consensus
from all the authorities. Turns out that the .z
rendering is byte-for-byte identical for all
authorities and that the relay I used to pull
the consensus.z at the time of the latest event
matches them. So probably the above infgen
analysis is correct and the corruption does
not correspond with the zlib stream boundaries.
I'm not ruling out a zlib related cause, but
it's not going to be an obvious one if it is.
This is the error:
{{{
< 2015-04-05 16:43:39 83.251.83.140 9001 9030
2015-04-05 16:-06 01:2.251.83.140 9001 9030
}}}
Right, exactlyeightbytes corrupted
in the entire document. My bet says
it corresponds with a 64-bit-aligned
memory address in the core image, just
like my analysis above of EVENT 05
(though I will verify it).
I'm sticking with my theory that an
memory pointer somewhere is getting
the low-order 16-bits overwritten.
This is not a confining theory--limitless
possible ways for it happen either
in the zlib code or elsewhere.
This is the error:
{{{
< 2015-04-05 16:43:39 83.251.83.140 9001 9030
2015-04-05 16:-06 01:2.251.83.140 9001 9030
}}}
Right, exactlyeightbytes corrupted
in the entire document.
It looks like in each case those 8 bytes are a copy of another 8 bytes earlier in the document. '-06 01:2' appears several times before. Just like with 'n rHDXjA'.
Have you tried if your Tor build(s) can parse the consensuses which showed the errors?
If you copy a consensus from the archives to cached-microdesc-consensus, Tor will try to parse it, even if it's extremely old. You need to delete the first line '@type network-status-consensus-3 1.0' from the archived ones.
Unfortunately, I don't see an easy way to feed in a compressed version.
Have been viciously bitten by a case of lazy
stubbornness, which was the refusal to spend
a few hours updating the disto of the system
where the Tor daemon runs. It harbors an
ancient version of Fedora, and I've been dropping
in a current kernel and openssl and thinking
that would be good enough.
The glibc on this box has an evil realloc()
bug that's so nasty it could never be reliably
reproduced, was fixed through pure desk-analysis,
and was never corrected in older Fedora releases:
Bug 457508 realloc doesn't fully copy the old data
Figured this out after going over all the core
images and verifying that, a) no the problem has
absolutely nothing to do with zlib, b) in every
case it does indeed involve the random corruption
of precisely eight bytes of data aligned on an
eight-byte boundary. Two of the four cores have
a corruption value of medium-size integers instead
of duplicated strings from earlier in the consensus
document, supporting both (a) and (b).
In addition, the observed 8-byte corruption size
is exactly the corruption size of that results
from the realloc() bug.
The function tor_gzip_uncompress() starts with
a 2-to-1 compression ratio guess that should
instead be 4-to-1 for consensus documents, so every
time a consensus is downloaded tor_realloc() gets
invoked and thereby realloc() is called, enabling
the potential for the aforementioned critter to raise
it's ugly antenna.
It now makes sense that the problem is sensitive
to how old cached-* files are since the .new
differential logic impacts the layout of the
heap. The realloc() bug requires that the
destination block be in a different heap arena
than the source block--a scenario requiring
a good deal of exact arrangement and
coincidence.
Also makes sense that running with MEMCHECK
or ASAN inhibits the issue since glibc
realloc() is overlaid in both cases.
Trac: Status: new to closed Resolution: N/Ato not a bug