apparent memory corruption -- very difficult to isolate
I have encountered what appears to be a memory corruption bug. Have reproduced it two times since the initial incident.
Do not believe this is a remotely exploitable bug as it only happens on the next consensus download after a pair of locally written control-channel scripts are run manually.
Am looking for advice on how to isolate the issue after spending significant effort on it thus far.
-
first occurred with 0.2.4.26 shortly after building OpenSSL 1.0.1m shared library to replace 1.0.1j; perhaps is related to this but concurrently had substantially increased relay bandwidth, which also might have led to the conditions that triggered the issue
-
saw start-up warning regarding difference between OpenSSL build and runtime versions. Never believed this was and issue and it proved to be irrelevant
-
tried running 0.2.4.26 built with ASAN but did not reproduce
-
built 0.2.5.12 and the bug recurred
-
ran 0.2.5.12 built with ASAN along with OpenSSL 1.0.1m ASAN and libevent 2.0.21 ASAN and did not reproduce; tried with libraries non-ASAN and did not reproduce
-
back to 0.2.5.12 standard build, but with minor patch to enable core files and have stdout+stderr directed to files; reproduced the problem again and obtained a good core file via SIGSEGV; core file is fully intact and accessible with 'gdb'; thus far have chosen not to delve into the core
-
tried running again with MALLOC_CHECK_=3 MALLOC_PERTURB_=85 but did not reproduce the problem
-
have gone back to (6) configuration, but so far no bug; relay has received ever higher consensus bandwidth and traffic since (6) and so perhaps the sweat-spot for producing the bug is no longer present
Have a vague suspicion that the problem is tied to a race condition between the main thread and the crypto thread.
The setup here is unique in a variety of ways and I believe these differences are the reason I see this problem where others have not. Seems prudent at this point to not include much detail.
At some point I deleted the cached-* files--possibly this has prevented reproducing the issue since but I can't remember offhand. Have daily network backups and can recover various generations of these files.
In the two events where a "clean"
shutdown was obtained, unparsable-desc
files were written and these were
retained. In the case of the
SIGSEGV termination this file did
not appear.
Please note the
ISO time "2015-XX-XX v6Dp0:05" was unparseable
messages in the first incident. It appears to me that four bytes of memory were overwritten here. Unfortunately I was not patient enough to wait for the second consensus attempt the second and third times this happened so it's unclear if this happens consistently. Perhaps instrumenting this string with debug code might lead to isolating the problem.
I would appreciate any advice or help that might lead to isolating the bug, ideally by triggering it in with the ASAN build running and thus getting directly to the problem. Hopefully someone closely familiar with the relay code might notice something indicating a direction to pursue.