I spent some time trying to diagnose some very strange micro-descriptor behaviour that seemed to be causing my directory authority to oom. After some investigation involving truncated/embedded microdescriptors, it turned out to most likely be due to my box ooming and corrupting the microdescriptor cache which then triggered more ooms, and not the other way around.
Removing the corrupted cache stopped the ooming and the warns about large keys and malloc failures. (Or, it could have just been that the exploit attempt stopped sometime during all the ooming, and then removing the cache ended the issue?).
The microdesc code seems to be pretty sparse when it comes to logs, dumping weird microdescs, and other diagnostics. This took much longer than it should have to get to the bottom of it. But perhaps it was just due to byzantine failures and won't happen again.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Apr 09 07:33:59.585 [warn] Unparseable microdescriptor
Apr 09 07:33:59.905 [warn] parse error: Malformed object: mismatched end tag RSA PUBLIC KEY
Apr 09 07:33:59.585 [warn] parse error: Malformed object: missing object end line
Apr 09 07:33:59.585 [warn] Unparseable microdescriptor
Apr 09 07:33:59.905 [warn] parse error: Malformed object: mismatched end tag RSA PUBLIC KEY
Apr 09 07:33:59.905 [warn] Unparseable microdescriptor
Apr 09 07:34:00.143 [warn] parse error: Malformed object: missing object end line
Apr 09 07:34:00.143 [warn] Unparseable microdescriptor
Apr 09 07:34:05.296 [err] Out of memory on realloc(). Dying.
Apr 09 09:25:57.799 [warn] Unparseable microdescriptor
Apr 09 09:25:57.799 [warn] crypto error while reading public key from string: malloc failure (in bignum routines:BN_EXPAND_INTERNAL)
Apr 09 09:25:57.799 [warn] crypto error while reading public key from string: nested asn1 error (in asn1 encoding routines:ASN1_TEMPLATE_NOEXP_D2I)
Apr 09 09:25:57.799 [warn] crypto error while reading public key from string: ASN1 lib (in PEM routines:PEM_ASN1_read_bio)
Apr 09 09:25:57.799 [warn] parse error: Couldn't parse public key.
Apr 09 09:25:57.799 [warn] Unparseable microdescriptor
Apr 09 10:41:06.278 [warn] parse error: Malformed object: missing object end line
Apr 09 10:41:06.279 [warn] Unparseable microdescriptor: @last-listed 2010-02-04 01:50:01
Apr 09 10:41:07.486 [warn] parse error: Malformed object: missing object end line
Apr 09 10:41:07.486 [warn] Unparseable microdescriptor: @last-listed 2010-02-06 05:50:01
Apr 09 10:41:09.900 [warn] parse error: Malformed object: missing object end line
The microdesc code apparently does not log anything below warn, nor does it log unparseable descriptors. Inspecting the microdesc cache revealed that several microdescs appeared to be just running into the next without proper termination, perhaps a side effect of earlier crashes/ooms.
(ides emitted these log lines while loading microdescriptors from its cached-microdescs.new file.)
Notice the @last-listed dates -- ides had been corrupting its microdesc cache for over a year, but didn't OOM in the process of trying to parse the entire tail of its MD cache until this month, when the file had become much longer.
Here is a longer piece of one of those ‘Unparseable microdescriptor’s:
Apr 09 10:41:14.550 [warn] Unparseable microdescriptor: @last-listed 2010-08-13 07:50:01onion-key-----BEGIN RSA PUBLIC KEY-----MIGJAoGBAMeiFlr3EKP5qVMthV8Mi6NYvONH1ZlWNrg3947qNQj6OOE57hK/qT61Ovx717sEtdfuksSXxxVVd8K1ym5gMP4ffAZWFYc5Z3PxusNEs+0EjwyVLxrrwnY/hKG+XjXdW48TWQoad3HyRMMdQUfm+sSf6nEusEeRgg9gv+JHF1G/AgMBAAE=---@last-listed 2010-08-18 04:20:01onion-key-----BEGIN RSA PUBLIC KEY-----MIGJAoGBALOgBn1u7gQCEIiowkX0cMVi20yZNoUXFbEn2HKreqGO/ZssPEcdAXbS1QdONiazdwVC7oFmdJ0OtS+OPyKPkoBqw0lR9CtOBXlJ45n+r7X2Yks0BHCt68XxuqnP/1jODPsex2hxaa5WU0HXIh7idsIdJCrfZPw39V/Abw4mllKNAgMBAAE=-----END RSA PUBLIC KEY-----family slippy@last-listed 2010-08-18 04:20:01onion-key-----BEGIN RSA PUBLIC KEY-----MIGJAoGBALTs9+vmYkA4VIlzbeRydehhMVEYyifxCm1dibfv9A93we8QM/UvUkSk
The microdesc code apparently does not log anything below warn, nor does it log unparseable descriptors.
(Mike had to modify the ‘Unparseable microdescriptor’ log_warn call to dump the descriptor into the log file.)
Inspecting the microdesc cache revealed that several microdescs appeared to be just running into the next without proper termination, perhaps a side effect of earlier crashes/ooms.
microdescs_add_list_to_cache and dump_microdescriptor are scary. Perhaps we should be prefixing each item in the cached-*.new files with a line containing the cached item's length and a short (32 or fewer bits) hash, and trying to resynchronize if we read a damaged item.
I'm reopening this ticket because I see no evidence that the underlying bug has been fixed. In particular, git blame shows that nothing relevant in src/or/microdesc.c or src/common/util.c has been changed since 2010-01-25, and microdescs were still being written improperly months later.
Mike: Did you keep a copy of your cached-microdescs* files, or just delete them?
I've moved Mike's ‘agile’ markings for his investigation of the April 9, 2011 ides crash to ticket #2957 (closed), so we can use this ticket to describe the actual bug.
Trac: Summary: Weird dirauth microdesc malloc failures, warns, ooms, exploit attempts? to ides corrupted its cached-microdescs.new file Owner: mikeperry to nickm Keywords: MikePerryIterationFires20110417 deleted, N/Aadded Points: 2 toN/A Status: reopened to assigned Actualpoints: 2 toN/A
Notice the @last-listed dates -- ides had been corrupting its microdesc cache for over a year, but didn't OOM in the process of trying to parse the entire tail of its MD cache until this month, when the file had become much longer.
Again, I speculated rather than RTFSing... It didn't try to parse the entire tail of the MD cache, just the text until just before the next line that begins with “@” or “onion-key”.
Inspecting the microdesc cache revealed that several microdescs appeared to be just running into the next without proper termination, perhaps a side effect of earlier crashes/ooms.
microdescs_add_list_to_cache and dump_microdescriptor are scary. Perhaps we should be prefixing each item in the cached-*.new files with a line containing the cached item's length and a short (32 or fewer bits) hash, and trying to resynchronize if we read a damaged item.
This doesn't seem to be necessary in order to prevent a runaway parser, but it would help us recover more items from the cache when something does go wrong.
Decreasing priority, because I am now convinced that the cache corruption didn't increase ides's memory consumption significantly.
I think for this one, the problem might have been (as I think cpunks was saying above?) the lack of error handling code in microdescs_add_list_to_cache() interacting badly with some other bug. That would explain why this hasn't recurred, in spite of there not having been anything obvious to have fixed the corruption. The fix in that case would be to address the XXXs in microdescs_add_list_to_cache(). This leads to the question: what's the right thing to do if we can't write microdesciptors to the cache? I lean towards, "Abort, don't make the change, and log a warning."