When we log unparseable desc stuff for our vote, we proceed to overwrite it with the invalid consensus we produced. The vote gets logged at log level notice, but only in truncated form not allowing one to analyze this bug.
I think the simplest solution here is to modify dump_desc() in routerparse.c to include the hash of the offending descriptor in the filename and then mention the hash in the log message about the parse failure. This potentially allows someone to exhaust disk space by uploading a lot of bogus descriptors, though, so we'll need to either have a way to clean up old logged descriptors or otherwise limit that, or restrict it only to cases where the unparseable material can't have come from an arbitrary source.
Proposed fix in my bug18322 branch; this introduces a new configuration option for the maximum size of dumped descriptors. When this option is non-zero, the filename for dumped descriptors includes their hex-encoded SHA-256, which is referenced in the log message at the time of the dump.
Proposed fix in my bug18322 branch; this introduces a new configuration option for the maximum size of dumped descriptors. When this option is non-zero, the filename for dumped descriptors includes their hex-encoded SHA-256, which is referenced in the log message at the time of the dump.
Hello. An initial review follows. I actually expected this feature to require less code, but it seems like fears of DoS made it more complicated. My current main concerns are the UX of this change, and the lack of tests.
I think this feature is disabled by default. Why is that? Is there any way the old behavior is better than this feature with appropriate defaults?
Also, when this feature is disabled, I think we fallback to the old behavior. Why would we ever want to do that? Why not enable this by default, with a small size allowance (e.g. 10MB) and remove all the old behavior code?
The torrc option is called DetailedLogForUnparseableDescriptors but it actually requires a size input to work. With such a name, I would expect it to be a boolean option that enables or disables the feature. If we enable this feature by default as suggested above, maybe we can repurpose the torrc option to something that simply controls the maximum size?
I feel that this FIFO functionality is quite complex (lots of code added, asserts and length calculations) without any tests whatsoever. The general logic is also undocumented.
I wonder if there is something simpler we could do instead of FIFO. Like just refuse to log more descriptors if the maximum size has been reached, and log this to the operator. Or maybe enforce limits based on the number of descriptors instead of the exact size.
Alternatively, if we like the FIFO approach, maybe some tests would help to bring more confidence here.
if (emit_prefix) filelen = 50 + strlen(type) + len; What is 50 here?
if (debugfile_base != NULL) tor_free(debugfile_base);
tor_free() should be able to handle NULL as input, so the if check is not necessary. Also, maybe initializing debugfile and debugfile_base to NULL could be more defensive.
Setting this to needs_revision for now. I'd like to do some more review and testing here eventually.
The torrc option is called DetailedLogForUnparseableDescriptors but it actually requires a size input to work. With such a name, I would expect it to be a boolean option that enables or disables the feature. If we enable this feature by default as suggested above, maybe we can repurpose the torrc option to something that simply controls the maximum size?
Yeah, it can be renamed to MaxUnparseableDescSizeToLog or something like that.
I think this feature is disabled by default. Why is that? Is there any way the old behavior is better than this feature with appropriate defaults?
Also, when this feature is disabled, I think we fallback to the old behavior. Why would we ever want to do that? Why not enable this by default, with a small size allowance (e.g. 10MB) and remove all the old behavior code?
Okay, sure.
I wonder if there is something simpler we could do instead of FIFO. Like just refuse to log more descriptors if the maximum size has been reached, and log this to the operator. Or maybe enforce limits based on the number of descriptors instead of the exact size.
The former option would be acceptable; the latter would be a bad move - one of the descriptor types that dump_desc() can be called on is a new router descriptor uploaded to a dirauth. I believe counting number rather than size lets anyone force a dirauth to consume as much disk space as it can allocate memory by uploading a huge, unparseable descriptor. We really do want to watch how much we log here.
if (emit_prefix) filelen = 50 + strlen(type) + len; What is 50 here?
See bug18322_v2 branch, which removes the fallback to the old-style logging, renames the config variable and checks for repeated identical unparseable descriptors when logging. Current testing methodology is to run a private network and curl --data-binary @test-desc.txt -v http://127.0.0.1:19030/tor/ with a suitably permuted bogus router descriptor in test-desc.txt and one dirauth's dirport at 127.0.0.1:19030. I'm working on some unit tests for this.
I'm actually going to call it merge_ready (pending my review) -- since fixing up changes stuff is no big deal for me, and otherwise I'll forget this. :)
Don't hate me for this, but: we can't use size_t for the total size of a bunch of these files, I think. On 32-bit systems, size_t is 4 bytes, but file sizes can often be 8 bytes.
We don't have to say "if (foo) tor_free(foo);". Just say "tor_free(foo)."
It looks like the dumped descriptor limit is only applied over a number
Should these files go into a subdirectory of the datadir? That would give sysadmins the ability to stick it on another volume, give it different permissions, etc.
What do we do about NUL characters?
Shouldn't we scan the unparseable-desc files at startup, and count how big they are and when they were last modified? Otherwise we will still up the disk eventually if Tor stops and starts enough times.
This approach will break with the linux seccomp2 sandbox, since we can't predict the filenames in advance. But IMO it'll be fine to just say "if the sandbox is on, don't store these to disk."
Don't hate me for this, but: we can't use size_t for the total size of a bunch of these files, I think. On 32-bit systems, size_t is 4 bytes, but file sizes can often be 8 bytes.
Files we generate here can't be, since they had to fit in process memory to get written, but I guess you can make up some weird case where we also do the scanning for existing files thing, and the user switches between running a 32-bit and a 64-bit Tor on the same directory. I'll switch it to uint64_t, though, just to be safe.
We don't have to say "if (foo) tor_free(foo);". Just say "tor_free(foo)."
Okay.
It looks like the dumped descriptor limit is only applied over a number
What does this mean?
Should these files go into a subdirectory of the datadir? That would give sysadmins the ability to stick it on another volume, give it different permissions, etc.
Okay.
What do we do about NUL characters?
If we decided something with a NUL in the middle of it constitutes a descriptor, we broke somewhere before we ever got to dump_desc(), since that function takes the bad descriptor as a NUL-terminated string. I'd call that worth checking out what does happen, but orthogonal to this ticket. If it can happen, we'd handle it by passing a length arg into dump_desc() and using write_mem_to_file() instead of write_str_to_file() for output, I suppose.
Shouldn't we scan the unparseable-desc files at startup, and count how big they are and when they were last modified? Otherwise we will still up the disk eventually if Tor stops and starts enough times.
Okay.
This approach will break with the linux seccomp2 sandbox, since we can't predict the filenames in advance. But IMO it'll be fine to just say "if the sandbox is on, don't store these to disk."
Files we generate here can't be, since they had to fit in process memory to get written, but I guess you can make up some weird case where we also do the scanning for existing files thing, and the user switches between running a 32-bit and a 64-bit Tor on the same directory. I'll switch it to uint64_t, though, just to be safe.
Also there's the case where we get some number of objects that fit into address space one at a time, but when you add them up, the sum overflows.
What does this mean?
I have no idea. I'll let you know if I remember. Sometimes I start writing sentences and they just
Updated version implements subdirectory and reconstructing FIFO from it on startup, together with unit tests for FIFO reconstruction code, in my bug18322_v3_squashed branch.