Log unparseable votes so they can be analysed

changed milestone to %Tor: 0.2.9.x-final

added TorCoreTeam201606 actualpoints::3 component::core tor/tor milestone::Tor: 0.2.9.x-final owner::andrea parent::17293 points::2 priority::medium resolution::implemented review-group-4 severity::normal sponsor::U-can status::closed tor-dos type::defect labels

Trac:
Parent: N/A to #17668 (moved)

Trac:
Milestone: Tor: 0.2.7.x-final to Tor: 0.2.9.x-final

Trac:
Sponsor: N/A to SponsorU-can
Reviewer: N/A to N/A

Trac:
Parent: #17668 (moved) to N/A
Points: N/A to small/medium

Trac:
Keywords: TorCoreTeam201602 deleted, N/A added

Trac:
Keywords: N/A deleted, tor-dos added

Trac:
Points: small/medium to 2

Trac:
Parent: N/A to #17293 (moved)

Taking ownership for 0.2.9 triage

Trac:
Status: new to assigned
Owner: N/A to andrea

Trac:
Keywords: N/A deleted, TorCoreTeam201606 added

I think the simplest solution here is to modify dump_desc() in routerparse.c to include the hash of the offending descriptor in the filename and then mention the hash in the log message about the parse failure. This potentially allows someone to exhaust disk space by uploading a lot of bogus descriptors, though, so we'll need to either have a way to clean up old logged descriptors or otherwise limit that, or restrict it only to cases where the unparseable material can't have come from an arbitrary source.

Proposed fix in my bug18322 branch; this introduces a new configuration option for the maximum size of dumped descriptors. When this option is non-zero, the filename for dumped descriptors includes their hex-encoded SHA-256, which is referenced in the log message at the time of the dump.

Trac:
Status: assigned to needs_review

Trac:
Actualpoints: N/A to 3

Trac:
Keywords: N/A deleted, review-group-4 added

Replying to andrea:

Proposed fix in my bug18322 branch; this introduces a new configuration option for the maximum size of dumped descriptors. When this option is non-zero, the filename for dumped descriptors includes their hex-encoded SHA-256, which is referenced in the log message at the time of the dump.

Hello. An initial review follows. I actually expected this feature to require less code, but it seems like fears of DoS made it more complicated. My current main concerns are the UX of this change, and the lack of tests.

I think this feature is disabled by default. Why is that? Is there any way the old behavior is better than this feature with appropriate defaults?

Also, when this feature is disabled, I think we fallback to the old behavior. Why would we ever want to do that? Why not enable this by default, with a small size allowance (e.g. 10MB) and remove all the old behavior code?
The torrc option is called DetailedLogForUnparseableDescriptors but it actually requires a size input to work. With such a name, I would expect it to be a boolean option that enables or disables the feature. If we enable this feature by default as suggested above, maybe we can repurpose the torrc option to something that simply controls the maximum size?
I feel that this FIFO functionality is quite complex (lots of code added, asserts and length calculations) without any tests whatsoever. The general logic is also undocumented.

I wonder if there is something simpler we could do instead of FIFO. Like just refuse to log more descriptors if the maximum size has been reached, and log this to the operator. Or maybe enforce limits based on the number of descriptors instead of the exact size.

Alternatively, if we like the FIFO approach, maybe some tests would help to bring more confidence here.
if (emit_prefix) filelen = 50 + strlen(type) + len; What is 50 here?
if (debugfile_base != NULL) tor_free(debugfile_base);

tor_free() should be able to handle NULL as input, so the if check is not necessary. Also, maybe initializing debugfile and debugfile_base to NULL could be more defensive.

Setting this to needs_revision for now. I'd like to do some more review and testing here eventually.

Trac:
Status: needs_review to needs_revision

The torrc option is called DetailedLogForUnparseableDescriptors but it actually requires a size input to work. With such a name, I would expect it to be a boolean option that enables or disables the feature. If we enable this feature by default as suggested above, maybe we can repurpose the torrc option to something that simply controls the maximum size?

Yeah, it can be renamed to MaxUnparseableDescSizeToLog or something like that.

I think this feature is disabled by default. Why is that? Is there any way the old behavior is better than this feature with appropriate defaults?

Also, when this feature is disabled, I think we fallback to the old behavior. Why would we ever want to do that? Why not enable this by default, with a small size allowance (e.g. 10MB) and remove all the old behavior code?

Okay, sure.

I wonder if there is something simpler we could do instead of FIFO. Like just refuse to log more descriptors if the maximum size has been reached, and log this to the operator. Or maybe enforce limits based on the number of descriptors instead of the exact size.

The former option would be acceptable; the latter would be a bad move - one of the descriptor types that dump_desc() can be called on is a new router descriptor uploaded to a dirauth. I believe counting number rather than size lets anyone force a dirauth to consume as much disk space as it can allocate memory by uploading a huge, unparseable descriptor. We really do want to watch how much we log here.

if (emit_prefix) filelen = 50 + strlen(type) + len; What is 50 here?

It's the maximum length of the file header naming the type of descriptor; it's ugly, but it's also been there since 2009: https://gitweb.torproject.org/tor.git/commit/?id=889c07f1fc54b5bd60c7919f8a5fc784eed2d57a

That'll go if we get rid of the old code, of course.

See bug18322_v2 branch, which removes the fallback to the old-style logging, renames the config variable and checks for repeated identical unparseable descriptors when logging. Current testing methodology is to run a private network and curl --data-binary @test-desc.txt -v http://127.0.0.1:19030/tor/ with a suitably permuted bogus router descriptor in test-desc.txt and one dirauth's dirport at 127.0.0.1:19030. I'm working on some unit tests for this.

Trac:
Status: needs_revision to needs_review

Rebased and with unit tests added in my bug18322_v2_tests branch

Hello, feature looking more robust now IMO! I did some chutney testing and it seems to work fine as well!

BTW, on the changes file you don't need to write bugfix on ??? since this is a feature. You can just say Closes ticket 18322 and be done with it.

I'm gonna turn this to needs_revision just for the changes file fix. After this feel free to turn this into merge_ready from my part.

Trac:
Status: needs_review to needs_revision

I'm actually going to call it merge_ready (pending my review) -- since fixing up changes stuff is no big deal for me, and otherwise I'll forget this. :)

Trac:
Status: needs_revision to merge_ready

Don't hate me for this, but: we can't use size_t for the total size of a bunch of these files, I think. On 32-bit systems, size_t is 4 bytes, but file sizes can often be 8 bytes.
We don't have to say "if (foo) tor_free(foo);". Just say "tor_free(foo)."
It looks like the dumped descriptor limit is only applied over a number
Should these files go into a subdirectory of the datadir? That would give sysadmins the ability to stick it on another volume, give it different permissions, etc.
What do we do about NUL characters?
Shouldn't we scan the unparseable-desc files at startup, and count how big they are and when they were last modified? Otherwise we will still up the disk eventually if Tor stops and starts enough times.
This approach will break with the linux seccomp2 sandbox, since we can't predict the filenames in advance. But IMO it'll be fine to just say "if the sandbox is on, don't store these to disk."

Trac:
Status: merge_ready to needs_revision

Don't hate me for this, but: we can't use size_t for the total size of a bunch of these files, I think. On 32-bit systems, size_t is 4 bytes, but file sizes can often be 8 bytes.

Files we generate here can't be, since they had to fit in process memory to get written, but I guess you can make up some weird case where we also do the scanning for existing files thing, and the user switches between running a 32-bit and a 64-bit Tor on the same directory. I'll switch it to uint64_t, though, just to be safe.

We don't have to say "if (foo) tor_free(foo);". Just say "tor_free(foo)."

Okay.

It looks like the dumped descriptor limit is only applied over a number

What does this mean?

Should these files go into a subdirectory of the datadir? That would give sysadmins the ability to stick it on another volume, give it different permissions, etc.

Okay.

What do we do about NUL characters?

If we decided something with a NUL in the middle of it constitutes a descriptor, we broke somewhere before we ever got to dump_desc(), since that function takes the bad descriptor as a NUL-terminated string. I'd call that worth checking out what does happen, but orthogonal to this ticket. If it can happen, we'd handle it by passing a length arg into dump_desc() and using write_mem_to_file() instead of write_str_to_file() for output, I suppose.

Shouldn't we scan the unparseable-desc files at startup, and count how big they are and when they were last modified? Otherwise we will still up the disk eventually if Tor stops and starts enough times.

Okay.

This approach will break with the linux seccomp2 sandbox, since we can't predict the filenames in advance. But IMO it'll be fine to just say "if the sandbox is on, don't store these to disk."

I'll add that check.

Files we generate here can't be, since they had to fit in process memory to get written, but I guess you can make up some weird case where we also do the scanning for existing files thing, and the user switches between running a 32-bit and a 64-bit Tor on the same directory. I'll switch it to uint64_t, though, just to be safe.

Also there's the case where we get some number of objects that fit into address space one at a time, but when you add them up, the sum overflows.

What does this mean?

I have no idea. I'll let you know if I remember. Sometimes I start writing sentences and they just

wrt all other stuff, I agree with you. :) Thanks!

On a closer look, I see no occurrences of if (foo) tor_free(foo) in this branch.

Updated version implements subdirectory and reconstructing FIFO from it on startup, together with unit tests for FIFO reconstruction code, in my bug18322_v3_squashed branch.

Trac:
Status: needs_revision to needs_review

M E R G E D !

Trac:
Resolution: N/A to implemented
Status: needs_review to closed

closed

changed time estimate to 16h

added 24h of time spent

moved to tpo/core/tor#18322 (closed)

Log unparseable votes so they can be analysed

Child items ...

Activity