Opened 13 months ago

Last modified 5 days ago

#23512 merge_ready defect

Bandwidth stats info leak upon close of circuits with queued cells

Reported by: asn Owned by:
Priority: Medium Milestone: Tor: 0.3.4.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-bug-bounty, congestion-attack, research, watermark, tor-stats, guard-discovery-stats, 034-triage-20180328, 034-removed-20180328
Cc: arma, pastly Actual Points:
Parent ID: Points:
Reviewer: dgoulet Sponsor: SponsorQ

Description

We received a tor bug bounty report from jaym about a congestion attack variant that can cause bandwidth stats watermark.

The bug uses the fact that Tor increments the read bytes counter before adding the cell to the output buffer: If the circuit gets killed before the cell gets relayed to the next hop, then the write bytes counter will never be updated, making the read bytes counter having a higher value than the write bytes counter. The attacker could exploit this assymetry to find relays using their bandwidth graph.

The attacker can kill the circuit using the OOM killer by saturating its output queue with cells until circuits_handle_oom() gets called and kills the circuit.

We should figure out whether this attack is practical (the paper claims it is) and whether it's worthwhile fixing it. Just fixing this issue won't solve the general issue of congestion attacks, and it might even allow other kinds of attacks.

The most practical fix right now seem to be to hack circuit_handle_oom()` to actually decrement the read counters before killing a circuit. However, that's a very specific fix that might solve this very specific bug, but leave the rest of the bug class open.

Another approach would be removing the bandwidth graphs, or aggregating them over a greater period of time, or adding noise. We should consider these approaches carefully since bandwidth graphs see great use by academic papers and also by relay operators (to gauge their contribution).

Child Tickets

Attachments (1)

excess_bytes.png (45.0 KB) - added by Dhalgren 3 weeks ago.

Download all attachments as: .zip

Change History (35)

comment:1 Changed 13 months ago by mikeperry

My current thinking is that this ticket should be about the specific fix, and we should include more general fixes in #22729 or subtickets.

comment:2 Changed 13 months ago by Jaym

Hi guys.

I think we should also look into other counters and correct their values even if they are not published. The control protocol can be used to dump stats about circuits, including counter of relayed cells, etc. They are subjected to the same kind of bug.

comment:3 Changed 12 months ago by nickm

Sponsor: SponsorQ

comment:4 Changed 12 months ago by mikeperry

Cc: arma added

During Montreal, Roger mentioned there is a second piece to this. I tried to save it here but trac ate my changes. Roger, can you say what that was again?

comment:5 Changed 12 months ago by mikeperry

Keywords: guard-discovery-stats added; guard-discovery removed

comment:6 Changed 12 months ago by arma

Mike: I talked to Jaym about this topic at PETS. I think at the time I was under the impression that we are incrementing the *write* counter, that is, the outbound connection, even though we never actually push the bytes out to the network because we kill the outbuf in the oom killer.

So I had thought that the bug was "we say that we wrote out the bytes even though we didn't", which allows an attacker to queue up a bunch of bytes for us to outbuf, and generate an artificially large signal.

If that is the issue (which contradicts asn's summary above I think), then there would be a second piece to the fix, since if we only decremented the write count to stop including the bytes we didn't actually write, then we would have an asymmetry between the (still much higher) read count and the (now corrected) write count. So the second part of the fix would be to decrement the read count by the corresponding amount too, so things match up.

comment:7 in reply to:  6 ; Changed 12 months ago by Jaym

Replying to arma:

Mike: I talked to Jaym about this topic at PETS. I think at the time I was under the impression that we are incrementing the *write* counter, that is, the outbound connection, even though we never actually push the bytes out to the network because we kill the outbuf in the oom killer.

So I had thought that the bug was "we say that we wrote out the bytes even though we didn't", which allows an attacker to queue up a bunch of bytes for us to outbuf, and generate an artificially large signal.

Yes, because of the first version of the paper you reviewed I considered the wrong counter. This counter actually behaves exactly like you describe above but this was not the counter used to report bandwidth in the extra-info descriptor. With your review, you made me realize that this problem actually happens with the read counter.

The up-to-date version of the paper I sent you in the beginning of September corrects the explanation and gives more details (given to asn too, which explains why you see difference in asn's description with you original thoughts). I also added chutney experiments to observe the difference in read/write that can be reproduced with this code (https://github.com/popets18dropping/code_and_data/tree/master/hs_drop_attack).

If that is the issue (which contradicts asn's summary above I think), then there would be a second piece to the fix, since if we only decremented the write count to stop including the bytes we didn't actually write, then we would have an asymmetry between the (still much higher) read count and the (now corrected) write count. So the second part of the fix would be to decrement the read count by the corresponding amount too, so things match up.

Yep. If needed I can help testing the fix with the code linked above.

comment:8 Changed 8 months ago by asn

Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final

comment:9 Changed 7 months ago by nickm

Keywords: 034-triage-20180328 added

comment:10 Changed 7 months ago by nickm

Keywords: 034-removed-20180328 added

Per our triage process, these tickets are pending removal from 0.3.4.

comment:11 Changed 6 months ago by nickm

Milestone: Tor: 0.3.4.x-finalTor: unspecified

These tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if time permits.

comment:12 in reply to:  7 Changed 2 months ago by arma

Replying to Jaym:

I also added chutney experiments to observe the difference in read/write that can be reproduced with this code (https://github.com/popets18dropping/code_and_data/tree/master/hs_drop_attack).

Looks like this url is 404? Is there a more permanent url? Thanks.

comment:13 Changed 2 months ago by mikeperry

So, for reference, the final version of the paper, at https://petsymposium.org/2018/files/papers/issue2/popets-2018-0011.pdf, says this on page 7, Section 4.3.1:

This buffering fills the memory until circuit_handle_oom () is called to kill the circuits and recover the memory from its queues. In the mean- time, the cell payload is counted in the read statistics and ignored in the write statistics.

This makes it sound like asn's proposed fix in the description should actually be correct, and that arma's comment may have been based on an earlier paper revision.

Sure would be nice to get that github url back to actually reproduce some science here..

comment:14 Changed 2 months ago by Jaym

Hello,

Sorry about that. Previous url was used for the anonymous submission. Here is the code: https://github.com/frochet/dropping_on_the_edge

comment:15 Changed 5 weeks ago by mikeperry

Cc: dgoulet added
Status: newneeds_review

Ok what do we think about this: https://github.com/torproject/tor/pull/324

I made that branch off of 0.3.2, because yesterday dgoulet told me that the network is still experiencing continuous OOM attacks, trigging circuit oomkiller. This very well could be (one of) the reasons for such attacks. So I think we should backport. Certainly plenty of relays are experiencing circuit OOMs and reporting asymmetric stats.

Instead of trying to guess when the bytes arrived and subtract them from the appropriate read totals, I just report that we wrote them instead. Much simpler, easier to backport, etc.

Downsides of this fix (and probably any other fix): We don't know how many bytes the TLS headers took up. For this reason, I also didn't bend over backwards to count bytes for var cells, wide circ ids, etc. Do we think that is sufficient? Should we lie and add ~1 TLS header of bytes per cell?

Are there other places where we kill circuits like this?

Dgoulet - what about the DoS/DESTROY queue handling?

comment:16 Changed 5 weeks ago by nickm

Milestone: Tor: unspecifiedTor: 0.3.5.x-final

Not sure this can get into 0.3.5 by Friday, but I think we need to have a look at it.

comment:17 Changed 5 weeks ago by dgoulet

After discussion with mike on IRC, we'll go back in revision mode for a series of fixes.

comment:18 Changed 5 weeks ago by dgoulet

Status: needs_reviewneeds_revision

comment:19 Changed 5 weeks ago by mikeperry

Summary: Bandwidth stats watermark can be induced using OOM killerBandwidth stats info leak upon close of circuits with queued cells

Updating the title because this vuln is more general than the oomkiller. It can be triggered many, many ways.

An updated fix for the general issue (based on discussion with dgoulet) is at https://github.com/mikeperry-tor/tor/commits/bug23512-v2-032

I am going to spend a bit seeing if I can use the tests in test_relay.c to exercise that code.

I am ok with this missing 0.3.5.1 for now, but I really think we should backport this far enough for relay operators to pick up, though.

comment:20 Changed 5 weeks ago by mikeperry

Status: needs_revisionneeds_review

Ok, finally have a test for it: https://github.com/torproject/tor/pull/330

comment:21 Changed 5 weeks ago by dgoulet

Cc: dgoulet removed
Reviewer: dgoulet
Status: needs_reviewneeds_revision

Reviewed, minor things, no show stopper, just possible simplification and comment upgrade ;).

@mikeperry, there is a possible argument for this to be based on 029 ... and if not, we should base it on 033 since 032 is EOL in 3 weeks.

comment:22 Changed 5 weeks ago by dgoulet

Hmmm also we might want to fix the CLANG issue pointed out by Travis:

src/or/rephist.c:1381:20:error:no previous extern declaration for non-static
      variable 'write_array' [-Werror,-Wmissing-variable-declarations]
STATIC bw_array_t *write_array = NULL;

comment:23 Changed 5 weeks ago by mikeperry

Status: needs_revisionneeds_review

Ok, new rebased+squashed PR against 0.2.9: https://github.com/torproject/tor/pull/340

I decided to make a branch against 0.3.2 also, because our relay metrics still show 1/3 of our relays still using it: https://metrics.torproject.org/versions.html

Merge to 0.3.2 required a test fix to mock assert_circuit_ok. New PR: https://github.com/torproject/tor/pull/341

0.3.3 and 0.3.4 have memory leaks in the tests due to unrelated circuitmux changes. Chasing those down.

If we like this, I will fix up 0.3.3 and 0.3.4, and do the merge commit to master.

comment:24 Changed 5 weeks ago by mikeperry

0.3.3 and 0.3.4 don't leak with this branch: https://github.com/torproject/tor/pull/342

comment:25 Changed 4 weeks ago by dgoulet

Status: needs_reviewmerge_ready

lgtm!

I would *really* want us to not forget to document #define TLS_PER_CELL_OVERHEAD 29. I know we are waiting on pastly to do the impossible job of recalling how 29 came about during the KIST development process.

In any case, now KIST and this will *at least* use the same value.

comment:26 Changed 4 weeks ago by mikeperry

Status: merge_readyneeds_revision

The unit test sometimes fails (approx 1 time out of 10) due (I think) to our use of approx_time() in the patch, but time() in rephist.c I will redo the branches for this fix, merge forward as before, and then make a master merge commit.

comment:27 Changed 4 weeks ago by mikeperry

Status: needs_revisionmerge_ready

Ok I changed the patch's use of approx_time() to time() for consistency with rephist.c (which fixes the flapping test for me), and I now check the exact byte count in the tests. Otherwise the following branches are identical to v3. Setting back to merge_ready.

0.2.9: https://github.com/torproject/tor/pull/348
0.3.2: https://github.com/torproject/tor/pull/349
0.3.3: https://github.com/torproject/tor/pull/350
0.3.4: https://github.com/torproject/tor/pull/351 (same as 0.3.3, but PR against 0.3.4 for CI tests)
master: https://github.com/torproject/tor/pull/352

comment:28 Changed 4 weeks ago by nickm

Milestone: Tor: 0.3.5.x-finalTor: 0.3.4.x-final

Merged to master; marking for backport (assuming nothing breaks).

comment:29 Changed 4 weeks ago by mikeperry

Cc: pastly added

Adding pastly to comment on how he arrived at KIST's TLS_PER_CELL_OVERHEAD value when he returns.

comment:30 Changed 4 weeks ago by nickm

bug23512-v4-029-fixes fixes some compilation errors that we ran into on master.

comment:31 Changed 3 weeks ago by Dhalgren

Seeing a new and unusual traffic pattern with bytes written higher than read on exit and I wonder if it is related.

Last edited 3 weeks ago by Dhalgren (previous) (diff)

Changed 3 weeks ago by Dhalgren

Attachment: excess_bytes.png added

comment:32 Changed 5 days ago by mikeperry

Dhalgren - What produced that graph, and for what Tor relay version? That does not look like a metrics graph.

Exits can have asymmetry between bytes read and bytes written due to cell packing. If there is a lot of intermittent small-packet activity (eg SSH) being read from a public internet IP by your exit, your exit will package will package those small packets into their own cells, and end up writing much more than it has been reading as it sends those mostly empty cells to the middle node. So this could be unrelated.

comment:33 Changed 5 days ago by mikeperry

Pastly pointed out that TLS_PER_CELL_OVERHEAD is exactly half of the TLS header size for TLS as used in Tor. He and Rob observed that exactly 2 Tor cells got packed into TLS frames whenever there were 2 or more in queue.

This means that if the number of cells in the circuit queue is odd, we should probably add one extra TLS_PER_CELL_OVERHEAD to the written bytes.

I can do that in another fixup, and comment the TLS_PER_CELL_OVERHEAD define.

comment:34 in reply to:  32 Changed 5 days ago by Dhalgren

Replying to mikeperry:

Dhalgren - What produced that graph, and for what Tor relay version? That does not look like a metrics graph.

The graph is an obscured snip of the high-res view provided by ISP.

Seeing a bit of this lately, investigated and it appears mostly DIR response traffic--visible in published statistics. Perhaps some new form of dubious behavior from at-large.

. . . So this could be unrelated.

Agree, probably not related.

Note: See TracTickets for help on using tickets.