Opened 20 months ago

Closed 6 weeks ago

#23512 closed defect (fixed)

Bandwidth stats info leak upon close of circuits with queued cells

Reported by: asn Owned by:
Priority: Medium Milestone: Tor: 0.3.4.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-bug-bounty, congestion-attack, research, watermark, tor-stats, guard-discovery-stats, 034-triage-20180328, 034-removed-20180328, 029-backport, 034-backport
Cc: arma, pastly Actual Points:
Parent ID: Points:
Reviewer: dgoulet Sponsor: SponsorQ

Description

We received a tor bug bounty report from jaym about a congestion attack variant that can cause bandwidth stats watermark.

The bug uses the fact that Tor increments the read bytes counter before adding the cell to the output buffer: If the circuit gets killed before the cell gets relayed to the next hop, then the write bytes counter will never be updated, making the read bytes counter having a higher value than the write bytes counter. The attacker could exploit this assymetry to find relays using their bandwidth graph.

The attacker can kill the circuit using the OOM killer by saturating its output queue with cells until circuits_handle_oom() gets called and kills the circuit.

We should figure out whether this attack is practical (the paper claims it is) and whether it's worthwhile fixing it. Just fixing this issue won't solve the general issue of congestion attacks, and it might even allow other kinds of attacks.

The most practical fix right now seem to be to hack circuit_handle_oom()` to actually decrement the read counters before killing a circuit. However, that's a very specific fix that might solve this very specific bug, but leave the rest of the bug class open.

Another approach would be removing the bandwidth graphs, or aggregating them over a greater period of time, or adding noise. We should consider these approaches carefully since bandwidth graphs see great use by academic papers and also by relay operators (to gauge their contribution).

Child Tickets

Attachments (2)

excess_bytes.png (45.0 KB) - added by Dhalgren 7 months ago.
23512_test.png (27.2 KB) - added by Jaym 6 months ago.

Download all attachments as: .zip

Change History (43)

comment:1 Changed 20 months ago by mikeperry

My current thinking is that this ticket should be about the specific fix, and we should include more general fixes in #22729 or subtickets.

comment:2 Changed 20 months ago by Jaym

Hi guys.

I think we should also look into other counters and correct their values even if they are not published. The control protocol can be used to dump stats about circuits, including counter of relayed cells, etc. They are subjected to the same kind of bug.

comment:3 Changed 18 months ago by nickm

Sponsor: SponsorQ

comment:4 Changed 18 months ago by mikeperry

Cc: arma added

During Montreal, Roger mentioned there is a second piece to this. I tried to save it here but trac ate my changes. Roger, can you say what that was again?

comment:5 Changed 18 months ago by mikeperry

Keywords: guard-discovery-stats added; guard-discovery removed

comment:6 Changed 18 months ago by arma

Mike: I talked to Jaym about this topic at PETS. I think at the time I was under the impression that we are incrementing the *write* counter, that is, the outbound connection, even though we never actually push the bytes out to the network because we kill the outbuf in the oom killer.

So I had thought that the bug was "we say that we wrote out the bytes even though we didn't", which allows an attacker to queue up a bunch of bytes for us to outbuf, and generate an artificially large signal.

If that is the issue (which contradicts asn's summary above I think), then there would be a second piece to the fix, since if we only decremented the write count to stop including the bytes we didn't actually write, then we would have an asymmetry between the (still much higher) read count and the (now corrected) write count. So the second part of the fix would be to decrement the read count by the corresponding amount too, so things match up.

comment:7 in reply to:  6 ; Changed 18 months ago by Jaym

Replying to arma:

Mike: I talked to Jaym about this topic at PETS. I think at the time I was under the impression that we are incrementing the *write* counter, that is, the outbound connection, even though we never actually push the bytes out to the network because we kill the outbuf in the oom killer.

So I had thought that the bug was "we say that we wrote out the bytes even though we didn't", which allows an attacker to queue up a bunch of bytes for us to outbuf, and generate an artificially large signal.

Yes, because of the first version of the paper you reviewed I considered the wrong counter. This counter actually behaves exactly like you describe above but this was not the counter used to report bandwidth in the extra-info descriptor. With your review, you made me realize that this problem actually happens with the read counter.

The up-to-date version of the paper I sent you in the beginning of September corrects the explanation and gives more details (given to asn too, which explains why you see difference in asn's description with you original thoughts). I also added chutney experiments to observe the difference in read/write that can be reproduced with this code (https://github.com/popets18dropping/code_and_data/tree/master/hs_drop_attack).

If that is the issue (which contradicts asn's summary above I think), then there would be a second piece to the fix, since if we only decremented the write count to stop including the bytes we didn't actually write, then we would have an asymmetry between the (still much higher) read count and the (now corrected) write count. So the second part of the fix would be to decrement the read count by the corresponding amount too, so things match up.

Yep. If needed I can help testing the fix with the code linked above.

comment:8 Changed 15 months ago by asn

Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final

comment:9 Changed 13 months ago by nickm

Keywords: 034-triage-20180328 added

comment:10 Changed 13 months ago by nickm

Keywords: 034-removed-20180328 added

Per our triage process, these tickets are pending removal from 0.3.4.

comment:11 Changed 13 months ago by nickm

Milestone: Tor: 0.3.4.x-finalTor: unspecified

These tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if time permits.

comment:12 in reply to:  7 Changed 9 months ago by arma

Replying to Jaym:

I also added chutney experiments to observe the difference in read/write that can be reproduced with this code (https://github.com/popets18dropping/code_and_data/tree/master/hs_drop_attack).

Looks like this url is 404? Is there a more permanent url? Thanks.

comment:13 Changed 9 months ago by mikeperry

So, for reference, the final version of the paper, at https://petsymposium.org/2018/files/papers/issue2/popets-2018-0011.pdf, says this on page 7, Section 4.3.1:

This buffering fills the memory until circuit_handle_oom () is called to kill the circuits and recover the memory from its queues. In the mean- time, the cell payload is counted in the read statistics and ignored in the write statistics.

This makes it sound like asn's proposed fix in the description should actually be correct, and that arma's comment may have been based on an earlier paper revision.

Sure would be nice to get that github url back to actually reproduce some science here..

comment:14 Changed 9 months ago by Jaym

Hello,

Sorry about that. Previous url was used for the anonymous submission. Here is the code: https://github.com/frochet/dropping_on_the_edge

comment:15 Changed 7 months ago by mikeperry

Cc: dgoulet added
Status: newneeds_review

Ok what do we think about this: https://github.com/torproject/tor/pull/324

I made that branch off of 0.3.2, because yesterday dgoulet told me that the network is still experiencing continuous OOM attacks, trigging circuit oomkiller. This very well could be (one of) the reasons for such attacks. So I think we should backport. Certainly plenty of relays are experiencing circuit OOMs and reporting asymmetric stats.

Instead of trying to guess when the bytes arrived and subtract them from the appropriate read totals, I just report that we wrote them instead. Much simpler, easier to backport, etc.

Downsides of this fix (and probably any other fix): We don't know how many bytes the TLS headers took up. For this reason, I also didn't bend over backwards to count bytes for var cells, wide circ ids, etc. Do we think that is sufficient? Should we lie and add ~1 TLS header of bytes per cell?

Are there other places where we kill circuits like this?

Dgoulet - what about the DoS/DESTROY queue handling?

comment:16 Changed 7 months ago by nickm

Milestone: Tor: unspecifiedTor: 0.3.5.x-final

Not sure this can get into 0.3.5 by Friday, but I think we need to have a look at it.

comment:17 Changed 7 months ago by dgoulet

After discussion with mike on IRC, we'll go back in revision mode for a series of fixes.

comment:18 Changed 7 months ago by dgoulet

Status: needs_reviewneeds_revision

comment:19 Changed 7 months ago by mikeperry

Summary: Bandwidth stats watermark can be induced using OOM killerBandwidth stats info leak upon close of circuits with queued cells

Updating the title because this vuln is more general than the oomkiller. It can be triggered many, many ways.

An updated fix for the general issue (based on discussion with dgoulet) is at https://github.com/mikeperry-tor/tor/commits/bug23512-v2-032

I am going to spend a bit seeing if I can use the tests in test_relay.c to exercise that code.

I am ok with this missing 0.3.5.1 for now, but I really think we should backport this far enough for relay operators to pick up, though.

comment:20 Changed 7 months ago by mikeperry

Status: needs_revisionneeds_review

Ok, finally have a test for it: https://github.com/torproject/tor/pull/330

comment:21 Changed 7 months ago by dgoulet

Cc: dgoulet removed
Reviewer: dgoulet
Status: needs_reviewneeds_revision

Reviewed, minor things, no show stopper, just possible simplification and comment upgrade ;).

@mikeperry, there is a possible argument for this to be based on 029 ... and if not, we should base it on 033 since 032 is EOL in 3 weeks.

comment:22 Changed 7 months ago by dgoulet

Hmmm also we might want to fix the CLANG issue pointed out by Travis:

src/or/rephist.c:1381:20:error:no previous extern declaration for non-static
      variable 'write_array' [-Werror,-Wmissing-variable-declarations]
STATIC bw_array_t *write_array = NULL;

comment:23 Changed 7 months ago by mikeperry

Status: needs_revisionneeds_review

Ok, new rebased+squashed PR against 0.2.9: https://github.com/torproject/tor/pull/340

I decided to make a branch against 0.3.2 also, because our relay metrics still show 1/3 of our relays still using it: https://metrics.torproject.org/versions.html

Merge to 0.3.2 required a test fix to mock assert_circuit_ok. New PR: https://github.com/torproject/tor/pull/341

0.3.3 and 0.3.4 have memory leaks in the tests due to unrelated circuitmux changes. Chasing those down.

If we like this, I will fix up 0.3.3 and 0.3.4, and do the merge commit to master.

comment:24 Changed 7 months ago by mikeperry

0.3.3 and 0.3.4 don't leak with this branch: https://github.com/torproject/tor/pull/342

comment:25 Changed 7 months ago by dgoulet

Status: needs_reviewmerge_ready

lgtm!

I would *really* want us to not forget to document #define TLS_PER_CELL_OVERHEAD 29. I know we are waiting on pastly to do the impossible job of recalling how 29 came about during the KIST development process.

In any case, now KIST and this will *at least* use the same value.

comment:26 Changed 7 months ago by mikeperry

Status: merge_readyneeds_revision

The unit test sometimes fails (approx 1 time out of 10) due (I think) to our use of approx_time() in the patch, but time() in rephist.c I will redo the branches for this fix, merge forward as before, and then make a master merge commit.

comment:27 Changed 7 months ago by mikeperry

Status: needs_revisionmerge_ready

Ok I changed the patch's use of approx_time() to time() for consistency with rephist.c (which fixes the flapping test for me), and I now check the exact byte count in the tests. Otherwise the following branches are identical to v3. Setting back to merge_ready.

0.2.9: https://github.com/torproject/tor/pull/348
0.3.2: https://github.com/torproject/tor/pull/349
0.3.3: https://github.com/torproject/tor/pull/350
0.3.4: https://github.com/torproject/tor/pull/351 (same as 0.3.3, but PR against 0.3.4 for CI tests)
master: https://github.com/torproject/tor/pull/352

comment:28 Changed 7 months ago by nickm

Milestone: Tor: 0.3.5.x-finalTor: 0.3.4.x-final

Merged to master; marking for backport (assuming nothing breaks).

comment:29 Changed 7 months ago by mikeperry

Cc: pastly added

Adding pastly to comment on how he arrived at KIST's TLS_PER_CELL_OVERHEAD value when he returns.

comment:30 Changed 7 months ago by nickm

bug23512-v4-029-fixes fixes some compilation errors that we ran into on master.

comment:31 Changed 7 months ago by Dhalgren

Seeing a new and unusual traffic pattern with bytes written higher than read on exit and I wonder if it is related.

Last edited 7 months ago by Dhalgren (previous) (diff)

Changed 7 months ago by Dhalgren

Attachment: excess_bytes.png added

comment:32 Changed 6 months ago by mikeperry

Dhalgren - What produced that graph, and for what Tor relay version? That does not look like a metrics graph.

Exits can have asymmetry between bytes read and bytes written due to cell packing. If there is a lot of intermittent small-packet activity (eg SSH) being read from a public internet IP by your exit, your exit will package will package those small packets into their own cells, and end up writing much more than it has been reading as it sends those mostly empty cells to the middle node. So this could be unrelated.

comment:33 Changed 6 months ago by mikeperry

Pastly pointed out that TLS_PER_CELL_OVERHEAD is exactly half of the TLS header size for TLS as used in Tor. He and Rob observed that exactly 2 Tor cells got packed into TLS frames whenever there were 2 or more in queue.

This means that if the number of cells in the circuit queue is odd, we should probably add one extra TLS_PER_CELL_OVERHEAD to the written bytes.

I can do that in another fixup, and comment the TLS_PER_CELL_OVERHEAD define.

comment:34 in reply to:  32 Changed 6 months ago by Dhalgren

Replying to mikeperry:

Dhalgren - What produced that graph, and for what Tor relay version? That does not look like a metrics graph.

The graph is an obscured snip of the high-res view provided by ISP.

Seeing a bit of this lately, investigated and it appears mostly DIR response traffic--visible in published statistics. Perhaps some new form of dubious behavior from at-large.

. . . So this could be unrelated.

Agree, probably not related.

Changed 6 months ago by Jaym

Attachment: 23512_test.png added

comment:35 Changed 6 months ago by Jaym

Hello,

I've tested 0.3.5.3-alpha tagged commit and obtained the attached result (23512_test.png)

It seems it writes a little bit more than read.

Here is also the state file of the onion service:

# Tor state file last generated on 2018-10-26 14:53:22 local time
# Other times below are in UTC
# You *do not* need to edit this file.

Guard in=default rsa_id=2B3219A5C66CC7ECCFA55804C465B7CCBEAF027C nickname=test036r sampled_on=2018-10-18!T04:20:23 sampled_by=0.3.5.3-alpha unlisted_since=2018-10-23!T18:34:51 listed=0
Guard in=default rsa_id=BA6632FA2F2A213FCB6E087C0F22052A0B62812B nickname=test020r sampled_on=2018-10-16!T05:28:18 sampled_by=0.3.5.3-alpha listed=1
Guard in=default rsa_id=730DBE47049E36CFCF683F1309D5B0A925D8DECF nickname=test017r sampled_on=2018-10-23!T03:39:25 sampled_by=0.3.5.3-alpha listed=1
Guard in=default rsa_id=31A83FFA0F6B0BB742EF2C4122EF9F20140281D7 nickname=test001a sampled_on=2018-10-16!T10:39:44 sampled_by=0.3.5.3-alpha listed=1
Guard in=default rsa_id=DBF56FB1117C471BF0190A7374A6CB2291182EA2 nickname=test004a sampled_on=2018-10-18!T23:54:35 sampled_by=0.3.5.3-alpha listed=1
Guard in=default rsa_id=E1727EB54ABCEA12AE92F5BCC48189A9C1097EE5 nickname=test030r sampled_on=2018-10-17!T04:19:16 sampled_by=0.3.5.3-alpha listed=1
Guard in=default rsa_id=457AE0016470CC5D419BC24DB0ECA976F72D840D nickname=test022r sampled_on=2018-10-21!T23:09:21 sampled_by=0.3.5.3-alpha listed=1
Guard in=default rsa_id=715E1FD56CAD6F679685DAEB2D28A3FBCF707F59 nickname=test009r sampled_on=2018-10-23!T17:40:06 sampled_by=0.3.5.3-alpha listed=1
Guard in=default rsa_id=FD18458EA5620B3B20A347DB03B01F26BF01C10F nickname=test006r sampled_on=2018-10-18!T15:28:12 sampled_by=0.3.5.3-alpha listed=1
Guard in=default rsa_id=12FF90989AB07018CB36B76CED915FEDF09BCE66 nickname=test029r sampled_on=2018-10-18!T14:07:29 sampled_by=0.3.5.3-alpha listed=1
Guard in=default rsa_id=D770B6F40A11658BDCD89C94C75AD92A2A80292E nickname=test025r sampled_on=2018-10-15!T19:02:38 sampled_by=0.3.5.3-alpha listed=1
Guard in=default rsa_id=C5F1BEDA4B33823E1679EDE5D54C9458096D5DA3 nickname=test011r sampled_on=2018-10-17!T07:58:18 sampled_by=0.3.5.3-alpha listed=1 confirmed_on=2018-10-20!T14:39:33 confirmed_idx=1 pb_use_attempts=6.000000 pb_use_successes=6.000000 pb_circ_attempts=19.000000 pb_circ_successes=19.000000 pb_successful_circuits_closed=19.000000
Guard in=default rsa_id=C83F9431C4BC3148E3CAC94C4E47A1096A0F0932 nickname=test034r sampled_on=2018-10-26!T06:24:44 sampled_by=0.3.5.3-alpha listed=1
Guard in=default rsa_id=093FD73B2F6AD9CB265D45C23BB603A8D9F0ED8F nickname=test000a sampled_on=2018-10-22!T01:08:39 sampled_by=0.3.5.3-alpha listed=1 confirmed_on=2018-10-23!T18:31:03 confirmed_idx=2 pb_circ_attempts=1.000000 pb_circ_successes=1.000000 pb_successful_circuits_closed=1.000000
Guard in=default rsa_id=2AE442FFA549D6C5EE1208D1FB75E5751DBB3992 nickname=test035r sampled_on=2018-10-23!T05:00:23 sampled_by=0.3.5.3-alpha unlisted_since=2018-10-25!T11:09:47 listed=0
Guard in=default rsa_id=44711053A24B2C5AF71383525CBCFDEE20BEEDFD nickname=test010r sampled_on=2018-10-23!T11:41:25 sampled_by=0.3.5.3-alpha listed=1
Guard in=default rsa_id=6F999ACBBDC74290E7CF6CA8E7FCCC83262F31FA nickname=test031r sampled_on=2018-10-23!T14:34:54 sampled_by=0.3.5.3-alpha listed=1
Guard in=default rsa_id=BD7181E1F07B6861C9364C1EAED80A4D1F3165C2 nickname=test005r sampled_on=2018-10-15!T13:55:08 sampled_by=0.3.5.3-alpha listed=1 confirmed_on=2018-10-25!T03:33:03 confirmed_idx=0 pb_use_attempts=18.000000 pb_use_successes=18.000000 pb_circ_attempts=41.000000 pb_circ_successes=38.000000 pb_successful_circuits_closed=37.000000 pb_collapsed_circuits=1.000000 pb_timeouts=7.000000
Guard in=default rsa_id=38EA4662196889A9F00C17BA013FEBAD3BB79402 nickname=test003a sampled_on=2018-10-23!T08:15:56 sampled_by=0.3.5.3-alpha listed=1
Guard in=default rsa_id=3FE678DE44C4BF8979353670240385045FCB1155 nickname=test014r sampled_on=2018-10-19!T04:15:31 sampled_by=0.3.5.3-alpha listed=1
!TorVersion Tor 0.3.5.3-alpha (git-444e9b37c53b0246)
!LastWritten 2018-10-26 12:53:22
!TotalBuildTimes 108
!CircuitBuildTimeBin 25 15
!CircuitBuildTimeBin 75 49
!CircuitBuildTimeBin 125 26
!CircuitBuildTimeBin 175 4
!CircuitBuildTimeBin 225 2
!CircuitBuildTimeBin 275 3
!CircuitBuildTimeBin 325 3
!CircuitBuildTimeBin 425 2
!CircuitBuildTimeBin 475 2
!CircuitBuildTimeBin 575 2

comment:36 Changed 6 months ago by mikeperry

Jaym - how did you test this? What type of node did you test with (guard, middle, exit)? And how large was this write discrepancy?

It is surprising that it is reporting more written than read -- we made conservative assumptions about TLS overhead (we report as if there are always 2 cells written per TLS payload, which is the minimum possible written overhead) that should mean that the reported write totals are always <= the read totals.

Last edited 6 months ago by mikeperry (previous) (diff)

comment:37 Changed 7 weeks ago by teor

0.2.9 fixes are in:
https://github.com/torproject/tor/pull/766

0.3.4 fixes are in:
https://github.com/torproject/tor/pull/767
There was a trivial merge conflict, which I resolved.

comment:38 in reply to:  37 Changed 7 weeks ago by teor

Replying to teor:

0.2.9 fixes are in:
https://github.com/torproject/tor/pull/766

0.3.4 fixes are in:
https://github.com/torproject/tor/pull/767
There was a trivial merge conflict, which I resolved.

The CI failed on a missing typedef, so I removed that commit.

0.2.9:
https://github.com/torproject/tor/pull/770

I'll do 0.3.4 once 0.2.9 passes.

comment:39 Changed 7 weeks ago by teor

comment:40 Changed 6 weeks ago by teor

Keywords: 029-backport 034-backport added

comment:41 Changed 6 weeks ago by teor

Resolution: fixed
Status: merge_readyclosed

#27073, #28096, #25773, #23681, and #23512 were merged to 0.2.9 and later after testing the new maint-0.2.9, maint-0.3.4, and master on CI.

Note: See TracTickets for help on using tickets.