Opened 7 months ago

Closed 7 months ago

#24666 closed defect (fixed)

sched: Store the circuit ID instead of the full DESTROY cell in the destroy queue

Reported by: dgoulet Owned by: nickm
Priority: Very High Milestone: Tor: 0.3.2.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-sched, tor-cell, backport-031, backport-030, backport-029, backport-028, backport-025
Cc: Actual Points:
Parent ID: Points:
Reviewer: dgoulet Sponsor:

Description

Tor currently keeps the DESTROY cells it needs to relay on the cmux->destroy_queue but it keeps the entire packed_cell_t which is a full 514 bytes usually.

Instead, we should keep the circid_t because this is really only what we need which would shrink down the used memory by a factor of 128x.

We've observed this on very loaded relays getting DoS with CREATE/DESTROY cells at high rate by many clients which is filling the DESTROY queue while tor is struggling to flush them on the network towards a relay that is also struggling.

This needs to be backported as far as we can in order to avoid relays being memory DoS too hard. The next step is to make our OOM consider those DESTROY cells but that is a bit more involving.

Child Tickets

Change History (8)

comment:1 Changed 7 months ago by dgoulet

Status: assignedneeds_review

Nickm's branch are:

destroy_cells_025
destroy_cells_031_tmp
destroy_cells_032_tmp
destroy_cells_033_tmp

comment:2 Changed 7 months ago by nickm

the destroy_cells_025 one is the one to review; the others are just merging it forward for testing.

comment:3 Changed 7 months ago by nickm

I've updated destroy_cells_025 to include a changes file, and to use channel_write_packed_cell() instead of channel_write_cell().

comment:4 Changed 7 months ago by dgoulet

Reviewer: dgoulet
Status: needs_reviewneeds_revision
  • destroy_cell_to_packed_cell() is supposed to free the destroy_cell_t but doesn't leading to memory leak.

Seems a simple tor_free(inp) is required.

  • The inserted_time seems not used at all. I bet it will be useful later especially with the OOM but as a fix that we will be backported, I think we could reduce the size even more by removing it?
  • This snippet I think could use a comment that says that the destroy queue can NOT be empty since the pop() can return NULL and circuitmux_get_first_active_circuit() only returns a queue with > 0 value.
+      dcell = destroy_cell_queue_pop(destroy_queue);
+      tor_assert(dcell);

Rest lgtm;

comment:5 Changed 7 months ago by nickm

Status: needs_revisionneeds_review

Branch updated as destroy_cells_025.

I addressed the first and third points above, but didn't remove the timer, since I think we'll need it for the OOM handler.

comment:6 Changed 7 months ago by dgoulet

lgtm; Currently testing, will let you know and move it to merge_ready once there.

comment:7 Changed 7 months ago by dgoulet

Status: needs_reviewmerge_ready

Ready for merge in my opinion. Been ~20h of testing, RAM has been stable and not leaking anymore.

comment:8 Changed 7 months ago by nickm

Resolution: fixed
Status: merge_readyclosed

okay. I made a bug24666_squashed_025 branch, then merged it forward.

Merge notes:

  • I added cd1f708a7f44ab305c9fcda0060f55f075b98362 to free the destroy cell in the unit test more correctly. Otherwise, coverity would have complained that there were paths where it didn't get freed.
  • 7d845976e3897fac8e78a4a26688ac57b660151b was a trivial merge conflict in the unit tests in 0.2.8.
  • b8a3602b2a7114f2027d4a3af27520b109762efd had a simple but nontrivial merge conflict in channel_flush_from_first_active_circuit.
  • We've changed the way that we handle timestamps twice since 0.2.5. So in 0.2.9, I shifted the (as yet unused) timestamp to monotonic msec in 79a50afa0e3dd44fc5ef80806ccda501fab5a718, and in 0.3.3 I shifted it to monotonic timestamp units in 1eeb505e6f08591c39e0a000efab3948ef1ef5b5
Note: See TracTickets for help on using tickets.