Opened 7 years ago

Closed 7 years ago

#6816 closed defect (fixed)

{connection_or,channel}_flush_from_first_active_circuit() should pick a new circuit for each cell

Reported by: andrea Owned by:
Priority: Medium Milestone: Tor: 0.2.4.x-final
Component: Core Tor/Tor Version: Tor: 0.2.4.2-alpha
Severity: Keywords: tor-relay
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Before channels, this function was always called with max == 1; during the course of testing channels I discovered that it is broken with max > 1.

We pick cell_ewma once per invocation:

https://gitweb.torproject.org/tor.git/blob/75c9ccd4f851bac6d32cb08ded557ac207bc8002:/src/or/relay.c#l2412

Then once per cell, we pop the head of the pqueue and re-add it (to get it to its new correct position):

https://gitweb.torproject.org/tor.git/blob/75c9ccd4f851bac6d32cb08ded557ac207bc8002:/src/or/relay.c#l2479

... but if the new position is not the head of the queue, we don't change cell_ewma and we continue sending from the same circuit - and then the assert fails the next time around.

This function should be refactored to not do this and to separate the policy choice of what priority order to impose on circuits from the mechanism of picking a circuit, pulling cells off it and sending them down the wire.

Child Tickets

Attachments (1)

tor-circuitmux-perf-report.txt (187.4 KB) - added by andrea 7 years ago.
Profiling results

Download all attachments as: .zip

Change History (15)

comment:1 Changed 7 years ago by nickm

You should know that I merged my branch "bug6341_a_v2", which does affect some of this code a little. You should probably make sure that your fixed version has the same fix, or that the bug6341_a_v2 fix is no longer needed.

comment:2 Changed 7 years ago by nickm

Keywords: tor-relay added

comment:3 Changed 7 years ago by nickm

Component: Tor RelayTor

comment:4 Changed 7 years ago by andrea

This is now finished in my bug6816 branch; tested and known to work with EWMA disabled; now testing operation as a relay with EWMA enabled.

comment:5 Changed 7 years ago by nickm

Status: newneeds_review

For my reference, this appears to be based on andrea/bug6465_squashed. Let me know if that isn't right.

comment:6 Changed 7 years ago by nickm

Hi! Again, a review in installments. As before, please expect at least 1/3 of the comments below to be unreasonable/offbase/already discussed on IRC.

PART 1:

  • channel_set_cmux_policy_everywhere (and other functions): should arg be constant?
  • Should we even allow changing this while Tor is running? I don't think it's something we ever really want to do very badly; would avoiding it save us complexity?
  • The "detach all circuits early so they can find the channel" comment doesn't make sense to me. How do they "find" the channel? When? For what?
  • The dance we need to do when we allocate a circuitmux is a little bit worrisome; it seems like maybe instead we should arrange the API so that it gets a policy as an argument when it's constructed.
  • Does the the 'circ->marked_for_close' check belong in circuit_set_circid_chan_helper? It seems that if the circuit is marked at that point, then the code we're about to execute would add it to the new channel, and mess up referential integrity, right?
    • Similar question to circuit_set_circid_chan_helper.
  • why are these circuitmux_detach_circuit calls newly necessary in _circuit_mark_for_close?
  • circuitmux_flush_cells() sounds from its description in the file comment like it should be called circuitmux_extract_cells_to_flush() or something.
  • I'm not sure that the chanid part of chanid_circid_muxinfo_map_t makes sense. If one circuitmux exists per channel, then it has an implicit chanid, right?
  • Would it make sense to have this info be information kept as part of the circuit, opaque to the circuit?
  • Documentation for active_circuits_(head,tail) needs to explain how the circuits are linked in their rings; it's IIRC nontrivial.
  • The nesting in circuitmux_detach_all_circuits is pretty gruesomely deep. Can this function be split up or refactored or anything? There's no shame in a for loop with continues.
  • I'm a little worried about memory management issues from allocating this many little new objects per circuit. I guess we'll see if it's a problem in practice, and if it is, we can roleplay accordingly.
  • I wonder why the round-robin policy should be a built-in default rather than just one other policy. Would that simplify stuff?
  • circuitmux_attach_circuit feels kind of swiss-army-knife-ish: it behaves differently depending on whether the circuit is already attached, whether it has cells or not, whether we knew it had cells, etc. But FWICT, we call it in only one place. Are all of those situations really possible when circuitmux_attach_circuit is called?
  • For ewma_policy, I think that's a C99 syntax you're using for initializing it; we don't assume C99, alas.
  • re ewma_alloc_circ_data: In the rest of the code, the way we tell the compiler that an argument is unused is (void)cell_count.

comment:7 Changed 7 years ago by nickm

PART 2, and I'm out of comments:

  • re comment on ewma_notify_xmit_cells: we try not to have our comments talk about closed tickets or what changed when. By convention, the code and the documentation should be all the documentation you need. You shouldn't need the bug tracker to figure out what Tor is doing and why, and if you want to know the history, you should be able to use the bugtracker.
  • Please let me know once you've got profiling result.

comment:8 in reply to:  6 Changed 7 years ago by arma

Replying to nickm:

  • I'm a little worried about memory management issues from allocating this many little new objects per circuit. I guess we'll see if it's a problem in practice, and if it is, we can roleplay accordingly.

In the past we've been really bad at recognizing, and then really bad at tracking down, memory fragmentation issues. The best plan we've found so far is to diligently not put them in in the first place.

(I am reminded of this topic because mikeperry just said 0.2.3 has memory fragmentation issues that were bloating his tor relays to oom. And there's no easy way to find out what/where/whether.)

comment:9 in reply to:  5 Changed 7 years ago by andrea

Replying to nickm:

For my reference, this appears to be based on andrea/bug6465_squashed. Let me know if that isn't right.

Yeah, that's right.

Changed 7 years ago by andrea

Profiling results

comment:10 in reply to:  7 Changed 7 years ago by andrea

Replying to nickm:

  • Please let me know once you've got profiling result.

Report from perf is attached; that's from a two hour run as a middle relay with 120 kB/sec bandwidth burstable to 160, which used about 30 s of CPU time in total. It looks pretty dominated by crypto operations; summing the obviously cryptographic OpenSSL/bignum operations gives me about 95%. The heartbeat output said I got about 20M through it during the test run, so that makes approx. 41,000 cells, so on this hardware we're looking at about 700 microseconds average CPU required per cell, amortizing startup costs. I think we're probably around 35 microseconds per cell for non-cryptographic processing. It does look like malloc is our single most expensive non-cryptographic operation, so your points about mallocing small blocks may be relevant here.

Think I should do a test run with master for comparison?

comment:11 Changed 7 years ago by andrea

This is rebased against bug6465_rebased_3, and thus the latest master, in bug6816_rebased

comment:12 Changed 7 years ago by andrea

The '[warn] circuit was already inactive' bug and an assert observed before turned out to be the same bug in circuitmux_detach_all_circuits(), now fixed. I'm still testing to be sure I got all the asserts.

comment:13 Changed 7 years ago by nickm

This is merged now, right? Shall we close it?

comment:14 in reply to:  13 Changed 7 years ago by andrea

Resolution: fixed
Status: needs_reviewclosed

Replying to nickm:

This is merged now, right? Shall we close it?

Yeah, fully merged.

Note: See TracTickets for help on using tickets.