Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#6341 closed defect (fixed)

connection_or_flush_from_first_active_circuit() does wrong thing when ewma_enabled

Reported by: arma Owned by:
Priority: Medium Milestone: Tor: 0.2.3.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-relay
Cc: robgjansen Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Here are our cryptic hints:

connection_or_flush_from_first_active_circuit(): "if (circ != conn->active_circuits)" is wrong, especialy with ewma. read comments, that means it show how stuff is broken.

that means random crashes for "connection_write_to_buf(NULL->body" if ewma enabled.

just remove that condition block, no need any goto before every stream started reads and all circs with empty queue marked as inactive.

"&& queue->head" saves from crashes, but that condition is wrong anyway. you can't unblock streams when you need it.

why need "re-add the cell_ewma_t" inside for{} ? what purpose for{} if you never can flush more than 1 cell for ewma anyway? Why there are no consistent logic for just one small func?

Child Tickets

Attachments (13)

censored.patch (490 bytes) - added by asn 7 years ago.
20120808-ec2-brokenewma-combined.pdf (453.0 KB) - added by robgjansen 7 years ago.
client performance, patch and no patch
20120813-ec2-brokenewma-combined.pdf (509.0 KB) - added by robgjansen 7 years ago.
new ewma unpatched results (the blue lines)
webclient5.log.xz (64.6 KB) - added by robgjansen 7 years ago.
bootstrapping problems - webclient5
nonexit11.log.xz (118.7 KB) - added by robgjansen 7 years ago.
bootstrapping problems - nonexit11
webclient71.log.xz (59.1 KB) - added by robgjansen 7 years ago.
fetch early - webclient71
webclient74.log.xz (54.7 KB) - added by robgjansen 7 years ago.
fetch early - webclient74
client.torrc (690 bytes) - added by robgjansen 7 years ago.
relay.torrc (797 bytes) - added by robgjansen 7 years ago.
time-to-bootstrap.pdf (131.1 KB) - added by robgjansen 7 years ago.
updated-time-to-bootstrap.pdf (132.2 KB) - added by robgjansen 7 years ago.
updated-full-time-to-bootstrap.pdf (132.9 KB) - added by robgjansen 7 years ago.
20120909-ec2-brokenewma-combined.pdf (799.7 KB) - added by robgjansen 7 years ago.

Change History (76)

comment:1 in reply to:  description Changed 7 years ago by arma

Replying to arma:

"if (circ != conn->active_circuits)"

Looks like that code went in as git commit 6589ea2a

that means random crashes for "connection_write_to_buf(NULL->body" if ewma enabled.

I assume this refers to

    connection_write_to_buf(cell->body, CELL_NETWORK_SIZE, TO_CONN(conn));

just remove that condition block, no need any goto before every stream started reads and all circs with empty queue marked as inactive.

So if conn->active_circuits changed, we should still call conn->active_circuits = *next_circ_on_conn_p(circ, conn); ?
That seems unlikely to be right.

"&& queue->head" saves from crashes, but that condition is wrong anyway. you can't unblock streams when you need it.

why need "re-add the cell_ewma_t" inside for{} ? what purpose for{} if you never can flush more than 1 cell for ewma anyway? Why there are no consistent logic for just one small func?

comment:2 Changed 7 years ago by arma

Summary: "if (circ != conn->active_circuits)" in connection_or_flush_from_first_active_circuit() is wrong?connection_or_flush_from_first_active_circuit() wants refactoring

Apparently no code in that path can change conn->active_circuits, so checking whether it's changed makes no sense. That's what this is about.

Our friendly bug reporter also points out that circ is conn->active_circuits in the case of !ewma_enabled, but it's based on conn->active_circuit_pqueue in the case of ewma_enabled. Sure seems like some of this code didn't get cleaned up when we added the ewma patch.

comment:3 Changed 7 years ago by arma

Summary: connection_or_flush_from_first_active_circuit() wants refactoringconnection_or_flush_from_first_active_circuit() does wrong thing when ewma_enabled

We *are* triggering the "if" when ewma is on. We just shouldn't be. Specifically, in the case where circ is based on conn->active_circuit_pqueue, it's very likely that circ != conn->active_circuits, and thus we're skipping checks like "Is the cell queue low enough to unblock all the streams that are waiting" and "Did we just run out of cells on this circuit's queue?".

We can't just remove the "if" clause, since then we'd run lines like "conn->active_circuits = *next_circ_on_conn_p(circ, conn);" which should only happen when !ewma_enabled.

I'm beginning to think this is a "0.2.3 milestone" thing.

comment:4 Changed 7 years ago by arma

Milestone: Tor: 0.2.3.x-final

We should do at least the simple fix for 0.2.3, since I'm unclear on whether always skipping those clauses at the end of the function (when ewma is on) is harming our performance, and we should find out.

comment:5 Changed 7 years ago by arma

Though that said, if Nick wants 0.2.3 to be more frozen than that, I'd be amenable to doing the fix in 0.2.4 and later backporting iff we become convinced it *is* important.

comment:6 Changed 7 years ago by nickm

What exactly do you consider "the real fix" and "the simple fix" here?

comment:7 Changed 7 years ago by hellais

I think this is the "simple fix" submitted by <censored> on IRC: http://paste.debian.net/plain/178153

Changed 7 years ago by asn

Attachment: censored.patch added

comment:8 Changed 7 years ago by nickm

Status: newneeds_review

Man, that's some messy code.

I'm in favor of making the above change in 0.2.3, and doing some radical refactoring in 0.2.4.

Possible radical refactoring: Having the non-ewma case around seems more or less pointless, and ugly. We should get rid of it if we can, and refactor stuff until the whole ewma logic can have unit tests.

comment:9 Changed 7 years ago by nickm

I've pushed it as a git commit in branch bug6341 in my public repo. Please review for 0.2.3; backport candidate for 0.2.2.

comment:10 Changed 7 years ago by nickm

Oops. That should have said branch "bug6341_a".

comment:11 Changed 7 years ago by nickm

And "bug6341_a_v2" is against 0.2.3, as it should be.

comment:12 Changed 7 years ago by arma

I argue against a backport to 0.2.2, since it seems to not be actively breaking there.

comment:13 Changed 7 years ago by arma

Hm. I think the description of the bug and patch is wrong.

If we check !ewma_enabled, we're actually causing ourselves to run *more* code than we ran before.

Specifically, we now run all this:

  tor_assert(*next_circ_on_conn_p(circ,conn));
  assert_active_circuits_ok_paranoid(conn);
  conn->active_circuits = *next_circ_on_conn_p(circ, conn);

  /* Is the cell queue low enough to unblock all the streams that are waiting
   * to write to this circuit? */
  if (streams_blocked && queue->n <= CELL_QUEUE_LOWWATER_SIZE)
    set_streams_blocked_on_circ(circ, conn, 0, 0); /* unblock streams */

  /* Did we just run out of cells on this circuit's queue? */
  if (queue->n == 0) {
    log_debug(LD_GENERAL, "Made a circuit inactive.");
    make_circuit_inactive_on_conn(circ, conn);
  }

Setting conn->active_circuits should be irrelevant but harmless. We should look very carefully to make sure *next_circ_on_conn_p(circ, conn) is never going to be NULL (or we're *introducing* an assert trigger case).

Being willing to call set_streams_blocked_on_circ() is the main feature of this bugfix I believe. Right now we don't unblock streams when we should. So this could be a performance fix. I have no idea if it fixes performance in any noticeable way; I've been toying with the idea of asking Rob to run a Shadow simulation with and without this patch.

comment:14 in reply to:  13 Changed 7 years ago by arma

Replying to arma:

I have no idea if it fixes performance in any noticeable way; I've been toying with the idea of asking Rob to run a Shadow simulation with and without this patch.

Specifically, I have a distant hope that fixing this bug will change Rob's "ewma is useless" simulation conclusions. Wouldn't that be neat.

comment:15 Changed 7 years ago by arma

Ok, I mailed Rob to ask him to simulate it. As a bonus, if his simulated Tor crashes we will learn something further. ;)

comment:16 Changed 7 years ago by arma

Another hint from our irc person (pasting here since I'm in the middle of doing n things): "you should to find answer, could connection_write_to_buf() make circuit inactive or it's impossible? if it's possible then no simple fix able, if it's impossible then simple fix or just remove that all goto stuff."

comment:17 Changed 7 years ago by robgjansen

Cc: robgjansen added

comment:18 in reply to:  15 ; Changed 7 years ago by robgjansen

Replying to arma:

Ok, I mailed Rob to ask him to simulate it. As a bonus, if his simulated Tor crashes we will learn something further. ;)

I'll try soon, after completing #6401 (in progress).

comment:19 in reply to:  18 Changed 7 years ago by robgjansen

Replying to robgjansen:

I'll try soon, after completing #6401 (in progress).

I am currently running simulations on EC2 with CircuitPriorityHalflife 30, one with and one without the patch applied.

comment:20 Changed 7 years ago by arma

Looking forward to seeing the graphs! :)

Changed 7 years ago by robgjansen

client performance, patch and no patch

comment:21 Changed 7 years ago by robgjansen

I just uploaded the graphs. You can find the Tor network and client model described in #6401.

There are 3 simulations here, all done with tor-0.2.3.16-alpha.

Load distribution in vanilla Tor:

TYPE	#XFERS	GiB	%
im	34735	0.033	0.075
web	85779	26.178	59.376
bulk	1586	7.744	17.565
p2p	596397	9.100	20.641
perf50k	1896	0.090	0.205
perf1m	965	0.942	2.138
TOTAL	721358	44.088	100.000

Load distribution with CircuitPriorityHalflife 30, not patched:

TYPE	#XFERS	GiB	%
im	6610	0.006	0.042
web	23264	7.100	47.495
bulk	1451	7.085	47.397
p2p	12919	0.197	1.319
perf50k	585	0.028	0.187
perf1m	545	0.532	3.560
TOTAL	45374	14.948	100.000

Load distribution with CircuitPriorityHalflife 30, patched:

TYPE	#XFERS	GiB	%
im	18485	0.018	0.045
web	60669	18.515	46.973
bulk	2770	13.525	34.315
p2p	416375	6.353	16.119
perf50k	1650	0.079	0.200
perf1m	948	0.926	2.349
TOTAL	500897	39.416	100.000

I have not analyzed the log file, but it looks like some crazy shit happened in the middle graph, when the patch was not applied. Was this somewhat expected? Or is this backwards, i.e. were you thinking the patch would make things explode?

comment:22 in reply to:  21 ; Changed 7 years ago by arma

Replying to robgjansen:

it looks like some crazy shit happened in the middle graph, when the patch was not applied. Was this somewhat expected?

The patch was thought to fix some behavior. We weren't sure if fixing the behavior would make a noticeable change. It looks like it is a significant change.

Let me see if I'm interpreting the results correctly. The graphs show that ewma-unpatched outperforms the other two -- their requests get to the first byte quicker and also get to the last byte quicker. But the data in your comment show that far fewer requests are made. Is it fair to say that the clients in the ewma-unpatched case put less load on the network because they're not making as many requests total? Is this 'not making as many requests' correlated to 'not finishing some of their requests in a timely fashion'?

comment:23 in reply to:  22 ; Changed 7 years ago by robgjansen

Replying to arma:

Let me see if I'm interpreting the results correctly. The graphs show that ewma-unpatched outperforms the other two -- their requests get to the first byte quicker and also get to the last byte quicker. But the data in your comment show that far fewer requests are made. Is it fair to say that the clients in the ewma-unpatched case put less load on the network because they're not making as many requests total?

This is absolutely correct. Only ~15 GiB worth of completed downloads happened in the unpatched case as opposed to ~40 GiB in the patched case.

Is this 'not making as many requests' correlated to 'not finishing some of their requests in a timely fashion'?

Could be. Though, the performance graphs shows that the downloads that did complete were fairly fast. We can't really draw conclusions directly from the unpatched performance because the load is so much different.

I wonder if clients were simply 'crashing' in the unpatched network, causing them to no longer make requests. In one of your previous comments, you mention that you expected some simulated Tors to crash. Did you mean in the unpatched or patched case?

comment:24 in reply to:  23 ; Changed 7 years ago by arma

Replying to robgjansen:

I wonder if clients were simply 'crashing' in the unpatched network, causing them to no longer make requests. In one of your previous comments, you mention that you expected some simulated Tors to crash. Did you mean in the unpatched or patched case?

I meant in either case, but we weren't expecting crashes in the unpatched case.

By crash, you mean a client that seg faults or asserts and stops running? This seems very useful to teach Shadow to notice, yes? :)

comment:25 in reply to:  24 Changed 7 years ago by robgjansen

Replying to arma:

By crash, you mean a client that seg faults or asserts and stops running? This seems very useful to teach Shadow to notice, yes? :)

No I don't mean that. If a client seg faults or fails an assertion, the entire Shadow process dies. That means no simulation results. By crash I meant getting stuck in a different way that causes it to be no longer useful.

comment:26 Changed 7 years ago by arma

Rob and I decided that this ticket is the one to follow-up on the "why do clients stop fetching stuff" bug, since it's the simplest change. He's going to rerun it with info-level logs for everybody and send me the logs. He'll then race me to find an answer, by stepping through things in Eclipse when he gets a chance.

Changed 7 years ago by robgjansen

new ewma unpatched results (the blue lines)

comment:27 Changed 7 years ago by robgjansen

So I re-ran CircuitPriorityHalflife 30, not patched, with info level logs. Looks like the problem did not appear this time:

TYPE	#XFERS	GiB	%
im	22531	0.021	0.048
web	73838	22.534	50.862
bulk	2655	12.964	29.261
p2p	496719	7.579	17.108
perf50k	1832	0.087	0.197
perf1m	1145	1.118	2.524
TOTAL	598720	44.304	100.000

I re-drew the client performance graph too, just for fun. The blue lines contain the new results.

I'll send the logs, though I'm not sure they will be useful.

comment:28 Changed 7 years ago by nickm

15:28 < armadev> re 6341, we have either a huge bug in shadow that sometimes 
                 ruins shadow's results, or a deadlocking bug in tor client. 
                 i'd say 60% / 40% if i had to guess right now. but i oscillate.
15:28 < armadev> also maybe /sometimes//
15:28 < nickm> armadev: so, merge the patch or not or wait for something else?
15:29 < armadev> nickm: i think it's fair to say that the simulations for 6341 
                 provided no usable hints about whether to merge it.
15:30 < nickm> armadev: ok
15:30 < nickm> armadev: I think I want to put 6341 into needs_information then
15:30 < nickm> agreed?
15:31 < armadev> sure. i feel a bit bad potentially leaving it out of debian's 
                 0.2.3. but we shouldn't be too rushed.
15:31 < nickm> you simultaneously  don't know if we should merge it and feel 
               bad about not merging it?
15:32 < nickm> Actually, I know the feeling
15:32 < armadev> i think it's an actual bug. i think our fix likely fixes it. i 
                 wonder if our fix breaks something.
15:32 < nickm> could merge in 0.2.4 and contemplate a backport eventually?
15:33 < armadev> sounds good to me
15:33 < nickm> may I copy-and-paste the above?
15:33 < armadev> sure

comment:29 in reply to:  27 ; Changed 7 years ago by arma

Replying to robgjansen:

I re-drew the client performance graph too, just for fun. The blue lines contain the new results.

From this graph, it looks like the patch basically has no (broad-scale performance) effect. Which is plausible too.

I'll send the logs, though I'm not sure they will be useful.

Agreed, they weren't useful -- I'm hoping to examine each individual client and see if one stops fetching stuff partway through the experiment. (There's now an email thread exploring how to fix shadow's handling of logs. Fun.)

Thanks!

comment:30 in reply to:  29 ; Changed 7 years ago by robgjansen

Replying to arma:

Replying to robgjansen:

I'll send the logs, though I'm not sure they will be useful.

Agreed, they weren't useful -- I'm hoping to examine each individual client and see if one stops fetching stuff partway through the experiment. (There's now an email thread exploring how to fix shadow's handling of logs. Fun.)

I redirected the email thread here, and the Shadow log problem is fixed. The examination you have planned will be much easier with the new logs that I'll post tomorrow.

comment:31 Changed 7 years ago by arma

I'm increasingly thinking that the Shadow client model, when doing performance comparison tests, should be "each client fetches a set number of things."

That way the total load on the network is fixed between comparisons, and the only question is "how well does the network handle this set of users doing that set of activities?"

I don't deny that it's interesting to consider the "if it works better, they'll use it more, putting more total load on it" dynamism.

But the current client model makes it hard to say things like "clients in both cases fetched all 100 of the files, but in case 1 they got their files faster on average" and "clients in case 1 successfully fetched fewer of their files".

If we want to get fancy, we could have the web browsing users fetch a set number of things, and the filesharers just keep downloading. Lots of permutations, each of which expose some properties and brush others under the rugs. Hm.

comment:32 in reply to:  31 Changed 7 years ago by robgjansen

Replying to arma:

I'm increasingly thinking that the Shadow client model, when doing performance comparison tests, should be "each client fetches a set number of things."

That way the total load on the network is fixed between comparisons, and the only question is "how well does the network handle this set of users doing that set of activities?"

It may be worth considering a model like this, if nothing else because we haven't given it much attention so far and we may be able to learn something we otherwise would not.


I don't deny that it's interesting to consider the "if it works better, they'll use it more, putting more total load on it" dynamism.

But the current client model makes it hard to say things like "clients in both cases fetched all 100 of the files, but in case 1 they got their files faster on average" and "clients in case 1 successfully fetched fewer of their files".

There is no panacea.

In your new client model, the network may start out looking like a Tor network we think we understand, but the network will slowly transform into something entirely different as more and more clients finish their set number of downloads. By the time 99% of the clients finish, I'd argue that we are so far from any real Tor network in terms of load, congestion, etc, that the results are no longer meaningful. And I'd guess the efficiency of the network probably doesn't scale linearly with the load either, meaning we may get a skewed impression of how things are working. In our current model, and assuming no software bugs, any given download will occur on a 'properly' loaded Tor network (for various definitions of 'properly').

RE the questions you stated above: perhaps we should be analyzing the data slightly differently than we have been? Instead of aggregating all the client download times into one CDF, would it help to look per-client details: e.g., how many downloads each client finished and what the download times were for each client, etc. (I'm thinking of an average and a 5 number summary for these, but you could also draw a CDF for each client;) But you're right, we won't be able to say "he was supposed to download X things, and he finished in Y seconds" when there is no fixed X.

If we want to get fancy, we could have the web browsing users fetch a set number of things, and the filesharers just keep downloading. Lots of permutations, each of which expose some properties and brush others under the rugs. Hm.

Yeah, I think we want a hybrid here: something like our current model to produce the background traffic that we think is characteristic of a real Tor network, with the addition of some other clients that have a deterministic set of downloads to perform. I guess this is somewhat similar to Tor+TorPerf. In fact, we have already built in TorPerf-like clients into our model, so it may be worth analyzing their data separately.

Or course it would help if we could be sure all the clients could download successfully through their SOCKS connection;) We're still working on that. Wheeee.

comment:33 in reply to:  30 Changed 7 years ago by robgjansen

Replying to robgjansen:

The examination you have planned will be much easier with the new logs that I'll post tomorrow.

The new logs are here.

I reduced it from 120 GB to 4.5 GB by filtering out lines containing: 'torrent client encountered a non-asynch-io error'. The P2P plug-in code doesn't differentiate that error other than that its not EAGAIN and its not coming from a send or recv call. (Apparently, it also doesn't rate-limit its reconnection attempts :P)

I suspect its coming from the attempted SOCKS connection because our file-download plug-in has similar problems (and the file plug-in does differentiate: you can grep the log for FG_ERR_SOCKSCON for those errors). Also, the new browser plug-in has these problems as well, though we are not currently using that plug-in in our model.

I hope you can find the problem in Tor, especially since 3 of our separate plug-ins are seeing this issue. I suggest search for the first occurrence of FG_ERR_SOCKSCON and working backwards from there.

comment:34 Changed 7 years ago by robgjansen

I've been looking into the SOCKS connection problems. The distribution of the number of SOCKS connection errors over a recent set of 24 experiments is:
[1, 3, 5, 7, 7, 7, 9, 12, 16, 17, 17, 18, 18, 25, 25, 26, 27, 39, 57, 109, 110, 120, 143, 580]

From what I can tell, it seems to mostly be happening when the client has yet to 'bootstrap' to 100% (usually stuck at 50%: "Loading relay descriptors"). In this state clients seem to be getting several HTTP 404 errors because the directory mirrors they selected to contact do not yet have the requested info. Circuits can't be built without an acceptable fraction of descriptors (75% of all descriptors and 50% of exit descriptors). The SOCKS connect then fails after the 2 minute SocksTimeout.

To increase confidence in my suspicion, I ran a small test that resulted in 254 total SOCKS connection errors and 1833 total HTTP 404s. The same exact test after reducing FRAC_USABLE_NEEDED from .75 to .4 and FRAC_EXIT_USABLE_NEEDED from .5 to .2 resulted in 3 total SOCKS connection errors and 138 total HTTP 404s.

So is it really a problem with the distribution of descriptor information to the directory mirrors? I wonder: are there any options I should be setting to improve bootstrapping in testing Tor networks (other than TestingTorNetwork of course)?

Is there a way to force everyone in the network to always contact the directory authority for directory information and descriptors instead of directory mirrors? Then, it would also be useful if I can make my directory authority only serve that information and not participate as circuit relays, just to ensure it is as stable as possible for bootstrapping. Does this make sense and is there a way to do this? (It wasn't obviously possible with the current options listed in the Tor manual.)

comment:35 in reply to:  34 Changed 7 years ago by arma

Replying to robgjansen:

Is there a way to force everyone in the network to always contact the directory authority for directory information and descriptors instead of directory mirrors?

Set "FetchDirInfoEarly 1" in the torrc for each Tor that should fetch that way. This will not only cause them to ask the authorities directly, but they'll also ask on the "dir mirror" schedule (i.e. within the hour after the consensus is made) rather than the "client" schedule (within a few hours). See directory_fetches_from_authorities() for the gory details.

Then, it would also be useful if I can make my directory authority only serve that information and not participate as circuit relays, just to ensure it is as stable as possible for bootstrapping. Does this make sense and is there a way to do this? (It wasn't obviously possible with the current options listed in the Tor manual.)

There's no easy way to do it.

comment:36 in reply to:  34 Changed 7 years ago by arma

Replying to robgjansen:

From what I can tell, it seems to mostly be happening when the client has yet to 'bootstrap' to 100% (usually stuck at 50%: "Loading relay descriptors"). In this state clients seem to be getting several HTTP 404 errors because the directory mirrors they selected to contact do not yet have the requested info. Circuits can't be built without an acceptable fraction of descriptors (75% of all descriptors and 50% of exit descriptors). The SOCKS connect then fails after the 2 minute SocksTimeout.

To increase confidence in my suspicion, I ran a small test that resulted in 254 total SOCKS connection errors and 1833 total HTTP 404s. The same exact test after reducing FRAC_USABLE_NEEDED from .75 to .4 and FRAC_EXIT_USABLE_NEEDED from .5 to .2 resulted in 3 total SOCKS connection errors and 138 total HTTP 404s.

Good find. I've opened #6752 for what I believe is the issue.

comment:37 in reply to:  34 ; Changed 7 years ago by arma

Replying to robgjansen:

From what I can tell, it seems to mostly be happening when the client has yet to 'bootstrap' to 100% (usually stuck at 50%: "Loading relay descriptors"). In this state clients seem to be getting several HTTP 404 errors because the directory mirrors they selected to contact do not yet have the requested info.

As usual, it would be great if you could get me info-level logs from one of these clients that doesn't bootstrap, and ideally from a directory mirror that doesn't bootstrap too.

I just used nickm's chutney tool to launch a self-contained tor network, and for the first 5 minute period, all the clients were getting 404's in response to their consensus fetches. After about 15 minutes though, everything had settled down and they all worked nicely. I wonder what is different in your situation.

Changed 7 years ago by robgjansen

Attachment: webclient5.log.xz added

bootstrapping problems - webclient5

Changed 7 years ago by robgjansen

Attachment: nonexit11.log.xz added

bootstrapping problems - nonexit11

Changed 7 years ago by robgjansen

Attachment: webclient71.log.xz added

fetch early - webclient71

Changed 7 years ago by robgjansen

Attachment: webclient74.log.xz added

fetch early - webclient74

comment:38 in reply to:  37 ; Changed 7 years ago by robgjansen

Replying to arma:

As usual, it would be great if you could get me info-level logs from one of these clients that doesn't bootstrap, and ideally from a directory mirror that doesn't bootstrap too.

Right on: relay log here, client log here.

A note on the log format: the first column is the real (wall clock) time, and the third column is the virtual time (returned with any time() type functions) as hours:minutes:seconds:nanoseconds.

The relay starts at just over the 1 virtual-minute mark and finally reaches 100% at just under 28 minutes. The client starts at the 15 virtual-minute mark and reaches 100% at just over 37 minutes. After reaching 100%, the client sees no additional socks connection errors.

I just used nickm's chutney tool to launch a self-contained tor network, and for the first 5 minute period, all the clients were getting 404's in response to their consensus fetches. After about 15 minutes though, everything had settled down and they all worked nicely. I wonder what is different in your situation.

Me too. I ran a third experiment using 'FetchDirInfoEarly 1' and 'FetchDirInfoExtraEarly 1' for the relays, and 'FetchDirInfoEarly 1' for the clients. There were no 404s for any of the nodes after the 5 virtual-minute mark.

I'm still seeing a few socks errors though.

For webclient71 in this log file, the client started up at the 15 minute 7 seconds mark, and was bootstrapped to 100% 24 seconds later. The 'FG_ERR_SOCKSCONN' occurs at minute 43. It appears the circuit building process may have been legitimately slow in this case because of exit8. Perhaps increasing the SocksTimeout to 3 minutes could help in this case.

For webclient74 in this log file, the client didn't bootstrap 100% until almost 36 minutes. There were 3 'FG_ERR_SOCKSCONN' errors before that, but none after. The client again appeared to be attempting to download descriptors.

Does any of this enlighten the problem? I'm curious if you can capture something from the logs that I didn't notice.

comment:39 in reply to:  38 ; Changed 7 years ago by arma

Replying to robgjansen:

Right on: relay log here, client log here.

Can you post the torrc files for these two Tors too?

comment:40 in reply to:  39 ; Changed 7 years ago by arma

Replying to arma:

Can you post the torrc files for these two Tors too?

I ask because it seems your webclient is fetching both the ns consensus and the microdesc consensus. If it's really just a client, it has no business wanting the ns consensus (or the descriptors that it points to).

Changed 7 years ago by robgjansen

Attachment: client.torrc added

Changed 7 years ago by robgjansen

Attachment: relay.torrc added

comment:41 in reply to:  40 ; Changed 7 years ago by robgjansen

Replying to arma:

Replying to arma:

Can you post the torrc files for these two Tors too?

I ask because it seems your webclient is fetching both the ns consensus and the microdesc consensus. If it's really just a client, it has no business wanting the ns consensus (or the descriptors that it points to).

Client here, relay here.

comment:42 in reply to:  40 ; Changed 7 years ago by arma

Replying to arma:

I ask because it seems your webclient is fetching both the ns consensus and the microdesc consensus. If it's really just a client, it has no business wanting the ns consensus (or the descriptors that it points to).

Looks like you're falling victim to bug #6507 (to be fixed in 0.2.4.1-alpha) -- it interprets an explicit "DirPort 0" as the same as "DirPort is enabled". Same with "ORPort 0" I believe. It was introduced in 0.2.3.3-alpha.

Fix is either to upgrade to git master, or stop writing "FooPort 0" in your torrcs.

comment:43 in reply to:  41 Changed 7 years ago by arma

Replying to robgjansen:

Replying to arma:

Replying to arma:

Can you post the torrc files for these two Tors too?

I ask because it seems your webclient is fetching both the ns consensus and the microdesc consensus. If it's really just a client, it has no business wanting the ns consensus (or the descriptors that it points to).

Client here, relay here.

Wait, is that client torrc for webclient5? It has FetchDirInfoEarly 1 in it. If not, can I have the torrc files that go with webclient5 and nonexit11? :)

comment:44 Changed 7 years ago by arma

Ok, the problem with nonexit11 is that it failed repeatedly to get a key certificate for the authority. Search the info-level log for "certificate ":

0:0:11:574284 [thread-0] 0:6:5:106061958 [scallion-info] [nonexit11-111.1.0.0] [intercept_logv] [tor-info] Looks like we need to download a new certificate from authority '4uthority' at 73.1.0.0:9112 (contact jansen@cs.umn.edu; identity 919CAD4BB055CE7EC5169F19BCA731F3D60EFB19)
0:0:11:574390 [thread-0] 0:6:5:106061958 [scallion-info] [nonexit11-111.1.0.0] [intercept_logv] [tor-info] authority_certs_fetch_missing() We're missing a certificate from authority with signing key A432966AA4970BECD68C198843EED4CD6F3F02F3: launching request.
0:0:13:002654 [thread-0] 0:7:10:063025730 [scallion-info] [nonexit11-111.1.0.0] [intercept_logv] [tor-info] Looks like we need to download a new certificate from authority '4uthority' at 73.1.0.0:9112 (contact jansen@cs.umn.edu; identity 919CAD4BB055CE7EC5169F19BCA731F3D60EFB19)
0:0:16:458881 [thread-0] 0:11:6:000000000 [scallion-info] [nonexit11-111.1.0.0] [intercept_logv] [tor-info] connection_dir_request_failed() Giving up on certificate fetch from directory server at '73.1.0.0'; retrying
0:0:16:458902 [thread-0] 0:11:6:000000000 [scallion-info] [nonexit11-111.1.0.0] [intercept_logv] [tor-info] authority_certs_fetch_missing() We're missing a certificate from authority with signing key A432966AA4970BECD68C198843EED4CD6F3F02F3: launching request.
0:1:30:915266 [thread-0] 0:16:7:000000000 [scallion-info] [nonexit11-111.1.0.0] [intercept_logv] [tor-info] connection_dir_request_failed() Giving up on certificate fetch from directory server at '73.1.0.0'; retrying
0:1:43:358079 [thread-0] 0:17:20:000000000 [scallion-info] [nonexit11-111.1.0.0] [intercept_logv] [tor-info] authority_certs_fetch_missing() We're missing a certificate from authority with signing key A432966AA4970BECD68C198843EED4CD6F3F02F3: launching request.
0:2:28:409714 [thread-0] 0:22:21:000000000 [scallion-info] [nonexit11-111.1.0.0] [intercept_logv] [tor-info] connection_dir_request_failed() Giving up on certificate fetch from directory server at '73.1.0.0'; retrying
0:3:42:342784 [thread-0] 0:26:29:103970803 [scallion-warning] [nonexit11-111.1.0.0] [intercept_logv] [tor-warn] Looks like we need to download a new certificate from authority '4uthority' at 73.1.0.0:9112 (contact jansen@cs.umn.edu; identity 919CAD4BB055CE7EC5169F19BCA731F3D60EFB19)
0:4:20:480911 [thread-0] 0:27:30:000000000 [scallion-info] [nonexit11-111.1.0.0] [intercept_logv] [tor-info] authority_certs_fetch_missing() We're missing a certificate from authority with signing key A432966AA4970BECD68C198843EED4CD6F3F02F3: launching request.
0:4:20:549334 [thread-0] 0:27:30:080227503 [scallion-warning] [nonexit11-111.1.0.0] [intercept_logv] [tor-warn] Looks like we need to download a new certificate from authority '4uthority' at 73.1.0.0:9112 (contact jansen@cs.umn.edu; identity 919CAD4BB055CE7EC5169F19BCA731F3D60EFB19)
0:4:20:598334 [thread-0] 0:27:30:157543249 [scallion-info] [nonexit11-111.1.0.0] [intercept_logv] [tor-info] trusted_dirs_load_certs_from_string() Adding downloaded certificate for directory authority authority with signing key A432966AA4970BECD68C198843EED4CD6F3F02F3

I wonder why the requests kept timing out.

comment:45 in reply to:  42 ; Changed 7 years ago by robgjansen

Replying to arma:

Looks like you're falling victim to bug #6507 (to be fixed in 0.2.4.1-alpha) -- it interprets an explicit "DirPort 0" as the same as "DirPort is enabled". Same with "ORPort 0" I believe. It was introduced in 0.2.3.3-alpha.

Fix is either to upgrade to git master, or stop writing "FooPort 0" in your torrcs.

OK. I wonder if "ClientOnly 1" has precedence since it follows the "DirPort 0" and "ORPort 0"?

Replying to arma:

Wait, is that client torrc for webclient5? It has FetchDirInfoEarly 1 in it. If not, can I have the torrc files that go with webclient5 and nonexit11? :)

Yes, client.torrc is used by webclient5 and relay.torrc is used by nonexit11. Are you implying that "FetchDirInfoEarly 1" is only for dir mirrors? That wasn't clear from your comment above:arma:

Set "FetchDirInfoEarly 1" in the torrc for each Tor that should fetch that way.

I thought this meant the clients would try to fetch whatever info they needed earlier. Are you telling me I should take these lines back out for regular clients?

Replying to arma:

Ok, the problem with nonexit11 is that it failed repeatedly to get a key certificate for the authority. Search the info-level log for "certificate ":

[...]

I wonder why the requests kept timing out.

Not sure. I checked my logs, and this was the only node in the entire experiment that had problems. I also checked a few other experiments and didn't notice those warnings.

Also, it looks like my cert is not yet expired:

dir-key-certificate-version 3
fingerprint 919CAD4BB055CE7EC5169F19BCA731F3D60EFB19
dir-key-published 2011-06-20 16:31:50
dir-key-expires 2013-06-20 16:31:50
dir-identity-key
[...]

comment:46 in reply to:  42 Changed 7 years ago by arma

Replying to arma:

Replying to arma:

I ask because it seems your webclient is fetching both the ns consensus and the microdesc consensus. If it's really just a client, it has no business wanting the ns consensus (or the descriptors that it points to).

Looks like you're falling victim to bug #6507 (to be fixed in 0.2.4.1-alpha) -- it interprets an explicit "DirPort 0" as the same as "DirPort is enabled". Same with "ORPort 0" I believe. It was introduced in 0.2.3.3-alpha.

Fix is either to upgrade to git master, or stop writing "FooPort 0" in your torrcs.

The fix will also be included in Tor 0.2.3.21-rc, coming out today.

comment:47 in reply to:  45 Changed 7 years ago by arma

Replying to robgjansen:

OK. I wonder if "ClientOnly 1" has precedence since it follows the "DirPort 0" and "ORPort 0"?

Order in the torrc file doesn't matter (except for LINELIST and LINELIST_* things).

Also, while ClientOnly works to disable actually opening the DirPort or ORPort, it doesn't provide a workaround for bug #6507. Good thing we're fixing it.

comment:48 in reply to:  45 Changed 7 years ago by arma

Replying to robgjansen:

Set "FetchDirInfoEarly 1" in the torrc for each Tor that should fetch that way.

I thought this meant the clients would try to fetch whatever info they needed earlier. Are you telling me I should take these lines back out for regular clients?

If they work, by all means keep them for now. Eventually I'd like to get to the point where we don't need them, since they make our simulation a bit less realistic and might introduce other issues at scale: not only do they make the directory fetching happen earlier, but they also make everybody only ask your directory authority when fetching dir stuff.

As another fun example of bizarre behavior, if you set FetchDirInfoEarly on a client, that makes directory_fetches_from_authorities() return true, which makes directory_command_should_use_begindir() return false, which makes use_begindir false in directory_initiate_command_rend(), which causes the dir fetches to be done via vanilla http rather than tunneled inside TLS. And with only one dir auth, due to #6761, all but one directory fetch for descriptors or microdescriptors will be silently discarded. Not the end of the world (everything still mostly works), but certainly not on the normal path. :)

Changed 7 years ago by robgjansen

Attachment: time-to-bootstrap.pdf added

comment:49 Changed 7 years ago by robgjansen

Replying to arma:

If they work, by all means keep them for now.

It turns out it made things worse.

I ran 24 experiments with "FetchDirInfoEarly 1", and 24 without. Then I graphed the time that each node in each experiment reached 100% bootstrap status as a CDF. The results are here. It seems to have not affected relay bootstrapping times, and made it worse for clients. (Note that clients in my experiments don't begin until around 15 minutes = 900 seconds, hence the large jump in the client plots.)

As another fun example of bizarre behavior, [...]

I guess we won't have to worry about that. So it seems that, after all of this, the fix isn't in the config and we don't have any answers except to lower the retry schedule (#6752)? Do you expect the fixes in 0.2.3.21-rc to improve bootstrapping?

comment:50 in reply to:  49 Changed 7 years ago by robgjansen

Replying to robgjansen:

So it seems that, after all of this, the fix isn't in the config and we don't have any answers except to lower the retry schedule (#6752)? Do you expect the fixes in 0.2.3.21-rc to improve bootstrapping?

Or, perhaps I'm asking for too much and I should give my networks more time to get situated? Maybe I shouldn't start any downloads until the 1 virtual hour mark?

Changed 7 years ago by robgjansen

comment:51 Changed 7 years ago by robgjansen

Great news! I spent many hours doing some hardcore debugging, and think I have fixed most of our bootstrapping issues. Once I noticed the abnormal behavior from a combination of Shadow debug logs (ouch!) and Tor info logs, the fix was fairly straightforward.

The new time to bootstrap graph looks much better now, right? :-)

I'm re-running my big suite of experiments now so we can count the errors in a larger sample. If we are relatively happy, we can finally get back to the main reason this ticket exists ;) I'll post those results when I get them.

comment:52 Changed 7 years ago by arma

Great!

Changed 7 years ago by robgjansen

comment:53 Changed 7 years ago by robgjansen

Replying to robgjansen:

I'm re-running my big suite of experiments now so we can count the errors in a larger sample.

I've attached the full time to bootstrap graph that aggregates all 24 experiments from the 'suite' I've been running. I've also counted the number of socks errors from each variation we ran:

variation notes total socks errors in 24 experiments
default before bugfix 1398
fetchearly FetchDirInfoEarly 1 2896
bugfix default, after bugfix 1

comment:54 Changed 7 years ago by arma

Well, that sure does look promising.

It makes sense that setting FetchDirInfoEarly tickled the bug more, since it switches to direct dirport fetches rather than tunneled ORPort fetches.

You still have a few percent of relays who don't bootstrap for a while. Another smaller bug, or what?

I wonder what the one socks failure was. :) I guess it could have simply been a normal 120 second timeout.

Does that mean it's time to try the actual #6341 experiment again?

Changed 7 years ago by robgjansen

comment:55 in reply to:  54 ; Changed 7 years ago by robgjansen

Replying to arma:

You still have a few percent of relays who don't bootstrap for a while. Another smaller bug, or what?

After checking through some log files, there doesn't seem to be any trend for the relays who take a while to bootstrap (e.g., its not mostly from relays with high link latency, or from slow relays). Sadly, I think there might be another smaller bug somewhere.

I wonder what the one socks failure was. :) I guess it could have simply been a normal 120 second timeout.

It was a timeout, but the notice level logs don't give me enough information to tell which relay in the circuit was to blame.

[tor-notice] connection_ap_expire_beginning() We tried for 15 seconds to connect to '70.1.0.0' using exit $A011BB1F2B1359E989ACF13C1DEB8768C6919734~exit20 at 140.1.0.0. Retrying on a new circuit.
[tor-notice] connection_ap_expire_beginning() We tried for 15 seconds to connect to '70.1.0.0' using exit $A011BB1F2B1359E989ACF13C1DEB8768C6919734~exit20 at 140.1.0.0. Retrying on a new circuit.
[tor-notice] connection_ap_expire_beginning() We tried for 15 seconds to connect to '70.1.0.0' using exit $A011BB1F2B1359E989ACF13C1DEB8768C6919734~exit20 at 140.1.0.0. Retrying on a new circuit.
[tor-notice] connection_ap_expire_beginning() We tried for 15 seconds to connect to '70.1.0.0' using exit $A011BB1F2B1359E989ACF13C1DEB8768C6919734~exit20 at 140.1.0.0. Retrying on a new circuit.
[tor-notice] connection_ap_expire_beginning() Tried for 120 seconds to get a connection to 70.1.0.0:80. Giving up. (waiting for circuit)

I checked exit20, but it seemed to be functioning correctly. Looks like some additional debugging fun is in order.

Does that mean it's time to try the actual #6341 experiment again?

Not sure, but I already did anyway. Does this provide any more answers than last set of graphs?

comment:56 Changed 7 years ago by nickm

So I'm getting kind of caught up on details a bit here. Should I be merging branch "bug6341" or not, or doing something else entirely, or don't we know yet?

comment:57 in reply to:  56 Changed 7 years ago by arma

Replying to nickm:

So I'm getting kind of caught up on details a bit here. Should I be merging branch "bug6341" or not, or doing something else entirely, or don't we know yet?

Sure, go ahead and merge it. Either 0.2.3.x or 0.2.4.x is fine with me.

The simulations aren't definitive about performance impact, but I think they show that it doesn't a) crash or b) totally break things.

comment:58 in reply to:  55 ; Changed 7 years ago by arma

Replying to robgjansen:

Does that mean it's time to try the actual #6341 experiment again?

Not sure, but I already did anyway. Does this provide any more answers than last set of graphs?

It does provide more answers, insofar as the 0813 graphs look fantastic but it's because there's less load on the system overall. The 0909 graphs look like impressive but the load seems more realistic.

The p2p-client time-to-first-byte (and time-to-last-byte) is worse with the patched ewma case. That's fine by me. Actually, it looks like time-to-first-byte is crummy for p2p clients in all three cases. Perhaps that's because they're blurting out many connection requests in parallel.

The legend in 'network data read/written over time' is particularly poorly placed: what happens to the blue line underneath it? :)

I guess my main question from considering these graphs is: what is the variance on these results? If you run with the same network but a different random seed, do you get the same general results or are they different by a bit? It looks like for the topic at hand, 'different by a bit' could easily swing things from 'helps to hurts' or vice versa. Does that mean we should close #6341 (now that we fixed the major Shadow bug) and move on to simulating something that we think ought to make a huge difference?

comment:59 Changed 7 years ago by nickm

Status: needs_reviewnew

Merged bug6341_a_v2. Leaving this open so arma and rob can talk; please turn off the lights on your way out of the ticket. :)

comment:60 in reply to:  58 Changed 7 years ago by robgjansen

Replying to arma:

The legend in 'network data read/written over time' is particularly poorly placed: what happens to the blue line underneath it? :)

There's not particularly much room for the legend in the data-over-time plots, and its not worth the effort to adjust the legend for every simulation run I do. I guess I could place the legend in the upper left for the "Read" graph and in the lower right for the "Write" graph...

I guess my main question from considering these graphs is: what is the variance on these results? If you run with the same network but a different random seed, do you get the same general results or are they different by a bit? It looks like for the topic at hand, 'different by a bit' could easily swing things from 'helps to hurts' or vice versa.

Sounds like a good discussion for #5398.

Does that mean we should close #6341 (now that we fixed the major Shadow bug) and move on to simulating something that we think ought to make a huge difference?

Sounds like a good plan.

Regarding bootstrapping: this Shadow patch has improved things further. Turns out our attempt to auto-tune TCP was creating tiny [virtual-]kernel buffers when the end-to-end bandwidth-delay product was tiny. After a quick test, the remaining small percentage of relays now finish bootstrapping at the same time as the others.

I thought and still wonder if our original auto-tuning was correct. I'm afraid its not clear to me what the 'correct' auto-tuning functionality is, and I'd rather not browse the kernel source to find out...

comment:61 Changed 7 years ago by robgjansen

Resolution: fixed
Status: newclosed

Turning off the lights.

comment:62 Changed 7 years ago by nickm

Keywords: tor-relay added

comment:63 Changed 7 years ago by nickm

Component: Tor RelayTor
Note: See TracTickets for help on using tickets.