Opened 5 years ago

Closed 4 years ago

Last modified 4 years ago

#9969 closed enhancement (implemented)

We launch 50 microdesc requests, spread out over just three guards?

Reported by: arma Owned by:
Priority: High Milestone: Tor: 0.2.6.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-client, nickm-patch
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Our choice of how many microdescs to fetch per directory request is really totally irrational since the network has grown so much since we chose those parameters.

Oct 11 13:32:44.779 [info] launch_descriptor_downloads(): Launching 50 requests for 4574 microdescs, 92 at a time

Seriously, we're making 50 requests and spreading them over 3 guards? I guess it's nice that we can get partial answers, but generally partial answers don't help us.

We should fetch many more at a time if it's going over a begindir channel.

Child Tickets

Attachments (1)

0001-Increase-objs-requested-tunneled-dir-connection.patch (9.6 KB) - added by arlolra 5 years ago.

Download all attachments as: .zip

Change History (28)

comment:1 Changed 5 years ago by nickm

Right. The original reasons to request no more than X objects per directory connection were:

  • Because sending a huge list of digests before we get any reply seemed questionable.
  • Because some HTTP proxies limited the URL length.
    • Incidentally, _we_ limit the URL length implicitly with the MAX_HEADERS_LEN field.

I think we should increase the number of objects we're willing to request per tunneled directory connection. Additionally, we should avoid having multiple inflight simultaneous connections to download the same info from the same directory guard.

comment:2 Changed 5 years ago by nickm

Priority: normalmajor

comment:3 Changed 5 years ago by karsten

A few thoughts:

  • Would it help if we implemented /tor/micro/all which is mentioned in dir-spec section 5.2 "Downloading router descriptors or microdescriptors" but which is not implemented yet? Of course, then clients would download the bulk of microdescriptors from a single directory.
  • Do we have to include full digests in requests, or would it be sufficient to ask for the first few digest bytes? Assuming that clients would only accept descriptors matching locally stored full digests. For example, requests could contain only the first 4 (or 8) base64 chars representing the first 3 (or 6) digest bytes. Directories could accept any multiple of 4 base64 chars.
  • Mixing the two ideas, how about we add a way to ask for 1/2, 1/4, etc. of all microdescriptors in a single request? The request could be /tor/micro/all/<base64-prefix>/<bits>, so that /tor/micro/all/A/1 means all digests starting with 0 binary, /tor/micro/all/w/2 means all digests starting with 11 binary, etc. Clients could decide how many requests to send from the number of descriptors they need, which may change over time.

Each of these ideas requires us to upgrade authorities and caches before clients will be able to use them.

comment:4 Changed 5 years ago by nickm

Those are cool ideas that will need new proposals. I'm thinking that the second two ideas are more or less equivalent. I'm opening a new ticket (#10871) for this, since the issues are nontrivial, and I'd like this one to be about hunting for things we can do with the existing caches in order to make clients behave less nuttily.

comment:5 Changed 5 years ago by arlolra

Status: newneeds_review

Sorry, forgot to submit this at the hack day. Steer me in the right direction.

comment:6 Changed 5 years ago by nickm

Will look soon! Please poke me if I haven't done so by the middle of the week.

comment:7 Changed 5 years ago by nickm

Status: needs_reviewneeds_revision

In dir_fetch_type:

  • I don't like those complicated conditional expressions. In particular, every time I see something like the following, I need to go look at a table of operator precedence. I think that in this case it is wrong: bitwise OR binds more tightly than ?, so the result of the conditional will always be "MICRODESC_DIRINFO"... but the problem here is probably that conditional expressions are Just Too Error-Prone.
          return V3_DIRINFO | 
    	             (resource && !strcmp(resource,"microdesc") ? MICRODESC_DIRINFO : 
    	                                                          NO_DIRINFO); 
    

In directory_send_command:

  • Not all directory requests are tunneled over Tor. (Some connections from nodes to authorities in particular are not tunneled.) Should we perhaps retain this warning in the non-tunneled case?

In initiate_descriptor_downloads:

  • Changing from 8 to 4 seems wrong; you've left off the space for the terminating NUL character.
  • Perhaps we should refactor this thing to use a smartlist and smartlist_join_strings(), to avoid the possibility of further such mistakes in the code.

In get_min_requests:

  • This function probably needs a more descriptive name
  • Why is the minimum higher now? We want to be launching fewer requests. Two per guard still seems like a high minimum. This feels more like a maximum than a minimum

Overall:

  • This needs unit tests; let me know if you need any help writing them.
  • Are we doing anything to make sure that all of the requests don't all go to the same guard?
  • Are we doing anything to lower the _maximum_ number of concurrent requests?

comment:8 in reply to:  7 Changed 5 years ago by arlolra

Replying to nickm:

In dir_fetch_type:

  • I don't like those complicated conditional expressions. In particular, every time I see something like the following, I need to go look at a table of operator precedence. I think that in this case it is wrong: bitwise OR binds more tightly than ?, so the result of the conditional will always be "MICRODESC_DIRINFO"... but the problem here is probably that conditional expressions are Just Too Error-Prone.
          return V3_DIRINFO | 
    	             (resource && !strcmp(resource,"microdesc") ? MICRODESC_DIRINFO : 
    	                                                          NO_DIRINFO); 
    

Ummm, no, the parentheses guard against the situation you described. But I wholeheartedly agree that if you had to think about it, it's already too complicated.

In initiate_descriptor_downloads:

  • Changing from 8 to 4 seems wrong; you've left off the space for the terminating NUL character.

I only counted 4 characters. "d/" + ".z"
The NULL terminator is accounted for when you backtrack to remove the last separator, so is included in the +1 * n.
memcpy(cp-1, ".z", 3);

  • Why is the minimum higher now? We want to be launching fewer requests. Two per guard still seems like a high minimum. This feels more like a maximum than a minimum

Hmm, I thought that's what we had agreed on but I admit to being very tired during that talk.

Thanks for reviewing. I'll take your questions back to the drawing board.

comment:9 Changed 5 years ago by andrea

Milestone: Tor: 0.2.5.x-finalTor: 0.2.6.x-final

Triage for 0.2.5.x: this should wait for 0.2.6.

comment:10 Changed 5 years ago by nickm

(I'd like to get it early in 0.2.6 though, so please let us know when the patch is updated)

comment:11 Changed 5 years ago by arlolra

Status: needs_revisionneeds_review

comment:12 Changed 5 years ago by nickm

Hm. That patch is less complicated than I feared, since it has a few other things in it that are beyond what this ticket needs. I think I should split it up into logical parts before we merge it.

  • I should check whether there's a faster mechanism here than connection_get_by_type_addr_port_purpose. If there is, we should use it. If not, we shouldn't necessarily add one unless this shows up in profiles.
  • In router_pick_directory_server_impl, I think n_busy needs to get reset to 0 before the "goto retry_without_exclude", or things get double-counted.
  • The "busy_out" field for router_pick_directory_server_impl() needs documentation.
  • I like the changes in initiate_descriptor_downloads()... but I'm not so sure about "digest_len += 1". What "NULL" are we talking about? 'digest' isn't NUL-terminated.
  • Gotta make sure it compiles with --enable-gcc-warnings on gcc and clang.
  • How long does it take us to retry the directory fetch after we decide not to launch one because the server is busy? It would be bad to re-check too fast, and it would be bad to re-check too slow. We'd better look at this.
  • Can max_dl_per_request return a higher number for TunnelDirConns for SERVERDESC too?
  • Does max_dl_per_request need to look at whether a particular connection will be tunneled, or does options->TunnelDirConns ensure that all these dir conns actually will be tunneled? I should check.

This is still on my plate, unless somebody else wants to pick up some/all of the above.

comment:13 Changed 5 years ago by arlolra

I think I should split it up into logical parts before we merge it.

I can rebase and split up the patch if it's helpful.

I like the changes in initiate_descriptor_downloads()... but I'm not so sure about "digest_len += 1". What "NULL" are we talking about? 'digest' isn't NUL-terminated.

The difference is that before you were building up one long string so when you base16_encode'd, you didn't want its NUL terminator. In the patch, we're making a smartlist of individual NUL terminated base16_encode'd strings, and then calling smartlist_join_strings on it, which will drop all the NULs.

Hope that makes sense. It's been 6 months since I've looked at this, so still paging it back in.

comment:14 in reply to:  13 Changed 5 years ago by nickm

Replying to arlolra:

I think I should split it up into logical parts before we merge it.

I can rebase and split up the patch if it's helpful.

Up to you; I'd be happy to do it too.

I like the changes in initiate_descriptor_downloads()... but I'm not so sure about "digest_len += 1". What "NULL" are we talking about? 'digest' isn't NUL-terminated.

The difference is that before you were building up one long string so when you base16_encode'd, you didn't want its NUL terminator. In the patch, we're making a smartlist of individual NUL terminated base16_encode'd strings, and then calling smartlist_join_strings on it, which will drop all the NULs.

Hope that makes sense. It's been 6 months since I've looked at this, so still paging it back in.

But digest_len is the length of the input, not the length of the output, right? It gets used like this:

+      base16_encode(cp, enc_digest_len, smartlist_get(digests, lo), digest_len);

If anything, it's enc_digest_len that should get incremented by 1, I think

comment:15 Changed 5 years ago by arlolra

If anything, it's enc_digest_len that should get incremented by 1, I think

Sounds like it.

This should have caught it,

tor_assert(destlen >= srclen*2+1);

But I guess the tests (and my testing) didn't exercise that code path when purpose wasn't DIR_PURPOSE_FETCH_MICRODESC.

Up to you; I'd be happy to do it too.

Got for it. But if anything needs coding/fixing, feel free to ping me.

comment:16 Changed 5 years ago by nickm

The split-up version is now in my branch ticket9969. I'll start making changes on top of it.

comment:17 Changed 5 years ago by nickm

Fixed most of the above issues. In response to my questions:

  • I should check whether there's a faster mechanism here than connection_get_by_type_addr_port_purpose. If there is, we should use it. If not, we shouldn't necessarily add one unless this shows up in profiles.

There isn't a faster one. Let's leave it for now.

  • How long does it take us to retry the directory fetch after we decide not to launch one because the server is busy? It would be bad to re-check too fast, and it would be bad to re-check too slow. We'd better look at this.

It looks like we do a reasonable job here.

Can max_dl_per_request return a higher number for TunnelDirConns for SERVERDESC too?

I think we can. Making the fix.

Does max_dl_per_request need to look at whether a particular connection will be tunneled, or does options->TunnelDirConns ensure that all these dir conns actually will be tunneled? I should check.

This one is still up in the air. It looks like directory_command_should_use_begindir() can return 0 for several reasons other than having TunnelDirConns==0.

comment:18 Changed 5 years ago by nickm

arlolra -- how to my changes look to you so far? Do you have any ideas for the remaining issue in my last comment?

comment:19 Changed 4 years ago by nickm

This one is still up in the air. It looks like directory_command_should_use_begindir() can return 0 for several reasons other than having TunnelDirConns==0.

I *think* this is better now? The only cases we don't handle from directory_command_should_use_begindir() that is actually reachable are:

  • You have configured a directory authority with no dirport.
  • You're trying to reach someplace whose ORPort you are firewalled from, but whose dirport you can reach.

I think that's okay, though. IIRC, the side effect of sending too many items through squid is that it gets truncated, not rejected?

(And does squid still even act this way?)

Anyways, see branch "ticket9969" in my public repository for this fix, and the rest of Arlo's patch split up. What do you think, Arlo and others?

comment:20 Changed 4 years ago by arlolra

And does squid still even act this way?

Seems to be 8k these days, but does that mean people upgraded?
https://bazaar.launchpad.net/~squid/squid/3-trunk/view/head:/src/defines.h#L141

The calculation 4096 - strlen(http://255.255.255.255/tor/server/d/.z) == 4058 may want to consider an IPv6 address, but doesn't make any difference.

You have configured a directory authority with no dirport.

Did you mean or_port?

What do you think, Arlo and others?

The changes you made look good. It's a little unsatisfying that there're some leftover cases, but they don't look too bad.

comment:21 Changed 4 years ago by nickm

Keywords: nickm-patch added

Add the nickm-patch keyword to some needs_review tickets where I wrote or substantially revised the patch. This helps me find which tickets I should review and which I should find reviewers for.

comment:22 Changed 4 years ago by nickm

Keywords: andrea-review added

comment:23 Changed 4 years ago by andrea

Keywords: andrea-review removed

2e168566654957cb708d7484c28778e110384ae4:

  • Simple typo fix; looks fine

c00b397992edefb4507f2c1408e289243f5c7916:

  • This one looks good to me

f591a4d94cfa6a8ad17fd126e9736196b10a266a:

  • This looks good to me

f752093e16a8a492f2b9b14255211f68548dc060:

  • This seems okay

29f15a97edb05d175b97154e0b1c96fd04485ee2:

  • Hmm, we use if (busy) and infer that "the reason that we got no server is that servers are busy", but what if router_pick_directory_server_impl() only skipped some possible servers for being busy? Are there other reasons it might fail while having seen some busy servers along the way, and should the non-zero busy count take precedence in that case?

21d5dbd474d5dad10a2bfa800df078f7fdc8c40b:

  • I think this looks okay given subsequent fixes

5ed5ac185bf6f30438af1638f30e04418ed27aff:

  • This looks good to me.

bb137e23c1c30f7e9f469d4924bbce2bb9b2d2ed:

  • This looks fine

cae0e7b06bcb75494f75cb29fc0f9a356c284bf4:

  • Looks fine

06bda506003826bf9a28aec3afe0b7b1ae6cc9c0:

  • Looks fine

02464694b2fa901e04b8a8f1c467a0773e6d4c27:

  • Yay standards compliance! :)

482e3cfa0969775233d3f903639c44f32ddaf820:

  • Looks fine to me

55b21b366c4a8c237dda0a967c0c499e18fb0b4c:

  • Good catch; I think this is different from that potential issue I mentioned for 29f15a97edb05d175b97154e0b1c96fd04485ee2.

0fdfdae7e3a88a9172a51f36ac6b536b5687d401:

  • This looks fine

055ad9c5fb0de4295ccb05357d6a2d8ad29d03e4:

  • Looks okay to me

6523eff9b3b5e06521da010e238b4cd23ed24e82:

  • Looks okay to me

fa80983e52cd5213ca2019024eafdc846a720f99:

  • Looks okay to me

ac9b0a3110ea4eea63133c6d2e3572b2cfd22bd6:

  • This looks okay, but why 500 vs. 1000?

comment:24 in reply to:  23 Changed 4 years ago by arlolra

29f15a97edb05d175b97154e0b1c96fd04485ee2:

  • Hmm, we use if (busy) and infer that "the reason that we got no server is that servers are busy", but what if router_pick_directory_server_impl() only skipped some possible servers for being busy? Are there other reasons it might fail while having seen some busy servers along the way, and should the non-zero busy count take precedence in that case?

busy only seems to be used to avoid the call to mark_all_dirservers_up, which still seems reasonable even in those cases, no?

comment:25 Changed 4 years ago by nickm

Resolution: implemented
Status: needs_reviewclosed
Type: defectenhancement

Just noticed a thing: the "goto retry_without_exclude" in router_pick_directory_server needs to be conditional.

Merged, cleaned up. The merge was complicated; somebody should probably have a look at it.

The merge is 5d4bb6f61f360ddcbee83476ff2f93a13f3a19c0.

The tweak is 6c443e987d318aabb92fc8acc25cf6ce82692118.

Changes file is 0b121433464804889240a6884179fa4631f0d7b0

comment:26 Changed 4 years ago by nickm

WRT 500 vs 1000: I think that 500 seemed safer to me because I wanted to avoid a situation where we send a Huuuuuuge request, and the response gets cut-off part way.

comment:27 Changed 4 years ago by arlolra

lgtm

time to find a new bug to work on

Note: See TracTickets for help on using tickets.