Opened 5 years ago

Closed 5 years ago

#8273 closed defect (fixed)

Set flag thresholds and flags based on measured bandwidth

Reported by: nickm Owned by:
Priority: Very High Milestone: Tor: 0.2.4.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-auth
Cc: karsten, mikeperry, aagbsn Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

When an authority has measured bandwidth values, it should set its thresholds for assigning the various flags, and the flags themselves, based on those values, not on advertised values.

Basically, every invocation of router_get_advertised_bandwidth*() in dirserv.c should be replaced if possible.

This is related to #2286 .

Child Tickets

Change History (18)

comment:1 Changed 5 years ago by andrea

Status: newneeds_review

Please review proposed fix in my bug8273 branch.

comment:2 Changed 5 years ago by nickm

Status: needs_reviewneeds_revision

Quick notes:

  • The "/* Check if we cleared the whole thing and free if so */" seems unnecessary. dirserv_clear_measured_bw_cache will free this eventually, right?
  • In case time is unsigned, the expiry check should be something like "e->as_of < now - MX_MEASUREMENT_TIME". This might also help the compiler raise the computation. (Not that it matters.)
  • We should be calling dirserv_clear_measured_bw_cache from dirserv_free_all.
  • Would DIGESTMAP_FOREACH_MODIFY and DIGESTMAP_FOREACH_MODIFY_END make the expiry function easier to write?
  • The node_id argument for dirserv_query_measured_bw_cache should probably be const. The documentation for that function should explain what the return value means, and what exactly happens to the *_out variables.
  • dirserv_get_bandwidth_for_router should take a const routerinfo
  • I think dirserv_cache_measured_bw should take a const parsed_line.

And an issue of medium trickiness:

  • It would be cool to have some unit tests for the new machinery here.

And a trickier issue:

  • When we have a sufficient number/fraction of nodes with measured bandwidth, I think we should stop believing nodes' advertised bandwidths. That is, if the number/fraction of measured nodes is high enough, then we should *never* call unmeasured nodes Fast, or Guard, or HSDir, or anything else that depends on bandwidth. (Plausible?)

comment:3 Changed 5 years ago by nickm

Oh, one thing I should check before the merge: I didn't check whether there were any additional router_get_advertised_bandwidths that should get changed into dirserv_get_bandwidth_for_router.

comment:4 Changed 5 years ago by arma

I just filed #8356 for some questions on the analysis side of this idea.

comment:5 Changed 5 years ago by nickm

Cc: karsten mikeperry added

Karsten (on #8356) IIUC seems to think that "Run it on one authority and see what happens" will be an acceptable way to test this patch once it's revised.

The remaining question that you had (for Mike perhaps) was whether this is likely to set up some kind of horrible feedback loop. I don't *think* it would, assuming it just does what's on the carton and changes how flags are assigned, but a second opinion would sure be nice.

comment:6 Changed 5 years ago by mikeperry

Hrmm.. I just checked the bandwidth authority source, and right now it appears we only bother to scan nodes already marked as "Fast". If we deploy this in combination with the "Unmeasured" cap from #2286, it might be possible that new nodes never get the Fast flag, and thus stay Unmeasured forever. This would be bad...

I suppose an even more important question right now is can this already happen with the current Fast cutoff vs the Unmeasured cap?

comment:7 Changed 5 years ago by mikeperry

Cc: aagbsn added

aagbsn: This might require a bw auth fix. See my comment above.

I don't think it will be too disasterous to change the bw auths to also scan non-Fast nodes, but it might slow down the scanning quite a bit. Is there a better way to handle this problem? One approach might be adding an additional OrNodeRestriction to select all nodes with either Fast or Unmeasured...

comment:8 in reply to:  7 Changed 5 years ago by nickm

Replying to mikeperry:

aagbsn: This might require a bw auth fix. See my comment above.

I don't think it will be too disasterous to change the bw auths to also scan non-Fast nodes, but it might slow down the scanning quite a bit. Is there a better way to handle this problem? One approach might be adding an additional OrNodeRestriction to select all nodes with either Fast or Unmeasured...

I think that's a fine idea. Another option is to scan all nodes which are Fast, plus all nodes which will meet the current Fast threshold if their advertised values turn out to be accurate. I don't have any sense which is better though.

comment:9 in reply to:  6 Changed 5 years ago by arma

Replying to mikeperry:

Hrmm.. I just checked the bandwidth authority source, and right now it appears we only bother to scan nodes already marked as "Fast".

Fun! While that remains true, that makes #8247 a bigger deal.

comment:10 Changed 5 years ago by aagbsn

I pushed the following to my pytorctl and torflow repos:
https://gitweb.torproject.org/user/aagbsn/pytorctl.git/shortlog/refs/heads/8273-add-unmeasured-path-selection-support

https://gitweb.torproject.org/user/aagbsn/torflow.git/shortlog/refs/heads/8273-add-unmeasured-path-selection-support

untested, but I'll find a machine somewhere and set it up shortly. hopefully I dodged all the landmines :-)

comment:11 Changed 5 years ago by aagbsn

I went with mikeperry's suggestion because it was easier; but it would be good to know how much longer the network takes to scan. Does anyone have current numbers on how long a complete measurement takes?

comment:12 Changed 5 years ago by mikeperry

One problem with my suggestion vs Nick's is that mine might flap. I just came down with a cold, so my brain might be misfiring, but here's what I think can happen: If you're scanning Fast or Unmeasured, you see a slow unmeasured node and then scan it. If it turns out to be not-Fast, then the *next* time around, you don't measure it, and it becomes unmeasured again, at which point you measure it.

It's not as bad as never measuring, of course. But I wonder if it will be a substantial problem? My guess is "Probably not", *except* for the fact that this means there will be a *lot* of non-Fast nodes at that 1.0 ratio that we assign to Unmeasured nodes. This could mess up the measurement of nodes that are actually perfectly average (arithmetic mean) in terms of measuring them, because they will now be more likely to be paired with slow nodes that fell out of the Fast status and lost their measured flag.

So maybe an even simpler fix of scanning non-Fast nodes is the best one? K.I.S.S.? The one thing I would suggest if we decide to go that route is to perhaps make an extra entry in the data/bwfiles URL list. (Where the hell is this file btw? It's not in my git checkout? Did we forget to add it?)

comment:13 Changed 5 years ago by aagbsn

Ok, here's a branch without the Fast flag requirements:
https://gitweb.torproject.org/user/aagbsn/pytorctl.git/shortlog/refs/heads/8273-remove-fast-flag

It seems that the bwfiles entry is missing. Do you have the original file?

comment:14 in reply to:  2 ; Changed 5 years ago by andrea

Status: needs_revisionneeds_review

Replying to nickm:

Quick notes:

  • The "/* Check if we cleared the whole thing and free if so */" seems unnecessary. dirserv_clear_measured_bw_cache will free this eventually, right?

It saves one extra block of memory for the map itself when it's empty.

  • In case time is unsigned, the expiry check should be something like "e->as_of < now - MX_MEASUREMENT_TIME". This might also help the compiler raise the computation. (Not that it matters.)

Done.

  • We should be calling dirserv_clear_measured_bw_cache from dirserv_free_all.

Done.

  • Would DIGESTMAP_FOREACH_MODIFY and DIGESTMAP_FOREACH_MODIFY_END make the expiry function easier to write?

Done.

  • The node_id argument for dirserv_query_measured_bw_cache should probably be const. The documentation for that function should explain what the return value means, and what exactly happens to the *_out variables.

Done.

  • dirserv_get_bandwidth_for_router should take a const routerinfo

Done.

  • I think dirserv_cache_measured_bw should take a const parsed_line.

Done.

And an issue of medium trickiness:

  • It would be cool to have some unit tests for the new machinery here.

Done; not really that tricky.

And a trickier issue:

  • When we have a sufficient number/fraction of nodes with measured bandwidth, I think we should stop believing nodes' advertised bandwidths. That is, if the number/fraction of measured nodes is high enough, then we should *never* call unmeasured nodes Fast, or Guard, or HSDir, or anything else that depends on bandwidth. (Plausible?)

That sounds like a good idea potentially; what about other circumstances we use the advertised bandwidth? Should we consider those cases or just concern ourselves with the flags?

comment:15 in reply to:  14 ; Changed 5 years ago by nickm

Replying to andrea:

Replying to nickm:

Quick notes:

  • The "/* Check if we cleared the whole thing and free if so */" seems unnecessary. dirserv_clear_measured_bw_cache will free this eventually, right?

It saves one extra block of memory for the map itself when it's empty.

Okay. I don't think this will make a big difference, but it can't hurt to leave it in.

[...]

And a trickier issue:

  • When we have a sufficient number/fraction of nodes with measured bandwidth, I think we should stop believing nodes' advertised bandwidths. That is, if the number/fraction of measured nodes is high enough, then we should *never* call unmeasured nodes Fast, or Guard, or HSDir, or anything else that depends on bandwidth. (Plausible?)

That sounds like a good idea potentially; what about other circumstances we use the advertised bandwidth? Should we consider those cases or just concern ourselves with the flags?

I think we should, for this ticket, just look at the flags. Other users of advertised bandwidth can get other tickets if they don't have them already. (Shall we open those tickets as we work on this one?)

I'll review the changes you mentioned above Friday (I hope); are you okay with implementing the "trickier issue" thing here? Let's talk online and open the new tickets together, unless you're feeling psyched to do them yourself.

comment:16 Changed 5 years ago by nickm

Status: needs_reviewneeds_revision

(I've reviewed the other changes so far, up to 8027ebb5fdf43b5aa63e498c1f8e3c6b2c87bbb7. They look good to me.

comment:17 in reply to:  15 Changed 5 years ago by andrea

Replying to nickm:

Replying to andrea:

Replying to nickm:

Quick notes:

  • The "/* Check if we cleared the whole thing and free if so */" seems unnecessary. dirserv_clear_measured_bw_cache will free this eventually, right?

It saves one extra block of memory for the map itself when it's empty.

Okay. I don't think this will make a big difference, but it can't hurt to leave it in.

[...]

And a trickier issue:

  • When we have a sufficient number/fraction of nodes with measured bandwidth, I think we should stop believing nodes' advertised bandwidths. That is, if the number/fraction of measured nodes is high enough, then we should *never* call unmeasured nodes Fast, or Guard, or HSDir, or anything else that depends on bandwidth. (Plausible?)

That sounds like a good idea potentially; what about other circumstances we use the advertised bandwidth? Should we consider those cases or just concern ourselves with the flags?

I think we should, for this ticket, just look at the flags. Other users of advertised bandwidth can get other tickets if they don't have them already. (Shall we open those tickets as we work on this one?)

I'll review the changes you mentioned above Friday (I hope); are you okay with implementing the "trickier issue" thing here? Let's talk online and open the new tickets together, unless you're feeling psyched to do them yourself.

I just created 8435; want me to go ahead and implement it? It seems like something we should do, but also like more of a feature than a bugfix, and the deadline for those is past.

comment:18 Changed 5 years ago by nickm

Resolution: fixed
Status: needs_revisionclosed

Fixed in the branch of #8435.

Note: See TracTickets for help on using tickets.