Opened 6 months ago

Closed 4 months ago

#28868 closed defect (fixed)

best_priority() can starve the worker threads of good relays

Reported by: teor Owned by: juga
Priority: Medium Milestone: sbws: 1.0.x-final
Component: Core Tor/sbws Version: sbws: 1.0.2
Severity: Normal Keywords:
Cc: juga, teor Actual Points:
Parent ID: #28663 Points:
Reviewer: ahf Sponsor:

Description (last modified by teor)

best_priority() tries to measure unmeasured and failing relays first.

But if fraction_relays or min_relays always fail, those relays will always end up first in the priority queue. (More precisely, those relays will end up first in the priority queue, until the results of the good relays time out are discarded for being too old.)

Thinking about starvation is complicated, because of the freshness_reduction_factor on some errors.

Here's a very simple algorithm that avoids starving good relays for failed relays:

  1. Count the number of times that sbws has attempted to get a result from each relay.
  2. Test the relays with the lowest number of attempts first. (Don't check if the attempt succeeded or failed.)

For this priority rule to work, every time a relay is queued, it must get a result. Here's how we can make that happen"

  • Modify result_putter_error() to store an error result to the queue.
  • Make sure timeouts store an error result to the queue.
  • Add a unit test and integration test that makes sure every queued relay has a result.

Here's an alternative that might be simpler to implement:

  • before a relay is queued using pool.apply_async() in run_speedtest(), store a ResultAttempt to the queue
  • only count ResultAttempts when prioritising relays

Child Tickets

Change History (13)

comment:1 Changed 6 months ago by teor

Description: modified (diff)

comment:2 Changed 6 months ago by teor

There's one problem with this scheme: if many new relays join the network every hour, then they will starve older relays. But that's a problem for the bad relays people, not sbws.

To avoid this problem, we could have two queues/pools: one for unmeasured relays, and one for measured relays. (Torflow does something like this, by having ~8 measured partitions, and an unmeasured partition.)

comment:3 in reply to:  description Changed 6 months ago by juga

Replying to teor:

best_priority() tries to measure unmeasured and failing relays first.

But if fraction_relays or min_relays always fail, those relays will always end up first in the priority queue. (More precisely, those relays will end up first in the priority queue, until the results of the good relays time out are discarded for being too old.)

Thinking about starvation is complicated, because of the freshness_reduction_factor on some errors.

Here's a very simple algorithm that avoids starving good relays for failed relays:

  1. Count the number of times that sbws has attempted to get a result from each relay.

This is already done when writing the results: ResultError and ResultSuccess keep that.

  1. Test the relays with the lowest number of attempts first. (Don't check if the attempt succeeded or failed.)

This's what i was proposing by commenting the part where it prioritizes ResultError measurements.

For this priority rule to work, every time a relay is queued, it must get a result. Here's how we can make that happen"

  • Modify result_putter_error() to store an error result to the queue.

result_putter already writes ResultError.

Here there're two other bugs, result_putter_error, only happens when:

  1. The relay being measured, doesn't have a descriptor (#28870)
  2. The operator hits KeyboardInterrupt (#28869)

AFAIK, there're no other cases where the error callback is called.

  • Make sure timeouts store an error result to the queue.
  • Add a unit test and integration test that makes sure every queued relay has a result.

Testing this is hard, but i'll see.

Here's an alternative that might be simpler to implement:

  • before a relay is queued using pool.apply_async() in run_speedtest(), store a ResultAttempt to the queue
  • only count ResultAttempts when prioritising relays

I don't see this easier. I'll evaluate after other changes has been made in #28663

comment:4 Changed 6 months ago by juga

Owner: set to juga
Status: newassigned

I've this branch https://github.com/juga0/sbws/commits/bug28868, but it's missing the test.

comment:5 Changed 6 months ago by juga

Status: assignedneeds_review

Add a unit test and integration test that makes sure every queued relay has a result.

Maybe this could be done as part of #28566 instead?.

#28933 runs the actual scanner. It is not counting that all the relays get measured, though in the test network this is the case.

Created PR without the tests: https://github.com/torproject/sbws/pull/328

comment:6 in reply to:  5 ; Changed 5 months ago by teor

Replying to juga:

Replying to teor:

best_priority() tries to measure unmeasured and failing relays first.

But if fraction_relays or min_relays always fail, those relays will always end up first in the priority queue. (More precisely, those relays will end up first in the priority queue, until the results of the good relays time out are discarded for being too old.)

Thinking about starvation is complicated, because of the freshness_reduction_factor on some errors.

Here's a very simple algorithm that avoids starving good relays for failed relays:

  1. Count the number of times that sbws has attempted to get a result from each relay.

This is already done when writing the results: ResultError and ResultSuccess keep that.

But some failures do not write a ResultError.

  1. Test the relays with the lowest number of attempts first. (Don't check if the attempt succeeded or failed.)

This's what i was proposing by commenting the part where it prioritizes ResultError measurements.

I don't understand what you mean here.
Can you link to the comment, or quote it?

For this priority rule to work, every time a relay is queued, it must get a result. Here's how we can make that happen"

  • Modify result_putter_error() to store an error result to the queue.

result_putter already writes ResultError.

But result_putter_error() is called when there is an exception in apply_async(), and it does not write ResultError.

Here there're two other bugs, result_putter_error, only happens when:

  1. The relay being measured, doesn't have a descriptor (#28870)
  2. The operator hits KeyboardInterrupt (#28869)

AFAIK, there're no other cases where the error callback is called.

The code is complicated, so it could throw other exceptions that you haven't seen yet. Future code changes could also add more exceptions.

  • Make sure timeouts store an error result to the queue.
  • Add a unit test and integration test that makes sure every queued relay has a result.

Testing this is hard, but i'll see.

Replying to juga:

Add a unit test and integration test that makes sure every queued relay has a result.

Maybe this could be done as part of #28566 instead?.

#28933 runs the actual scanner. It is not counting that all the relays get measured, though in the test network this is the case.

Created PR without the tests: https://github.com/torproject/sbws/pull/328

You'll also need to update the documentation:
https://github.com/torproject/sbws/blob/master/docs/source/specification.rst#simple-relay-prioritization

comment:7 Changed 5 months ago by asn

Reviewer: ahf

comment:8 Changed 5 months ago by juga

Status: needs_reviewneeds_revision

Doing a revision on this before is reviewed, to address teor's comments

comment:9 in reply to:  6 Changed 5 months ago by juga

Replying to teor:

Here's a very simple algorithm that avoids starving good relays for failed relays:

  1. Count the number of times that sbws has attempted to get a result from each relay.

This is already done when writing the results: ResultError and ResultSuccess keep that.

But some failures do not write a ResultError.

  1. Test the relays with the lowest number of attempts first. (Don't check if the attempt succeeded or failed.)

This's what i was proposing by commenting the part where it prioritizes ResultError measurements.

I don't understand what you mean here.
Can you link to the comment, or quote it?

sorry, i don't remember now where i said that, but i think i missunderstand you.
I think this adds more complexity but might help to get more eligible relays.
What if we open a new ticket for that?

For this priority rule to work, every time a relay is queued, it must get a result. Here's how we can make that happen"

  • Modify result_putter_error() to store an error result to the queue.

result_putter already writes ResultError.

But result_putter_error() is called when there is an exception in apply_async(), and it does not write ResultError.

Ah, i get you now, you're right.
This might need some more changes.
What if we also open a new ticket for this?.

You'll also need to update the documentation:
https://github.com/torproject/sbws/blob/master/docs/source/specification.rst#simple-relay-prioritization

ok, updated

comment:10 Changed 5 months ago by teor

New tickets are a good idea. I get lost in big comment threads, and in big tickets.

comment:11 Changed 5 months ago by juga

Status: needs_revisionneeds_review

I created two children tickets, but there're still more things in the ticket description that i didn't implemented in the PR ​https://github.com/torproject/sbws/pull/328.
What i implemented was basically not prioritizing relays that failed to be measured, which is one of the two things (the other is #28897) i believe makes sbws stall.
Setting to needs_review again.

comment:12 Changed 4 months ago by ahf

Status: needs_reviewmerge_ready

The code in PR328 looks reasonable to me. I added a minor comment to a boolean expression, but nothing blocking.

I still feel that I don't know the codebase well enough to say if changes are net positive/negative for the overall codebase, but I trust juga to get those details right.

comment:13 Changed 4 months ago by juga

Resolution: fixed
Status: merge_readyclosed

Thanks, merged!

Note: See TracTickets for help on using tickets.