best_priority() can starve the worker threads of good relays

Trac:
Parent Ticket: #28663 (moved)

added component::core tor/sbws milestone::sbws: 1.0.x-final owner::juga parent::28663 priority::medium resolution::fixed reviewer::ahf severity::normal status::closed type::defect version::sbws 1.0.2 labels

Trac:
Description: best_priority() tries to measure unmeasured and failing relays first.

But if fraction_relays or min_relays always fail, those relays will always end up first in the priority queue. (More precisely, those relays will end up first in the priority queue, until the results of the good relays time out.)

Thinking about starvation is complicated, because of the freshness_reduction_factor on some errors.

Here's a very simple algorithm that avoids starving good relays for failed relays:

Count the number of times that sbws has attempted to get a result from each relay.
Test the relays with the lowest number of attempts first. (Don't check if the attempt succeeded or failed.)

For this priority rule to work, every time a relay is queued, it must get a result. Here's how we can make that happen"

Modify result_putter_error() to store an error result to the queue.
Make sure timeouts store an error result to the queue.
Add a unit test and integration test that makes sure every queued relay has a result.

Here's an alternative that might be simpler to implement:

before a relay is queued using pool.apply_async() in run_speedtest(), store a ResultAttempt to the queue
only count ResultAttempts when prioritising relays

to

best_priority() tries to measure unmeasured and failing relays first.

But if fraction_relays or min_relays always fail, those relays will always end up first in the priority queue. (More precisely, those relays will end up first in the priority queue, until the results of the good relays ~~time out~~ are discarded for being too old.)

Thinking about starvation is complicated, because of the freshness_reduction_factor on some errors.

Here's a very simple algorithm that avoids starving good relays for failed relays:

Count the number of times that sbws has attempted to get a result from each relay.
Test the relays with the lowest number of attempts first. (Don't check if the attempt succeeded or failed.)

For this priority rule to work, every time a relay is queued, it must get a result. Here's how we can make that happen"

Modify result_putter_error() to store an error result to the queue.
Make sure timeouts store an error result to the queue.
Add a unit test and integration test that makes sure every queued relay has a result.

Here's an alternative that might be simpler to implement:

before a relay is queued using pool.apply_async() in run_speedtest(), store a ResultAttempt to the queue
only count ResultAttempts when prioritising relays

There's one problem with this scheme: if many new relays join the network every hour, then they will starve older relays. But that's a problem for the bad relays people, not sbws.

To avoid this problem, we could have two queues/pools: one for unmeasured relays, and one for measured relays. (Torflow does something like this, by having ~8 measured partitions, and an unmeasured partition.)

Replying to teor:

best_priority() tries to measure unmeasured and failing relays first.

But if fraction_relays or min_relays always fail, those relays will always end up first in the priority queue. (More precisely, those relays will end up first in the priority queue, until the results of the good relays ~~time out~~ are discarded for being too old.)

Thinking about starvation is complicated, because of the freshness_reduction_factor on some errors.

Here's a very simple algorithm that avoids starving good relays for failed relays:

Count the number of times that sbws has attempted to get a result from each relay.

This is already done when writing the results: ResultError and ResultSuccess keep that.

Test the relays with the lowest number of attempts first. (Don't check if the attempt succeeded or failed.)

This's what i was proposing by commenting the part where it prioritizes ResultError measurements.

For this priority rule to work, every time a relay is queued, it must get a result. Here's how we can make that happen"

Modify result_putter_error() to store an error result to the queue.

result_putter already writes ResultError.

Here there're two other bugs, result_putter_error, only happens when:

The relay being measured, doesn't have a descriptor (#28870 (moved))
The operator hits KeyboardInterrupt (#28869 (moved))

AFAIK, there're no other cases where the error callback is called.

Make sure timeouts store an error result to the queue.

Add a unit test and integration test that makes sure every queued relay has a result.

Testing this is hard, but i'll see.

Here's an alternative that might be simpler to implement:

before a relay is queued using pool.apply_async() in run_speedtest(), store a ResultAttempt to the queue

only count ResultAttempts when prioritising relays

I don't see this easier. I'll evaluate after other changes has been made in #28663 (moved)

I've this branch https://github.com/juga0/sbws/commits/bug28868, but it's missing the test.

Trac:
Owner: N/A to juga
Status: new to assigned

Add a unit test and integration test that makes sure every queued relay has a result. Maybe this could be done as part of #28566 (moved) instead?.

#28933 (moved) runs the actual scanner. It is not counting that all the relays get measured, though in the test network this is the case.

Created PR without the tests: https://github.com/torproject/sbws/pull/328

Trac:
Status: assigned to needs_review

Replying to juga:

Replying to teor:

best_priority() tries to measure unmeasured and failing relays first.

But if fraction_relays or min_relays always fail, those relays will always end up first in the priority queue. (More precisely, those relays will end up first in the priority queue, until the results of the good relays ~~time out~~ are discarded for being too old.)

Thinking about starvation is complicated, because of the freshness_reduction_factor on some errors.

Here's a very simple algorithm that avoids starving good relays for failed relays:

Count the number of times that sbws has attempted to get a result from each relay.

This is already done when writing the results: ResultError and ResultSuccess keep that.

But some failures do not write a ResultError.

Test the relays with the lowest number of attempts first. (Don't check if the attempt succeeded or failed.)

This's what i was proposing by commenting the part where it prioritizes ResultError measurements.

I don't understand what you mean here. Can you link to the comment, or quote it?

For this priority rule to work, every time a relay is queued, it must get a result. Here's how we can make that happen"

Modify result_putter_error() to store an error result to the queue.

result_putter already writes ResultError.

But result_putter_error() is called when there is an exception in apply_async(), and it does not write ResultError.

Here there're two other bugs, result_putter_error, only happens when:

The relay being measured, doesn't have a descriptor (#28870 (moved))

The operator hits KeyboardInterrupt (#28869 (moved))

AFAIK, there're no other cases where the error callback is called.

The code is complicated, so it could throw other exceptions that you haven't seen yet. Future code changes could also add more exceptions.

Make sure timeouts store an error result to the queue.

Add a unit test and integration test that makes sure every queued relay has a result.

Testing this is hard, but i'll see.

Replying to juga:

Add a unit test and integration test that makes sure every queued relay has a result. Maybe this could be done as part of #28566 (moved) instead?.

#28933 (moved) runs the actual scanner. It is not counting that all the relays get measured, though in the test network this is the case.

Created PR without the tests: https://github.com/torproject/sbws/pull/328

You'll also need to update the documentation: https://github.com/torproject/sbws/blob/master/docs/source/specification.rst#simple-relay-prioritization

Trac:
Reviewer: N/A to ahf

Doing a revision on this before is reviewed, to address teor's comments

Trac:
Status: needs_review to needs_revision

Replying to teor:

Here's a very simple algorithm that avoids starving good relays for failed relays:

Count the number of times that sbws has attempted to get a result from each relay.

This is already done when writing the results: ResultError and ResultSuccess keep that.

But some failures do not write a ResultError.

Test the relays with the lowest number of attempts first. (Don't check if the attempt succeeded or failed.)

This's what i was proposing by commenting the part where it prioritizes ResultError measurements.

I don't understand what you mean here. Can you link to the comment, or quote it?

sorry, i don't remember now where i said that, but i think i missunderstand you. I think this adds more complexity but might help to get more eligible relays. What if we open a new ticket for that?

For this priority rule to work, every time a relay is queued, it must get a result. Here's how we can make that happen"

Modify result_putter_error() to store an error result to the queue.

result_putter already writes ResultError.

But result_putter_error() is called when there is an exception in apply_async(), and it does not write ResultError.

Ah, i get you now, you're right. This might need some more changes. What if we also open a new ticket for this?.

You'll also need to update the documentation: https://github.com/torproject/sbws/blob/master/docs/source/specification.rst#simple-relay-prioritization

ok, updated

New tickets are a good idea. I get lost in big comment threads, and in big tickets.

I created two children tickets, but there're still more things in the ticket description that i didn't implemented in the PR https://github.com/torproject/sbws/pull/328. What i implemented was basically not prioritizing relays that failed to be measured, which is one of the two things (the other is #28897 (moved)) i believe makes sbws stall. Setting to needs_review again.

Trac:
Status: needs_revision to needs_review

The code in PR328 looks reasonable to me. I added a minor comment to a boolean expression, but nothing blocking.

I still feel that I don't know the codebase well enough to say if changes are net positive/negative for the overall codebase, but I trust juga to get those details right.

Trac:
Status: needs_review to merge_ready

Thanks, merged!

Trac:
Resolution: N/A to fixed
Status: merge_ready to closed

closed

mentioned in issue #29156 (moved)

mentioned in issue #29157 (moved)

moved to tpo/network-health/sbws#28868 (closed)

mentioned in issue tpo/network-health/sbws#29156 (closed)

mentioned in issue tpo/network-health/sbws#29157 (closed)

best_priority() can starve the worker threads of good relays

Child items ...

Activity