Opened 3 months ago

Closed 13 days ago

#28864 closed defect (fixed)

sbws AsyncResults have no timeout

Reported by: teor Owned by: juga
Priority: Medium Milestone: sbws: 1.1.x-final
Component: Core Tor/sbws Version: sbws: 1.0.2
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: #28663 Points: 1
Reviewer: nickm Sponsor:

Description

After sbws queues an AsyncResult, it will wait forever for the result to be ready:
https://github.com/torproject/sbws/blob/ee64d76df54ceb3a3c9e1e2a797fd70d68bb0035/sbws/core/scanner.py#L359-L364

If at least one result hangs, then sbws will hang, because AsyncResult.ready() does not have a timeout.

Instead, sbws should call AsyncResult.wait([timeout]) on each result, after calling pool.apply_async() on a large number of results.

See https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.AsyncResult.wait

Child Tickets

Change History (13)

comment:1 Changed 3 months ago by teor

This ticket probably needs to be done at the same time as #28865.

comment:2 in reply to:  description Changed 3 months ago by juga

Replying to teor:

After sbws queues an AsyncResult, it will wait forever for the result to be ready:
https://github.com/torproject/sbws/blob/ee64d76df54ceb3a3c9e1e2a797fd70d68bb0035/sbws/core/scanner.py#L359-L364

If at least one result hangs, then sbws will hang, because AsyncResult.ready() does not have a timeout.

In theory this won't happen, since both circuits and requests have a timeout. Also the sleep happen in the main thread, not the thread that is getting the results

Instead, sbws should call AsyncResult.wait([timeout]) on each result, after calling pool.apply_async() on a large number of results.

See https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.AsyncResult.wait

I'll try this anyway

comment:3 Changed 3 months ago by juga

I looked more at this.
wait is the method of an event (https://github.com/python/cpython/blob/master/Lib/multiprocessing/pool.py#L662), and it blocks (https://docs.python.org/3/library/threading.html?highlight=threading#threading.Event.wait)
sleep is not blocking because happen in the main thread.
Probably the cause (or at least the main one) is #28897.

comment:4 in reply to:  3 Changed 3 weeks ago by juga

Cc: juga teor removed
Milestone: sbws: 1.0.x-finalsbws: 1.1.x-final
Status: newneeds_review

Replying to juga:

I looked more at this.
wait is the method of an event (https://github.com/python/cpython/blob/master/Lib/multiprocessing/pool.py#L662), and it blocks (https://docs.python.org/3/library/threading.html?highlight=threading#threading.Event.wait)

a maximum of timeout.
get with timeout uses wait, and that's what we want since pending results that didn't trigger either callback or callback error have not been gotten, and get give us either the exception or the value.

After merging #28932, this is an improvement, can wait to 1.1 milestone.

https://github.com/torproject/sbws/pull/341

comment:5 Changed 3 weeks ago by juga

Owner: set to juga
Status: needs_reviewassigned

reassigning so that the ticket has owner

comment:6 Changed 3 weeks ago by juga

Status: assignedneeds_review

comment:7 Changed 3 weeks ago by juga

Points: 1

comment:8 Changed 3 weeks ago by juga

Status: needs_reviewneeds_revision

Because of #28865, this needs to be re-thought, to don't make it yet more complicated.

comment:9 Changed 2 weeks ago by juga

Status: needs_revisionneeds_review

Solving also #28865:
https://github.com/torproject/sbws/pull/342
Moving to 1.1 milestone since #28932 check progress and logs error.

comment:10 Changed 2 weeks ago by nickm

Reviewer: nickm

comment:11 Changed 2 weeks ago by nickm

Status: needs_reviewmerge_ready

This code looks reasonable to me. If you have tested it and works for you, then I'd say "merge it!" :)

For my own curiosity: why is the time cutoff set to "the number of releys to measure multiplied by TIMEOUT_MEASUREMENTS (around 90mins)"?

comment:12 in reply to:  11 Changed 2 weeks ago by juga

Replying to nickm:

For my own curiosity: why is the time cutoff set to "the number of releys to measure multiplied by TIMEOUT_MEASUREMENTS (around 90mins)"?

Well spotted. Actually when measuring 300 relays it takes around 90min. In the case a relay doesn't get measured in 3min, there's no need to wait that 300 relays take each of them 3min, it can just finish when there hasn not been any relay measured in that time.
I commited a fixup to do that, tested the whole loop in the public network and works.


comment:13 Changed 13 days ago by juga

Resolution: fixed
Status: merge_readyclosed

Merged.

Note: See TracTickets for help on using tickets.