Opened 5 months ago

Closed 3 months ago

#28866 closed defect (fixed)

ResultDump.queue.put() can hang if the queue is full

Reported by: teor Owned by: juga
Priority: Medium Milestone: sbws: 1.0.x-final
Component: Core Tor/sbws Version: sbws: 1.0.2
Severity: Normal Keywords:
Cc: juga, teor Actual Points: 1
Parent ID: #28663 Points: 1
Reviewer: mikeperry Sponsor:

Description

sbws calls ResultDump.queue.put() in blocking mode:
https://github.com/torproject/sbws/blob/ee64d76df54ceb3a3c9e1e2a797fd70d68bb0035/sbws/core/scanner.py#L303

But multiprocessing callbacks need to return immediately:
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.apply_async

So sbws should call put() without blocking, or with a (very small) timeout:
https://docs.python.org/3/library/queue.html#queue.Queue.put

Child Tickets

Change History (15)

comment:1 in reply to:  description Changed 5 months ago by juga

Replying to teor:

sbws calls ResultDump.queue.put() in blocking mode:
https://github.com/torproject/sbws/blob/ee64d76df54ceb3a3c9e1e2a797fd70d68bb0035/sbws/core/scanner.py#L303

But multiprocessing callbacks need to return immediately:
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.apply_async

So sbws should call put() without blocking, or with a (very small) timeout:
https://docs.python.org/3/library/queue.html#queue.Queue.put

I'm not sure we can make this callback not blocking, since it'll be accessing the same part of memory and needs to write to disk. Not sure either about the small timeout.
I'll try, but dunno how i can test all of this.

comment:2 in reply to:  description ; Changed 5 months ago by juga

After looking more at this.

Replying to teor:

sbws calls ResultDump.queue.put() in blocking mode:
https://github.com/torproject/sbws/blob/ee64d76df54ceb3a3c9e1e2a797fd70d68bb0035/sbws/core/scanner.py#L303

put is not blocking because it happens in result_dump's thread (result_dump launches its own thread before the the pool is instantiated)

But multiprocessing callbacks need to return immediately:
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.apply_async

So sbws should call put() without blocking, or with a (very small) timeout:
https://docs.python.org/3/library/queue.html#queue.Queue.put

With put timeout it will always return Full unless the slot is inmediatly available (https://docs.python.org/3/library/queue.html#queue.Queue.put)
It's also very unlikely that the queue will be full, since results happen with a difference of seconds and worst case there're only 6000.
A way to solve this without a different thread, queue and lock in result_putter, would be to use a deque, but i don't think it's necessary, since i think the bug is in #28897.

comment:3 in reply to:  2 Changed 4 months ago by juga

Status: newneeds_review

Replying to juga:

So sbws should call put() without blocking, or with a (very small) timeout:
https://docs.python.org/3/library/queue.html#queue.Queue.put

With put timeout it will always return Full unless the slot is immediatly available (https://docs.python.org/3/library/queue.html#queue.Queue.put)

i miss-interpreted this, it doesn't return Full.

Implemented the timeout in check whether the queue is full:
https://github.com/torproject/sbws/pull/325

comment:4 Changed 4 months ago by asn

Reviewer: mikeperry

comment:5 Changed 3 months ago by juga

Owner: set to juga
Status: needs_reviewassigned

Changing to assigned because i discovered a bug in the PR, i'll change again to needs_review when i fix it.

comment:6 Changed 3 months ago by juga

Points: 1
Status: assignedneeds_review

I think i solved the bug.

comment:7 Changed 3 months ago by mikeperry

Status: needs_reviewneeds_revision

Hrmm. I might need to look at a lot more of the sbws code, but I don't think the patch is the best way to address the possibility that the queue may be full. If the queue can also be added from another thread, then there still can be a race where it ends up full after the loop check but before the next statement. I would instead make it do something like:

while True:
  try:
    result_dump.queue.put(measurement_result, timeout=1)
  except Full as e:
    log.info('The results queue is full, after 1 second it should not be full')

That way, if the timeout hits because the queue is full for a second, the exception is thrown, and then you log, and then you try again for another second (at which point it should succeed, but in case it does not, you just log and try again rather than raise an uncaught Full exception).

Last edited 3 months ago by mikeperry (previous) (diff)

comment:8 in reply to:  7 Changed 3 months ago by juga

Replying to mikeperry:

Hrmm. I might need to look at a lot more of the sbws code, but I don't think the patch is the best way to address the possibility that the queue may be full. If the queue can also be added from another thread, then there still can be a race where it ends up full after the loop check but before the next statement.

I've never seen the case where the queue is full and it's very unlikely that it'll get full by other thread, since measurement threads take more than 5 seconds to complete, and the the ResultDump thread is getting measurements from the queue every second (https://github.com/torproject/sbws/blob/ada70d4f85ac7ce6f9847eeae832ec7620106470/sbws/lib/resultdump.py#L589)

I would instead make it do something like:

while True:
  try:
    result_dump.queue.put(measurement_result, timeout=1)
  except Full as e:
    log.info('The results queue is full, after 1 second it should not be full')

That way, if the timeout hits because the queue is full for a second, the exception is thrown, and then you log, and then you try again for another second (at which point it should succeed, but in case it does not, you just log and try again rather than raise an uncaught Full exception).

That was my first though, but the while True will make this call to block forever, and the callback has to return immediately. From https://docs.python.org/3.5/library/multiprocessing.html#multiprocessing.pool.Pool.apply_async:

"Callbacks should complete immediately since otherwise the thread which handles the results will get blocked."

So something like that will be only possible if there was yet another thread handling the callback to don't block the main process (like ResultDump does), but that would add unneeded complexity.

comment:9 Changed 3 months ago by juga

Status: needs_revisionneeds_review

comment:10 Changed 3 months ago by juga

I added a new commit with more documentation on how the results are gotten, after i tried (unsuccessfully) to create a test that would raise Full: https://github.com/torproject/sbws/pull/325/commits/d8bbb66f2d429927651a7bcad8f5381a94ba34f4
Note also the comment in https://github.com/torproject/sbws/pull/325/commits/ada70d4f85ac7ce6f9847eeae832ec7620106470#diff-3b84dbf9a215f274316faeccffc94f21R329.

comment:11 Changed 3 months ago by juga

Merged in a branch with current master and current needs_review tickets and tested a whole loop with the public Tor network

comment:12 Changed 3 months ago by mikeperry

Status: needs_reviewneeds_revision

Ok I now understand a bit more of what's going on here. I am still not in favor of this solution, for the following reasons:

  1. The only way that the closure callback could block in the old code was if the result getter thread died/stalled/deadlocked and did not manage to actually pull a result to make the queue non-full.
  2. The new code has a while loop which will still run forever if the queue is full and nothing pulls from it. It will just log in that case.

It seems like #1 and #2 both block and fail to return in the exact same way. What about:

try:
  result_dump.queue.put(measurement_result, timeout=3)
except Full as e:
  log.warn("The results queue is staying full. Is the getter thread dead?")

This way, the closure will only block for at most 3 seconds, which should only happen if the getter is dead/hung, since the getter should have pulled in just 1 second. Hence the warn in that case.

comment:13 Changed 3 months ago by juga

Status: needs_revisionneeds_review

I like last solution, i added a commit with it.
Changed slightly the commit message.

comment:14 Changed 3 months ago by mikeperry

Status: needs_reviewmerge_ready

Ok this looks good. Sorry for the delay.

comment:15 Changed 3 months ago by juga

Actual Points: 1
Resolution: fixed
Status: merge_readyclosed

Thanks, merged

Note: See TracTickets for help on using tickets.