Opened 8 months ago

Closed 5 months ago

#28663 closed defect (fixed)

sbws stops accumulating, silently

Reported by: stefani Owned by: juga
Priority: Medium Milestone: sbws: 1.1.x-final
Component: Core Tor/sbws Version: sbws: 1.0.2
Severity: Major Keywords: 10-proposed
Cc: Actual Points:
Parent ID: #28639 Points:
Reviewer: Sponsor:

Description

sbws deb 1.0.2-1 stops accumulating data.
scanner log does not indicate anything when this happens, just the last time of write/access.
restart of sbws mitigates.

e.g. most recently, last log line was "ResultSuccess" for one relay.
no python errors reported, no traceback.

Child Tickets

TicketStatusOwnerSummaryComponent
#28736closedMake info level logs more clear and consistentCore Tor/sbws
#28864closedjugasbws AsyncResults have no timeoutCore Tor/sbws
#28865closedjugasbws keeps the number of AsyncResults less than the number of threadsCore Tor/sbws
#28866closedjugaResultDump.queue.put() can hang if the queue is fullCore Tor/sbws
#28868closedjugabest_priority() can starve the worker threads of good relaysCore Tor/sbws
#28869closedKeyboardInterrupt will cause a callback error and does not close the thread pool cleanlyCore Tor/sbws
#28870closedStop asserting when there's not a descriptor for a relay being measuredCore Tor/sbws
#28897closedjugaStop running twice destination usability testsCore Tor/sbws
#28932closedDetect when sbws stalls and log backtraceCore Tor/sbws
#28933closedjugaRun one sbws main loop using the test network as part of integration testsCore Tor/sbws
#29156closedMeasure first relays with the lowest number of attempts first.Core Tor/sbws
#29157closedModify the the threads callback error to store the errorCore Tor/sbws

Change History (17)

comment:1 Changed 8 months ago by gk

Component: ApplicationsCore Tor/sbws

comment:2 Changed 8 months ago by teor

Milestone: sbws: 1.0.x-final

This bug should be fixed before we deploy any more sbws instances.

comment:3 Changed 8 months ago by teor

Parent ID: #28639

This could be #28639 or a separate bug.

comment:4 Changed 8 months ago by juga

stefani, could you copy-paste the exact last line you saw?

comment:5 in reply to:  4 Changed 8 months ago by juga

Replying to juga:

stefani, could you copy-paste the exact last line you saw?

Sorry, no need to do so, i'm quite sure now the line is "<relaynick> <8firstdigitsfp> finished measurement with ResultSuccess"

comment:6 Changed 8 months ago by stefani

correct, last line was resultdump.py:544 - handle_result - <relaynick> <digitsfp> finished ...

comment:7 Changed 8 months ago by anadahz

This bug is triggered after ~8 hours of sbws uptime.
I can always reproduce this bug on sbws version 1.0.3-dev0.

comment:8 in reply to:  7 Changed 8 months ago by juga

Replying to anadahz:

This bug is triggered after ~8 hours of sbws uptime.
I can always reproduce this bug on sbws version 1.0.3-dev0.

If you feel like and have time, you could:

  1. check if you have commit ca7856b6bbfc7d73063b65d99cf903a5ca16cb1d in your version
  2. if you don't have it, update git repo to last master, restart and wait until sbws stalls again
  3. if you have it, say which tor version you're using, paste the tor logs somewhere and link them in #28639

comment:9 in reply to:  7 Changed 8 months ago by juga

Replying to anadahz:

This bug is triggered after ~8 hours of sbws uptime.

8h is the time it would take to complete the first prioritization loop.
This confirm more my theory that sbws might be getting stall in that point, maybe because of race conditions (#27996)
5 days ago, I commented these lines in longclaw's sbws and it's not stalling anymore:

  1. where it returns only a subset of the relays [0] and instead return all (the loop will take around 1 day, but measure all the relays that in the network found in the prioritization loop, not around 300).
  2. i commented the condition where sbws gives more priority to the relays that fails [1]. This is probably not the reason, but made sbws faster because it seems there're unstable relays that would get prioritized to be measured, but would fail again.

I'd uncomment 1 by 1 those lines and see if it stalls again.

There could be still other reasons as the one in #28639.

[0] https://github.com/torproject/sbws/blob/c2b85b3f81456779f5882addbf78713acf8d66c9/sbws/lib/relayprioritizer.py#L89
[1] https://github.com/torproject/sbws/blob/c2b85b3f81456779f5882addbf78713acf8d66c9/sbws/lib/relayprioritizer.py#L66

comment:10 Changed 7 months ago by teor

As I said in https://trac.torproject.org/projects/tor/ticket/28639#comment:12 , I think that:

This issue is an instance of #27049: sbws does not open any 3-hop circuits, so it stalls when all 3-hop circuits time out.

I am waiting for someone to confirm by either:

  • having sbws stall, and getting info-level or debug-level logs of the hour before sbws starts stalling, then analysing the log, or
  • having sbws not stall when this fix is applied:

One quick fix is to enable predicted circuits (#28701).

comment:11 in reply to:  10 Changed 7 months ago by juga

Replying to teor:

  • having sbws not stall when this fix is applied:

I can confirm that enabling predicted circuits does not solve the problem. (https://trac.torproject.org/projects/tor/ticket/28639#comment:20)
I can also confirm that commenting the lines (https://trac.torproject.org/projects/tor/ticket/28663#comment:9) (either both or one of them), solves the problem.
However, it seems to me that there's an underlaying problem with threads and locks, and maybe it's just casuality that sbws doesn't stall commenting those lines.

Trying to figure out what's happening, i realized:

  • relaylist, self._refresh_lock = Lock() . While it's instanciated, it's not being acquired nor released. It's run every 5min.
  • resultdump, it's locking the directory before writing the result, but it's not locking the Result itself
  • scanner, measure_relay()
    • first calls Destination.next(), which instantiates an RLock, acquire, then call Destination.is_usable, then call connect_destination_over_circuit(), which instantiates an RLock, then relase the first RLock.
    • some lines laters, it calls again Destination.is_usable.

So, things i could do to fix this:

  • lock in the prioritization loop, maybe resultdump
  • ensure that locks acquired and relased
  • fix the double locking in usability tests, but here there's an additionally problem.

I've observed that sometimes sbws fails several times in a row to perform the usability test (while the destination is actually alive), which blocks sbws 5 min.
Some solutions to this:

  • don't perform usability test, instead count how many relays fail to be measured in a row, after some number warn that maybe it's the destination that is down. I prefer to try this, i think this would remove a lot of complexity in the code.
  • perform the usability test not through Tor
  • if the usability test fail, the time to perform a new one could be different (bigger) than the one it sleeps (smaller).

Before changing any code and because there might be additional complexity i've not spot yet, i'm going to look at which are all the calls that a thread pass by and which ones are reading/writing the same memory. Hopefully i'll get to understand why commenting lines makes sbws to don't stall.

comment:12 Changed 7 months ago by juga

Owner: set to juga
Status: newassigned

comment:13 Changed 7 months ago by juga

The instantiated locks without acquire and release use the context manager, so that might not be a failure.
The point where sbws is stalled is in this loop: https://github.com/torproject/sbws/blob/ee64d76df54ceb3a3c9e1e2a797fd70d68bb0035/sbws/core/scanner.py#L362, which was added in #28061 in order to stop measuring the same relay by two threads when a new loop starts.
Turned out that there's 1 thread that is not measuring relays, it's the one storing the results, so that could be solved just changing "> 0" to "> 1", i'll try that.
However, there's still the mistery about why commenting the lines in relaprioritizer solves the problem too.

comment:14 in reply to:  13 ; Changed 7 months ago by teor

Replying to juga:

The instantiated locks without acquire and release use the context manager, so that might not be a failure.
The point where sbws is stalled is in this loop: https://github.com/torproject/sbws/blob/ee64d76df54ceb3a3c9e1e2a797fd70d68bb0035/sbws/core/scanner.py#L362, which was added in #28061 in order to stop measuring the same relay by two threads when a new loop starts.

This code can block forever at a few point for a few different reasons. I will open a new child ticket for every location that can block.

Turned out that there's 1 thread that is not measuring relays, it's the one storing the results, so that could be solved just changing "> 0" to "> 1", i'll try that.

pending_results is a list of AsyncResults, not threads:
https://docs.python.org/2/library/multiprocessing.html#multiprocessing.pool.AsyncResult

So this change doesn't do what you say it does.

However, there's still the mistery about why commenting the lines in relaprioritizer solves the problem too.

It might solve the problem by changing where the code blocks.

Last edited 7 months ago by teor (previous) (diff)

comment:15 in reply to:  14 Changed 7 months ago by juga

Replying to teor:

The point where sbws is stalled is in this loop: https://github.com/torproject/sbws/blob/ee64d76df54ceb3a3c9e1e2a797fd70d68bb0035/sbws/core/scanner.py#L362, which was added in #28061 in order to stop measuring the same relay by two threads when a new loop starts.

This code can block forever at a few point for a few different reasons. I will open a new child ticket for every location that can block.

Turned out that there's 1 thread that is not measuring relays, it's the one storing the results, so that could be solved just changing "> 0" to "> 1", i'll try that.

pending_results is a list of AsyncResults, not threads:
https://docs.python.org/2/library/multiprocessing.html#multiprocessing.pool.AsyncResult

So this change doesn't do what you say it does.

That's the logic thing to think, but i couldn't find enough documentation.
Firstable, Pool manages tasks, jobs and results (https://github.com/python/cpython/blob/master/Lib/multiprocessing/pool.py#L186), and have at least 3 queues _inqueue, _outqueue, _taskqueue.
I left sbws running extra logging the "jobs" and the queues, and the moment it was stalled, pending_results was 1, so was the difference between _inqueue and _outqueue, and the "job" i could see, was thread 1, resultdump waiting for more results to get.
So while i'd agree with you, it's still not clear to me that that's the case.

comment:16 Changed 5 months ago by juga

Keywords: 10-proposed added
Milestone: sbws: 1.0.x-finalsbws: 1.1.x-final

Moving to milestione 1.1 tickets which we don't know if they need to be implemented after 1.0.3 has been released or that can only be done after 1.0 has been released

comment:17 Changed 5 months ago by juga

Resolution: fixed
Status: assignedclosed

All the children have been closed.

Note: See TracTickets for help on using tickets.