Opened 8 years ago

Closed 8 years ago

#2947 closed defect (fixed)

BwScanner does not clear stream data between slices

Reported by: aagbsn Owned by: mikeperry
Priority: Medium Milestone:
Component: Core Tor/Torflow Version:
Severity: Keywords:
Cc: aagbsn@… Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

The output files ( ./data/scanner.{n}/scan-data/bws-* ) contain entries from previous slices.

The default relays-per-slice is 50; but the number of entries per output file grows by about 50 per slice.

Example: the output file generated for the third slice would contain the entries for the first and second slices as well.

It looks like the sqlite databases are not being cleared between speedrace() runs.

Child Tickets

Change History (5)

comment:1 Changed 8 years ago by aagbsn

Cc: aagbsn@… added

comment:2 in reply to:  1 Changed 8 years ago by aagbsn

Replying to aagbsn:

https://github.com/aagbsn/torctl/tree/ticket2947

per commit msg:
Tests show that SQLSupport.reset_all() may clear too much because
if BwAuthority calls Scansupport.reset_stats() after each speedrace()
run only the first slice is properly recorded; the rest are empty.

comment:3 Changed 8 years ago by aagbsn

Update:

  1. We suspected that this bug was the cause of BwAuthority's memory leak problems as items were not being cleared after each slice completed. However, after fixing this issue the memory leaks persisted.
  1. mikeperry and I decided that refactoring BwAuthority as a parent-child pair of processes would ensure that memory leaks would not persist past each run. Basically, rather than 1 long-running process, we split bwauthority.py into a pair of processes: one parent process that was responsible for passing slice parameters to a child process that would actually scan that slice.
  1. After refactoring BwAuthority, we discovered more issues -- in some cases circuit_status_events referenced Router objects that were not stored in the database (SQLAlchemy raised sqlalchemy.orm.exc.NoResultFound) even though update_consensus() and _update_db() had supposedly inserted the referenced Router objects. To restate: objects that were supposedly stored in the database were failing in queries only seconds later.
  1. This issue happens pretty rarely; my best estimate is about 2 weeks or so...
  1. This is likely caused by a race between the sessions bound to Elixir models and another shared session (tc_session).
  1. I believe this race occurs because of misuse of SQLAlchemy scoped_sessions. I refactored BwAuthority to use local sessions for each function in torctl/SQLSupport.py that accesses the database, and call tc_session.remove() prior to returning, as is recommended here:

http://www.sqlalchemy.org/docs/orm/session.html#contextual-thread-local-sessions . The effect should be to flush mapped objects from the local session to the database so that queries from the Elixir bound sessions will succeed.

  1. Now we wait and see if this race condition persists :-(.

comment:4 Changed 8 years ago by aagbsn

Fixes to bwscanner.py for ensure that if any exceptions occur the slice will be restarted by the parent process.

See: commit 37668610f78ac5b29ae399fac24ba70e2a2a643c

comment:5 Changed 8 years ago by aagbsn

Resolution: fixed
Status: newclosed

Update:
This issue has been resolved.

Note: See TracTickets for help on using tickets.