per commit msg:
Tests show that SQLSupport.reset_all() may clear too much because
if BwAuthority calls Scansupport.reset_stats() after each speedrace()
run only the first slice is properly recorded; the rest are empty.
We suspected that this bug was the cause of BwAuthority's memory leak problems as items were not being cleared after each slice completed. However, after fixing this issue the memory leaks persisted.
mikeperry and I decided that refactoring BwAuthority as a parent-child pair of processes would ensure that memory leaks would not persist past each run. Basically, rather than 1 long-running process, we split bwauthority.py into a pair of processes: one parent process that was responsible for passing slice parameters to a child process that would actually scan that slice.
After refactoring BwAuthority, we discovered more issues -- in some cases circuit_status_events referenced Router objects that were not stored in the database (SQLAlchemy raised sqlalchemy.orm.exc.NoResultFound) even though update_consensus() and _update_db() had supposedly inserted the referenced Router objects. To restate: objects that were supposedly stored in the database were failing in queries only seconds later.
This issue happens pretty rarely; my best estimate is about 2 weeks or so...
This is likely caused by a race between the sessions bound to Elixir models and another shared session (tc_session).
I believe this race occurs because of misuse of SQLAlchemy scoped_sessions. I refactored BwAuthority to use local sessions for each function in torctl/SQLSupport.py that accesses the database, and call tc_session.remove() prior to returning, as is recommended here:
http://www.sqlalchemy.org/docs/orm/session.html#contextual-thread-local-sessions . The effect should be to flush mapped objects from the local session to the database so that queries from the Elixir bound sessions will succeed.
Now we wait and see if this race condition persists :-(.