Opened 8 years ago

Closed 8 years ago

#3834 closed defect (fixed)

Postgre bandwidth scanner has halted on moria

Reported by: mikeperry Owned by: aagbsn
Priority: High Milestone:
Component: Core Tor/Torflow Version:
Severity: Keywords:
Cc: arma, aagbsn@… Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Roger's scanner has made no progress since the second scanner apparently crashed.

There have been a lot of ephemeral conversations about it, and no one person knows the full story.

This bug is to record analysis of the crash.

Child Tickets

Attachments (3)

err.1.log.gz (150.9 KB) - added by aagbsn 8 years ago.
crashes
err.2.log.gz (62.2 KB) - added by aagbsn 8 years ago.
crashes from scanner2
err.3.log.gz (60.5 KB) - added by aagbsn 8 years ago.
crashes from scanner1, trial 2

Download all attachments as: .zip

Change History (32)

comment:1 Changed 8 years ago by mikeperry

Cc: arma added

comment:2 Changed 8 years ago by arma

It's not clear that it crashed. The process was still around. It just didn't log anything more to the bw.log file.

I don't know what it was supposed to do, or how to recognize what it did, so I find it hard to help with debugging.

comment:3 Changed 8 years ago by arma

Ok. I've restarted the bwauth on moria. We can start again.

comment:4 Changed 8 years ago by arma

First traceback on scanner 1:

WARN[Sun Aug 28 20:30:57 2011]:No event for: 650 CIRC 35 LAUNCHED PURPOSE=GENERA
L
Traceback (most recent call last):
  File "../../TorCtl/TorCtl.py", line 712, in _eventLoop
    self._handleFn(timestamp, reply)
  File "../../TorCtl/TorCtl.py", line 1333, in _handle1
    l.listen(event)
  File "../../TorCtl/TorCtl.py", line 1285, in listen
    self._map1.get(event.event_name, self.unknown_event)(event)
  File "../../TorCtl/SQLSupport.py", line 874, in circ_status_event
    idhex=r.idhex).with_labels().one()
  File "/var/home/bwauth/work/SQLAlchemy-0.7.2/lib/sqlalchemy/orm/query.py", line 1802, in one
    raise orm_exc.NoResultFound("No row was found for one()")
NoResultFound: No row was found for one()

comment:5 Changed 8 years ago by arma

Scanner 4 has a traceback:

Exception in thread EventLoop (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "/usr/lib/python2.5/threading.py", line 486, in __bootstrap_inner
  File "/usr/lib/python2.5/threading.py", line 446, in run
  File "../../TorCtl/TorCtl.py", line 715, in _eventLoop
  File "../../TorCtl/TorUtil.py", line 360, in plog
  File "/usr/lib/python2.5/logging/__init__.py", line 1056, in log
  File "/usr/lib/python2.5/logging/__init__.py", line 1100, in _log
  File "/usr/lib/python2.5/logging/__init__.py", line 1080, in makeRecord
  File "/usr/lib/python2.5/logging/__init__.py", line 258, in __init__
<type 'exceptions.TypeError'>: 'NoneType' object is not callable

comment:6 in reply to:  4 ; Changed 8 years ago by aagbsn

Are you running the latest torflow: (commit: 37668610f78ac5b29ae399fac24ba70e2a2a643c)?
Are you running the latest torctl: (commit: 25cb2bf77b1a98f2273b0f90c99b57e3d6e21a41)?

If yes to the above, it looks like the race described in #2947 still lives. Do you have a complete bw.log you can attach or email?

Replying to arma:

First traceback on scanner 1:

WARN[Sun Aug 28 20:30:57 2011]:No event for: 650 CIRC 35 LAUNCHED PURPOSE=GENERA
L
Traceback (most recent call last):
  File "../../TorCtl/TorCtl.py", line 712, in _eventLoop
    self._handleFn(timestamp, reply)
  File "../../TorCtl/TorCtl.py", line 1333, in _handle1
    l.listen(event)
  File "../../TorCtl/TorCtl.py", line 1285, in listen
    self._map1.get(event.event_name, self.unknown_event)(event)
  File "../../TorCtl/SQLSupport.py", line 874, in circ_status_event
    idhex=r.idhex).with_labels().one()
  File "/var/home/bwauth/work/SQLAlchemy-0.7.2/lib/sqlalchemy/orm/query.py", line 1802, in one
    raise orm_exc.NoResultFound("No row was found for one()")
NoResultFound: No row was found for one()

comment:7 in reply to:  5 ; Changed 8 years ago by aagbsn

Cc: aagbsn@… added

This is caused by the main thread exiting before the eventLoop thread; because we weren't exiting cleanly. I have a fix for this in my testing branch that just needs_review before we can merge it.

https://github.com/aagbsn/torflow/commit/09f232bffa3cf51db21c53c81e544ad977bcb335

Replying to arma:

Scanner 4 has a traceback:

Exception in thread EventLoop (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "/usr/lib/python2.5/threading.py", line 486, in __bootstrap_inner
  File "/usr/lib/python2.5/threading.py", line 446, in run
  File "../../TorCtl/TorCtl.py", line 715, in _eventLoop
  File "../../TorCtl/TorUtil.py", line 360, in plog
  File "/usr/lib/python2.5/logging/__init__.py", line 1056, in log
  File "/usr/lib/python2.5/logging/__init__.py", line 1100, in _log
  File "/usr/lib/python2.5/logging/__init__.py", line 1080, in makeRecord
  File "/usr/lib/python2.5/logging/__init__.py", line 258, in __init__
<type 'exceptions.TypeError'>: 'NoneType' object is not callable

comment:8 in reply to:  6 Changed 8 years ago by arma

Replying to aagbsn:

Are you running the latest torflow: (commit: 37668610f78ac5b29ae399fac24ba70e2a2a643c)?
Are you running the latest torctl: (commit: 25cb2bf77b1a98f2273b0f90c99b57e3d6e21a41)?

If yes to the above, it looks like the race described in #2947 still lives. Do you have a complete bw.log you can attach or email?

Mailed three complete logs that include that backtrace.

comment:9 Changed 8 years ago by arma

Ok, scanner 4 has now hit the traceback that causes the log to stop:

-rw-r--r-- 1 bwauth bwauth 22699559 2011-08-29 18:09 scanner.1/bw.log
-rw-r--r-- 1 bwauth bwauth 29221682 2011-08-29 18:09 scanner.2/bw.log
-rw-r--r-- 1 bwauth bwauth 31998982 2011-08-29 18:09 scanner.3/bw.log
-rw-r--r-- 1 bwauth bwauth 36093582 2011-08-29 17:21 scanner.4/bw.log
[...]
DEBUG[Mon Aug 29 17:21:13 2011]:0.0896589756012 CIRC 7159 EXTENDED $173B220F9F32
F39086D5661274A47485EDA26131,$E93DC83A9347FBE3DC168C4B4FA351B1AA5BAFE2
DEBUG[Mon Aug 29 17:21:13 2011]:1314652873.17 0.0790801048279 CIRC 7159 BUILT $1
73B220F9F32F39086D5661274A47485EDA26131,$E93DC83A9347FBE3DC168C4B4FA351B1AA5BAFE2
DEBUG[Mon Aug 29 17:21:13 2011]:0.178069114685 CIRC 7159 BUILT $173B220F9F32F39086D5661274A47485EDA26131,$E93DC83A9347FBE3DC168C4B4FA351B1AA5BAFE2
DEBUG[Mon Aug 29 17:21:13 2011]:Attaching stream: 10504 to circuit 7159
DEBUG[Mon Aug 29 17:21:13 2011]:0.045471906662 STREAM 10504 SENTCONNECT 7159 38.229.70.2 443
ERROR[Mon Aug 29 17:21:15 2011]:An unknown HTTP error occured
Traceback (most recent call last):
  File "bwauthority_child.py", line 125, in http_request
    reply = urllib2.urlopen(request)
  File "/usr/lib/python2.5/urllib2.py", line 124, in urlopen
    return _opener.open(url, data)
  File "/usr/lib/python2.5/urllib2.py", line 381, in open
    response = self._open(req, data)
  File "/usr/lib/python2.5/urllib2.py", line 399, in _open
    '_open', req)
  File "/usr/lib/python2.5/urllib2.py", line 360, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.5/urllib2.py", line 1115, in https_open
    return self.do_open(httplib.HTTPSConnection, req)
  File "/usr/lib/python2.5/urllib2.py", line 1079, in do_open
    h.request(req.get_method(), req.get_selector(), req.data, headers)
  File "/usr/lib/python2.5/httplib.py", line 866, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python2.5/httplib.py", line 889, in _send_request
    self.endheaders()
  File "/usr/lib/python2.5/httplib.py", line 860, in endheaders
    self._send_output()
  File "/usr/lib/python2.5/httplib.py", line 732, in _send_output
    self.send(msg)
  File "/usr/lib/python2.5/httplib.py", line 699, in send
    self.connect()
  File "/usr/lib/python2.5/httplib.py", line 1134, in connect
    sock.connect((self.host, self.port))
  File "../libs/SocksiPy/socks.py", line 369, in connect
    self.__negotiatesocks5(destpair[0],destpair[1])
  File "../libs/SocksiPy/socks.py", line 236, in __negotiatesocks5
    raise Socks5Error((ord(resp[1]),_socks5errors[ord(resp[1])]))
Socks5Error: (1, 'general SOCKS server failure')
DEBUG[Mon Aug 29 17:21:15 2011]:Removed 127.0.0.1:38421 from our local port list
DEBUG[Mon Aug 29 17:21:15 2011]:Got last exit of E93DC83A9347FBE3DC168C4B4FA351B1AA5BAFE2
DEBUG[Mon Aug 29 17:21:15 2011]:97.1529135706-99.1083300743% circuit build+fetch failed for ('E93DC83A9347FBE3DC168C4B4FA351B1AA5BAFE2', 'ONE')
DEBUG[Mon Aug 29 17:21:15 2011]:Checking if scan count is met...

and the log stops there. This is I believe the primary bug.

comment:10 in reply to:  6 ; Changed 8 years ago by arma

Replying to aagbsn:

Are you running the latest torflow: (commit: 37668610f78ac5b29ae399fac24ba70e2a2a643c)?
Are you running the latest torctl: (commit: 25cb2bf77b1a98f2273b0f90c99b57e3d6e21a41)?

I hadn't explicitly answered those here, so: yes.

comment:11 in reply to:  10 Changed 8 years ago by aagbsn

Replying to arma:

Replying to aagbsn:

Are you running the latest torflow: (commit: 37668610f78ac5b29ae399fac24ba70e2a2a643c)?
Are you running the latest torctl: (commit: 25cb2bf77b1a98f2273b0f90c99b57e3d6e21a41)?

I hadn't explicitly answered those here, so: yes.

Thanks.

(I think) the Traceback regarding the Socks5Error is unrelated to your scanner freezing. I wonder if the problem is related to postgres. I have a couple of fixes in my testing branch that correct BwAuthority's usage of sqlalchemy/elixir session objects that might help.

If your scanner does freeze up again, could you provide the output of #ps -u postgres u ?

comment:12 Changed 8 years ago by arma

I upgraded my git and restarted the bwauths. First issue:

INFO[Thu Sep 01 00:42:56 2011]:TorFlow Version: master 9e4b804a398ebf5bec1ffc261577a4b84b35f220
ERROR[Thu Sep 01 00:42:56 2011]:list index out of range
INFO[Thu Sep 01 00:42:56 2011]:TorCtl Version: unknown unknown

comment:13 Changed 8 years ago by arma

Scanner 3 had an almost immediate problem:

-rw-r--r-- 1 bwauth bwauth 2079006 2011-09-01 05:09 scanner.1/bw.log
-rw-r--r-- 1 bwauth bwauth 2501595 2011-09-01 05:09 scanner.2/bw.log
-rw-r--r-- 1 bwauth bwauth   48435 2011-09-01 00:46 scanner.3/bw.log
-rw-r--r-- 1 bwauth bwauth 3882522 2011-09-01 05:09 scanner.4/bw.log

Here's the end of its output:

DEBUG[Thu Sep 01 00:46:51 2011]:0.186544179916 CIRC 15 BUILT $01340C8CDF5E90B9D4
FF1AEBDCA403CD2FF6E562,$070BC8794DCF79B2EC490EF9A38B16FD5EBDE745
DEBUG[Thu Sep 01 00:46:51 2011]:Attaching stream: 21 to circuit 15
DEBUG[Thu Sep 01 00:46:51 2011]:0.0268068313599 STREAM 21 SENTCONNECT 15 38.229.70.2 443
DEBUG[Thu Sep 01 00:46:51 2011]:0.000241041183472 CIRC 16 EXTENDED $56D84A5138C90E13025B26DA0B9AAD7FB796AF38
DEBUG[Thu Sep 01 00:46:51 2011]:Ignoring circ 16
ERROR[Thu Sep 01 00:46:51 2011]:An unknown HTTP error occured
Traceback (most recent call last):
  File "bwauthority_child.py", line 125, in http_request
    reply = urllib2.urlopen(request)
  File "/usr/lib/python2.5/urllib2.py", line 124, in urlopen
    return _opener.open(url, data)
  File "/usr/lib/python2.5/urllib2.py", line 381, in open
    response = self._open(req, data)
  File "/usr/lib/python2.5/urllib2.py", line 399, in _open
    '_open', req)
  File "/usr/lib/python2.5/urllib2.py", line 360, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.5/urllib2.py", line 1115, in https_open
    return self.do_open(httplib.HTTPSConnection, req)
  File "/usr/lib/python2.5/urllib2.py", line 1079, in do_open
    h.request(req.get_method(), req.get_selector(), req.data, headers)
  File "/usr/lib/python2.5/httplib.py", line 866, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python2.5/httplib.py", line 889, in _send_request
    self.endheaders()
  File "/usr/lib/python2.5/httplib.py", line 860, in endheaders
    self._send_output()
  File "/usr/lib/python2.5/httplib.py", line 732, in _send_output
    self.send(msg)
  File "/usr/lib/python2.5/httplib.py", line 699, in send
    self.connect()
  File "/usr/lib/python2.5/httplib.py", line 1134, in connect
    sock.connect((self.host, self.port))
  File "../libs/SocksiPy/socks.py", line 369, in connect
    self.__negotiatesocks5(destpair[0],destpair[1])
  File "../libs/SocksiPy/socks.py", line 236, in __negotiatesocks5
    raise Socks5Error((ord(resp[1]),_socks5errors[ord(resp[1])]))
Socks5Error: (5, 'Connection refused')
DEBUG[Thu Sep 01 00:46:51 2011]:Removed 127.0.0.1:37009 from our local port list
DEBUG[Thu Sep 01 00:46:51 2011]:Got last exit of 070BC8794DCF79B2EC490EF9A38B16FD5EBDE745
DEBUG[Thu Sep 01 00:46:51 2011]:35.0-37.1376656691% circuit build+fetch failed for ('070BC8794DCF79B2EC490EF9A38B16FD5EBDE745', 'dustbunnytycoon')
DEBUG[Thu Sep 01 00:46:51 2011]:Checking if scan count is met...

comment:14 in reply to:  12 ; Changed 8 years ago by aagbsn

Replying to arma:

I upgraded my git and restarted the bwauths. First issue:

INFO[Thu Sep 01 00:42:56 2011]:TorFlow Version: master 9e4b804a398ebf5bec1ffc261577a4b84b35f220
ERROR[Thu Sep 01 00:42:56 2011]:list index out of range
INFO[Thu Sep 01 00:42:56 2011]:TorCtl Version: unknown unknown

Hm. Looks like you don't have TorCtl checked out into torflow/TorCtl? Perhaps this should be logged as NOTICE and warn that the repo wasn't found rather than ERROR.

comment:15 Changed 8 years ago by aagbsn

adding from email:

% ps -w -u postgres u
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
postgres 2353 0.0 0.0 101012 4400 ? S Jun21 14:30 /usr/lib/postgr
esql/8.3/bin/postgres -D /var/lib/postgresql/8.3/mai
postgres 2358 0.0 0.4 101144 26656 ? Ss Jun21 11:19 postgres: write
r process
postgres 2359 0.0 0.0 101012 1068 ? Ss Jun21 6:53 postgres: wal writer process
postgres 2360 0.0 0.0 101564 1488 ? Ss Jun21 11:44 postgres: autovacuum launcher process
postgres 2361 0.0 0.0 71752 1192 ? Ss Jun21 18:33 postgres: stats collector process
postgres 3642 0.2 0.2 103936 16688 ? Ss 00:43 0:40 postgres: bwscanner bwscan1 127.0.0.1(58025) idle in transaction
postgres 3667 0.2 0.2 103888 17280 ? Ss 00:43 0:38 postgres: bwscanner bwscan2 127.0.0.1(58054) idle
postgres 3679 0.0 0.2 103900 14248 ? Ss 00:43 0:08 postgres: bwscanner bwscan3 127.0.0.1(58056) UPDATE waiting
postgres 4023 0.1 0.2 103388 13832 ? Ss 00:46 0:21 postgres: bwscanner bwscan1 127.0.0.1(43811) idle
postgres 4027 0.1 0.2 103328 14720 ? Ss 00:46 0:26 postgres: bwscanner bwscan2 127.0.0.1(43817) idle
postgres 4031 0.0 0.1 103356 8408 ? Ss 00:46 0:00 postgres: bwscanner bwscan3 127.0.0.1(43858) idle in transaction
postgres 30756 0.8 0.2 103308 13736 ? Ss 04:59 0:12 postgres: bwscanner bwscan4 127.0.0.1(45962) idle
postgres 31138 0.3 0.1 103368 9808 ? Ss 05:02 0:03 postgres: bwscanner bwscan4 127.0.0.1(46645) idle in transaction

searching around for postgres "UPDATE WAITING" gives a few interesting results:
http://archives.postgresql.org/pgsql-general/2002-08/msg01119.php
http://archives.postgresql.org/pgsql-sql/1999-05/msg00065.php

comment:16 in reply to:  15 Changed 8 years ago by aagbsn

Replying to aagbsn:

adding from email:

% ps -w -u postgres u
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
postgres 2353 0.0 0.0 101012 4400 ? S Jun21 14:30 /usr/lib/postgr
esql/8.3/bin/postgres -D /var/lib/postgresql/8.3/mai
postgres 2358 0.0 0.4 101144 26656 ? Ss Jun21 11:19 postgres: write
r process
postgres 2359 0.0 0.0 101012 1068 ? Ss Jun21 6:53 postgres: wal writer process
postgres 2360 0.0 0.0 101564 1488 ? Ss Jun21 11:44 postgres: autovacuum launcher process
postgres 2361 0.0 0.0 71752 1192 ? Ss Jun21 18:33 postgres: stats collector process
postgres 3642 0.2 0.2 103936 16688 ? Ss 00:43 0:40 postgres: bwscanner bwscan1 127.0.0.1(58025) idle in transaction
postgres 3667 0.2 0.2 103888 17280 ? Ss 00:43 0:38 postgres: bwscanner bwscan2 127.0.0.1(58054) idle
postgres 3679 0.0 0.2 103900 14248 ? Ss 00:43 0:08 postgres: bwscanner bwscan3 127.0.0.1(58056) UPDATE waiting
postgres 4023 0.1 0.2 103388 13832 ? Ss 00:46 0:21 postgres: bwscanner bwscan1 127.0.0.1(43811) idle
postgres 4027 0.1 0.2 103328 14720 ? Ss 00:46 0:26 postgres: bwscanner bwscan2 127.0.0.1(43817) idle
postgres 4031 0.0 0.1 103356 8408 ? Ss 00:46 0:00 postgres: bwscanner bwscan3 127.0.0.1(43858) idle in transaction
postgres 30756 0.8 0.2 103308 13736 ? Ss 04:59 0:12 postgres: bwscanner bwscan4 127.0.0.1(45962) idle
postgres 31138 0.3 0.1 103368 9808 ? Ss 05:02 0:03 postgres: bwscanner bwscan4 127.0.0.1(46645) idle in transaction

searching around for postgres "UPDATE WAITING" gives a few interesting results:
http://archives.postgresql.org/pgsql-general/2002-08/msg01119.php
http://archives.postgresql.org/pgsql-sql/1999-05/msg00065.php

Also found this piece of info:

http://grokbase.com/t/postgresql.org/pgsql-general/2011/06/postgres-8-3-10-alter-table-waiting-issue/16hv6ndfcuqzm2ncyf2w6dr77p6e

Maybe an exception is preventing the transaction from completing? Every freeze Roger has seen has been preceded by a Socks5Erorr...

comment:17 in reply to:  14 Changed 8 years ago by arma

Replying to aagbsn:

Hm. Looks like you don't have TorCtl checked out into torflow/TorCtl? Perhaps this should be logged as NOTICE and warn that the repo wasn't found rather than ERROR.

I think I do have it in the expected location. Perhaps the current working directory gets changed before it looks?

comment:18 Changed 8 years ago by arma

Scanner 4 got another traceback today. Looks like the output is interleaved with some other thread:

[...]
DEBUG[Thu Sep 01 10:49:07 2011]:Got stream bw: 3167
DEBUG[Thu Sep 01 10:49:07 2011]:0.102648973465 STREAM_BW 3167 319 2362
ERROR[Thu Sep 01 10:49:07 2011]:An unknown HTTP error occured
Traceback (most recent call last):
  File "bwauthority_child.py", line 125, in http_request
    reply = urllib2.urlopen(request)
  File "/usr/lib/python2.5/urllib2.py", line 124, in urlopen
DEBUG[Thu Sep 01 10:49:07 2011]:1314888547.54 0.00143194198608 CIRC 2066 CLOSED $BC789B6057C5C71F2F85B3DA202B52EE94C4FF0C,$640AE17D46C7374F57689CD60E508685FB5BB4B5 REASON=DESTROYED REMOTE_REASON=OR_CONN_CLOSED
    return _opener.open(url, data)
  File "/usr/lib/python2.5/urllib2.py", line 381, in open
    response = self._open(req, data)
  File "/usr/lib/python2.5/urllib2.py", line 399, in _open
    '_open', req)
  File "/usr/lib/python2.5/urllib2.py", line 360, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.5/urllib2.py", line 1115, in https_open
    return self.do_open(httplib.HTTPSConnection, req)
  File "/usr/lib/python2.5/urllib2.py", line 1080, in do_open
    r = h.getresponse()
  File "/usr/lib/python2.5/httplib.py", line 928, in getresponse
DEBUG[Thu Sep 01 10:49:07 2011]:0.143916845322 CIRC 2066 CLOSED $BC789B6057C5C71F2F85B3DA202B52EE94C4FF0C,$640AE17D46C7374F57689CD60E508685FB5BB4B5 REASON=DESTROYED REMOTE_REASON=OR_CONN_CLOSED
DEBUG[Thu Sep 01 10:49:07 2011]:Close refcount 0 for BC789B6057C5C71F2F85B3DA202B52EE94C4FF0C
DEBUG[Thu Sep 01 10:49:07 2011]:Close refcount 1 for 640AE17D46C7374F57689CD60E508685FB5BB4B5
    response.begin()
  File "/usr/lib/python2.5/httplib.py", line 385, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.5/httplib.py", line 349, in _read_status
    raise BadStatusLine(line)
BadStatusLine
DEBUG[Thu Sep 01 10:49:07 2011]:Removed 127.0.0.1:60353 from our local port list
DEBUG[Thu Sep 01 10:49:07 2011]:Got last exit of 640AE17D46C7374F57689CD60E508685FB5BB4B5
DEBUG[Thu Sep 01 10:49:07 2011]:71.9952019192-73.9944022391% circuit build+fetch failed for ('640AE17D46C7374F57689CD60E508685FB5BB4B5', 'Unnamed')
DEBUG[Thu Sep 01 10:49:07 2011]:Checking if scan count is met...
DEBUG[Thu Sep 01 10:49:07 2011]:Got stream bw: 3167
[...]

comment:19 Changed 8 years ago by arma

And scanner 4 stopped logging earlier this evening:

[...]
INFO[Thu Sep 01 18:50:15 2011]:Committing jobs...
INFO[Thu Sep 01 18:50:15 2011]:Commit done.
INFO[Thu Sep 01 18:50:15 2011]:Scanner commit done.
DEBUG[Thu Sep 01 18:50:15 2011]:Got stream bw: 6380
DEBUG[Thu Sep 01 18:51:40 2011]:120.854029179 STREAM_BW 6380 0 14932
DEBUG[Thu Sep 01 18:51:41 2011]:121.23456192 CIRC 4292 EXTENDED $23BCACB73CCF46E
6346B5C19C72D3D52747757FB,$C20E7B105725C9840E83702B87F6AD3AFF25ABE7
DEBUG[Thu Sep 01 18:51:41 2011]:Ignoring circ 4292
DEBUG[Thu Sep 01 18:51:41 2011]:121.235049009 CIRC 4292 BUILT $23BCACB73CCF46E63
46B5C19C72D3D52747757FB,$C20E7B105725C9840E83702B87F6AD3AFF25ABE7
DEBUG[Thu Sep 01 18:51:41 2011]:Wrote stats
NOTICE[Thu Sep 01 18:51:41 2011]:Tor closed control connection. Exiting event th
read.
DEBUG[Thu Sep 01 18:51:41 2011]:Ignoring circ 4292
DEBUG[Thu Sep 01 18:51:41 2011]:Got stream bw: 6380
DEBUG[Thu Sep 01 18:51:41 2011]:120.794023037 STREAM_BW 6380 0 22716
[...]
DEBUG[Thu Sep 01 18:51:47 2011]:Ignoring circ 4294
DEBUG[Thu Sep 01 18:51:47 2011]:Ignoring stream: 6388
DEBUG[Thu Sep 01 18:51:47 2011]:Ignoring stream: 6388
DEBUG[Thu Sep 01 18:51:47 2011]:Deleting ignored stream: 6391
WARN[Thu Sep 01 18:51:47 2011]:No event for: 650 NEWDESC $05C368BF20CDD10F2FC486BE48B343F435C9CE29~Unnamed
Traceback (most recent call last):
  File "../../TorCtl/TorCtl.py", line 712, in _eventLoop
    self._handleFn(timestamp, reply)
  File "../../TorCtl/TorCtl.py", line 1337, in _handle1
    self._map1.get(event.event_name, self.unknown_event)(event)
  File "../../TorCtl/PathSupport.py", line 1838, in new_desc_event
    if TorCtl.ConsensusTracker.new_desc_event(self, d):
  File "../../TorCtl/TorCtl.py", line 1693, in new_desc_event
    r = self.c.read_routers(ns)
  File "../../TorCtl/TorCtl.py", line 1073, in read_routers
    r = self.get_router(ns)
  File "../../TorCtl/TorCtl.py", line 1054, in get_router
    desc = self.sendAndRecv("GETINFO desc/id/" + ns.idhex + "\r\n")[0][2]
  File "../../TorCtl/TorCtl.py", line 862, in sendAndRecv
    lines = self._sendImpl(self._doSend, msg)
  File "../../TorCtl/TorCtl.py", line 729, in _sendImpl
    raise self._closedEx
TorCtlClosed

and that was the end of the log. The Tor process is still going, and scanners 1 and 2 are still plugging along.

comment:20 in reply to:  7 ; Changed 8 years ago by aagbsn

Unfortunately the fix (call TorCtl.Connection.close() in the atexit handler, see: github link above) does not work 100% of the time. Here is a traceback:

DEBUG[Fri Sep 02 06:34:28 2011]:0.622465133667 CIRC 2396 CLOSED $0C8B86A3FB22EF3E92ABEEA80ADF879A26B59BA8,$6620BECAFC3AB3E2A255A975239F6EF4EC542F16 REASON=REQUESTED
DEBUG[Fri Sep 02 06:34:28 2011]:Close refcount 0 for 0C8B86A3FB22EF3E92ABEEA80ADF879A26B59BA8
DEBUG[Fri Sep 02 06:34:28 2011]:Close refcount 0 for 6620BECAFC3AB3E2A255A975239F6EF4EC542F16
DEBUG[Fri Sep 02 06:34:28 2011]:1314945267.4 0.620501041412 CIRC 2402 CLOSED $7F4C602A9E0498F244AAA002783E7B8DEB4875EF,$FCC3151A288A58AAD8B6D81E2FB3244DE6526375 REASON=REQUESTED
DEBUG[Fri Sep 02 06:34:28 2011]:0.638204097748 CIRC 2402 CLOSED $7F4C602A9E0498F244AAA002783E7B8DEB4875EF,$FCC3151A288A58AAD8B6D81E2FB3244DE6526375 REASON=REQUESTED
DEBUG[Fri Sep 02 06:34:28 2011]:Close refcount 0 for 7F4C602A9E0498F244AAA002783E7B8DEB4875EF
DEBUG[Fri Sep 02 06:34:28 2011]:Close refcount 0 for FCC3151A288A58AAD8B6D81E2FB3244DE6526375
DEBUG[Fri Sep 02 06:36:50 2011]:Ignoring stream: 3047
DEBUG[Fri Sep 02 06:36:51 2011]:Wrote stats
NOTICE[Fri Sep 02 06:36:51 2011]:Tor closed control connection. Exiting event thread.
DEBUG[Fri Sep 02 06:36:51 2011]:Ignoring stream: 3047
DEBUG[Fri Sep 02 06:36:51 2011]:Ignoring stream: 3047
DEBUG[Fri Sep 02 06:36:52 2011]:Deleting ignored stream: 3045

...snip...

DEBUG[Fri Sep 02 06:36:54 2011]:78.6796159744 CIRC 2417 EXTENDED $AC92AA2248170F4DA7B886932CD430AB0945D2B1
DEBUG[Fri Sep 02 06:36:54 2011]:Ignoring circ 2417
DEBUG[Fri Sep 02 06:36:54 2011]:78.6807179451 CIRC 2417 BUILT $AC92AA2248170F4DA7B886932CD430AB0945D2B1
DEBUG[Fri Sep 02 06:36:54 2011]:Ignoring circ 2417
DEBUG[Fri Sep 02 06:36:54 2011]:Ignoring stream: 3066
DEBUG[Fri Sep 02 06:36:55 2011]:Ignoring stream: 3066
DEBUG[Fri Sep 02 06:36:55 2011]:Deleting ignored stream: 3066
WARN[Fri Sep 02 06:36:55 2011]:No event for: 650 NEWCONSENSUS
Traceback (most recent call last):

File "../../TorCtl/TorCtl.py", line 712, in _eventLoop

self._handleFn(timestamp, reply)

File "../../TorCtl/TorCtl.py", line 1337, in _handle1

self._map1.get(event.event_name, self.unknown_event)(event)

File "../../TorCtl/PathSupport.py", line 1834, in new_consensus_event

TorCtl.ConsensusTracker.new_consensus_event(self, n)

File "../../TorCtl/TorCtl.py", line 1679, in new_consensus_event

self._read_routers(self.ns_map.values())

File "../../TorCtl/TorCtl.py", line 1597, in _read_routers

routers = self.c.read_routers(nslist) # Sets .down if 3,4,5

File "../../TorCtl/TorCtl.py", line 1073, in read_routers

r = self.get_router(ns)

File "../../TorCtl/TorCtl.py", line 1054, in get_router

desc = self.sendAndRecv("GETINFO desc/id/" + ns.idhex + "\r\n")[0][2]

File "../../TorCtl/TorCtl.py", line 862, in sendAndRecv

lines = self._sendImpl(self._doSend, msg)

File "../../TorCtl/TorCtl.py", line 729, in _sendImpl

raise self._closedEx

TorCtlClosed

This seems like it may be related to #1329; perhaps atagar may have some insight.

Replying to aagbsn:

This is caused by the main thread exiting before the eventLoop thread; because we weren't exiting cleanly. I have a fix for this in my testing branch that just needs_review before we can merge it.

https://github.com/aagbsn/torflow/commit/09f232bffa3cf51db21c53c81e544ad977bcb335

Replying to arma:

Scanner 4 has a traceback:

Exception in thread EventLoop (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "/usr/lib/python2.5/threading.py", line 486, in __bootstrap_inner
  File "/usr/lib/python2.5/threading.py", line 446, in run
  File "../../TorCtl/TorCtl.py", line 715, in _eventLoop
  File "../../TorCtl/TorUtil.py", line 360, in plog
  File "/usr/lib/python2.5/logging/__init__.py", line 1056, in log
  File "/usr/lib/python2.5/logging/__init__.py", line 1100, in _log
  File "/usr/lib/python2.5/logging/__init__.py", line 1080, in makeRecord
  File "/usr/lib/python2.5/logging/__init__.py", line 258, in __init__
<type 'exceptions.TypeError'>: 'NoneType' object is not callable

comment:21 in reply to:  20 ; Changed 8 years ago by aagbsn

Clarification: the scanner freezes after this Traceback and the child process does not exit.

Replying to aagbsn:

Unfortunately the fix (call TorCtl.Connection.close() in the atexit handler, see: github link above) does not work 100% of the time. Here is a traceback:

DEBUG[Fri Sep 02 06:34:28 2011]:0.622465133667 CIRC 2396 CLOSED $0C8B86A3FB22EF3E92ABEEA80ADF879A26B59BA8,$6620BECAFC3AB3E2A255A975239F6EF4EC542F16 REASON=REQUESTED
DEBUG[Fri Sep 02 06:34:28 2011]:Close refcount 0 for 0C8B86A3FB22EF3E92ABEEA80ADF879A26B59BA8
DEBUG[Fri Sep 02 06:34:28 2011]:Close refcount 0 for 6620BECAFC3AB3E2A255A975239F6EF4EC542F16
DEBUG[Fri Sep 02 06:34:28 2011]:1314945267.4 0.620501041412 CIRC 2402 CLOSED $7F4C602A9E0498F244AAA002783E7B8DEB4875EF,$FCC3151A288A58AAD8B6D81E2FB3244DE6526375 REASON=REQUESTED
DEBUG[Fri Sep 02 06:34:28 2011]:0.638204097748 CIRC 2402 CLOSED $7F4C602A9E0498F244AAA002783E7B8DEB4875EF,$FCC3151A288A58AAD8B6D81E2FB3244DE6526375 REASON=REQUESTED
DEBUG[Fri Sep 02 06:34:28 2011]:Close refcount 0 for 7F4C602A9E0498F244AAA002783E7B8DEB4875EF
DEBUG[Fri Sep 02 06:34:28 2011]:Close refcount 0 for FCC3151A288A58AAD8B6D81E2FB3244DE6526375
DEBUG[Fri Sep 02 06:36:50 2011]:Ignoring stream: 3047
DEBUG[Fri Sep 02 06:36:51 2011]:Wrote stats
NOTICE[Fri Sep 02 06:36:51 2011]:Tor closed control connection. Exiting event thread.
DEBUG[Fri Sep 02 06:36:51 2011]:Ignoring stream: 3047
DEBUG[Fri Sep 02 06:36:51 2011]:Ignoring stream: 3047
DEBUG[Fri Sep 02 06:36:52 2011]:Deleting ignored stream: 3045

...snip...

DEBUG[Fri Sep 02 06:36:54 2011]:78.6796159744 CIRC 2417 EXTENDED $AC92AA2248170F4DA7B886932CD430AB0945D2B1
DEBUG[Fri Sep 02 06:36:54 2011]:Ignoring circ 2417
DEBUG[Fri Sep 02 06:36:54 2011]:78.6807179451 CIRC 2417 BUILT $AC92AA2248170F4DA7B886932CD430AB0945D2B1
DEBUG[Fri Sep 02 06:36:54 2011]:Ignoring circ 2417
DEBUG[Fri Sep 02 06:36:54 2011]:Ignoring stream: 3066
DEBUG[Fri Sep 02 06:36:55 2011]:Ignoring stream: 3066
DEBUG[Fri Sep 02 06:36:55 2011]:Deleting ignored stream: 3066
WARN[Fri Sep 02 06:36:55 2011]:No event for: 650 NEWCONSENSUS
Traceback (most recent call last):

File "../../TorCtl/TorCtl.py", line 712, in _eventLoop

self._handleFn(timestamp, reply)

File "../../TorCtl/TorCtl.py", line 1337, in _handle1

self._map1.get(event.event_name, self.unknown_event)(event)

File "../../TorCtl/PathSupport.py", line 1834, in new_consensus_event

TorCtl.ConsensusTracker.new_consensus_event(self, n)

File "../../TorCtl/TorCtl.py", line 1679, in new_consensus_event

self._read_routers(self.ns_map.values())

File "../../TorCtl/TorCtl.py", line 1597, in _read_routers

routers = self.c.read_routers(nslist) # Sets .down if 3,4,5

File "../../TorCtl/TorCtl.py", line 1073, in read_routers

r = self.get_router(ns)

File "../../TorCtl/TorCtl.py", line 1054, in get_router

desc = self.sendAndRecv("GETINFO desc/id/" + ns.idhex + "\r\n")[0][2]

File "../../TorCtl/TorCtl.py", line 862, in sendAndRecv

lines = self._sendImpl(self._doSend, msg)

File "../../TorCtl/TorCtl.py", line 729, in _sendImpl

raise self._closedEx

TorCtlClosed

This seems like it may be related to #1329; perhaps atagar may have some insight.

Replying to aagbsn:

This is caused by the main thread exiting before the eventLoop thread; because we weren't exiting cleanly. I have a fix for this in my testing branch that just needs_review before we can merge it.

https://github.com/aagbsn/torflow/commit/09f232bffa3cf51db21c53c81e544ad977bcb335

Replying to arma:

Scanner 4 has a traceback:

Exception in thread EventLoop (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "/usr/lib/python2.5/threading.py", line 486, in __bootstrap_inner
  File "/usr/lib/python2.5/threading.py", line 446, in run
  File "../../TorCtl/TorCtl.py", line 715, in _eventLoop
  File "../../TorCtl/TorUtil.py", line 360, in plog
  File "/usr/lib/python2.5/logging/__init__.py", line 1056, in log
  File "/usr/lib/python2.5/logging/__init__.py", line 1100, in _log
  File "/usr/lib/python2.5/logging/__init__.py", line 1080, in makeRecord
  File "/usr/lib/python2.5/logging/__init__.py", line 258, in __init__
<type 'exceptions.TypeError'>: 'NoneType' object is not callable

comment:22 in reply to:  18 Changed 8 years ago by aagbsn

Some good news:
I ran into the freezing problem using the SQLite backend, and now we have a little bit more information:

DEBUG[Fri Sep 02 11:45:30 2011]:0.00586295127869 STREAM 75 SENTCONNECT 28 38.229.70.2 443
ERROR[Fri Sep 02 11:45:31 2011]:An unknown HTTP error occured
Traceback (most recent call last):

File "bwauthority_child.py", line 125, in http_request

reply = urllib2.urlopen(request)

File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen

return _opener.open(url, data, timeout)

File "/usr/lib/python2.7/urllib2.py", line 391, in open

response = self._open(req, data)

File "/usr/lib/python2.7/urllib2.py", line 409, in _open

'_open', req)

File "/usr/lib/python2.7/urllib2.py", line 369, in _call_chain

result = func(*args)

File "/usr/lib/python2.7/urllib2.py", line 1193, in https_open

return self.do_open(httplib.HTTPSConnection, req)

File "/usr/lib/python2.7/urllib2.py", line 1154, in do_open

h.request(req.get_method(), req.get_selector(), req.data, headers)

File "/usr/lib/python2.7/httplib.py", line 955, in request

self._send_request(method, url, body, headers)

File "/usr/lib/python2.7/httplib.py", line 989, in _send_request

self.endheaders(body)

File "/usr/lib/python2.7/httplib.py", line 951, in endheaders

self._send_output(message_body)

File "/usr/lib/python2.7/httplib.py", line 811, in _send_output

self.send(msg)

File "/usr/lib/python2.7/httplib.py", line 773, in send

self.connect()

File "/usr/lib/python2.7/httplib.py", line 1154, in connect

self.timeout, self.source_address)

File "/usr/lib/python2.7/socket.py", line 562, in create_connection

sock.connect(sa)

File "../libs/SocksiPy/socks.py", line 369, in connect

self.negotiatesocks5(destpair[0],destpair[1])

File "../libs/SocksiPy/socks.py", line 236, in negotiatesocks5

raise Socks5Error((ord(resp[1]),_socks5errors[ord(resp[1])]))

Socks5Error: (1, 'general SOCKS server failure')
DEBUG[Fri Sep 02 11:45:31 2011]:Removed 127.0.0.1:51292 from our local port list
DEBUG[Fri Sep 02 11:45:31 2011]:Got last exit of 778CABE1D95CD5E4094A82C696972DAECA8634A5
DEBUG[Fri Sep 02 11:45:31 2011]:0.0769230769231-0.115384615385% circuit build+fetch failed for ('778CABE1D95CD5E4094A82C696972DAECA8634A5', 'vmi3007')
DEBUG[Fri Sep 02 11:45:31 2011]:Checking if scan count is met...
WARN[Fri Sep 02 11:45:36 2011]:No event for: 650 STREAM 75 FAILED 28 38.229.70.2:443 REASON=END REMOTE_REASON=MISC
Traceback (most recent call last):

File "../../TorCtl/TorCtl.py", line 717, in _eventLoop

self._handleFn(timestamp, reply)

File "../../TorCtl/TorCtl.py", line 1338, in _handle1

l.listen(event)

File "../../TorCtl/TorCtl.py", line 1290, in listen

self._map1.get(event.event_name, self.unknown_event)(event)

File "../../TorCtl/SQLSupport.py", line 1086, in stream_status_event

strm.id).execute(row_type='failedstream')

File "/home/aagbsn/.virtualenvs/torflow/lib/python2.7/site-packages/sqlalchemy/sql/expression.py", line 2660, in execute

return e._execute_clauseelement(self, multiparams, params)

File "/home/aagbsn/.virtualenvs/torflow/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2292, in _execute_clauseelement

return connection._execute_clauseelement(elem, multiparams, params)

File "/home/aagbsn/.virtualenvs/torflow/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1532, in _execute_clauseelement

compiled_sql, distilled_params

File "/home/aagbsn/.virtualenvs/torflow/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1640, in _execute_context

context)

File "/home/aagbsn/.virtualenvs/torflow/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1633, in _execute_context

context)

File "/home/aagbsn/.virtualenvs/torflow/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 325, in do_execute

cursor.execute(statement, parameters)

OperationalError: (OperationalError) database is locked u'UPDATE stream SET row_type=? WHERE stream.id = ?' ('failedstream', 1)

My hypothesis is that some sqlalchemy sessions are not closed (by session.commit(), or session.remove()), though most of the time this doesn't fail because a commit does happen elsewhere. When http_request fails (as in the case of a Socks5Error), the commit doesn't happen, the session is left open, and the next unlucky query fails. SQLite doesn't support concurrent connections, but postgres does, so the query hangs forever...

I noticed the session oddities while working on #2947, and refactored SQLSupport.py to use local sessions that are closed after each request. See:
https://github.com/aagbsn/torctl/commit/4cc3605c969cebb0b1880c1257f68e63ecb6064f

mikeperry and I have been testing these changes with the postgres backend and have not encountered arma's freezing issue.

I was able to replicate the issue reliably by setting nodes_per_slice and circs_per_node to 1 in bwauthority.cfg . Using the sqlite backend, bwauthority exits and is restarted. Over a 30 minute period I logged 3 crashes. A second run logged 5 crashes. After switching TorCtl branches and running the same test, I have been unable to reproduce the error after 45 minutes.

--Aaron

Replying to arma:

Scanner 4 got another traceback today. Looks like the output is interleaved with some other thread:

[...]
DEBUG[Thu Sep 01 10:49:07 2011]:Got stream bw: 3167
DEBUG[Thu Sep 01 10:49:07 2011]:0.102648973465 STREAM_BW 3167 319 2362
ERROR[Thu Sep 01 10:49:07 2011]:An unknown HTTP error occured
Traceback (most recent call last):
  File "bwauthority_child.py", line 125, in http_request
    reply = urllib2.urlopen(request)
  File "/usr/lib/python2.5/urllib2.py", line 124, in urlopen
DEBUG[Thu Sep 01 10:49:07 2011]:1314888547.54 0.00143194198608 CIRC 2066 CLOSED $BC789B6057C5C71F2F85B3DA202B52EE94C4FF0C,$640AE17D46C7374F57689CD60E508685FB5BB4B5 REASON=DESTROYED REMOTE_REASON=OR_CONN_CLOSED
    return _opener.open(url, data)
  File "/usr/lib/python2.5/urllib2.py", line 381, in open
    response = self._open(req, data)
  File "/usr/lib/python2.5/urllib2.py", line 399, in _open
    '_open', req)
  File "/usr/lib/python2.5/urllib2.py", line 360, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.5/urllib2.py", line 1115, in https_open
    return self.do_open(httplib.HTTPSConnection, req)
  File "/usr/lib/python2.5/urllib2.py", line 1080, in do_open
    r = h.getresponse()
  File "/usr/lib/python2.5/httplib.py", line 928, in getresponse
DEBUG[Thu Sep 01 10:49:07 2011]:0.143916845322 CIRC 2066 CLOSED $BC789B6057C5C71F2F85B3DA202B52EE94C4FF0C,$640AE17D46C7374F57689CD60E508685FB5BB4B5 REASON=DESTROYED REMOTE_REASON=OR_CONN_CLOSED
DEBUG[Thu Sep 01 10:49:07 2011]:Close refcount 0 for BC789B6057C5C71F2F85B3DA202B52EE94C4FF0C
DEBUG[Thu Sep 01 10:49:07 2011]:Close refcount 1 for 640AE17D46C7374F57689CD60E508685FB5BB4B5
    response.begin()
  File "/usr/lib/python2.5/httplib.py", line 385, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.5/httplib.py", line 349, in _read_status
    raise BadStatusLine(line)
BadStatusLine
DEBUG[Thu Sep 01 10:49:07 2011]:Removed 127.0.0.1:60353 from our local port list
DEBUG[Thu Sep 01 10:49:07 2011]:Got last exit of 640AE17D46C7374F57689CD60E508685FB5BB4B5
DEBUG[Thu Sep 01 10:49:07 2011]:71.9952019192-73.9944022391% circuit build+fetch failed for ('640AE17D46C7374F57689CD60E508685FB5BB4B5', 'Unnamed')
DEBUG[Thu Sep 01 10:49:07 2011]:Checking if scan count is met...
DEBUG[Thu Sep 01 10:49:07 2011]:Got stream bw: 3167
[...]

Changed 8 years ago by aagbsn

Attachment: err.1.log.gz added

crashes

Changed 8 years ago by aagbsn

Attachment: err.2.log.gz added

crashes from scanner2

Changed 8 years ago by aagbsn

Attachment: err.3.log.gz added

crashes from scanner1, trial 2

comment:23 in reply to:  21 ; Changed 8 years ago by aagbsn

I noticed this comment in TorCtl.py:

# eventThread provides a more reliable indication of when we are done.
# The _loop thread won't always die when self.close() is called.

ah.

I removed the call to call self._thread.join(), where _thread is the _loop thread. I'm not sure if this will have any other side effects.

Replying to aagbsn:

Clarification: the scanner freezes after this Traceback and the child process does not exit.

Replying to aagbsn:

Unfortunately the fix (call TorCtl.Connection.close() in the atexit handler, see: github link above) does not work 100% of the time. Here is a traceback:

DEBUG[Fri Sep 02 06:34:28 2011]:0.622465133667 CIRC 2396 CLOSED $0C8B86A3FB22EF3E92ABEEA80ADF879A26B59BA8,$6620BECAFC3AB3E2A255A975239F6EF4EC542F16 REASON=REQUESTED
DEBUG[Fri Sep 02 06:34:28 2011]:Close refcount 0 for 0C8B86A3FB22EF3E92ABEEA80ADF879A26B59BA8
DEBUG[Fri Sep 02 06:34:28 2011]:Close refcount 0 for 6620BECAFC3AB3E2A255A975239F6EF4EC542F16
DEBUG[Fri Sep 02 06:34:28 2011]:1314945267.4 0.620501041412 CIRC 2402 CLOSED $7F4C602A9E0498F244AAA002783E7B8DEB4875EF,$FCC3151A288A58AAD8B6D81E2FB3244DE6526375 REASON=REQUESTED
DEBUG[Fri Sep 02 06:34:28 2011]:0.638204097748 CIRC 2402 CLOSED $7F4C602A9E0498F244AAA002783E7B8DEB4875EF,$FCC3151A288A58AAD8B6D81E2FB3244DE6526375 REASON=REQUESTED
DEBUG[Fri Sep 02 06:34:28 2011]:Close refcount 0 for 7F4C602A9E0498F244AAA002783E7B8DEB4875EF
DEBUG[Fri Sep 02 06:34:28 2011]:Close refcount 0 for FCC3151A288A58AAD8B6D81E2FB3244DE6526375
DEBUG[Fri Sep 02 06:36:50 2011]:Ignoring stream: 3047
DEBUG[Fri Sep 02 06:36:51 2011]:Wrote stats
NOTICE[Fri Sep 02 06:36:51 2011]:Tor closed control connection. Exiting event thread.
DEBUG[Fri Sep 02 06:36:51 2011]:Ignoring stream: 3047
DEBUG[Fri Sep 02 06:36:51 2011]:Ignoring stream: 3047
DEBUG[Fri Sep 02 06:36:52 2011]:Deleting ignored stream: 3045

...snip...

DEBUG[Fri Sep 02 06:36:54 2011]:78.6796159744 CIRC 2417 EXTENDED $AC92AA2248170F4DA7B886932CD430AB0945D2B1
DEBUG[Fri Sep 02 06:36:54 2011]:Ignoring circ 2417
DEBUG[Fri Sep 02 06:36:54 2011]:78.6807179451 CIRC 2417 BUILT $AC92AA2248170F4DA7B886932CD430AB0945D2B1
DEBUG[Fri Sep 02 06:36:54 2011]:Ignoring circ 2417
DEBUG[Fri Sep 02 06:36:54 2011]:Ignoring stream: 3066
DEBUG[Fri Sep 02 06:36:55 2011]:Ignoring stream: 3066
DEBUG[Fri Sep 02 06:36:55 2011]:Deleting ignored stream: 3066
WARN[Fri Sep 02 06:36:55 2011]:No event for: 650 NEWCONSENSUS
Traceback (most recent call last):

File "../../TorCtl/TorCtl.py", line 712, in _eventLoop

self._handleFn(timestamp, reply)

File "../../TorCtl/TorCtl.py", line 1337, in _handle1

self._map1.get(event.event_name, self.unknown_event)(event)

File "../../TorCtl/PathSupport.py", line 1834, in new_consensus_event

TorCtl.ConsensusTracker.new_consensus_event(self, n)

File "../../TorCtl/TorCtl.py", line 1679, in new_consensus_event

self._read_routers(self.ns_map.values())

File "../../TorCtl/TorCtl.py", line 1597, in _read_routers

routers = self.c.read_routers(nslist) # Sets .down if 3,4,5

File "../../TorCtl/TorCtl.py", line 1073, in read_routers

r = self.get_router(ns)

File "../../TorCtl/TorCtl.py", line 1054, in get_router

desc = self.sendAndRecv("GETINFO desc/id/" + ns.idhex + "\r\n")[0][2]

File "../../TorCtl/TorCtl.py", line 862, in sendAndRecv

lines = self._sendImpl(self._doSend, msg)

File "../../TorCtl/TorCtl.py", line 729, in _sendImpl

raise self._closedEx

TorCtlClosed

This seems like it may be related to #1329; perhaps atagar may have some insight.

Replying to aagbsn:

This is caused by the main thread exiting before the eventLoop thread; because we weren't exiting cleanly. I have a fix for this in my testing branch that just needs_review before we can merge it.

https://github.com/aagbsn/torflow/commit/09f232bffa3cf51db21c53c81e544ad977bcb335

Replying to arma:

Scanner 4 has a traceback:

Exception in thread EventLoop (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "/usr/lib/python2.5/threading.py", line 486, in __bootstrap_inner
  File "/usr/lib/python2.5/threading.py", line 446, in run
  File "../../TorCtl/TorCtl.py", line 715, in _eventLoop
  File "../../TorCtl/TorUtil.py", line 360, in plog
  File "/usr/lib/python2.5/logging/__init__.py", line 1056, in log
  File "/usr/lib/python2.5/logging/__init__.py", line 1100, in _log
  File "/usr/lib/python2.5/logging/__init__.py", line 1080, in makeRecord
  File "/usr/lib/python2.5/logging/__init__.py", line 258, in __init__
<type 'exceptions.TypeError'>: 'NoneType' object is not callable

comment:24 Changed 8 years ago by arma

I restarted all the scanners on moria, since a third one had wedged.

comment:25 in reply to:  23 ; Changed 8 years ago by aagbsn

This didn't fix it. The fix for #3798 should be reverted until TorCtl.Connection.close() is fixed.

Replying to aagbsn:

I noticed this comment in TorCtl.py:

# eventThread provides a more reliable indication of when we are done.
# The _loop thread won't always die when self.close() is called.

ah.

I removed the call to call self._thread.join(), where _thread is the _loop thread. I'm not sure if this will have any other side effects.

Replying to aagbsn:

Clarification: the scanner freezes after this Traceback and the child process does not exit.

Replying to aagbsn:

Unfortunately the fix (call TorCtl.Connection.close() in the atexit handler, see: github link above) does not work 100% of the time. Here is a traceback:

DEBUG[Fri Sep 02 06:34:28 2011]:0.622465133667 CIRC 2396 CLOSED $0C8B86A3FB22EF3E92ABEEA80ADF879A26B59BA8,$6620BECAFC3AB3E2A255A975239F6EF4EC542F16 REASON=REQUESTED
DEBUG[Fri Sep 02 06:34:28 2011]:Close refcount 0 for 0C8B86A3FB22EF3E92ABEEA80ADF879A26B59BA8
DEBUG[Fri Sep 02 06:34:28 2011]:Close refcount 0 for 6620BECAFC3AB3E2A255A975239F6EF4EC542F16
DEBUG[Fri Sep 02 06:34:28 2011]:1314945267.4 0.620501041412 CIRC 2402 CLOSED $7F4C602A9E0498F244AAA002783E7B8DEB4875EF,$FCC3151A288A58AAD8B6D81E2FB3244DE6526375 REASON=REQUESTED
DEBUG[Fri Sep 02 06:34:28 2011]:0.638204097748 CIRC 2402 CLOSED $7F4C602A9E0498F244AAA002783E7B8DEB4875EF,$FCC3151A288A58AAD8B6D81E2FB3244DE6526375 REASON=REQUESTED
DEBUG[Fri Sep 02 06:34:28 2011]:Close refcount 0 for 7F4C602A9E0498F244AAA002783E7B8DEB4875EF
DEBUG[Fri Sep 02 06:34:28 2011]:Close refcount 0 for FCC3151A288A58AAD8B6D81E2FB3244DE6526375
DEBUG[Fri Sep 02 06:36:50 2011]:Ignoring stream: 3047
DEBUG[Fri Sep 02 06:36:51 2011]:Wrote stats
NOTICE[Fri Sep 02 06:36:51 2011]:Tor closed control connection. Exiting event thread.
DEBUG[Fri Sep 02 06:36:51 2011]:Ignoring stream: 3047
DEBUG[Fri Sep 02 06:36:51 2011]:Ignoring stream: 3047
DEBUG[Fri Sep 02 06:36:52 2011]:Deleting ignored stream: 3045

...snip...

DEBUG[Fri Sep 02 06:36:54 2011]:78.6796159744 CIRC 2417 EXTENDED $AC92AA2248170F4DA7B886932CD430AB0945D2B1
DEBUG[Fri Sep 02 06:36:54 2011]:Ignoring circ 2417
DEBUG[Fri Sep 02 06:36:54 2011]:78.6807179451 CIRC 2417 BUILT $AC92AA2248170F4DA7B886932CD430AB0945D2B1
DEBUG[Fri Sep 02 06:36:54 2011]:Ignoring circ 2417
DEBUG[Fri Sep 02 06:36:54 2011]:Ignoring stream: 3066
DEBUG[Fri Sep 02 06:36:55 2011]:Ignoring stream: 3066
DEBUG[Fri Sep 02 06:36:55 2011]:Deleting ignored stream: 3066
WARN[Fri Sep 02 06:36:55 2011]:No event for: 650 NEWCONSENSUS
Traceback (most recent call last):

File "../../TorCtl/TorCtl.py", line 712, in _eventLoop

self._handleFn(timestamp, reply)

File "../../TorCtl/TorCtl.py", line 1337, in _handle1

self._map1.get(event.event_name, self.unknown_event)(event)

File "../../TorCtl/PathSupport.py", line 1834, in new_consensus_event

TorCtl.ConsensusTracker.new_consensus_event(self, n)

File "../../TorCtl/TorCtl.py", line 1679, in new_consensus_event

self._read_routers(self.ns_map.values())

File "../../TorCtl/TorCtl.py", line 1597, in _read_routers

routers = self.c.read_routers(nslist) # Sets .down if 3,4,5

File "../../TorCtl/TorCtl.py", line 1073, in read_routers

r = self.get_router(ns)

File "../../TorCtl/TorCtl.py", line 1054, in get_router

desc = self.sendAndRecv("GETINFO desc/id/" + ns.idhex + "\r\n")[0][2]

File "../../TorCtl/TorCtl.py", line 862, in sendAndRecv

lines = self._sendImpl(self._doSend, msg)

File "../../TorCtl/TorCtl.py", line 729, in _sendImpl

raise self._closedEx

TorCtlClosed

This seems like it may be related to #1329; perhaps atagar may have some insight.

Replying to aagbsn:

This is caused by the main thread exiting before the eventLoop thread; because we weren't exiting cleanly. I have a fix for this in my testing branch that just needs_review before we can merge it.

https://github.com/aagbsn/torflow/commit/09f232bffa3cf51db21c53c81e544ad977bcb335

Replying to arma:

Scanner 4 has a traceback:

Exception in thread EventLoop (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "/usr/lib/python2.5/threading.py", line 486, in __bootstrap_inner
  File "/usr/lib/python2.5/threading.py", line 446, in run
  File "../../TorCtl/TorCtl.py", line 715, in _eventLoop
  File "../../TorCtl/TorUtil.py", line 360, in plog
  File "/usr/lib/python2.5/logging/__init__.py", line 1056, in log
  File "/usr/lib/python2.5/logging/__init__.py", line 1100, in _log
  File "/usr/lib/python2.5/logging/__init__.py", line 1080, in makeRecord
  File "/usr/lib/python2.5/logging/__init__.py", line 258, in __init__
<type 'exceptions.TypeError'>: 'NoneType' object is not callable

comment:26 in reply to:  25 Changed 8 years ago by aagbsn

Replying to aagbsn:

This didn't fix it. The fix for #3798 should be reverted until TorCtl.Connection.close() is fixed.

Done. Also the local session changes have been merged into TorCtl master.

arma: these fixes should resolve the scanner freezing.

comment:27 Changed 8 years ago by arma

bwauth@moria:~/work/torflow.git/NetworkScanners/BwAuthority/data$ ls -la scanner.?/bw.log
-rw-r--r-- 1 bwauth bwauth 214679496 2011-09-14 18:40 scanner.1/bw.log
-rw-r--r-- 1 bwauth bwauth 280650155 2011-09-14 18:40 scanner.2/bw.log
-rw-r--r-- 1 bwauth bwauth 316280175 2011-09-14 18:40 scanner.3/bw.log
-rw-r--r-- 1 bwauth bwauth 344261604 2011-09-14 18:40 scanner.4/bw.log

Looks very promising. I think everything is running smoothly. The IO load is still considerable, and I haven't decided whether I should keep it on moria at this level, but it looks like it is working as intended. Woo!

comment:28 Changed 8 years ago by arma

For the record, I've been running:

INFO[Tue Sep 06 21:33:44 2011]:TorFlow Version: master 5736529160df12435fb39d11f81a5da92f74d615
NOTICE[Tue Sep 06 21:33:44 2011]:Git Repo at ../../TorCtl/.git/ Not Found

comment:29 in reply to:  28 Changed 8 years ago by aagbsn

Resolution: fixed
Status: newclosed

Replying to arma:

For the record, I've been running:

INFO[Tue Sep 06 21:33:44 2011]:TorFlow Version: master 5736529160df12435fb39d11f81a5da92f74d615
NOTICE[Tue Sep 06 21:33:44 2011]:Git Repo at ../../TorCtl/.git/ Not Found

new ticket opened for the NOTICE at: #4024

Note: See TracTickets for help on using tickets.