Opened 7 years ago

Closed 7 months ago

#4015 closed defect (user disappeared)

BwAuthority postgres scanner freezes with changes from #3679 TorCtl Event Parsing Rewrite

Reported by: aagbsn Owned by: aagbsn
Priority: High Milestone:
Component: Core Tor/Torflow Version:
Severity: Blocker Keywords:
Cc: Actual Points:
Parent ID: #13630 Points:
Reviewer: Sponsor:

Description

BwAuthority, postgres, and the changes in #3679 provide the perfect storm for more BwAuthority freezes.

Initially everything works fine, but 2 days later 2 of my scanners had halted. I restarted them and another froze in about a day.

This looks similar to the problem arma raises in #3834, where the scanner freezes without WARN, ERROR, so it may be db locking related:

INFO[Tue Sep 13 16:16:38 2011]:Waiting on consensus result: False
DEBUG[Tue Sep 13 16:16:38 2011]:Deleting ignored stream: 12028
INFO[Tue Sep 13 16:16:38 2011]:Newdesc timer is up. Assuming we have full consensus
INFO[Tue Sep 13 16:16:38 2011]:Consensus change... Updating rank history

This problem appears to have been introduced or exposed with these changes. The bwauth VM and my testing vm have not frozen with the old Event Parsing code.

Child Tickets

Change History (13)

comment:1 Changed 7 years ago by mikeperry

What is the status of this? Have we tried re-testing?

Has it still not appeared anywhere else?

comment:2 Changed 7 years ago by mikeperry

Thoughts on ways to force a traceback/logs out of this piece of shit:

kill -SIGINT child_pid # To cause a KeyboardInterrupt traceback in a random python thread
kill -SIGKILL tor_pid # To generate a traceback inside the read loop due to closed control port
kill -SIGHUP tor_pid # To cause tor to send some random control port events during re-init

Any others we can try?

comment:3 Changed 7 years ago by mikeperry

Priority: normalmajor

Bumping this to major because we've seen these types of logless crashes before. If the #3679 patch actually does make them more prevalent, we should probably use the occasion to try to find out why they are happening.

comment:4 in reply to:  2 Changed 7 years ago by mikeperry

Replying to mikeperry:

Thoughts on ways to force a traceback/logs out of this piece of shit:

kill -SIGINT child_pid # To cause a KeyboardInterrupt traceback in a random python thread
kill -SIGKILL tor_pid # To generate a traceback inside the read loop due to closed control port
kill -SIGHUP tor_pid # To cause tor to send some random control port events during re-init

Any others we can try?

ps -eLf will give us the thread ids of the bw scanner children. We can try sending SIGINT, SIGALRM.

We could also try attaching gdb and running "thread apply all backtrace" to get the actual python backtraces. Maybe those will yield clues.

comment:5 Changed 7 years ago by aagbsn

I haven't see one freeze up solid yet, but they do hang seems to take forever updating.

DEBUG[Wed Oct 05 17:22:54 2011]:54554.944793 Read 1 ND => 446 routers. Update: True
DEBUG[Wed Oct 05 17:22:54 2011]:Rebuilding ExactUniformGenerator
DEBUG[Wed Oct 05 17:22:54 2011]:Rebuilding ExactUniformGenerator
DEBUG[Wed Oct 05 17:22:54 2011]:Rebuilding ExactUniformGenerator
INFO[Wed Oct 05 17:22:54 2011]:Consensus change... Updating db
$ date
Wed Oct  5 18:15:47 PDT 2011

comment:6 Changed 7 years ago by aagbsn

No crashes, but my scanners are not making any progress:

(torflow)bwscanner@tor:~/tor/torflow/NetworkScanners/BwAuthority$ grep "slice number" data/scanner.1/bw.log 
DEBUG[Sun Oct 02 17:57:14 2011]:Starting slice number 0
(torflow)bwscanner@tor:~/tor/torflow/NetworkScanners/BwAuthority$ grep "slice number" data/scanner.2/bw.log 
DEBUG[Sun Oct 02 17:58:44 2011]:Starting slice number 0
(torflow)bwscanner@tor:~/tor/torflow/NetworkScanners/BwAuthority$ grep "slice number" data/scanner.3/bw.log 
DEBUG[Sun Oct 02 17:58:13 2011]:Starting slice number 0
(torflow)bwscanner@tor:~/tor/torflow/NetworkScanners/BwAuthority$ grep "slice number" data/scanner.4/bw.log 
DEBUG[Sun Oct 02 17:57:53 2011]:Starting slice number 0
(torflow)bwscanner@tor:~/tor/torflow/NetworkScanners/BwAuthority$ date
Sat Oct  8 14:02:25 PDT 2011

comment:7 Changed 7 years ago by aagbsn

BwAuthority is spending a lot of time waiting to get consensus and not doing much else.

(torflow)bwscanner@tor:~/tor/torflow/NetworkScanners/BwAuthority$ grep "Not enough router descriptors" data/scanner.1/bw.log | wc -l
174777
(torflow)bwscanner@tor:~/tor/torflow/NetworkScanners/BwAuthority$ grep "Not enough router descriptors" data/scanner.2/bw.log | wc -l
138768
(torflow)bwscanner@tor:~/tor/torflow/NetworkScanners/BwAuthority$ grep "Not enough router descriptors" data/scanner.3/bw.log | wc -l
148491
(torflow)bwscanner@tor:~/tor/torflow/NetworkScanners/BwAuthority$ grep "Not enough router descriptors" data/scanner.4/bw.log | wc -l
144698

See these WARNs:

INFO[Sat Oct 08 14:09:33 2011]:Waiting on consensus result: False
INFO[Sat Oct 08 14:09:33 2011]:Router descriptor for 019EF7DBC8D67CCCA755128EF73D118E381E09EC does not match ns fingerprint (NS @ 2011-10-08 00:52:35 vs Desc @ 2011-10-08 18:53:20)
WARN[Sat Oct 08 14:09:33 2011]:No router desc for 019EF7DBC8D67CCCA755128EF73D118E381E09EC after NEWDESC
DEBUG[Sat Oct 08 14:09:33 2011]:68656.1056559 Read 1 ND => 218 routers. Update: False
INFO[Sat Oct 08 14:09:33 2011]:Consensus change... Updating db
INFO[Sat Oct 08 14:09:39 2011]:Consensus db updated
INFO[Sat Oct 08 14:09:39 2011]:Waiting on consensus result: False
INFO[Sat Oct 08 14:09:39 2011]:Router descriptor for 01EE3D06EB598F62A35A30DDB614F7D48F7E5EE4 does not match ns fingerprint (NS @ 2011-10-08 00:55:02 vs Desc @ 2011-10-08 18:55:46)
WARN[Sat Oct 08 14:09:39 2011]:No router desc for 01EE3D06EB598F62A35A30DDB614F7D48F7E5EE4 after NEWDESC
DEBUG[Sat Oct 08 14:09:39 2011]:68662.0935042 Read 1 ND => 218 routers. Update: False
INFO[Sat Oct 08 14:09:39 2011]:Consensus change... Updating db
INFO[Sat Oct 08 14:09:41 2011]:Consensus db updated
INFO[Sat Oct 08 14:09:41 2011]:Waiting on consensus result: False
INFO[Sat Oct 08 14:09:42 2011]:Router descriptor for 09F64E00F34C88F604163F24D37BEAF9245702EA does not match ns fingerprint (NS @ 2011-10-08 01:09:46 vs Desc @ 2011-10-08 19:10:41)
WARN[Sat Oct 08 14:09:42 2011]:No router desc for 09F64E00F34C88F604163F24D37BEAF9245702EA after NEWDESC
DEBUG[Sat Oct 08 14:09:42 2011]:68650.4785399 Read 1 ND => 218 routers. Update: False
INFO[Sat Oct 08 14:09:42 2011]:Consensus change... Updating db
INFO[Sat Oct 08 14:09:43 2011]:Consensus db updated
INFO[Sat Oct 08 14:09:43 2011]:Waiting on consensus result: False
INFO[Sat Oct 08 14:09:43 2011]:Router descriptor for 0F241A5759BC0BA2C14D085B7E614751C82C70F9 does not match ns fingerprint (NS @ 2011-10-08 01:31:59 vs Desc @ 2011-10-08 18:11:42)
WARN[Sat Oct 08 14:09:43 2011]:No router desc for 0F241A5759BC0BA2C14D085B7E614751C82C70F9 after NEWDESC
DEBUG[Sat Oct 08 14:09:43 2011]:68652.445874 Read 1 ND => 218 routers. Update: False
INFO[Sat Oct 08 14:09:44 2011]:Consensus change... Updating db
INFO[Sat Oct 08 14:09:44 2011]:Consensus db updated

comment:8 Changed 7 years ago by aagbsn

I think I found the bug; see #3679

comment:9 in reply to:  8 Changed 7 years ago by aagbsn

Replying to aagbsn:

I think I found the bug; see #3679

I thought wrong. The issue remains.

comment:10 Changed 6 years ago by mikeperry

Owner: changed from mikeperry to aagbsn
Status: newassigned

comment:11 Changed 5 years ago by atagar

Resolution: wontfix
Status: assignedclosed

Ticket #3679 isn't ever gonna happen. Resolving.

comment:12 Changed 3 years ago by ln5

Parent ID: #3679
Resolution: wontfix
Status: closedreopened

This issue isn't resolved just because TorCtl didn't get fixed.

comment:13 Changed 7 months ago by teor

Parent ID: #13630
Resolution: user disappeared
Severity: Blocker
Status: reopenedclosed

This is either no longer an issue, or we'll fix it by moving away from torflow.

Note: See TracTickets for help on using tickets.