Opened 4 years ago

Closed 3 years ago

#9862 closed defect (fixed)

Tor hangs because stem isn't reading its stdout

Reported by: zwol Owned by: atagar
Priority: High Milestone:
Component: Core Tor/Stem Version:
Severity: Keywords: controller
Cc: phw Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

In order to do careful monitoring of an experiment I'm running over Tor, I wrote a bridge controller that allocates an ORPort to each client as it checks in. Clients come and go; there are only ever about 20 of them at a time, but stale connections don't get cleaned up very quickly, so I wind up having to manage 100+ ORPorts.

It appears that at some critical number of ORPorts -- the logs suggest 160 -- Tor gets stuck halfway through the process of reinitializing itself; crucially, it doesn't ever respond to the SETCONF message, and it also stops processing all other input.

I'm not sure whether this is *supposed* to be a supported configuration -- it's certainly an *abnormal* configuration -- but it seems to me that if there is an upper limit, SETCONF should fail when you hit it, rather than having the relay just stop working.

Debug logs attached.

Child Tickets

Attachments (2)

tor.log.xz (2.6 MB) - added by zwol 4 years ago.
debug-level Tor log
stem.log.xz (6.1 KB) - added by zwol 4 years ago.
STEM trace

Change History (24)

Changed 4 years ago by zwol

debug-level Tor log

Changed 4 years ago by zwol

STEM trace

comment:1 Changed 4 years ago by zwol

The code for the controller in question is here: https://github.com/zackw/tbbscraper/blob/master/controller/controller.py

comment:2 Changed 4 years ago by arma

  • Keywords tor-relay added
  • Milestone set to Tor: 0.2.5.x-final

comment:3 Changed 4 years ago by nickm

I tried starting a Tor instance with 300 ORPorts, and it worked okay. So apparently there's something about adding a bunch of them in succession that might be a problem.

The most useful test here would be to attach a debugger to the process and find out what all of its threads are doing, to see if something is blocking.

comment:4 Changed 4 years ago by zwol

This is interesting. I think what we actually have here is a deadlock caused by STEM.

All but one of the threads in the Tor process (there are nine) are blocked in recv and have the same stack trace (with different file descriptors):

#0  0x00007ff31c61c8bc in __libc_recv (fd=fd@entry=214, buf=<optimized out>, 
    n=<optimized out>, flags=<optimized out>, flags@entry=0)
    at ../sysdeps/unix/sysv/linux/x86_64/recv.c:34
#1  0x00007ff31d8f163a in recv (__flags=0, __n=552, __buf=0x7ff30f7f5940, 
    __fd=214) at /usr/include/x86_64-linux-gnu/bits/socket2.h:45
#2  read_all (fd=fd@entry=214, buf=buf@entry=0x7ff30f7f5940 "", 
    count=count@entry=552, isSocket=isSocket@entry=1)
    at ../src/common/util.c:1685
#3  0x00007ff31d8a79be in cpuworker_main (data=0x0, data@entry=0x7ff3080f7fd0)
    at ../src/or/cpuworker.c:428
#4  0x00007ff31d8defe8 in tor_pthread_helper_fn (_data=0x0)
    at ../src/common/compat.c:2248
#5  0x00007ff31c32bb50 in start_thread (arg=<optimized out>)
    at pthread_create.c:304
#6  0x00007ff31c61ba7d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112

... so I presume they are all CPU workers waiting for something to do.

The remaining thread has this stack trace:

Program received signal SIGINT, Interrupt.
0x00007ff31c60fb0d in write () at ../sysdeps/unix/syscall-template.S:82
82      in ../sysdeps/unix/syscall-template.S
(gdb) bt
#0  0x00007ff31c60fb0d in write () at ../sysdeps/unix/syscall-template.S:82
#1  0x00007ff31d8f152e in write_all (fd=1, 
    buf=buf@entry=0x7fff49cb7650 "Oct 01 22:15:01.000 [notice] Your Tor server's identity key fingerprint is 'tbbscraperentry 73EFD4FE8D5D2466ECBDCFAC11894A72A322FD3C'\n", count=count@entry=134, isSocket=0, isSocket@entry=1238070893)
    at ../src/common/util.c:1661
#2  0x00007ff31d8e9fae in logv (severity=5, domain=<optimized out>, 
    funcname=0x7ff31d920bb3 "init_keys", suffix=suffix@entry=0x0, 
    format=format@entry=0x7ff31d9205f8 "Your Tor server's identity key fingerprint is '%s %s'", ap=ap@entry=0x7fff49cb9dc0) at ../src/common/log.c:417
#3  0x00007ff31d8ea263 in log_fn_ (severity=severity@entry=5, 
    domain=domain@entry=1, fn=fn@entry=0x7ff31d920bb3 "init_keys", 
    format=format@entry=0x7ff31d9205f8 "Your Tor server's identity key fingerprint is '%s %s'") at ../src/common/log.c:455
#4  0x00007ff31d822d15 in init_keys () at ../src/or/router.c:922
#5  0x00007ff31d881de5 in options_act (old_options=0x7ff308105910)
    at ../src/or/config.c:1396
#6  set_options (new_val=new_val@entry=0x7ff314032e40, 
    msg=msg@entry=0x7fff49cba148) at ../src/or/config.c:596
#7  0x00007ff31d88360d in options_trial_assign (list=0x7ff314032570, 
    use_defaults=use_defaults@entry=0, clear_first=clear_first@entry=1, 
    msg=msg@entry=0x7fff49cba148) at ../src/or/config.c:1865
#8  0x00007ff31d8a042e in control_setconf_helper (
    conn=conn@entry=0x7ff320da7ec0, len=len@entry=3704, body=<optimized out>, 
    body@entry=0x7ff320cfcbd8 "ORPort=\"9000 IPv4Only\" ORPort=\"9001 IPv4Only\" ORPort=\"9002 IPv4Only\" ORPort=\"9003 IPv4Only\" ORPort=\"9004 IPv4Only\" ORPort=\"9005 IPv4Only\" ORPort=\"9006 IPv4Only\" ORPort=\"9007 IPv4Only\" ORPort=\"9008 IPv"..., use_defaults=0) at ../src/or/control.c:762
#9  0x00007ff31d8a3a25 in handle_control_resetconf (body=<optimized out>, 
    len=<optimized out>, conn=<optimized out>) at ../src/or/control.c:809
#10 connection_control_process_inbuf (conn=conn@entry=0x7ff320da7ec0)
    at ../src/or/control.c:3419
#11 0x00007ff31d8874a9 in connection_process_inbuf (
    conn=conn@entry=0x7ff320da7ec0, package_partial=package_partial@entry=1)
    at ../src/or/connection.c:4011
#12 0x00007ff31d88d6dd in connection_handle_read_impl (conn=0x7ff320da7ec0)
    at ../src/or/connection.c:2839
#13 connection_handle_read (conn=conn@entry=0x7ff320da7ec0)
    at ../src/or/connection.c:2880
#14 0x00007ff31d7e9061 in conn_read_callback (fd=<optimized out>, 
    event=<optimized out>, _conn=0x7ff320da7ec0) at ../src/or/main.c:718
#15 0x00007ff31cee1ccc in event_base_loop ()
   from /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5
#16 0x00007ff31d7e99f5 in do_main_loop () at ../src/or/main.c:1992
#17 0x00007ff31d7eb1de in tor_main (argc=7, argv=0x7fff49cba648)
    at ../src/or/main.c:2708
#18 0x00007ff31c55fead in __libc_start_main (main=<optimized out>, 
    argc=<optimized out>, ubp_av=<optimized out>, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff49cba638)
    at libc-start.c:228
#19 0x00007ff31d7e5a9d in _start ()

So Tor is stuck trying to write a log message to file descriptor 1, which is a pipe, the other end of which is connected to the controller process, because STEM parses notice-level log output to know when the Tor process is fully spun up. (Tor's stderr is also a pipe connected to STEM.) But STEM is blocked reading from the control socket, and in general doesn't appear to bother reading from *either* its stdout or stderr pipes after stem.process.launch_tor returns. I guess the pipe just happens to fill up at 160 open ORPorts. I can work around this in my code but I think it still qualifies as a bug in STEM.

comment:5 Changed 4 years ago by atagar

Hi zwol. I've dug through python's subprocess module for over an hour so far trying to find a method to ignore further stdout content without a special puller thread but no luck so far. I'll keep thinking about it, but probably won't come up with anything until this weekend.

Cheers! -Damian

PS. Stem is not an acronym, so the library's called "Stem". Not "STEM".

comment:6 Changed 4 years ago by atagar

  • Component changed from Tor to Stem
  • Owner set to atagar

By the way, the launch_tor() function returns the tor process which has the stdout and stderr attributes. You can likely call 'tor_process.communicate()' from a daemon thread to avoid this.

comment:7 Changed 4 years ago by zwol

You can likely call 'tor_process.communicate()' from a daemon thread to avoid this.

That is approximately the workaround I put into my code. May I suggest that as a first step here, the present need to do something like this should be documented?

I've dug through python's subprocess module for over an hour so far trying to find a method to ignore further stdout content without a special puller thread but no luck so far.

So I recognize that this probably would require changes to Tor as well as Stem, but the fact that Stem needs to parse log messages in the first place seems like the real bug here. What exactly is it doing with them, and can we think of a better technique? Maybe a new command line option which tells Tor to close file descriptor N when it reaches a certain level of initializedness, or something like that?

comment:8 follow-up: Changed 4 years ago by atagar

... but the fact that Stem needs to parse log messages in the first place seems like the real bug here. What exactly is it doing with them

Stem is using them to figure out when tor has bootstrapped enough to reach a usable state. It's also used so we can raise an exception if it fails before that.

... and can we think of a better technique?

Suggestions welcome. Personally I'm not a fan of closing file descriptor N. Having the python process consume the rest of stdout (even if it's with a null reader by default) is preferable since it's both less hacky and users might want it. The only trick is doing this without requiring a daemon thread.

comment:9 in reply to: ↑ 8 Changed 4 years ago by arma

Replying to atagar:

Stem is using them to figure out when tor has bootstrapped enough to reach a usable state. It's also used so we can raise an exception if it fails before that.

The bootstrap status events (or the getinfo) don't do it for you?

comment:10 Changed 4 years ago by rransom

Perhaps Stem should set Log err stdout once it has connected to Tor's control port and enabled the events that it wants.

comment:11 Changed 4 years ago by atagar

Oh, just thought of something. If you call tor_process.stdout.close() does this still manifest? If not then that's a pretty trivial fix to make in launch_tor().

The bootstrap status events (or the getinfo) don't do it for you?

No for a couple reasons:

  1. It doesn't tell us when we *can* connect to the control port. We could poll instead, but reading stdout is better.
  1. The launch_tor() function simply launches tor, it doesn't require that the user has a control port.

I recall you asking this on an earlier ticket too.

comment:12 Changed 4 years ago by atagar

  • Status changed from new to needs_information

Swapping the status (waiting for the answer to "If you call tor_process.stdout.close() does this still manifest?").

comment:13 Changed 4 years ago by zwol

waiting for the answer to "If you call tor_process.stdout.close() does this still manifest?"

I'm sorry - to me, that so obviously won't work that I didn't even register it as a question. If you do that, the Tor process will get a SIGPIPE and crash the very next time it writes to its stdout (resp. stderr).

A dedicated thread calling communicate() is also a bad idea, for a different reason: that will accumulate all of the log messages in memory until the Tor process eventually exits. For my use case, that could be weeks, and megabytes of notices. (Probably not many megabytes, but I'm already struggling with the memory usage of this controller; every little bit helps.)

This is my current workaround:

class PipeDrainer(threading.Thread):
    """Read and discard all data from a pipe.  Exits when the pipe is closed."""
    def __init__(self, fp):
        threading.Thread.__init__(self)
        self.fp = fp 

    def run(self):
        try:
            while len(self.fp.read(8192)) > 0:
                pass
        except:
            pass
        self.fp.close()

# ...
class TorBridge(object):
# ...
    def __enter__(self):
        try:
            # ...
            self.process = stem.process.launch_tor_with_config(config={ ... },
                               take_ownership=True)
            self.stdout_drainer = PipeDrainer(self.process.stdout)
            self.stdout_drainer.start()
            self.stderr_drainer = PipeDrainer(self.process.stderr)
            self.stderr_drainer.start()
            # ...
Last edited 4 years ago by zwol (previous) (diff)

comment:14 Changed 4 years ago by zwol

  • Status changed from needs_information to new

comment:15 Changed 3 years ago by phw

I ran into the same problem as I'm using Stem and my Tor process generates plenty of log messages. My workaround is:

torCtrl.set_conf("Log", "err file /dev/null")

comment:16 Changed 3 years ago by phw

  • Cc phw added

comment:17 Changed 3 years ago by atagar

That certainly sounds better than a poller thread, since otherwise the tor process will still hang if it outlives our python script. I'm not thrilled with having our Controller automatically do that after authentication, but that's the best option so far.

comment:18 Changed 3 years ago by nickm

  • Milestone Tor: 0.2.5.x-final deleted

Not a Tor bug; removing from the milestone.

comment:19 Changed 3 years ago by atagar

  • Keywords controller added; tor-relay removed
  • Priority changed from normal to major
  • Summary changed from Tor hangs if you ask it to open too many ORPorts to Tor hangs because stem isn't reading its stdout
  • Version Tor: 0.2.4.17-rc deleted

I apologize for how long I've let this linger on. This is definitely the most important gotcha I need to fix for our upcoming Stem release. This will be taken care of at some point this month (probably next week).

comment:20 follow-up: Changed 3 years ago by atagar

waiting for the answer to "If you call tor_process.stdout.close() does this still manifest?"

I'm sorry - to me, that so obviously won't work that I didn't even register it as a question. If you do that, the Tor process will get a SIGPIPE and crash the very next time it writes to its stdout (resp. stderr).

Hi zwol. Are you sure about this? I just tried making this tweak...

https://gitweb.torproject.org/user/atagar/stem.git/commitdiff/4f38835

... and Stem's integ tests pass. I'm not sure if this fixes the issue where tor seizes up, but tor seems to gracefully handle the closure of stdout/stderr.

Would someone running into this issue mind giving this change a try?

comment:21 in reply to: ↑ 20 Changed 3 years ago by zwol

Replying to atagar:

I'm sorry - to me, that so obviously won't work that I didn't even register it as a question. If you do that, the Tor process will get a SIGPIPE and crash the very next time it writes to its stdout (resp. stderr).

Hi zwol. Are you sure about this? I just tried making this tweak...

https://gitweb.torproject.org/user/atagar/stem.git/commitdiff/4f38835

... and Stem's integ tests pass.

Tor could contain code to gracefully handle its stdout/stderr/other log files being closed out from under it, in which case this would work fine; but I don't actually know if it does this. I don't have time to dig into it this week.

comment:22 Changed 3 years ago by atagar

  • Resolution set to fixed
  • Status changed from new to closed

Sweet, it works! Fix pushed.

I tested this with the following. Prior to the fix tor would hang after ~3 seconds, but with the fix things happily chug along.

import time

import stem.process

from stem.control import EventType, Controller

tor_process = stem.process.launch_tor_with_config(
  config = { 
    'ControlPort': '9051',
    'Log': 'DEBUG stdout',
  },  
  take_ownership = True,
)

with Controller.from_port() as controller:
  controller.authenticate()

  def heartbeat(event):
    print "%s - %s / %s" % (time.time(), event.read, event.written)

  controller.add_event_listener(heartbeat, EventType.BW)

  print "Press any key to quit..."
  raw_input()

Thanks for the catch!

Note: See TracTickets for help on using tickets.