Tor hangs because stem isn't reading its stdout

added component::archived/stem controller owner::atagar priority::high resolution::fixed status::closed type::defect labels

Trac:
tor.log.xz

debug-level Tor log

Trac:
stem.log.xz

STEM trace

The code for the controller in question is here: https://github.com/zackw/tbbscraper/blob/master/controller/controller.py

Trac:
Keywords: N/A deleted, tor-relay added
Milestone: N/A to Tor: 0.2.5.x-final

I tried starting a Tor instance with 300 ORPorts, and it worked okay. So apparently there's something about adding a bunch of them in succession that might be a problem.

The most useful test here would be to attach a debugger to the process and find out what all of its threads are doing, to see if something is blocking.

This is interesting. I think what we actually have here is a deadlock caused by STEM.

All but one of the threads in the Tor process (there are nine) are blocked in recv and have the same stack trace (with different file descriptors):

#0  0x00007ff31c61c8bc in __libc_recv (fd=fd@entry=214, buf=<optimized out>, 
    n=<optimized out>, flags=<optimized out>, flags@entry=0)
    at ../sysdeps/unix/sysv/linux/x86_64/recv.c:34
#1  0x00007ff31d8f163a in recv (__flags=0, __n=552, __buf=0x7ff30f7f5940, 
    __fd=214) at /usr/include/x86_64-linux-gnu/bits/socket2.h:45
#2  read_all (fd=fd@entry=214, buf=buf@entry=0x7ff30f7f5940 "", 
    count=count@entry=552, isSocket=isSocket@entry=1)
    at ../src/common/util.c:1685
#3  0x00007ff31d8a79be in cpuworker_main (data=0x0, data@entry=0x7ff3080f7fd0)
    at ../src/or/cpuworker.c:428
#4  0x00007ff31d8defe8 in tor_pthread_helper_fn (_data=0x0)
    at ../src/common/compat.c:2248
#5  0x00007ff31c32bb50 in start_thread (arg=<optimized out>)
    at pthread_create.c:304
#6  0x00007ff31c61ba7d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112

... so I presume they are all CPU workers waiting for something to do.

The remaining thread has this stack trace:

Program received signal SIGINT, Interrupt.
0x00007ff31c60fb0d in write () at ../sysdeps/unix/syscall-template.S:82
82      in ../sysdeps/unix/syscall-template.S
(gdb) bt
#0  0x00007ff31c60fb0d in write () at ../sysdeps/unix/syscall-template.S:82
#1  0x00007ff31d8f152e in write_all (fd=1, 
    buf=buf@entry=0x7fff49cb7650 "Oct 01 22:15:01.000 [notice] Your Tor server's identity key fingerprint is 'tbbscraperentry 73EFD4FE8D5D2466ECBDCFAC11894A72A322FD3C'\n", count=count@entry=134, isSocket=0, isSocket@entry=1238070893)
    at ../src/common/util.c:1661
#2  0x00007ff31d8e9fae in logv (severity=5, domain=<optimized out>, 
    funcname=0x7ff31d920bb3 "init_keys", suffix=suffix@entry=0x0, 
    format=format@entry=0x7ff31d9205f8 "Your Tor server's identity key fingerprint is '%s %s'", ap=ap@entry=0x7fff49cb9dc0) at ../src/common/log.c:417
#3  0x00007ff31d8ea263 in log_fn_ (severity=severity@entry=5, 
    domain=domain@entry=1, fn=fn@entry=0x7ff31d920bb3 "init_keys", 
    format=format@entry=0x7ff31d9205f8 "Your Tor server's identity key fingerprint is '%s %s'") at ../src/common/log.c:455
#4  0x00007ff31d822d15 in init_keys () at ../src/or/router.c:922
#5  0x00007ff31d881de5 in options_act (old_options=0x7ff308105910)
    at ../src/or/config.c:1396
#6  set_options (new_val=new_val@entry=0x7ff314032e40, 
    msg=msg@entry=0x7fff49cba148) at ../src/or/config.c:596
#7  0x00007ff31d88360d in options_trial_assign (list=0x7ff314032570, 
    use_defaults=use_defaults@entry=0, clear_first=clear_first@entry=1, 
    msg=msg@entry=0x7fff49cba148) at ../src/or/config.c:1865
#8  0x00007ff31d8a042e in control_setconf_helper (
    conn=conn@entry=0x7ff320da7ec0, len=len@entry=3704, body=<optimized out>, 
    body@entry=0x7ff320cfcbd8 "ORPort=\"9000 IPv4Only\" ORPort=\"9001 IPv4Only\" ORPort=\"9002 IPv4Only\" ORPort=\"9003 IPv4Only\" ORPort=\"9004 IPv4Only\" ORPort=\"9005 IPv4Only\" ORPort=\"9006 IPv4Only\" ORPort=\"9007 IPv4Only\" ORPort=\"9008 IPv"..., use_defaults=0) at ../src/or/control.c:762
#9  0x00007ff31d8a3a25 in handle_control_resetconf (body=<optimized out>, 
    len=<optimized out>, conn=<optimized out>) at ../src/or/control.c:809
#10 connection_control_process_inbuf (conn=conn@entry=0x7ff320da7ec0)
    at ../src/or/control.c:3419
#11 0x00007ff31d8874a9 in connection_process_inbuf (
    conn=conn@entry=0x7ff320da7ec0, package_partial=package_partial@entry=1)
    at ../src/or/connection.c:4011
#12 0x00007ff31d88d6dd in connection_handle_read_impl (conn=0x7ff320da7ec0)
    at ../src/or/connection.c:2839
#13 connection_handle_read (conn=conn@entry=0x7ff320da7ec0)
    at ../src/or/connection.c:2880
#14 0x00007ff31d7e9061 in conn_read_callback (fd=<optimized out>, 
    event=<optimized out>, _conn=0x7ff320da7ec0) at ../src/or/main.c:718
#15 0x00007ff31cee1ccc in event_base_loop ()
   from /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5
#16 0x00007ff31d7e99f5 in do_main_loop () at ../src/or/main.c:1992
#17 0x00007ff31d7eb1de in tor_main (argc=7, argv=0x7fff49cba648)
    at ../src/or/main.c:2708
#18 0x00007ff31c55fead in __libc_start_main (main=<optimized out>, 
    argc=<optimized out>, ubp_av=<optimized out>, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff49cba638)
    at libc-start.c:228
#19 0x00007ff31d7e5a9d in _start ()

So Tor is stuck trying to write a log message to file descriptor 1, which is a pipe, the other end of which is connected to the controller process, because STEM parses notice-level log output to know when the Tor process is fully spun up. (Tor's stderr is also a pipe connected to STEM.) But STEM is blocked reading from the control socket, and in general doesn't appear to bother reading from either its stdout or stderr pipes after stem.process.launch_tor returns. I guess the pipe just happens to fill up at 160 open ORPorts. I can work around this in my code but I think it still qualifies as a bug in STEM.

Hi zwol. I've dug through python's subprocess module for over an hour so far trying to find a method to ignore further stdout content without a special puller thread but no luck so far. I'll keep thinking about it, but probably won't come up with anything until this weekend.

Cheers! -Damian

PS. Stem is not an acronym, so the library's called "Stem". Not "STEM".

By the way, the launch_tor() function returns the tor process which has the stdout and stderr attributes. You can likely call 'tor_process.communicate()' from a daemon thread to avoid this.

Trac:
Component: Tor to Stem
Owner: N/A to atagar

You can likely call 'tor_process.communicate()' from a daemon thread to avoid this.

That is approximately the workaround I put into my code. May I suggest that as a first step here, the present need to do something like this should be documented?

I've dug through python's subprocess module for over an hour so far trying to find a method to ignore further stdout content without a special puller thread but no luck so far.

So I recognize that this probably would require changes to Tor as well as Stem, but the fact that Stem needs to parse log messages in the first place seems like the real bug here. What exactly is it doing with them, and can we think of a better technique? Maybe a new command line option which tells Tor to close file descriptor N when it reaches a certain level of initializedness, or something like that?

... but the fact that Stem needs to parse log messages in the first place seems like the real bug here. What exactly is it doing with them

Stem is using them to figure out when tor has bootstrapped enough to reach a usable state. It's also used so we can raise an exception if it fails before that.

... and can we think of a better technique?

Suggestions welcome. Personally I'm not a fan of closing file descriptor N. Having the python process consume the rest of stdout (even if it's with a null reader by default) is preferable since it's both less hacky and users might want it. The only trick is doing this without requiring a daemon thread.

Replying to atagar:

Stem is using them to figure out when tor has bootstrapped enough to reach a usable state. It's also used so we can raise an exception if it fails before that.

The bootstrap status events (or the getinfo) don't do it for you?

Perhaps Stem should set Log err stdout once it has connected to Tor's control port and enabled the events that it wants.

Oh, just thought of something. If you call tor_process.stdout.close() does this still manifest? If not then that's a pretty trivial fix to make in launch_tor().

The bootstrap status events (or the getinfo) don't do it for you?

No for a couple reasons:

It doesn't tell us when we can connect to the control port. We could poll instead, but reading stdout is better.
The launch_tor() function simply launches tor, it doesn't require that the user has a control port.

I recall you asking this on an earlier ticket too.

Swapping the status (waiting for the answer to "If you call tor_process.stdout.close() does this still manifest?").

Trac:
Status: new to needs_information

waiting for the answer to "If you call tor_process.stdout.close() does this still manifest?"

I'm sorry - to me, that so obviously won't work that I didn't even register it as a question. If you do that, the Tor process will get a SIGPIPE and crash the very next time it writes to its stdout (resp. stderr).

A dedicated thread calling communicate() is also a bad idea, for a different reason: that will accumulate all of the log messages in memory until the Tor process eventually exits. For my use case, that could be weeks, and megabytes of notices. (Probably not many megabytes, but I'm already struggling with the memory usage of this controller; every little bit helps.)

This is my current workaround:

class PipeDrainer(threading.Thread):
    """Read and discard all data from a pipe.  Exits when the pipe is closed."""
    def __init__(self, fp):
        threading.Thread.__init__(self)
        self.fp = fp 

    def run(self):
        try:
            while len(self.fp.read(8192)) > 0:
                pass
        except:
            pass
        self.fp.close()

# ...
class TorBridge(object):
# ...
    def __enter__(self):
        try:
            # ...
            self.process = stem.process.launch_tor_with_config(config={ ... },
                               take_ownership=True)
            self.stdout_drainer = PipeDrainer(self.process.stdout)
            self.stdout_drainer.start()
            self.stderr_drainer = PipeDrainer(self.process.stderr)
            self.stderr_drainer.start()
            # ...

Trac:
Status: needs_information to new

I ran into the same problem as I'm using Stem and my Tor process generates plenty of log messages. My workaround is:

torCtrl.set_conf("Log", "err file /dev/null")

Tor hangs because stem isn't reading its stdout

Child items ...

Activity