In order to do careful monitoring of an experiment I'm running over Tor, I wrote a bridge controller that allocates an ORPort to each client as it checks in. Clients come and go; there are only ever about 20 of them at a time, but stale connections don't get cleaned up very quickly, so I wind up having to manage 100+ ORPorts.
It appears that at some critical number of ORPorts -- the logs suggest 160 -- Tor gets stuck halfway through the process of reinitializing itself; crucially, it doesn't ever respond to the SETCONF message, and it also stops processing all other input.
I'm not sure whether this is supposed to be a supported configuration -- it's certainly an abnormal configuration -- but it seems to me that if there is an upper limit, SETCONF should fail when you hit it, rather than having the relay just stop working.
Debug logs attached.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
I tried starting a Tor instance with 300 ORPorts, and it worked okay. So apparently there's something about adding a bunch of them in succession that might be a problem.
The most useful test here would be to attach a debugger to the process and find out what all of its threads are doing, to see if something is blocking.
This is interesting. I think what we actually have here is a deadlock caused by STEM.
All but one of the threads in the Tor process (there are nine) are blocked in recv and have the same stack trace (with different file descriptors):
#0 0x00007ff31c61c8bc in __libc_recv (fd=fd@entry=214, buf=<optimized out>, n=<optimized out>, flags=<optimized out>, flags@entry=0) at ../sysdeps/unix/sysv/linux/x86_64/recv.c:34#1 0x00007ff31d8f163a in recv (__flags=0, __n=552, __buf=0x7ff30f7f5940, __fd=214) at /usr/include/x86_64-linux-gnu/bits/socket2.h:45#2 read_all (fd=fd@entry=214, buf=buf@entry=0x7ff30f7f5940 "", count=count@entry=552, isSocket=isSocket@entry=1) at ../src/common/util.c:1685#3 0x00007ff31d8a79be in cpuworker_main (data=0x0, data@entry=0x7ff3080f7fd0) at ../src/or/cpuworker.c:428#4 0x00007ff31d8defe8 in tor_pthread_helper_fn (_data=0x0) at ../src/common/compat.c:2248#5 0x00007ff31c32bb50 in start_thread (arg=<optimized out>) at pthread_create.c:304#6 0x00007ff31c61ba7d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
... so I presume they are all CPU workers waiting for something to do.
The remaining thread has this stack trace:
Program received signal SIGINT, Interrupt.0x00007ff31c60fb0d in write () at ../sysdeps/unix/syscall-template.S:8282 in ../sysdeps/unix/syscall-template.S(gdb) bt#0 0x00007ff31c60fb0d in write () at ../sysdeps/unix/syscall-template.S:82#1 0x00007ff31d8f152e in write_all (fd=1, buf=buf@entry=0x7fff49cb7650 "Oct 01 22:15:01.000 [notice] Your Tor server's identity key fingerprint is 'tbbscraperentry 73EFD4FE8D5D2466ECBDCFAC11894A72A322FD3C'\n", count=count@entry=134, isSocket=0, isSocket@entry=1238070893) at ../src/common/util.c:1661#2 0x00007ff31d8e9fae in logv (severity=5, domain=<optimized out>, funcname=0x7ff31d920bb3 "init_keys", suffix=suffix@entry=0x0, format=format@entry=0x7ff31d9205f8 "Your Tor server's identity key fingerprint is '%s %s'", ap=ap@entry=0x7fff49cb9dc0) at ../src/common/log.c:417#3 0x00007ff31d8ea263 in log_fn_ (severity=severity@entry=5, domain=domain@entry=1, fn=fn@entry=0x7ff31d920bb3 "init_keys", format=format@entry=0x7ff31d9205f8 "Your Tor server's identity key fingerprint is '%s %s'") at ../src/common/log.c:455#4 0x00007ff31d822d15 in init_keys () at ../src/or/router.c:922#5 0x00007ff31d881de5 in options_act (old_options=0x7ff308105910) at ../src/or/config.c:1396#6 set_options (new_val=new_val@entry=0x7ff314032e40, msg=msg@entry=0x7fff49cba148) at ../src/or/config.c:596#7 0x00007ff31d88360d in options_trial_assign (list=0x7ff314032570, use_defaults=use_defaults@entry=0, clear_first=clear_first@entry=1, msg=msg@entry=0x7fff49cba148) at ../src/or/config.c:1865#8 0x00007ff31d8a042e in control_setconf_helper ( conn=conn@entry=0x7ff320da7ec0, len=len@entry=3704, body=<optimized out>, body@entry=0x7ff320cfcbd8 "ORPort=\"9000 IPv4Only\" ORPort=\"9001 IPv4Only\" ORPort=\"9002 IPv4Only\" ORPort=\"9003 IPv4Only\" ORPort=\"9004 IPv4Only\" ORPort=\"9005 IPv4Only\" ORPort=\"9006 IPv4Only\" ORPort=\"9007 IPv4Only\" ORPort=\"9008 IPv"..., use_defaults=0) at ../src/or/control.c:762#9 0x00007ff31d8a3a25 in handle_control_resetconf (body=<optimized out>, len=<optimized out>, conn=<optimized out>) at ../src/or/control.c:809#10 connection_control_process_inbuf (conn=conn@entry=0x7ff320da7ec0) at ../src/or/control.c:3419#11 0x00007ff31d8874a9 in connection_process_inbuf ( conn=conn@entry=0x7ff320da7ec0, package_partial=package_partial@entry=1) at ../src/or/connection.c:4011#12 0x00007ff31d88d6dd in connection_handle_read_impl (conn=0x7ff320da7ec0) at ../src/or/connection.c:2839#13 connection_handle_read (conn=conn@entry=0x7ff320da7ec0) at ../src/or/connection.c:2880#14 0x00007ff31d7e9061 in conn_read_callback (fd=<optimized out>, event=<optimized out>, _conn=0x7ff320da7ec0) at ../src/or/main.c:718#15 0x00007ff31cee1ccc in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5#16 0x00007ff31d7e99f5 in do_main_loop () at ../src/or/main.c:1992#17 0x00007ff31d7eb1de in tor_main (argc=7, argv=0x7fff49cba648) at ../src/or/main.c:2708#18 0x00007ff31c55fead in __libc_start_main (main=<optimized out>, argc=<optimized out>, ubp_av=<optimized out>, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff49cba638) at libc-start.c:228#19 0x00007ff31d7e5a9d in _start ()
So Tor is stuck trying to write a log message to file descriptor 1, which is a pipe, the other end of which is connected to the controller process, because STEM parses notice-level log output to know when the Tor process is fully spun up. (Tor's stderr is also a pipe connected to STEM.) But STEM is blocked reading from the control socket, and in general doesn't appear to bother reading from either its stdout or stderr pipes after stem.process.launch_tor returns. I guess the pipe just happens to fill up at 160 open ORPorts. I can work around this in my code but I think it still qualifies as a bug in STEM.
Hi zwol. I've dug through python's subprocess module for over an hour so far trying to find a method to ignore further stdout content without a special puller thread but no luck so far. I'll keep thinking about it, but probably won't come up with anything until this weekend.
Cheers! -Damian
PS. Stem is not an acronym, so the library's called "Stem". Not "STEM".
By the way, the launch_tor() function returns the tor process which has the stdout and stderr attributes. You can likely call 'tor_process.communicate()' from a daemon thread to avoid this.
You can likely call 'tor_process.communicate()' from a daemon thread to avoid this.
That is approximately the workaround I put into my code. May I suggest that as a first step here, the present need to do something like this should be documented?
I've dug through python's subprocess module for over an hour so far trying to find a method to ignore further stdout content without a special puller thread but no luck so far.
So I recognize that this probably would require changes to Tor as well as Stem, but the fact that Stem needs to parse log messages in the first place seems like the real bug here. What exactly is it doing with them, and can we think of a better technique? Maybe a new command line option which tells Tor to close file descriptor N when it reaches a certain level of initializedness, or something like that?
... but the fact that Stem needs to parse log messages in the first place seems like the real bug here. What exactly is it doing with them
Stem is using them to figure out when tor has bootstrapped enough to reach a usable state. It's also used so we can raise an exception if it fails before that.
... and can we think of a better technique?
Suggestions welcome. Personally I'm not a fan of closing file descriptor N. Having the python process consume the rest of stdout (even if it's with a null reader by default) is preferable since it's both less hacky and users might want it. The only trick is doing this without requiring a daemon thread.
Stem is using them to figure out when tor has bootstrapped enough to reach a usable state. It's also used so we can raise an exception if it fails before that.
The bootstrap status events (or the getinfo) don't do it for you?
Oh, just thought of something. If you call tor_process.stdout.close() does this still manifest? If not then that's a pretty trivial fix to make in launch_tor().
The bootstrap status events (or the getinfo) don't do it for you?
No for a couple reasons:
It doesn't tell us when we can connect to the control port. We could poll instead, but reading stdout is better.
The launch_tor() function simply launches tor, it doesn't require that the user has a control port.
I recall you asking this on an earlier ticket too.
waiting for the answer to "If you call tor_process.stdout.close() does this still manifest?"
I'm sorry - to me, that so obviously won't work that I didn't even register it as a question. If you do that, the Tor process will get a SIGPIPE and crash the very next time it writes to its stdout (resp. stderr).
A dedicated thread calling communicate() is also a bad idea, for a different reason: that will accumulate all of the log messages in memory until the Tor process eventually exits. For my use case, that could be weeks, and megabytes of notices. (Probably not many megabytes, but I'm already struggling with the memory usage of this controller; every little bit helps.)
This is my current workaround:
class PipeDrainer(threading.Thread): """Read and discard all data from a pipe. Exits when the pipe is closed.""" def __init__(self, fp): threading.Thread.__init__(self) self.fp = fp def run(self): try: while len(self.fp.read(8192)) > 0: pass except: pass self.fp.close()# ...class TorBridge(object):# ... def __enter__(self): try: # ... self.process = stem.process.launch_tor_with_config(config={ ... }, take_ownership=True) self.stdout_drainer = PipeDrainer(self.process.stdout) self.stdout_drainer.start() self.stderr_drainer = PipeDrainer(self.process.stderr) self.stderr_drainer.start() # ...