Sebastian reports that his server (debian, tor master, libevent patches-2.0) gets lots of "TLS error while handshaking (with bufferevent) with [scrubbed]: ssl handshake failure (in SSL routines:SSL3_READ_BYTES:SSL3_ST_SR_CERT_A)" warnings. Those imply that (some? all?) connections aren't making it in.
Additionally, his circuit timeouts seem to increase and increase. Perhaps some outgoing connections aren't working.
So tentatively, I'm going to guess that the situation is: sometimes outgoing connections succeed, and sometimes they fail. Sometimes incoming connections succeed, and sometimes they fail.
The first thing to ask is, is that right? Or was I mistaken when I thought that incoming connections work sometimes?
Second thing to ask is, is there any pattern to which hosts work and which don't? Are hosts in "mostly working" and "mostly not working" classes, or does each conn fail more or less at random? Are the ones that work with incoming conns the same as the ones that work with outgoing conns?
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related.
Learn more.
Oh! For future work on this, please go with patches-2.0? I think that there shouldn't be anything to break it in master, but I'm more comfortable with the stability of patches-2.0 going forward for a while.
My relay (Windows 7, tor master, libevent patches-2.0, DisableIOCP 0) registers in the network, but apparently doesn't relay a single byte.
There are a few warnings similar to what Sebastian reports, but only 24 in half a day: "TLS error while handshaking (with bufferevents) with [scrubbed]: http request (in SSL routines:SSL23_GET_CLIENT_HELLP:SSL23_ST_SR_CLNT_HELLO_A)"
I took out the DisableIOCP 0 part to see if the relay starts pushing some traffic. Anything else I should try?
Relay mode: Taking out DisableIOCP 0 didn't change anything.
Client mode: The same tor.exe (tor master, libevent patches-2.0) works just fine when copied into a TBB and using it to browse the web.
Bridge mode: I think setting up a bridge and making sure it sees some traffic is more difficult than setting up a relay. How would a bridge not have the same problem as a relay?
Relay mode without bufferevents: I'll have to try running the relay without bufferevents to see if the problem really comes from that, but I need the box for something else now. I'll try that once I have the box back.
I ran an 0.2.2.31-rc relay on Windows for a few hours which attracted some traffic. I then switched to tor master, libevent patches-2.0, bufferevents, and DisableIOCP 0. The result is that the circuit build timeout doubled to 19 seconds within 6 minutes. I think something's broken here.
AHA. There are definitely some rate-limiting bugs. Libevent commit 5b18f13048d4d6 fixes one of them, where ticks that should have been 333 msec were instead 22 minutes. (Yuck.)
There's another one, where filtering bufferevents get rate-limited at the wrong point. I'm working on that one now.
The latest Tor (b161674d5dc or later) with the latest Libevent (5b18f13048d4d6) might have fewer bugs. Some will remain, though. It will be good to know which ones are gone. If you run into bugs, please be absolutely sure which version of Tor and which version of Libevent you were using.
Okay, I think that the latest Tor and libevent described above are a good choice everywhere except with IOCP. For IOCP, I've started a Libevent branch "20_iocp_ratelim", that needs more testing and review before it's ready to merge into libevent.
Running fluxe3 with bufferevents now (no filtering bufferevents yet):
~/build/libevent$ git rev-parse HEAD
2cbe115cbcdd47d954fa47a96c206c5c6ee63ddc
~/build/tor$ git rev-parse HEAD
b161674d5dc312f95a495bed3c25d892fc9f0550
The situation looks very much improved, no more reports about failing to open circuits (the circuit build timeout also doesn't rise anymore).
I did see another few of these tho: [warn] TLS error while handshaking (with bufferevent) with [scrubbed]: ssl handshake failure (in SSL routines:SSL3_READ_BYTES:SSL3_ST_SR_CERT_A)
When connecting to it as a client, i get this:
$ src/or/tor --usebridges 1 --bridge 78.47.18.110:80[notice] Tor v0.2.3.2-alpha-dev (git-b161674d5dc312f9). This is experimental software. Do not rely on it for strong anonymity. (Running on Darwin x86_64)...[notice] Bootstrapped 10%: Finishing handshake with directory server.[notice] Learned fingerprint ED13D1D13C1E57C6A406DD64551D2F905AB99AFF for bridge 78.47.18.110:80[notice] Bootstrapped 15%: Establishing an encrypted directory connection.[notice] Bootstrapped 20%: Asking for networkstatus consensus.[notice] Bootstrapped 50%: Loading relay descriptors.[notice] new bridge descriptor 'fluxe3' (fresh)[notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.[warn] Received http status code 404 ("Not found") from server '78.47.18.110:80' while fetching consensus directory.[warn] Received http status code 404 ("Not found") from server '78.47.18.110:80' while fetching consensus directory.
and sure enough - there isn't a cached-microdesc-consensus file in fluxe3's datadir. Only 20 minutes after starting up did it decide to fetch a microdesc-consensus and store it in its datadir. A client that connets through it (again via the bridges line from above) successfully bootstraps. While it does so, the relay logs (5 times in total):
[warn] Bug: Duplicate call to connection_mark_for_close at directory.c:3549 (first at directory.c:3549)
This COULD be a coincidence as it is a public relay, but it happened right in the second where the client fetched descriptors, and it happened again when I re-started my client without cached dir info. Restarting it with cached dir info doesn't cause problems.
Another thing that is curious is that it takes the client a full minute to go do this step:
Aug 25 07:55:38.000 [notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.Aug 25 07:56:39.000 [notice] We're missing a certificate from authority with signing key 3509BA5A624403A905C74DA5C8A0CEC9E0D3AF86: launching request.
This is reproducible. After all this, using the tor client for basic browsing works.
The client is not using bufferevents in these tests. I'm keeping the relay up on bufferevents for now, please run tests as you see fit.
00:28 < Sebastian> nickm: going to bed in a minute. Here's what I have so far:00:29 < Sebastian> 14 ssl handshake failure (in SSL routines:SSL3_READ_BYTES:SSL3_ST_SR_CERT_A) messages. They all come from unique IP addresses.00:29 < Sebastian> none of those are relays.00:29 < nickm> Sebastian: just walked in the door. Feel free to summarize. Okay if I copy-and-paste onto trac?00:30 < Sebastian> some of them are very similar, which makes me think that they might be on dynip and got a new IP which triggered a new request00:30 < Sebastian> yes, please do00:30 < nickm> ok00:30 < Sebastian> one sslv3 alert bad record mac (in SSL routines:SSL3_READ_BYTES:SSL_ST_OK) message00:31 < Sebastian> this one came from a relay00:31 < Sebastian> at a university00:31 < Sebastian> so I think they are messing with things00:31 < Sebastian> no other things to report yet.00:32 < Sebastian> (well, got a few of the known connection_mark_for_close messages, but those seem pretty isolated.00:32 < Sebastian> )00:33 < nickm> Can you tell me which relay so I can experiment? 00:34 < Sebastian> gurgle ( gurgle.cs.uwaterloo.ca, ip address 129.97.74.17 )00:34 < nickm> oh interesting. I wonder if they're running any unusual performance hacks.00:35 < Sebastian> orport 501100:35 < Sebastian> feel free to experiment with fluxe3 too00:35 < Sebastian> I'll keep it running00:38 < Sebastian> ok, I think I'm going to bed. Laters
with respect to gurgle, Ian says (posting here with permission):
On port 5011 we seem to be running git version 218e84b634d10c2 (fromyour repo?)Kevin was playing with it today, but just to change the bandwidthparameters and HUPping it.Some weird log lines, though:Aug 25 19:59:30.000 [warn] Couldn't open "data/state.tmp" (data/state)for writing: Too many open filesAug 25 19:59:30.000 [warn] Unable to write state to file "data/state";will try again laterAug 25 20:00:46.000 [warn] Could not open "data/stats/dirreq-stats": Toomany open filesAug 25 20:01:51.000 [notice] Heartbeat: Tor's uptime is 36 days 12:02hours, with 5932 circuits open. I've pushed 3630.23 GB and received3594.84 GB.Aug 25 20:14:03.000 [warn] Cannot seed RNG -- no entropy source found.Aug 25 20:28:38.000 [notice] Removed 21/7414 microdescriptors as old.Aug 25 20:28:38.000 [warn] Couldn't open "data/cached-microdescs.tmp"(data/cached-microdescs) for writing: Too many open filesAug 25 20:58:39.000 [notice] Removed 5/7393 microdescriptors as old.Aug 25 20:59:30.000 [warn] Couldn't open "data/state.tmp" (data/state)for writing: Too many open filesAug 25 20:59:30.000 [warn] Unable to write state to file "data/state";will try again laterAug 25 21:14:04.000 [warn] Cannot seed RNG -- no entropy source found.Aug 25 21:28:40.000 [notice] Removed 33/7388 microdescriptors as old.Aug 25 21:59:30.000 [warn] Couldn't open "data/state.tmp" (data/state)for writing: Too many open files...Aug 26 15:29:16.000 [notice] Removed 47/7858 microdescriptors as old.Aug 26 16:00:40.000 [warn] Failing because we have 992 connectionsalready. Please raise your ulimit -n. [589338339 similar message(s)suppressed in last 21600 seconds]Aug 26 16:29:18.000 [notice] Removed 48/7811 microdescriptors as old.Aug 26 17:14:24.000 [warn] Cannot seed RNG -- no entropy source found.It's not immediately apparent to me that running out of file descriptorsshould cause the SSL error you're seeing, though.
This looks like gentoo bug https://bugs.gentoo.org/show_bug.cgi?id=376515. There the users reported that it was fixed when switching to libevent-2.0.14. Perhaps changes in libevent .13 and .14 might provide a clue as to what's going on here.
Sorry a closer look shows that the gentoo error ends at the same place (ssl handshake) but ends with SSL23_ST_CR_SRVR_HELLO_A while the original here ends with SSL3_ST_SR_CERT_A so maybe its not the same.