Opened 8 years ago

Closed 3 years ago

#3769 closed defect (wontfix)

Bufferevent-based server sometimes succeeds, sometimes fails.

Reported by: nickm Owned by:
Priority: High Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Keywords: bufferevents tor-relay
Cc: iang@… Actual Points:
Parent ID: #3561 Points:
Reviewer: Sponsor:

Description

Sebastian reports that his server (debian, tor master, libevent patches-2.0) gets lots of "TLS error while handshaking (with bufferevent) with [scrubbed]: ssl handshake failure (in SSL routines:SSL3_READ_BYTES:SSL3_ST_SR_CERT_A)" warnings. Those imply that (some? all?) connections aren't making it in.

Additionally, his circuit timeouts seem to increase and increase. Perhaps some outgoing connections aren't working.

So tentatively, I'm going to guess that the situation is: sometimes outgoing connections succeed, and sometimes they fail. Sometimes incoming connections succeed, and sometimes they fail.

The first thing to ask is, is that right? Or was I mistaken when I thought that incoming connections work sometimes?

Second thing to ask is, is there any pattern to which hosts work and which don't? Are hosts in "mostly working" and "mostly not working" classes, or does each conn fail more or less at random? Are the ones that work with incoming conns the same as the ones that work with outgoing conns?

Child Tickets

Change History (25)

comment:1 Changed 8 years ago by nickm

Trying out a bridge might also be a way to explore this.

comment:2 Changed 8 years ago by Sebastian

just a quick correction, it was also libevent master, not patches-2.0

comment:3 Changed 8 years ago by nickm

Oh! For future work on this, please go with patches-2.0? I think that there shouldn't be anything to break it in master, but I'm more comfortable with the stability of patches-2.0 going forward for a while.

comment:4 Changed 8 years ago by karsten

My relay (Windows 7, tor master, libevent patches-2.0, DisableIOCP 0) registers in the network, but apparently doesn't relay a single byte.

There are a few warnings similar to what Sebastian reports, but only 24 in half a day: "TLS error while handshaking (with bufferevents) with [scrubbed]: http request (in SSL routines:SSL23_GET_CLIENT_HELLP:SSL23_ST_SR_CLNT_HELLO_A)"

I took out the DisableIOCP 0 part to see if the relay starts pushing some traffic. Anything else I should try?

comment:5 Changed 8 years ago by nickm

Try connecting to it as a client, to see how it fails from the client side?

comment:6 Changed 8 years ago by nickm

Also, see if bridges have the same problem. If so, this should get easier to examine. (And does it work as a client for you?)

comment:7 Changed 8 years ago by karsten

Relay mode: Taking out DisableIOCP 0 didn't change anything.

Client mode: The same tor.exe (tor master, libevent patches-2.0) works just fine when copied into a TBB and using it to browse the web.

Bridge mode: I think setting up a bridge and making sure it sees some traffic is more difficult than setting up a relay. How would a bridge not have the same problem as a relay?

Relay mode without bufferevents: I'll have to try running the relay without bufferevents to see if the problem really comes from that, but I need the box for something else now. I'll try that once I have the box back.

comment:8 Changed 8 years ago by karsten

I ran an 0.2.2.31-rc relay on Windows for a few hours which attracted some traffic. I then switched to tor master, libevent patches-2.0, bufferevents, and DisableIOCP 0. The result is that the circuit build timeout doubled to 19 seconds within 6 minutes. I think something's broken here.

comment:9 Changed 8 years ago by nickm

AHA. There are definitely some rate-limiting bugs. Libevent commit 5b18f13048d4d6 fixes one of them, where ticks that should have been 333 msec were instead 22 minutes. (Yuck.)

There's another one, where filtering bufferevents get rate-limited at the wrong point. I'm working on that one now.

comment:10 Changed 8 years ago by nickm

The latest Tor (b161674d5dc or later) with the latest Libevent (5b18f13048d4d6) might have fewer bugs. Some will remain, though. It will be good to know which ones are gone. If you run into bugs, please be absolutely sure which version of Tor and which version of Libevent you were using.

comment:11 Changed 8 years ago by nickm

Okay, I think that the latest Tor and libevent described above are a good choice everywhere except with IOCP. For IOCP, I've started a Libevent branch "20_iocp_ratelim", that needs more testing and review before it's ready to merge into libevent.

You can find a Tor built for windows with Tor master, libevent 20_iocp_ratelim:

http://www.wangafu.net/~nickm/volatile/tor-win32-bev-20110824.zip
http://www.wangafu.net/~nickm/volatile/tor-win32-bev-20110824.zip.asc

Perhaps it will work better; perhaps it will crash.

Same tests as before would now be useful again.

comment:12 Changed 8 years ago by Sebastian

Running fluxe3 with bufferevents now (no filtering bufferevents yet):

~/build/libevent$ git rev-parse HEAD
2cbe115cbcdd47d954fa47a96c206c5c6ee63ddc

~/build/tor$ git rev-parse HEAD
b161674d5dc312f95a495bed3c25d892fc9f0550

The situation looks very much improved, no more reports about failing to open circuits (the circuit build timeout also doesn't rise anymore).

I did see another few of these tho: [warn] TLS error while handshaking (with bufferevent) with [scrubbed]: ssl handshake failure (in SSL routines:SSL3_READ_BYTES:SSL3_ST_SR_CERT_A)

When connecting to it as a client, i get this:

$ src/or/tor --usebridges 1 --bridge 78.47.18.110:80
[notice] Tor v0.2.3.2-alpha-dev (git-b161674d5dc312f9). This is experimental software. Do not rely on it for strong anonymity. (Running on Darwin x86_64)
...
[notice] Bootstrapped 10%: Finishing handshake with directory server.
[notice] Learned fingerprint ED13D1D13C1E57C6A406DD64551D2F905AB99AFF for bridge 78.47.18.110:80
[notice] Bootstrapped 15%: Establishing an encrypted directory connection.
[notice] Bootstrapped 20%: Asking for networkstatus consensus.
[notice] Bootstrapped 50%: Loading relay descriptors.
[notice] new bridge descriptor 'fluxe3' (fresh)
[notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.
[warn] Received http status code 404 ("Not found") from server '78.47.18.110:80' while fetching consensus directory.
[warn] Received http status code 404 ("Not found") from server '78.47.18.110:80' while fetching consensus directory.

and sure enough - there isn't a cached-microdesc-consensus file in fluxe3's datadir. Only 20 minutes after starting up did it decide to fetch a microdesc-consensus and store it in its datadir. A client that connets through it (again via the bridges line from above) successfully bootstraps. While it does so, the relay logs (5 times in total):

[warn] Bug: Duplicate call to connection_mark_for_close at directory.c:3549 (first at directory.c:3549)

This COULD be a coincidence as it is a public relay, but it happened right in the second where the client fetched descriptors, and it happened again when I re-started my client without cached dir info. Restarting it with cached dir info doesn't cause problems.

Another thing that is curious is that it takes the client a full minute to go do this step:

Aug 25 07:55:38.000 [notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.
Aug 25 07:56:39.000 [notice] We're missing a certificate from authority with signing key 3509BA5A624403A905C74DA5C8A0CEC9E0D3AF86: launching request.

This is reproducible. After all this, using the tor client for basic browsing works.

The client is not using bufferevents in these tests. I'm keeping the relay up on bufferevents for now, please run tests as you see fit.

comment:13 Changed 8 years ago by nickm

00:28 < Sebastian> nickm: going to bed in a minute. Here's what I have so far:

00:29 < Sebastian> 14 ssl handshake failure (in SSL 
                   routines:SSL3_READ_BYTES:SSL3_ST_SR_CERT_A) messages. They 
                   all come from unique IP addresses.
00:29 < Sebastian> none of those are relays.
00:29 < nickm> Sebastian: just walked in the door.  Feel free to summarize.  
               Okay if I copy-and-paste onto trac?
00:30 < Sebastian> some of them are very similar, which makes me think that 
                   they might be on dynip and got a new IP which triggered a 
                   new request
00:30 < Sebastian> yes, please do

00:30 < nickm> ok
00:30 < Sebastian> one sslv3 alert bad record mac (in SSL 
                   routines:SSL3_READ_BYTES:SSL_ST_OK) message
00:31 < Sebastian> this one came from a relay
00:31 < Sebastian> at a university
00:31 < Sebastian> so I think they are messing with things
00:31 < Sebastian> no other things to report yet.
00:32 < Sebastian> (well, got a few of the known connection_mark_for_close 
                   messages, but those seem pretty isolated.
00:32 < Sebastian> )
00:33 < nickm> Can you tell me which relay so I can experiment? 

00:34 < Sebastian> gurgle ( gurgle.cs.uwaterloo.ca, ip address 129.97.74.17 )
00:34 < nickm> oh interesting.  I wonder if they're running any unusual 
               performance hacks.
00:35 < Sebastian> orport 5011
00:35 < Sebastian> feel free to experiment with fluxe3 too
00:35 < Sebastian> I'll keep it running
00:38 < Sebastian> ok, I think I'm going to bed. Laters

comment:14 Changed 8 years ago by nickm

I opened a few new bugs for the most separable issues above; have a look at the children of #3561.

comment:15 Changed 8 years ago by nickm

I emailed Ian about gurgle.

comment:16 Changed 8 years ago by nickm

with respect to gurgle, Ian says (posting here with permission):

On port 5011 we seem to be running git version 218e84b634d10c2 (from
your repo?)

Kevin was playing with it today, but just to change the bandwidth
parameters and HUPping it.

Some weird log lines, though:

Aug 25 19:59:30.000 [warn] Couldn't open "data/state.tmp" (data/state)
for writing: Too many open files
Aug 25 19:59:30.000 [warn] Unable to write state to file "data/state";
will try again later
Aug 25 20:00:46.000 [warn] Could not open "data/stats/dirreq-stats": Too
many open files
Aug 25 20:01:51.000 [notice] Heartbeat: Tor's uptime is 36 days 12:02
hours, with 5932 circuits open. I've pushed 3630.23 GB and received
3594.84 GB.
Aug 25 20:14:03.000 [warn] Cannot seed RNG -- no entropy source found.
Aug 25 20:28:38.000 [notice] Removed 21/7414 microdescriptors as old.
Aug 25 20:28:38.000 [warn] Couldn't open "data/cached-microdescs.tmp"
(data/cached-microdescs) for writing: Too many open files
Aug 25 20:58:39.000 [notice] Removed 5/7393 microdescriptors as old.
Aug 25 20:59:30.000 [warn] Couldn't open "data/state.tmp" (data/state)
for writing: Too many open files
Aug 25 20:59:30.000 [warn] Unable to write state to file "data/state";
will try again later
Aug 25 21:14:04.000 [warn] Cannot seed RNG -- no entropy source found.
Aug 25 21:28:40.000 [notice] Removed 33/7388 microdescriptors as old.
Aug 25 21:59:30.000 [warn] Couldn't open "data/state.tmp" (data/state)
for writing: Too many open files
...
Aug 26 15:29:16.000 [notice] Removed 47/7858 microdescriptors as old.
Aug 26 16:00:40.000 [warn] Failing because we have 992 connections
already. Please raise your ulimit -n. [589338339 similar message(s)
suppressed in last 21600 seconds]
Aug 26 16:29:18.000 [notice] Removed 48/7811 microdescriptors as old.
Aug 26 17:14:24.000 [warn] Cannot seed RNG -- no entropy source found.


It's not immediately apparent to me that running out of file descriptors
should cause the SSL error you're seeing, though.

comment:17 Changed 8 years ago by iang

Cc: iang@… added

comment:18 Changed 8 years ago by blueness

This looks like gentoo bug https://bugs.gentoo.org/show_bug.cgi?id=376515. There the users reported that it was fixed when switching to libevent-2.0.14. Perhaps changes in libevent .13 and .14 might provide a clue as to what's going on here.

comment:19 in reply to:  18 Changed 8 years ago by blueness

Replying to blueness:

This looks like gentoo bug https://bugs.gentoo.org/show_bug.cgi?id=376515.

Sorry a closer look shows that the gentoo error ends at the same place (ssl handshake) but ends with SSL23_ST_CR_SRVR_HELLO_A while the original here ends with SSL3_ST_SR_CERT_A so maybe its not the same.

comment:20 Changed 7 years ago by nickm

Keywords: bufferevents added

comment:21 Changed 7 years ago by nickm

Milestone: Tor: 0.2.3.x-finalTor: 0.2.4.x-final

comment:22 Changed 7 years ago by nickm

Keywords: tor-relay added

comment:23 Changed 7 years ago by nickm

Component: Tor RelayTor

comment:24 Changed 6 years ago by nickm

Milestone: Tor: 0.2.4.x-finalTor: unspecified

comment:25 Changed 3 years ago by nickm

Resolution: wontfix
Status: newclosed

The bufferevents code and corresponding build options have been removed in 0.2.9.2-alpha

Note: See TracTickets for help on using tickets.