Opened 5 years ago

Last modified 21 months ago

#13603 new defect

Many channels closing with "IOERROR" reason given on control port.

Reported by: toralf Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: 0.2.5.10
Severity: Normal Keywords: tor-relay needs-analysis networking
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

my tor-relay https://globe.torproject.org/#/relay/F1BE15429B3CE696D6807F4D4A58B1BFEC45C822
usually do serv about 1000 inbound, 500 outbound and 500 exit connections in parallel using just 20% CPU time of 1 core.

I do however wonder about the high amount of IOERRORs I do see in page 1#5 of the arm tool window, when I press e + q :

 │ 20:47:12 [ORCONN] STATUS: CONNECTED  ENDPOINT: $DB65666C334A177E8D4EA5B12E933E16275DEEE7~FridgeNet
 │ 20:47:11 [ORCONN] STATUS: LAUNCHED   ENDPOINT: $F0B6616974070C06ED80207075CDE5112158FF86~walrusWaffles
 │ 20:47:11 [ORCONN] STATUS: CLOSED     ENDPOINT: $8B9AD4D3055E9F9A5A027BD98F61583E673FF84B~GoldDragon REASON: IOERROR
 │ 20:47:11 [ORCONN] STATUS: LAUNCHED   ENDPOINT: $DB65666C334A177E8D4EA5B12E933E16275DEEE7~FridgeNet
 │ 20:47:11 [ORCONN] STATUS: CLOSED     ENDPOINT: $8E3C2D457B2799C26F70012F5A4A857DA7D326AE~Unnamed REASON: DONE
 │ 20:47:11 [ORCONN] STATUS: CONNECTED  ENDPOINT: $C47EDFA198D217F5DDCEFFCA6CB837D1C0AC5E6A~chuckschuldiner
 │ 20:47:11 [ORCONN] STATUS: CONNECTED  ENDPOINT: $4587B58F494875B6ACF1E29B1D64B2C93BD533D3~SYDIT
 │ 20:47:11 [ORCONN] STATUS: NEW        ENDPOINT: 78.47.188.4:58097
 │ 20:47:10 [ORCONN] STATUS: CONNECTED  ENDPOINT: $E91B2D9F72ADA255D46ED32F880A0B1BFFF64981~lessQQmorePEWPEW
 │ 20:47:10 [ORCONN] STATUS: LAUNCHED   ENDPOINT: $4587B58F494875B6ACF1E29B1D64B2C93BD533D3~SYDIT
 │ 20:47:10 [ORCONN] STATUS: LAUNCHED   ENDPOINT: $E91B2D9F72ADA255D46ED32F880A0B1BFFF64981~lessQQmorePEWPEW
 │ 20:47:10 [ORCONN] STATUS: CONNECTED  ENDPOINT: $FB40DC3C724E5AA6ECD3EA1F64C0F559B48F21B5~torpidsDEinterwerk
 │ 20:47:10 [ORCONN] STATUS: LAUNCHED   ENDPOINT: $FB40DC3C724E5AA6ECD3EA1F64C0F559B48F21B5~torpidsDEinterwerk
 │ 20:47:09 [ORCONN] STATUS: CLOSED     ENDPOINT: $D4643D88FFF8A7DA22FC26070FD9B4F22C88CD2C~default REASON: IOERROR
 │ 20:47:09 [ORCONN] STATUS: CLOSED     ENDPOINT: $C48FECDA99852AA28953780A5F48911F2EF27B06~ololiuqui REASON: IOERROR
 │ 20:47:09 [ORCONN] STATUS: CLOSED     ENDPOINT: $3EE29A5BA006487B96AA5DC2CE2243793E5C463C~overratedconfigs REASON: IOERROR
 │ 20:47:09 [ORCONN] STATUS: CLOSED     ENDPOINT: $2FD0BA57A34DC2792AF470398F72F37F9E51DC2D~serotonin REASON: IOERROR
 │ 20:47:09 [ORCONN] STATUS: CLOSED     ENDPOINT: $A49A0ACF7D45C1A70CCD08F3B69322BB6B1AA7E9~bennobaksteen REASON: IOERROR
 │ 20:47:08 [ORCONN] STATUS: FAILED     ENDPOINT: $372082F3E01DE6A6333D30329C6903A19D0E8E87~Unnamed REASON: NOROUTE NCIRCS: 1
 │ 20:47:08 [ORCONN] STATUS: LAUNCHED   ENDPOINT: $372082F3E01DE6A6333D30329C6903A19D0E8E87~Unnamed
 │ 20:47:08 [ORCONN] STATUS: FAILED     ENDPOINT: $90AB1CE329584D694E33D0C56E85C5639027D0BB~Unnamed REASON: DONE
 │ 20:47:07 [ORCONN] STATUS: CLOSED     ENDPOINT: $AA2FA395FC85AFC3CF038E027D9DD00A92FF3255~gulmohar REASON: IOERROR

The system is a hardened 64 bit Gentoo with kernel 3.17.1

Child Tickets

Attachments (5)

historgram.png (17.6 KB) - added by toralf 5 years ago.
histogram
all_known.png (52.2 KB) - added by toralf 5 years ago.
histogram with labeled axis of those relays known by directory authorities and mirrors
all_known_detail.png (51.6 KB) - added by toralf 5 years ago.
detail for relays with ioerrors < 200
fingerprints (439.8 KB) - added by toralf 21 months ago.
histogram (1.8 KB) - added by toralf 21 months ago.

Download all attachments as: .zip

Change History (31)

comment:1 Changed 5 years ago by toralf

What I meant with the topic is , that as soon as I do observe more than 600 connections (inbound+outbound) than I do observe such errors.

In addition today I observed (for the first time after I installed the tor-relay) such warnings :

Oct 30 20:45:43.000 [notice] Your Guard 0xbaddad ($EA8637EA746451C0680559FDFF34ABA54DDAE831) is failing more circuits than usual. Most likely this means the Tor network is overloaded. Success counts are 165/237. Use counts are 68/68. 210 circuits completed, 0 were unusable, 45 collapsed, and 6 timed out. For reference, your timeout cutoff is 60 seconds.
Oct 30 20:46:45.000 [warn] Your Guard 0xbaddad ($EA8637EA746451C0680559FDFF34ABA54DDAE831) is failing a very large amount of circuits. Most likely this means the Tor network is overloaded, but it could also mean an attack against you or potentially the guard itself. Success counts are 87/176. Use counts are 68/68. 148 circuits completed, 0 were unusable, 60 collapsed, and 3 timed out. For reference, your timeout cutoff is 60 seconds.
Oct 30 21:46:32.000 [warn] Your Guard 0xbaddad ($EA8637EA746451C0680559FDFF34ABA54DDAE831) is failing an extremely large amount of circuits. This could indicate a route manipulation attack, extreme network overload, or a bug. Success counts are 52/176. Use counts are 73/73. 138 circuits completed, 0 were unusable, 86 collapsed, and 2 timed out. For reference, your timeout cutoff is 60 seconds.

comment:2 Changed 5 years ago by toralf

Summary: a lot of IOERRORS if > 600 connections are opena lot of IOERRORS if traffic is higher than about 12 MBit/sec

comment:3 Changed 5 years ago by phw

How do your start your Tor relay? Over the distribution's init script or manually?

comment:4 Changed 5 years ago by toralf

over the Gentoo's distribution script (it is hardened 64 stable Gentoo):

tor-relay ~ # cat /etc/init.d/tor 
#!/sbin/runscript
# Copyright 1999-2014 Gentoo Foundation
# Distributed under the terms of the GNU General Public License v2
# $Header: /var/cvsroot/gentoo-x86/net-misc/tor/files/tor.initd-r7,v 1.1 2014/09/23 23:14:00 blueness Exp $

PIDFILE=/var/run/tor/tor.pid
CONFFILE=/etc/tor/torrc
GRACEFUL_TIMEOUT=${GRACEFUL_TIMEOUT:-60}

# See bug #523552, and https://trac.torproject.org/projects/tor/ticket/5525
# Graceful = wait 30 secs or so until all connections are properly closed.
extra_commands="checkconfig"
extra_started_commands="graceful gracefulstop reload"
description="Anonymizing overlay network for TCP"
description_checkconfig="Check for valid config file."
description_reload="Reload the configuration."
description_graceful="Gracefully restart."
description_gracefulstop="Gracefully stop."

depend() {
        need net
}

checkconfig() {
        # first check that it exists
        if [ ! -f ${CONFFILE} ] ; then
                eerror "You need to setup ${CONFFILE} first"
                eerror "Example is in ${CONFFILE}.sample"
                return 1
        fi

        # now verify whether the configuration is valid
        /usr/bin/tor --verify-config -f ${CONFFILE} > /dev/null 2>&1
        if [ $? -eq 0 ] ; then
                einfo "Tor configuration (${CONFFILE}) is valid."
                return 0
        else
                eerror "Tor configuration (${CONFFILE}) not valid."
                /usr/bin/tor --verify-config -f ${CONFFILE}
                return 1
        fi
}

start() {
        checkconfig || return 1
        checkpath -d -m 0755 -o tor:tor /var/run/tor
        ebegin "Starting Tor"
        HOME=/var/lib/tor
        start-stop-daemon --start --pidfile "${PIDFILE}" --quiet --exec /usr/bin/tor  -- -f "${CONFFILE}" --runasdaemon 1 --PidFile "${PIDFILE}" > /dev/null 2>&1
        eend $?
}

stop() {
        ebegin "Stopping Tor"
        start-stop-daemon --stop --pidfile "${PIDFILE}" --exec /usr/bin/tor -- --PidFile "${PIDFILE}"
        eend $?
}

graceful() {
        gracefulstop
        start
        eend $?
}

gracefulstop() {
        ebegin "Gracefully stopping Tor"
        ebegin "This can take up to ${GRACEFUL_TIMEOUT} seconds"
        start-stop-daemon -P --stop --signal INT -R ${GRACEFUL_TIMEOUT} --pidfile "${PIDFILE}" --exec /usr/bin/tor -- --PidFile "${PIDFILE}"
        eend $?
}

reload() {
        if [ ! -f ${PIDFILE} ]; then
                eerror "${SVCNAME} isn't running"
                return 1
        fi
        checkconfig || return 1
        ebegin "Reloading Tor configuration"
        start-stop-daemon --signal HUP --pidfile ${PIDFILE}
        eend $?
}

comment:5 Changed 5 years ago by phw

Hmm, shouldn't the script use ulimit to raise the file descriptor limit? At least Debian's init script calls ulimit -n some-large-number. If this is indeed missing, it would explain why your relay starts acting weirdly after a specific amount of connections have been established: It might simply be running out of file descriptors.

comment:6 Changed 5 years ago by toralf

well, this is done in the appropriate config file:

tor-relay ~ # cat /etc/conf.d/tor
#
# Set the file limit
rc_ulimit="-n 30000"

I changed it to 60000 - but same situation,a s soon as the # of conenctions and/or bandwith rises up, I do get the errors.

comment:7 Changed 5 years ago by toralf

Hhm, a log file at level debug gave me over a time frame of 20 seconds :

tor-relay /var/log/tor # grep -c 'TLS error' debug.log
12

tor-relay /var/log/tor # egrep -c 'misc error\) for' debug.log
90

Does this help to narrow down this issue ?
Furthermore how many IOERROR are common ?

comment:8 Changed 5 years ago by toralf

and this is unusual high :

tor-relay /var/log/tor # netstat --tcp --udp --program -n | grep -c TIME_WAIT
512

tor-relay /var/log/tor # netstat --tcp --udp --program -n | grep -c tor
1368

Hhm, since I run my tor relay as a non-exit relay, I do now have 2700 connections but just 22 in state of TIME_WAIT.

Last edited 5 years ago by toralf (previous) (diff)

comment:9 Changed 5 years ago by toralf

Priority: minornormal
Summary: a lot of IOERRORS if traffic is higher than about 12 MBit/sectoo much IOERRORS ?

comment:10 Changed 5 years ago by toralf

with the script below I realized today within 11 minutes 351 closed ORconnections, 71 with reason DONE, but 277 with reason IOERROR

#!/usr/bin/python2

import time
import functools

from stem import ORStatus, ORClosureReason
from stem.control import EventType, Controller

global i_done
global i_closed
global i_ioerror
i_done = 0
i_closed = 0
i_ioerror = 0

def main():
  with Controller.from_port() as controller:
    controller.authenticate()

    orconn_listener = functools.partial(orconn_event, controller)
    controller.add_event_listener(orconn_listener, EventType.ORCONN)

    raw_input()

def orconn_event(controller, event):
  global i_done
  global i_closed
  global i_ioerror
  
  if event.status == ORStatus.CLOSED:
    i_closed += 1
  
    if event.reason == ORClosureReason.DONE:
      i_done += 1
      
    if event.reason == ORClosureReason.IOERROR:
      i_ioerror += 1
      fingerprint = event.endpoint_fingerprint

      try:
        relay = controller.get_network_status(fingerprint)
        print " %i %i %i %i %s %s %15s %5i %s %s" % (i_closed, i_done, i_ioerror, event.arrived_at, time.ctime(event.arrived_at), fingerprint, relay.address, relay.or_port, controller.get_info("ip-to-country/%s" % relay.address, 'unknown'), relay.nickname)
      except Exception as exc:
        print " %i %i %i %i %s %s"                % (i_closed, i_done, i_ioerror, event.arrived_at, time.ctime(event.arrived_at), fingerprint)

if __name__ == '__main__':
  main() 

I took a look onto the values got from ifconfig, there are no RX/TX errors so far

Last edited 5 years ago by toralf (previous) (diff)

comment:11 Changed 5 years ago by toralf

Well, line 2079 in tortls.c

  r = SSL_read(tls->ssl, cp, (int)len);

gives

read returned r=0, err=-8

which means an SSL_ERROR_WANT_ACCEPT right ?

Version 1, edited 5 years ago by toralf (previous) (next) (diff)

comment:12 Changed 5 years ago by toralf

With the mentioned script seen above I do collect data for my exit node since 31th of December. The raw data do contain about 650000 lines of about 12800 fingerprints over the 25 days. The current amount of known fingerprints by directory authorities is about 7000 (derived using https://stem.torproject.org/tutorials/mirror_mirror_on_the_wall.html#where-can-i-get-the-current-descriptors )

Whilst the majority of relays have a reasonable small number of failed attempts, there is 1 relay having 1545 ioerrors.

With the files "all" containing the fingerprints of all known relays and the raw data files from the error collecting script (file names: ioerror.<num>) I used the following command sequence to produce the histograms with gnuplot as seen in the attachment section:

$> cat ioerror.* | perl -we 'my %h = (); while (<>) { @f=split(); $h{$f[9]}++; } while ((my $k, $v) = each %h) { print $k, "\t", $v, "\n" }' | sort > io.csv
$>rm io_curr.csv; cat all | while read d; do grep $d io.csv >> io_curr.csv; done
$>awk ' { print $2 } ' < io_curr.csv > sc
$>cat sc | perl -we 'my %h = (); while (<>) { chomp(); $h{$_}++; } while ((my $k, $v) = each %h) { print $k, "\t", $v, "\n" }' | sort -n > tc
$>gnuplot -e 'set xrange [0:200]; set yrange [0:120]; set xlabel "ioerrors"; set ylabel "relays"; plot "tc" w l lt rgb "green"; pause -1'

At a first glance the histogram looks like a superposition of 2 different processes. One looks like the well known f(x) = exp(-x). But there seems to be a gaussian curve too with the mean x value of about 100.

If desired I can provide the raw data.

FWIW : There are about 5000 fingerprints in the data set which are not known by the directory authorities. Those were excluded from the dataset (although it wouldn't make any difference).

Last edited 5 years ago by toralf (previous) (diff)

Changed 5 years ago by toralf

Attachment: historgram.png added

histogram

Changed 5 years ago by toralf

Attachment: all_known.png added

histogram with labeled axis of those relays known by directory authorities and mirrors

Changed 5 years ago by toralf

Attachment: all_known_detail.png added

detail for relays with ioerrors < 200

comment:13 Changed 5 years ago by nickm

Milestone: Tor: 0.2.7.x-final

comment:14 Changed 5 years ago by nickm

Status: newassigned

comment:15 Changed 4 years ago by nickm

Keywords: 027-triaged-1-out added

Marking triaged-out items from first round of 0.2.7 triage.

comment:16 Changed 4 years ago by nickm

Milestone: Tor: 0.2.7.x-finalTor: 0.2.???

Make all non-needs_review, non-needs_revision, 027-triaged-1-out items belong to 0.2.???

comment:17 Changed 3 years ago by JivanAmara

Severity: Normal

I'm seeing the same thing on an Ubuntu 16.04 host. Happy to provide details if someone picks this issue up.

comment:18 Changed 3 years ago by teor

Milestone: Tor: 0.2.???Tor: 0.3.???

Milestone renamed

comment:19 Changed 3 years ago by nickm

Keywords: tor-03-unspecified-201612 added
Milestone: Tor: 0.3.???Tor: unspecified

Finally admitting that 0.3.??? was a euphemism for Tor: unspecified all along.

comment:20 Changed 2 years ago by nickm

Keywords: tor-03-unspecified-201612 removed

Remove an old triaging keyword.

comment:21 Changed 2 years ago by nickm

Keywords: 027-triaged-in added

comment:22 Changed 2 years ago by nickm

Keywords: 027-triaged-in removed

comment:23 Changed 2 years ago by nickm

Keywords: 027-triaged-1-out removed

comment:24 Changed 2 years ago by nickm

Status: assignednew

Change the status of all assigned/accepted Tor tickets with owner="" to "new".

comment:25 Changed 2 years ago by nickm

Keywords: tor-relay needs-analysis networking added
Summary: too much IOERRORS ?Many channels closing with "IOERROR" reason given on control port.

comment:26 Changed 21 months ago by toralf

I'd like to come back to this topic mainly due to the DoS against Tor relays in the last few days.

I used [1] to gather the state of closed connections and [2] to count DONE IOERROR TIMEOUT CONNECTRESET counts create a histogram of those values. The measurement was made in the last 3 days.

Whilst majority of relays have just few IOERRORS, there're a bunch if relays having hundreds or thousands of failures within 2 days w/o any DONE. FWIW the authority faravahar has almost CONNECTRESET.

I'll attach the fingerprints and the histogram got from my 2 relays (running at the hardware)

[1] https://github.com/toralf/torutils/blob/master/orstatus.py
[2] https://github.com/toralf/torutils/blob/master/plot.sh

Changed 21 months ago by toralf

Attachment: fingerprints added

Changed 21 months ago by toralf

Attachment: histogram added
Note: See TracTickets for help on using tickets.