Opened 10 months ago

Last modified 7 weeks ago

#29427 needs_revision defect

kist: Poor performance with a small amount of sockets

Reported by: dgoulet Owned by: dgoulet
Priority: Medium Milestone: Tor: 0.4.3.x-final
Component: Core Tor/Tor Version: Tor: 0.3.2.1-alpha
Severity: Major Keywords: tor-sched, kist, regression?, 041-deferred-20190530, BugSmashFund, 035-backport, 040-backport, 041-backport, 042-backport
Cc: pastly, robgjansen Actual Points: 0.1
Parent ID: Points: 0.1
Reviewer: nickm Sponsor:

Description

We just recently found that KIST is performing very poorly if tor has very little amount of sockets.

How KIST operates

KIST is scheduled if some cells are put on a circuit queue. A scheduler run might not handle all cells because it depends on the available space in the TCP buffer for the socket. What KIST does at the moment is reschedule itself in 10ms (static value).

The problem here is that if there are very few sockets (like most tor clients), then KIST will be able to handle one socket very fast, let say in 1ms, and then it will sleep for another 9ms until KIST is rescheduled.

That 9ms waiting time means that tor is not pushing bytes on the wire even though it could during that time. See the attached graph made by pastly, you can see how much KIST badly under performs with the current 10ms.

Consequences

(Might be more, don't treat this as an exhaustive list)

  1. Clients are basically capped in bandwidth because they in general only talk to the Guard on a single socket.
  1. A new relay joining the network won't have any connections so when the authority measures it, or our bw. scanners, they will only be able to measure a capped value compared to what the relay could actually do (if higher). This measurement will recover after a while once the relay starts seeing traffic and the number of sockets ramps up.

Solution

As you can see on the attached graph, bringing the scheduler interval time down to 2ms gives us better performance than Vanilla. That could be a short term solution.

A better solution, a bit more medium-term, would be to make that scheduling interval dynamic depending on how fast tor thinks the TCP buffer on the socket will get emptied. That depends on the connection throughput basically. For example, a 100mbit NIC towards a Guard might only push through 10mbit so we would need a way for tor to learn that per-connection which would allow KIST to estimate when it needs to be rescheduled for that connection.

Child Tickets

Attachments (2)

perf-10ms.png (123.6 KB) - added by dgoulet 10 months ago.
perf-10ms-more-controlled.png (79.9 KB) - added by pastly 9 months ago.

Download all attachments as: .zip

Change History (19)

Changed 10 months ago by dgoulet

Attachment: perf-10ms.png added

comment:1 Changed 10 months ago by pastly

Cc: pastly added

comment:2 Changed 10 months ago by robgjansen

Cc: robgjansen added

More intuition about the problem:

KIST tries not to overload the kernel when there are many sockets, and so it only runs every 10 msec. On high performance relays with lots of sockets, this is a good thing.

But on a relay with only 1 active socket, it is possible that you fill the kernel socket buffer, and the NIC only takes e.g. 2 msec to send it, but then KIST doesn't run again for another 8 msec. So the NIC is sitting there idle for those 8 msec.

This is only an issue when the sum of bytes in all kernel-level socket buffers is less than the number of bytes the NIC could send in 10 msec. This can happen when there are only a few sockets, or on a really really fast NIC.


What to do:

Clients:

KIST was designed for relays. Clients don't need to prioritize traffic the same way relays do, so they don't really need KIST. Clients can simply run the vanilla scheduler so that they read/write ASAP (rather than deferring I/O like KIST does). Or clients can run KIST with a 1 msec scheduling frequency.

Relays:

For relays, we could guess how long it would take the kernel to send out all of the notsent bytes sitting in kernel buffers plus all outbuf bytes sitting in Tor outbufs. If the time we guess is less than 10 msec, then we could run KIST sooner. This guess would probably involve knowing or estimating the NIC speed.

Slightly more formally, if we know we can write b bytes per KIST scheduling run, i.e. b bytes per m milliseconds, but we only actually wrote w bytes where w < b, then we know that we need to call back sooner than m milliseconds. We can dynamically compute the time it will take us to write the w bytes as:

new_callback_time = m * (w/b)

Then we check back again in new_callback_time milliseconds (the time when the kernel will be empty) instead of m milliseconds.

Then also, KIST will never let itself write more than b bytes across all sockets, because it knows that its network card can't possibly write more than b bytes.

Issues:

I don't know how each relay can reliably compute the value of b. Maybe we start with the "observed bandwidth" as an estimate? But then we need to allow b to grow in case the relay suddenly got faster, or for new relays?

Last edited 9 months ago by dgoulet (previous) (diff)

comment:3 in reply to:  2 Changed 9 months ago by dgoulet

Replying to robgjansen:

KIST was designed for relays. Clients don't need to prioritize traffic the same way relays do, so they don't really need KIST. Clients can simply run the vanilla scheduler so that they read/write ASAP (rather than deferring I/O like KIST does). Or clients can run KIST with a 1 msec scheduling frequency.

Fortunately, right now, it is easy for Tor to know if it is running has a relay or not. Easy solution is to adjust the KISTSchedRunInterval to 2msec (initial testing at 1msec is locking tor apparently, need to be investigated) for clients *only*.

What I worry here is for onion service. They can have a *lot* of circuits to many rendezvous points so there is a clear requirement for circuit priority and not loading the Guard link which KIST would basically help. But then, we don't have a way to measure the NIC used throughput for clients/HS :S ...

I don't know how each relay can reliably compute the value of b. Maybe we start with the "observed bandwidth" as an estimate? But then we need to allow b to grow in case the relay suddenly got faster, or for new relays?

For relays, I think the observed bandwidth from the consensus could be a good start until we have a reliable way for Tor to measure its throughput regularly.

comment:4 Changed 9 months ago by pastly

The following graph I have more confidence in as a useful piece of evidence for you.

This is a tiny 10 relay network run on localhost on my desktop Debian 9 computer. There is one Tor client with one curl downloading a single large file of all zeros from nginx also running on localhost. The client builds normal three hop circuits to this webserver, always choosing the target relay as the first hop. All relays and the client have the same scheduler, and in the case of KIST, the same run interval too. Everyone is running Debian's Tor 0.3.5.7 unmodified.

Here is the configuration of a relay in the network:

$ cat torrc-common

ShutdownWaitLength 2
ExitRelay 1
IPv6Exit 1
ExitPolicy accept *:*
CookieAuthentication 1
ContactInfo pastly@torproject.org
LogTimeGranularity 1
SafeLogging 0

DirAuthority auth1 orport=10102 no-v2 v3ident=13572CEF296468E344506CAE402BDE55A28C21CD 127.100.1.1:10103 04C4B152E7EE3960B947BDE96823728132BE2A06
DirAuthority auth2 orport=10106 no-v2 v3ident=47188F93370723370B6C1F441C9131F68F65F54C 127.100.1.1:10107 A182371ABFBDE825B359AD005EEA795F27F91C81
DirAuthority auth3 orport=10110 no-v2 v3ident=CA8134FE7E018D48C4821E3C3233DE5A6C68C823 127.100.1.1:10111 71A9A9E880118B4BCA5B5A4303BF8C0534F92D2F
TestingTorNetwork 1
# change between kist and vanilla here
# change KISTSchedRunInterval with consensus
#     param and waiting for it to be disseminated
#     to all
Schedulers Vanilla

$ cat relay1/torrc 

%include torrc-common
DataDirectory relay1
PidFile relay1/tor.pid
#Log notice file relay1/notice.log
Address 127.100.1.1
SocksPort 127.100.1.1:10112
ControlPort 127.100.1.1:10113
ControlSocket /redacted/path/to/relay1/control_socket
ORPort 127.100.1.1:10114
DirPort 127.100.1.1:10115
Nickname relay1
CacheDirectory /tmp/relay1


Here is the configuration of the client in this network:

$ cat torrc-common 

DirAuthority auth1 orport=10102 no-v2 v3ident=13572CEF296468E344506CAE402BDE55A28C21CD 127.100.1.1:10103 04C4B152E7EE3960B947BDE96823728132BE2A06
DirAuthority auth2 orport=10106 no-v2 v3ident=47188F93370723370B6C1F441C9131F68F65F54C 127.100.1.1:10107 A182371ABFBDE825B359AD005EEA795F27F91C81
DirAuthority auth3 orport=10110 no-v2 v3ident=CA8134FE7E018D48C4821E3C3233DE5A6C68C823 127.100.1.1:10111 71A9A9E880118B4BCA5B5A4303BF8C0534F92D2F

TestingTorNetwork 1
NumCPUs 1
LogTimeGranularity 1
SafeLogging 0
ShutdownWaitLength 2
CookieAuthentication 1
# change between kist and vanilla here
# change KISTSchedRunInterval with consensus
#     param and waiting for it to be disseminated
#     to all
Schedulers Vanilla

$ cat client10301/torrc 

%include torrc-common
DataDirectory client10301
PidFile client10301/tor.pid
#Log notice file client10301/notice.log
SocksPort 127.0.0.1:10301
ControlSocket /redacted/path/to/client10301/control_socket
CacheDirectory /tmp/client10301
EntryNodes relay1

Please use the following graph for insight instead of the previously shared perf-10ms.png. The following graph is way closer to the real world (unmodified Tor binary, 3 hop circuits, etc.)

Changed 9 months ago by pastly

comment:5 Changed 6 months ago by nickm

Keywords: regression? added

comment:6 Changed 6 months ago by nickm

Keywords: 041-deferred-20190530 added

Marking these tickets as deferred from 041.

comment:7 Changed 6 months ago by nickm

Milestone: Tor: 0.4.1.x-finalTor: 0.4.2.x-final

comment:8 Changed 2 months ago by nickm

Keywords: 042-should added

comment:9 Changed 7 weeks ago by ahf

Owner: set to dgoulet
Status: newassigned

Distributing 0.4.2 tickets between network team members.

comment:10 Changed 7 weeks ago by dgoulet

Actual Points: 0.1
Keywords: BugSmashFund added
Points: 0.1
Status: assignedneeds_review

PR: https://github.com/torproject/tor/pull/1387
Branch: ticket29427_042_01

comment:11 Changed 7 weeks ago by nickm

Quick question: do we have a way to test that this performs as expected on different platforms?

comment:12 in reply to:  11 Changed 7 weeks ago by dgoulet

Replying to nickm:

Quick question: do we have a way to test that this performs as expected on different platforms?

We do not. The choice for the interval is based on torrc and consensus so it shouldn't matter on which platform?

comment:13 Changed 7 weeks ago by nickm

We do not. The choice for the interval is based on torrc and consensus so it shouldn't matter on which platform?

My concern is that different platforms sometimes handle small timers very differently.

Should there be separate consensus parameters for client and server?

comment:14 in reply to:  13 Changed 7 weeks ago by dgoulet

Replying to nickm:

We do not. The choice for the interval is based on torrc and consensus so it shouldn't matter on which platform?

My concern is that different platforms sometimes handle small timers very differently.

True. For multiple platforms, I have to say no :S.

Timers at the msec level I would assume would be fine on most of our targeted platforms. Also, this specific interval is only used on Linux and BSD. Windows uses Vanilla scheduler which doesn't have this problem.

Should there be separate consensus parameters for client and server?

Yes this is a good idea actually. But this means 043 considering past our feature freeze?

comment:15 Changed 7 weeks ago by nickm

Reviewer: nickm

Setting myself as reviewer per discussion at meeting.

One more question: is this something we want to think about potentially backporting? If not, should it wait for 043 when we can treat it as a feature and add a new consensus parameter?

comment:16 in reply to:  15 Changed 7 weeks ago by dgoulet

Keywords: 035-backport 040-backport 041-backport 042-backport added
Milestone: Tor: 0.4.2.x-finalTor: 0.4.3.x-final
Status: needs_reviewneeds_revision

Replying to nickm:

Setting myself as reviewer per discussion at meeting.

One more question: is this something we want to think about potentially backporting? If not, should it wait for 043 when we can treat it as a feature and add a new consensus parameter?

Discussion with nickm on IRC. There is a good argument on preventing partitioning client/HS into two buckets of "performance".

So because of this, we'll defer this to 043, add two consensus parameters (client and relay sched interval) and backport it back to 035. We'll have a good chunk of the 043 cycle to make sure it works properly.

comment:17 Changed 7 weeks ago by nickm

Keywords: 042-should removed
Note: See TracTickets for help on using tickets.