Opened 5 years ago

Closed 2 years ago

#12890 closed enhancement (fixed)

Design and implement optimizations for socket write limits

Reported by: robgjansen Owned by:
Priority: Medium Milestone: Tor: 0.3.2.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-relay kist networking tcp
Cc: Actual Points:
Parent ID: #12541 Points:
Reviewer: Sponsor:

Description

KIST has two components: global scheduling (#9262) and socket write limits. This ticket is to track discussion about the design that should be implemented to realize socket write limits, and discussion about the implementation.

The goal of the write limit is to never send to the kernel what the kernel wouldn't send out to the network anyway due to throttling at the TCP layer. Rob's USENIX Security paper computed write limits for each socket as

sock_space = sock_buf_size - sock_buf_len
tcp_space = (snd_cwnd - snd_unacked) * mss
sock_write_limit = min(sock_space, tcp_space)

And then a global write limit across all sockets for each scheduling round is computed according to the upstream bandwidth of the relay and the configured write callback time interval. Writing in a given round ends when either the global limit is reached, or all of the socket limits are reached.

The TCP information can be collected with a getsockopt call, but doing this for every socket for every write round (callback interval) can get expensive. A kernel hacker, Patrick McHardy, suggested using the "netlink socket diag" interface (examples here and here) to collect information for multiple sockets all at once instead of a separate system call for each.

Note that the socket write limit need not actually be computed, because the kernel will return EAGAIN when the socket is full anyway. Along these lines, Bryan Ford suggested setting the socket buffer size based on the amount Tor thinks it should send plus a little extra (e.g., tcp_space*1.25), and then let the kernel push back automatically instead of trying to compute a new write limit for every socket for every write interval round. Then Tor can continue to try to write as much as it can and let the kernel push back when Tor should stop. In this case, we need to ensure TCP auto-tuning is disabled, as otherwise it may undo our settings by adjusting our socket buffer sizes underneath us.

I think we need two intervals: e.g., we want to try to write every 10 milliseconds, and then update snd_cwnd/write limits/socket buffer sizes every 100 milliseconds.

Child Tickets

Attachments (4)

0001-refactor-loop-to-separate-function-for-easier-Shadow.patch (3.2 KB) - added by robgjansen 5 years ago.
0001-remove-sock_space-calculation.patch (1.1 KB) - added by robgjansen 5 years ago.
0001-refactor-sleep-call-outside-of-sockmgr_thread_loop_o.patch (2.1 KB) - added by robgjansen 5 years ago.
kist-updates-0262.tar (30.0 KB) - added by robgjansen 4 years ago.
Updates to KIST, done on nickm's 'kist' branch after merging it with tag 'tor-0.2.6.2-alpha'

Download all attachments as: .zip

Change History (46)

comment:1 Changed 5 years ago by robgjansen

In case it isn't clear, this should be implemented on top of #9262.

comment:2 Changed 5 years ago by nickm

Component: - Select a componentTor
Keywords: tor-relay added

comment:3 Changed 5 years ago by nickm

Milestone: Tor: 0.2.6.x-final

comment:4 Changed 5 years ago by nickm

Milestone: Tor: 0.2.6.x-finalTor: 0.2.7.x-final

comment:5 Changed 5 years ago by robgjansen

Milestone: Tor: 0.2.7.x-finalTor: 0.2.6.x-final

Currently talking with Nick about the design of this component. We decided that this could completely be implemented in a separate thread with no feedback to the master thread.

Get the buffer lengths:

ioctl(fd, SIOCINQ, &inqlen);
ioctl(fd, SIOCOUTQ, &outqlen);

This gives us the length (ie, data waiting for io) of the send and receive buffer.

Get the TCP info:

getsockopt(fd, SOL_TCP, TCP_INFO, (void *)&tcp_info, &tcp_info_length);

That gives us this struct (or something like it):

struct tcp_info {
	__u8	tcpi_state;
	__u8	tcpi_ca_state;
	__u8	tcpi_retransmits;
	__u8	tcpi_probes;
	__u8	tcpi_backoff;
	__u8	tcpi_options;
	__u8	tcpi_snd_wscale : 4, tcpi_rcv_wscale : 4;

	__u32	tcpi_rto;
	__u32	tcpi_ato;
	__u32	tcpi_snd_mss;
	__u32	tcpi_rcv_mss;

	__u32	tcpi_unacked;
	__u32	tcpi_sacked;
	__u32	tcpi_lost;
	__u32	tcpi_retrans;
	__u32	tcpi_fackets;

	/* Times. */
	__u32	tcpi_last_data_sent;
	__u32	tcpi_last_ack_sent;     /* Not remembered, sorry. */
	__u32	tcpi_last_data_recv;
	__u32	tcpi_last_ack_recv;

	/* Metrics. */
	__u32	tcpi_pmtu;
	__u32	tcpi_rcv_ssthresh;
	__u32	tcpi_rtt;
	__u32	tcpi_rttvar;
	__u32	tcpi_snd_ssthresh;
	__u32	tcpi_snd_cwnd;
	__u32	tcpi_advmss;
	__u32	tcpi_reordering;

	__u32	tcpi_rcv_rtt;
	__u32	tcpi_rcv_space;

	__u32	tcpi_total_retrans;
};

[See here - http://linuxgazette.net/136/pfeiffer.html More info here.]

Then we want setsockopt to set the buffer sizes based on the current length and the tcp window:

setsockopt(fd, SOL_SOCKET, SO_SNDBUF, (void *)&snd_size, snd_size_len);
setsockopt(fd, SOL_SOCKET, SO_RCVBUF, (void *)&rcv_size, rcv_size_len);

NOTE: When setting SO_SNDBUF and SO_RCVBUF with setsockopt, the value applied is double the valued passed (it is automatically doubled by the kernel).

comment:6 in reply to:  5 ; Changed 5 years ago by dgoulet

Replying to robgjansen:

Then we want setsockopt to set the buffer sizes based on the current length and the tcp window:

setsockopt(fd, SOL_SOCKET, SO_SNDBUF, (void *)&snd_size, snd_size_len);
setsockopt(fd, SOL_SOCKET, SO_RCVBUF, (void *)&rcv_size, rcv_size_len);

NOTE: When setting SO_SNDBUF and SO_RCVBUF with setsockopt, the value applied is double the valued passed (it is automatically doubled by the kernel).

Here is some extra info on that, hope this can be useful:

The kernel makes sure also that you do not go above /proc/sys/net/core/wmem_max for SO_SNDBUF and /proc/sys/net/core/rmem_max for SO_RCVBUF. Note that those are system wide thus should probably not be mangled by tor (unless the box is dedicated for tor traffic :). Furthermore, that value is multiplied by 2 and must at least be a certain minimum. (Below is a snippet from the 3.18.0 kernel).

#define TCP_SKB_MIN_TRUESIZE    (2048 + SKB_DATA_ALIGN(sizeof(struct sk_buff)))

#define SOCK_MIN_SNDBUF     (TCP_SKB_MIN_TRUESIZE * 2)
#define SOCK_MIN_RCVBUF      TCP_SKB_MIN_TRUESIZE

sk->sk_sndbuf = max_t(u32, val * 2, SOCK_MIN_SNDBUF);
sk->sk_rcvbuf = max_t(u32, val * 2, SOCK_MIN_RCVBUF);

The size of struct sk_buff is 232 bytes. Note that this is not considered a stable ABI for user space thus the size of sk_buff can vary over time. Also, the SKB_DATA_ALIGN is an alignement on the L1 cache but it's often only 32 bytes thus here sk_buff is actually aligned to 256 bytes.

Having congestion control in user space is difficult vis-a-vis performance. For once, you have to do extra syscalls for every I/O operation but it's also a "play the guessing max size game" which is tightly controlled by the kernel.

comment:7 in reply to:  6 ; Changed 5 years ago by robgjansen

Replying to dgoulet:

Having congestion control in user space is difficult vis-a-vis performance. For once, you have to do extra syscalls for every I/O operation but it's also a "play the guessing max size game" which is tightly controlled by the kernel.

Hmmm. Yeah.

Relatedly, do you know if TCP auto-tuning in the kernel is disabled when the buffer size is explicitly set with setsockopt? If not, do you know a socket option to disable TCP auto-tuning for a given socket?

comment:8 in reply to:  7 Changed 5 years ago by yawning

Ooof.

Replying to robgjansen:

ioctl(fd, SIOCINQ, &inqlen);
ioctl(fd, SIOCOUTQ, &outqlen);

TIOCOUTQ/TIOCINQ are more portable, the SIOCINQ/SIOCOUTQ defines are Linux-isms.

Get the TCP info:

getsockopt(fd, SOL_TCP, TCP_INFO, (void *)&tcp_info, &tcp_info_length);

For reference, the *BSD structure is at:
https://svnweb.freebsd.org/base/head/sys/netinet/tcp.h?revision=246210&view=markup#l192

For Vista and later there is GetPerTcpConnectionEStats()
http://msdn.microsoft.com/en-us/library/bb485738%28VS.85%29.aspx

Replying to robgjansen:

Replying to dgoulet:
Relatedly, do you know if TCP auto-tuning in the kernel is disabled when the buffer size is explicitly set with setsockopt? If not, do you know a socket option to disable TCP auto-tuning for a given socket?

Yes, it does disable auto-tuning if the app layer requests something explicit.

As an aside, I'm slightly worried about a sufficiently infrequent update window leading to underutilizing the link, but the interval can be tunable, and we can see what a good value is in the wild (the reverse case where we overutilize the link because we don't respond to cwnd getting cut due to a insufficiently frequent polling interval is no worse than the current behavior so I'm not worried about that).

comment:9 Changed 5 years ago by yawning

More regarding portability, since nickm was asking about it in IRC, and I went and actually looked at the code due to comments by other people that use TCP_INFO scaring me.

References:

The good:

  • Linux < 2.6.x does not support this, since this is when the call was added.
  • Windows depends on version (with a preference towards the Windows-ism added in Vista since it's a single call that appears to return all the necessary information). The iperf people claim that TCP_INFO is supported, but I can't find documentation to back this up.

The bad:

  • NetBSD/OpenBSD do not support TCP_INFO.

The ugly:

  • FreeBSD support was added in the FreeBSD 6.0 timeframe, *but* a bunch of values are missing (

http://fxr.watson.org/fxr/source/netinet/tcp_usrreq.c#L1259) including a way to get "SND.UNA", making it effectively worthless ("SND.WND", "SND.NXT" and "cwnd" are insufficient to calculate the amount of inflight data). However making it work is a trivial 1 line change to tcp_usrreq.c:tcp_fill_info().

  /* The full patch will want to rename tcp_info.__tcpi_unacked... */
  ti->__tcpi_unacked = tp->snd_una;

The FreeBSD version of the tcp_space calculation assuming the patch would be something like:

  /*
   * Everything is in bytes, so none of that MSS sillyness.
   *
   * It's probably worth calculating the available space based off: 
   *  wnd_size = min(ti.tcpi_snd_cwnd, ti.tcpi_snd_wnd)
   * instead of just tcpi_snd_cwnd since the information is already
   * in the tcp_info structure.
   */
  tcp_space = ti.tcpi_snd_cwnd - (ti.tcpi_snd_next - ti.__tcpi_unacked);

The Fruit Company:

  • Darwin does not expose the fact that it supports TCP_INFO in /usr/include/netinet/tcp.h, but the XNU source (at least xnu-2050.18.24) has code that implements this. Since they are Apple, their idea of what `struct tcp_info` should look like is different from everyone else's, and ironically enough is the easiest to use. The header comment also notes that "The TCP_INFO socket option is a private API and is subject to change.".
      /*
       * The Fruit Company has their own struct tcp_info, that can be
       * queried by getsockopt() TCP_INFO (undocumented), or sysctl().
       *
       * The Fruit version of struct tcp_info handily includes:
       *  ti->tcpi_txunacked = tp->snd_max - tp->snd_una;
       *
       * As an added bonus at least as of xnu-2050.18.24, we can omit the
       * TIOCOUTQ call.
       */
      sndbuf_bytes = ti.tcpi_snd_sbbytes; /* Look ma, I saved a syscall! */
      tcp_space = ti.tcpi_snd_cwnd - ti.tcpi_txunacked; /* Both in bytes. */
    

I assume that just using the ioctl() calls to get the send buffer capacity/current size on non-Linux/Windows platforms will still be an improvement, but it bums me out a bit. If people want I can write the kernel patches for all of the BSDs.

*Edit: Actually looked at the XNU code.*

Last edited 5 years ago by yawning (previous) (diff)

comment:10 Changed 5 years ago by nickm

I have a linux-only implementation that seems to run and adjust socket buffers in my branch "kist".

It still needs work:

  • Darwin, BSD, Windows support.
  • Shutting down the thread and freeing memory on exit.
  • Avoiding its loop when Tor is quiescent.
  • Some kind of debugging log to make sure the values it's setting are reasonable.
  • Upper and lower bounds on what it's willing to set the buffer sizes to.
  • Tuning for global read/write limits
  • Tuning for maximum output buffer sizes.

Still, it's a start!

comment:11 Changed 5 years ago by nickm

(Rob, please let me know which of the above are necessary before you could start testing this.)

comment:12 in reply to:  11 Changed 5 years ago by robgjansen

Replying to nickm:

(Rob, please let me know which of the above are necessary before you could start testing this.)

Some changes and thoughts:

  1. I have attached a patch for a slight refactoring of the thread main loop to make it easier to run the thread in Shadow.
  2. I thought about it, and I'm pretty sure we don't actually want the socket size calculation anymore. I needed it in my prototype because I was not setting socket buf sizes explicitly, and so had to deal with the buf size changing underneath me from tcp autotuning. Since we are setting the buf size, taking buf_size = MIN(sock_space, tcp_space); would mean that the buf size could never increase - and I think we don't want that! So, all we would have to compute is tcp_space and we can save the other two syscalls :) I have attached a patch for this as well.
  3. It would be useful to have the global read/write limits feature there before starting to test. For now, this could be as simple as a configuration option that specifies the upstream link capacity, and then the logic necessary to enforce the limit.
  4. We probably also want to write a bit above the per-socket and/or global limits, so that we can minimize kernel 'starvation' (when the kernel could have sent something, but it had no data to send because our calculations were a bit off). Options that specify how much above the per-socket and global limits we want to write would be helpful. For example, buf_size = tcp_space; would become buf_size = tcp_space * get_options()->KISTSockBufSizeFactor; (e.g., 1.25) and there would be a similar config for global limits. I think upper/lower bounds are a good idea to enforce some max/min values.

Changed 5 years ago by robgjansen

comment:13 Changed 5 years ago by nickm

  1. I have attached a patch for a slight refactoring of the thread main loop

looks good to me.

  1. and I think we don't want that!

You mean, we don't want to have the buffer size get smaller and smaller forever? Indeed, that sounds bad.

  1. It would be useful to have the global read/write limits feature there before starting to test. For now, this could be as simple as a configuration option that specifies the upstream link capacity, and then the logic necessary to enforce the limit.

Don't we already have this? I thought that BandwidthRate already enforced a global read/write limit.

  1. We probably also want to write a bit above the per-socket and/or global limits

Will do.

comment:14 Changed 5 years ago by nickm

Branch updated. Thanks!

comment:15 in reply to:  13 Changed 5 years ago by robgjansen

Replying to nickm:

  1. It would be useful to have the global read/write limits feature there before starting to test. For now, this could be as simple as a configuration option that specifies the upstream link capacity, and then the logic necessary to enforce the limit.

Don't we already have this? I thought that BandwidthRate already enforced a global read/write limit.

We want to keep the non-circuit queues as small as possible. Does the circuit scheduler stop pulling cells from the circuit queues at the same time that BandwidthRate is reached? I guess we can test as-is and revisit this issue if it turns out to be a problem.

Thanks for the fast branch updates!

comment:16 Changed 5 years ago by dgoulet

Looking at nickm's branch and the remaining tasks (comment10), seem something is missing from the paper (http://www.robgjansen.com/publications/kist-sec2014.pdf).

Algorithm 2 of the paper shows (line 4 and 8) that tor should do a lookup on the connection for available write data (raw or not) before doing the adjustment. However, I don't see that right now in the code, is it still needed?

I think this might be needed else we might get into a situation where tor iterates every X seconds on the sockets and adjust their size even though it's not needed which might deteriorate performance?

For instance, I did a quick kernel trace of tor bootstrapping with the kist branch and you can see that there are bunch of get/setsockopt() where one of them (fd = 6) actually never writes anything but still gets adjusted.

https://paste.debian.net/plainh/0ed81334

comment:17 Changed 5 years ago by robgjansen

Thanks for looking over the branch. The lines you reference in Alg 2 were primarily there because the prototype we implemented for KIST worked exclusively inside of the connection_write_callback function in main.c. In our prototype, KIST only operated on sockets that were in the global list of writable sockets.

In the new implementation, all sockets are 'watched' by the helper thread, whether they have data to send or not. Even if the socket has no data in Tor, the TCP window could be changing in the kernel due to incoming ACKS, so there is some value in doing the calls even in this case.

One approach is to do the kernel calls for a few seconds after Tor drains the app buffer, and then stop until more data was written. To figure out when to stop and start the calls, we probably need to add more communication between the process and the thread to notify when the socket has data in the Tor buffer and not. (I think adding back in the other kernel call that I removed in my patch to check the socket length would not work, because the socket could be empty at the time of check even though more data has been written.)

I suggest we keep things simple for now (which is certainly OK for Shadow), but try to monitor how much overhead the extra calls consume and do something like I outlined above if necessary. I do worry, though, that this may end up adding more overhead to the main process instead of the helper thread, which is probably not what we want.

comment:18 in reply to:  17 Changed 5 years ago by dgoulet

Replying to robgjansen:

I suggest we keep things simple for now (which is certainly OK for Shadow), but try to monitor how much overhead the extra calls consume and do something like I outlined above if necessary. I do worry, though, that this may end up adding more overhead to the main process instead of the helper thread, which is probably not what we want.

Sounds good!

Let me know when you want to start testing, I can help here with kernel tracing and analyze the experiments (with and without KIST) by looking at the syscall overhead between the two run, the amount of failed (EWOULDBLOCK) write in both cases and the average of data written per call which can help us I think tune stuff and measure regression.

comment:19 Changed 5 years ago by yawning

Before I forget, since I talked to nickm about this on IRC. I belive there's a possible future improvement here, depending on how expensive the TCP_INFO query turns out to be, though someone with more time than I have should simulate things.

The polling frequency should be dynamic on a per socket level, based on the state of the TCP connection.

TCP congestion control at a high level is divided into two phases, Slow Start and Congestion Avoidance.

During slow start, on each ACK that covers new data, the congestion window grows by cwnd += min(N, SMSS) (Older TCP/IP stacks grew it by SMSS, which leads to a misbehaving receiver attack). Congestion window growth here is extremely aggressive, so the discrepancy between our polling and the actual congestion window will be quite large and we will end up under-utilizing the link's capacity. This should only happen on relatively new connections, connections that are cruddy enough that the retransmission timer fires, and after the connection has been idle for a while, but the first case affects perceived user responsiveness, so it's worth thinking about.

During congestion avoidance, the congestion window grows by up to SMSS per RTT (cwnd += min(1, SMMS*SMSS/cwnd) on each ACK that covers unacknowledged data). When in congestion avoidance, our estimate should track cwnd fairly closely (at least until loss occurs), so our polling interval can be relatively infrequent. There's likewise potential for being clever here and adjusting the polling interval based on the RTT estimate.

The information required to distinguish between these two states is readily available (cwnd, ssthresh, and rtt) at least on Linux, so it is possible to query such things, so it's mostly a matter of "Is it affordable in terms of syscalls".

comment:20 Changed 5 years ago by robgjansen

I'm trying to get this running in Shadow now, and the patch I just attached should help. Currently working out bootstrapping issues I noticed after uping to 0.2.6.x-alpha.

comment:21 Changed 5 years ago by nickm

applied that to my kist branch; thanks.

Rob, what do you think about Yawning's comments above?

comment:22 in reply to:  21 Changed 5 years ago by robgjansen

Replying to nickm:

Rob, what do you think about Yawning's comments above?

Donald Knuth once wrote:

premature optimization is the root of all evil

I completely agree with him that further optimizations are possible to dynamically adjust the polling interval based on which phase we are in (slow start vs cong avoidance) and using SMSS and RTT from the TCP info struct. IMHO, those optimizations should come after we have the simpler version working and merged, assuming that the simpler version already improves the situation with tolerable overhead.

comment:23 Changed 5 years ago by dgoulet

Quick performance measurement on the latest nick's branch (kist) on the get/setsockopt() syscall to give us an idea of the overhead added so far.

Measurement is basically tor bootstrapping and a torsocks + wget:

getsockopt() mean: 1811 ns, stdev: 1541 ns (394 times)
setsockopt() mean: 1503 ns, stdev: 740 ns (396 times)

Thus, 0.003314 msec overhead for each socket which

comment:24 Changed 5 years ago by nickm

Milestone: Tor: 0.2.6.x-finalTor: 0.2.7.x-final

I'm tentatively bumping KIST stuff to 0.2.7.x, since I think it won't be done this month. Please let me know if I'm wrong

comment:25 Changed 5 years ago by nickm

I'm tentatively bumping KIST stuff to 0.2.7.x, since I think it won't be done this month. Please let me know if I'm wrong

comment:26 Changed 5 years ago by robgjansen

Sorry for the silence, but I have not dropped this! I've been dealing with various problems and am still working on suggesting design changes/additions to Nick's current branch. See my disappointing update in #12891 comment 6.

Last edited 5 years ago by robgjansen (previous) (diff)

comment:27 Changed 5 years ago by nickm

Owner: robgjansen deleted
Status: newassigned

comment:28 Changed 5 years ago by nickm

Keywords: 027-triaged-1-out added

Marking triaged-out items from first round of 0.2.7 triage.

comment:29 Changed 5 years ago by nickm

Milestone: Tor: 0.2.7.x-finalTor: 0.2.???

Make all non-needs_review, non-needs_revision, 027-triaged-1-out items belong to 0.2.???

Changed 4 years ago by robgjansen

Attachment: kist-updates-0262.tar added

Updates to KIST, done on nickm's 'kist' branch after merging it with tag 'tor-0.2.6.2-alpha'

comment:30 Changed 4 years ago by robgjansen

I merged nickm's 'kist' branch with tag 'tor-0.2.6.2-alpha' and then made several updates to the kist code. The 6 commits are attached here. I kept the changes as separate commits so it is easier to review each change.

The most significant change is a quick algorithm that I hacked together to compute an adaptive interval length for how often getsockopt() needs to be called to get an updated tcp_info state for that socket. The hope is that this will avoid unnecessary context switches; for example, if we just sent some data and need an ACK before the window will get updated, then we can wait for about RTT milliseconds before asking the kernel for new TCP info.

I'm hoping for help with these things:

  1. merge the code somewhere public and with a proper [more recent] version of Tor? Perhaps in nickm's public kist branch?
  2. review the existing code (I added questions with an XXX and TODO label)
  3. help from someone to properly code the part labeled with FIXME, if needed: if the socket_table was a min heap, that would tighten our socket descriptor iteration loop.

I've done some testing in Shadow and will post the results in #12891.

comment:31 Changed 4 years ago by nickm

Sounds great!

Current plan: I'll try to look at this code on Friday if I can possibly do so. I'll either fix the fixmes or figure out more about what's needed to do so. If you've got revisions to make based on that, let's do 'em before PETS. Then let's try to find a 20 minute sit-down time at PETS to figure out details and next steps?

comment:32 Changed 4 years ago by robgjansen

Great plan! I'm doing some experiments now with the new model I posted in #12891, but I'll wait to make any code changes until your 'kist' branch is updated.

comment:33 Changed 3 years ago by yawning

Severity: Normal

Linux 4.6.0 adds a somewhat more friendly bit of information to struct tcp_info that might increase accuracy (write_seq - snd_nxt).

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=cd9b266095f422267bddbec88f9098b48ea548fc

tcpi_notsent_bytes reports the amount of bytes in the write queue that were not yet sent.

comment:34 Changed 3 years ago by teor

Milestone: Tor: 0.2.???Tor: 0.3.???

Milestone renamed

comment:35 Changed 3 years ago by nickm

Keywords: tor-03-unspecified-201612 added
Milestone: Tor: 0.3.???Tor: unspecified

Finally admitting that 0.3.??? was a euphemism for Tor: unspecified all along.

comment:36 Changed 2 years ago by nickm

Keywords: tor-03-unspecified-201612 removed

Remove an old triaging keyword.

comment:37 Changed 2 years ago by nickm

Keywords: 027-triaged-in added

comment:38 Changed 2 years ago by nickm

Keywords: 027-triaged-in removed

comment:39 Changed 2 years ago by nickm

Keywords: 027-triaged-1-out removed

comment:40 Changed 2 years ago by nickm

Status: assignednew

Change the status of all assigned/accepted Tor tickets with owner="" to "new".

comment:41 Changed 2 years ago by nickm

Keywords: kist networking tcp added

comment:42 Changed 2 years ago by pastly

Milestone: Tor: unspecifiedTor: 0.3.2.x-final
Resolution: fixed
Status: newclosed

A lot has changed in the last year that I've been implementing KIST.

It looks like most/all of this ticket deals with the KIST prototype that would run stuff off of the main thread and collect stats on all sockets. That's no longer a thing.

Huge optimization: when it comes time to do some scheduling, only collect TCP info for the sockets that have channels ready to send. To quote our new KIST paper that hasn't been accepted anywhere yet.

We observed that the median number of write-pending sockets that accumulated during a 10 millisecond period was 23 (with min=1, q1=18, q3=27, and max=127), while the median amount of time to collect TCP information on all write-pending sockets was 23 microseconds (with min=1, q1=17, q3=33, and max=674). We observed a linear relationship between the amount of time required to collect TCP information on all write-pending sockets and the number of such sockets (1.08 microseconds per pending socket), independent of the total number of open sockets. Therefore, we believe that the KIST overhead, with our optimization of only collecting TCP information on pending sockets, should be tolerable to run in the main thread for even the fastest Tor relay.

But what does the algorithm look like? See here.

  struct tcp_info tcp;
  socklen_t tcp_info_len = sizeof(tcp);
  getsockopt(sock, SOL_TCP, TCP_INFO, (void *)&(tcp), &tcp_info_len);
  ioctl(sock, SIOCOUTQNSD, &(ent->notsent));
  ent->cwnd = tcp.tcpi_snd_cwnd;
  ent->unacked = tcp.tcpi_unacked;
  ent->mss = tcp.tcpi_snd_mss;

  int64_t tcp_space, extra_space;
  tcp_space = (ent->cwnd - ent->unacked) * ent->mss;
  if (tcp_space < 0) tcp_space = 0;
  extra_space = (ent->cwnd * ent->mss) * sock_buf_size_factor - ent->notsent;
  if (extra_space < 0) extra_space = 0;
  ent->limit = tcp_space + extra_space;
  if (++counter >= 1000) {
    counter -= 1000;
    log_info(LD_SCHED, "socket info: cwnd=%d unacked=%d notsent=%d limit=%"
        PRIi64,
        ent->cwnd,
        ent->unacked,
        ent->notsent,
        ent->limit);

First calculate the amount of "TCP space" there is. This is the amount of data that the kernel should send out onto the wire immediately since TCP won't limit it.

Then calculate some amount of extra space. To avoid starving the kernel, we want a little extra data to be there so if ACKs come back before the next scheduling run the kernel has something to send so the wire isn't idle. Assuming the sock_buf_size_factor is 1.0, we allow up to one extra congestion window worth of data to sit in the outbound kernel socket buffer.

Add together the TCP space and extra space and that's the socket's KIST-imposed write limit.

It requires two system calls. A recent version of linux (which came out in March 2016, see git describe --contains cd9b2660 against the kernel) has a struct tcp_info intended for internal use that has notsentbytes in it. We'd only need one system call if we took advantage of that.

What about cross platform support?

In the name of actually getting KIST merged and to keep the code simple, we make no attempt to support anything other than Linux. This is a great balance between maximizing benefit and development cost. KIST checks for the struct tcp_info, that it has the right members, and that it can make the ioctl syscall too. If so, we have full KIST support. If not, we can still run the KIST scheduler but make the per-socket write limit INT_MAX. It won't hurt.

Note: See TracTickets for help on using tickets.