KIST has two components: global scheduling (#9262 (moved)) and socket write limits. This ticket is to track discussion about the design that should be implemented to realize socket write limits, and discussion about the implementation.
The goal of the write limit is to never send to the kernel what the kernel wouldn't send out to the network anyway due to throttling at the TCP layer. Rob's USENIX Security paper computed write limits for each socket as
And then a global write limit across all sockets for each scheduling round is computed according to the upstream bandwidth of the relay and the configured write callback time interval. Writing in a given round ends when either the global limit is reached, or all of the socket limits are reached.
The TCP information can be collected with a getsockopt call, but doing this for every socket for every write round (callback interval) can get expensive. A kernel hacker, Patrick McHardy, suggested using the "netlink socket diag" interface (examples here and here) to collect information for multiple sockets all at once instead of a separate system call for each.
Note that the socket write limit need not actually be computed, because the kernel will return EAGAIN when the socket is full anyway. Along these lines, Bryan Ford suggested setting the socket buffer size based on the amount Tor thinks it should send plus a little extra (e.g., tcp_space*1.25), and then let the kernel push back automatically instead of trying to compute a new write limit for every socket for every write interval round. Then Tor can continue to try to write as much as it can and let the kernel push back when Tor should stop. In this case, we need to ensure TCP auto-tuning is disabled, as otherwise it may undo our settings by adjusting our socket buffer sizes underneath us.
I think we need two intervals: e.g., we want to try to write every 10 milliseconds, and then update snd_cwnd/write limits/socket buffer sizes every 100 milliseconds.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related.
Learn more.
Currently talking with Nick about the design of this component. We decided that this could completely be implemented in a separate thread with no feedback to the master thread.
Then we want setsockopt to set the buffer sizes based on the current length and the tcp window:
setsockopt(fd, SOL_SOCKET, SO_SNDBUF, (void *)&snd_size, snd_size_len);setsockopt(fd, SOL_SOCKET, SO_RCVBUF, (void *)&rcv_size, rcv_size_len);}}}**NOTE:** When setting `SO_SNDBUF` and `SO_RCVBUF` with `setsockopt`, the value applied is double the valued passed (it is automatically doubled by the kernel).
Here is some extra info on that, hope this can be useful:
The kernel makes sure also that you do not go above /proc/sys/net/core/wmem_max for SO_SNDBUF and /proc/sys/net/core/rmem_max for SO_RCVBUF. Note that those are system wide thus should probably not be mangled by tor (unless the box is dedicated for tor traffic :). Furthermore, that value is multiplied by 2 and must at least be a certain minimum. (Below is a snippet from the 3.18.0 kernel).
sk->sk_sndbuf = max_t(u32, val * 2, SOCK_MIN_SNDBUF);
sk->sk_rcvbuf = max_t(u32, val * 2, SOCK_MIN_RCVBUF);
The size of struct sk_buff is 232 bytes. Note that this is not considered a stable ABI for user space thus the size of sk_buff can vary over time. Also, the `SKB_DATA_ALIGN` is an alignement on the L1 cache but it's often only 32 bytes thus here sk_buff is actually aligned to 256 bytes.Having congestion control in user space is **difficult** vis-a-vis performance. For once, you have to do extra syscalls for every I/O operation but it's also a "play the guessing max size game" which is tightly controlled by the kernel.
Having congestion control in user space is difficult vis-a-vis performance. For once, you have to do extra syscalls for every I/O operation but it's also a "play the guessing max size game" which is tightly controlled by the kernel.
Hmmm. Yeah.
Relatedly, do you know if TCP auto-tuning in the kernel is disabled when the buffer size is explicitly set with setsockopt? If not, do you know a socket option to disable TCP auto-tuning for a given socket?
Replying to dgoulet:
Relatedly, do you know if TCP auto-tuning in the kernel is disabled when the buffer size is explicitly set with setsockopt? If not, do you know a socket option to disable TCP auto-tuning for a given socket?
Yes, it does disable auto-tuning if the app layer requests something explicit.
As an aside, I'm slightly worried about a sufficiently infrequent update window leading to underutilizing the link, but the interval can be tunable, and we can see what a good value is in the wild (the reverse case where we overutilize the link because we don't respond to cwnd getting cut due to a insufficiently frequent polling interval is no worse than the current behavior so I'm not worried about that).
More regarding portability, since nickm was asking about it in IRC, and I went and actually looked at the code due to comments by other people that use TCP_INFO scaring me.
Linux < 2.6.x does not support this, since this is when the call was added.
Windows depends on version (with a preference towards the Windows-ism added in Vista since it's a single call that appears to return all the necessary information). The iperf people claim that TCP_INFO is supported, but I can't find documentation to back this up.
The bad:
NetBSD/OpenBSD do not support TCP_INFO.
The ugly:
FreeBSD support was added in the FreeBSD 6.0 timeframe, but a bunch of values are missing (
http://fxr.watson.org/fxr/source/netinet/tcp_usrreq.c#L1259) including a way to get "SND.UNA", making it effectively worthless ("SND.WND", "SND.NXT" and "cwnd" are insufficient to calculate the amount of inflight data). However making it work is a trivial 1 line change to tcp_usrreq.c:tcp_fill_info().
{{{
/* The full patch will want to rename tcp_info.__tcpi_unacked... */
ti->__tcpi_unacked = tp->snd_una;
}}}
The FreeBSD version of the tcp_space calculation assuming the patch would be something like:
{{{
/*
* Everything is in bytes, so none of that MSS sillyness.
*
* It's probably worth calculating the available space based off:
* wnd_size = min(ti.tcpi_snd_cwnd, ti.tcpi_snd_wnd)
* instead of just tcpi_snd_cwnd since the information is already
* in the tcp_info structure.
*/
tcp_space = ti.tcpi_snd_cwnd - (ti.tcpi_snd_next - ti.__tcpi_unacked);
}}}
The Fruit Company:
Darwin does not expose the fact that it supports TCP_INFO in /usr/include/netinet/tcp.h, but the XNU source (at least xnu-2050.18.24) has code that implements this. Since they are Apple, their idea of what struct tcp_info should look like is different from everyone else's, and ironically enough is the easiest to use. The header comment also notes that "The TCP_INFO socket option is a private API and is subject to change.".
{{{
/*
The Fruit Company has their own struct tcp_info, that can be
queried by getsockopt() TCP_INFO (undocumented), or sysctl().
The Fruit version of struct tcp_info handily includes:
ti->tcpi_txunacked = tp->snd_max - tp->snd_una;
As an added bonus at least as of xnu-2050.18.24, we can omit the
TIOCOUTQ call.
/
sndbuf_bytes = ti.tcpi_snd_sbbytes; / Look ma, I saved a syscall! /
tcp_space = ti.tcpi_snd_cwnd - ti.tcpi_txunacked; / Both in bytes. */
}}}
I assume that just using the ioctl() calls to get the send buffer capacity/current size on non-Linux/Windows platforms will still be an improvement, but it bums me out a bit. If people want I can write the kernel patches for all of the BSDs.
(Rob, please let me know which of the above are necessary before you could start testing this.)
Some changes and thoughts:
I have attached a patch for a slight refactoring of the thread main loop to make it easier to run the thread in Shadow.
I thought about it, and I'm pretty sure we don't actually want the socket size calculation anymore. I needed it in my prototype because I was not setting socket buf sizes explicitly, and so had to deal with the buf size changing underneath me from tcp autotuning. Since we are setting the buf size, taking buf_size = MIN(sock_space, tcp_space); would mean that the buf size could never increase - and I think we don't want that! So, all we would have to compute is tcp_space and we can save the other two syscalls :) I have attached a patch for this as well.
It would be useful to have the global read/write limits feature there before starting to test. For now, this could be as simple as a configuration option that specifies the upstream link capacity, and then the logic necessary to enforce the limit.
We probably also want to write a bit above the per-socket and/or global limits, so that we can minimize kernel 'starvation' (when the kernel could have sent something, but it had no data to send because our calculations were a bit off). Options that specify how much above the per-socket and global limits we want to write would be helpful. For example, buf_size = tcp_space; would become buf_size = tcp_space * get_options()->KISTSockBufSizeFactor; (e.g., 1.25) and there would be a similar config for global limits. I think upper/lower bounds are a good idea to enforce some max/min values.
I have attached a patch for a slight refactoring of the thread main loop
looks good to me.
and I think we don't want that!
You mean, we don't want to have the buffer size get smaller and smaller forever? Indeed, that sounds bad.
It would be useful to have the global read/write limits feature there before starting to test. For now, this could be as simple as a configuration option that specifies the upstream link capacity, and then the logic necessary to enforce the limit.
Don't we already have this? I thought that BandwidthRate already enforced a global read/write limit.
We probably also want to write a bit above the per-socket and/or global limits
It would be useful to have the global read/write limits feature there before starting to test. For now, this could be as simple as a configuration option that specifies the upstream link capacity, and then the logic necessary to enforce the limit.
Don't we already have this? I thought that BandwidthRate already enforced a global read/write limit.
We want to keep the non-circuit queues as small as possible. Does the circuit scheduler stop pulling cells from the circuit queues at the same time that BandwidthRate is reached? I guess we can test as-is and revisit this issue if it turns out to be a problem.
Algorithm 2 of the paper shows (line 4 and 8) that tor should do a lookup on the connection for available write data (raw or not) before doing the adjustment. However, I don't see that right now in the code, is it still needed?
I think this might be needed else we might get into a situation where tor iterates every X seconds on the sockets and adjust their size even though it's not needed which might deteriorate performance?
For instance, I did a quick kernel trace of tor bootstrapping with the kist branch and you can see that there are bunch of get/setsockopt() where one of them (fd = 6) actually never writes anything but still gets adjusted.
Thanks for looking over the branch. The lines you reference in Alg 2 were primarily there because the prototype we implemented for KIST worked exclusively inside of the connection_write_callback function in main.c. In our prototype, KIST only operated on sockets that were in the global list of writable sockets.
In the new implementation, all sockets are 'watched' by the helper thread, whether they have data to send or not. Even if the socket has no data in Tor, the TCP window could be changing in the kernel due to incoming ACKS, so there is some value in doing the calls even in this case.
One approach is to do the kernel calls for a few seconds after Tor drains the app buffer, and then stop until more data was written. To figure out when to stop and start the calls, we probably need to add more communication between the process and the thread to notify when the socket has data in the Tor buffer and not. (I think adding back in the other kernel call that I removed in my patch to check the socket length would not work, because the socket could be empty at the time of check even though more data has been written.)
I suggest we keep things simple for now (which is certainly OK for Shadow), but try to monitor how much overhead the extra calls consume and do something like I outlined above if necessary. I do worry, though, that this may end up adding more overhead to the main process instead of the helper thread, which is probably not what we want.
I suggest we keep things simple for now (which is certainly OK for Shadow), but try to monitor how much overhead the extra calls consume and do something like I outlined above if necessary. I do worry, though, that this may end up adding more overhead to the main process instead of the helper thread, which is probably not what we want.
Sounds good!
Let me know when you want to start testing, I can help here with kernel tracing and analyze the experiments (with and without KIST) by looking at the syscall overhead between the two run, the amount of failed (EWOULDBLOCK) write in both cases and the average of data written per call which can help us I think tune stuff and measure regression.
Before I forget, since I talked to nickm about this on IRC. I belive there's a possible future improvement here, depending on how expensive the TCP_INFO query turns out to be, though someone with more time than I have should simulate things.
The polling frequency should be dynamic on a per socket level, based on the state of the TCP connection.
TCP congestion control at a high level is divided into two phases, Slow Start and Congestion Avoidance.
During slow start, on each ACK that covers new data, the congestion window grows by cwnd += min(N, SMSS) (Older TCP/IP stacks grew it by SMSS, which leads to a misbehaving receiver attack). Congestion window growth here is extremely aggressive, so the discrepancy between our polling and the actual congestion window will be quite large and we will end up under-utilizing the link's capacity. This should only happen on relatively new connections, connections that are cruddy enough that the retransmission timer fires, and after the connection has been idle for a while, but the first case affects perceived user responsiveness, so it's worth thinking about.
During congestion avoidance, the congestion window grows by up to SMSS per RTT (cwnd += min(1, SMMS*SMSS/cwnd) on each ACK that covers unacknowledged data). When in congestion avoidance, our estimate should track cwnd fairly closely (at least until loss occurs), so our polling interval can be relatively infrequent. There's likewise potential for being clever here and adjusting the polling interval based on the RTT estimate.
The information required to distinguish between these two states is readily available (cwnd, ssthresh, and rtt) at least on Linux, so it is possible to query such things, so it's mostly a matter of "Is it affordable in terms of syscalls".
I'm trying to get this running in Shadow now, and the patch I just attached should help. Currently working out bootstrapping issues I noticed after uping to 0.2.6.x-alpha.
Rob, what do you think about Yawning's comments above?
Donald Knuth once wrote:
premature optimization is the root of all evil
I completely agree with him that further optimizations are possible to dynamically adjust the polling interval based on which phase we are in (slow start vs cong avoidance) and using SMSS and RTT from the TCP info struct. IMHO, those optimizations should come after we have the simpler version working and merged, assuming that the simpler version already improves the situation with tolerable overhead.
Sorry for the silence, but I have not dropped this! I've been dealing with various problems and am still working on suggesting design changes/additions to Nick's current branch. See my disappointing update in #12891 (moved) comment 6.
I merged nickm's 'kist' branch with tag 'tor-0.2.6.2-alpha' and then made several updates to the kist code. The 6 commits are attached here. I kept the changes as separate commits so it is easier to review each change.
The most significant change is a quick algorithm that I hacked together to compute an adaptive interval length for how often getsockopt() needs to be called to get an updated tcp_info state for that socket. The hope is that this will avoid unnecessary context switches; for example, if we just sent some data and need an ACK before the window will get updated, then we can wait for about RTT milliseconds before asking the kernel for new TCP info.
I'm hoping for help with these things:
merge the code somewhere public and with a proper [more recent] version of Tor? Perhaps in nickm's public kist branch?
review the existing code (I added questions with an XXX and TODO label)
help from someone to properly code the part labeled with FIXME, if needed: if the socket_table was a min heap, that would tighten our socket descriptor iteration loop.
I've done some testing in Shadow and will post the results in #12891 (moved).
Current plan: I'll try to look at this code on Friday if I can possibly do so. I'll either fix the fixmes or figure out more about what's needed to do so. If you've got revisions to make based on that, let's do 'em before PETS. Then let's try to find a 20 minute sit-down time at PETS to figure out details and next steps?
Great plan! I'm doing some experiments now with the new model I posted in #12891 (moved), but I'll wait to make any code changes until your 'kist' branch is updated.
A lot has changed in the last year that I've been implementing KIST.
It looks like most/all of this ticket deals with the KIST prototype that would run stuff off of the main thread and collect stats on all sockets. That's no longer a thing.
Huge optimization: when it comes time to do some scheduling, only collect TCP info for the sockets that have channels ready to send. To quote our new KIST paper that hasn't been accepted anywhere yet.
We observed that the median number of write-pending sockets that accumulated during a 10 millisecond period was 23 (with min=1, q1=18, q3=27, and max=127), while the median amount of time to collect TCP information on all write-pending sockets was 23 microseconds (with min=1, q1=17, q3=33, and max=674). We observed a linear relationship between the amount of time required to collect TCP information on all write-pending sockets and the number of such sockets (1.08 microseconds per pending socket), independent of the total number of open sockets. Therefore, we believe that the KIST overhead, with our optimization of only collecting TCP information on pending sockets, should be tolerable to run in the main thread for even the fastest Tor relay.
First calculate the amount of "TCP space" there is. This is the amount of data that the kernel should send out onto the wire immediately since TCP won't limit it.
Then calculate some amount of extra space. To avoid starving the kernel, we want a //little// extra data to be there so if ACKs come back before the next scheduling run the kernel has something to send so the wire isn't idle. Assuming the sock_buf_size_factor is 1.0, we allow up to one extra congestion window worth of data to sit in the outbound kernel socket buffer.
Add together the TCP space and extra space and that's the socket's KIST-imposed write limit.
It requires two system calls. A recent version of linux (which came out in March 2016, see git describe --contains cd9b2660 against the kernel) has a struct tcp_info intended for internal use that has notsentbytes in it. We'd only need one system call if we took advantage of that.
What about cross platform support?
In the name of actually getting KIST merged and to keep the code simple, we make no attempt to support anything other than Linux. This is a great balance between maximizing benefit and development cost. KIST checks for the struct tcp_info, that it has the right members, and that it can make the ioctl syscall too. If so, we have full KIST support. If not, we can still run the KIST scheduler but make the per-socket write limit INT_MAX. It won't hurt.
Trac: Milestone: Tor: unspecified to Tor: 0.3.2.x-final Resolution: N/Ato fixed Status: new to closed