KIST has two components: global scheduling (#9262 (moved)) and socket write limits. This ticket is to track discussion about the design that should be implemented to realize socket write limits, and discussion about the implementation.
The goal of the write limit is to never send to the kernel what the kernel wouldn't send out to the network anyway due to throttling at the TCP layer. Rob's USENIX Security paper computed write limits for each socket as
And then a global write limit across all sockets for each scheduling round is computed according to the upstream bandwidth of the relay and the configured write callback time interval. Writing in a given round ends when either the global limit is reached, or all of the socket limits are reached.
The TCP information can be collected with a getsockopt call, but doing this for every socket for every write round (callback interval) can get expensive. A kernel hacker, Patrick McHardy, suggested using the "netlink socket diag" interface (examples here and here) to collect information for multiple sockets all at once instead of a separate system call for each.
Note that the socket write limit need not actually be computed, because the kernel will return EAGAIN when the socket is full anyway. Along these lines, Bryan Ford suggested setting the socket buffer size based on the amount Tor thinks it should send plus a little extra (e.g., tcp_space*1.25), and then let the kernel push back automatically instead of trying to compute a new write limit for every socket for every write interval round. Then Tor can continue to try to write as much as it can and let the kernel push back when Tor should stop. In this case, we need to ensure TCP auto-tuning is disabled, as otherwise it may undo our settings by adjusting our socket buffer sizes underneath us.
I think we need two intervals: e.g., we want to try to write every 10 milliseconds, and then update snd_cwnd/write limits/socket buffer sizes every 100 milliseconds.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Currently talking with Nick about the design of this component. We decided that this could completely be implemented in a separate thread with no feedback to the master thread.
Then we want setsockopt to set the buffer sizes based on the current length and the tcp window:
setsockopt(fd, SOL_SOCKET, SO_SNDBUF, (void *)&snd_size, snd_size_len);setsockopt(fd, SOL_SOCKET, SO_RCVBUF, (void *)&rcv_size, rcv_size_len);}}}**NOTE:** When setting `SO_SNDBUF` and `SO_RCVBUF` with `setsockopt`, the value applied is double the valued passed (it is automatically doubled by the kernel).
Here is some extra info on that, hope this can be useful:
The kernel makes sure also that you do not go above /proc/sys/net/core/wmem_max for SO_SNDBUF and /proc/sys/net/core/rmem_max for SO_RCVBUF. Note that those are system wide thus should probably not be mangled by tor (unless the box is dedicated for tor traffic :). Furthermore, that value is multiplied by 2 and must at least be a certain minimum. (Below is a snippet from the 3.18.0 kernel).
sk->sk_sndbuf = max_t(u32, val * 2, SOCK_MIN_SNDBUF);
sk->sk_rcvbuf = max_t(u32, val * 2, SOCK_MIN_RCVBUF);
The size of struct sk_buff is 232 bytes. Note that this is not considered a stable ABI for user space thus the size of sk_buff can vary over time. Also, the `SKB_DATA_ALIGN` is an alignement on the L1 cache but it's often only 32 bytes thus here sk_buff is actually aligned to 256 bytes.Having congestion control in user space is **difficult** vis-a-vis performance. For once, you have to do extra syscalls for every I/O operation but it's also a "play the guessing max size game" which is tightly controlled by the kernel.
Having congestion control in user space is difficult vis-a-vis performance. For once, you have to do extra syscalls for every I/O operation but it's also a "play the guessing max size game" which is tightly controlled by the kernel.
Hmmm. Yeah.
Relatedly, do you know if TCP auto-tuning in the kernel is disabled when the buffer size is explicitly set with setsockopt? If not, do you know a socket option to disable TCP auto-tuning for a given socket?
Replying to dgoulet:
Relatedly, do you know if TCP auto-tuning in the kernel is disabled when the buffer size is explicitly set with setsockopt? If not, do you know a socket option to disable TCP auto-tuning for a given socket?
Yes, it does disable auto-tuning if the app layer requests something explicit.
As an aside, I'm slightly worried about a sufficiently infrequent update window leading to underutilizing the link, but the interval can be tunable, and we can see what a good value is in the wild (the reverse case where we overutilize the link because we don't respond to cwnd getting cut due to a insufficiently frequent polling interval is no worse than the current behavior so I'm not worried about that).
More regarding portability, since nickm was asking about it in IRC, and I went and actually looked at the code due to comments by other people that use TCP_INFO scaring me.
Linux < 2.6.x does not support this, since this is when the call was added.
Windows depends on version (with a preference towards the Windows-ism added in Vista since it's a single call that appears to return all the necessary information). The iperf people claim that TCP_INFO is supported, but I can't find documentation to back this up.
The bad:
NetBSD/OpenBSD do not support TCP_INFO.
The ugly:
FreeBSD support was added in the FreeBSD 6.0 timeframe, but a bunch of values are missing (
http://fxr.watson.org/fxr/source/netinet/tcp_usrreq.c#L1259) including a way to get "SND.UNA", making it effectively worthless ("SND.WND", "SND.NXT" and "cwnd" are insufficient to calculate the amount of inflight data). However making it work is a trivial 1 line change to tcp_usrreq.c:tcp_fill_info().
{{{
/* The full patch will want to rename tcp_info.__tcpi_unacked... */
ti->__tcpi_unacked = tp->snd_una;
}}}
The FreeBSD version of the tcp_space calculation assuming the patch would be something like:
{{{
/*
* Everything is in bytes, so none of that MSS sillyness.
*
* It's probably worth calculating the available space based off:
* wnd_size = min(ti.tcpi_snd_cwnd, ti.tcpi_snd_wnd)
* instead of just tcpi_snd_cwnd since the information is already
* in the tcp_info structure.
*/
tcp_space = ti.tcpi_snd_cwnd - (ti.tcpi_snd_next - ti.__tcpi_unacked);
}}}
The Fruit Company:
Darwin does not expose the fact that it supports TCP_INFO in /usr/include/netinet/tcp.h, but the XNU source (at least xnu-2050.18.24) has code that implements this. Since they are Apple, their idea of what struct tcp_info should look like is different from everyone else's, and ironically enough is the easiest to use. The header comment also notes that "The TCP_INFO socket option is a private API and is subject to change.".
{{{
/*
The Fruit Company has their own struct tcp_info, that can be
queried by getsockopt() TCP_INFO (undocumented), or sysctl().
The Fruit version of struct tcp_info handily includes:
ti->tcpi_txunacked = tp->snd_max - tp->snd_una;
As an added bonus at least as of xnu-2050.18.24, we can omit the
TIOCOUTQ call.
/
sndbuf_bytes = ti.tcpi_snd_sbbytes; / Look ma, I saved a syscall! /
tcp_space = ti.tcpi_snd_cwnd - ti.tcpi_txunacked; / Both in bytes. */
}}}
I assume that just using the ioctl() calls to get the send buffer capacity/current size on non-Linux/Windows platforms will still be an improvement, but it bums me out a bit. If people want I can write the kernel patches for all of the BSDs.
(Rob, please let me know which of the above are necessary before you could start testing this.)
Some changes and thoughts:
I have attached a patch for a slight refactoring of the thread main loop to make it easier to run the thread in Shadow.
I thought about it, and I'm pretty sure we don't actually want the socket size calculation anymore. I needed it in my prototype because I was not setting socket buf sizes explicitly, and so had to deal with the buf size changing underneath me from tcp autotuning. Since we are setting the buf size, taking buf_size = MIN(sock_space, tcp_space); would mean that the buf size could never increase - and I think we don't want that! So, all we would have to compute is tcp_space and we can save the other two syscalls :) I have attached a patch for this as well.
It would be useful to have the global read/write limits feature there before starting to test. For now, this could be as simple as a configuration option that specifies the upstream link capacity, and then the logic necessary to enforce the limit.
We probably also want to write a bit above the per-socket and/or global limits, so that we can minimize kernel 'starvation' (when the kernel could have sent something, but it had no data to send because our calculations were a bit off). Options that specify how much above the per-socket and global limits we want to write would be helpful. For example, buf_size = tcp_space; would become buf_size = tcp_space * get_options()->KISTSockBufSizeFactor; (e.g., 1.25) and there would be a similar config for global limits. I think upper/lower bounds are a good idea to enforce some max/min values.
I have attached a patch for a slight refactoring of the thread main loop
looks good to me.
and I think we don't want that!
You mean, we don't want to have the buffer size get smaller and smaller forever? Indeed, that sounds bad.
It would be useful to have the global read/write limits feature there before starting to test. For now, this could be as simple as a configuration option that specifies the upstream link capacity, and then the logic necessary to enforce the limit.
Don't we already have this? I thought that BandwidthRate already enforced a global read/write limit.
We probably also want to write a bit above the per-socket and/or global limits
It would be useful to have the global read/write limits feature there before starting to test. For now, this could be as simple as a configuration option that specifies the upstream link capacity, and then the logic necessary to enforce the limit.
Don't we already have this? I thought that BandwidthRate already enforced a global read/write limit.
We want to keep the non-circuit queues as small as possible. Does the circuit scheduler stop pulling cells from the circuit queues at the same time that BandwidthRate is reached? I guess we can test as-is and revisit this issue if it turns out to be a problem.
Algorithm 2 of the paper shows (line 4 and 8) that tor should do a lookup on the connection for available write data (raw or not) before doing the adjustment. However, I don't see that right now in the code, is it still needed?
I think this might be needed else we might get into a situation where tor iterates every X seconds on the sockets and adjust their size even though it's not needed which might deteriorate performance?
For instance, I did a quick kernel trace of tor bootstrapping with the kist branch and you can see that there are bunch of get/setsockopt() where one of them (fd = 6) actually never writes anything but still gets adjusted.