Design and implement optimizations for socket write limits

changed milestone to %Tor: 0.3.2.x-final

added component::core tor/tor kist milestone::Tor: 0.3.2.x-final networking parent::12541 priority::medium resolution::fixed severity::normal status::closed tcp tor-relay type::enhancement labels

In case it isn't clear, this should be implemented on top of #9262 (moved).

Trac:
Component: - Select a component to Tor
Keywords: N/A deleted, tor-relay added

Trac:
Milestone: N/A to Tor: 0.2.6.x-final

Trac:
Milestone: Tor: 0.2.6.x-final to Tor: 0.2.7.x-final

Currently talking with Nick about the design of this component. We decided that this could completely be implemented in a separate thread with no feedback to the master thread.

Get the buffer lengths:

ioctl(fd, SIOCINQ, &inqlen);
ioctl(fd, SIOCOUTQ, &outqlen);

This gives us the length (ie, data waiting for io) of the send and receive buffer.

Get the TCP info:

getsockopt(fd, SOL_TCP, TCP_INFO, (void *)&tcp_info, &tcp_info_length);

That gives us this struct (or something like it):

struct tcp_info {
	__u8	tcpi_state;
	__u8	tcpi_ca_state;
	__u8	tcpi_retransmits;
	__u8	tcpi_probes;
	__u8	tcpi_backoff;
	__u8	tcpi_options;
	__u8	tcpi_snd_wscale : 4, tcpi_rcv_wscale : 4;

	__u32	tcpi_rto;
	__u32	tcpi_ato;
	__u32	tcpi_snd_mss;
	__u32	tcpi_rcv_mss;

	__u32	tcpi_unacked;
	__u32	tcpi_sacked;
	__u32	tcpi_lost;
	__u32	tcpi_retrans;
	__u32	tcpi_fackets;

	/* Times. */
	__u32	tcpi_last_data_sent;
	__u32	tcpi_last_ack_sent;     /* Not remembered, sorry. */
	__u32	tcpi_last_data_recv;
	__u32	tcpi_last_ack_recv;

	/* Metrics. */
	__u32	tcpi_pmtu;
	__u32	tcpi_rcv_ssthresh;
	__u32	tcpi_rtt;
	__u32	tcpi_rttvar;
	__u32	tcpi_snd_ssthresh;
	__u32	tcpi_snd_cwnd;
	__u32	tcpi_advmss;
	__u32	tcpi_reordering;

	__u32	tcpi_rcv_rtt;
	__u32	tcpi_rcv_space;

	__u32	tcpi_total_retrans;
};

[See here - http://linuxgazette.net/136/pfeiffer.html More info here.]

Then we want setsockopt to set the buffer sizes based on the current length and the tcp window:

setsockopt(fd, SOL_SOCKET, SO_SNDBUF, (void *)&snd_size, snd_size_len);
setsockopt(fd, SOL_SOCKET, SO_RCVBUF, (void *)&rcv_size, rcv_size_len);

NOTE: When setting SO_SNDBUF and SO_RCVBUF with setsockopt, the value applied is double the valued passed (it is automatically doubled by the kernel).

Trac:
Milestone: Tor: 0.2.7.x-final to Tor: 0.2.6.x-final

Replying to robgjansen:

Then we want setsockopt to set the buffer sizes based on the current length and the tcp window:

setsockopt(fd, SOL_SOCKET, SO_SNDBUF, (void *)&snd_size, snd_size_len);
setsockopt(fd, SOL_SOCKET, SO_RCVBUF, (void *)&rcv_size, rcv_size_len);
}}}

**NOTE:** When setting `SO_SNDBUF` and `SO_RCVBUF` with `setsockopt`, the value applied is double the valued passed (it is automatically doubled by the kernel).

Here is some extra info on that, hope this can be useful:

The kernel makes sure also that you do not go above /proc/sys/net/core/wmem_max for SO_SNDBUF and /proc/sys/net/core/rmem_max for SO_RCVBUF. Note that those are system wide thus should probably not be mangled by tor (unless the box is dedicated for tor traffic :). Furthermore, that value is multiplied by 2 and must at least be a certain minimum. (Below is a snippet from the 3.18.0 kernel).

{{{ #define TCP_SKB_MIN_TRUESIZE (2048 + SKB_DATA_ALIGN(sizeof(struct sk_buff)))

#define SOCK_MIN_SNDBUF (TCP_SKB_MIN_TRUESIZE * 2) #define SOCK_MIN_RCVBUF TCP_SKB_MIN_TRUESIZE

sk->sk_sndbuf = max_t(u32, val * 2, SOCK_MIN_SNDBUF); sk->sk_rcvbuf = max_t(u32, val * 2, SOCK_MIN_RCVBUF);


The size of struct sk_buff is 232 bytes. Note that this is not considered a stable ABI for user space thus the size of sk_buff can vary over time. Also, the `SKB_DATA_ALIGN` is an alignement on the L1 cache but it's often only 32 bytes thus here sk_buff is actually aligned to 256 bytes.

Having congestion control in user space is **difficult** vis-a-vis performance. For once, you have to do extra syscalls for every I/O operation but it's also a "play the guessing max size game" which is tightly controlled by the kernel.

Replying to dgoulet:

Having congestion control in user space is difficult vis-a-vis performance. For once, you have to do extra syscalls for every I/O operation but it's also a "play the guessing max size game" which is tightly controlled by the kernel.

Hmmm. Yeah.

Relatedly, do you know if TCP auto-tuning in the kernel is disabled when the buffer size is explicitly set with setsockopt? If not, do you know a socket option to disable TCP auto-tuning for a given socket?

Ooof.

Replying to robgjansen:

{{{ ioctl(fd, SIOCINQ, &inqlen); ioctl(fd, SIOCOUTQ, &outqlen); }}}

TIOCOUTQ/TIOCINQ are more portable, the SIOCINQ/SIOCOUTQ defines are Linux-isms.

Get the TCP info:

{{{ getsockopt(fd, SOL_TCP, TCP_INFO, (void *)&tcp_info, &tcp_info_length); }}}

For reference, the *BSD structure is at: https://svnweb.freebsd.org/base/head/sys/netinet/tcp.h?revision=246210&view=markup#l192

For Vista and later there is GetPerTcpConnectionEStats() http://msdn.microsoft.com/en-us/library/bb485738%28VS.85%29.aspx

Replying to robgjansen:

Replying to dgoulet: Relatedly, do you know if TCP auto-tuning in the kernel is disabled when the buffer size is explicitly set with setsockopt? If not, do you know a socket option to disable TCP auto-tuning for a given socket?

Yes, it does disable auto-tuning if the app layer requests something explicit.

As an aside, I'm slightly worried about a sufficiently infrequent update window leading to underutilizing the link, but the interval can be tunable, and we can see what a good value is in the wild (the reverse case where we overutilize the link because we don't respond to cwnd getting cut due to a insufficiently frequent polling interval is no worse than the current behavior so I'm not worried about that).

More regarding portability, since nickm was asking about it in IRC, and I went and actually looked at the code due to comments by other people that use TCP_INFO scaring me.

References:

The iperf code.
FXR [http://fxr.watson.org/fxr/source/netinet/tcp.h]

The good:

Linux < 2.6.x does not support this, since this is when the call was added.
Windows depends on version (with a preference towards the Windows-ism added in Vista since it's a single call that appears to return all the necessary information). The iperf people claim that TCP_INFO is supported, but I can't find documentation to back this up.

The bad:

NetBSD/OpenBSD do not support TCP_INFO.

The ugly:

FreeBSD support was added in the FreeBSD 6.0 timeframe, but a bunch of values are missing ( http://fxr.watson.org/fxr/source/netinet/tcp_usrreq.c#L1259) including a way to get "SND.UNA", making it effectively worthless ("SND.WND", "SND.NXT" and "cwnd" are insufficient to calculate the amount of inflight data). However making it work is a trivial 1 line change to tcp_usrreq.c:tcp_fill_info().

{{{ /* The full patch will want to rename tcp_info.__tcpi_unacked... */ ti->__tcpi_unacked = tp->snd_una; }}}

The FreeBSD version of the tcp_space calculation assuming the patch would be something like: {{{ /* * Everything is in bytes, so none of that MSS sillyness. * * It's probably worth calculating the available space based off: * wnd_size = min(ti.tcpi_snd_cwnd, ti.tcpi_snd_wnd) * instead of just tcpi_snd_cwnd since the information is already * in the tcp_info structure. */ tcp_space = ti.tcpi_snd_cwnd - (ti.tcpi_snd_next - ti.__tcpi_unacked); }}}

The Fruit Company:

Darwin does not expose the fact that it supports TCP_INFO in /usr/include/netinet/tcp.h, but the XNU source (at least xnu-2050.18.24) has code that implements this. Since they are Apple, their idea of what struct tcp_info should look like is different from everyone else's, and ironically enough is the easiest to use. The header comment also notes that "The TCP_INFO socket option is a private API and is subject to change.". {{{ /*
- The Fruit Company has their own struct tcp_info, that can be
- queried by getsockopt() TCP_INFO (undocumented), or sysctl().
- The Fruit version of struct tcp_info handily includes:
- ti->tcpi_txunacked = tp->snd_max - tp->snd_una;
- As an added bonus at least as of xnu-2050.18.24, we can omit the
- TIOCOUTQ call. / sndbuf_bytes = ti.tcpi_snd_sbbytes; / Look ma, I saved a syscall! / tcp_space = ti.tcpi_snd_cwnd - ti.tcpi_txunacked; / Both in bytes. */ }}}

I assume that just using the ioctl() calls to get the send buffer capacity/current size on non-Linux/Windows platforms will still be an improvement, but it bums me out a bit. If people want I can write the kernel patches for all of the BSDs.

Edit: Actually looked at the XNU code.

I have a linux-only implementation that seems to run and adjust socket buffers in my branch "kist".

It still needs work:

Darwin, BSD, Windows support.
Shutting down the thread and freeing memory on exit.
Avoiding its loop when Tor is quiescent.
Some kind of debugging log to make sure the values it's setting are reasonable.
Upper and lower bounds on what it's willing to set the buffer sizes to.
Tuning for global read/write limits
Tuning for maximum output buffer sizes.

Still, it's a start!

(Rob, please let me know which of the above are necessary before you could start testing this.)

Replying to nickm:

(Rob, please let me know which of the above are necessary before you could start testing this.)

Some changes and thoughts:

I have attached a patch for a slight refactoring of the thread main loop to make it easier to run the thread in Shadow.
I thought about it, and I'm pretty sure we don't actually want the socket size calculation anymore. I needed it in my prototype because I was not setting socket buf sizes explicitly, and so had to deal with the buf size changing underneath me from tcp autotuning. Since we are setting the buf size, taking buf_size = MIN(sock_space, tcp_space); would mean that the buf size could never increase - and I think we don't want that! So, all we would have to compute is tcp_space and we can save the other two syscalls :) I have attached a patch for this as well.
It would be useful to have the global read/write limits feature there before starting to test. For now, this could be as simple as a configuration option that specifies the upstream link capacity, and then the logic necessary to enforce the limit.
We probably also want to write a bit above the per-socket and/or global limits, so that we can minimize kernel 'starvation' (when the kernel could have sent something, but it had no data to send because our calculations were a bit off). Options that specify how much above the per-socket and global limits we want to write would be helpful. For example, buf_size = tcp_space; would become buf_size = tcp_space * get_options()->KISTSockBufSizeFactor; (e.g., 1.25) and there would be a similar config for global limits. I think upper/lower bounds are a good idea to enforce some max/min values.

Trac:
0001-refactor-loop-to-separate-function-for-easier-Shadow.patch

Trac:
0001-remove-sock_space-calculation.patch

I have attached a patch for a slight refactoring of the thread main loop

looks good to me.

and I think we don't want that!

You mean, we don't want to have the buffer size get smaller and smaller forever? Indeed, that sounds bad.

It would be useful to have the global read/write limits feature there before starting to test. For now, this could be as simple as a configuration option that specifies the upstream link capacity, and then the logic necessary to enforce the limit.

Don't we already have this? I thought that BandwidthRate already enforced a global read/write limit.

We probably also want to write a bit above the per-socket and/or global limits

Will do.

Branch updated. Thanks!

Replying to nickm:

It would be useful to have the global read/write limits feature there before starting to test. For now, this could be as simple as a configuration option that specifies the upstream link capacity, and then the logic necessary to enforce the limit.

Don't we already have this? I thought that BandwidthRate already enforced a global read/write limit.

We want to keep the non-circuit queues as small as possible. Does the circuit scheduler stop pulling cells from the circuit queues at the same time that BandwidthRate is reached? I guess we can test as-is and revisit this issue if it turns out to be a problem.

Thanks for the fast branch updates!

Looking at nickm's branch and the remaining tasks (comment10), seem something is missing from the paper (http://www.robgjansen.com/publications/kist-sec2014.pdf).

Algorithm 2 of the paper shows (line 4 and 8) that tor should do a lookup on the connection for available write data (raw or not) before doing the adjustment. However, I don't see that right now in the code, is it still needed?

I think this might be needed else we might get into a situation where tor iterates every X seconds on the sockets and adjust their size even though it's not needed which might deteriorate performance?

For instance, I did a quick kernel trace of tor bootstrapping with the kist branch and you can see that there are bunch of get/setsockopt() where one of them (fd = 6) actually never writes anything but still gets adjusted.

https://paste.debian.net/plainh/0ed81334

Design and implement optimizations for socket write limits

Child items ...

Activity