Opened 5 months ago

Last modified 4 months ago

#22798 needs_review defect

Windows relay is several times slower than Linux relay

Reported by: Vort Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: 0.2.9.11
Severity: Normal Keywords: tor-relay performance windows
Cc: Actual Points:
Parent ID: Points: 5
Reviewer: Sponsor:

Description (last modified by teor)

I have launched two relays: first one in native mode on Windows, second one in virtual machine on Linux.

Then measured their bandwidth using three-hop circuit: refEntry, myRelay, refExit
refEntry is 13B2354C74CCE29815B4E1F692F2F0E86C7F13DD
refExit is 07C05ED4825F51D5BE4CDBBAA80BFA484132A2F5

Windows version of Tor was able to provide 51 KiB/s.
Linux version - 163 KiB/s, which is three times higher.

But this was my measurements.
BwAuth ratings for this relays are far more different:
Windows one have weight = 18 (19/13/22/18).
Linux one got weight = 1030 (293/1030/1460).

Which leads to actual traffic rising from 1 KiB/s to ~500 KiB/s.

I can keep relay in virtual machine for a while, but it would be much better if Windows version gets fixed.

Here are the versions of software used in tests:
OS: Windows 7 SP1 x64 (host)
OS: Ubuntu 16.04 x64 (guest)
VM: VirtualBox 5.1.22
Tor: 0.2.9.11 (Linux)
Tor: 0.2.9.11, 0.3.0.8 (Windows)

Also I have obtained TCP packets dump from relay's network interface:
(REDACTED)

Packets 1-1584 are slow transfer (Windows relay).
Packets 1585-8659 are fast transfer (Linux VM relay).

I can made additional tests and provide additional information if needed.

Child Tickets

TicketStatusOwnerSummaryComponent
#22847closedtbb-teamUpgrade Tor Browser Windows builds to cygwin 2.7.0Applications/Tor Browser

Attachments (15)

tor_apimon.png (69.9 KB) - added by Vort 5 months ago.
random_tweak.patch (331 bytes) - added by Vort 5 months ago.
tor_slow_select.png (98.8 KB) - added by Vort 5 months ago.
tor_fast_select.png (98.1 KB) - added by Vort 5 months ago.
BwTest.cpp (2.0 KB) - added by Vort 5 months ago.
strace.txt (1.1 MB) - added by Vort 5 months ago.
BwTest.2.cpp (2.4 KB) - added by Vort 5 months ago.
BwTest.3.cpp (4.2 KB) - added by Vort 5 months ago.
tor_vm_fast_upload.png (30.1 KB) - added by Vort 5 months ago.
upload_difference.png (23.0 KB) - added by Vort 5 months ago.
tor_upload_without_lso.png (9.6 KB) - added by Vort 5 months ago.
isb_performance.png (6.7 KB) - added by Vort 5 months ago.
tor_windows_upload_hackfix_v1.patch (2.3 KB) - added by Vort 4 months ago.
tor_windows_upload_fix_v2.patch (1.9 KB) - added by Vort 4 months ago.
tor_windows_upload_fix_v3.patch (1.9 KB) - added by Vort 4 months ago.

Download all attachments as: .zip

Change History (138)

comment:1 Changed 5 months ago by teor

Description: modified (diff)
Severity: MajorNormal

Hi,

Thanks for reporting this issue.

We know that relays on Windows are slower than on Linux.
On the public network, 99% of relays are Linux or BSD.
But most clients are Windows.
So we would like to make tor perform better on Windows.
But we need more Windows developers to help out so we can do this.

If CPU load is high on Windows, you can help by providing a performance profile that shows where tor spends most of its CPU time. If it's not, then I'm not sure what to look at next. Maybe someone who has experience with Windows network server and performance can help.

Also, please don't post detailed network flows from the live Tor network, it's not safe for users.

comment:2 Changed 5 months ago by teor

Keywords: windows added; network win32 removed
Points: 5

comment:3 Changed 5 months ago by teor

Status: newneeds_information

comment:4 Changed 5 months ago by Vort

CPU load during that 50 KiB/s test is so low, that it is hard to measure.
It is somewhere around 0.05%-0.15%.

I think that a reason of such slowdown is somewhere in network API interaction.
But I'm not sure, of course.

My IP and MAC was redacted in that file.
IPs of two other relays are public and can be viewed on Atlas.
But anyway, it is possible for me to send this file privately to developers.

comment:5 in reply to:  4 Changed 5 months ago by yawning

Replying to Vort:

I think that a reason of such slowdown is somewhere in network API interaction.

The Tor daemon using IOCP is probably required to solve this sort of issue correctly. At one point, there was the bufferevents code that started doing this, but it was incomplete and buggy, and the code was removed due to rot.

comment:6 Changed 5 months ago by cypherpunks

High-performance Windows Sockets Applications

Microsoft Windows networking components have been developed for performance and scalability. This enables applications to maximize available network bandwidth. Windows Sockets and the Windows TCP/IP protocol stack have been tuned and streamlined. As a result, properly written Windows applications can achieve exceptional throughput and performance
These achievements illustrate that Windows TCP/IP processes data very quickly. Many applications, however, do not take advantage of the Windows, TCP/IP, and Windows Sockets performance capabilities because they unknowingly implement performance-hampering techniques.

comment:7 Changed 5 months ago by cypherpunks

Why do you want a Windows relay? Especially, a 32-bit version?

comment:8 in reply to:  7 Changed 5 months ago by Vort

Why do you want a Windows relay?

Because my computer is not only for Tor.

Especially, a 32-bit version?

Because that kind of build is offered by official website:
https://www.torproject.org/dist/torbrowser/7.0.1/tor-win32-0.3.0.8.zip
I can build/install 64-bit version if there are some instructions how to do that.

comment:9 Changed 5 months ago by cypherpunks

At least, try to use non-ancient Windows Sockets version (for Win95) in https://gitweb.torproject.org/tor.git/tree/src/common/compat.c#n3221

  r = WSAStartup(0x202,&WSAData);

More in https://msdn.microsoft.com/en-us/library/windows/desktop/ms742213(v=vs.85).aspx

comment:10 in reply to:  9 Changed 5 months ago by Vort

At least, try to use non-ancient Windows Sockets version (for Win95)

I have tried this already. No effect.

comment:11 Changed 5 months ago by Vort

Maybe OpenSSL is doing something wrong?
Here is the trace of send() and recv() API calls at the end of transmission:
attachment:tor_apimon.png

Note that there are no recv()s at the screenshot.
4097 * 10 / (19.234 - 18.515) = 56981 (55 KiB/s)

Changed 5 months ago by Vort

Attachment: tor_apimon.png added

comment:12 Changed 5 months ago by Vort

Random tweak: attachment:random_tweak.patch
And relay become much faster: 55 KiB/s -> 120 KiB/s.
Bug is somewhere near.

Changed 5 months ago by Vort

Attachment: random_tweak.patch added

comment:13 Changed 5 months ago by cypherpunks

Random tweak

Try to increase ConstrainedSockSize by enable ConstrainedSockets, yet. some megabytes maybe 256k max.

Last edited 5 months ago by cypherpunks (previous) (diff)

comment:15 in reply to:  13 Changed 5 months ago by Vort

Try to increase ConstrainedSockSize

Yes! That's it.

ConstrainedSockets 1
ConstrainedSockSize 262144

And Windows relay as fast as Linux relay.

comment:16 Changed 5 months ago by cypherpunks

Yes! That's it.

You can to tweak registry settings to change default values for socket buffers.

comment:17 in reply to:  16 Changed 5 months ago by Vort

You can to tweak registry settings to change default values for socket buffers.

There should be instructions for Windows relays operators in this case.

comment:18 Changed 5 months ago by Vort

Or, maybe, this problem affects regular users too?

comment:19 Changed 5 months ago by cypherpunks

Or, maybe, this problem affects regular users too?

Please, test it.

There are was inverse problem with socket buffers (at least theory about it).

comment:20 in reply to:  12 Changed 5 months ago by cypherpunks

Status: needs_informationnew

Test in comment:12 shows that this is a bottleneck in Tor, which leads to system buffers exhaustion on the high load, sooner or later.

comment:21 Changed 5 months ago by cypherpunks

this is a bottleneck in Tor, which leads to system buffers exhaustion on the high load, sooner or later.

No. This test was success because tor_tls_write() can to pass more bytes at once to openssl and windows' send() too complex. Increasing of chunks size actually could lead to buffers exhaustion, and this ticket not about this.

comment:22 in reply to:  19 Changed 5 months ago by Vort

Replying to cypherpunks:

Or, maybe, this problem affects regular users too?

Please, test it.

I can't see such a problem in non-relay mode.
Tested with HTTP upload and the same refEntry and refExit relays.
Probably, the difference is that in relay case, connection is incoming and in non-relay case is outgoing.

comment:23 Changed 5 months ago by Vort

Yes.
If connection is initiated from my relay to remote relay, then speed is 464 KiB/s.
If from remote relay to my relay, then 53 KiB/s.

comment:24 Changed 5 months ago by cypherpunks

Probably, the difference is that in relay case, connection is incoming and in non-relay case is outgoing.
Yes.

Can you test even more crazy random tweak yet?
Try to replace code:

        if (!connection_is_reading(conn)) {
          connection_stop_writing(conn);
          conn->write_blocked_on_bw = 1;
          /* we'll start reading again when we get more tokens in our
           * read bucket; then we'll start writing again too.
           */
        }

by

        if (!connection_is_reading(conn)) {
          break;
        }

from connection_handle_write_impl()

Last edited 5 months ago by cypherpunks (previous) (diff)

comment:25 Changed 5 months ago by cypherpunks

Try to remove code:

Wait I'll edit post it was wrong idea.

comment:26 Changed 5 months ago by cypherpunks

Try to replace code

Idea behind this: small socket buffer (SO_SNDBUF) and openssl internals returning TOR_TLS_WANTREAD for already blocked on read socket (bw limits) leads to lag (100ms by default) for writing, so your relays write less bytes.

You could also test this theory without patches by decreasing TokenBucketRefillInterval value (will lead to more cpu load)

comment:27 Changed 5 months ago by Vort

I will test this once my test relay will be in consensus.

But now I can say that I see lags on select() calls.
In normal case, select() returns 1 after send() with 10035 code almost instantly (within 0-1 ms).
In slow case, select() is running for 10-100ms, often returning 0 because of timeout.

comment:28 in reply to:  24 Changed 5 months ago by Vort

Try to replace code:

No, it doesn't help.

I guess, the problem is somewhere near creation of new connection.
Incoming connections behaves like they have TCP window = default send buffer size = 8 KiB.
Outgoing connections, most likely, have 8 KiB buffers too. But they are not waiting until all ACKs arrive.

comment:29 in reply to:  27 Changed 5 months ago by cypherpunks

Replying to Vort:

In slow case, select() is running for 10-100ms, often returning 0 because of timeout.

The only commit with notice about select() timeout is https://gitweb.torproject.org/tor.git/commit/?id=4645f28c3b125f9d281eb457d110c431a6a0b166
But it is for multi-threaded Tor only, which is not true for Windows.

comment:30 Changed 5 months ago by Vort

The only commit with notice about select() timeout is

I am talking about select(), which is called from libevent-2-0-5.dll.
And timeout value is not a problem by itself.
Problem is that it is being hit because of some misconfiguration.

comment:31 Changed 5 months ago by cypherpunks

No, it doesn't help.

Yeah, and openssl can't return WANTREAD for non handshake operations, it was wrong theory, alas.

comment:32 Changed 5 months ago by Vort

Here are the screenshots, which shows slow and fast select()s:
attachment:tor_slow_select.png
attachment:tor_fast_select.png

In slow mode, there are many select()s and little send()s.
In fast mode, - little select()s and many send()s.

Changed 5 months ago by Vort

Attachment: tor_slow_select.png added

Changed 5 months ago by Vort

Attachment: tor_fast_select.png added

comment:33 Changed 5 months ago by cypherpunks

Here are the screenshots, which shows slow and fast select()s:

So incoming connection behaves like it is POSIX socket with strong limits, and outgoing like windows socket with complex logic. (4097*2 > 8192).

comment:34 Changed 5 months ago by cypherpunks

Incoming connections behaves like they have TCP window = default send buffer size = 8 KiB.
Outgoing connections, most likely, have 8 KiB buffers too. But they are not waiting until all ACKs arrive.

Or maybe not, not 8k, not incoming, we need more measured data.

comment:35 Changed 5 months ago by cypherpunks

https://community.openvpn.net/openvpn/ticket/640 interesting stuff.

So, we can learn that Win8 and up (and Linux) we do not need to do anything to sndbuf/rcbuf (31.9 vs. 34.9 sounds like test noise to me), but the recommendation for XP, Vista and Win7 should be "use big buffers".

comment:36 Changed 5 months ago by cypherpunks

https://serverfault.com/questions/608060/windows-tcp-window-scaling-hitting-plateau-too-early yet more food, where it (not) explains what afd.sys version you need to tune registry key with send window, why it never works, what autotunes you need to turn off and turn on, why m$ ruined tcp stack for win7 and what to do? (spoiler: nobody knows how to fix win7, except maybe to install win98 win10)

comment:37 Changed 5 months ago by Vort

Maybe it is possible to find some open-source software, which is using non-blocking mode, doesn't set SO_SNDBUF and works fine in Windows 7?

comment:38 Changed 5 months ago by cypherpunks

Enabled https://en.wikipedia.org/wiki/Compound_TCP changes anything for you?

comment:39 in reply to:  38 Changed 5 months ago by Vort

Enabled https://en.wikipedia.org/wiki/Compound_TCP changes anything for you?

No.

comment:40 Changed 5 months ago by Vort

I have made a test using WANem and 3 virtual machines: Windows 7, Windows XP and WANem.
Connection settings was set to 10 Mbit/s and 150 ms delay.
With this configuration, Win7->WinXP iperf gives out 3.3 Mbit/s (400 KiB/s).
This is worse than theoretical limit of 1.19 MiB/s.
But far more better than 50 KiB/s, which I was obtaining on real configuration.

If I figure out how to set up completely virtual Tor network, I will be able to make additional measurements.
But, most likely, I will get the same 50 KiB/s with it.

comment:41 Changed 5 months ago by Vort

Here is the simple program, which reproduces the problem:
attachment:BwTest.cpp

Usage:
BwTest.exe -s
BwTest.exe -c 10.0.1.2

On the same configuration, where iPerf shows 400 KiB/s, this program shows 80 KiB/s.

Changed 5 months ago by Vort

Attachment: BwTest.cpp added

comment:42 in reply to:  41 ; Changed 5 months ago by cypherpunks

Replying to Vort:

Here is the simple program, which reproduces the problem:

Any difference from https://msdn.microsoft.com/en-us/library/windows/desktop/ms740149(v=vs.85).aspx?

comment:43 in reply to:  42 ; Changed 5 months ago by Vort

Replying to cypherpunks:

Replying to Vort:

Here is the simple program, which reproduces the problem:

Any difference from https://msdn.microsoft.com/en-us/library/windows/desktop/ms740149(v=vs.85).aspx?

Something is similar, something is different.
I don't know exactly what differences you need to know about.

comment:44 in reply to:  43 Changed 5 months ago by cypherpunks

Replying to Vort:

Something is similar, something is different.
I don't know exactly what differences you need to know about.

Use MS crap to test bw, and if it's the same, set buffer to 64k (instead of 8k), and give up.

comment:45 Changed 5 months ago by cypherpunks

set buffer to 64k (instead of 8k)

Fixed size is not a solution, for some peers you need even more buffer space while for another it waste time and space. Win7 announced as modern with auto-tuning features, it should to support all possible net connections over planet, somehow.

comment:46 in reply to:  45 Changed 5 months ago by cypherpunks

Replying to cypherpunks:

set buffer to 64k (instead of 8k)

Fixed size is not a solution, for some peers you need even more buffer space while for another it waste time and space. Win7 announced as modern with auto-tuning features, it should to support all possible net connections over planet, somehow.

Then use https://dxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/md/windows/ntio.c

comment:47 Changed 5 months ago by cypherpunks

Then use ​https://dxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/md/windows/ntio.c

Firefox using fixed size for windows, hard-coded 131072 * 4 bytes. Comment in code:

    // The Windows default of 8KB is too small and as of vista sp1, autotuning
    // only applies to receive window

comment:48 Changed 5 months ago by cypherpunks

https://msdn.microsoft.com/en-us/library/windows/desktop/bb736549(v=vs.85).aspx seems like solution to implement really working send buffer auto-tuning:

Applications that perform one blocking or non-blocking send request at a time typically rely on internal send buffering by Winsock to achieve decent throughput. The send buffer limit for a given connection is controlled by the SO_SNDBUF socket option. For the blocking and non-blocking send method, the send buffer limit determines how much data is kept outstanding in TCP. If the ISB value for the connection is larger than the send buffer limit, then the throughput achieved on the connection will not be optimal. In order to achieve better throughput, the applications can set the send buffer limit based on the result of the ISB query as ISB change notifications occur on the connection.

On every connection_handle_write_impl() or somewhere yet, to query, to count, to set optimal value for SO_SNDBUF
Not so much code, but seems like for Win7 only.

Last edited 5 months ago by cypherpunks (previous) (diff)

comment:49 Changed 5 months ago by Vort

I was wondering what is the key for good iPerf speed.
Launched API Monitor and saw SO_SNDBUF with size = 212992.
But it was coming not from iperf3.exe, but from cygwin1.dll.
Digging into that library revealed that this part of code was recently disabled:
https://github.com/mirror/newlib-cygwin/commit/609d2b22afc63d80029a4fe85e64b51c84ccae08

comment:50 Changed 5 months ago by teor

I logged #22847 to upgrade the Tor windows builds to cygwin 2.7.0.

comment:51 Changed 5 months ago by Vort

It is better to investigate why cygwin can be faster, even without SO_SNDBUF.

comment:52 Changed 5 months ago by Vort

Probably, the only thing which is matters, is the buffer size, which is passed to send().
If I change 4097 in my code to 32768, speed rises from 80 KiB/s to 310 KiB/s.
Even without SO_SNDBUF.

comment:53 in reply to:  51 Changed 5 months ago by teor

Replying to Vort:

It is better to investigate why cygwin can be faster, even without SO_SNDBUF.

The tor bundles use mingw, not cygwin. Sorry for not working that out sooner.

comment:54 Changed 5 months ago by Vort

The tor bundles use mingw, not cygwin.

And mingw is a good choice.

comment:55 Changed 5 months ago by teor

See #22848 for the equivalent mingw-w64 bug.

comment:56 Changed 5 months ago by Vort

See #22848 for the equivalent mingw-w64 bug.

Other way round.

When auto-tuning is enabled, speed is 50 KiB/s.
When it is disabled (via ConstrainedSockSize), speed is 400 KiB/s.

But this is not enough information to make a conclusions.

comment:57 Changed 5 months ago by Vort

Here is strace result, just for completeness:
attachment:strace.txt

It turns out, that those 4 KiB send()s are good for Linux, but bad for Windows?

Changed 5 months ago by Vort

Attachment: strace.txt added

comment:58 Changed 5 months ago by cypherpunks

When auto-tuning is enabled, speed is 50 KiB/s.
When it is disabled (via ConstrainedSockSize), speed is 400 KiB/s.

But this is not enough information to make a conclusions.

There no facts auto-tuning is enabled at any settings for your case, with default buffer size (assume 8k) speed is 50KiB/s while with modified (256k) speed is 400 KiB/s.

256k by default for recv and send will make popular windows relays to use more memory than linux relays, it's way for new ticket "Windows relay is several times more hungry than Linux relay".

comment:59 Changed 5 months ago by Vort

There no facts auto-tuning is enabled at any settings for your case

Yes, but it should be enabled.
netsh winsock show autotuning -> enabled.

Maybe something is just preventing it from working correctly.

comment:60 Changed 5 months ago by cypherpunks

Maybe something is just preventing it from working correctly.

Or maybe just a bug inside windows code.

Look at filezilla code for something like (this is a copy/paste from some old code, something can be wrong):

static void update_tcp_send_buffer_size(SOCKET s)
{
    ULONG v = 0;
    DWORD outlen = 0;

#ifndef SIO_IDEAL_SEND_BACKLOG_QUERY
#define SIO_IDEAL_SEND_BACKLOG_QUERY 0x4004747b
#endif

    if (!p_WSAIoctl(s, SIO_IDEAL_SEND_BACKLOG_QUERY, 0, 0, &v, sizeof(v), &outlen, 0, 0)) {
	p_setsockopt(s, SOL_SOCKET, SO_SNDBUF, (const char*)&v, sizeof(v));
    }
}

Test it with your BwTest.cpp​ (update buffer size once a second while compute speed)
Actually 2 lines of code to test.

comment:61 Changed 5 months ago by Vort

Test it with your BwTest.cpp​ (update buffer size once a second while compute speed)

80 KiB/s -> 400 KiB/s
Nice.
Need to test with some Linux instead of Windows XP.
Maybe this code is capable of 1 MiB/s too.

comment:62 Changed 5 months ago by Vort

Maybe this code is capable of 1 MiB/s too.

Yes, with Win7->Ubuntu transfer, speed reaches 1190 KiB/s.

comment:63 Changed 5 months ago by cypherpunks

Can you test what values returned for SIO_IDEAL_SEND_BACKLOG_QUERY and what dynamic for them?

Changed 5 months ago by Vort

Attachment: BwTest.2.cpp added

comment:64 Changed 5 months ago by Vort

attachment:BwTest.2.cpp​, Win7->Ubuntu:

bw = 60 KiB/s, buf = 65536
bw = 437 KiB/s, buf = 262144
bw = 1237 KiB/s, buf = 1048576
bw = 1108 KiB/s, buf = 2097152
bw = 2496 KiB/s, buf = 2097152
bw = 1529 KiB/s, buf = 2097152
bw = 1189 KiB/s, buf = 2097152
bw = 1194 KiB/s, buf = 2097152
bw = 1189 KiB/s, buf = 2097152
bw = 1189 KiB/s, buf = 2097152
bw = 1190 KiB/s, buf = 2097152

comment:65 Changed 5 months ago by cypherpunks

Interesting, is it really need 2M for send buffer to reach 1M/s with your test link?
Then why it reports

bw = 1237 KiB/s, buf = 1048576

it's already reached 1M/s while SNDBUF previously was set to 256k
Maybe to set SNDBUF values to v/2 and to watch values again?

comment:66 Changed 5 months ago by Vort

Interesting, is it really need 2M for send buffer to reach 1M/s with your test link?

Maybe this depends on how WANem limiting is implemented?

Maybe to set SNDBUF values to v/2 and to watch values again?

Results are strange:

bw = 60 KiB/s, buf = 65536
bw = 233 KiB/s, buf = 131072
bw = 441 KiB/s, buf = 262144
bw = 874 KiB/s, buf = 524288
bw = 1257 KiB/s, buf = 1048576
bw = 1437 KiB/s, buf = 2097152
bw = 1693 KiB/s, buf = 4194304
bw = 2199 KiB/s, buf = 4194304
bw = 113 KiB/s, buf = 4194304
bw = 2 KiB/s, buf = 4194304
bw = 1900 KiB/s, buf = 4194304
bw = 1185 KiB/s, buf = 4194304
bw = 1190 KiB/s, buf = 4194304
bw = 1189 KiB/s, buf = 4194304
bw = 1194 KiB/s, buf = 4194304

comment:67 Changed 5 months ago by Vort

I will try to raise memory amount for WANem VM and perform test again.

comment:68 in reply to:  67 Changed 5 months ago by Vort

I will try to raise memory amount for WANem VM and perform test again.

Not helps.
Same drop to 2 KiB/s.
Don't know what exactly it means.

But anyway, with v/2, buffer grows even larger.

comment:69 Changed 5 months ago by cypherpunks

But anyway, with v/2, buffer grows even larger.

Guess v*2 doesn't buffer grows

comment:70 Changed 5 months ago by Vort

Another thing worth trying is setting SO_SNDBUF=IDEAL_SEND_BACKLOG after every send().
(No v/2 with this test)
Results:

bw = 410 KiB/s, buf = 262144
bw = 2953 KiB/s, buf = 2097152
bw = 3209 KiB/s, buf = 4194304
bw = 165 KiB/s, buf = 4194304
bw = 356 KiB/s, buf = 4194304
bw = 287 KiB/s, buf = 4194304
bw = 2093 KiB/s, buf = 4194304
bw = 1189 KiB/s, buf = 4194304
bw = 1189 KiB/s, buf = 4194304
bw = 1194 KiB/s, buf = 4194304
bw = 1189 KiB/s, buf = 4194304
bw = 1190 KiB/s, buf = 4194304
bw = 1189 KiB/s, buf = 4194304

Now I am almost sure that bandwidth drops are due to WANem limiting.
But I don't have a real 10 Mbit/150 ms link for testing purposes.

comment:71 Changed 5 months ago by cypherpunks

Now I am almost sure that bandwidth drops are due to WANem limiting.
But I don't have a real 10 Mbit/150 ms link for testing purposes.

I guess connection via WANem is almost like connection via real WAN with loss and jitters, every moment is unique. Ideal buffer size is result of such conditions, and seems like it never shrinks in Windows7.

comment:72 in reply to:  59 Changed 5 months ago by cypherpunks

Yes, but it should be enabled.
netsh winsock show autotuning -> enabled.

Maybe something is just preventing it from working correctly.

Then again, is that true for outbound connection too? Or only incoming connections broken? Maybe Microsoft limits incoming connection such way for non-server (non ultimate or something) editions?

comment:73 Changed 5 months ago by Vort

I guess connection via WANem is almost like connection via real WAN with loss and jitters, every moment is unique.

You can't get 3209 KiB/s on 10 Mbit/s link.
So WANem is not perfect (or, maybe, I've made some configuration mistakes somewhere).

and seems like it never shrinks in Windows7

Looks like so.
But if WANem produces bandwidth spikes, this also can be the reason for absence of shrinking.

By the way, I wasn't able to make IDEAL_SEND_BACKLOG grow beyond 4194304, even on 100 Mbit/s link.

comment:74 Changed 5 months ago by Vort

Then again, is that true for outbound connection too? Or only incoming connections broken?

This test requires complete rewrite (and port) of BwTest.
And I'm not sure that this test reproduces the problem with 100% accuracy.
So maybe I will do this, but not right now.

Another thought:
Right now I am running the relay with ~1100 established connections.
It is official 0.3.0.8 build with ConstrainedSockSize 262144 tweak.
But I don't see where are the wasted 262144 * 2 * 1100 = 550 MiB are located.
tor.exe have only 118 MiB of "Private Bytes" and 259 MiB of "Virtual Size" used.
Not much higher than freshly started tor.exe process uses: 55 MiB and 161 MiB respectively.

comment:75 Changed 5 months ago by cypherpunks

I tested this code with IDEAL_SEND_BACKLOG for virtual (unlimit) low latency local link (guest win7 to host linux), it goes to grow buffer to 2M without any reason, if to keep report ideal buffer (it says 64k) without actual changes (system default sndbuf reported as 8k) then it transfer with the same speed as with 2M buffer size.

comment:76 Changed 5 months ago by cypherpunks

But I don't see where are the wasted 262144 * 2 * 1100 = 550 MiB are located.

  • Видишь суслика?
  • Нет
  • И я не вижу. А он есть.
  • Понял

comment:77 Changed 5 months ago by cypherpunks

I tested this code with IDEAL_SEND_BACKLOG for virtual (unlimit) low latency local link (guest win7 to host linux), it goes to grow buffer to 2M without any reason, if to keep report ideal buffer (it says 64k) without actual changes (system default sndbuf reported as 8k) then it transfer with the same speed as with 2M buffer size.

Tested with latency 1, 5, 10, it wants 2M (probably depends maximum size of system memory) always.

Outbound connection behaves differently sometime, if some real latency present (1ms, 10ms) it doesn't tries to get all available bw, IDEAL_SEND_BACKLOG returns 64k every time.

Last edited 5 months ago by cypherpunks (previous) (diff)

comment:78 Changed 5 months ago by cypherpunks

And I'm not sure that this test reproduces the problem with 100% accuracy.

At least SNDBUF reported as 8k and speed depends latency for outbound connection, it doesn't tries to adapt send buffer size to connection itself. Both inbound and outbound connection affected by this bug.

comment:79 Changed 5 months ago by cypherpunks

https://social.technet.microsoft.com/Forums/windows/en-US/d047ba5c-9c25-4089-85fc-569693870d71/winsock-send-buffer-size-windows-7?forum=w7itpronetworking

Hi dear technicians,

Something is wrong or I don't get is right what's with winsock ?

look:

when I set in netsh: netsh winsock set autotuning on

I receive this message => Winsock send autotuning is disabled.

when I set in netsh: netsh winsock set autotuning off,

I receive this message=> Winsock send autotuning is enabled.

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\AFD\Parameters\

REG_DWORD "DynamicSendBufferDisable"

setting this value to 0, I receive this message => Winsock send autotuning is disabled.

setting this value to 1, I receive this message => Winsock send autotuning is enabled.

I think, something is inverted or I am wrong!?

Haha, doesn't works for me neither. Probably code just missed.

comment:80 Changed 5 months ago by Vort

Here is the new version of test program:
attachment:BwTest.3.cpp
Now it uses non-blocking mode and fully compatible with Linux.

With this version, speed now is 52 KiB/s (instead of 80 KiB/s for previous version).
Which is more close to what I observe with Tor.

Both inbound and outbound connection affected by this bug.

That is strange.
I see the same thing with the test program.
But effect is different for Tor.
Rechecked that again.

One more confirmation of inbound/outbound difference is that in case of their equality, regular Windows users would see very slow upload speed in many cases.
But this is not the case.

Most likely, reproduction is still not accurate enough.

Changed 5 months ago by Vort

Attachment: BwTest.3.cpp added

comment:81 in reply to:  76 Changed 5 months ago by Vort

  • Видишь суслика?
  • Нет

He have no place to hide.

comment:82 Changed 5 months ago by cypherpunks

But I don't see where are the wasted 262144 * 2 * 1100 = 550 MiB are located.

Linux doesn't report sockets' buffers per process as well, you can to send 4M to queue without any additional memory reported for your process. But you can to count queues from netstat, or to read used memory for all sockets reported by /proc/net/sockstat

comment:83 Changed 5 months ago by Vort

Looks like socket buffers in Windows are located in Nonpaged Pool.
Now tor.exe have 966 connections. And Nonpaged Pool have size of 255 MiB (checked with RamMap).
Still not 480 MiB.
I will look at this numbers again, when connection count changes.

comment:84 Changed 5 months ago by cypherpunks

But effect is different for Tor.
Rechecked that again.
If connection is initiated from my relay to remote relay, then speed is 464 KiB/s.
If from remote relay to my relay, then 53 KiB/s.

Tor does bind for non-loopback addresses if connecting to outside

  if (bindaddr && bind(s, bindaddr, bindaddr_len) < 0) {
    *socket_error = tor_socket_errno(s);
    log_warn(LD_NET,"Error binding network socket: %s",
             tor_socket_strerror(*socket_error));
    tor_close_socket(s);
    return -1;
  }

Might be changes something for sockets in windows.

But 464 KiB/s still sounds like 64K buffer on link with 150ms latency, better than 8k but non optimal.

Last edited 5 months ago by cypherpunks (previous) (diff)

comment:85 Changed 5 months ago by cypherpunks

Might be changes something for sockets in windows.

Nothing changes, 53KiB/s on link with 150ms latency. Anyway, it requires OutboundBindAddress options enabled to bind anything.

comment:86 Changed 5 months ago by cypherpunks

If connection is initiated from my relay to remote relay, then speed is 464 KiB/s.
If from remote relay to my relay, then 53 KiB/s.

Maybe routes were different with different latency?
Just testing, 464 KiB/s is about 16-18ms.

comment:87 Changed 5 months ago by Vort

Maybe routes were different with different latency?

This is how tests were made:

Two Tor instances was launched at the same Windows 7 computer:
TorClient and TorRelay.

Then I made a 3-hop circuit via control port of TorClient: refEntry, myRelay, refExit.
Attached stream and downloaded test file from 38.229.72.16 via SOCKS port of TorClient.
This test forced refEntry to make a connection to myRelay.
Then download command forced my relay to upload data to refEntry.
I guess, that ping between myRelay and refEntry is somewhere near 150 ms.
This test produced 52 KiB/s.

Next test:
Restarted TorRelay (which is dropping connection from refEntry to myRelay).
Connected to control port of TorRelay.
Created the 2-hop circuit refEntry, refExit.
This is forced outgoing connection from TorRelay to refEntry.
Next part is identical to test one:
3-hop circuit via control port of TorClient: refEntry, myRelay, refExit and file download
But since TorRelay <-> refEntry connection is already established, incoming connection is not created.
I don't know if Tor creating new outgoing connection or reusing existing, but I don't see incoming connection in this case.
This test produces 464 KiB/s.

Maybe I am doing something wrong.
So please try to reproduce this thing.

If your ping to refEntry is not 150 ms, then select another refEntry.

comment:88 Changed 5 months ago by cypherpunks

Maybe I am doing something wrong.
So please try to reproduce this thing.

I don't trust windows to get internet connection. My virtual env isolated. Maybe it is reason for non reproducible result with outbound connections. btw, your win7 via wanem isolated from internet?

comment:89 Changed 5 months ago by Vort

btw, your win7 via wanem isolated from internet?

Yes.

comment:90 Changed 5 months ago by Vort

I don't trust windows to get internet connection.

Is it hard to make isolated Tor network?

comment:91 Changed 5 months ago by cypherpunks

Is it hard to make isolated Tor network?

Chutney, except windows case probably.

comment:92 Changed 5 months ago by Vort

Can't reproduce inbound/outbound difference with VM and isolated Tor network.
Probably that is because host and guest have different sets of OS updates.

comment:93 Changed 5 months ago by cypherpunks

Probably that is because host and guest have different sets of OS updates.

Microsoft probably fixed silently and partially somedays before. My VM with "outdated" win7 too. Tried almost everything, redirected for internet detection, assigned public ip addr, connected to 443 port. nada, 53KiB/s per 150ms as hard coded.

comment:94 Changed 5 months ago by Vort

Sadly, updates did not helped me to reproduce inbound/outgoing difference.
But now I know that outgoing upload for relay and for client not differs.
Both of them shows ~50 KiB/s in isolated VM.

comment:95 Changed 5 months ago by cypherpunks

Win7 detects we tries to fool it with VM and disables super exclusive feature :)

comment:96 Changed 5 months ago by cypherpunks

Or triggered by some peer's params like tcp window.

comment:97 Changed 5 months ago by Vort

Or VM simulation is inaccurate somewhere.
Here is upload test with NAT-ed VM and guard E555B09C (ping = 124 ms from my location):
attachment:tor_vm_fast_upload.png

Changed 5 months ago by Vort

Attachment: tor_vm_fast_upload.png added

comment:98 Changed 5 months ago by cypherpunks

Tested NAT-ed guest Win7 with "BwTest.3.exe -c ip_of_local_host -u" (yet moded to report SNDBUF on start and later). When NAT-ed it connects to host's "bwtest -s -d" over "lo" interface, so added delay to "lo" and got ping 500ms. It uploads faster than usual 53KiB/s but reported SNDBUF == 8KB anyway. In guest's wireshark ACKs have no delays. WAN latency doesn't affect tcp data in NAT mode, guest communicates internally with VM process on host without host's tcp stack: all acks appears immediately, peer's tcp_window differs, etc. VM process behaves like proxy for guest in NAT mode.

comment:99 Changed 5 months ago by Vort

Tested NAT-ed guest Win7 with "BwTest.3.exe -c ip_of_local_host -u"

There may be lot of reasons why localhost tests will give incorrect results.

For example:

Tested with latency 1, 5, 10, it wants 2M (probably depends maximum size of system memory) always.

I've made a test with real Win7<->Win7 100 Mbit/s WiFi link (pings ranging from 1ms to 100ms and higher, but, usually they are in range from 1ms to 10ms).
And BwTest -c ip -u -t did not raised ISB beyond 65536.

comment:100 Changed 5 months ago by cypherpunks

There may be lot of reasons why localhost tests will give incorrect results.

From manual for VirtualBox:

The network frames sent out by the guest operating system are received by VirtualBox’s NAT engine, which extracts the TCP/IP data and resends it using the host operating system. To an application on the host, or to another computer on the same network as the host, it looks like the data was sent by the VirtualBox application on the host, using an IP address belonging to the host. VirtualBox listens for replies to the packages sent, and repacks and resends them to the guest machine on its private network.

Last edited 5 months ago by cypherpunks (previous) (diff)

comment:101 Changed 5 months ago by Vort

From manual for VirtualBox:

This means just that NAT is as bad as localhost for testing.
But isolated network gives different results too :/

comment:102 Changed 5 months ago by cypherpunks

Might be your uplink installed some accelerator that behaves like transparent proxy for outgoing connections? Can you inspect your tor client traffic to high latency peer to find how fast answers arrived?

comment:103 Changed 5 months ago by Vort

Can you inspect your tor client traffic to high latency peer to find how fast answers arrived?

Strange things happens.

This test is the same upload test, which I have used in VM with NAT, but on the host.
Selected guards have similar pings.
But for E555B09C I am getting 67 KiB/s (instead of 454 KiB/s for VM test), for 13B2354C - 462 KiB/s:
attachment:upload_difference.png

Changed 5 months ago by Vort

Attachment: upload_difference.png added

comment:104 Changed 5 months ago by cypherpunks

But for E555B09C I am getting 67 KiB/s (instead of 454 KiB/s for VM test), for 13B2354C - 462 KiB/s:

Look Len=8192, seems your upload good when https://en.wikipedia.org/wiki/Large_segment_offload works, but it doesn't work reliably?

comment:105 Changed 5 months ago by cypherpunks

upload good when

"good" as it was 64KB buffer, that is NIC buffer for offloaded data. It seems LSO (TSO) just hides broken windows tcp stack. Try to disable it for network adapter, to test real windows 7 upload speed.

comment:106 Changed 5 months ago by Vort

Do the latest results means that

  1. Regular users also can be affected by this problem?
  2. It's not Linux, which accelerated my relay, but NAT from VirtualBox?

comment:107 Changed 5 months ago by Vort

Try to disable it for network adapter, to test real windows 7 upload speed.

Results are less stable, but upload to 13B2354C still can go up to 450 KiB/s:
attachment:tor_upload_without_lso.png

Changed 5 months ago by Vort

Attachment: tor_upload_without_lso.png added

comment:108 Changed 5 months ago by cypherpunks

Regular users also can be affected by this problem?

Not sure.

It's not Linux, which accelerated my relay, but NAT from VirtualBox?

Yes, NAT-ed VM strips guest's tcp/ip stack, performance will depend host entirely for such case.

Results are less stable, but upload to 13B2354C still can go up to 450 KiB/s:

Mystery of Windows.

comment:109 Changed 5 months ago by cypherpunks

Btw, peers promotes huge tcp window 500K - 1M, my local test didn't show such values.

comment:110 Changed 5 months ago by cypherpunks

peers promotes huge tcp window 500K - 1M

And no change, how possible?

comment:111 Changed 5 months ago by cypherpunks

Tuned window size, it really doesn't changes per recved dozen KB if you allow to use dozen MBs. Still, can't trigger test win7 in VM to send more bytes.

comment:112 Changed 5 months ago by cypherpunks

It's not Linux, which accelerated my relay, but NAT from VirtualBox?

According to manual VirtualBox tunes buffers for itself yet, effectively disables host optimizations if any exists.

The VirtualBox NAT stack performance is often determined by its interaction with the host’s TCP/IP stack and the size of several buffers ( SO_RCVBUF and SO_SNDBUF ). For certain setups users might want to adjust the buffer size for a better performance. This can by achieved using the following commands (values are in kilobytes and can range from 8 to 1024)
Each of these buffers has a default size of 64KB

comment:113 Changed 5 months ago by Vort

Here is the test for WSAIoctl/setsockopt performance.
BwTest.exe -s -u -t was started on Windows 7 computer with real 100 Mbit/s and 1 ms link.
Yet again, ISB was equal to 65536 all time of the test.
Result: WSAIoctl and setsockopt functions uses 0.25% of CPU resources, comparing to other WS2_32 functions.
attachment:isb_performance.png

Changed 5 months ago by Vort

Attachment: isb_performance.png added

comment:114 Changed 5 months ago by cypherpunks

Yet again, ISB was equal to 65536 all time of the test.

Yes, this ISB really works correctly with real peers.I tested average linux for auto-tuning of SNDBUF, it raises buffer similar way and never shrinks it later. ISB is way to fix broken upload on windows, except maybe to limit maximum values for relays, 4MB is way to crash relay (or do dos against entire system) with thousands connections.

comment:115 Changed 5 months ago by Vort

except maybe to limit maximum values for relays, 4MB is way to crash relay (or do dos against entire system) with thousands connections

It would be good to create a test with thousands of connections and different buffer sizes and see how much RAM it will actually use (with RamMap, for example)

comment:116 Changed 5 months ago by cypherpunks

It would be good to create a test with thousands of connections and different buffer sizes and see how much RAM it will actually use (with RamMap, for example)

I believe it depends actual data, requesting 4M of SO_SNDBUF doesn't leads to 4M usage. You need to fill send queue by real data. So memory usage depends peers. Someone could to exploit it by filling system memory while your process doesn't aware about it, or something could go wrong for another reasons.

comment:117 Changed 5 months ago by Vort

You need to fill send queue by real data.

And amount of that data depends on relay's bandwidth (or even on RelayBandwidthRate).
This amounts will be megabytes, not gigabytes.

Changed 4 months ago by Vort

comment:118 Changed 4 months ago by Vort

Here is my attempt to fix this problem:
attachment:tor_windows_upload_hackfix_v1.patch

It breaks build, but allows to produce tor.exe for testing.

comment:119 Changed 4 months ago by Vort

I have made a test with Windows 8.1 (6.3.9600) and WANem 40Mbit/150ms.
And maximum speeds for BwTest -s -u and BwTest -s -u -t appeared to be the same: 4475 KiB/s.

comment:120 Changed 4 months ago by cypherpunks

Here is my attempt to fix this problem

SO_SNDBUF accept int, while SIO_IDEAL_SEND_BACKLOG_QUERY returns ULONG, no difference if sizeof(int)==sizeof(long). What will happens for win x128 or some another specific case, can setsockopt parse such input correctly?

comment:121 Changed 4 months ago by Vort

What will happens for win x128 or some another specific case

I guess they will not change it:
An unsigned LONG. The range is 0 through 4294967295 decimal.

Changed 4 months ago by Vort

comment:122 Changed 4 months ago by Vort

New iteration of the patch: attachment:tor_windows_upload_fix_v2.patch

  • I have found better place for update_send_buffer_size function.
  • Version check was added.

comment:123 Changed 4 months ago by Vort

Status: newneeds_review

I think this patch is ready for review.
Questions:

  1. Is it is ok to compare result >= 0 or it should be result > 0?
  2. Maybe non-encrypted connection requires update_send_buffer_size call too?
  3. What is the good description for the function? Fix slow upload for Windows Vista and Windows 7 (bug #22798) would be enough?
  4. Is this code compatible enough with C compilers?

Changed 4 months ago by Vort

Note: See TracTickets for help on using tickets.