[warn] channelpadding_compute_time_until_pad_for_netflow(): Bug: Channel padding timeout scheduled 164729ms in the past. Did the monotonic clock just jump?
Normal Tor client, running Tor 0.3.1.0-alpha-dev 2e4f3b36bdfd5118, got this log line:
May 10 02:50:11.217 [warn] channelpadding_compute_time_until_pad_for_netflow(): Bug: Channel padding timeout scheduled 164729ms in the past. Did the monotonic clock just jump? (on Tor 0.3.1.0-alpha-dev 2e4f3b36bdfd5118)
My tor client had been idle for hours before this point, and the log message coincides with a new request that I made. I'm pretty sure there was no clock jump.
Alas, I have no more detailed logs. I'll try to reproduce with logs.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
Hrmm.. Noting some thoughts while we wait for more info or other instances: This could have happened if Tor stopped running its one-per-second housekeeping while you were reconnecting..
That large of a delta indicates a padding time was chosen for the channel, but a timer wasn't scheduled yet. Then that channel was forgotten about for about two minutes, until the connection housekeeping started running again, and we tried to schedule the padding timer, but found out it should have happened 2 minutes ago...
Hrmm.. So in this log it looks like circuit construction delayed the connection housekeeping/run_scheduled_events() for a few seconds. This delay meant that the padding time elapsed before channelpadding_decide_to_pad_channel() was called, which meant that the time was in the past by the time it was called.. See also https://trac.torproject.org/projects/tor/ticket/16585
I am not sure what to do about this. I could change the code to always schedule an actual timer right away and cancel it if data is sent instead (instead of waiting around until we get within a second before scheduling a timer). This may mean more timer churn, but now that the timers are more efficient, perhaps this is not an issue.
Or, we could demote the log message to notice.
In terms of when padding actually gets sent, I'm guessing that it still won't happen any earlier with the schedule and cancel timer approach, because if this is #16585 (moved), libevent wouldn't call us back any sooner anyway.
This is making me think that we just demote this message and not increase the code complexity.. Any objections?
I go tthat message 3 times within last 48 hours with 0.3.1.3-alpha. eg.:
Jun 16 15:02:36.000 [warn] channelpadding_compute_time_until_pad_for_netflow(): Bug: Channel padding timeout scheduled 39ms in the past. Did the monotonic clock just jump? (on Tor 0.3.1.3-alpha dc47d936d47ffc25)
I go tthat message 3 times within last 48 hours with 0.3.1.3-alpha. eg.:
{{{
Jun 16 15:02:36.000 [warn] channelpadding_compute_time_until_pad_for_netflow(): Bug: Channel padding timeout scheduled 39ms in the past. Did the monotonic clock just jump? (on Tor 0.3.1.3-alpha dc47d936d47ffc25)
}}}
What were the numbers you got for the other two times? (Apparently the numbers will help Mike guess about the bug.)
Ok, I demoted this to notice if the callback is over 10s late, and to info if it is less than that. I don't think there is anything more we can do here, until we're either multi-threaded for we have a solution for #16585 (moved).
See mikeperry/bug22212 (7914091445e076871bc9fdd41f8304a9b3af8e8c).
I go tthat message 3 times within last 48 hours with 0.3.1.3-alpha. eg.:
Jun 16 15:02:36.000 [warn] channelpadding_compute_time_until_pad_for_netflow(): Bug: Channel padding timeout scheduled 39ms in the past. Did the monotonic clock just jump? (on Tor 0.3.1.3-alpha dc47d936d47ffc25)}}}
What were the numbers you got for the other two times? (Apparently the numbers will help Mike guess about the bug.)
{{{
mr-fox linux # grep channel /tmp/warn.log | cut -f2- -d')' | cut -f1 -d'.'
: Bug: Channel padding timeout scheduled 39ms in the past
: Bug: Channel padding timeout scheduled 1396ms in the past
: Bug: Channel padding timeout scheduled 25ms in the past
: Bug: Channel padding timeout scheduled 516ms in the past
: Bug: Channel padding timeout scheduled 395ms in the past
: Bug: Channel padding timeout scheduled 979ms in the past
: Bug: Channel padding timeout scheduled 25ms in the past
: Bug: Channel padding timeout scheduled 17ms in the past
: Bug: Channel padding timeout scheduled 114ms in the past
: Bug: Channel padding timeout scheduled 1163ms in the past
Ok, I added a changes file, and refactored that check. I also lowered the time limit for the notice line to 4.5 seconds, which is the point at which we start to risk allowing a new netflow record to get emitted.
FWIW, I got this warning on a chutney network after I suspended my laptop for a few seconds while the network was running. Unfortunately, I've lost the logs but I'm pretty sure that the branch included the fix above.
I added a 2nd Tor exit relay to my machine, listening on ports 9001 and 9030, ipv4 and ipv4.
The new is 6EABEBF38CE7E3DF672C4DB01383606FE3EB2215 (the old is 1AF72E8906E6C49481A791A6F8F84F8DFEBBB2BA)
Started after rebooting the machine at the same time, both notice logs are nearly identical, except now for the message seen below of the old, matured exit relay. I got few minutes ago :
Aug 26 10:27:48.000 [notice] Received reload signal (hup). Reloading config and resetting internal state.Aug 26 10:27:48.000 [notice] Read configuration file "/etc/tor/torrc".Aug 26 10:27:48.000 [notice] Tor 0.3.1.5-alpha (git-83389502ee631465) opening log file.Aug 26 10:27:49.000 [notice] Channel padding timeout scheduled 231988ms in the past.
The new exit (just few days old) didn't show this behavior.
Two lines is not a whole lot of context here. It's also not clear if these are all happening after a SIGHUP, after a fresh start of Tor, or both/either?
That said, if this happens right after a fresh start of Tor, then this may be caused by us not setting chan->timestamp_xfer_ms on channels before that code is reached. Odd that this is suddenly happening now, though. Perhaps some change in the channel code is causing us to not set that field in some cases? I can look into that, I guess.
Same here on a non-exit realy (B143D439B72D239A419F8DCE07B8A8EB1B486FA7) freshly upgraded to 0.3.1.7, also on a host with NTP
First occurence about 25mn after relay start
Oct 25 18:06:47.000 [notice] Channel padding timeout scheduled 230620ms in the past.
Oct 25 18:43:39.000 [notice] Channel padding timeout scheduled 216408ms in the past.
Oct 25 19:31:04.000 [notice] Channel padding timeout scheduled 242298ms in the past.
Oct 25 20:39:16.000 [notice] Channel padding timeout scheduled 540349ms in the past.
Oct 25 23:11:53.000 [notice] Channel padding timeout scheduled 221720ms in the past.
Putting in 032 so we don't miss it. This notice level is something that is quite present in a normal relay logs. I get maybe 20-30 a day on a fast relay.
Shouldn't we put it to info log? Not sure this is something worth telling the operator when it happens because the operator can't do anything or at best sync its clock?...
Trac: Milestone: Tor: 0.3.1.x-final to Tor: 0.3.2.x-final
Shouldn't we put it to info log? Not sure this is something worth telling the operator when it happens because the operator can't do anything or at best sync its clock?...
I'm a bit nervous about quieting it -- it seems there should be a real bug here, right? My relay has lines like
Dec 21 14:19:54.371 [notice] Channel padding timeout scheduled 617944ms in the past. Dec 22 16:22:23.034 [notice] Channel padding timeout scheduled 209517ms in the past. Dec 28 14:18:18.271 [notice] Channel padding timeout scheduled 165914ms in the past. Jan 06 16:16:11.495 [notice] Channel padding timeout scheduled 186827ms in the past. Jan 06 17:57:23.067 [notice] Channel padding timeout scheduled 141687ms in the past. Jan 06 17:57:23.068 [notice] Channel padding timeout scheduled 254534ms in the past.
and I'm quite sure it isn't just freezing for 10+ minutes at a time.
How is Tor setting its timeouts for so far in the past?
Jan 13 02:51:31.465 [info] channelpadding_send_padding_cell_for_callback(): Sending netflow keepalive on 3895 to 99.107.176.235:57814 (988E6BEC7AB2451AA66D2F9B2A9FA2A928C9795C) after 5058 ms. Delta 6msJan 13 02:51:38.308 [info] channelpadding_send_padding_cell_for_callback(): Sending netflow keepalive on 3895 to 99.107.176.235:57814 (988E6BEC7AB2451AA66D2F9B2A9FA2A928C9795C) after 6843 ms. Delta 20msJan 13 02:51:42.244 [info] channelpadding_send_padding_cell_for_callback(): Sending netflow keepalive on 3895 to 99.107.176.235:57814 (988E6BEC7AB2451AA66D2F9B2A9FA2A928C9795C) after 3938 ms. Delta 4msJan 13 02:54:28.603 [notice] Channel padding timeout scheduled 159796ms in the past.Jan 13 02:54:28.603 [info] channelpadding_send_padding_cell_for_callback(): Sending netflow keepalive on 3895 to 99.107.176.235:57814 (988E6BEC7AB2451AA66D2F9B2A9FA2A928C9795C) after 166359 ms. Delta 159798msJan 13 02:54:33.863 [info] channelpadding_send_padding_cell_for_callback(): Sending netflow keepalive on 3895 to 99.107.176.235:57814 (988E6BEC7AB2451AA66D2F9B2A9FA2A928C9795C) after 5260 ms. Delta 6msJan 13 02:54:43.120 [info] channelpadding_send_padding_cell_for_callback(): Sending netflow keepalive on 3895 to 99.107.176.235:57814 (988E6BEC7AB2451AA66D2F9B2A9FA2A928C9795C) after 9258 ms. Delta 10msJan 13 02:54:50.978 [info] channelpadding_send_padding_cell_for_callback(): Sending netflow keepalive on 3895 to 99.107.176.235:57814 (988E6BEC7AB2451AA66D2F9B2A9FA2A928C9795C) after 7860 ms. Delta 6ms
I wonder if my conn to this relay was full of stuff that was trying to flush, during this 160s that surprised the netflow padder. That is, maybe if the outbuf-or-whatever never goes empty, we're never updating the "time to next netflow" timer? And it finally does go empty, and the netflow timer says "omg why did nobody call me until now?"
Ingredient one:
channel_timestamp_active(), channel_timestamp_recv(), and channel_timestamp_xmit() are the only functions that update chan->timestamp_xfer. They get called when we add stuff to our outbuf or read stuff from our inbuf, but they never get called when we flush our outbuf, that is, when we actually wrote stuff onto the network.
Ingredient two:
channelpadding_send_padding_cell_for_callback(), if called when chan->has_queued_writes(chan) is true (i.e. we still have stuff on the outbuf still trying to flush), sets chan->next_padding_time back to 0.
Ingredient three:
channelpadding_compute_time_until_pad_for_netflow(), if chan->next_padding_time is zero, sets it to chan->timestamp_xfer + the freshly chosen padding_timeout. And then it promptly checks if it's surprisingly long ago, and warns if it is.
So if it's been 160 seconds since we read any new cells or queued any new cells onto the chan (say, because we've been spending the whole time flushing queued stuff, but we hadn't finished flushing until now, and the other side didn't have anything to say), then timestamp_xfer is going to be 160 seconds in the past, and the newly computed next_padding_time will be slightly less than 160 seconds in the past, causing us to log_notice and then send a padding cell right then -- when in reality we probably flushed on that chan way more recently than 160 seconds ago.
Is the fix to call channel_timestamp_active() from connection_or_finished_flushing()? What is chan->timestamp_active for, anyway?
There's a good argument to be made for instead calling channel_timestamp_active() from connection_or_flushed_some(), rather than only when we finish flushing. Since it is "active", for some notion of active.
I this can help, "Channel padding timeout scheduled XXXXXXms in the past." appears also with Tor 0.3.2.9:
Jan 09 21:10:25.461 [notice] Tor 0.3.2.9 (git-9e8b762fcecfece6) running on Linux with Libevent 2.1.8-stable, OpenSSL 1.1.1-dev, Zlib 1.2.8, Liblzma 5.1.0alpha, and Libzstd 1.3.2..../...Jan 14 22:45:33.000 [notice] Channel padding timeout scheduled 182518ms in the past. Jan 15 01:59:48.000 [notice] Heartbeat: Tor's uptime is 5 days 3:59 hours, with 6126 circuits open. I've sent 324.43 GB and received 309.87 GB.Jan 15 01:59:48.000 [notice] Circuit handshake stats since last time: 2349/2349 TAP, 866755/867105 NTor.Jan 15 01:59:48.000 [notice] Since startup, we have initiated 0 v1 connections, 0 v2 connections, 0 v3 connections, and 109957 v4 connections; and received 182 v1 connections, 3720 v2 connections, 6311 v3 connections, and 458930 v4 connections.Jan 15 05:59:48.000 [notice] Heartbeat: Tor's uptime is 5 days 7:59 hours, with 5870 circuits open. I've sent 336.13 GB and received 321.51 GB.Jan 15 05:59:48.000 [notice] Circuit handshake stats since last time: 1602/1602 TAP, 811672/811672 NTor.Jan 15 05:59:48.000 [notice] Since startup, we have initiated 0 v1 connections, 0 v2 connections, 0 v3 connections, and 112952 v4 connections; and received 182 v1 connections, 3835 v2 connections, 6599 v3 connections, and 475508 v4 connections.Jan 15 06:22:19.000 [notice] Channel padding timeout scheduled 221453ms in the past. Jan 15 09:59:48.000 [notice] Heartbeat: Tor's uptime is 5 days 11:59 hours, with 7482 circuits open. I've sent 351.16 GB and received 336.47 GB.Jan 15 09:59:48.000 [notice] Circuit handshake stats since last time: 2110/2110 TAP, 841999/841999 NTor.Jan 15 09:59:48.000 [notice] Since startup, we have initiated 0 v1 connections, 0 v2 connections, 0 v3 connections, and 116107 v4 connections; and received 183 v1 connections, 4016 v2 connections, 6939 v3 connections, and 493174 v4 connections.Jan 15 12:26:07.000 [notice] Channel padding timeout scheduled 263389ms in the past. Jan 15 13:59:48.000 [notice] Heartbeat: Tor's uptime is 5 days 15:59 hours, with 8796 circuits open. I've sent 363.28 GB and received 348.51 GB.Jan 15 13:59:48.000 [notice] Circuit handshake stats since last time: 2072/2072 TAP, 750040/750040 NTor.Jan 15 13:59:48.000 [notice] Since startup, we have initiated 0 v1 connections, 0 v2 connections, 0 v3 connections, and 119247 v4 connections; and received 184 v1 connections, 4184 v2 connections, 7267 v3 connections, and 509112 v4 connections.
For those who want to try the patch, here is what I'm running currently:
diff --git a/src/or/connection_or.c b/src/or/connection_or.cindex dadfdc4..ecc1ae4 100644--- a/src/or/connection_or.c+++ b/src/or/connection_or.c@@ -603,6 +603,9 @@ connection_or_flushed_some(or_connection_t *conn) { size_t datalen;+ /* Update the channel's active timestamp */+ channel_timestamp_active(TLS_CHAN_TO_BASE(conn->chan));+ /* The channel will want to update its estimated queue size */ channel_update_xmit_queue_size(TLS_CHAN_TO_BASE(conn->chan));
It is alas a rare log message for me, so it's hard to easily tell if I have resolved the issue. Maybe somebody who is getting the log message more often than I am can try it too?
This fix looks good to me. It also seems backport-safe, since at worst case these timestamp updates will be redundant to the channel layer ones, and best case they will cover these areas we missed.