Opened 3 years ago

Last modified 22 months ago

#20423 reopened defect

Clock jumps on relay due to hostname lookup timeouts

Reported by: Felixix Owned by:
Priority: High Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: 0.2.9.10
Severity: Normal Keywords: regression, 029-backport, 032-unreached
Cc: Actual Points:
Parent ID: #21789 Points: 1
Reviewer: Sponsor:

Description

The security update (Tor v0.2.8.9 running on FreeBSD with Libevent
2.0.22-stable, OpenSSL LibreSSL 2.4.3 and Zlib 1.2.8.) shows the
following log entry each hour:

Oct 19 02:51:07.000 [warn] Your system clock just jumped 136 seconds
forward; assuming established circuits no longer work.
Oct 19 02:51:07.000 [warn] assign_to_cpuworker failed. Ignoring.
...
Oct 19 02:51:07.000 [warn] assign_to_cpuworker failed. Ignoring.
Oct 19 02:51:15.000 [notice] Tor has successfully opened a circuit.
Looks like client functionality is working.
...
Oct 19 03:51:10.000 [warn] Your system clock just jumped 138 seconds
forward; assuming established circuits no longer work.
Oct 19 03:51:11.000 [warn] assign_to_cpuworker failed. Ignoring.
...
Oct 19 04:50:37.000 [warn] Your system clock just jumped 105 seconds
forward; assuming established circuits no longer work.
Oct 19 04:50:37.000 [warn] assign_to_cpuworker failed. Ignoring.
...
Oct 19 05:51:14.000 [warn] Your system clock just jumped 142 seconds
forward; assuming established circuits no longer work.
Oct 19 05:51:15.000 [warn] assign_to_cpuworker failed. Ignoring.
...

The warning first appeared on 2.8.7 after update on September 13th (Tor
v0.2.8.7 running on FreeBSD with Libevent 2.0.22-stable, OpenSSL
LibreSSL 2.4.2 and Zlib 1.2.8.). That time I switched back (Tor v0.2.7.6
running on FreeBSD with Libevent 2.0.22-stable, OpenSSL LibreSSL 2.4.2
and Zlib 1.2.8.) and the warning disappeared.


Lately those warnings pop up:

2016-10-21_09-03-15_noticesJ5.log:Oct 19 13:39:55.000 [warn] assign_to_cpuworker failed. Ignoring.
2016-10-21_09-03-15_noticesJ5.log:Oct 19 13:40:00.000 [warn] circuit_mark_for_close_: Bug: Duplicate

call to circuit_mark_for_close at src/or/onion.c:182 (first at src/or/command.c:559) (on Tor 0.2.8.9 )
2016-10-21_09-03-15_noticesJ5.log:Oct 19 14:39:55.000 [warn] Your system clock just jumped 151 seconds

forward; assuming established circuits no longer work.
2016-10-21_09-03-15_noticesJ5.log:Oct 19 14:39:55.000 [warn] assign_to_cpuworker failed. Ignoring.

2016-10-21_09-03-15_noticesJ5.log:Oct 19 22:39:51.000 [warn] assign_to_cpuworker failed. Ignoring.
2016-10-21_09-03-15_noticesJ5.log:Oct 19 22:39:56.000 [warn] circuit_mark_for_close_: Bug: Duplicate

call to circuit_mark_for_close at src/or/onion.c:182 (first at src/or/command.c:559) (on Tor 0.2.8.9 )
2016-10-21_09-03-15_noticesJ5.log:Oct 19 22:39:56.000 [warn] circuit_mark_for_close_: Bug: Duplicate

call to circuit_mark_for_close at src/or/onion.c:182 (first at src/or/command.c:559) (on Tor 0.2.8.9 )
2016-10-21_09-03-15_noticesJ5.log:Oct 19 22:39:56.000 [warn] circuit_mark_for_close_: Bug: Duplicate

call to circuit_mark_for_close at src/or/onion.c:182 (first at src/or/command.c:559) (on Tor 0.2.8.9 )
2016-10-21_09-03-15_noticesJ5.log:Oct 19 23:40:02.000 [warn] Your system clock just jumped 158 seconds

forward; assuming established circuits no longer work.
2016-10-21_09-03-15_noticesJ5.log:Oct 19 23:40:02.000 [warn] assign_to_cpuworker failed. Ignoring.


Posted in tor-relays:
https:// lists.torproject.org/pipermail/tor-relays/2016-October/010727.html

A similar warning is reported in tor-talk:
https:// lists.torproject.org/pipermail/tor-talk/2016-October/042425.html


There seems to be no impact by LibreSSL. Here the Tor configurations of the past:

tor-2.6.9-bin libre-2.2.3-bin libevent2-bin => ok

tor-2.6.10_1-src libre-2.2.5-src libevent2-bin => ok

tor-2.7.6-src libre-2.2.5-src libevent2-bin => ok

tor-2.7.6-src libre-2.3.4-src libevent2-bin => ok

tor-2.8.7-src libre-2.4.2-src libevent2-bin => warn

tor-2.7.6_1-src libre-2.4.2-src libevent2-src => ok

tor-2.8.9-src libre-2.4.3-src libevent2-src(libre) => warn

tor-2.8.9-src openSSL-1.0.2j-src libevent2-src(libre) => warn

tor-2.8.9-src openSSL-1.0.2j-src libevent2-src(libre) => warn

tor-2.8.9-src openSSL-1.0.2j-src libevent2-src(openssl) => warn


Hourly a block of up to hundred and more warnings with mostly the identical timestamp shows up.
The warning comes on 7 of my 7 relays. May-be the setup is specific on my side (same for all relays):

  • Freebsd 10.1
  • Tor in Jail
  • Jail by RDR/NAT, RDR with PF filter rule
  • No changes during the transition from 2.6.x until 2.8.7

How can the warning be avoided ?

Get circuits lost (assuming established circuits no longer work) ?

Child Tickets

Attachments (1)

make-test-freebsd-10.3.log (32.7 KB) - added by s7r 3 years ago.

Download all attachments as: .zip

Change History (40)

comment:1 Changed 3 years ago by dgoulet

Component: User ExperienceCore Tor/Tor
Keywords: regression freebsd added; assign_to_cpuworker circuit_mark_for_close_ removed
Milestone: Tor: 0.2.9.x-final
Priority: MediumHigh
Version: Tor: 0.2.8.7Tor: 0.2.8.9

comment:2 Changed 3 years ago by Felixix

Some more informations

"assign_to_cpuworker failed" is not connected to Heartbeat.
Change the heartbeat periode to 1.5h but still "assign_to_cpuworker failed" is hourly.

A reload (hup) forces "assign_to_cpuworker failed" warnings.

Observation (log notice and top)
Oct 23 16:51:25.000 [warn] assign_to_cpuworker failed. Ignoring.
...
Oct 23 16:51:25.000 [warn] assign_to_cpuworker failed. Ignoring.
17:48 Tor process at 90% CPU
17:49 Tor process at 43% CPU
17:50 Tor process at 7% CPU
17:51 Tor process at 5% CPU
Oct 23 17:51:22.000 [warn] Your system clock just jumped 140 seconds forward; assuming established circuits no longer work.
Oct 23 17:51:22.000 [warn] assign_to_cpuworker failed. Ignoring.
...
Oct 23 17:51:22.000 [warn] assign_to_cpuworker failed. Ignoring.
17:52 Tor process at 68% CPU
17:53 Tor process at 71% CPU
17:54 Tor process at 77% CPU
17:55 Tor process at 79% CPU

  • 58 minutes performing well
  • 140 secs doing whatever and not pushing data
  • After ready with doing whatever back with warn clock and warn cpuworker

Openntpd with 8 servers and all with 1500-1600 secs query interval.

Taking a 5 minutes log info shows (force warning by hup; then back to notice level):
Oct 23 20:15:25.000 [info] assign_onionskin_to_cpuworker: circ->p_chan gone. Failing circ.
Oct 23 20:15:25.000 [warn] assign_to_cpuworker failed. Ignoring.

comment:3 Changed 3 years ago by nickm

I think that the root problem here is whatever causes Tor to stall and overload, eating a bunch of CPU. That would cause all of:

  • The "clock jumped X seconds forward" warning
  • The remote side to close the circuit and its channel already because we took too long...
  • ...leading to the duplicate close warning when we try to close it for being too old...
  • ...and/or leading to assign_to_cpuworker() failing because the channel isn't there any more.

I'm not sure whether this stalling problem is new in 028 or not, though: it could be that it was present previously, but that we weren't seeing it, because of the changes that 028 made to how we close circuits in #17218.

Did you see "Your system clock just jumped X seconds forward" warnings with older versions of Tor? If not, we should try to figure out why those started.

Additionally, we should clean up the warnings for assign_to_cpuworker() failing (that shouldn't be a warning), and the duplicate close warning. (There should be a check for whether the circuit is marked in the loop at the end of onion_pending_add().)

comment:4 in reply to:  3 Changed 3 years ago by Felixix

Replying to nickm:

I think that the root problem here is whatever causes Tor to stall and overload, eating a bunch of CPU. That would cause all of:

  • The "clock jumped X seconds forward" warning
  • The remote side to close the circuit and its channel already because we took too long...
  • ...leading to the duplicate close warning when we try to close it for being too old...
  • ...and/or leading to assign_to_cpuworker() failing because the channel isn't there any more.

The other way around. Tor is stalling for about X seconds at 5% CPU. And 58 minutes back at 70-90%. Not the typical overload. I think no circuits are handled then.

Did you see "Your system clock just jumped X seconds forward" warnings with older versions of Tor? If not, we should try to figure out why those started.

I never had those warnings before (2.6.9 - 2.7.6, see ticket text). Roll-back from 2.8.9 (like 2.8.7 before) to 2.7.6 and warnings and X seconds are gone.

Additionally, we should clean up the warnings for assign_to_cpuworker() failing (that shouldn't be a warning), and the duplicate close warning. (There should be a check for whether the circuit is marked in the loop at the end of onion_pending_add().)

I am glad I received warnings. Otherwise I might not have noticed it.

Thanks and cheers

comment:5 Changed 3 years ago by arma

I agree with Nick that the mystery here is the clock jump -- the other things like the cpuworker line follow from that clock jump mystery.

In the past, we had issues where libevent thought time had passed when it hadn't. Those were only on weird platforms, like some of the BSDs, but here we are with one of those. Curious.

comment:6 Changed 3 years ago by palyhomp

As I recall I did not have the clock jump issue on FreeBSD 10.3, they started appear when I upgraded to FreeBSD 11-RC1.

Since updating to 0.2.9.4-alpha the clock jump issue seems to be not so dramatic. Below 300sec where they tended to be >> 1000

Sep 13 15:55:02 tor Tor[36694]: Your system clock just jumped 5203 seconds forward; assuming established circuits no longer work.
Sep 13 17:22:29 tor Tor[36694]: Your system clock just jumped 5247 seconds forward; assuming established circuits no longer work.
Sep 13 18:48:00 tor Tor[36694]: Your system clock just jumped 5131 seconds forward; assuming established circuits no longer work.
Sep 13 20:20:10 tor Tor[36694]: Your system clock just jumped 5530 seconds forward; assuming established circuits no longer work.
Sep 13 22:14:32 tor Tor[36694]: Your system clock just jumped 6862 seconds forward; assuming established circuits no longer work.
Version 0, edited 3 years ago by palyhomp (next)

comment:7 Changed 3 years ago by arma

If you log at loglevel debug for a bit, can you check if there are things that happen in between? Or is your Tor really waking up, doing some stuff, then getting distracted for 150 seconds, then waking up again?

Also, can you attach / paste your torrc here? Maybe you accidentally have 1000 onion services configured or something. :)

comment:8 Changed 3 years ago by Felixix


So far:

Tor 2.6.x and 2.7.x in a jail => ok
Tor 2.8.1, 2.8.7 and 2.8.9 in a jail => warning
Tor 2.8.9 on *host* (not in a jail) => ok


Next:

Tor 2.8.9 back to jail with some minutes debug
Tor 2.9.x-devel in a jail


My torrc:

SocksPort 0
RunAsDaemon 1
ORPort 443 NoListen
ORPort 1234 NoAdvertise
DirPort 80 NoListen
DirPort 5678 NoAdvertise
Nickname ...
RelayBandwidthRate 14000 KB
RelayBandwidthBurst 50000 KB
ContactInfo ...
MyFamily ...
ExitPolicy reject *:*
HeartbeatPeriod 3600 seconds
NUMCPUs 2
MaxMemInQueues 2 GB
Log notice file /var/log/tor/tor
EntryStatistics 1
ConnDirectionStatistics 1
DirPortFrontPage /var/mypage.html
ExtraInfoStatistics 1
DownloadExtraInfo 1


comment:9 Changed 3 years ago by arma

Summary: Warning notice log: assign_to_cpuworker failed. Ignoring.Clock jumps on FreeBSD relay

Your torrc looks reasonable.

I am now wondering if something about your bsd jail is preventing Tor from doing something, or starving it, or otherwise breaking things.

comment:10 Changed 3 years ago by arma

In particular, what changed about freebsd jails between "FreeBSD 10.3" and "FreeBSD 11-RC1"?

comment:11 in reply to:  10 Changed 3 years ago by Felixix

Replying to arma:

In particular, what changed about freebsd jails between "FreeBSD 10.3" and "FreeBSD 11-RC1"?

Please notice that "https://trac.torproject.org/projects/tor/ticket/20423#comment:6" is a different machine/operator. I am on "10.1-RELEASE-p40".

@palyhomp: You are on host or jail?

comment:12 in reply to:  7 Changed 3 years ago by Felixix

Replying to arma:

If you log at loglevel debug for a bit, can you check if there are things that happen in between? Or is your Tor really waking up, doing some stuff, then getting distracted for 150 seconds, then waking up again?


"[debug] resolve_my_address: Guessed local host name as 'jail1'" is *22924* times.

No other logs between "Nov 01 07:57:07" and "Nov 01 08:02:32". This is key.

Here is the stalling observed with "top":
2016-11-01_09-05-00: 19130 12 _tor 2 20 0 247M 222M uwait 3:03 31.98% tor
2016-11-01_09-04-00: 19130 12 _tor 2 20 0 247M 222M uwait 3:02 25.98% tor
2016-11-01_09-03-00: 19130 12 _tor 2 20 0 247M 222M uwait 3:02 37.99% tor
2016-11-01_09-02-00: 19130 12 _tor 2 20 0 247M 222M uwait 3:02 3.96% tor
2016-11-01_09-01-00: 19130 12 _tor 2 20 0 247M 222M uwait 3:02 3.96% tor
2016-11-01_09-00-00: 19130 12 _tor 2 20 0 247M 222M uwait 3:02 3.96% tor
2016-11-01_08-59-00: 19130 12 _tor 2 20 0 247M 222M uwait 3:01 1.95% tor
2016-11-01_08-58-00: 19130 12 _tor 2 20 0 247M 222M uwait 3:01 3.96% tor
2016-11-01_08-57-00: 19130 12 _tor 2 20 0 247M 222M uwait 3:01 19.97% tor
2016-11-01_08-56-00: 19130 12 _tor 2 20 0 247M 222M uwait 3:01 31.98% tor
2016-11-01_08-55-00: 19130 12 _tor 2 20 0 247M 222M uwait 3:00 23.97% tor
2016-11-01_08-54-00: 19130 12 _tor 2 20 0 247M 222M uwait 3:00 23.97% tor


Some of the log:

Nov 01 07:57:07.000 [notice] Tor 0.2.8.9 opening new log file.
Nov 01 07:57:07.000 [warn] Your log may contain sensitive information - you're logging more than "notice". Don't log unless it serves an important reason. Overwrite the log afterwards.
Nov 01 07:57:07.000 [debug] tor_disable_debugger_attach: Attemping to disable debugger attachment to Tor for unprivileged users.

...

Nov 01 07:57:07.000 [info] cell_ewma_set_scale_factor: Enabled cell_ewma algorithm because of value in CircuitPriorityHalflifeMsec in consensus; scale factor is 0.793701 per 10 seconds
Nov 01 07:57:07.000 [info] options_act: Worker-related options changed. Rotating workers.
Nov 01 07:57:07.000 [debug] configure_nameservers: stat()ing /etc/resolv.conf
Nov 01 07:57:07.000 [info] configure_nameservers: No change to '/etc/resolv.conf'
Nov 01 07:57:07.000 [debug] resolve_my_address: Guessed local host name as 'jail1'
Nov 01 07:57:07.000 [info] resolve_my_address: Could not resolve guessed local hostname 'jail1'. Trying something else.
Nov 01 07:57:07.000 [info] resolve_my_address: Learned IP address '10.0.1.1' for local interface. Using that.
Nov 01 07:57:07.000 [info] resolve_my_address: Address '<guessed from interfaces>' resolves to private IP address '10.0.1.1'. Tor servers that use the default DirAuthorities must have public IP addresses.
Nov 01 07:57:07.000 [info] router_pick_published_address: Could not determine our address locally. Checking if directory headers provide any hints.
Nov 01 07:57:07.000 [info] router_pick_published_address: Success: chose address 'mmm.yyy.iii.ppp'.
Nov 01 07:57:07.000 [debug] resolve_my_address: Guessed local host name as 'jail1'
Nov 01 07:57:07.000 [info] resolve_my_address: Could not resolve guessed local hostname 'jail1'. Trying something else.
Nov 01 07:57:07.000 [info] resolve_my_address: Learned IP address '10.0.1.1' for local interface. Using that.
Nov 01 07:57:07.000 [info] resolve_my_address: Address '<guessed from interfaces>' resolves to private IP address '10.0.1.1'. Tor servers that use the default DirAuthorities must have public IP addresses.
Nov 01 07:57:07.000 [info] router_pick_published_address: Could not determine our address locally. Checking if directory headers provide any hints.
Nov 01 07:57:07.000 [info] router_pick_published_address: Success: chose address 'mmm.yyy.iii.ppp'.
Nov 01 07:57:07.000 [debug] resolve_my_address: Guessed local host name as 'jail1'
Nov 01 07:57:07.000 [info] resolve_my_address: Could not resolve guessed local hostname 'jail1'. Trying something else.
Nov 01 07:57:07.000 [info] resolve_my_address: Learned IP address '10.0.1.1' for local interface. Using that.
Nov 01 07:57:07.000 [info] resolve_my_address: Address '<guessed from interfaces>' resolves to private IP address '10.0.1.1'. Tor servers that use the default DirAuthorities must have public IP addresses.
Nov 01 07:57:07.000 [info] router_pick_published_address: Could not determine our address locally. Checking if directory headers provide any hints.
Nov 01 07:57:07.000 [info] router_pick_published_address: Success: chose address 'mmm.yyy.iii.ppp'.
Nov 01 07:57:07.000 [debug] resolve_my_address: Guessed local host name as 'jail1'
Nov 01 07:57:07.000 [info] resolve_my_address: Could not resolve guessed local hostname 'jail1'. Trying something else.
Nov 01 07:57:07.000 [info] resolve_my_address: Learned IP address '10.0.1.1' for local interface. Using that.
Nov 01 07:57:07.000 [info] resolve_my_address: Address '<guessed from interfaces>' resolves to private IP address '10.0.1.1'. Tor servers that use the default DirAuthorities must have public IP addresses.
Nov 01 07:57:07.000 [info] router_pick_published_address: Could not determine our address locally. Checking if directory headers provide any hints.
Nov 01 07:57:07.000 [info] router_pick_published_address: Success: chose address 'mmm.yyy.iii.ppp'.
Nov 01 07:57:07.000 [debug] resolve_my_address: Guessed local host name as 'jail1'
Nov 01 07:57:07.000 [info] resolve_my_address: Could not resolve guessed local hostname 'jail1'. Trying something else.
Nov 01 07:57:07.000 [info] resolve_my_address: Learned IP address '10.0.1.1' for local interface. Using that.
Nov 01 07:57:07.000 [info] resolve_my_address: Address '<guessed from interfaces>' resolves to private IP address '10.0.1.1'. Tor servers that use the default DirAuthorities must have public IP addresses.
Nov 01 07:57:07.000 [info] router_pick_published_address: Could not determine our address locally. Checking if directory headers provide any hints.
Nov 01 07:57:07.000 [info] router_pick_published_address: Success: chose address 'mmm.yyy.iii.ppp'.

... grep "Guessed local host name as" == 22924 times ...

Nov 01 08:02:32.000 [debug] resolve_my_address: Guessed local host name as 'jail1'
Nov 01 08:02:32.000 [info] resolve_my_address: Could not resolve guessed local hostname 'jail1'. Trying something else.
Nov 01 08:02:32.000 [info] resolve_my_address: Learned IP address '10.0.1.1' for local interface. Using that.
Nov 01 08:02:32.000 [info] resolve_my_address: Address '<guessed from interfaces>' resolves to private IP address '10.0.1.1'. Tor servers that use the default DirAuthorities must have public IP addresses.
Nov 01 08:02:32.000 [info] router_pick_published_address: Could not determine our address locally. Checking if directory headers provide any hints.
Nov 01 08:02:32.000 [info] router_pick_published_address: Success: chose address 'mmm.yyy.iii.ppp'.
Nov 01 08:02:32.000 [debug] resolve_my_address: Guessed local host name as 'jail1'
Nov 01 08:02:32.000 [info] resolve_my_address: Could not resolve guessed local hostname 'jail1'. Trying something else.
Nov 01 08:02:32.000 [info] resolve_my_address: Learned IP address '10.0.1.1' for local interface. Using that.
Nov 01 08:02:32.000 [info] resolve_my_address: Address '<guessed from interfaces>' resolves to private IP address '10.0.1.1'. Tor servers that use the default DirAuthorities must have public IP addresses.
Nov 01 08:02:32.000 [info] router_pick_published_address: Could not determine our address locally. Checking if directory headers provide any hints.
Nov 01 08:02:32.000 [info] router_pick_published_address: Success: chose address 'mmm.yyy.iii.ppp'.
Nov 01 08:02:32.000 [debug] resolve_my_address: Guessed local host name as 'jail1'
Nov 01 08:02:32.000 [info] resolve_my_address: Could not resolve guessed local hostname 'jail1'. Trying something else.
Nov 01 08:02:32.000 [info] resolve_my_address: Learned IP address '10.0.1.1' for local interface. Using that.
Nov 01 08:02:32.000 [info] resolve_my_address: Address '<guessed from interfaces>' resolves to private IP address '10.0.1.1'. Tor servers that use the default DirAuthorities must have public IP addresses.
Nov 01 08:02:32.000 [info] router_pick_published_address: Could not determine our address locally. Checking if directory headers provide any hints.
Nov 01 08:02:32.000 [info] router_pick_published_address: Success: chose address 'mmm.yyy.iii.ppp'.
Nov 01 08:02:32.000 [debug] resolve_my_address: Guessed local host name as 'jail1'
Nov 01 08:02:32.000 [info] resolve_my_address: Could not resolve guessed local hostname 'jail1'. Trying something else.
Nov 01 08:02:32.000 [info] resolve_my_address: Learned IP address '10.0.1.1' for local interface. Using that.
Nov 01 08:02:32.000 [info] resolve_my_address: Address '<guessed from interfaces>' resolves to private IP address '10.0.1.1'. Tor servers that use the default DirAuthorities must have public IP addresses.
Nov 01 08:02:32.000 [info] router_pick_published_address: Could not determine our address locally. Checking if directory headers provide any hints.
Nov 01 08:02:32.000 [info] router_pick_published_address: Success: chose address 'mmm.yyy.iii.ppp'.
Nov 01 08:02:32.000 [debug] file_status: stat()ing /var/db/tor/keys/ed25519_signing_secret_key_encrypted
Nov 01 08:02:32.000 [debug] check_private_dir: stat()ing /var/db/tor
Nov 01 08:02:32.000 [debug] check_private_dir: stat()ing /var/db/tor/keys
Nov 01 08:02:32.000 [debug] file_status: stat()ing /var/db/tor/keys/ed25519_master_id_public_key
Nov 01 08:02:32.000 [debug] file_status: stat()ing /var/db/tor/keys/ed25519_master_id_secret_key_encrypted
Nov 01 08:02:32.000 [debug] configure_nameservers: stat()ing /etc/resolv.conf
Nov 01 08:02:32.000 [info] configure_nameservers: No change to '/etc/resolv.conf'
Nov 01 08:02:32.000 [debug] scheduler_evt_callback: Scheduler event callback called
Nov 01 08:02:32.000 [debug] scheduler_run: We have a chance to run the scheduler
Nov 01 08:02:32.000 [debug] scheduler_run: Scheduler saw pending channel 27509 at 0x80988d640 with 64 cells writeable
Nov 01 08:02:32.000 [debug] channel_tls_get_overhead_estimate_method: Estimated overhead ratio for TLS chan 27509 is 1.013652
Nov 01 08:02:32.000 [debug] channel_write_packed_cell: Writing packed_cell_t 0x80f10e280 to channel 0x80988d640 with global ID 27509
Nov 01 08:02:32.000 [debug] channel_tls_get_overhead_estimate_method: Estimated overhead ratio for TLS chan 27509 is 1.013652
Nov 01 08:02:32.000 [debug] channel_update_xmit_queue_size: Increasing queue size for channel 27509 by 521 from 0 to 521
Nov 01 08:02:32.000 [debug] channel_update_xmit_queue_size: Increasing global queue size by 521 for channel 27509, new size is 61727
Nov 01 08:02:32.000 [debug] scheduler_adjust_queue_size: Queue size adjustment by +521 for channel 27509
Nov 01 08:02:32.000 [debug] scheduler_adjust_queue_size: Queue heuristic is now 521

...

Lots of:

Nov 01 08:02:33.000 [debug] conn_read_callback: socket 795 wants to read.
Nov 01 08:02:33.000 [debug] connection_read_to_buf: 795: starting, inbuf_datalen 0 (0 pending in tls object). at_most 16448.
Nov 01 08:02:33.000 [debug] connection_read_to_buf: After TLS read of 514: 543 read, 0 written
Nov 01 08:02:33.000 [debug] connection_or_process_cells_from_inbuf: 795: starting, inbuf_datalen 514 (0 pending in tls object).
Nov 01 08:02:33.000 [debug] connection_or_process_cells_from_inbuf: 795: starting, inbuf_datalen 0 (0 pending in tls object).

...

Nov 01 08:02:33.000 [debug] scheduler_adjust_queue_size: Queue size adjustment by -528 for channel 1147
Nov 01 08:02:33.000 [debug] scheduler_adjust_queue_size: Queue heuristic is now 0
Nov 01 08:02:33.000 [debug] connection_bucket_refill_helper: global_read_bucket now 1073741824.
Nov 01 08:02:33.000 [debug] connection_bucket_refill_helper: global_write_bucket now 1073741824.
Nov 01 08:02:33.000 [debug] connection_bucket_refill_helper: global_relayed_read_bucket now 10240000.
Nov 01 08:02:33.000 [debug] connection_bucket_refill_helper: global_relayed_write_bucket now 10240000.
Nov 01 08:02:33.000 [debug] connection_bucket_refill_helper: or_conn->read_bucket now 1073741824.
Nov 01 08:02:33.000 [debug] connection_bucket_refill_helper: or_conn->read_bucket now 1073741824.
Nov 01 08:02:33.000 [debug] connection_bucket_refill_helper: or_conn->read_bucket now 1073741824.

... 2604 times: Nov 01 08:02:33.000 [debug] connection_bucket_refill_helper: or_conn->read_bucket now 1073741824.

Nov 01 08:02:33.000 [debug] connection_bucket_refill_helper: or_conn->read_bucket now 1073741824.
Nov 01 08:02:33.000 [debug] connection_bucket_refill_helper: or_conn->read_bucket now 1073741824.
Nov 01 08:02:33.000 [debug] connection_bucket_refill_helper: or_conn->read_bucket now 1073741824.
Nov 01 08:02:33.000 [debug] connection_bucket_refill_helper: or_conn->read_bucket now 1073741824.
Nov 01 08:02:33.000 [warn] Your system clock just jumped 326 seconds forward; assuming established circuits no longer work.

Finally the warning.

comment:13 Changed 3 years ago by Felixix

There is an obvious difference in the log files between 0.2.7.6 and 0.2.8.9:

  • 0.2.8.9 in the jail, debug after reload with about 23k times resolve_my_address etc, 5-6 min Tor down to 4% CPU
  • 0.2.7.6_1 in the *same* jail and keys, debug after reload with 4 times resolve_my_address etc, pushing data on 25% CPU

For now I keep it like it is. If you want me to do anything please tell me. I would be glad if I can assist/test.


0.2.8.9 in the jail, debug after reload:

Nov 03 18:35:02.000 [info] router_write_fingerprint: Dumping fingerprint to "/var/db/tor/fingerprint"...
Nov 03 18:35:02.000 [notice] Your Tor server's identity key fingerprint is 'blabla fingerprint'
Nov 03 18:35:02.000 [info] cell_ewma_set_scale_factor: Enabled cell_ewma algorithm because of value in CircuitPriorityHalflifeMsec in consensus; scale factor is 0.793701 per 10 seconds
Nov 03 18:35:02.000 [info] options_act: Worker-related options changed. Rotating workers.
Nov 03 18:35:02.000 [debug] configure_nameservers: stat()ing /etc/resolv.conf
Nov 03 18:35:02.000 [info] configure_nameservers: No change to '/etc/resolv.conf'

23484*
Nov 03 18:35:02.000 [debug] resolve_my_address: Guessed local host name as 'jail1'
Nov 03 18:35:02.000 [info] resolve_my_address: Could not resolve guessed local hostname 'jail1'. Trying something else.
Nov 03 18:35:02.000 [info] resolve_my_address: Learned IP address '10.x.x.x' for local interface. Using that.
Nov 03 18:35:02.000 [info] resolve_my_address: Address '<guessed from interfaces>' resolves to private IP address '10.x.x.x'. Tor servers that use the default DirAuthorities must have public IP addresses.
Nov 03 18:35:02.000 [info] router_pick_published_address: Could not determine our address locally. Checking if directory headers provide any hints.
Nov 03 18:35:02.000 [info] router_pick_published_address: Success: chose address 'm.y.i.p'.
...

25432*
Nov 03 18:40:34.000 [debug] connection_bucket_refill_helper: or_conn->read_bucket now 1073741824.

!!!
Nov 03 18:40:34.000 [warn] Your system clock just jumped 332 seconds forward; assuming established circuits no longer work.

2498*
Nov 03 18:40:34.000 [debug] run_connection_housekeeping: Sending keepalive to (an.y.i.p:port)


0.2.7.6_1 in the same jail, debug after reload:

Nov 03 21:25:46.000 [info] int router_write_fingerprint(int): Dumping fingerprint to "/var/db/tor/fingerprint"...
Nov 03 21:25:46.000 [notice] Your Tor server's identity key fingerprint is 'blabla fingerprint'
Nov 03 21:25:46.000 [info] void cell_ewma_set_scale_factor(const or_options_t *, const networkstatus_t *): Enabled cell_ewma algorithm because of value in CircuitPriorityHalflifeMsec in consensus; scale factor is 0.793701 per 10 seconds
Nov 03 21:25:46.000 [info] int options_act(const or_options_t *): Worker-related options changed. Rotating workers.
Nov 03 21:25:46.000 [debug] int configure_nameservers(int): stat()ing /etc/resolv.conf
Nov 03 21:25:46.000 [info] int configure_nameservers(int): No change to '/etc/resolv.conf'

4*
Nov 03 21:25:46.000 [debug] int resolve_my_address(int, const or_options_t *, uint32_t *, const char , char ): Guessed local host name as 'jail1'
Nov 03 21:25:46.000 [info] int resolve_my_address(int, const or_options_t *, uint32_t *, const char , char ): Could not resolve guessed local hostname 'jail1'. Trying something else.
Nov 03 21:25:46.000 [info] int resolve_my_address(int, const or_options_t *, uint32_t *, const char , char ): Learned IP address '10.x.x.x' for local interface. Using that.
Nov 03 21:25:46.000 [info] int resolve_my_address(int, const or_options_t *, uint32_t *, const char , char ): Address '<guessed from interfaces>' resolves to private IP address '10.x.x.x'. Tor servers that use the default DirAuthorities must have public IP addresses.
Nov 03 21:25:46.000 [info] int router_pick_published_address(const or_options_t *, uint32_t *): Could not determine our address locally. Checking if directory headers provide any hints.
Nov 03 21:25:46.000 [info] int router_pick_published_address(const or_options_t *, uint32_t *): Success: chose address 'm.y.i.p'.
...

Nov 03 21:25:46.000 [debug] void directory_initiate_command_rend(const tor_addr_t *, uint16_t, uint16_t, const char *, uint8_t, uint8_t, dir_indirection_t, const char *, const char *, size_t, time_t, const rend_data_t *): anonymized 0, use_begindir 0.
...

20*
Nov 03 21:25:46.000 [debug] void connection_bucket_refill_helper(int *, int, int, int, const char *): or_conn->read_bucket now 1073741824.

comment:14 Changed 3 years ago by s7r

I think this is not (always) related to the clock change. I have over 70 lines of Nov 08 10:19:49.000 [warn] assign_to_cpuworker failed. Ignoring. , one immediately after another with nothing in between, most of them with the very same timestamp, in 3 buckets of :49 (seconds) :50 (seconds) and :51 (seconds).

I am on FreeBSD 10.3-RELEASE-p11 with Tor 0.2.9.4-alpha (git-8b0755c9bb296ae2) with Libevent 2.0.22-stable, OpenSSL 1.0.2j and Zlib 1.2.8.

I want to add that this relays has been running quite for some time, was in 0.2.9.1 and 0.2.8.x as well and is the first time ever it reports this.

Last edited 3 years ago by s7r (previous) (diff)

comment:15 Changed 3 years ago by arma

I wonder if there is something about this freebsd user's jail that makes calls like gethostname super slow? And then this ticket would be a variant on #20610?

Felixix, do things get better if you explicitly set an Address line in your torrc?

comment:16 Changed 3 years ago by teor

In 0.2.9.1-alpha, we changed how we used timers:

I wonder if your OS has a broken monotonic timer:

    - Tor now uses the operating system's monotonic timers (where
      available) for internal fine-grained timing. Previously we would
      look at the system clock, and then attempt to compensate for the
      clock running backwards. Closes ticket 18908.

Alternately, I wonder if your OS previously relied on Tor compensating for the clock running backwards.

Can you please run the tor unit tests in your jail?

comment:17 Changed 3 years ago by s7r

P.S. jumping in maybe this is helpful. I get the message on a base system (not inside a jail), the clock did not change and I do have explicitly set Address in torrc.

One correction: I have a script which monitors the log file and emails me for every warn, but it has some limits that it will not send too many duplicate messages at every 6 hours. While closely looking at the log file, I discovered that there are way more such messages than I thought:

Nov 08 10:19:51.000 [warn] assign_to_cpuworker failed. Ignoring.
# cat /var/log/tor | grep assign_to_cpuworker | wc -l

1881

There are exactly 573 messages at 10:19:49.000 ; 692 at 10:19:50.000 ; 616 at 10:19:51.000

Again, this is not a jail, Address is set and it's been a relay for long time, this is the first time this warning appears and hasn't repeated so far. Can share access on this machine if needed.

comment:18 Changed 3 years ago by teor

Can you please run the tor unit tests for 0.2.9.5-alpha.
They have a timer unit test that uses monotonic time.
We use monotonic time in tor in 0.2.9.
Perhaps it is broken on your machine.

Changed 3 years ago by s7r

Attachment: make-test-freebsd-10.3.log added

comment:19 Changed 3 years ago by s7r

Couldn't properly run make test on the port so cloned the git repository (latest master branch) to run it. Attached the log file -- the monotonic time test failed. Is it used occasionally or something? Otherwise why is not the warning printed all the time / most of the time? How can it be fixed?

Last edited 3 years ago by s7r (previous) (diff)

comment:20 Changed 3 years ago by teor

Maybe the variance is small most of the time, but large occasionally?
The warning is only printed when the clock changes dramatically.

comment:21 Changed 3 years ago by s7r

I can confirm that is true. Just run make test again on the same machine and it outputs:
util/monotonic_time: OK
util/monotonic_time_ratchet: [forking] OK

Would it be wise to rephrase this log message?

comment:22 in reply to:  21 Changed 3 years ago by teor

Replying to s7r:

I can confirm that is true. Just run make test again on the same machine and it outputs:
util/monotonic_time: OK
util/monotonic_time_ratchet: [forking] OK

Would it be wise to rephrase this log message?

No, we need to find out whether:

  • the process really hangs for multiple seconds, or
  • the clock really regularly jumps multiple seconds,

And then either:

  • remove the hangs, or
  • use a more reliable time source.

comment:23 in reply to:  15 Changed 3 years ago by Felixix

Replying to arma:

I wonder if there is something about this freebsd user's jail that makes calls like gethostname super slow? And then this ticket would be a variant on #20610?

About
https:// trac.torproject.org/projects/tor/ticket/20610
or better:
https:// lists.torproject.org/pipermail/tor-relays/2016-November/010924.html

I quote:
Nov 01 18:38:17.000 - 23195
...
Nov 01 19:38:17.000 - 23207
...
Nov 01 20:38:17.000 - 23232

Exactly what I see - as well I get 23k warnings. See the comment above.
( https:// trac.torproject.org/projects/tor/ticket/20423?replyto=15#comment:13 )

A different warning and linux. Is this only on FreeBSD? It appeared to me since 0.2.8.x.

Felixix, do things get better if you explicitly set an Address line in your torrc?

I did:

ORPort my.ext.i.p:443 NoListen
ORPort my.loc.i.p:1234 NoAdvertise
DirPort my.ext.i.p:80 NoListen
DirPort my.loc.i.p:5678 NoAdvertise

and received the warnings.

comment:24 in reply to:  17 Changed 3 years ago by Felixix

Replying to s7r:

P.S. jumping in maybe this is helpful.

Thanks for jumping in.

Can you force the warnings by hup/reload ?

Are you on 0.2.8.x ? Can you temporary switch to 0.2.7.6 ?

I apologize for asking that directly.

comment:25 Changed 3 years ago by teor

So you didn't set an Address <IPv4> line in your torrc?

If you don't set Address, and tor is using gethostbyname on your hostname to find your address, and that lookup goes out to the network (because your hostname is not in /etc/hosts), and the lookup times out occasionally, that could cause these problems.

It still would be a good idea for us to work out how to avoid calling router_pick_published_address() that often, and see if that fixes this issue.

comment:26 in reply to:  25 Changed 3 years ago by Felixix

Replying to teor:

So you didn't set an Address <IPv4> line in your torrc?

I will do Address my.ext.i.p in my torrc and come back.

comment:27 Changed 3 years ago by Felixix

The mistery seems solved! After setting Address my.ext.i.p the hourly warnings and the stalling is gone again. What one would do if guessing becomes necessary (dynamic ip) ?

Two relays are configured newly and both work fine. I am confident to switch all relays to 0.2.8.9 now.

And tell me if I shall test a modified ip guessing sometimes. My setup (jail) seems sensitive and reproducible for it.

Thanks to everybody


0.2.8.9 with Address my.ext.i.p

torrc good
Address my.ext.i.p
ORPort my.ext.i.p:443 NoListen
ORPort my.loc.i.p:1234 NoAdvertise
DirPort my.ext.i.p:80 NoListen
DirPort my.loc.i.p:5678 NoAdvertise

top tor process shows no stalling.


0.2.8.9 without Address my.ext.i.p

torrc bad
#Address my.ext.i.p
ORPort my.ext.i.p:443 NoListen
ORPort my.loc.i.p:1234 NoAdvertise
DirPort my.ext.i.p:80 NoListen
DirPort my.loc.i.p:5678 NoAdvertise

log bad
...
Nov 10 07:30:47.410 [notice] Tor v0.2.8.9 running on FreeBSD with Libevent 2.0.22-stable, OpenSSL LibreSSL 2.4.3 and Zlib 1.2.8.
...
Nov 10 07:30:47.425 [notice] Opening OR listener on my.loc.i.p:1234
Nov 10 07:30:47.426 [notice] Opening Directory listener on my.loc.i.p:5678
...
Nov 10 07:36:43.000 [notice] Guessed our IP address as my.ext.i.p
...
Nov 10 09:36:27.000 [warn] Your system clock just jumped 321 seconds forward; assuming established circuits no longer work.
Nov 10 09:36:28.000 [warn] assign_to_cpuworker failed. Ignoring.
in total 168 times
Nov 10 09:36:30.000 [warn] assign_to_cpuworker failed. Ignoring.
...
Nov 10 10:36:30.000 [warn] Your system clock just jumped 325 seconds forward; assuming established circuits no longer work.
Nov 10 10:36:31.000 [warn] assign_to_cpuworker failed. Ignoring.
in total 171 times
Nov 10 10:36:33.000 [warn] assign_to_cpuworker failed. Ignoring.
...
Nov 10 11:36:30.000 [warn] Your system clock just jumped 325 seconds forward; assuming established circuits no longer work.
Nov 10 11:36:31.000 [warn] assign_to_cpuworker failed. Ignoring.
in total 164 times
Nov 10 11:36:33.000 [warn] assign_to_cpuworker failed. Ignoring.

top bad tor process
2016-11-10_11-40-00 25.98% tor
2016-11-10_11-39-00 17.97% tor
2016-11-10_11-38-00 11.96% tor
2016-11-10_11-37-00 13.96% tor
2016-11-10_11-36-00 3.96% tor
2016-11-10_11-35-00 1.95% tor
2016-11-10_11-34-00 3.96% tor
2016-11-10_11-33-00 3.96% tor
2016-11-10_11-32-00 3.96% tor
2016-11-10_11-31-00 19.97% tor
2016-11-10_11-30-00 19.97% tor
2016-11-10_11-29-00 17.97% tor
2016-11-10_11-28-00 21.97% tor
2016-11-10_11-27-00 25.98% tor

comment:28 Changed 3 years ago by teor

Summary: Clock jumps on FreeBSD relayClock jumps on FreeBSD relay due to hostname lookup timeouts

Ok, so #20610 deals with the excessive logging part of this issue, #19974 deals with the monotonic time part of this issue.

I suspect that [warn] assign_to_cpuworker failed. Ignoring. is a symptom of the host name lookup timeout. So I've retitled this issue to focus on the hostname lookup timeout.

We could improve this issue by avoiding calling getaddrinfo() or gethostbyname() so often (although it's only being called once per hour, which is probably the right frequency), or by somehow putting a lower timeout on it, or by calling it asynchronously.

Alternately, we could log a message when tor_addr_lookup() takes more than N seconds, encouraging relay operators to set 'Address', modify /etc/hosts and /etc/hostname, or use IP addresses rather than hostnames in their config.

comment:29 in reply to:  28 Changed 3 years ago by nickm

Replying to teor:

We could improve this issue by avoiding calling getaddrinfo() or gethostbyname() so often (although it's only being called once per hour, which is probably the right frequency), or by somehow putting a lower timeout on it, or by calling it asynchronously.

I think that "calling it asynchronously" is the right answer. We could do that either from a job that we pass to a worker thread, or using libevent's evdns resolver. (I believe that might not initialize the libevent evdns resolver very well, though, so the "use a worker" thing might be a better option)

comment:30 Changed 3 years ago by arma

Summary: Clock jumps on FreeBSD relay due to hostname lookup timeoutsClock jumps on relay due to hostname lookup timeouts

I'm removing the word 'freebsd' from the ticket title, since this bug can affect any relay with a crummy resolver. Also, I put a diagnosis of the bug on #20610. Patch coming here soonish.

comment:31 Changed 3 years ago by arma

Keywords: freebsd removed
Status: newneeds_review

My bug20423 branch should resolve this issue.

(It does not resolve the fact that maybe the relay has a broken resolver, but it will at least not try to do thousands of resolve attempts in a row.)

It is a patch on maint-0.2.9. We can consider a backport to 028 if we like it enough, or we can leave 028 as a regression and focus on our future.

Once we get this one in, we might in 0.3.0 consider changing many more of these calls to cache_only=1. In fact, if we leave check_descriptor_ipaddress_changed() calling resolve_my_address() once a minute, maybe we're all set and the whole of router_pick_published_address() can just look at the cached values? I didn't want to make an invasive change like that in 0.2.9 though, since there are probably edge cases we care about, e.g. where getinfo_helper_misc() calls router_pick_published_address() and we're not a relay so we've never called resolve_my_address() before that very moment.

comment:32 Changed 3 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

This looks plausible to me, and simple enough for 0.2.9. Merging. Could you please open new tickets for the "once we get this in" part of this ticket?

comment:33 in reply to:  31 Changed 3 years ago by pastly

Replying to arma:

My bug20423 branch should resolve this issue.

I can confirm no more log spam wrt picking an address.

Last edited 3 years ago by pastly (previous) (diff)

comment:34 Changed 2 years ago by teor

Milestone: Tor: 0.2.9.x-finalTor: 0.3.0.x-final
Resolution: fixed
Status: closedreopened
Version: Tor: 0.2.8.9Tor: 0.2.9.10

A relay operator reports that this is still happening in 0.2.9.10:
https://lists.torproject.org/pipermail/tor-relays/2017-March/012140.html

We should probably try and fix it in 0.3.0.

I also opened #21789 in 0.3.1 for arma's 'once we get this in...'.

comment:35 Changed 2 years ago by teor

Points: 1

From the operator's emails, it appears that the relay isn't stuck in the main thread, but warnings are still being issued at regular intervals. This will cause lots of spurious bug reports if we stick it in a stable release (or it may even indicate an underlying issue).

I also opened #21809 for the hang on SIGINT that they reported.

comment:36 Changed 2 years ago by teor

Keywords: 029-backport added

comment:37 Changed 2 years ago by nickm

Milestone: Tor: 0.3.0.x-finalTor: 0.3.2.x-final

comment:38 Changed 22 months ago by teor

Parent ID: #21789

#21789 may mitigate this issue

comment:39 Changed 22 months ago by nickm

Keywords: 032-unreached added
Milestone: Tor: 0.3.2.x-finalTor: unspecified

Mark a large number of tickets that I do not think we will do for 0.3.2.

Note: See TracTickets for help on using tickets.