Opened 6 months ago

Closed 5 weeks ago

#24977 closed defect (fixed)

Non-fatal assertion !(tor_mem_is_zero((const char*)node->hsdir_index->fetch, DIGEST256_LEN))

Reported by: asn Owned by: asn
Priority: Medium Milestone: Tor: 0.3.4.x-final
Component: Core Tor/Tor Version: Tor: 0.3.2.1-alpha
Severity: Normal Keywords: tor-hs, prop224, 034-triage-20180328, 034-removed-20180502
Cc: Actual Points:
Parent ID: Points: 1
Reviewer: dgoulet Sponsor:

Description

This one is back, since pre-release days of hsv3. Something makes it such that the hsdir index is not well set for some relays.

I got this to happen on my hsv3 service a few weeks ago. I got it a few times on the same second for the same node, and then it got fixed... There were no other references to that node (or its fpr) before that.

Jan 04 21:30:54.000 [warn] tor_bug_occurred_(): Bug: src/or/hs_common.:1277: node_has_hsdir_index: Non-fatal assertion !(tor_mem_is_zero((const char*)node->hsdir_index->fetch, DIGEST256_LEN)) failed. (on Tor 0.3.2.6-alpha-dev b6fd78ea301bd089)
Jan 04 21:30:55.000 [warn] Bug: Non-fatal assertion !(tor_mem_is_zero((const char*)node->hsdir_index->fetch, DIGEST256_LEN)) failed in node_has_hsdir_index at src/or/hs_common.c:1277. Stack trace: (on Tor 0.3.2.6-alpha-dev b6fd78ea301bd089)
Jan 04 21:30:55.000 [warn] Bug:     ./tor/src/or/tor(log_backtrace+0x42) [0x7f6079a21db2] (on Tor 0.3.2.6-alpha-dev b6fd78ea301bd089)
Jan 04 21:30:55.000 [warn] Bug:     ./tor/src/or/tor(tor_bug_occurred_+0xb7) [0x7f6079a3cc57] (on Tor 0.3.2.6-alpha-dev b6fd78ea301bd089)
Jan 04 21:30:55.000 [warn] Bug:     ./tor/src/or/tor(hs_get_responsible_hsdirs+0x4f9) [0x7f6079a046c9] (on Tor 0.3.2.6-alpha-dev b6fd78ea301bd089)
Jan 04 21:30:55.000 [warn] Bug:     ./tor/src/or/tor(hs_service_run_scheduled_events+0x1a5b) [0x7f6079a11c5b] (on Tor 0.3.2.6-alpha-dev b6fd78ea301bd089)
Jan 04 21:30:55.000 [warn] Bug:     ./tor/src/or/tor(+0x4b9f1) [0x7f60798ed9f1] (on Tor 0.3.2.6-alpha-dev b6fd78ea301bd089)
Jan 04 21:30:55.000 [warn] Bug:     ./tor/src/or/tor(+0x6b4f0) [0x7f607990d4f0] (on Tor 0.3.2.6-alpha-dev b6fd78ea301bd089)
Jan 04 21:30:55.000 [warn] Bug:     /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5(event_base_loop+0x7fc) [0x7f6078f253dc] (on Tor 0.3.2.6-alpha-dev b6fd78ea301bd089)
Jan 04 21:30:55.000 [warn] Bug:     ./tor/src/or/tor(do_main_loop+0x244) [0x7f60798f0eb4] (on Tor 0.3.2.6-alpha-dev b6fd78ea301bd089)
Jan 04 21:30:55.000 [warn] Bug:     ./tor/src/or/tor(tor_main+0x1c25) [0x7f60798f46f5] (on Tor 0.3.2.6-alpha-dev b6fd78ea301bd089)
Jan 04 21:30:55.000 [warn] Bug:     ./tor/src/or/tor(main+0x19) [0x7f60798ec629] (on Tor 0.3.2.6-alpha-dev b6fd78ea301bd089)
Jan 04 21:30:55.000 [warn] Bug:     /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1) [0x7f60781182b1] (on Tor 0.3.2.6-alpha-dev b6fd78ea301bd089)
Jan 04 21:30:55.000 [warn] Bug:     ./tor/src/or/tor(_start+0x2a) [0x7f60798ec67a] (on Tor 0.3.2.6-alpha-dev b6fd78ea301bd089)
Jan 04 21:30:55.000 [info] hs_get_responsible_hsdirs(): Node $EEC47B34F9403AA7F765D070BB3011E50A373C21~ivanmk2 at 185.22.173.162 was found without hsdir index.
Jan 04 21:30:55.000 [info] run_upload_descriptor_event(): Initiating upload for hidden service current descriptor for service onion with 3/3 introduction points.

Child Tickets

Change History (32)

comment:1 Changed 6 months ago by nickm

Component: - Select a componentCore Tor/Tor

comment:2 Changed 6 months ago by dgoulet

Owner: set to dgoulet
Status: newaccepted

comment:3 Changed 6 months ago by dgoulet

Hmmm I have not seen that on any of my v3 services for a _while_ now.

What I can see that makes me wonder. We can have node_t without an ed25519 known ID that is before we get the descriptor.

Notice node_set_hsdir_index(), it doesn't set anything if the node_get_ed25519_id() returns NULL. We only set HSDir index if rs->supports_v3_hsdir meaning when we have a rs. But the hs_get_responsible_hsdirs() looks up the node by identity_digest ...

And node_supports_v3_hsdir() can return true with only using the ri and not the rs ... so we have this difference here where we only set the indexes if we have a rs but then we can also validate HSDir support by ri.

Although, in this situation, we loop over the rs list so any node_t we look at from the rs->identity_digest should return a node that has a valid rs....

Very puzzling!

comment:4 Changed 6 months ago by asn

Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final
Status: acceptedneeds_information

Yep, this needs more debugging and more logs. Seems quite rare right now. Pushing to 034 until more info arrives.

comment:5 Changed 4 months ago by nickm

Keywords: 034-triage-20180328 added

comment:6 Changed 3 months ago by cypherpunks

Parent ID: #25955

comment:7 Changed 3 months ago by asn

Keywords: 034-removed-20180502 added
Milestone: Tor: 0.3.4.x-finalTor: unspecified

comment:8 Changed 2 months ago by cypherpunks

My computer started with a wrong time. The bug was triggered while I ran "openntpd -d -s" to set the correct time.

System: debian stretch amd64 with openRC as init system
torrc: socksPort + onion service v2 + onion service v3

May 14 19:01:07 txt Tor[3576]: Bootstrapped 0%: Starting
May 14 19:01:07 txt Tor[3576]: Received local state file with skewed time (/var/lib/tor/state): It seems that our clock is behind by 16 hours, 57 minutes, or that theirs is ahead. Tor requires an accurate clock to work: please check your time, timezone, and date settings.
May 14 19:01:07 txt Tor[3576]: Problem bootstrapping. Stuck at 0%: Starting. (Clock skew -61037 in local state file from /var/lib/tor/state; CLOCK_SKEW; count 1; recommendation warn; host ? at ?)
May 14 19:01:07 txt Tor[3576]: Parsing GEOIP IPv4 file /usr/share/tor/geoip.
May 14 19:01:08 txt Tor[3576]: Parsing GEOIP IPv6 file /usr/share/tor/geoip6.
May 14 19:01:13 txt Tor[3576]: Our clock is 14 hours, 58 minutes behind the time published in the consensus network status document (2018-05-15 08:00:00 UTC).  Tor needs an accurate clock to work correctly. Please check your time and date settings!
May 14 19:01:13 txt Tor[3576]: Received microdesc flavor consensus with skewed time (CONSENSUS): It seems that our clock is behind by 14 hours, 58 minutes, or that theirs is ahead. Tor requires an accurate clock to work: please check your time, timezone, and date settings
.
May 14 19:01:13 txt Tor[3576]: Problem bootstrapping. Stuck at 0%: Starting. (Clock skew -53934 in microdesc flavor consensus from CONSENSUS; CLOCK_SKEW; count 2; recommendation warn; host ? at ?)
May 14 19:01:13 txt Tor[3576]: Starting with guard context "default"
May 14 19:01:13 txt Tor[3576]: Opening Socks listener on /var/run/tor/socks
May 14 19:01:13 txt Tor[3576]: Opening Control listener on /var/run/tor/control
May 14 19:01:13 txt Tor[3576]: Bootstrapped 5%: Connecting to directory server
May 14 19:01:13 txt Tor[3576]: Bootstrapped 10%: Finishing handshake with directory server
May 14 19:01:14 txt Tor[3576]: Bootstrapped 15%: Establishing an encrypted directory connection
May 14 19:01:14 txt Tor[3576]: Bootstrapped 20%: Asking for networkstatus consensus
May 14 19:01:14 txt Tor[3576]: Bootstrapped 25%: Loading networkstatus consensus
May 14 19:01:15 txt Tor[3576]: Our clock is 15 hours, 58 minutes behind the time published in the consensus network status document (2018-05-15 09:00:00 UTC).  Tor needs an accurate clock to work correctly. Please check your time and date settings!
May 14 19:01:15 txt Tor[3576]: Received microdesc flavor consensus with skewed time (CONSENSUS): It seems that our clock is behind by 15 hours, 58 minutes, or that theirs is ahead. Tor requires an accurate clock to work: please check your time, timezone, and date settings
.
May 14 19:01:15 txt Tor[3576]: Problem bootstrapping. Stuck at 25%: Loading networkstatus consensus. (Clock skew -57526 in microdesc flavor consensus from CONSENSUS; CLOCK_SKEW; count 3; recommendation warn; host ? at ?)
May 14 19:01:15 txt Tor[3576]: I learned some more directory information, but not enough to build a circuit: We have no recent usable consensus.
May 14 19:01:18 txt Tor[3576]: Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
May 14 19:01:18 txt Tor[3576]: Application request when we haven't used client f
unctionality lately. Optimistically trying directory fetches again.
May 14 19:01:18 txt Tor[3576]: Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
May 14 19:01:25 txt Tor[3576]: Received NETINFO cell with skewed time (OR:86.59.21.38:443): It seems that our clock is behind by 16 hours, 58 minutes, or that theirs is ahead. Tor requires an accurate clock to work: please check your time, timezone, and date settings.
May 14 19:01:25 txt Tor[3576]: Problem bootstrapping. Stuck at 25%: Loading networkstatus consensus. (Clock skew -61111 in NETINFO cell from OR; CLOCK_SKEW; count 4; recommendation warn; host 847B1F850344D7876491A54892F904934E4EB85D at 86.59.21.38:443)
May 14 19:01:26 txt Tor[3576]: Received directory with skewed time (DIRSERV:86.59.21.38:443): It seems that our clock is behind by 16 hours, 58 minutes, or that theirs is ahead. Tor requires an accurate clock to work: please check your time, timezone, and date settings.
May 14 19:01:26 txt Tor[3576]: Problem bootstrapping. Stuck at 25%: Loading networkstatus consensus. (Clock skew -61114 in directory from DIRSERV; CLOCK_SKEW; count 5; recommendation warn; host ? at 86.59.21.38:443)
May 14 19:01:32 txt Tor[3576]: Received NETINFO cell with skewed time (OR:193.23.244.244:443): It seems that our clock is behind by 16 hours, 58 minutes, or that theirs is ahead. Tor requires an accurate clock to work: please check your time, timezone, and date settings.
May 14 19:01:32 txt Tor[3576]: Problem bootstrapping. Stuck at 25%: Loading networkstatus consensus. (Clock skew -61112 in NETINFO cell from OR; CLOCK_SKEW; count 6; recommendation warn; host 7BE683E65D48141321C5ED92F075C55364AC7123 at 193.23.244.244:443)
May 14 19:01:32 txt Tor[3576]: Received directory with skewed time (DIRSERV:193.23.244.244:443): It seems that our clock is behind by 16 hours, 58 minutes, or that theirs is ahead. Tor requires an accurate clock to work: please check your time, timezone, and date settings.
May 14 19:01:32 txt Tor[3576]: Problem bootstrapping. Stuck at 25%: Loading networkstatus consensus. (Clock skew -61114 in directory from DIRSERV; CLOCK_SKEW; count 7; recommendation warn; host ? at 193.23.244.244:443)
May 14 19:01:33 txt Tor[3576]: Our clock is 16 hours, 58 minutes behind the time published in the consensus network status document (2018-05-15 10:00:00 UTC).  Tor needs an accurate clock to work correctly. Please check your time and date settings!
May 14 19:01:33 txt Tor[3576]: Received microdesc flavor consensus with skewed time (CONSENSUS): It seems that our clock is behind by 16 hours, 58 minutes, or that theirs is ahead. Tor requires an accurate clock to work: please check your time, timezone, and date settings.
May 14 19:01:33 txt Tor[3576]: Problem bootstrapping. Stuck at 25%: Loading networkstatus consensus. (Clock skew -61108 in microdesc flavor consensus from CONSENSUS; CLOCK_SKEW; count 8; recommendation warn; host ? at ?)
May 14 19:01:33 txt Tor[3576]: I learned some more directory information, but not enough to build a circuit: We have no recent usable consensus.
May 14 19:01:39 txt Tor[3576]: Received NETINFO cell with skewed time (OR:194.109.206.212:443): It seems that our clock is behind by 16 hours, 58 minutes, or that theirs is ahead. Tor requires an accurate clock to work: please check your time, timezone, and date settings.
May 14 19:01:39 txt Tor[3576]: Problem bootstrapping. Stuck at 25%: Loading networkstatus consensus. (Clock skew -61113 in NETINFO cell from OR; CLOCK_SKEW; count 9; recommendation warn; host 7EA6EAD6FD83083C538F44038BBFA077587DD755 at 194.109.206.212:443)
May 14 19:01:39 txt Tor[3576]: Received directory with skewed time (DIRSERV:194.109.206.212:443): It seems that our clock is behind by 16 hours, 58 minutes, or that theirs is ahead. Tor requires an accurate clock to work: please check your time, timezone, and date settings.
May 14 19:01:39 txt Tor[3576]: Problem bootstrapping. Stuck at 25%: Loading networkstatus consensus. (Clock skew -61113 in directory from DIRSERV; CLOCK_SKEW; count 10; recommendation warn; host ? at 194.109.206.212:443)
May 14 19:01:49 txt Tor[3576]: Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
May 14 19:01:49 txt Tor[3576]: Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
May 14 19:01:49 txt Tor[3576]: Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
May 14 19:01:55 txt Tor[3576]: Received directory with skewed time (DIRSERV:194.109.206.212:443): It seems that our clock is behind by 16 hours, 58 minutes, or that theirs is ahead. Tor requires an accurate clock to work: please check your time, timezone, and date settings.
May 14 19:01:55 txt Tor[3576]: Problem bootstrapping. Stuck at 25%: Loading networkstatus consensus. (Clock skew -61113 in directory from DIRSERV; CLOCK_SKEW; count 11; recommendation warn; host ? at 194.109.206.212:443)
May 14 19:02:01 txt Tor[3576]: Received NETINFO cell with skewed time (OR:131.188.40.189:443): It seems that our clock is behind by 16 hours, 58 minutes, or that theirs is ahead. Tor requires an accurate clock to work: please check your time, timezone, and date settings.
May 14 19:02:01 txt Tor[3576]: Problem bootstrapping. Stuck at 25%: Loading networkstatus consensus. (Clock skew -61113 in NETINFO cell from OR; CLOCK_SKEW; count 12; recommendation warn; host F2044413DAC2E02E3D6BCF4735A19BCA1DE97281 at 131.188.40.189:443)
May 14 19:02:01 txt Tor[3576]: Received directory with skewed time (DIRSERV:131.188.40.189:443): It seems that our clock is behind by 16 hours, 58 minutes, or that theirs is ahead. Tor requires an accurate clock to work: please check your time, timezone, and date settings.
May 14 19:02:01 txt Tor[3576]: Problem bootstrapping. Stuck at 25%: Loading networkstatus consensus. (Clock skew -61113 in directory from DIRSERV; CLOCK_SKEW; count 13; recommendation warn; host ? at 131.188.40.189:443)
May 14 19:02:08 txt Tor[3576]: Received NETINFO cell with skewed time (OR:154.35.175.225:443): It seems that our clock is behind by 16 hours, 58 minutes, or that theirs is ahead. Tor requires an accurate clock to work: please check your time, timezone, and date settings.
May 14 19:02:08 txt Tor[3576]: Problem bootstrapping. Stuck at 25%: Loading networkstatus consensus. (Clock skew -61112 in NETINFO cell from OR; CLOCK_SKEW; count 14; recommendation warn; host CF6D0AAFB385BE71B8E111FC5CFF4B47923733BC at 154.35.175.225:443)
May 14 19:02:08 txt Tor[3576]: Received directory with skewed time (DIRSERV:154.35.175.225:443): It seems that our clock is behind by 16 hours, 58 minutes, or that theirs is ahead. Tor requires an accurate clock to work: please check your time, timezone, and date settings.
May 14 19:02:08 txt Tor[3576]: Problem bootstrapping. Stuck at 25%: Loading networkstatus consensus. (Clock skew -61113 in directory from DIRSERV; CLOCK_SKEW; count 15; recommendation warn; host ? at 154.35.175.225:443)
May 14 19:02:13 txt Tor[3576]: Received NETINFO cell with skewed time (OR:171.25.193.9:80): It seems that our clock is behind by 16 hours, 58 minutes, or that theirs is ahead. Tor requires an accurate clock to work: please check your time, timezone, and date settings.
May 14 19:02:13 txt Tor[3576]: Problem bootstrapping. Stuck at 25%: Loading networkstatus consensus. (Clock skew -61113 in NETINFO cell from OR; CLOCK_SKEW; count 16; recommendation warn; host BD6A829255CB08E66FBE7D3748363586E46B3810 at 171.25.193.9:80)
May 14 19:02:14 txt Tor[3576]: Received directory with skewed time (DIRSERV:171.25.193.9:80): It seems that our clock is behind by 16 hours, 58 minutes, or that theirs is ahead. Tor requires an accurate clock to work: please check your time, timezone, and date settings.
May 14 19:02:14 txt Tor[3576]: Problem bootstrapping. Stuck at 25%: Loading networkstatus consensus. (Clock skew -61113 in directory from DIRSERV; CLOCK_SKEW; count 17; recommendation warn; host ? at 171.25.193.9:80)
May 14 19:02:21 txt Tor[3576]: Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
May 14 19:02:21 txt Tor[3576]: Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
May 14 19:02:21 txt Tor[3576]: Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
May 14 19:02:28 txt Tor[3576]: Received NETINFO cell with skewed time (OR:128.31.0.34:9101): It seems that our clock is behind by 16 hours, 58 minutes, or that theirs is ahead. Tor requires an accurate clock to work: please check your time, timezone, and date settings.
May 14 19:02:28 txt Tor[3576]: Problem bootstrapping. Stuck at 25%: Loading networkstatus consensus. (Clock skew -61112 in NETINFO cell from OR; CLOCK_SKEW; count 18; recommendation warn; host 9695DFC35FFEB861329B9F1AB04C46397020CE31 at 128.31.0.34:9101)
May 15 12:01:08 txt Tor[3576]: Tried for 61127 seconds to get a connection to [scrubbed]:443. Giving up.
May 15 12:01:08 txt Tor[3576]: Tried for 61127 seconds to get a connection to [scrubbed]:443. Giving up.
May 15 12:01:08 txt Tor[3576]: Your system clock just jumped 61114 seconds forward; assuming established circuits no longer work.
May 15 12:01:08 txt Tor[3576]: Bootstrapped 80%: Connecting to the Tor network
May 15 12:01:08 txt Tor[3576]: Bootstrapped 90%: Establishing a Tor circuit
May 15 12:01:08 txt Tor[3576]: Tor has successfully opened a circuit. Looks like client functionality is working.
May 15 12:01:08 txt Tor[3576]: Bootstrapped 100%: Done
May 15 12:01:08 txt Tor[3576]: tor_bug_occurred_(): Bug: ../src/or/hs_service.c:1952: should_rotate_descriptors: Non-fatal assertion !(service->desc_current == NULL || service->desc_next == NULL) failed. (on Tor 0.3.3.5-rc )
May 15 12:01:08 txt Tor[3576]: Bug: Non-fatal assertion !(service->desc_current == NULL || service->desc_next == NULL) failed in should_rotate_descriptors at ../src/or/hs_service.c:1952. Stack trace: (on Tor 0.3.3.5-rc )
May 15 12:01:08 txt Tor[3576]: Bug:     /usr/bin/tor(log_backtrace+0x44) [0x55da645ac3f4] (on Tor 0.3.3.5-rc )
May 15 12:01:08 txt Tor[3576]: Bug:     /usr/bin/tor(tor_bug_occurred_+0xb9) [0x55da645c7c29] (on Tor 0.3.3.5-rc )
May 15 12:01:08 txt Tor[3576]: Bug:     /usr/bin/tor(hs_service_run_scheduled_events+0x887) [0x55da6459aa77] (on Tor 0.3.3.5-rc )
May 15 12:01:08 txt Tor[3576]: Bug:     /usr/bin/tor(+0x505e1) [0x55da644735e1] (on Tor 0.3.3.5-rc )
May 15 12:01:08 txt Tor[3576]: Bug:     /usr/bin/tor(+0x70df0) [0x55da64493df0] (on Tor 0.3.3.5-rc )
May 15 12:01:08 txt Tor[3576]: Bug:     /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5(event_base_loop+0x6a0) [0x7f5ba944c5a0] (on Tor 0.3.3.5-rc )
May 15 12:01:08 txt Tor[3576]: Bug:     /usr/bin/tor(do_main_loop+0x2bc) [0x55da644777ec] (on Tor 0.3.3.5-rc )
May 15 12:01:08 txt Tor[3576]: Bug:     /usr/bin/tor(tor_run_main+0x275) [0x55da64478ef5] (on Tor 0.3.3.5-rc )
May 15 12:01:08 txt Tor[3576]: Bug:     /usr/bin/tor(tor_main+0x3a) [0x55da6447233a] (on Tor 0.3.3.5-rc )
May 15 12:01:08 txt Tor[3576]: Bug:     /usr/bin/tor(main+0x19) [0x55da644720a9] (on Tor 0.3.3.5-rc )
May 15 12:01:08 txt Tor[3576]: Bug:     /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1) [0x7f5ba7ca82e1] (on Tor 0.3.3.5-rc )
May 15 12:01:08 txt Tor[3576]: Bug:     /usr/bin/tor(_start+0x2a) [0x55da644720fa] (on Tor 0.3.3.5-rc )
May 15 12:01:08 txt Tor[3576]: should_rotate_descriptors(): Bug: Service descriptor is NULL ((nil)/(nil)). Next rotation time is 1526367600 (now: 1526378468). Valid after time from consensus is 1526378400 (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: tor_bug_occurred_(): Bug: ../src/or/hs_common.c:1294: node_has_hsdir_index: Non-fatal assertion !(tor_mem_is_zero((const char*)node->hsdir_index->fetch, DIGEST256_LEN)) failed. (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug: Non-fatal assertion !(tor_mem_is_zero((const char*)node->hsdir_index->fetch, DIGEST256_LEN)) failed in node_has_hsdir_index at ../src/or/hs_common.c:1294. Stack trace: (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(log_backtrace+0x44) [0x55da645ac3f4] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(tor_bug_occurred_+0xb9) [0x55da645c7c29] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(hs_get_responsible_hsdirs+0x396) [0x55da6458d236] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(hs_service_run_scheduled_events+0x1a44) [0x55da6459bc34] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(+0x505e1) [0x55da644735e1] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(+0x70df0) [0x55da64493df0] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5(event_base_loop+0x6a0) [0x7f5ba944c5a0] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(do_main_loop+0x2bc) [0x55da644777ec] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(tor_run_main+0x275) [0x55da64478ef5] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(tor_main+0x3a) [0x55da6447233a] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(main+0x19) [0x55da644720a9] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1) [0x7f5ba7ca82e1] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(_start+0x2a) [0x55da644720fa] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: tor_bug_occurred_(): Bug: ../src/or/hs_common.c:1294: node_has_hsdir_index: Non-fatal assertion !(tor_mem_is_zero((const char*)node->hsdir_index->fetch, DIGEST256_LEN)) failed. (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug: Non-fatal assertion !(tor_mem_is_zero((const char*)node->hsdir_index->fetch, DIGEST256_LEN)) failed in node_has_hsdir_index at ../src/or/hs_common.c:1294. Stack trace: (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(log_backtrace+0x44) [0x55da645ac3f4] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(tor_bug_occurred_+0xb9) [0x55da645c7c29] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(hs_get_responsible_hsdirs+0x396) [0x55da6458d236] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(hs_service_run_scheduled_events+0x1a44) [0x55da6459bc34] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(+0x505e1) [0x55da644735e1] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(+0x70df0) [0x55da64493df0] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5(event_base_loop+0x6a0) [0x7f5ba944c5a0] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(do_main_loop+0x2bc) [0x55da644777ec] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(tor_run_main+0x275) [0x55da64478ef5] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(tor_main+0x3a) [0x55da6447233a] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(main+0x19) [0x55da644720a9] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1) [0x7f5ba7ca82e1] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(_start+0x2a) [0x55da644720fa] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: tor_bug_occurred_(): Bug: ../src/or/hs_common.c:1294: node_has_hsdir_index: Non-fatal assertion !(tor_mem_is_zero((const char*)node->hsdir_index->fetch, DIGEST256_LEN)) failed. (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug: Non-fatal assertion !(tor_mem_is_zero((const char*)node->hsdir_index->fetch, DIGEST256_LEN)) failed in node_has_hsdir_index at ../src/or/hs_common.c:1294. Stack trace: (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(log_backtrace+0x44) [0x55da645ac3f4] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(tor_bug_occurred_+0xb9) [0x55da645c7c29] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(hs_get_responsible_hsdirs+0x396) [0x55da6458d236] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(hs_service_run_scheduled_events+0x1a44) [0x55da6459bc34] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(+0x505e1) [0x55da644735e1] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(+0x70df0) [0x55da64493df0] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5(event_base_loop+0x6a0) [0x7f5ba944c5a0] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(do_main_loop+0x2bc) [0x55da644777ec] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(tor_run_main+0x275) [0x55da64478ef5] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(tor_main+0x3a) [0x55da6447233a] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(main+0x19) [0x55da644720a9] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1) [0x7f5ba7ca82e1] (on Tor 0.3.3.5-rc )
May 15 12:01:10 txt Tor[3576]: Bug:     /usr/bin/tor(_start+0x2a) [0x55da644720fa] (on Tor 0.3.3.5-rc )

The log shows many more recurrences of tor_bug_occured. I only copied the first few. The last recurrence was 5 seconds later.

comment:9 Changed 2 months ago by nickm

Milestone: Tor: unspecifiedTor: 0.3.4.x-final

comment:10 Changed 2 months ago by dgoulet

Parent ID: #25955
Status: needs_informationaccepted

comment:11 Changed 2 months ago by asn

Thanks for the logs cypherpunks from comment:8.

Seems like we just triggered the debugging code we added in #25306. See:

should_rotate_descriptors(): Bug: Service descriptor is NULL ((nil)/(nil)). Next rotation time is 1526367600 (now: 1526378468). Valid after time from consensus is 1526378400 (on Tor 0.3.3.5-rc )

Pretty fun that we end up with both descriptors as NULL.

This seems like not-so-hard to reproduce by starting up an HS with a hella skewed clock, and then setting it up to the right time so that it gets the right consensus.

Update: yes indeed I managed to trivially repro this by setting the system time back with: date --set='4 days ago' letting Tor try to bootstrap, and then after a bit setting it to the right time.

Last edited 2 months ago by asn (previous) (diff)

comment:12 Changed 2 months ago by asn

OK I went deep and debugged this.

The basic problem is that upon fixing the clock we end up with next_rotation_time in the past, whereas normally it should be in the future like the comment from #25306 suggests:

     * So, when tor starts, the next rotation time is set to the start of the
     * next SRV period using the consensus valid after time so it should
     * always be set to a future time value.

The root problem lies elsewhere tho: in the voting schedule calculation logic.

Here are the bug steps:

1) We start up with a skewed clock. We are fetching a consensus from the future which we don't consider live, and hence we don't build any HS descs.

2) However, when we fetch the consensus we also call voting_schedule_recalculate_timing() which calls get_voting_schedule() with a skewed approx_time(). This means that get_voting_schedule() calculates new_voting_schedule->interval_starts with a skewed clock, and the result is also skewed.

3) Then we fix up the clock. We now start trusting the consensus again, but we never call get_voting_schedule() again because it's only called when receiving a fetched consensus and we've already done that... This means that we remain with a skewed interval_starts schedule.

4) Then the HS service code calls set_rotation_time() which uses the skewed interval_starts through sr_state_get_start_time_of_current_protocol_run() -> get_start_time_of_current_round() -> voting_schedule_get_next_valid_after_time() and ends up with a skewed next_rotation_time.

5) Then we go into should_rotate_descriptors() with the basic assumption defeated (we just started up but next_rotation_time is in the past) which causes the BUG because we needlessly try to rotate the descriptors.

comment:13 Changed 2 months ago by dgoulet

Owner: changed from dgoulet to asn
Reviewer: dgoulet
Status: acceptedassigned

comment:14 Changed 2 months ago by asn

Status: assignedneeds_review

OK, I pushed a branch called bug24977_dev in my repo with a fix to this issue!

Seems like the general issue is not just about hsv3 and it's a general logic bug, where Tor does not recompute consensus information after clock skew gets fixed. Basically, we call networkstatus_set_current_consensus() with a skewed clock, and we never call it when the clock gets fixed. The bug is mainly visible for v3 because that's the main subsystem that uses the current time and consensus information, but effects of this bug could potentially ripple elsewhere too.

You can reproduce the original bug by doing date --set='4 days ago', launching tor, letting it get stuck bootstrapping and then setting the right date again. The patch above should fix the asserts.

My fix above only does the bare minimum to fix the bug for HSv3; it does not attempt to recompute all possible consensus information that could dpeend on the current time, since that would require a much bigger refactor since networkstatus_set_current_consensus() is quite complicated. In the future we might want to call some more functions after detecting a clock skew, like reschedule_dirvote() and update_consensus_networkstatus_fetch_time().

comment:15 Changed 2 months ago by dgoulet

Status: needs_reviewneeds_revision

I think ideally we would like to do this if those two conditions are met:

  1. Clock has jumped (at least the discrepancy from the code of 2+ sec)
  2. If we transition from "not live" to "live"

Recomputing the voting timings and HSDir index only makes sense if we had a consensus before that we didn't think it was live and after the clock jump, it became live.

I'm not sure how easily we can achieve (2) here. Probably something like two flags where one says "previous consensus was not live" and second says "the whole recalculation of a new live consensus needs to happen"... Becomes a bit non-trivial to correctly set those with our code flow...

The other option is to possibly keep the valid_after time of the consensus that was used for the HSDir index computation. If that time is 0 (unset) or smaller than the one we have in our *live* consensus and we detect the clock jump, we should readjust.

The whole point of the above I think is to avoid only recomputing with the >= 100 seconds clock jump since edge cases appear if for instance we correct for only 10 seconds.

comment:16 in reply to:  15 Changed 2 months ago by asn

Replying to dgoulet:

I think ideally we would like to do this if those two conditions are met:

  1. Clock has jumped (at least the discrepancy from the code of 2+ sec)
  2. If we transition from "not live" to "live"

Recomputing the voting timings and HSDir index only makes sense if we had a consensus before that we didn't think it was live and after the clock jump, it became live.

I'm not sure how easily we can achieve (2) here. Probably something like two flags where one says "previous consensus was not live" and second says "the whole recalculation of a new live consensus needs to happen"... Becomes a bit non-trivial to correctly set those with our code flow...

Yeah the code flow is not very welcoming to this flag approach, because no particular event happens when we move from non-live to live, and we would probably have to check every second for the transition; otherwise if we miss the transition even for a second this bug can occur again I think. I wonder how to do this cleanly...

The other option is to possibly keep the valid_after time of the consensus that was used for the HSDir index computation. If that time is 0 (unset) or smaller than the one we have in our *live* consensus and we detect the clock jump, we should readjust.

Hmm, I kinda get the idea here, but it seems a bit dirty to rely on whether HSDir index is set, to also readjust voting schedule etc. I wonder what else we could rely on here...

The whole point of the above I think is to avoid only recomputing with the >= 100 seconds clock jump since edge cases appear if for instance we correct for only 10 seconds.

Hmm, I get the theoretical issue here, but it seems ultra unlikely we will miss the transition for 10 seconds skew, given the way consensus documents are propagated on the network. Because this can only happen if Alice fetches a consensus with valid-after 12:00 at 12:00:05, but her system time is skewed at 11:59:56. It's very unlikely that Alice will fetch the consensus so early in its lifetime and also have so tiny skewed clock, but I guess the theoretical issue is there.

Need to think of how to do this cleanly and without too much overhead.

comment:17 Changed 2 months ago by dgoulet

Indeed... that is true. Probably the current approach is fine and should cover the majority of issues. And ultimately it will get fixed after a while at the next consensus.

Detecting non-live to live is a PITA... I think our best bet is probably what you did which is to readjust everything on clock jump regardless of what we think we have or not. Safe.

comment:18 Changed 2 months ago by dgoulet

Status: needs_revisionmerge_ready

Now I'm curious what nickm thinks. Time to merge_ready.

comment:19 Changed 8 weeks ago by nickm

I think this might actually be a bugfix on 0.3.4, not on 0.3.2: 0.3.4 changed the way that we call the dirvote calculation code when we did #25937. Could that be the cause of this?

As for the patch itself: Is it possible for us to have this recalculation get _scheduled_ in update_current_time, but actually executed inside a mainloop_event? Or does it hurt us to have other stuff called in between those points? The problem there is that update_current_time already does too much: I'd rather keep our callgraph simple, and make tiny-fast functions like this _never_ have a slow-complex case.

comment:20 in reply to:  19 ; Changed 8 weeks ago by asn

Status: merge_readyneeds_revision

Replying to nickm:

I think this might actually be a bugfix on 0.3.4, not on 0.3.2: 0.3.4 changed the way that we call the dirvote calculation code when we did #25937. Could that be the cause of this?

Hmm, I don't think so. Because also the backtrace from comment:8 is based on Tor 0.3.3.5-rc and represents the bug fixed by the branch. I'll look into this anyway.

As for the patch itself: Is it possible for us to have this recalculation get _scheduled_ in update_current_time, but actually executed inside a mainloop_event? Or does it hurt us to have other stuff called in between those points? The problem there is that update_current_time already does too much: I'd rather keep our callgraph simple, and make tiny-fast functions like this _never_ have a slow-complex case.

Yes, I think this is possible.

I will try to make it so that update_current_time() sets a flag if needed, and then we actually do the computations as part of the HS subsystem.

comment:21 in reply to:  20 Changed 8 weeks ago by asn

Status: needs_revisionmerge_ready

Replying to asn:

Replying to nickm:

I think this might actually be a bugfix on 0.3.4, not on 0.3.2: 0.3.4 changed the way that we call the dirvote calculation code when we did #25937. Could that be the cause of this?

Hmm, I don't think so. Because also the backtrace from comment:8 is based on Tor 0.3.3.5-rc and represents the bug fixed by the branch. I'll look into this anyway.

Hm. I checked the branch from #25937 and I don't think it's related to this bug. This bug has been around for ever, but it started being visible now because we require a live consensus for some HSv3 functions to work properly. Hence, when we fetch a consensus with a skewed clock and wrongly consider it non-live, the relevant HSv3 functions also work badly. Our patch now basically re-runs those functions after the clock gets fixed.

As for the patch itself: Is it possible for us to have this recalculation get _scheduled_ in update_current_time, but actually executed inside a mainloop_event? Or does it hurt us to have other stuff called in between those points? The problem there is that update_current_time already does too much: I'd rather keep our callgraph simple, and make tiny-fast functions like this _never_ have a slow-complex case.

Yes, I think this is possible.

I will try to make it so that update_current_time() sets a flag if needed, and then we actually do the computations as part of the HS subsystem.

I implemented the above approach in my branch bug24977. You can also see the relevant PR here: https://github.com/torproject/tor/pull/124

Please compare the approach of bug24977 with the one from bug24977_dev. I think the one using callbacks indeed looks better.

Not sure if this bug fix warrants a backport, it seems to be quite rare because it requires a skewed clock to occur. I'd vote for no, but let me know if you have a different opinion and I can prepare branches.

comment:22 Changed 7 weeks ago by nickm

This looks better, but it adds a once-a-second event at the same time when we're trying to remove those.

Have a look at bug24977_again in my public repo; what do you think of that?

comment:23 Changed 7 weeks ago by dgoulet

Hmmmm this new branch adds a call to the recalculate consensus data function to the hs_service callback... it makes me think that we might have the wrong approach. Here is why with the hs service callback example:

So in the case of not having a live consensus because of a skewed clock, having the recalculation in a mainloop event means that the hs_service callback can be called _before_ it. Which seems to be the reason why this was added to the hs service callback:

+  if (need_to_recalculate_consensus_data)
+    recalculate_consensus_data_callback(NULL, NULL);

Because we recalculate the data in an event loop now, it means other events can trigger the assert() on the hsdir indexes because they would see the consensus being live but the recalculate consensus callback was not yet triggered. For instance, a client request to a .onion will trigger the assert() if done before the recalculation happened because it would consider the consensus live.

In other words, there is a time gap between "clock jump, now I have a live consensus" and "recalculate conensus data", we'll end up in some desynch state imo. HS is just one big example, there might be more.

The earlier patch from asn did recalculate everything on the spot when a clock jump was noticed.

comment:24 in reply to:  23 ; Changed 7 weeks ago by asn

Replying to dgoulet:

Because we recalculate the data in an event loop now, it means other events can trigger the assert() on the hsdir indexes because they would see the consensus being live but the recalculate consensus callback was not yet triggered. For instance, a client request to a .onion will trigger the assert() if done before the recalculation happened because it would consider the consensus live.

In other words, there is a time gap between "clock jump, now I have a live consensus" and "recalculate conensus data", we'll end up in some desynch state imo. HS is just one big example, there might be more.

Hm, I guess the HS-client example above is indeed a plausible (but unlikely) edge-case. It's worth noting that the example you cited above applies only to the HS client case, whereas I think we covered it for HS services since services basically function based on those callbacks. Of course, there might be other cases around the codebase, which we are not aware of right now ("if a tree falls in a forest and no one is around to hear it...").

Some obvious approaches:

a) Fix all the cases by recomputing data immediately when a clock jump is noticed (as per my first branch). This has the negative effect of giving variable time duration to the callback as Nick pointed out.
b) Fix the HS client edge-case dgoulet pointed out, by calling recalculate_consensus_data_callback() when an HS client tries to connect to an HS.
c) Merge Nick's branch and leave the edge-cases open. In the future, we can fix those too as we understand more things about this problem.
d) Do more research, try to understand the problem more, and figure out an even better solution if possible...

comment:25 in reply to:  24 Changed 6 weeks ago by dgoulet

Replying to asn:

Hm, I guess the HS-client example above is indeed a plausible (but unlikely) edge-case. It's worth noting that the example you cited above applies only to the HS client case, whereas I think we covered it for HS services since services basically function based on those callbacks. Of course, there might be other cases around the codebase, which we are not aware of right now ("if a tree falls in a forest and no one is around to hear it...").

Yes my worry is mostly about other possible issues this affect. And making it a special case for the hs service callback to recalculate seems the wrong approach...

a) Fix all the cases by recomputing data immediately when a clock jump is noticed (as per my first branch). This has the negative effect of giving variable time duration to the callback as Nick pointed out.
b) Fix the HS client edge-case dgoulet pointed out, by calling recalculate_consensus_data_callback() when an HS client tries to connect to an HS.
c) Merge Nick's branch and leave the edge-cases open. In the future, we can fix those too as we understand more things about this problem.
d) Do more research, try to understand the problem more, and figure out an even better solution if possible...

Not the easiest call here but I think (a) is what needs to be done even though I don't like much doing extra work in the update time callback.......

The reason is that (b) is *one* possible issue among possibly other. Special cases where we directly call the callback makes me think we are piling up on the house of cards. The problem with (c) now is that this issue is something that took us quite a bit of time to finally find and I bet if something outside of the HS subsystem is affected by this, it will be hard as well to find out. And (d), I doubt it is needed, we have a good grasp on the problem imo.

comment:26 Changed 6 weeks ago by dgoulet

Status: merge_readyneeds_revision

I think we are still unsure here on how to do it properly...

comment:27 Changed 6 weeks ago by asn

Status: needs_revisionneeds_review

OK people, in branch bug24977_final you can see the approach we discussed where we use the consensus valid-after to detect outdated state and then fix it on-demand:

https://github.com/torproject/tor/pull/146

I didn't do the (1) part where we add a timer when we receive a non-live consensus that triggers when it becomes live, because I figured out that the on-demand approach should cover that too for the HSv3 use case. Let me know if you'd like to see that too, in case it helps with the non-HSv3 use cases. Perhaps we can have that as a ticket for the future?

BTW, I choose the nodelist to add the valid_after timestamp, since that's the data structure that basically carries the HSDir indices. I didn't want to add it to each individual node_t because that would be a bit weird.


Please also see commit 2520ee34c which fixes another client-side case which could produce this bug and stack trace. I left more details in the commit message.

Let me know if you have questions or you'd like to see things done differently. I think this is still doable for 034.

comment:28 Changed 6 weeks ago by dgoulet

Status: needs_reviewneeds_revision

Commented on PR.

comment:29 Changed 6 weeks ago by asn

Status: needs_revisionneeds_review

Addressed dgoulet's review.

comment:30 Changed 5 weeks ago by dgoulet

Status: needs_reviewmerge_ready

Ok, I think we can go in autosquash mode and merge this to master!

comment:31 Changed 5 weeks ago by nickm

To master or to 0.3.4?

comment:32 Changed 5 weeks ago by nickm

Resolution: fixed
Status: merge_readyclosed

Going to assume 0.3.4.

Looking at the branch, it seems plausible to me, and I don't see anything frightening.

Merged to 0.3.4 and forward!

Note: See TracTickets for help on using tickets.