Opened 8 months ago

Closed 2 months ago

Last modified 2 months ago

#28223 closed defect (fixed)

Unparseable microdescriptor on public relay

Reported by: teor Owned by: nickm
Priority: Medium Milestone: Tor: 0.4.0.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: 040-roadmap-proposed, regression?, 035-can, postfreeze-ok, 040-deferred-20190220, asn-merge
Cc: Actual Points: .2
Parent ID: Points: .2
Reviewer: teor Sponsor:

Description

A relay operator reported this error:

Get this at my exit relay since yesterday:

# head /tmp/warn.log
Oct 23 23:30:17.000 [notice] Tor 0.3.5.3-alpha opening new log file.
Oct 23 23:30:33.000 [warn] parse error: internal NUL character.
Oct 23 23:30:33.000 [warn] Unparseable microdescriptor
Oct 23 23:30:33.000 [warn] parse error: internal NUL character.
Oct 23 23:30:33.000 [warn] Unparseable microdescriptor

even with log level "debug" there seems to be no more information.
Anything I should worry about?

https://lists.torproject.org/pipermail/tor-relays/2018-October/date.html

Child Tickets

Attachments (1)

cached-microdescs.new.xz (551.8 KB) - added by madage 3 months ago.
cached-microdescriptors file with nul bytes filled relay info

Download all attachments as: .zip

Change History (28)

comment:1 Changed 8 months ago by teor

Keywords: 035-must added
Priority: MediumHigh

We must triage this issue in 0.3.5.

comment:2 Changed 8 months ago by nickm

"internal nul character" sounds like it's getting junk in its microdescriptors, either from a messed up directory source, a messed up cache file, maybe a bug in fetching and uncompressing microdescriptors.

It would be good to improve the log messages here, and maybe we'll get a better idea what is happening.

comment:3 Changed 7 months ago by nickm

Keywords: 035-can added; 035-must removed

comment:4 Changed 7 months ago by teor

Keywords: 040-roadmap-proposed added; 035-roadmap-proposed removed
Milestone: Tor: 0.3.5.x-finalTor: 0.4.0.x-final

This is a ticket for better logging, it can go in 0.4.0.

comment:5 Changed 5 months ago by nickm

Keywords: postfreeze-ok added

Mark some tickets as postfreeze-ok, to indicate that I think they are okay to accept in 0.4.0 post-freeze. Does not indicate that they are all necessary to do postfreeze.

comment:6 Changed 4 months ago by nickm

Keywords: 040-deferred-20190220 added
Milestone: Tor: 0.4.0.x-finalTor: unspecified

Deferring 51 tickets from 0.4.0.x-final. Tagging them with 040-deferred-20190220 for visibility. These are the tickets that did not get 040-must, 040-can, or tor-ci.

comment:7 Changed 3 months ago by teor

Keywords: 040-must added
Milestone: Tor: unspecifiedTor: 0.4.0.x-final

A user on IRC reported this bug in 0.4.0.3-alpha.

We should re-triage this ticket in 0.4.0.

comment:8 Changed 3 months ago by teor

Owner: set to mikeperry
Status: newassigned

Assigning to Mike to add some logging, because everyone else has been assigned more than two 040-must tickets.

Changed 3 months ago by madage

Attachment: cached-microdescs.new.xz added

cached-microdescriptors file with nul bytes filled relay info

comment:9 Changed 3 months ago by madage

Attached my cached-microdescriptors file where the bug happened. Yesterday the cache must have been rotated as I am not seeing the warnings anymore.

comment:10 Changed 3 months ago by teor

Owner: mikeperry deleted

Un-assigning 040-must bugs from Mike, because he's overloaded.
We'll work out what to do with these tickets at the meeting.

comment:11 Changed 3 months ago by nickm

Owner: set to nickm
Status: assignedaccepted

comment:12 Changed 3 months ago by nickm

Okay, let's try to figure this out. The nuls are on line 16158, between the end of one md and the start of the annotations for the next. There is a 41-minute gap between the @last-listed times of the two microdescs.

The NULs fill the file between position 0xd8546 and 0xd9000 inclusive. That last figure suggests to me that we're maybe doing an improper seek.

Looking at the code in microdesc.c, if we're doing an improper seek, I suspect the following places:

  • tor_fd_getpos(), which calls lseek to find the current position in the file. This doesn't seem too likely to me, since it happens between writing the annotations and writing the descriptor.
  • The call from start_writing_to_file() to tor_fd_seekend(). This looks correct to me, alas.
  • The call to for_fd_setpos() in microdesc_cache_rebuild(). I have trouble believing this call is the cause of the problem, since it happens when we're rebuilding the main cache, and this bug is in the journal.

Now, here is another thought. Is this happening as we write to disk, or at some earlier phase? That is, this bug would also make sense if we had something padding microdesc bodies with nuls before or after we downloaded them.

Last edited 3 months ago by nickm (previous) (diff)

comment:13 Changed 3 months ago by nickm

I wonder if any other programs are looking at these files too. Does arm poke them at all? Is there any other tool in use?

comment:14 Changed 3 months ago by nickm

Keywords: 040-must removed
Priority: HighMedium

I would also love to know what platform this has happening on. I would _guess_ it's a reasonably normal linux, but I haven't heard one way or another.

I've written a diagnostic patch as bug28223_diagnostic_035; I think we should merge it in 0.4.0 but not backport. It also includes a couple of extra #includes, though I bet they won't be needed. There is a PR at https://github.com/torproject/tor/pull/891 .

Going to remove this from "040-must" since it is not actually harmful, and does not appear to be new in 040.

comment:15 Changed 3 months ago by nickm

Status: acceptedneeds_review

comment:16 Changed 3 months ago by nickm

Actual Points: .2

comment:17 Changed 3 months ago by toralf

This is a stable hardened Gentoo Linux server.

comment:18 Changed 3 months ago by madage

This bug happened on a Linux 32bit machine with self compiled kernel and tor. Tor was built without libseccomp/sandbox support. libc is glibc 2.28.

After nick brought the 41min gap between last-listed relay times to my attention, I thought it was related to a power failure that seems to have happened at that time. kern.log has a last entry about 9h10 and a reboot at 9h48. However, after a close look at tor notice.log, I see that tor itself was restarted at 9h10 after not being able to get a rendevous descriptor and it finished bootstraping at 9h11. After that tor was restarted again at 9h49.

I should mention that I was not the user who first reported this to teor months back, but maybe (s)he had a power failure/tor restart while fetching the microdescriptors as well?

comment:19 Changed 3 months ago by teor

Points: .2

comment:20 Changed 2 months ago by asn

Reviewer: teor

comment:21 Changed 2 months ago by teor

Status: needs_reviewneeds_revision

CI failed on Windows, it looks like there's some duplicate code, and we could log some more useful information about the context of the NUL.

comment:22 Changed 2 months ago by nickm

Status: needs_revisionneeds_review

I've tried to do the requested changes. I couldn't log an identifier for the MD, since a lot of the code that calls warn_if_nul has no concept of individual microdescs, and the code that does have a concept of them doesn't know anything about MD identity until after the point when it's called.

comment:23 Changed 2 months ago by teor

Status: needs_reviewmerge_ready

Hexencode isn't very readable, but it's better than nothing.

Thanks!

comment:24 Changed 2 months ago by nickm

Keywords: asn-merge added

comment:25 Changed 2 months ago by asn

Resolution: fixed
Status: merge_readyclosed

Merged to 040 and onwards.

comment:26 Changed 2 months ago by teor

I pushed an extra commit bffba9d26 to master to accept the following practracker change:

-problem function-size /src/feature/dirparse/microdesc_parse.c:microdescs_parse_from_string() 154
+problem function-size /src/feature/dirparse/microdesc_parse.c:microdescs_parse_from_string() 169

Let's work out how to solve this merge-forward and practracker fails issue.
Requiring a master merge for each backport is one solution (#29881).

comment:27 Changed 2 months ago by teor

Actually, let's get #30051 done, so we do practracker as a pre-push hook.
Once all the mergers have the pre-push hook, we won't fail CI master any more.

Note: See TracTickets for help on using tickets.