Opened 6 months ago

Last modified 2 weeks ago

#30901 assigned enhancement

Add control port trace logging to tor

Reported by: teor Owned by: teor
Priority: Medium Milestone: Tor: 0.4.3.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-ci-fail-sometimes
Cc: nickm, catalyst, teor Actual Points: 1.8
Parent ID: #29437 Points: 1
Reviewer: nickm Sponsor:

Description

To diagnose #30235, we need to be able to see what tor is sending and receiving on the control port.

I don't think that tor has any logging code for control port messages: I had a quick look, and there wasn't anything obvious.

Showing each message at info level could be good, but it might also be risky?
Maybe we should safe_string_client() the actual message text.

nickm and catalyst are refactoring the controller code right now, so they might know where I should add the logging statements.

Child Tickets

TicketTypeStatusOwnerSummary
#30700defectreopenedatagarTor's Travis stem job shows debug logs from 10 minutes after the hang
#31319enhancementclosedteorDocument logs that are excluded from the control port log event
#31359enhancementclosedteorSplit a control_event module out of control, and make it use control safe logging
#31854defectclosedteorIn tests and log.c, stop using ~0 a log domain mask

Change History (24)

comment:1 Changed 6 months ago by nickm

This sounds like debug to me, not info. We could use the LD_CONTROL domain to select these messages if we need to.

Intercepting incoming commands should be pretty easy -- the right place to do that is probably somewhere around handle_control_command.

For outgoing responses and events, I'd like catalyst's POV: they are actively refactoring that code right now, and probably have a better idea of how to hook it than I do.

comment:2 in reply to:  1 ; Changed 5 months ago by teor

Hi catalyst,

I'd like to make progress on this ticket today or tomorrow:

Replying to nickm:

For outgoing responses and events, I'd like catalyst's POV: they are actively refactoring that code right now, and probably have a better idea of how to hook it than I do.

Here's one way we could move forward:

I write a patch, and then you refactor it so that it fits in the new control code.

Let me know if you'd like me to do it another way.

comment:3 in reply to:  2 ; Changed 5 months ago by catalyst

Replying to teor:

Hi catalyst,

I'd like to make progress on this ticket today or tomorrow:

Replying to nickm:

For outgoing responses and events, I'd like catalyst's POV: they are actively refactoring that code right now, and probably have a better idea of how to hook it than I do.

Here's one way we could move forward:

I write a patch, and then you refactor it so that it fits in the new control code.

Let me know if you'd like me to do it another way.

If you're OK with hooking control_write_reply() and missing the set of control protocol writers that don't yet funnel through there, that might be best. I think after #30889 is merged, the remaining control protocol writers that don't go through that are the async control events, GETINFO, and GETCONF. (I'd have to check again to be more sure.)

comment:4 in reply to:  3 Changed 5 months ago by catalyst

Replying to catalyst:

Replying to teor:

Hi catalyst,

I'd like to make progress on this ticket today or tomorrow:

Replying to nickm:

For outgoing responses and events, I'd like catalyst's POV: they are actively refactoring that code right now, and probably have a better idea of how to hook it than I do.

Here's one way we could move forward:

I write a patch, and then you refactor it so that it fits in the new control code.

Let me know if you'd like me to do it another way.

If you're OK with hooking control_write_reply() and missing the set of control protocol writers that don't yet funnel through there, that might be best. I think after #30889 is merged, the remaining control protocol writers that don't go through that are the async control events, GETINFO, and GETCONF. (I'd have to check again to be more sure.)

Looking at my notes, it seems to be GETCONF (not GETINFO, sorry for misremembering), MAPADDRESS, and the async control events that use connection_add_buf() directly. I guess you could wrap specifically the connection_add_buf() that are related to the control protocol.

The control reply refactor work that I'm working on is #30984. I think we can minimize conflicts in that code if you stick to wrapping the callers of connection_add_buf() that talk to the control protocol. Or we can talk more about coordinating more invasive changes if you think that's better.

comment:5 Changed 5 months ago by catalyst

It occurs to me that if we capture async control event output in these log messages, we will have to exclude log events that are sent as async control events, to prevent loops.

comment:6 in reply to:  5 Changed 5 months ago by teor

Replying to catalyst:

It occurs to me that if we capture async control event output in these log messages, we will have to exclude log events that are sent as async control events, to prevent loops.

Since we want to debug a controller hang, I'd like to avoid modifying the control messages entirely.
Because additional log messages could break or modify the deadlock.

There's also a module include layering issue: the logging code calls the async control code, so the async control code can't include the log functions.

I'd like a set of signal safe logs at debug level, like the error sigsafe logs from https://github.com/torproject/tor/blob/master/src/lib/err/torerr.c
But I also need a set of domains for each of those fds.
At this stage, I don't think we'll need any levels other than debug and error, so it's not worth making the code fully generic.

comment:7 Changed 5 months ago by gaba

Cc: teor added
Keywords: network-team-roadmap-august added

comment:8 Changed 4 months ago by teor

Actual Points: 0.8
Status: newneeds_revision

I have a work in progress branch here:
https://github.com/torproject/tor/pull/1196

I also opened #31319 to merge a control-spec log event documentation update.

comment:9 Changed 4 months ago by teor

Actual Points: 0.81.8

The code works, and produces log output like:

Aug 06 01:35:53.760 [debug] conn_read_callback: socket 8 wants to read.
Aug 06 01:35:53.760 [debug] read_to_chunk: Read 48 bytes. 48 on inbuf.
T= 1565019353 Control Trace Conn: 0x611000002200, Command Read: Content: "AUTHENTICATE \r\n".
Aug 06 01:35:53.761 [info] handle_control_authenticate: Authenticated control connection (8)
T= 1565019353 Control Trace Conn: 0x611000002200, Reply Printf: Content: "250 OK\r\n".

The output is raw, because we can't use the log module from the control_trace module.

I still need to split out some of the commits into separate branches, and squash the remaining commits.

comment:10 Changed 4 months ago by nickm

You asked for comment on the format. I'm going to suggest that we generate something designed to be easier to parse, on the theory that we will want to process the output with other tools.

comment:11 Changed 4 months ago by nickm

(A note on what I meant by "easier to parse": there should be a regex or a grammar or something that we commit to having our outputs follow.)

comment:12 Changed 4 months ago by teor

Oh, and I should also change the control module (and all its dependencies), so they only use the control_safe logging module. No BUG()s or tor_asserts() or tor_log()s.

Otherwise, the control code could create an infinite logging loop. (It is an existing bug, so it's not essential to fix now.)

comment:13 in reply to:  10 Changed 4 months ago by teor

Replying to nickm:

You asked for comment on the format. I'm going to suggest that we generate something designed to be easier to parse, on the theory that we will want to process the output with other tools.

I suggest a key=value or key="quoted value" grammar:

T=1565019353 Trace=Control Conn=0x611000002200 Direction=Read Type=Command Content="AUTHENTICATE \r\n"
T=1565019353 Trace=Control Conn=0x611000002200 Direction=Write Type=Printf Content="250 OK\r\n"

comment:14 Changed 4 months ago by nickm

that grammar lgtm; we use it in many places already.

comment:15 Changed 4 months ago by teor

Nick suggested that we could disable controller logging before logging a trace log, and enable it afterwards. That might avoid a bunch of the refactoring in this branch.

comment:16 Changed 4 months ago by teor

Owner: set to teor
Status: needs_revisionassigned

comment:17 Changed 4 months ago by teor

Status: assignedneeds_revision

comment:18 Changed 3 months ago by nickm

Keywords: 042-deferred-20190918 added
Milestone: Tor: 0.4.2.x-finalTor: 0.4.3.x-final

Defer numerous 0.4.2 tickets to 0.4.3.

comment:19 Changed 3 months ago by teor

Keywords: 042-deferred-20190918 removed
Milestone: Tor: 0.4.3.x-finalTor: 0.4.2.x-final

I've split some unrelated code out of this branch. Now I can use some of the previous code to implement a cleaner solution.

comment:20 Changed 3 months ago by teor

Milestone: Tor: 0.4.2.x-finalTor: 0.4.3.x-final

comment:21 Changed 2 months ago by teor

Reviewer: nickm
Status: needs_revisionneeds_review

I started this branch based on #31854, assuming that it will merge as-is. (There was a small merge conflict between #31854 and master, which I have fixed.)

Here is a revised design:

  1. Add a LD_SKIP_CB log flag, which skips callback loggers when sending the log message, and: a) rename other flags to avoid ambiguity b) write unit tests for all the log flags
  2. Implement a log_debug_control_trace() macro, which: a) uses the domain LD_CONTROL|LD_SKIP_CB b) escapes user-controlled data, and raw data from tor c) formats the trace log using a key="quoted value" grammar d) logs the trace message using log_debug()

Here is my initial branch, at step 1. a)

It also includes the commits from #31854.

I'm looking for a design review at this stage.

comment:22 Changed 2 months ago by nickm

This design looks plausible. A few suggestions to consider if they seem like a good idea to you:

  • The log_debug_control_trace() macro's name seems pretty long.
  • The log_debug_control_trace() macro sounds like it belongs in src/feature/control somewhere.
  • The documentation for the flags, including existing flags, should explain not only what the flags do, but why you would want to use them.

comment:23 Changed 2 months ago by nickm

Status: needs_reviewassigned

comment:24 Changed 2 weeks ago by teor

Keywords: network-team-roadmap-august removed
Sponsor: Sponsor31-can

We decided not to do this as part of Sponsor 31.

Note: See TracTickets for help on using tickets.