Opened 4 years ago

Closed 2 years ago

#13362 closed enhancement (fixed)

Make Onionoo's logs more useful

Reported by: karsten Owned by: iwakeh
Priority: Medium Milestone: Onionoo 3.1-1.0.0
Component: Metrics/Onionoo Version:
Severity: Normal Keywords:
Cc: iwakeh Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

I noticed a few problems with Onionoo's logs since we switched to logback.

First, error logs are strangely truncated, as in this older example:

2014-09-24 22:23:43 ERROR o.t.o.u.DescriptorDownloader(1132014-09-24 22:23:43 ERROR o.t.o.u.DescriptorDownloader(1132014-09-24 22:23:43 ERROR o.t.o.u.DescriptorDownloader(1132014-09-24 22:23:43 ERROR o.t.o.u.DescriptorDownloader(1132014-09-24 22:23:43 ERROR o.t.o.u.DescriptorDownloader(1132014-09-24 22:23:43 ERROR o.t.o.u.DescriptorDownloader(1132014-09-24 22:23:44 ERROR o.t.o.u.DescriptorDownloader(1132014-09-24 22:23:44 ERROR o.t.o.u.DescriptorDownloader(1132014-09-24 22:25:08 ERROR o.t.o.u.NodeDetailsStatusUpdater(464

My guess is that this can be explained by "parentheses being special". We could use [] instead, or simply use the same pattern as for the main log.

Second, the statistics logs are missing the most relevant parts. Here's an example:

2014-09-04 10:15:06 [runtime: 843] Initializing.
2014-09-04 10:15:06 [runtime: 845] Acquired lock
2014-09-04 10:15:06 [runtime: 847] Initialized descriptor source
2014-09-04 10:15:06 [runtime: 852] Initialized document store
2014-09-04 10:15:06 [runtime: 877] Initialized status update runner
2014-09-04 10:15:06 [runtime: 890] Initialized document writer runner
2014-09-04 10:15:06 [runtime: 891] Downloading descriptors.
2014-09-04 10:15:44 [runtime: 38516] Reading descriptors.
2014-09-04 10:24:51 [runtime: 585920] Updating internal status files.
2014-09-04 10:30:58 [runtime: 952445] Updating document files.
2014-09-04 10:35:14 [runtime: 1208036] Shutting down.
2014-09-04 10:35:14 [runtime: 1208039] Wrote parse histories
2014-09-04 10:35:44 [runtime: 1238709] Flushed document cache
2014-09-04 10:35:44 [runtime: 1238709] Gathering statistics.
2014-09-04 10:35:44 [runtime: 1238711] NodeDetailsStatusUpdater
2014-09-04 10:35:44 [runtime: 1238713] ClientsStatusUpdater
2014-09-04 10:35:44 [runtime: 1238715] UptimeStatusUpdater
2014-09-04 10:35:44 [runtime: 1238715] GeoIP lookup service
2014-09-04 10:35:44 [runtime: 1238719] Reverse domain name resolver
2014-09-04 10:35:44 [runtime: 1238720] Descriptor source
2014-09-04 10:35:44 [runtime: 1238721] Document store
2014-09-04 10:35:44 [runtime: 1238721] Releasing lock.
2014-09-04 10:35:44 [runtime: 1238721] Released lock
2014-09-04 10:35:44 [runtime: 1238721] Terminating.

I think the problem is that we're not including {} placeholders in log calls with additional arguments, so logback is simply dropping those arguments on the floor. But maybe we shouldn't mix log calls with placeholders and log calls that concatenate strings themselves. Taking out calls with multiple arguments (except for logging exceptions) seems like the simplest fix for now, though I could see us switching to placeholders in the future.

Third, and this is not really an issue but a minor inconvenience, the [runtime: 123456] part of both main and error logs is not as easy to read as it would seem. I wonder if we should simply switch timestamps to %date{ISO8601, UTC} for millisecond precision. That would make logs somewhat easier to read.

Child Tickets

TicketTypeStatusOwnerSummary
#13561enhancementclosedlogging documentation

Attachments (1)

0001-Fix-truncated-error-log-entries.patch (756 bytes) - added by leeroy 4 years ago.

Download all attachments as: .zip

Change History (18)

comment:1 Changed 4 years ago by karsten

Status: newneeds_review

Please review my task-13362 branch.

comment:2 Changed 4 years ago by iwakeh

Thanks, for removing the parenthesis bug from FILEERR.

All the formatting and other choices I made, when writing the logback.xml, were
suggestions and examples (like the runtime value), b/c I could not really
infer what is necessary or wanted and I intended to showcase some of the
features of logback.
I should have pointed it out more clearly, when the logging was introduced.
So, good that we now get around to defining the logging structure in detail!

Another topic for the logging documentation we talked about in the e-mail-appender issue.
Maybe, add one document about configuration, which will contain the logging decisions/findings
discussed so far and later some configuration topics for the embedded jetty?

Last edited 4 years ago by iwakeh (previous) (diff)

comment:3 Changed 4 years ago by iwakeh

Status: needs_reviewneeds_revision

comment:4 Changed 4 years ago by iwakeh

I removed the code example from the previous comment, as it was not what I intended.
I shouldn't edit in a hurry ;-)

comment:5 Changed 4 years ago by karsten

Agreed, let's define what we'd like to log and how, and let's document that both for code contributors and for service operators. Some suggestions:

  • I'd like to keep a detailed log for debugging problems. We could use two log levels for normal operation: debug and info.
  • There should also be logs for warnings and errors. We might even suggest setting up Nagios scripts that check whether either of these log files is not empty and issue a warning message to the operator. During normal operation, there shouldn't be any warnings or even errors, so we should only use those two log levels when the operator must take a look. Once the operator resolves the problem they should delete the log file; warnings and error messages will still be contained in detailed logs. We'll probably want two sets of warning/error logs for back-end and front-end.
  • We don't really need a separate statistics log file. What we might want to do instead is define thresholds for certain operations, like reading descriptors of a certain time, and emit a warning if that operation takes longer than that. We could use the same log files as described above.

I didn't merge my task-13362 branch to master yet. Please let me know which changes require more discussion, and I'll split up the commit and merge the undisputed parts.

comment:6 in reply to:  5 ; Changed 4 years ago by iwakeh

task-13362 branch is ok to be merged.

I think we mostly agree on the logging.
Starting from your suggestions I also added what the logging document should mention for whom.
(ops=operators, devs=developers)

  • The detailed log including everything (debug and up) is important for development.
    • Logging instructions: How to turn the detailed log off (ops) and how to change the threshold level (ops, devs).
  • The warn-error-log for ops (warning and up).
    • Logging instructions: How to turn it off (devs,ops).

I wouldn't impose nagios on operators, as there are many other monitoring solutions. Just provide a link to the available nagios scripts. And, also mention the possibility of configuring appenders (without detailing it), so whoever has experience with logback (or want to learn) can use whatever they prefer.
The logging document should point out that warnings and errors are to be taken seriously (ops).
In addition, log messages themselves should contain possible solutions, or hints about what to do. (This is actually a topic for the contributor document.)
For example, the missing Geoip error could refer to the install file.

The 'statistics' log statements can be treated as you suggested, no need for a separate log.

Separate logfiles for backend and frontend can be set using the command line options introduced with the one-jar/one-war task.
Basically, how to do this needs to be explained in the logging document.

As for the (future) production environment, you mentioned that the two ought to run under different users anyway. So, they should already have separate logs?

... What we might want to do instead is define thresholds for certain operations, like reading descriptors of a certain time, and emit a warning if that operation takes longer than that.

Maybe, these warnings/errors are only relevant for development and performance tuning? Can an operator do something about these warnings? Or, do you want these to be reported as a track issue?

Last edited 4 years ago by iwakeh (previous) (diff)

comment:7 in reply to:  6 Changed 4 years ago by karsten

Replying to iwakeh:

task-13362 branch is ok to be merged.

Great, merged.

I think we mostly agree on the logging.
Starting from your suggestions I also added what the logging document should mention for whom.
[...]

Thanks for starting this! Would you want to turn these ideas into a documentation patch that we can later merge? I guess the ops part would go into INSTALL and the devs part into CONTRIB.md. New ticket? :)

Just one concern: we wouldn't want ops to edit logback.xml, right? In theory, ops should be able to just take the jar and war and provide any configuration specifics as command-line arguments. Or is that unrealistic as a goal?

As for the (future) production environment, you mentioned that the two ought to run under different users anyway. So, they should already have separate logs?

Oh. You're right, I forgot about that. Should be fine to not provide anything at the command line and just log to the current directory by default then.

... What we might want to do instead is define thresholds for certain operations, like reading descriptors of a certain time, and emit a warning if that operation takes longer than that.

Maybe, these warnings/errors are only relevant for development and performance tuning? Can an operator do something about these warnings? Or, do you want these to be reported as a track issue?

What's a track issue? I was thinking of pretty high thresholds here, which would make the service almost unusable. There are two important time constants: every hour a new cronjob starts, and after six hours the front-end considers the back-end data as stale. If a back-end execution takes longer than one hour, that's worth a warning, and if it takes longer than a couple of hours, that's worth shouting at the operator to check things and possibly inform the developers.

Anyway, we can discuss this later and just include the statistics log lines in the debug/info log for now. New ticket?

comment:8 Changed 4 years ago by iwakeh

I should have written 'Trac' issue ;-)
i.e. make operators report here, if they observe very unusual performance.

Two new derived tickets for the open topics.

comment:9 Changed 4 years ago by leeroy

I noticed the error log being truncated today. It is indeed due to special handling of parenthesis. I didn't see the problem with missing information in statistics logs (but may not know what to look for). What remains of this ticket needing revision?

Changed 4 years ago by leeroy

comment:10 in reply to:  9 Changed 4 years ago by karsten

Replying to leeroy:

I noticed the error log being truncated today. It is indeed due to special handling of parenthesis. I didn't see the problem with missing information in statistics logs (but may not know what to look for). What remains of this ticket needing revision?

Aha, it looks like the fixes discussed above got lost when switching to Jetty. There were two logback.xml files, and somehow we managed to delete the wrong one. I just re-applied the fixes.

So, what remains to be done? I'd say the two child tickets. We could, in theory, remove their parent ticket relations to this ticket and close this one. Or we could work on the child tickets and then close all of them at once. Want to help with the child tickets?

comment:11 Changed 4 years ago by leeroy

From reading here and the child tickets it sounds like the improvements will need to consider the import of archives. So maybe I should look into obtaining thresholds including that possibility first? When the backend imports archives during the update the entire process takes 12hours (+ a scaling modifier of ~1.3 depending on availability of cpu cores for load heavy parts like reading consensus). When the updater is run on recent data, for the first time, a similar behavior occurs with a baseline of 1 hour. Other updates which apply only recent updates of n consensus intervals (having n << 72) are relatively short operations.

  • In any case how should the thresholds be defined? In a file? These thresholds will be sensitive to improvements in the code. Say the import of data is improved, now the thresholds will need to be easily updated.
  • Other possibilities may include: defining a test (run with rest of the tests) to profile performance, and have the import and update functionality provide a hint as to how much data is being processed.
  • What kind of additional logging, for the frontend, should be added?
  • Does it make sense to echo some messages to the screen/tty/console session?
  • Tail -f doesn't help if the shared logging facility rotates the log file. It likely will do this when importing archives. Should the log file only be rotated between update runs, instead of during the update?
  • (to be added to documentation) Monitoring tools will lose the file handle during log rotation.
Last edited 4 years ago by leeroy (previous) (diff)

comment:12 Changed 3 years ago by iwakeh

Milestone: Onionoo 3.1.1

comment:13 Changed 3 years ago by iwakeh

Owner: set to iwakeh
Status: needs_revisionassigned

Owner wasn't set.

comment:14 Changed 3 years ago by karsten

Milestone: Onionoo 3.1.1Onionoo 3.1-1.0.0

Milestone renamed

comment:15 Changed 2 years ago by iwakeh

Severity: Normal

Depends on the definitions from #20540.

comment:16 Changed 2 years ago by iwakeh

Closing as all tasks in here are done and the log-level definition implementation is a different ticket (#13361).

comment:17 Changed 2 years ago by iwakeh

Resolution: fixed
Status: assignedclosed
Note: See TracTickets for help on using tickets.