Make Onionoo's logs more useful

changed milestone to %Onionoo 3.1-1.0.0

added component::metrics/onionoo milestone::Onionoo 3.1-1.0.0 owner::iwakeh priority::medium resolution::fixed severity::normal status::closed type::enhancement labels

Please review my task-13362 branch.

Trac:
Status: new to needs_review

Thanks, for removing the parenthesis bug from FILEERR.

All the formatting and other choices I made, when writing the logback.xml, were suggestions and examples (like the runtime value), b/c I could not really infer what is necessary or wanted and I intended to showcase some of the features of logback. I should have pointed it out more clearly, when the logging was introduced. So, good that we now get around to defining the logging structure in detail!

Another topic for the logging documentation we talked about in the e-mail-appender issue. Maybe, add one document about configuration, which will contain the logging decisions/findings discussed so far and later some configuration topics for the embedded jetty?

Trac:
Status: needs_review to needs_revision

I removed the code example from the previous comment, as it was not what I intended. I shouldn't edit in a hurry ;-)

Agreed, let's define what we'd like to log and how, and let's document that both for code contributors and for service operators. Some suggestions:

I'd like to keep a detailed log for debugging problems. We could use two log levels for normal operation: debug and info.
There should also be logs for warnings and errors. We might even suggest setting up Nagios scripts that check whether either of these log files is not empty and issue a warning message to the operator. During normal operation, there shouldn't be any warnings or even errors, so we should only use those two log levels when the operator must take a look. Once the operator resolves the problem they should delete the log file; warnings and error messages will still be contained in detailed logs. We'll probably want two sets of warning/error logs for back-end and front-end.
We don't really need a separate statistics log file. What we might want to do instead is define thresholds for certain operations, like reading descriptors of a certain time, and emit a warning if that operation takes longer than that. We could use the same log files as described above.

I didn't merge my task-13362 branch to master yet. Please let me know which changes require more discussion, and I'll split up the commit and merge the undisputed parts.

task-13362 branch is ok to be merged.

I think we mostly agree on the logging. Starting from your suggestions I also added what the logging document should mention for whom. (ops=operators, devs=developers)

The detailed log including everything (debug and up) is important for development.

Logging instructions: How to turn the detailed log off (ops) and how to change the threshold level (ops, devs).

The warn-error-log for ops (warning and up).

Logging instructions: How to turn it off (devs,ops).

I wouldn't impose nagios on operators, as there are many other monitoring solutions. Just provide a link to the available nagios scripts. And, also mention the possibility of configuring appenders (without detailing it), so whoever has experience with logback (or want to learn) can use whatever they prefer. The logging document should point out that warnings and errors are to be taken seriously (ops). In addition, log messages themselves should contain possible solutions, or hints about what to do. (This is actually a topic for the contributor document.) For example, the missing Geoip error could refer to the install file.

The 'statistics' log statements can be treated as you suggested, no need for a separate log.

Separate logfiles for backend and frontend can be set using the command line options introduced with the one-jar/one-war task. Basically, how to do this needs to be explained in the logging document.

As for the (future) production environment, you mentioned that the two ought to run under different users anyway. So, they should already have separate logs?

... What we might want to do instead is define thresholds for certain operations, like reading descriptors of a certain time, and emit a warning if that operation takes longer than that.

Maybe, these warnings/errors are only relevant for development and performance tuning? Can an operator do something about these warnings? Or, do you want these to be reported as a track issue?

Replying to iwakeh:

task-13362 branch is ok to be merged.

Great, merged.

I think we mostly agree on the logging. Starting from your suggestions I also added what the logging document should mention for whom. [...]

Thanks for starting this! Would you want to turn these ideas into a documentation patch that we can later merge? I guess the ops part would go into INSTALL and the devs part into CONTRIB.md. New ticket? :)

Just one concern: we wouldn't want ops to edit logback.xml, right? In theory, ops should be able to just take the jar and war and provide any configuration specifics as command-line arguments. Or is that unrealistic as a goal?

As for the (future) production environment, you mentioned that the two ought to run under different users anyway. So, they should already have separate logs?

Oh. You're right, I forgot about that. Should be fine to not provide anything at the command line and just log to the current directory by default then.

... What we might want to do instead is define thresholds for certain operations, like reading descriptors of a certain time, and emit a warning if that operation takes longer than that.

Maybe, these warnings/errors are only relevant for development and performance tuning? Can an operator do something about these warnings? Or, do you want these to be reported as a track issue?

What's a track issue? I was thinking of pretty high thresholds here, which would make the service almost unusable. There are two important time constants: every hour a new cronjob starts, and after six hours the front-end considers the back-end data as stale. If a back-end execution takes longer than one hour, that's worth a warning, and if it takes longer than a couple of hours, that's worth shouting at the operator to check things and possibly inform the developers.

Anyway, we can discuss this later and just include the statistics log lines in the debug/info log for now. New ticket?

I should have written 'Trac' issue ;-) i.e. make operators report here, if they observe very unusual performance.

Two new derived tickets for the open topics.

I noticed the error log being truncated today. It is indeed due to special handling of parenthesis. I didn't see the problem with missing information in statistics logs (but may not know what to look for). What remains of this ticket needing revision?

Trac:
Username: leeroy

0001-Fix-truncated-error-log-entries.patch

Trac:
Username: leeroy

Replying to leeroy:

I noticed the error log being truncated today. It is indeed due to special handling of parenthesis. I didn't see the problem with missing information in statistics logs (but may not know what to look for). What remains of this ticket needing revision?

Aha, it looks like the fixes discussed above got lost when switching to Jetty. There were two logback.xml files, and somehow we managed to delete the wrong one. I just re-applied the fixes.

So, what remains to be done? I'd say the two child tickets. We could, in theory, remove their parent ticket relations to this ticket and close this one. Or we could work on the child tickets and then close all of them at once. Want to help with the child tickets?

From reading here and the child tickets it sounds like the improvements will need to consider the import of archives. So maybe I should look into obtaining thresholds including that possibility first? When the backend imports archives during the update the entire process takes 12hours (+ a scaling modifier of ~1.3 depending on availability of cpu cores for load heavy parts like reading consensus). When the updater is run on recent data, for the first time, a similar behavior occurs with a baseline of 1 hour. Other updates which apply only recent updates of n consensus intervals (having n << 72) are relatively short operations.

In any case how should the thresholds be defined? In a file? These thresholds will be sensitive to improvements in the code. Say the import of data is improved, now the thresholds will need to be easily updated.
Other possibilities may include: defining a test (run with rest of the tests) to profile performance, and have the import and update functionality provide a hint as to how much data is being processed.
What kind of additional logging, for the frontend, should be added?
Does it make sense to echo some messages to the screen/tty/console session?
Tail -f doesn't help if the shared logging facility rotates the log file. It likely will do this when importing archives. Should the log file only be rotated between update runs, instead of during the update?
(to be added to documentation) Monitoring tools will lose the file handle during log rotation.

Trac:
Username: leeroy

Trac:
Milestone: N/A to Onionoo 3.1.1

Owner wasn't set.

Trac:
Status: needs_revision to assigned
Owner: N/A to iwakeh

Milestone renamed

Trac:
Milestone: Onionoo 3.1.1 to Onionoo 3.1-1.0.0

Depends on the definitions from #20540 (moved).

Trac:
Severity: N/A to Normal
Reviewer: N/A to N/A
Sponsor: N/A to N/A

Closing as all tasks in here are done and the log-level definition implementation is a different ticket (#13361 (moved)).

Trac:
Resolution: N/A to fixed
Status: assigned to closed

closed

mentioned in issue #13562 (moved)

Make Onionoo's logs more useful

Child items 0

Activity