java projects should use slf4j (our default implementation is logback, but that is less important)
Yep.
errors are possibly mailed by logging frameworks (in their default settings). So error level should be important (an 'action item') for operation.
info-level should give an indication that all things are running as expected.
Agreed about those two, but we'll have to decide when to use the other three levels in slf4j. Here's a suggestion:
trace: I suggest disregarding this level, because the slf4j developers themselves only added it to "bow to popular demand". It seems we can easily get away by just using debug instead.
debug: Let's use this for detailed messages to debug a problem, under the assumption that these logs are typically disabled in production and only enabled when debugging a problem.
info: It seems useful to log whenever a process or major step starts or ends, but under the assumption that these logs will only be written to files and not mailed out to operators. It might be a good requirement to write info-level logs in a way that operators can understand them without having to read any code. One example where info might be too high is where metrics-lib informs us which DescriptorCollector implementation it's serving us, because that's something the operator wouldn't normally care about.
warn: We could use warn to inform the operator of a problem that we were able to recover from but that they should be aware of. The warning should be written in a way that the operator understands, and it should be something that the operator can do something against. Stated differently, we should expect to be contacted by operators who are unclear what to do about a given warning. And if they cannot do anything against it, maybe it should be an info message rather than a warning. We might want to recommend that operators include warnings in any automated notifications they receive from their service instance.
error: We should use error for problems that we cannot recover from. Otherwise they're similar to warnings.
maybe, log statistics to separate log files (as in Onionoo).
Are there log domains of some sort? It seems that we should leave the decision of log files to the operator, but could say that these log messages go into a "statistics" log domain that the operator may log to the same or a different file. By the way, we'd probably want to log these messages on info level.
what else?
Maybe one thing:
Log levels used by metrics-lib, where a problem with parsing a descriptor can have different consequences depending on the application. In other words, if we log a warning or even error, we should give the application the opportunity to tone down that warning, or ignore it, because it doesn't care as much. What we could do is use a log domain "parsing", or we could let applications define logging by logger name and tone down org.torproject.descriptor.* loggers.
trace: I suggest disregarding this level, because the slf4j developers themselves only added it to "bow to popular demand". It seems we can easily get away by just using debug instead.
Agreed.
debug: Let's use this for detailed messages to debug a problem, under the assumption that these logs are typically disabled in production and only enabled when debugging a problem.
Agreed.
info: It seems useful to log whenever a process or major step starts or ends, but under the assumption that these logs will only be written to files and not mailed out to operators. It might be a good requirement to write info-level logs in a way that operators can understand them without having to read any code.
Fine.
One example where info might be too high is where metrics-lib informs us which DescriptorCollector implementation it's serving us, because that's something the operator wouldn't normally care about.
The logging statement you refer to can be useful to the operator once there are other metrics-lib implementations available and the operator choses to use metrics-lib-ng instead of the regular implementation. Or, in case we provide other parsing implementations, too.
(In order to make it really code independent metrics-lib would need to supply an identifier, but when different implementations are used the full class-name can just be stated in the documentation to that implementation and operators wouldn't need to know that this is a java class name.)
warn: We could use warn to inform the operator of a problem that we were able to recover from but that they should be aware of. The warning should be written in a way that the operator understands, and it should be something that the operator can do something against. Stated differently, we should expect to be contacted by operators who are unclear what to do about a given warning. And if they cannot do anything against it, maybe it should be an info message rather than a warning. We might want to recommend that operators include warnings in any automated notifications they receive from their service instance.
Agreed. Even require that the 'cure' is added to the warning, e.g. "provide more disk space" etc. Requiring this will give a good distinction between warnings and info messages.
error: We should use error for problems that we cannot recover from. Otherwise they're similar to warnings.
Maybe, define errors for the unusual (errors would be logged in some places where we do not really expect anything) and ask operators to post these (ml or trac)?
maybe, log statistics to separate log files (as in Onionoo).
Are there log domains of some sort? It seems that we should leave the decision of log files to the operator, but could say that these log messages go into a "statistics" log domain that the operator may log to the same or a different file. By the way, we'd probably want to log these messages on info level.
Yes, you can call it domains. There should be domains for certain log lines (like statistics) that can be easily directed to different log-line consumers (other files, databases, servers, etc.).
what else?
Maybe one thing:
Log levels used by metrics-lib, where a problem with parsing a descriptor can have different consequences depending on the application. In other words, if we log a warning or even error, we should give the application the opportunity to tone down that warning, or ignore it, because it doesn't care as much. What we could do is use a log domain "parsing", or we could let applications define logging by logger name and tone down org.torproject.descriptor.* loggers.
Maybe, for metrics-lib we should not log too much but rather throw exceptions. That would be the right way to make the application to deal with problems.
But this is "error handling" rather than logging.
One example where info might be too high is where metrics-lib informs us which DescriptorCollector implementation it's serving us, because that's something the operator wouldn't normally care about.
The logging statement you refer to can be useful to the operator once there are other metrics-lib implementations available and the operator choses to use metrics-lib-ng instead of the regular implementation. Or, in case we provide other parsing implementations, too.
(In order to make it really code independent metrics-lib would need to supply an identifier, but when different implementations are used the full class-name can just be stated in the documentation to that implementation and operators wouldn't need to know that this is a java class name.)
Okay. I'm not certain where this message fits in best. On the one hand I wouldn't expect an operator to select an implementation that is not the default. But on the other hand I can see us asking an operator for their info-level logs whenever something's wrong, and then we might want to see which exact implementation was used. I'd say let's defer this until we have more such cases to speculate about.
warn: We could use warn to inform the operator of a problem that we were able to recover from but that they should be aware of. The warning should be written in a way that the operator understands, and it should be something that the operator can do something against. Stated differently, we should expect to be contacted by operators who are unclear what to do about a given warning. And if they cannot do anything against it, maybe it should be an info message rather than a warning. We might want to recommend that operators include warnings in any automated notifications they receive from their service instance.
Agreed. Even require that the 'cure' is added to the warning, e.g. "provide more disk space" etc. Requiring this will give a good distinction between warnings and info messages.
I like this!
error: We should use error for problems that we cannot recover from. Otherwise they're similar to warnings.
Maybe, define errors for the unusual (errors would be logged in some places where we do not really expect anything) and ask operators to post these (ml or trac)?
You mean assign a unique error number? Sure, sounds like a fine idea! Should we also do that for warnings, so that operators can give us the warning number whenever they're unsure how to handle an issue?
Maybe, for metrics-lib we should not log too much but rather throw exceptions. That would be the right way to make the application to deal with problems.
But this is "error handling" rather than logging.
Well, it touches both areas. If we say we rather throw an exception than log a warning or an error, that's something we can work with. I didn't look at the code yet to see whether that would work. It might!
Agreed with the rest that I didn't quote above. Let's turn this thread into a guide somewhere as soon as we agree on the remaining things.
Maybe, define errors for the unusual (errors would be logged in some places where we do not really expect anything) and ask operators to post these (ml or trac)?
You mean assign a unique error number? Sure, sounds like a fine idea! Should we also do that for warnings, so that operators can give us the warning number whenever they're unsure how to handle an issue?
Assigning a number wouldn't be possible. I was referring to errors that weren't anticipated when coding. A good example is ticket #20695 (moved) where an error was caught. The logged text ERROR o.t.c.c.CollecTorMain:71 The relaydescs module failed: ... should be ERROR o.t.c.c.CollecTorMain:71 Please post this one trac.torproject.org or to the metrics-team mailing list. The relaydescs module failed: ...
Maybe, for metrics-lib we should not log too much but rather throw exceptions. That would be the right way to make the application to deal with problems.
But this is "error handling" rather than logging.
Well, it touches both areas. If we say we rather throw an exception than log a warning or an error, that's something we can work with. I didn't look at the code yet to see whether that would work. It might!
Agreed with the rest that I didn't quote above. Let's turn this thread into a guide somewhere as soon as we agree on the remaining things.
Alright, I finally got around to trying out those guidelines. Please take a look at this commit in my metrics-lib branch task-20540 where I wrote or tweaked log statements for DescriptorIndexCollector and friends. Is that roughly what you intended when writing those guidelines?
I really want to keep the info level here, because this gives valuable runtime information. Maybe, the text could be phrased differently to make it seem less 'techy'? On the other hand, people running metrics-lib or other Metrics products can be expected to not be offended by such an info level statement.
The two if (log.isDebugEnabled()) are redundant as the logging statement is already in the good form, see Parametrized Logging in the Logback manual (especially the Better alternative section). (Will add this to the wiki logging section.)
This warning should contain a hint about what to do. The sentence Local directory already exists and is not a directory. sounds quite cryptic.
All info level statements in DescriptorIndexCollector are fine. This could be changed to Finished descriptor collection.?
Here I would add a hint to look at the error message, e.g., adding The following error message provides more details. or similar. Something that makes operators pay attention and thus detect some file system errors or what ever.
Why a slash in here: log.debug("Fetching remote file /{} with expected ...?
This makes me raise the escalate vs. logging question, which is not yet covered in the logging definition. The aborted descriptor collection should be propagated to the calling application, e.g., turn the warning into a RuntimeException with the same text.
Thoughts?
And I changed that one log statement back to an IllegalArgumentException, even though I think that's a stretch. (I believe that we should only throw RuntimeExceptions at callers for things they're clearly responsible for, like passing null to us where we clearly said that they must not do such a thing. But the caller does not have exclusive control over the file system, so it might not have been them who created the non-directory file. Yeah, a stretch, gray area, edge case... I don't know the final answer yet.)
Want to take another look?
Should I continue with the other metrics-lib classes, like DescriptorReaderImpl?
And I changed that one log statement back to an IllegalArgumentException, even though I think that's a stretch. (I believe that we should only throw RuntimeExceptions at callers for things they're clearly responsible for, like passing null to us where we clearly said that they must not do such a thing. But the caller does not have exclusive control over the file system, so it might not have been them who created the non-directory file. Yeah, a stretch, gray area, edge case... I don't know the final answer yet.)
Want to take another look?
Looks fine!
Should I continue with the other metrics-lib classes, like DescriptorReaderImpl?
Sure. This is a good approach, one by one.
When there are more examples for the exception vs. logging for deriving a rule (of thumb), I'll add that to the wiki.