The current strategy for the hourly cronjob to avoid concurrent executions is to write a lock file at startup and delete it upon termination. And if there's already such a file at startup the cronjob doesn't start.
This strategy works fine if there's a live process not succeeding on time. It fails pretty badly if a process died, because subsequent runs won't start without human intervention.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
You could write the pid of the process into the lockfile, and if there's a lockfile check that a process with that pid is still running. If it isn't, remove the lockfile and start over. This is not 100% if the system starts another process with the same pid, but should be a lot better. You can make it more complex too, but that's probably not warranted
That could work. But I'm not sure if there's a platform-independent way to build it. (One might argue that the current cron strategy is not platform-independent anyway, but hey.)
Or maybe it's time to move away from having cron execute a new hourly Java process towards a single Java process that runs in a loop and sleeps after an execution until the next. I just wonder how the service operator would learn when an execution fails, ideally via email. Hmm.
I second giving up cron:
cron is designed for entirely independ executions (ideally).
Having a long running task and not wanting the next task to run, unless the
previous one is finished introduces a dependency between the two.
Once there is such a dependency it should rather be addressed by design.
Some ideas for a different approach:
Using the ScheduledExecutorService with fixed rate or delay gives quite some control over the single executions and could mail in case of failure.
If the future version of onionoo uses logging (e.g. logback), the failure mails could be sent via an SMTPAppender.
Actually, a combination of these two might be best: java controlled timed execution and mailing via SMTPAppender.
Many regularly running cronjobs employ locking if parallel execution is a concern. Your machine might always have a clock jump, a high load or some other condition preventing orderly execution otherwise. If onionoo is prone to dying (which apparently happened here), you'd want a watchdog to do the email sending bits in a different process anyway, rather than increase the lifetime of the onionoo process?
(The watchdog would be platform-specific again, in any case. What are the target platforms? Does it really include windows? Because cron is easily portable to bsd)
I like the idea of using ScheduledExecutorService and SMTPAppender for logging warnings and non-fatal errors. I'd like to move away from cron for that. This is currently blocking on better logging infrastructure, but once we have that, I'd say let's switch.
Regarding the watchdog, I wonder if we can add a Nagios warning for this. In theory, it's sufficient to download https://onionoo.torproject.org/summary?limit=0 and make sure the included timestamps are not older than, say, three hours. Sebastian, do you know how to write such a warning and make sure that I learn about problems via email?
I like the idea of using ScheduledExecutorService and SMTPAppender for logging warnings and non-fatal errors. I'd like to move away from cron for that. This is currently blocking on better logging infrastructure, but once we have that, I'd say let's switch.
The Nagios warning is implemented, and the better logging infrastructure is in place. That means this ticket isn't blocking on anything anymore. Should we try the SMTPAppender first and then switch to using ScheduledExecutorService? Would you be able to submit a patch for the former?
Not so much a patch b/c it is mostly configuration and I don't know
the Onionoo server parameters. Hence, some changes will
be necessary after applying the patch. I tried to indicate the changes
by XML comments.
The attached patch adds lines to build.xml and logback.xml.
In addition, a java mail implementation has to be provided by installing
gnumail-1.1.2.jar and gnumail-providers-1.1.2.jar, which can be found in
wheezy package libgnumail-java.
Well, now mailing should work. That is, once an ERROR is logged the e-mails
are triggered and will contain a few log messages up-to this error.
Note that I added the appender to two loggers for testing purposes. I would expect it to send me mail once per hour with the statistics. Still, I don't receive anything.
I also tried sending mail from the command line. The following command succeeds and results in an actual email in my inbox:
metrics@sewerzowi:/srv/onionoo.torproject.org/onionoo$ echo "Test body." | mail -s "Test subject" karsten@torproject.org
Not sure if this is an issue that sysadmins could fix, because sending mail apparently works. Is there an easy way to debug this on the Java side?
Unless an ERROR was logged, I think this is normal behavior.
Quote from my comment below:
''Well, now mailing should work. That is, once an ERROR is logged the e-mails
are triggered and will contain a few log messages up-to this error.''
Currently statistics are logged as INFO. Hence, no mail.
The mail appender functionality is intended as follows:
The log buffer of the appender is filled up to a certain number of lines, then the first
lines are dropped when more logging statements arrive. The first ERROR logged will cause
the mailing of all the lines in the buffer up to the ERROR in a single mail. This makes
it possible to tell what happend just by reading that one mail.
If the stat-log should be mailed, too, this requires either changing its log-level
(actually one dummy ERROR written to "statistics" after the statistic logging lines
should suffice trigger the mailing, anything else could stay on INFO), or some additional
coding (more than editing logback.xml).
PS:
I usually have an error immediatly after starting ant run, b/c I still do not have the
correct geoip setup. Is the documentation already updated or should I open a ticket?
In addition, I usually have time-out errors during the run.
PS:
I usually have an error immediatly after starting ant run, b/c I still do not have the
correct geoip setup. Is the documentation already updated or should I open a ticket?
Oops. Please find the updated INSTALL file in master.
In addition, I usually have time-out errors during the run.
Can you be more specific, ideally in a new ticket?
Why are your gnumail jars not in /usr/share/java as all the others?
I assume javax.mail.* from the gnumail jars is not found.
The setup looks ok otherwise.
For debugging, set the following in logback.xml:
This will print a detailed logging setup at the beginning.
I hope you'll find a ClassNotFound somewhere.
(There should be a ticket for logging documentation, where these things
can be kept for future reference. Contributor or Deployer?)
Are you sure? I explicitly added the EMAIL appender to the statistics logger.> I don't see where messages would be filtered based on log level.
The SMTPAppender is triggered by ERROR; it's part of its functionality.
(The default setting, which is enabled in our case, uses an OnErrorEvaluator, tar-ball with sources).
I also added an error log statement to the end of Main, but still don't receive emails.
That should trigger mailing.
Oops. Please find the updated INSTALL file in master.
Thanks!
In addition, I usually have time-out errors during the run.
Can you be more specific, ideally in a new ticket?
Download time-out errors due to my internet connection, I guess.
No need for a new ticket, I think.
Could not fetch or store https://collector.torproject.org/recent/bridge-descriptors/statuses/20140917-093705-4A0CCD2DDC7995083D73F5D667100C8A5831F16D. Skipping. Reason: Connection timed out
Why are your gnumail jars not in /usr/share/java as all the others?
I assume javax.mail.* from the gnumail jars is not found.
The jars are not installed to /usr/share/java/ yet, but I copied them over for testing. Once everything works I'm planning to ask our sysadmin to install them. But the jars should be present:
The setup looks ok otherwise.
For debugging, set the following in logback.xml:
{{{
}}}
This will print a detailed logging setup at the beginning.
I hope you'll find a ClassNotFound somewhere.
Hmm, nothing in onionoo-all.log. Weird. Do you have an example of what should be logged?
(There should be a ticket for logging documentation, where these things
can be kept for future reference. Contributor or Deployer?)
Fine question. (I don't have a good answer yet.)
Are you sure? I explicitly added the EMAIL appender to the statistics logger.> I don't see where messages would be filtered based on log level.
The SMTPAppender is triggered by ERROR; it's part of its functionality.
(The default setting, which is enabled in our case, uses an OnErrorEvaluator, tar-ball with sources).
I also added an error log statement to the end of Main, but still don't receive emails.
That should trigger mailing.
Understood. Makes sense!
Oops. Please find the updated INSTALL file in master.
Thanks!
In addition, I usually have time-out errors during the run.
- <configuration debug="false">+ <configuration debug="true">}}}This will print a detailed logging setup at the beginning.I hope you'll find a ClassNotFound somewhere.
Hmm, nothing in onionoo-all.log. Weird. Do you have an example of what should be logged?
Well, at the beginning the logging setup is written to Stdout, if debug is enabled as above.
Makes sense, b/c the logging is in the process of being configured.
{{{
vagrant@vagrant:/srv/onionoo.torproject.org/onionoo$ ant run
[java] 11:22:25,917 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback.groovy]
[java] 11:22:25,917 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback-test.xml]
[java] 11:22:25,917 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found resource [logback.xml] at [file:/srv/onionoo.torproject.org/onionoo/classes/logback.xml]
[java] 11:22:26,013 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set
[java] 11:22:26,021 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.rolling.RollingFileAppender]
[java] 11:22:26,032 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [FILEALL]
[java] 11:22:26,076 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property
[java] 11:22:26,195 |-INFO in c.q.l.core.rolling.TimeBasedRollingPolicy - No compression will be used
[java] 11:22:26,197 |-INFO in c.q.l.core.rolling.TimeBasedRollingPolicy - Will use the pattern ./onionoo-all.%d{yyyy-MM-dd}.%i.log for the active file
....
and a lot more.If an Exception is thrown even more lines. The exception I hope to see:{{{ ... [java] 11:28:17,750 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.classic.net.SMTPAppender] [java] 11:28:17,752 |-ERROR in ch.qos.logback.core.joran.action.AppenderAction - Could not create an Appender of type [ch.qos.logback.classic.net.SMTPAppender]. ch.qos.logback.core.util.DynamicClassLoadingException: Failed to instantiate type ch.qos.logback.classic.net.SMTPAppender [java] at ch.qos.logback.core.util.DynamicClassLoadingException: Failed to instantiate type ch.qos.logback.classic.net.SMTPAppender [java] at at ch.qos.logback.core.util.OptionHelper.instantiateByClassName(OptionHelper.java:54) [java] at at ch.qos.logback.core.util.OptionHelper.instantiateByClassName(OptionHelper.java:32) [java] at at ch.qos.logback.core.joran.action.AppenderAction.begin(AppenderAction.java:54) [java] at at ch.qos.logback.core.joran.spi.Interpreter.callBeginAction(Interpreter.java:276) [java] at at ch.qos.logback.core.joran.spi.Interpreter.startElement(Interpreter.java:148) [java] at at ch.qos.logback.core.joran.spi.Interpreter.startElement(Interpreter.java:130) [java] at at ch.qos.logback.core.joran.spi.EventPlayer.play(EventPlayer.java:50) [java] at at ch.qos.logback.core.joran.GenericConfigurator.doConfigure(GenericConfigurator.java:147) [java] at at ch.qos.logback.core.joran.GenericConfigurator.doConfigure(GenericConfigurator.java:133) [java] at at ch.qos.logback.core.joran.GenericConfigurator.doConfigure(GenericConfigurator.java:96) [java] at at ch.qos.logback.core.joran.GenericConfigurator.doConfigure(GenericConfigurator.java:55) [java] at at ch.qos.logback.classic.util.ContextInitializer.configureByResource(ContextInitializer.java:75) [java] at at ch.qos.logback.classic.util.ContextInitializer.autoConfig(ContextInitializer.java:148) [java] at at org.slf4j.impl.StaticLoggerBinder.init(StaticLoggerBinder.java:84) [java] at at org.slf4j.impl.StaticLoggerBinder.<clinit>(StaticLoggerBinder.java:54) [java] at at org.slf4j.LoggerFactory.bind(LoggerFactory.java:128) [java] at at org.slf4j.LoggerFactory.performInitialization(LoggerFactory.java:108) [java] at at org.slf4j.LoggerFactory.getILoggerFactory(LoggerFactory.java:279) [java] at at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:252) [java] at at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:265) [java] at at org.torproject.onionoo.server.ServerMain.<clinit>(ServerMain.java:12) [java] Caused by: java.lang.NoClassDefFoundError: javax/mail/Message [java] at at java.lang.Class.getDeclaredConstructors0(Native Method) [java] at at java.lang.Class.privateGetDeclaredConstructors(Class.java:2532) [java] at at java.lang.Class.getConstructor0(Class.java:2842) [java] at at java.lang.Class.newInstance(Class.java:345) [java] at at ch.qos.logback.core.util.OptionHelper.instantiateByClassName(OptionHelper.java:50) [java] at ... 20 common frames omitted [java] Caused by: java.lang.ClassNotFoundException: javax.mail.Message [java] at at java.net.URLClassLoader$1.run(URLClassLoader.java:366) [java] at at java.net.URLClassLoader$1.run(URLClassLoader.java:355) [java] at at java.security.AccessController.doPrivileged(Native Method) [java] at at java.net.URLClassLoader.findClass(URLClassLoader.java:354) [java] at at java.lang.ClassLoader.loadClass(ClassLoader.java:425) [java] at at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) [java] at at java.lang.ClassLoader.loadClass(ClassLoader.java:358) [java] at ... 25 common frames omitted [java] 11:28:17,752 |-ERROR in ch.qos.logback.core.joran.spi.Interpreter@23:73 - ActionException in Action for tag [appender] ch.qos.logback.core.joran.spi.ActionException: ch.qos.logback.core.util.DynamicClassLoadingException: Failed to instantiate type ch.qos.logback.classic.net.SMTPAppender ...}}}PS:If there are no exceptions for the SMTPAppender, it could be the e-mail address in from: `metrics@sewerzowi.torproject.org`. Your mail test above (at the end of comment 11) might send from `metrics@localhost`.If that is the reason for the missing mails. The inbox of ’metrics’ could have the rejected mails.And, the mailing will work with metrics@localhost.
[java] 18:15:02,806 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback.groovy] [java] 18:15:02,807 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback-test.xml] [java] 18:15:02,807 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found resource [logback.xml] at [file:/srv/onionoo.torproject.org/onionoo/classes/logback.xml] [java] 18:15:02,965 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.rolling.RollingFileAppender] [java] 18:15:02,970 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [FILEALL] [java] 18:15:03,013 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property [java] 18:15:03,113 |-INFO in c.q.l.core.rolling.TimeBasedRollingPolicy - No compression will be used [java] 18:15:03,115 |-INFO in c.q.l.core.rolling.TimeBasedRollingPolicy - Will use the pattern /srv/onionoo.torproject.org/onionoo/onionoo-all.%d{yyyy-MM-dd}.%i.log for the active file [java] 18:15:03,118 |-INFO in ch.qos.logback.core.rolling.SizeAndTimeBasedFNATP@2e6ed964 - The date pattern is 'yyyy-MM-dd' from file name pattern '/srv/onionoo.torproject.org/onionoo/onionoo-all.%d{yyyy-MM-dd}.%i.log'. [java] 18:15:03,119 |-INFO in ch.qos.logback.core.rolling.SizeAndTimeBasedFNATP@2e6ed964 - Roll-over at midnight. [java] 18:15:03,123 |-INFO in ch.qos.logback.core.rolling.SizeAndTimeBasedFNATP@2e6ed964 - Setting initial period to Wed Sep 17 17:32:19 UTC 2014 [java] 18:15:03,129 |-INFO in ch.qos.logback.core.rolling.RollingFileAppender[FILEALL] - Active log file name: /srv/onionoo.torproject.org/onionoo/onionoo-all.log [java] 18:15:03,130 |-INFO in ch.qos.logback.core.rolling.RollingFileAppender[FILEALL] - File property is set to [/srv/onionoo.torproject.org/onionoo/onionoo-all.log] [java] 18:15:03,131 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.FileAppender] [java] 18:15:03,131 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [FILEERR] [java] 18:15:03,132 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property [java] 18:15:03,141 |-INFO in ch.qos.logback.core.FileAppender[FILEERR] - File property is set to [/srv/onionoo.torproject.org/onionoo/onionoo-err.log] [java] 18:15:03,141 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.FileAppender] [java] 18:15:03,141 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [FILESTATISTICS] [java] 18:15:03,142 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property [java] 18:15:03,151 |-INFO in ch.qos.logback.core.FileAppender[FILESTATISTICS] - File property is set to [/srv/onionoo.torproject.org/onionoo/onionoo-statistics.log] [java] 18:15:03,151 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.classic.net.SMTPAppender] [java] 18:15:03,165 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [EMAIL] [java] 18:15:03,230 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [FILEERR] to Logger[org.torproject] [java] 18:15:03,231 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [EMAIL] to Logger[org.torproject] [java] 18:15:03,232 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [FILESTATISTICS] to Logger[org.torproject.onionoo.cron.Main] [java] 18:15:03,232 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [FILESTATISTICS] to Logger[statistics] [java] 18:15:03,232 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [EMAIL] to Logger[statistics] [java] 18:15:03,232 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to ALL [java] 18:15:03,232 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [FILEALL] to Logger[ROOT] [java] 18:15:03,232 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration. [java] 18:15:03,234 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@629bee3a - Registering current configuration as safe fallback point [java] 18:15:03,241 |-INFO in ch.qos.logback.classic.net.SMTPAppender[EMAIL] - SMTPAppender [EMAIL] is tracking [1] buffers
Still no luck. But I wonder if we can do better than using SMTPAppender. The Nagios plugin that checks that Onionoo is running and returns recent data works just fine. Maybe we should write a second plugin that checks that the back-end cronjob works without problems. And we could write a third plugin that checks that the front-end part works without issues. The two new plugins would probably simply look for a file with error logs on disk. I can write those scripts using Python. Also, not requiring a working mail setup could make deployment easier, too. Whoever wants to deploy Onionoo can use whatever they like to watch the service, which could be our Nagios scripts or something else. What do you think?
Yes, you're right. Monitoring should be separated from the application.
For the front-end I'd suggest jmx/MBeans (actually comment here #11573). These can be very easily be verified with a nagios plugin.
We even could re-use the jmx-example that comes with the tomcat-extra (i.e. jmx) package.
It lists sessions, memory and the like as plain text.
Using nagios plugins right now is useful and won't prevent any other/additional monitoring later on.
And, python plugins are way quicker to write and deploy.
Depending on the server setup, the nagios plugin might get read access to the logs of onionoo's backend?