Opened 4 years ago

Closed 3 years ago

#16296 closed defect (invalid)

Implement lock file in a more robust way

Reported by: karsten Owned by:
Priority: Medium Milestone:
Component: Metrics/Onionoo Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: #16401 Points:
Reviewer: Sponsor:

Description

The current lock file gets written when the hourly updater starts and gets deleted when it exits. It's really good in preventing concurrent executions. But it doesn't handle cases when a process doesn't exit cleanly, because subsequent executions won't start at all. One possibility for not exiting cleanly is that the machine gets rebooted while the hourly updater is running.

One way to fix this might be to use Java's FileLock instead of making up our own lock file.

This is more a defect than an enhancement, because it broke the main Onionoo service instance.

Child Tickets

Attachments (3)

0001-Make-LockFile-more-robust.-Per-ticket-16296.patch (8.5 KB) - added by leeroy 4 years ago.
onionoo-mem-usage.png (30.4 KB) - added by karsten 4 years ago.
0001-Make-LockFile-even-more-robust.-Per-16296.patch (7.5 KB) - added by leeroy 4 years ago.
Improved permission handling. Also fixes 16401.

Download all attachments as: .zip

Change History (15)

comment:1 Changed 4 years ago by leeroy

How robust does it need to be? I ran into this when I '^C' on the updater which left the lock file unreleased. This could perhaps also be solved by adding a shutdown hook on the updater's runtime. Maybe try a snippet of both? Neither option is going to catch the unexpected system crash/reset. If you need it to be bulletproof it might be worthwhile to check the pid of running java VM (in addition).

Oh Java, I had almost forgotten thy quirks. Okay so I made an attempt on a patch that would retain cross-platform compatibility, while also being as robust as possible. I couldn't get the FileLock to work reliably so I resorted to other means. How does it look?

On a related subject, the updater only uses the lock file during the one-time updater. The periodic updater doesn't use it at all. Should this be changed?

Last edited 4 years ago by leeroy (previous) (diff)

comment:2 Changed 4 years ago by leeroy

Status: newneeds_review

comment:3 Changed 4 years ago by karsten

Oh wow, that's quite some code there for preventing concurrent executions.

Please don't hate me for this suggestion, but what do you think of making the periodic updater the normal mode of operation and getting rid of the lock file entirely?

Want to take a look at #13600 and the branch there? Note that it currently uses a lock file, but I could see us removing that. Service operators would then be responsible for not starting two instances of the updater concurrently. If there's no risk anymore that two cron runs get in each other's way, this seems like a reasonable assumption.

Thoughts?

comment:4 Changed 4 years ago by leeroy

I think it's not something to do lightly. There are many problems with removing the lock file as you suggest.

  • By running two Java process, one for server, one for periodic updater, each having access to the same heap size, you leave each to the mercy of the operating system.
  • Java's GC works best within the context of a single JVM. When you leave the periodic updater running, even after releasing resources internally, the updater GC doesn't kick in until the OS says `I'm low on memory`. Even once the updater GC does run, the updater JVM doesn't automagically make that memory (2-4 GB) available again.
  • Which means the updater may continue to occupy the allocated heap space. Relying on the OS, which uses virtual addressing, negatively impacts other server processes (imagine adding a db process).
  • By running a cron job that memory is always freed at the end of the update.
  • The periodic updater can be run concurrently because it doesn't use the lock at all.
  • Saying to an operator, it's your own fault if you run more than one updater and break something, is lazy development which leads to poor quality
  • You'll have no choice but to transition to a separate database process.

A lock file provides many benefits. It's under utilized.

  • The current Onionoo design is effectively a schemaless semi-structured db. It needs locks because they aren't provided any other way.
  • A lock during an update means the developer of Onionoo went to the effort making corruption of data stores unlikely.
  • Enables the updater to detect potential error conditions across system restarts. Instead of just removing the lock, the updater can use the existence of an invalid lock to trigger error detection/correction.
  • The updater can be modified to use multiple locks, making extensibility, performance improvement, message passing easier.
  • The use of (high-performance) locks to merge stages of the updating process leads to less dependence on the heap and more dependence on other (unused) IO subsytems.

tl;dr How the updater is run cannot be enforced. Therefore, it should be flexible enough for operator choice and it should be fault tolerant. Just in case an operator decides to use some alternative automatic run mechanism (periodic or one-time), or accidentally runs more than one. I, for one operator, do not, under any circumstance, want to re-import data when it takes an hour to import recent (72h), and 12 hours to import a single month of archives.

If kept, a lock should be created in the updater instance, not the one-time version, or the periodic. Note that mapped memory is used in the patch. A further improvement (see #15844?) would be to remove the use of sockets (I never liked that anyway) and instead use the lock content together with some other validation method. I didn't try anything fancy in the patch because this use of the lock file isn't defined (except to prevent concurrent one-time updater). It's up to you. Please do consider the updated comments in #15844? before deciding to scrap the lock file.

Version 7, edited 4 years ago by leeroy (previous) (next) (diff)

Changed 4 years ago by karsten

Attachment: onionoo-mem-usage.png added

comment:5 Changed 4 years ago by karsten

This reply doesn't address most of your (valid) points above.

I was curious about memory usage over time if the hourly updater is not started by cron. That's why I let it run on the Onionoo mirror and added another cronjob to measure memory usage:

* * * * * ps ax -o rss,command | grep java >> top-usage

See the attached graph. It shows that Java indeed does not free most of its memory between runs. But it also mostly sticks to the configured 4 GB threshold and doesn't grow indefinitely. I'd say that while it could behave nicer in theory, this is acceptable behavior.

But I agree with you that we shouldn't take out the lock file lightly at this point. Needs more thoughts.

comment:6 Changed 4 years ago by leeroy

True it does stick to the 4GB. That's what the -Xmx flag does. It says to the JVM, you may use 4GB, even if you don't need it, which is most of the time for one-time update, or hourly periodic. In fact, the only time when 4GB is actually needed is for processing archives (~3.7GB/avg for a single month). The problem is this trashes the OS cache which could be used for better things, like disk cache, or web frontend cache (see that whopping 5.3GB virtual address space used for hourly update). Not to mention what would happen if you introduce another dependency, like a db process. which relies on the OS memory management and caching.

I've touched on this topic elsewhere so not to worry--it's a major concern.

Oh, I've also made the LockFile even more robust for permission like problems. I attach the revision here (and I'm testing it on a running instance). I see there aren't any unit tests for this component. If you would like some just mention it here. The permission handling addition additionally deals with #16401. If you would prefer more separation please do mention it here as well.

(I think I will go back and try to make it work the way tor does it. I haven't tried to make java do c in a couple weeks, since FileLock failed, maybe some other solution will work)

Last edited 4 years ago by leeroy (previous) (diff)

comment:7 Changed 4 years ago by leeroy

I think I might have done something wrong the first time. I obtained a file channel then checked the locking and it would allow acquiring an exclusive lock, as well as I/O in another JVM. I now open the file using the file channel directly and the exclusive locking appears to work.

Yes, FileLock works if the locking mechanism is rewritten to use FileChannel.open. I can't remember what I did before, closer to original code I think, but the tryLock method didn't work that way. It now works so I'll replace that last patch. You should find this is closer to how tor handles file locking.

In case you're worried about the absence of a FileChannel.close(), try adding DELETE_ON_CLOSE to FileChannel.open(). Which shows that java doesn't even create the file. There is no leak because even in the current implementation the updater creates a new LockFile each time so the file descriptor gets GC'd at this point. Similarly, for the permission checking code, the methods are static, and so are GC'd on an early return from checking directory contents.

Last edited 4 years ago by leeroy (previous) (diff)

Changed 4 years ago by leeroy

Improved permission handling. Also fixes 16401.

comment:8 Changed 4 years ago by karsten

Sorry again for throwing in something here that doesn't address most of your comments, but I figured it's better to note this than stay silent:

I noticed that we try hard in NodeDetailsStatusUpdater not to have DocumentStore read NodeStatus instances from disk until after we finished parsing new descriptors in NodeDetailsStatusUpdater.processDescriptor and before NodeDetailsStatusUpdater.readNodeStatuses is called. And that works fine for the --single-run case. But it fails for the no-args execution of the hourly updater, because in that case DocumentStore will just keep NodeStatus instances in memory until the next run starts. What we should do, I think, is invalidate the cache after each hourly update. I think the following code achieves that, but I'd appreciate another set of eyes to look and another set of hands to test before merging this. Thanks!

diff --git a/src/main/java/org/torproject/onionoo/cron/Main.java b/src/main/java/org/torproject/onionoo/cron/Main.java
index 6f39cb6..d8739e1 100644
--- a/src/main/java/org/torproject/onionoo/cron/Main.java
+++ b/src/main/java/org/torproject/onionoo/cron/Main.java
@@ -204,6 +204,8 @@ public class Main implements Runnable {
     if (this.ds != null) {
       this.ds.flushDocumentCache();
       this.log.info("Flushed document cache");
+      this.ds.invalidateDocumentCache();
+      this.log.info("Invalidated document cache");
     }
   }

comment:9 Changed 4 years ago by leeroy

No problem, I will test this.

comment:10 Changed 4 years ago by leeroy

When you say it fails, how do you mean? Regardless of how the updater is launched it depends on the run method. The run method ensures shutDown and cleanUp are called. Doesn't that take care of the problem?

comment:11 Changed 4 years ago by leeroy

Parent ID: #16401

comment:12 Changed 3 years ago by karsten

Resolution: invalid
Severity: Normal
Status: needs_reviewclosed

We're stopped using a lock file two months ago, so this ticket is moot.

Note: See TracTickets for help on using tickets.