Opened 2 years ago

Closed 19 months ago

#22990 closed enhancement (fixed)

Add a heartbeat log message indicating progress and estimated time left

Reported by: karsten Owned by: iwakeh
Priority: Medium Milestone:
Component: Metrics/Library Version:
Severity: Normal Keywords: metrics-2018
Cc: Actual Points:
Parent ID: Points:
Reviewer: iwakeh Sponsor:

Description

It would be really useful to know how long it still takes to read a bunch of descriptors, as this can range from minutes to hours or even days. In some cases this information would help to decide whether to continue or abort and try something different.

How about we start reading descriptors by walking files in the given directories, summing up their bytes, and using that total sum to periodically write a log message with progress and estimated time left?

Once per minute on debug level would be sufficient. And it's okay if it's not super precise and gets confused by compression or other factors. Just a rough number that gets more precise over time.

Child Tickets

Change History (21)

comment:1 Changed 2 years ago by karsten

It's probably worth noting that progress is not only a function of reading and parsing speed, but also of processing speed of the application. That means that we cannot easily say that parsing X bytes of server descriptors takes Y minutes. If the application takes a while to process each returned descriptor, that delays the overall process. That's why this heartbeat would be useful also to debug applications and find bottlenecks.

comment:2 Changed 2 years ago by iwakeh

Yes, a heartbeat would be useful.

Why not keep it simple by only noting the total versus processed count (units in bytes and files or whatever is appropriate) and omitting the time estimate?
From a few heartbeat message it is easy to infer/guesstimate the time needed or a potential slowdown for the current operation.

The heartbeats could be written to their own log for further processing (using some logback config features).

comment:3 Changed 2 years ago by karsten

Keeping it simple by omitting the time estimate sounds fine to me. We could always add that later if we think it would be useful.

Regarding using a separate log, I'd probably keep that simple, too, and just use the same log. But I don't feel strongly about this.

Okay, I might grab this ticket and hack on this the next time I'm processing larger amounts of data. But if you want to hack on this first, feel free to grab it, and I'll review the patch.

comment:4 Changed 2 years ago by iwakeh

Owner: changed from metrics-team to iwakeh
Status: newaccepted

Why not continue now as I started thinking about it.

comment:5 Changed 23 months ago by karsten

Keywords: metrics-2018 added

comment:6 Changed 23 months ago by karsten

Keywords: metrics-2017 added; metrics-2018 removed

comment:7 Changed 21 months ago by karsten

Status: acceptedneeds_review

Please review commit bd52455 in my task-22990 branch.

We could probably do more, but this patch turned out to be quite useful already to test my #24218 implementation. I could imagine postponing future improvements until we see a need for them. But if you have further ideas that are relatively easy to implement, let's do them, too.

comment:8 Changed 21 months ago by iwakeh

Reviewer: iwakeh

Adding these to my reviewing list.

comment:9 Changed 21 months ago by iwakeh

Status: needs_reviewneeds_revision

I'd like to use a Logger different from the standard Logger. Maybe, 'statistics' as in Onionoo, i.e., LoggerFactory.getLogger("statistics")...? The progress information is rather of statistical nature, the log statements will show up in the general ('-all.log') anyway; and it would be possible to have the information separated by simple logback configuration.

The percent calculation should only be done once: 100 * progress / total.
Why not also keep track of the percentage done, i.e., previousPercentDone and percentDone or similar?

More importantly, readTarballs needs to catch the IOE to avoid halting the process just b/c of one tarball being faulty.

Information about skipped tarballs should also be logged to the 'statistics' logger and the tarball skipped should be logged as warning as before.

Maybe, think about other possible statistics, which could be easily logged?

Last edited 21 months ago by iwakeh (previous) (diff)

comment:10 in reply to:  9 ; Changed 21 months ago by karsten

Replying to iwakeh:

I'd like to use a Logger different from the standard Logger. Maybe, 'statistics' as in Onionoo, i.e., LoggerFactory.getLogger("statistics")...? The progress information is rather of statistical nature, the log statements will show up in the general ('-all.log') anyway; and it would be possible to have the information separated by simple logback configuration.

Okay.

The percent calculation should only be done once: 100 * progress / total.
Why not only keep track of the percentage done, i.e., previousPercentDone and percentDone or similar?

Sure, why not.

More importantly, readTarballs needs to catch the IOE to avoid halting the process just b/c of one tarball being faulty.

Ugh, very good point. That's an oversight.

Information about skipped tarballs should also be logged to the 'statistics' logger

Not sure about that one. That's something different than what we're trying to achieve here: have an easy way to learn about progress when importing large amounts of data.

and the tarball skipped should be logged as warning as before.

No, it's perfectly fine to leave a tarball in place and use the history file to avoid reading it once again. If we want to log it, it should be on debug.

Maybe, think about other possible statistics, which could be easily logged?

Nothing came to mind yesterday. The new message was already very useful when importing one year of descriptors which took five hours. After three or four hours one wonders whether it's still doing anything useful or if it's going to take days or weeks, and then it's really useful to have this message. I'd say let's save future statistics for when we find out that we need them.

I can revise the patch today or tomorrow, unless you beat me to it.

comment:11 in reply to:  10 Changed 21 months ago by iwakeh

Replying to karsten:

... Okay.

Just some replies below to what we didn't totally agree on :-)

...

Information about skipped tarballs should also be logged to the 'statistics' logger


Not sure about that one. That's something different than what we're trying to achieve here: have an easy way to learn about progress when importing large amounts of data.

Well, progress will be really fast, if tarballs are skipped.

and the tarball skipped should be logged as warning as before.

No, it's perfectly fine to leave a tarball in place and use the history file to avoid reading it once again. If we want to log it, it should be on debug.

That will be an additional change. Currently, these are logged as warning (I basically suggested it to keep the behavior we had):
` log.warn("Unable to read descriptor file {}.", file, e);` line 251

Maybe, think about other possible statistics, which could be easily logged?

Nothing came to mind yesterday. The new message was already very useful when importing one year of descriptors which took five hours. After three or four hours one wonders whether it's still doing anything useful or if it's going to take days or weeks, and then it's really useful to have this message. I'd say let's save future statistics for when we find out that we need them.

Yep, that's fine.

I can revise the patch today or tomorrow, unless you beat me to it.

Whoever starts on this should first add a tiny comment here :-)

comment:12 in reply to:  10 ; Changed 21 months ago by iwakeh

Replying to karsten:

...

More importantly, readTarballs needs to catch the IOE to avoid halting the process just b/c of one tarball being faulty.

Ugh, very good point. That's an oversight.

This is an excellent example where having decent tests would serve as documentation and ensuring that the wanted behavior, i.e., keep processing, is kept when making changes.

comment:13 Changed 21 months ago by karsten

Replying to iwakeh:

Replying to karsten:

... Okay.

Just some replies below to what we didn't totally agree on :-)

...

Information about skipped tarballs should also be logged to the 'statistics' logger


Not sure about that one. That's something different than what we're trying to achieve here: have an easy way to learn about progress when importing large amounts of data.

Well, progress will be really fast, if tarballs are skipped.

Ah, no, they won't be processed in the readTarballs() method at all, because they're not even added to tarballs in readDescriptorFiles().

and the tarball skipped should be logged as warning as before.

No, it's perfectly fine to leave a tarball in place and use the history file to avoid reading it once again. If we want to log it, it should be on debug.

That will be an additional change. Currently, these are logged as warning (I basically suggested it to keep the behavior we had):
` log.warn("Unable to read descriptor file {}.", file, e);` line 251

Oh, you mean skipped because of an IOException? Yes, we should keep that as a warning. I thought you meant skipped because we read them before. Alright, nothing to do here then.

Maybe, think about other possible statistics, which could be easily logged?

Nothing came to mind yesterday. The new message was already very useful when importing one year of descriptors which took five hours. After three or four hours one wonders whether it's still doing anything useful or if it's going to take days or weeks, and then it's really useful to have this message. I'd say let's save future statistics for when we find out that we need them.

Yep, that's fine.

I can revise the patch today or tomorrow, unless you beat me to it.

Whoever starts on this should first add a tiny comment here :-)

Sounds good!

comment:14 in reply to:  12 ; Changed 21 months ago by karsten

Replying to iwakeh:

Replying to karsten:

...

More importantly, readTarballs needs to catch the IOE to avoid halting the process just b/c of one tarball being faulty.

Ugh, very good point. That's an oversight.

This is an excellent example where having decent tests would serve as documentation and ensuring that the wanted behavior, i.e., keep processing, is kept when making changes.

Indeed.

And it's also an excellent example how code review works! :)

comment:15 in reply to:  14 Changed 21 months ago by iwakeh

Replying to karsten:

Replying to iwakeh:

Replying to karsten:

...

More importantly, readTarballs needs to catch the IOE to avoid halting the process just b/c of one tarball being faulty.

Ugh, very good point. That's an oversight.

This is an excellent example where having decent tests would serve as documentation and ensuring that the wanted behavior, i.e., keep processing, is kept when making changes.

Indeed.

And it's also an excellent example how code review works! :)

Except that the review should rather be for issues that cannot be checked by code :-)

comment:16 Changed 21 months ago by karsten

I'm revising this patch now.

comment:17 Changed 21 months ago by karsten

Status: needs_revisionneeds_review

comment:18 Changed 21 months ago by karsten

Please also review commit 39b1242 in the same branch with another tweak.

comment:19 Changed 20 months ago by iwakeh

Keywords: metrics-2018 added; metrics-2017 removed

Will be completed in 2018.

comment:20 Changed 19 months ago by iwakeh

Status: needs_reviewmerge_ready

Looks fine and passes all automated checks&tests.
Please find a tiny fixup commit removing a 'printStackTrace' call.

Ready for merge.

comment:21 Changed 19 months ago by karsten

Resolution: fixed
Status: merge_readyclosed

Fixup commit 03d5496 looks good. Squashed everything, rebased to master, and pushed. Closing. Thanks!

Note: See TracTickets for help on using tickets.