Opened 2 years ago

Last modified 22 months ago

#22140 new enhancement

Store raw descriptor contents as UTF-8 encoded Strings rather than byte[]

Reported by: karsten Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/Library Version:
Severity: Normal Keywords: metrics-2018
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

When we're reading descriptors from disk we're storing raw descriptor contents as byte[] and returning them in Descriptor#getRawDescriptorBytes(). Also, we're storing partial raw descriptor contents in DirSourceEntry#getDirSourceEntryBytes() and NetworkStatusEntry#getStatusEntryBytes().

Storing byte[] can be useful when writing raw contents back to disk, because we can be sure that contents are exactly the same as when we read them from disk. Namely, we don't have to worry about character encoding.

However, support for handling (large) byte[] content is limited. Today I looked into ways to handle large descriptor files (#20395), and I found that most libraries work best with character streams, not with byte streams. And I only briefly considered implementing Knuth-Morris-Pratt myself...

So, I looked at the four main code bases using metrics-lib (CollecTor, ExoneraTor, metrics-web, Onionoo) to see which of them use raw descriptor bytes and how. After all, if we're not using them ourselves, we can as well get rid of them. Here's what I found:

  1. Onionoo's DescriptorQueue uses raw bytes to keep statistics on processed bytes, which seems like something that would still work reasonably well with character lengths.
  2. CollecTor's DescriptorPersistence indeed uses raw descriptor bytes to write descriptors obtained from another CollecTor instance to disk. We'd have to change that.
  3. CollecTor's VotePersistence uses raw descriptor bytes to calculate the digest of votes, which is something we should implement in metrics-lib directly (#20333).
  4. ExoneraTor's ExoneraTorDatabaseImporter imports raw status entry bytes into the database, but we know that those are just ASCII, so this would work as well with UTF-8 strings.
  5. metrics-web's RelayDescriptorDatabaseImporter also imports raw status entry bytes into the database, which works with strings for the same reason as above.

I might have overlooked something.

But if not, CollecTor's DescriptorPersistence is the only place where we really need byte[] rather than String. If we can change that, we can switch from Descriptor#getRawDescriptorBytes() to Descriptor#getRawDescriptor() and deprecate the former (and do the same with the other two partial contents).

And then we can resume #20395 with a much more complete toolbox.

Child Tickets

Change History (5)

comment:1 Changed 2 years ago by karsten

Type: defectenhancement

comment:2 Changed 2 years ago by karsten

Uhm, after working on #20333, I realized that we'll need to process raw bytes for calculating descriptor digests. That defeats the plans above to some extent. We'll need to work with raw bytes at least until files are split up into descriptors. This doesn't mean that storing strings rather than raw bytes wouldn't still be beneficial. But unfortunately this change won't give us the more complete toolbox that I was hoping for. Hmm.

comment:3 Changed 2 years ago by iwakeh

Another caveat is that not all writing is yet done through the persistence mechanism (cf. #20518).

But, I think (as suggested in #20395 the FileChannel approach) that maybe lazy reading/providing of bytes could leave the code here as is?

comment:4 Changed 2 years ago by iwakeh

Please, refer to #20395 for the 'lazy reading/providing of bytes' approach in order to just have one place for the discussion.

comment:5 Changed 22 months ago by karsten

Keywords: metrics-2018 added
Note: See TracTickets for help on using tickets.