#21932 closed defect (fixed)

Stop relying on the platform's default charset

Reported by: karsten Owned by: metrics-team
Priority: Medium Milestone: metrics-lib 2.0.0
Component: Metrics/Library Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

While looking into the encoding issue of different Onionoo instances producing different contact string encodings (#15813), I tracked down this issue to metrics-lib's ServerDescriptorImpl.java class and its usage of new String(byte[]).

The issue is that the constructor above uses "the platform's default charset". Turns out that the main Onionoo instance uses US-ASCII as default charset (Charset.defaultCharset()) and the mirror uses UTF-8. (Interestingly, the mirror only uses UTF-8 for commands executed by cron and also uses US-ASCII for commands directly executed by my user, so the default would change depending on whether Onionoo's updater was started automatically after a reboot or started manually by the user; which made debugging just a bit more challenging!)

Long story short, we should not rely on the platform's default charset when converting bytes to strings or vice versa, but we should explicitly specify the charset we want! We just need to pick one.

Somewhat related I ran an analysis of character encodings in relay server descriptors two weeks ago. Here's what I found:

$ wget
https://collector.torproject.org/archive/relay-descriptors/server-descriptors/server-descriptors-2017-02.tar.xz
$ tar xf server-descriptors-2017-02.tar.xz
$ find server-descriptors-2017-02 -type f -exec file --mime {} \; > mimes
$ cut -d" " -f3 mimes | sort | uniq -c
  68 charset=iso-8859-1
466900 charset=us-ascii
1145 charset=utf-8

I'd say let's just pretend that server descriptors are UTF-8 encoded. In this case, the following patch will resolve the issue for server descriptors:

diff --git a/src/main/java/org/torproject/descriptor/impl/ServerDescriptorImpl.java b/src/main/java/org/torproject/descriptor/impl/ServerDescriptorImpl.java
index 309cad4..2381378 100644
--- a/src/main/java/org/torproject/descriptor/impl/ServerDescriptorImpl.java
+++ b/src/main/java/org/torproject/descriptor/impl/ServerDescriptorImpl.java
@@ -8,6 +8,7 @@ import org.torproject.descriptor.DescriptorParseException;
 import org.torproject.descriptor.ServerDescriptor;
 
 import java.io.UnsupportedEncodingException;
+import java.nio.charset.StandardCharsets;
 import java.security.MessageDigest;
 import java.security.NoSuchAlgorithmException;
 import java.util.ArrayList;
@@ -56,8 +57,8 @@ public abstract class ServerDescriptorImpl extends DescriptorImpl
   }
 
   private void parseDescriptorBytes() throws DescriptorParseException {
-    Scanner scanner = new Scanner(new String(this.rawDescriptorBytes))
-        .useDelimiter("\n");
+    Scanner scanner = new Scanner(new String(this.rawDescriptorBytes,
+        StandardCharsets.UTF_8)).useDelimiter("\n");
     String nextCrypto = "";
     List<String> cryptoLines = null;
     while (scanner.hasNext()) {

If this sounds like a reasonable plan, we should look into other places in the code where we use methods relying on the platform's default charset and explicitly specify a charset there, too.

Child Tickets

Change History (21)

comment:1 Changed 13 months ago by karsten

Status: newneeds_review

Please review the small patch and suggested process given in the ticket description.

comment:2 Changed 12 months ago by iwakeh

Status: needs_reviewneeds_information

This ticket doesn't only address becoming independent of platform charset settings (which is good), but it also suggests switching from ASCII to UTF8.
In many places metrics-lib currently reads also writes US-ASCII. As far as the (gruesome) charset topics go, UTF8 should be a superset of ASCII. Starting to read and write descriptors consistently in utf8 should work and would be good also to accommodate some of the character in names and other user configurable descriptor fields. In general, the proposed change seems fine.

Is it also intended to only let metrics-lib write utf8 files with the proposed changes?

comment:3 Changed 12 months ago by iwakeh

See also #21933.

comment:4 Changed 12 months ago by karsten

Good points!

Can you give an example (or maybe even a complete list) of places where metrics-lib writes US-ASCII files?

comment:5 Changed 12 months ago by iwakeh

Actually, on closer look the writing happens on a lower level and writes what it gets (downloads), which might explain the three charsets in your statistics above.

Should this be turned into writing utf-8 consistently?

comment:6 Changed 12 months ago by karsten

I also looked a bit more at the code. I did not find any code in metrics-lib that would convert strings to US-ASCII before writing them to disk. Regarding the three charsets, those are not produced by metrics-lib but by Tor relays and kept unchanged by CollecTor which only reads and writes byte arrays. The issue discussed here only affects reading and parsing descriptors, not writing descriptors.

I also looked at archived descriptors to see which descriptor types are affected. I only found relay server descriptors and version 1 Tor directories from 2007 and before that internally contain relay server descriptors as well. The latter are parsed in metrics-lib using the same code that also parses server descriptors. (I didn't find bridge server descriptors to be affected, but the reason is that CollecTor's bridgedescs module converts them to ASCII, which is something we should reconsider, too.) In summary, if we fix server descriptor parsing in metrics-lib, we're good.

So, you have a point that this ticket doesn't only address becoming independent of platform charset settings, but that is also suggests switching from ASCII (default charset on Debian stable) to UTF8. I think we should do this anyway. Though it's potentially surprising and also backwards-incompatible, so I believe we'll have to call the next release 2.0.0 if we want to put this in. Works for me, but what do you think?

comment:7 Changed 12 months ago by iwakeh

Milestone: metrics-lib 1.8.0

comment:8 Changed 11 months ago by karsten

Status: needs_informationneeds_review

I started hacking on this and would appreciate a quick review of at least the detailed commit message before I continue. Thanks!

comment:9 Changed 11 months ago by iwakeh

The reasoning in the commit message is fine. Should I look at the code later?

comment:10 Changed 11 months ago by karsten

Status: needs_reviewneeds_revision

Thanks, if the reasoning in the commit message sounds okay, I can continue working on the patch over the next days. (What would help is if we could close #19607 today or tomorrow, so that we don't have to rebase too much code here later on.)

comment:11 Changed 11 months ago by karsten

Status: needs_revisionneeds_review

There, I started splitting up this huge commit into several smaller ones that are easier to review. Please find my task-21932-2 branch with four commits that are ready for review and a fifth (and last) commit with the remaining changes. If you'd like to review the first four commits, I'd merge them to master and work more on the fifth commit later today or tomorrow. Thanks!

comment:12 Changed 11 months ago by karsten

Alright, I worked on a major revision of that fifth commit (while leaving the first four untouched). The commit messages says it all. Please find it in my task-21932-3 branch. I believe the fifth commit in particular requires thorough review and much more testing than existing unit tests. Thoughts? :)

comment:13 Changed 11 months ago by karsten

I just pushed one more commit to that branch to fix a bug.

comment:14 in reply to:  13 Changed 11 months ago by iwakeh

Replying to karsten:

I just pushed one more commit to that branch to fix a bug.

Good, that you found that! Should there be a test added for this in particular?

All looks fine, passes tests and checks. I added a simple test class directly for DescriptorImpl here.

Some questions (partially not related to current changes):

  1. Rename getScanner into newScanner, because a new Scanner is created with every call and this is not a 'getter'?
  2. ExitList.EOL = "\n" = DescriptorImpl.NL why not use one for all delimiters in newScanner calls? Shouldn't NL be defined in Descriptor and be used everywhere, i.e., deprecate ExitList.EOL and replace with release 2.0.0?
  3. Make NL the delimiter default that is already set for the Scanner returned by newScanner?
  4. The Scanner usage in TorperfResult is essentially getLine() in the two places. Do Torperf descriptors contain "\r"?
  5. DescriptorImpl.setDigestXXX allow empty or null argument. Should there be a check?

comment:15 Changed 11 months ago by karsten

Replying to iwakeh:

Replying to karsten:

I just pushed one more commit to that branch to fix a bug.

Good, that you found that!

Indeed. I guess I would have found it when running Onionoo and metrics-web with this metrics-lib version, which I was planning to do today. But always good to find bugs early. Let's see what else I'll find today!

Should there be a test added for this in particular?

Sure, added.

All looks fine, passes tests and checks. I added a simple test class directly for DescriptorImpl here.

Looks good!

Some questions (partially not related to current changes):

  1. Rename getScanner into newScanner, because a new Scanner is created with every call and this is not a 'getter'?

Good idea, changed.

  1. ExitList.EOL = "\n" = DescriptorImpl.NL why not use one for all delimiters in newScanner calls? Shouldn't NL be defined in Descriptor and be used everywhere, i.e., deprecate ExitList.EOL and replace with release 2.0.0?

This one is tricky. We shouldn't impose a newline character for all descriptor types. For example, Torperf measurement results use either \n or \r\n as newline character(s), depending on whether they're written by Torperf or OnionPerf. And we don't know what future descriptor types will (want to) use.

But do we need to expose this in the interface at all? I could imagine deprecating ExitList.EOL and just using DescriptorImpl.NL, but without moving DescriptorImpl.NL to Descriptor.NL. Should we do that? We could easily do this in 1.9.0.

  1. Make NL the delimiter default that is already set for the Scanner returned by newScanner?

See above. We'd have to provide an overloaded method for non-default delimiters, and the question is whether that saves so much code or complexity overall. We could do it, but we could just leave it as is.

  1. The Scanner usage in TorperfResult is essentially getLine() in the two places. Do Torperf descriptors contain "\r"?

See above. But to be honest, I focused mainly on Tor descriptors and just tweaked the other descriptors until everything compiled again. ;) We could probably make TorperfResultImpl and ExitListImpl better if we wanted. But maybe we should save that for 1.9.0?

  1. DescriptorImpl.setDigestXXX allow empty or null argument. Should there be a check?

Do you think that's necessary? These methods are only called by ourselves, and we're already making sure that we're neither passing an empty or null argument when calling that method. I don't feel strongly though.

Okay, I pushed two new commits to my task-21932-3 branch with new tests and renamed newScanner() method. I'll hunt for new bugs now. If I don't find any, should I squash and merge to master? Or do you want to add more changes from your suggestions above?

comment:16 in reply to:  15 Changed 11 months ago by karsten

Replying to karsten:

Okay, I pushed two new commits to my task-21932-3 branch with new tests and renamed newScanner() method. I'll hunt for new bugs now. If I don't find any, should I squash and merge to master? Or do you want to add more changes from your suggestions above?

Update: I didn't run into any issues, but I'll do another round of tests as soon as we have a pre-release tarball.

Ready to merge, or needs revision?

comment:17 Changed 11 months ago by iwakeh

Status: needs_reviewmerge_ready

The current commits look fine. Tests and checks pass.

As the the open questions above existed for a while their discussion can be transferred to the API overhaul ticket #19640 (in 1.9.0). I'll add a comment there in a sec.

All else is fine to be released in 1.8.0.

comment:18 Changed 11 months ago by karsten

Milestone: metrics-lib 1.8.0metrics-lib 2.0.0
Status: merge_readyassigned

Great, thanks for looking! Rebased, squashed, and pushed to master. Changing status to assigned and milestone to 2.0.0, because we still rely on the platform's default charset. Still good to have the preparatory changes merged now and released soon.

comment:19 Changed 10 months ago by karsten

Status: assignedneeds_review

Please review my branch task-21932-4.

comment:20 Changed 10 months ago by iwakeh

Status: needs_reviewmerge_ready

Changes look as planned; tests and checks pass.
Ready for 2.0.0

comment:21 Changed 10 months ago by karsten

Resolution: fixed
Status: merge_readyclosed

Perfect, thanks for checking! Cherry-picked and pushed to master. Closing.

Note: See TracTickets for help on using tickets.