Opened 4 years ago

Last modified 7 months ago

#21933 needs_review defect

Fix deserialization of UTF-8 characters in details statuses and documents

Reported by: karsten Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/Onionoo Version:
Severity: Normal Keywords: metrics-2018
Cc: metrics-team Actual Points:
Parent ID: Points:
Reviewer: Sponsor:


While looking into the encoding issue of different Onionoo instances producing different contact string encodings (#15813), I tracked down a somewhat related issue with UTF-8 characters all being converted to ?.

The issue is related to how we're avoiding to store UTF-8 characters in details statuses and details documents and instead escaping those characters. We're doing this correctly for the serialization part but incorrectly for the deserialization part.

We have two choices here. We could either give up on the escaping part and just store UTF-8 characters directly. Or we could fix the deserialization part. I have a fix for the latter and ran out of time for the former, but maybe the former would be the better fix. I'm including my fix here anyway:

diff --git a/src/main/java/org/torproject/onionoo/docs/ b/src/main/java/org/torproject/onionoo/docs/
index 39d6271..b6b1c4c 100644
--- a/src/main/java/org/torproject/onionoo/docs/
+++ b/src/main/java/org/torproject/onionoo/docs/
@@ -496,22 +496,25 @@ public class DocumentStore {
     if (!parse) {
       return this.retrieveUnparsedDocumentFile(documentType,
-    } else if (documentType.equals(DetailsDocument.class)
-        || documentType.equals(BandwidthDocument.class)
+    } else if (documentType.equals(BandwidthDocument.class)
         || documentType.equals(WeightsDocument.class)
         || documentType.equals(ClientsDocument.class)
         || documentType.equals(UptimeDocument.class)) {
       return this.retrieveParsedDocumentFile(documentType,
+    } else if (documentType.equals(DetailsStatus.class)
+        || documentType.equals(DetailsDocument.class)) {
+      if (documentType.equals(DetailsStatus.class)) {
+        documentString = "{" + documentString + "}";
+      }
+      documentString = StringUtils.replace(documentString, "\\u", "\\\\u");
+      return this.retrieveParsedDocumentFile(documentType, documentString);
     } else if (documentType.equals(BandwidthStatus.class)
         || documentType.equals(WeightsStatus.class)
         || documentType.equals(ClientsStatus.class)
         || documentType.equals(UptimeStatus.class)
         || documentType.equals(UpdateStatus.class)) {
       return this.retrieveParsedStatusFile(documentType, documentString);
-    } else if (documentType.equals(DetailsStatus.class)) {
-      return this.retrieveParsedDocumentFile(documentType, "{"
-          + documentString + "}");
     } else {
       log.error("Parsing is not supported for type "
           + documentType.getName() + ".");

The main difference here is the StringUtils.replace() part. Without this line (so, current master) we would pass a string containing \uxxxx to Gson.fromJson() which would unescape it and turn it into the corresponding UTF-8 character. So far so good. But when we would later write this status or document back to disk, DocumentStore#writeToFile will write these bytes to disk as US-ASCII, and that will replace all UTF-8 characters with ?.

The patch fixes this by replacing \uxxxx in the file content with \\uxxxx which Gson.fromJson() will not consider an escaped UTF-8 character. We do have code in place that reverses this double-escaping, see DetailsStatus#unescapeJson. So, the patch fixes the problem by keeping things escaped until they are used.

Again, I think the cleaner fix would be to give up on escaping UTF-8 characters and just switch to UTF-8. The part that might make this a little harder is that we'll have to make sure that this works correctly for existing files. And if it does not, we'll need to special-case those somehow. But maybe the patch above helps to come up with this cleaner fix.

Child Tickets

Change History (5)

comment:1 Changed 3 years ago by karsten

Keywords: metrics-2018 added

comment:2 Changed 3 years ago by karsten

Keywords: metrics-2017 added; metrics-2018 removed

comment:3 Changed 3 years ago by iwakeh

Keywords: metrics-2018 added; metrics-2017 removed

Will be completed in 2018.

comment:4 Changed 2 years ago by teor

In, Tor relays make sure that their ContactInfo lines are UTF-8 (see #27428).
We expect to merge the corresponding authority checks in 0.3.6, see #24033 and children.

comment:5 Changed 7 months ago by karsten

Cc: metrics-team added
Status: newneeds_review

I spent some more time on this issue and implemented the cleaner fix mentioned above that gives up on escaping UTF-8 characters and just switches to UTF-8. This reduces complexity quite a bit and should be easier for us to maintain in the future. Please review commit 9d4e79a in my task-21933 branch.

Note: See TracTickets for help on using tickets.