Opened 4 years ago

Closed 4 years ago

#14859 closed defect (fixed)

Adapt Onionoo's parsing code for MaxMind's GeoLite2 City files to their new format

Reported by: karsten Owned by:
Priority: Medium Milestone:
Component: Metrics/Onionoo Version:
Severity: Keywords:
Cc: iwakeh Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

MaxMind has changed their file format for GeoLite2 City database files. They split up blocks into IPv4 and IPv6 addresses, and they now support different locales. We need to update our parsing code to read their new formats. At the same time we should drop support for the old format in order to remove otherwise dead code.

I'll post a branch in a minute. Please review. I'll hold back on merging that branch into master until all mirrors have upgraded to the new database. It's just a matter of fetching the latest database and putting GeoLite2-City-Blocks-IPv4.csv and GeoLite2-City-Locations-en.csv into the geoip/ directory. iwakeh, please let me know when that's the case. It's not that mirrors would break, but they would stop resolving IP addresses to locations and ASes.

Child Tickets

Attachments (1)

0001-Do-not-proceed-with-lookup-if-an-encoding-error-is-r.patch (4.7 KB) - added by iwakeh 4 years ago.
patch suggestion

Download all attachments as: .zip

Change History (15)

comment:2 Changed 4 years ago by iwakeh

This looks ok; I didn't go into the database details, but when the tests work things should be fine.

Maybe, while working on it anyway, include more ISO-8859-1 characters into the test cases?
(Like Kırşehir, Tekirdağ, İnegöl, Çankırı, ...)

comment:3 Changed 4 years ago by karsten

Oh wow, looks like ISO-8859-1 was not the right charset at all. The special characters you mention are all in ISO-8859-9, not in ISO-8859-1. Oops! Turns out MaxMind uses UTF-8 for their files.

Please take a look at the latest commit I pushed to the branch mentioned above. It changes character encoding and adds test cases for the Turkish cities you mentioned.

Anything else we should test? If not, are you ready for these commits to be merged into master?

comment:4 Changed 4 years ago by iwakeh

Oh, I didn't verify ISO-8859.x. Good catch.

No other test cases unless you have the time to cover other European accents and special characters ( á æ ö ø é ...).
Or, instead verify the UTF8 encoding of the MaxMind files explicitly. So, there'll be a failing test, if MaxMind changes file format?

This branch should be fine now, even without further tests.

comment:5 Changed 4 years ago by karsten

Status: needs_reviewnew

Agreed about adding more tests for other special characters later.

I'm not sure how to verify that the MaxMind files are actually UTF-8 encoded. That's different from writing our own unit tests where we provide our own data. This verification code would have to run in the production code and somehow detect the character encoding when parsing MaxMind's files. Any idea how to do that?

I just merged the branch to master. I'll leave this ticket open until we have resolved the minor remaining issues above.

comment:6 Changed 4 years ago by iwakeh

Of course, you're right, this is not a test, but a run-time verification.

Using InputStreamReader(InputStream in, CharsetDecoder dec) for reading the files
will throw CharacterCodingException, e.g.

 BufferedReader br = new BufferedReader(new InputStreamReader(
new FileInputStream(this.geoLite2CityBlocksIPv4CsvFile), 
StandardCharsets.UTF_8.newDecoder));

The mentioned classes are from 'java.nio.charset'

A CharacterCodingException could exit the run, or trigger a warning and
use the current way of reading the file as fallback, which could lead to mangled
output. Or, just not use the wrongly encoded file, like in the missing-geoip-case.

The latter seems best to me.

comment:7 Changed 4 years ago by karsten

Hmm, looks like the current code already contains such a run-time verification:

    try {
      // ...
      BufferedReader br = new BufferedReader(new InputStreamReader(
          new FileInputStream(this.geoLite2CityBlocksIPv4CsvFile),
          "UTF-8"));
      // ...
    } catch (IOException e) {
      log.error("I/O exception while reading "
          + this.geoLite2CityBlocksIPv4CsvFile.getAbsolutePath() + ".");
      return lookupResults;
    }

The differences to your code are that the encoding is specified using a String and that CharacterCodingException is not caught directly but its super class IOException. But I think the effect should be the same, no?

comment:8 Changed 4 years ago by iwakeh

Actually the implementations of the two InputStreamReader constructors differ:

  • the current code will internally use a decoder with CodingErrorAction.REPLACE, i.e. drop erroneous input. (see java.io.InputStreamReader.java and sun.nio.cs.StreamDecoder)
  • whereas, the constructor with the decoder parameter reports a java.nio.charset.MalformedInputException (one subclass of CharacterCodingException).

I couldn't make up an invalid utf-8 file. So, the code below demonstrates the difference
using US-ASCII.

public class CharsetDecTest{

    public static void main(String[] args)throws Exception{
	String geo = "GeoLite2-City-Locations-en.csv";
	try{
	    BufferedReader br1 = 
		new BufferedReader(new InputStreamReader(new FileInputStream(new File(geo)), "US-ASCII"));
	    while(br1.readLine()!=null){}
	}catch(Throwable t){
	    System.out.println("e1: " + t);
	}
	try{
	    BufferedReader br2 = new BufferedReader(
                new InputStreamReader(new FileInputStream(new File(geo)), 
                      StandardCharsets.US_ASCII.newDecoder()));
	    while(br2.readLine()!=null){}
	}catch(Throwable t){
	    System.out.println("e2: " + t);
	}


    }
}

comment:9 Changed 4 years ago by karsten

Interesting! I can reproduce this problem with reading a UTF-16 document and pretending it's UTF-8:

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.nio.charset.StandardCharsets;

public class CharsetDecTest {

  public static void main(String[] args) throws Exception {
    String geo = "GeoLite2-City-Locations-en.csv";
    try {
      BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(
          new FileOutputStream(new File(geo)), "UTF-16"));
      bw.write("Hello world!\n");
      bw.close();
    } catch (Throwable t) {
      System.out.println("e0: " + t);
    }
    try {
      BufferedReader br1 = new BufferedReader(new InputStreamReader(
          new FileInputStream(new File(geo)), "UTF-8"));
      String line;
      while ((line = br1.readLine()) != null) {
        System.out.println(line);
      }
    } catch (Throwable t) {
      System.out.println("e1: " + t);
    }
    try {
      BufferedReader br2 = new BufferedReader(new InputStreamReader(
          new FileInputStream(new File(geo)),
          StandardCharsets.US_ASCII.newDecoder()));
      String line;
      while ((line = br2.readLine()) != null) {
        System.out.println(line);
      }
    } catch (Throwable t) {
      System.out.println("e2: " + t);
    }
  }
}

Output:

??

comment:10 Changed 4 years ago by karsten

Heh, Trac cut off the part of my comment where I said that it messes up the output.

Anyway, the more important part that got cut off: Want to submit a patch for this, or should I write one?

comment:11 Changed 4 years ago by iwakeh

Find a quick patch attached.

Please make sure the error handling makes sense.

I basically added the exception to the error log message,
so it would be easy to determine what exactly caused
the lookup failure.

In addition, I set the CodingErrorAction explicitly,
in order to avoid relying on the current jdk implementation.

Changed 4 years ago by iwakeh

patch suggestion

comment:12 Changed 4 years ago by karsten

Status: newneeds_review

Oh wow, it doesn't stop, does it? Turns out that ASN database files are not UTF-8 encoded but ISO-8859-1 encoded. I wrote a patch and unit tests for that and applied your patch with some tweaks. Please review my branch task-14859-2. I'm running this branch locally now to give it some testing.

comment:13 Changed 4 years ago by karsten

Nope, it doesn't stop. Here's another related issue: even if AS names are correctly parsed from the database file, any contained non-ASCII characters are written to the details documents file as \\u00F2. Which is okay. However, we don't undo that double-escaping and return \\u00F2 to HTTP clients which is six valid characters but not the single escaped UTF-8 character we're supposed to return.

This bug is not limited to AS names, but city names and other fields have the same problem.

Note that including the fields parameter fixes this problem, because it leads to rewriting the document and correctly replacing \\u with \u. Compare these two queries:

https://onionoo.torproject.org/details?search=C13E70D4

..."city_name":"S\\u00E3o Paulo",...,"as_name":"Servi\\uFFFDos de Comunica\\uFFFD\\uFFFDo S.A."...

https://onionoo.torproject.org/details?search=C13E70D4&fields=city_name,as_name

..."city_name":"S\u00E3o Paulo","as_name":"Servi\uFFFDos de Comunica\uFFFD\uFFFDo S.A."}...

Pushed a fix as another commit to my task-14859-2 branch. Please review.

comment:14 Changed 4 years ago by karsten

Resolution: fixed
Status: needs_reviewclosed

I'm taking back my last comment above. The bug described there exists, but I'd like to fix that bug differently. However, the bug is more related to #13267, and I'd like to close this ticket now and discuss the issue over there.

I went ahead and merged the following two commits into master, because there's not really much to review:

commit 9398d6999022b4d03a2305b8dbaf66bfd52801be
Author: iwakeh <iwakeh@users.ourproject.org>
Date:   Thu Jan 22 21:00:00 2015 +0000

    Do not proceed with lookup if an encoding error is reported and log file encoding errors.

commit 2d8eb09d1e37af4efd1776acf73ff94b98ca8bbd
Author: Karsten Loesing <karsten.loesing@gmx.net>
Date:   Tue Feb 24 10:50:54 2015 +0100

    Fix character encoding of ASN database file.
    
    Looks like MaxMind's ASN database files are ISO-8859-1 encoded, not UTF-8
    encoded...

Resolving. Thanks!

Note: See TracTickets for help on using tickets.