Opened 7 years ago

Closed 2 years ago

#6450 closed defect (wontfix)

Compass' command-line script can't encode unicode characters

Reported by: karsten Owned by: metrics-team
Priority: Low Milestone:
Component: Metrics/Compass Version:
Severity: Keywords:
Cc: atagar, delber@… Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Today I found that tail and less are unhappy about the task #6329 script printing out unicode characters. When piping its output into tail or less, the script exits with a traceback. When writing to stdout directly, Python is happy.

Here's how to reproduce the problem:

  • Clone the metrics-tasks repository.
  • Navigate to the #6329 script and make it download required data: cd task-6329/; ./tor-relays-stats.py -d
  • Find a unicode character in an AS name: grep -B1 "as_name.*\\\\u" details.json
  • Display relays in that AS, e.g. AS28548: ./tor-relays-stats.py -i -a 28548 | tail

Python should print out the following traceback:

Traceback (most recent call last):
  File "./tor-relays-stats.py", line 197, in <module>
    short=70 if options.short else None)
  File "./tor-relays-stats.py", line 110, in print_groups
    print formatted_group[:short]
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 144: ordinal not in range(128)

I found that a possible solution is to replace all Unicode characters with '?'s, but that doesn't seem very elegant:

-                              exit, guard, country, as_number, as_name)
+                              exit, guard, country, as_number, as_name.encode('ascii', 'replace'))

Are there better solutions?

Child Tickets

Change History (7)

comment:1 Changed 7 years ago by delber

Cc: delber@… added

If you look at every '
\u' present in details.json, you notice that most of them are placeholders for badly parsed characters. I think the problem probably lies within Onionoo (or even Tor itself). Are strings supposed to be UTF-8 encoded? What should Onionoo do when the strings are not valid UTF-8?

comment:2 Changed 7 years ago by atagar

Hi Karsten. Unicode is a bit of a pita in python 2.x. Options are...

  • Make sure to use unicode strings rather than regular (ascii) strings. Unicode strings start with a 'u', for instance...
>>> u"こんにちは"
u'\u3053\u3093\u306b\u3061\u306f'
  • Switch to python 3.x. One of the big changes in the upgrade was that strings are now unicode (like java).

Hope that helps. -Damian

comment:3 in reply to:  1 Changed 7 years ago by karsten

Replying to delber:

If you look at every '
\u' present in details.json, you notice that most of them are placeholders for badly parsed characters. I think the problem probably lies within Onionoo (or even Tor itself). Are strings supposed to be UTF-8 encoded? What should Onionoo do when the strings are not valid UTF-8?

Onionoo adds these characters after looking them up in Maxmind's GeoIP or ASN database. Onionoo's response is supposed to be UTF-8 encoded. But when you request something from Onionoo, you receive a Unicode string? I wonder if that's the problem here.

comment:4 in reply to:  2 Changed 7 years ago by karsten

Replying to atagar:

Hi Karsten. Unicode is a bit of a pita in python 2.x. Options are...

  • Make sure to use unicode strings rather than regular (ascii) strings. Unicode strings start with a 'u', for instance...

See above. The Python script should be able to handle a UTF-8 encoded document that it downloads from Onionoo.

  • Switch to python 3.x. One of the big changes in the upgrade was that strings are now unicode (like java).

I'd rather want to stick with the Python version supported in Debian stable. Maybe we can work around this encoding problem somehow.

comment:5 Changed 7 years ago by karsten

Component: Metrics UtilitiesCompass
Owner: set to gsathya
Summary: Task #6329 Python script can't encode unicode charactersCompass' command-line script can't encode unicode characters

The #6329 script is now called Compass and has its own Component, but the problem persists.

comment:6 Changed 2 years ago by karsten

Owner: changed from gsathya to metrics-team
Status: newassigned

I believe that gsathya has stopped working on any of these tickets quite a while ago. Reassigning to the friendly metrics-team user. (gsathya, thanks for having worked on all these tickets back when you did!)

comment:7 Changed 2 years ago by irl

Resolution: wontfix
Status: assignedclosed

The metrics team has put a goal of shutting down Compass in its roadmap and merging functionality with Relay Search (previously known as Atlas). This ticket is specific to Compass and as development on Compass has ceased, I am marking this ticket as wontfix.

See #23517 for information on the planned work to integrate Compass functionality into Relay Search.

Note: See TracTickets for help on using tickets.