Opened 6 years ago

Closed 6 years ago

#5047 closed enhancement (wontfix)

Implement basic usage statistics in obfsproxy

Reported by: karsten Owned by: karsten
Priority: Medium Milestone:
Component: Obfuscation/Obfsproxy Version:
Severity: Keywords:
Cc: asn, atagar Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

We should implement some basic usage statistics in obfsproxy to learn about usage as long as Tor doesn't have support for obfsproxy statistics (#5040). Once Tor supports these statistics, the implementation in obfsproxy can be removed. Both Tor's and obfsproxy's statistics should be equivalent or at least easily comparable.

The idea is to have obfsproxy log incoming connections in a privacy-aware way and provide a simple script to convert these logs into a format that can be published without issues. Bridge operators can periodically run the script and send the output to the Tor developers who publish and analyze them. The implementation in obfsproxy should be quite simple in order not to break too much stuff. The conversion script should be dead simple, so that bridge operators can understand what's going on.

Here's a possible approach:

We want to count daily connections by country and daily unique IP addresses by country. Similar to other statistics in Tor, we want to aggregate data over 24-hour periods, resolve IP addresses to country codes, and round up frequencies to multiples of 8.

  1. When obfsproxy starts, it does three things: a) generate a secret string S that it only keeps in memory; b) note the timestamp TS when it started; c) create a buffer B with a capacity of 100 log messages.
  1. Whenever obfsproxy receives a client connection, it runs steps 3 to 5:
  1. It checks whether at least 24 hours have passed since TS. If so, it flushes all log messages from buffer B, shuffles them, and appends them to a file on disk. It also increments TS in 24-hour steps until TS is not more than 24 hours in the past.
  1. It checks whether B is full, i.e., contains 100 messages. If so, it flushes B and appends messages to a file on disk in random order.
  1. It creates a new log message containing a) timestamp TS (which is NOT the current timestamp!), b) the country code of the connecting IP as resolved by a GeoIP database, c) the hashed IP address using secret S, i.e., H(IP || S) with a cryptographic hash function of the implementor's choice. An example log message would be "2012-02-07 14:01:04 de 1234567890123456789012345678901234567890".
  1. When obfsproxy stops, it does NOT flush the contents of B to disk. It forgets about S, possibly in a cryptographically secure manner.

The buffer has two functions here. First, it removes the original order of connections, which may still be meaningful if it contains connections from countries with few connections. Second, the buffer protects the timing of single client connections that occur when obfsproxy is terminated and restarted shortly after a 24-hour interval ends. The buffer size of 100 was arbitrarily chosen to avoid memory problems on heavily used bridges. Higher numbers are preferred, but if that makes things more complicated, 100 should be a large enough number.

The log messages still reveal too much information to be published. They shouldn't contain IP hashes, and frequencies still need to be rounded up to the next multiple of 8. The following bash script, which probably requires a lot more comments, converts a log message file into a format that can be published by bridge operators.

#!/bin/bash
echo "Daily rounded total requests by country"
cut -d" " -f1-3 data | sort | uniq -c | \
awk '{printf "%s %s %s %d\n", $2, $3, $4, 8*(int(($1+7)/8))}'
echo "Daily rounded unique IPs by country"
sort data | uniq | cut -d" " -f1-3 | uniq -c | \
awk '{printf "%s %s %s %d\n", $2, $3, $4, 8*(int(($1+7)/8))}'

Note that the approach taken here was designed to keep the changes to obfsproxy small. Of course, we could implement everything in obfsproxy and write nice files that bridge operators can mail to the Tor devs directly. That would be an implementation similar to what Tor does for the various statistics. The buffered logging approach seemed to be a good compromise between not logging sensitive data and not adding too much code. Whether that is true is a question for the obfsproxy developers.

Child Tickets

Attachments (2)

log2stats.py (5.0 KB) - added by karsten 6 years ago.
Python script to convert obfsproxy log into aggregated stats
log2stats.2.py (5.4 KB) - added by karsten 6 years ago.
Python script to convert obfsproxy log into aggregated stats, second attempt

Download all attachments as: .zip

Change History (14)

Changed 6 years ago by karsten

Attachment: log2stats.py added

Python script to convert obfsproxy log into aggregated stats

comment:1 Changed 6 years ago by karsten

It seems we don't have time to implement the approach described above. Here's a simpler approach: obfsproxy bridge operators keep info-level logs of their obfsproxy (as soon as #5069 is implemented) and run the attached Python script on them. The output is the same information that we'd put in extra-info descriptors. For example:

fingerprint 1234567890123456789012345678901234567890
obfsproxy-stats-end (86400 s) 2012-02-10 09:22:04
obfsproxy-ips DE=8,DK=8,EU=8,JP=8
obfsproxy-reqs DE=8,DK=16,EU=8,JP=8
fingerprint 1234567890123456789012345678901234567890
obfsproxy-stats-end (86400 s) 2012-02-11 09:22:04
obfsproxy-ips EU=8
obfsproxy-reqs EU=8
fingerprint 1234567890123456789012345678901234567890
obfsproxy-stats-end (86400 s) 2012-02-12 09:22:04
obfsproxy-ips EU=8
obfsproxy-reqs EU=8
fingerprint 1234567890123456789012345678901234567890
obfsproxy-stats-end (86400 s) 2012-02-13 11:22:04
obfsproxy-ips ??=8,EU=8
obfsproxy-reqs ??=8,EU=8

The Python script is still very rough around the edges. I'd love some help with that from someone who actually can code Python.

comment:2 Changed 6 years ago by atagar

def writestats(statsstart, ips, geoipfile, fingerprint):

Personally I found this very difficult to read, and I suspect most of the work and temporary variables are unnecessary. Here's a completely untested function that, I think, will do the same thing...

def writestats(statsstart, ips, geoip_path, fingerprint):
  if not ips: return # nothing to write
  
  # matches against lines from the maxmind geoip db, such as...
  # 18939904,19005439,JP
  
  geoip_line = re.compile('^([\d]*),([\d]*),([A-Z0-9]{2})$')
  
  # mapping of country codes to their ips
  locale_to_ip = {}
  
  # Ips in the geoip file are sorted, so sorting our listing so we can just
  # compare to the first element.
  ips.sort()
  
  geoip_processing: # used for targeted breaks, double check syntax (I don't use this much...)
  with open(geoip_path) as geoip_file:
    for line in geoip_file:
      m = geoip_line.match(line)
      
      # skip line if it's not an entry (probably a blank line or comment)
      if not m: continue
      
      ip_start, ip_end, locale = m.groups()
      
      # It's possible that we come before the current entry, which means the
      # file doesn't contain our ip. Pop off ips until that's no longer the
      # case.
      
      while ip_start > ips[0]:
        if not "??" in locale_to_ip:
          locale_to_ip["??"] = []
        
        locale_to_ip["??"].append(ips.pop(0))
        if not ips: break geoip_processing
      
      if ip_end < ips[0]:
        continue # nope, our entry is later
      elif ip_start < ips[0] and ip_end > ips[0]:
        # entry matches
        if not locale in locale_to_ip:
          locale_to_ip[locale] = []
        
        locale_to_ip[locale].append(ips.pop(0))
        if not ips: break geoip_processing
  
  # any remaining ips after processing the file are unknown
  while ips:
    if not "??" in locale_to_ip:
      locale_to_ip["??"] = []
    
    locale_to_ip["??"].append(ips.pop(0))
  
  # now move on with printing stuff, probably something like...
  for locale in locale_to_ip:
    connection_count = len(locale_to_ip[locale])
    unique_connection_count = len(set(locale_to_ip[locale]))

Besides that looks good, just some nitpicks...

starting = re.compile('([
d\-:TZ]* )?\[info\] Starting.$')
exiting = re.compile('
([
d\-:TZ]* )?\[info\] Exiting.$')
connect = re.compile('([
d\-:TZ]* )?\[[a-z]*\] ([
d
.]+):[
d]+ ' +

'\([a-z0-9]+\): trying to connect to [
d
.:]+$')

Move the starting/ending/connect regexes into header constants? When regexes get substantial like this it's also nice to have a comment with examples for what they match against (greatly helps readability).

for idx, ree in enumerate(re_list):

Neat, the enumerate builtin is new to me. In writestats you use it more than you need to (since you never use the 'idx'), and personally I'd do away with re_list here so it just uses a tuple inline. But es no importa.

for line in open(obfsproxylog, 'r'):

You actually probably want...

with open(obfsproxylog) as obsfproxylog_file:
  for line in obsfproxylog_file:
    ...

This will close the file when you're done. Also, 'r' is the default mode for the open method.

if statsstart > 0:

If you initialize statsstart to sys.maxint then this isn't needed.

del ips[:]

Why not just do "ips = []"? In either case you're only removing references.

ipn = int("%02x%02x%02x%02x" %

(int(st[0]), int(st[1]), int(st[2]), int(st[3])), 16)

Looks like the map function could help you here.

>>> st = ["1", "2", "3", "4"]
>>> map(int, st)
[1, 2, 3, 4]

Cheers! -Damian

comment:3 Changed 6 years ago by karsten

Cc: atagar added
Owner: changed from asn to karsten
Status: newassigned

Thanks, atagar, for the great code review! That was really helpful.

Here's a revised script that fixes most of the things you mentioned. It also takes command-line arguments now and writes fingerprint hashes, no original fingerprints.

May I ask you for a second review round? The goal would be to add the script to the obfsproxy repository and enable bridge operators to easily understand what the script does.

And do you know an easy way to write the CC=NN pairs first ordered by NN (after rounding up, that is), then by CC? For example: DE=48,EU=48,AU=32

Changed 6 years ago by karsten

Attachment: log2stats.2.py added

Python script to convert obfsproxy log into aggregated stats, second attempt

comment:4 Changed 6 years ago by atagar

I'm about to head to sleep and I'll be busy most of tomorrow so a second pass will need to wait. But for your last question concerning multi-key sorting the sort function's key value can take a tuple which, iirc, does what you want...

>>> foo = [(1, 'a'), (2, 'z'), (2, 'b')]
>>> foo.sort(key = lambda i: (i[0], i[1]))
>>> foo
[(1, 'a'), (2, 'b'), (2, 'z')]

Cheers! -Damian

comment:5 Changed 6 years ago by atagar

Took a quick scan and looks good to me.

locale_to_reqs[locale] = 8 * (int((len(locale_to_ip[locale]) + 7) / 8))

Is the extra int conversion really necessary? int / int = int

>>> foo = [1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> len(foo) / 8
1
>>> type(len(foo) / 8)
<type 'int'>

print "hashed-fingerprint %s" % sha1(a2b_hex(fingerprint)).hexdigest()

Is this right? The fingerprint arg is already equal to...

fingerprint = sha1(a2b_hex(m.group(1))).hexdigest()

ccnn = []

I haven't a clue what 'ccnn' stands for. It might be the convention with some languages to abbreviate variables to cryptic and largely un-guessable acronyms, but with python it's preferable to spell it out.

Cheers! -Damian

comment:6 Changed 6 years ago by karsten

Thanks for all the suggestions! I added a branch log2stats to my public obfsproxy repository that has an updated log2stats.py script. (Good catch with the int conversion and hashed fingerprints; I didn't implement the tuple-sorting thing, and I left ccnn unchanged.)

It's still unclear if we should ask obfsproxy operators to log on info level and use this script, or if they should rather send us their heartbeat log messages (#5083). Let's not add this script to obfsproxy until we have a decision there.

comment:7 Changed 6 years ago by asn

If #5101 gets merged, it will break the current version of the python script. It changes:

2012-02-10 22:37:00 [info] 89.78.67.56:1234 (obfs2): trying to connect to 127.0.0.1:5001

to

2012-02-10 22:37:00 [info] 89.78.67.56:1234 (obfs2): Successful outbound connection to '127.0.0.1:5001'.

BTW, the new log line will only appear when the connection is successful. When the connection fails we still get:

89.78.67.56:1234: connection to 127.0.0.1:5001 failed: Connection timeout

comment:8 in reply to:  7 Changed 6 years ago by nickm

Replying to asn:

If #5101 gets merged, it will break the current version of the python script.

#5101 got merged; the script will need tweaking. Also, IPs are now scrubbed by default; the script will need to handle that if it doesn't already.

comment:9 Changed 6 years ago by arma

Looks like we're good to go here?

comment:10 Changed 6 years ago by karsten

Do people still care about having an updated log2stats.py script in obfsproxy, now that we have heartbeat messages and [scrubbed] addresses by default?

comment:11 Changed 6 years ago by arma

I think we should focus on teaching obfsproxy how to tell Tor that it just got a connection and what its IP address was. Then Tor can just do its stats like normal.

comment:12 Changed 6 years ago by karsten

Resolution: wontfix
Status: assignedclosed

Agreed. Closing this ticket. #5040 is about making bridges add obfsproxy stats to their extra-info descriptors.

Note: See TracTickets for help on using tickets.