wiki:doc/DataExtractionForComparison

Data extraction for Comparison

Here is the the modified batch to even more some steps

@ECHO OFF

REM Download the tarballs manually and unpack them

REM Put the relay data in X:\consensuses-2012-03

REM Put the bridge data in X:\bridge-descriptors-2012-03
 
ECHO Extracting nickname lines using grep

"C:\Program Files (x86)\GnuWin32\bin\grep.exe" -r -h 2012 X:\consensuses-2012-03 > X:\dump.txt

"C:\Program Files (x86)\GnuWin32\bin\grep.exe" -r -h 2012 X:\bridge-descriptors-2012-03\statuses > X:\dump2.txt

ECHO Printing the second column of relays to file

"C:\Program Files (x86)\GnuWin32\bin\awk.exe" {print$2} X:\dump.txt > X:\print.txt

ECHO Printing the second column of bridges to file

"C:\Program Files (x86)\GnuWin32\bin\awk.exe" {print$2} X:\dump2.txt > X:\print2.txt

REM Sorting first to use uniq without arguments

ECHO Please wait while the lines get sorted

"C:\Program Files (x86)\GnuWin32\bin\sort.exe" X:\print.txt > X:\sorted.txt

"C:\Program Files (x86)\GnuWin32\bin\sort.exe" X:\print2.txt > X:\sorted2.txt

REM Then using uniq

ECHO Keeping only unique names

"C:\Program Files (x86)\GnuWin32\bin\uniq.exe" X:\sorted.txt > X:\unique.txt

"C:\Program Files (x86)\GnuWin32\bin\uniq.exe" X:\sorted2.txt > X:\unique2.txt

REM Turned out that I need the dump sorted

ECHO Please wait while the dump gets sorted

"C:\Program Files (x86)\GnuWin32\bin\sort.exe" X:\dump.txt > X:\dump_sorted.txt

"C:\Program Files (x86)\GnuWin32\bin\sort.exe" X:\dump2.txt > X:\dump2_sorted.txt

REM now you need to remove the dates manually from both unique.txt files

ECHO Please press any file to close this

PAUSE

EXIT

The improved way of May 2012 for comparing March 2012 nicknames :

Platform: Windows 7 (64 bit)

Data for relays: consensuses-2012-03

Data for bridges: bridge-descriptors-2012-03; The statuses

  1. Download both files manually (I used a browser)
  1. Unpack them manually (I used 7zip) (I unpacked all the data the bridge descriptors tarball contained, to speed things up you can unpack the statuses only)
  1. Run grep on the relay data to extract the lines that contain the nicknames
"C:\Program Files (x86)\GnuWin32\bin\grep.exe" -r -h 2012 X:\input folder > X:\dump.txt

(I used 2012 as pattern because it was simple. Regular expression will work as well)

  1. Automate procession of dump.txt (I created a batch.bat with the following content)
@ECHO OFF

ECHO Printing the second column to file

"C:\Program Files (x86)\GnuWin32\bin\awk.exe" {print$2} X:\dump.txt > X:\print.txt

REM Sorting first to use uniq without arguments

ECHO Please wait while the lines get sorted

"C:\Program Files (x86)\GnuWin32\bin\sort.exe" X:\print.txt > X:\sorted.txt

REM Then using uniq

ECHO Keeping only unique names

"C:\Program Files (x86)\GnuWin32\bin\uniq.exe" X:\sorted.txt > X:\unique.txt

REM Turned out that I need the dump sorted

ECHO Please wait while the dump gets sorted

"C:\Program Files (x86)\GnuWin32\bin\sort.exe" X:\dump.txt > X:\dump_sorted.txt

ECHO Please press any file to close this

PAUSE

EXIT

(extensions are not required)

  1. Run it and move all created files to another folder, delete dump.txt. (Like X:\relays\data) (I know that this could be included in the batch file)
  1. Run grep on the bridge data
"C:\Program Files (x86)\GnuWin32\bin\grep.exe" -r -h 2012 X:\input folder > X:\dump.txt
  1. Run the batch.bat again. (It's possible to output dump.txt and dump2.txt at the grep step [which could be included in the batch.bat] and change the batch to output different files)

(all of them with one batch)

  1. Remove the dates (2012-03) manually from both "unique.txt" files. (Could be done with a tool I guess)
  1. Compare the both unique.txt files manually. (That was the approach, automation with reproducible results would be fine)



Old version:

This is document is going to describe what I did to obtain two lists with bridge names to compare them.

It was my plan to make this process open, so anyone can check. While processing the data, I thought that there's room for improvement.

Platform: Windows 7 You'll see why I mention that

Data for relays: Consensuses of May 2008

I looked into the server descriptors as well, but they contain more data, which I don't considered useful for the comparison.

Data for bridges: Statuses of bridges of May 2008

Download consensuses https://metrics.torproject.org/data.html#relaydesc

  1. I downloaded the consensuses of May 2008 "consensuses-2008-05.tar.bz2"
  1. Unpacked them, what gave me the root folder "consensuses-2008-05" containing 31 folders with 24 files each


Inspect the files I used Notepad++ since the default notepad has many downsides

  1. I opened the file "2008-05-01-00-00-00-consensus"
  1. The lines containing the relay name starts with an "r" Could I use that somehow
  1. I considered these lines to be the only useful


Process the files Since manual copying each line to a new file is slow I used grep for Windows. I had it installed already, but it was rather unused.

Linux is wonderful here, Windows lacks this function.

  1. I found a way to extract the lines containing the relay name by using grep, now I needed a pattern
  1. All relay lines contain the year 2008, so I could use this as pattern. Pattern found.

Using regular expression for the full data might have worked

Using regular expression for the 27 chars string might have worked

I used "2008" because it was simple and gave only few false positives

I tested it with a single file, then the folder "01"

  1. Having all relay lines of 24 hours in a single file revealed that at least some of them

where not up 24 hours.

  1. I decided to use grep on all files at once to have all relay lines in one file.

grep is able to do so and I didn't want to miss any relay

grep is really fast

  1. I used "grep -r -h 2008 X:\concensuses-2008-05 > X:\comparison\dump.txt

I renamed the file to "relays unsorted uncleaned.txt"


Sort the lines I considered it useful to sort the lines. Windows isn't able to sort the content of files.

Since I used Notepad++ for looking into the files I wanted to use it

for sorting as well. It can't do that natively, but there's a plugin.

  1. I decided not to strip the trailing "r"

It shouldn't hurt.

  1. I used Notepad++ with the plugin "Column Sort" to sort the lines

that's time and memory intensive

  1. I saved a copy of the sorted data and removed, valid-after, fresh-until, valid-until

and vote digest

vote digest was included once because that line contained "2008"

I saved it as "relays sorted cleaned.txt"

If I would have processed the files manually it would have take far far longer.

The tools were a great help so far. Considered Linux distributions can do that

by default it should not be hard to reproduce this.


Try trimming list To compare nicknames in the first place it should be much easier

to see the same nickname just once at a time.

Manually trimming would have worked, but would consume much time.

To just keep name and fingerprint I wanted to treat it as CSV.

Just in order to remove data from the files

  1. I loaded the sorted copy into a spreadsheet program, but not all lines could be

imported because spreadsheet programs are limited. I therefor had to split the

list first.

I used LibreOffice 3.5, but Microsoft Excel has a limit amount

of lines as well.


Split the list Windows is able to split files, but I don't know how well.

I used GSplit, because I knew it could split after x occurrences of a pattern. This includes special characters

like the Line Feed character. So I could make sure to keep

the lines itself intact and could choose exactly how many lines

the files would contain. The first a 1000000, the second the rest.

  1. I split the file into two parts by using GSplit

changed or used settings

"I want to split after the nth occurrence of a specified pattern"

"Split after the occurrence number"

"1000000"

"0x0A" as this is the LF or Line Feed

Filename "part{num}.txt"

"Do not add Gsplit tags to piece files"

Keep the wanted I considered nickname and fingerprint to be valuable, because

the fingerprint makes identification easier.

  1. I loaded each part in a spreadsheet application. Calc from LibreOffice 3.5
  1. I used spaces as separator and made sure every column is treated as text

Treating it as text prevents interpretations of the data

for example "001" will be turned to "1" as the trailing zeros

will be ignored, treating the data as text prevents this

  1. I removed the columns that seemed not to be required and saved each file as CSV

no commas were added, I ended up with "nickname" "fingerprint

separated by space, no empty lines in between.


Trim the list Now both files contained the nick and the fingerprint, but still multiple times

I wanted to remove the duplicates.

I used Notepad++ with the TextFX plugin

  1. I loaded both files into Notepad++ and used TextFX to sort them, as it can

paste unique lines only.

In fact TextFX could have done the first sorting as well

"Sort accending"

"Sort outputs only unique lines"

"Sort line case insensitive" no difference between Tor, ToR and tor, the fingerprint prevents those lines from

not being output.

  1. I copied both sorted lists into a new file and removed a single line,

because it appeared twice

It should have been possible to combine both CSV files before sorting, but that's matter of memory

  1. I discard the changes made to the CSV files I did not save the changes
  1. I saved the new list as "relay names fingerprint.txt"

which now contains 9469 lines. strange there are not so many relays

there where never so many relays; did I mess up?

there are relay names that are the same, but have a different fingerprint

this explains some occurrences

I noticed that some fingerprint appeared at least twice, but had different nicknames

I checked the source data and they where not up at the same time.

I decided to go on, even though it was strange.


Unnamed relays Before I started I wondered if Unnamed relays would tell me anything.

I looked at "Unnamed" and counted them; whole word, match case

It appeared 3390 times

  1. I removed "Unnamed" (case sensitive) and saved as

"relay names fingerprint no unnamed.txt"

I kept UNNAMED and unnamed as well as Unnamed + any addition

Should I trim the list further?

  1. I loaded the file into Calc and removed the fingerprints

saved as "relay names only.csv"

  1. I sorted the file with Notepad++ and kept the unique names

"unique relay names only sorted.csv"

I may lost "Tor, ToR and tor", but was OK with that

I was down to 4873 lines

Back in 2008 there weren’t so many relays

Should have names changed that often?


Bridges

  1. I downloaded the bridge data
  1. “Grep”ed the statuses

I also used 2008 as pattern, there where no false positives this time

  1. Sorted using Column Sort

I sorted with Column Sort in the first place to have an overview

Many lines where exact duplicates

I think it's useless to do this

  1. Sorted again using TextFX unique lines only, saved it

bridges sorted.txt with only unique bridges and fingerprints

  1. Loaded into a spreadsheet application

Remember to treat it as text

  1. Keep only the bridge names

that's the only thing needed

  1. I checked “Unnamed” and it didn't vary at all
  1. Sorted unique, removed "Unnamed"

a final time to make sure I had less lines

"bridges names only unique.csv"

  1. Compared them manually (that's what I agreed to)

That's more work as a thought as I saw the bridge list

The bridge IPs were sanitized but one could tell if they are stable

  1. Copied lines I found from "bridges sorted.txt" and "relay sorted cleaned.txt" to "findings.txt"


The files I really worked with are"unique relay names only sorted.csv","bridges names only unique.csv", "bridges sorted.txt" and"relay sorted cleaned.txt".

I did not know if the other files I created along the way would be useful so I saved them. At least I haven't used them.


My approach as I planned it would to look at the bridge names and compare them to the relay names. Mainly because there are much more relays.

Would and should my approach be different if there would be 50000 bridges?


I'm sure some call me (something) for not taking a shortcut. I'm sure I could remove or skip a few steps if I know the right tools. Also I'm on Windows.

After I did all this, I was quite sure that this can be done with a script. Some experienced user would be better at this.

What I'm looking for is a improvement on how I approached it. There are plans to compare the names from bridges and relays from an recent tarball.

Maybe it's even possible to use an algorithm that prints out exact matches.

Last modified 6 years ago Last modified on Jun 7, 2012, 12:43:42 PM