Opened 7 years ago

Closed 5 years ago

#9385 closed defect (fixed)

bridgedb's email responder should fuzzy match email addresses within time periods

Reported by: isis Owned by: isis
Priority: Medium Milestone:
Component: Circumvention/BridgeDB Version:
Severity: Keywords: bridgedb-email, bridgedb-0.2.3
Cc: isis, sysrqb, phw, mrphs Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

tl;dr: We're getting trolled hardcore. We should have some sort of fuzzy matching on email addresses within a time limit.

While looking into #9277, in the directory which BridgeDB stores it's logfiles, I noticed several problems.

One of them is that BridgeDB's email response distributor is incredibly naive and susceptible to massive trolling. Forgetting the fact that there are five days worth of logfiles which include the *full* *text* of the response emails, *including* *the* *client* *email* *addresses*, it is actually lucky that I saw these email addresses, because there is a definite pattern to them.

There were 200 occurences of 'gmail.com':

$ grep -Er '@gmail\.com' | awk -Pe  '{"From "} ; { print $2 }' | grep gmail\.com | wc -l
200

120 of which were unique:

$ grep -Er '@gmail\.com' | awk -Pe  '{"From "} ; { print $2 }' | grep gmail\.com | sort | uniq | wc -l
120

The problem is that there are multiple addresses making requests in a row which are not only quite clearly related (i.e. <static_username>+<incremental_integer>@gmail.com, or <base32_80bit_hash>@gmail.com) but also are quite obviously snark/trolling from various adversaries.

For example, one of the usernames which had incremental integers, was 'feidanchaoren', and I saw it incremented 34 times, i.e.

feidanchaoren00001@
feidanchaoren00002@
[...]
feidanchaoren00034@

There were multiple requests (though at minimum 30 minutes apart) from precisely the same username+integer.

Also, 'fei dan' is romanji for 飞蛋, which means 'flying egg' in English. It is from Confucian parable which, if I understood it correctly (and I am well-versed in neither Traditional Chinese nor Confucianism), is about a man who pays so much attention to a bunch of eggs trying to ensure that they hatch, that he does not pay any attention to what to do afterwards. The eggs hatch, and the chickens fly away. Roughly, it means: "if you pay too much attention to details and not enough to the bigger picture, you are made of #fail". And 'cha oren' (超人) is 'superman' in English but more accurately Nietzsche's 'übermensch' in German. I would assume we're being trolled pretty hard.

One way to fix this might be to take the time period which we currently wait between responses, and in addition to rejecting emails from precisely the same username, we can block anything which fuzzy matches. However, going down the path of finding clever regexes to match things like the fake .onion address looking email addresses in addition to all the other things which are clearly patterns to a human sounds like a good way to either write unreadable code or accidentally block honest users.

Child Tickets

Change History (15)

comment:1 Changed 7 years ago by phw

Interesting. Nice find, isis!

Just for the sake of completeness: the plus sign in GMail addresses is used to automatically label emails. If I own foo@…, I can distribute foo+bar@… to automatically label all incoming emails with "bar". So for the "<static_username>+<incremental_integer>@gmail.com"-attack you mention, you need only a single GMail address.

comment:2 Changed 7 years ago by mparte

However, going down the path of finding clever regexes to match things like the fake .onion address looking email addresses in addition to all the other things which are clearly patterns to a human sounds like a good way to either write unreadable code or accidentally block honest users.

Could you test for Kolmogorov Complexity?

comment:3 Changed 7 years ago by sysrqb

Replying to phw:

Interesting. Nice find, isis!

Definitely!

This will be (sadly) very difficult to fix, however. We do handle the '+' notation already:

    idx = localpart.find('+')
    if idx >= 0:
        localpart = localpart[:idx]
    rules = domainrules.get(domain, [])
    if 'ignore_dots' in rules:
        # j.doe@ is the same as jdoe@.
        localpart = localpart.replace(".", "")

    return "%s@%s"%(localpart, domain)

So, limiting "incrementing" addresses and "similar" addresses is the next challenge. I fear it will be a losing battle.

Replying to mparte:

Could you test for Kolmogorov Complexity?

Interesting thought, it's something we can look into and experiment with.

comment:4 in reply to:  3 Changed 7 years ago by phw

Replying to sysrqb:

We do handle the '+' notation already:

Great!

So, limiting "incrementing" addresses and "similar" addresses is the next challenge. I fear it will be a losing battle.

I'm not even sure if that is a battle worth fighting. Even if we come up with the perfect algorithm to detect similar addresses, the adversary could then simply start generating non-similar addresses by, e.g., randomly concatenating words from a dictionary. I feel like it would be very expensive for us to fix this problem but trivial to circumvent our fix once again. These rate-limiting strategies should be implemented by the email provider (that's actually the very reason, we require Yahoo/GMail addresses) and I'm afraid there's a good reason, they can't do a better job.

comment:5 Changed 7 years ago by phw

Cc: phw added

comment:6 Changed 6 years ago by mrphs

Cc: mrphs added

comment:7 Changed 6 years ago by isis

Keywords: bridgedb-email bridgedb-0.2.x added; email distributor spam bots removed

comment:8 in reply to:  2 Changed 6 years ago by wfn

Replying to mparte:

However, going down the path of finding clever regexes to match things like the fake .onion address looking email addresses in addition to all the other things which are clearly patterns to a human sounds like a good way to either write unreadable code or accidentally block honest users.

Could you test for Kolmogorov Complexity?

This is venturing into crazy territory, but fwiw: it might be possible to calculate similarity simply using Hamming Distances: https://en.wikipedia.org/wiki/Hamming_distance which at the binary/bit level is 'the number of XORs needed to make $given_string into $some_target_string.'

Actually implementing such a distance metric on top of email address storage (god forbid) would entitle making an email address storage mechanism (probably) based on binary trees. I fear it would be simple to produce false positives, though; the mechanism could prefer calculating distances starting from the right-hand-side of the first part of the email address, whatnot.

In any case, I guess I agree that this does indeed sound like a losing battle with some insane code to top it off. :(

comment:9 Changed 6 years ago by isis

Status: newaccepted

I forgot to mention, someone on IRC (sorry! I've forgotten your nick! Feel free to speak up!) once was interested in working on this, using the Levenshtein's string distance algorithm. This would be super simple to do!

comment:10 Changed 6 years ago by joelanders

Maybe this is silly, but, to make this more difficult: if a baddie wanted to know whether freedom76@… was requesting bridges, they could send requests from freedomNN@… and see if the requests are blocked.

comment:11 Changed 6 years ago by isis

Keywords: bridgedb-0.2.2 added; bridgedb-0.2.x removed
Status: acceptedneeds_review

So, I've implemented this slightly differently than proposed.

Right now, there is a EMAIL_BLACKLIST option in the bridgedb.conf config file. There is also another setting, EMAIL_FUZZY_MATCH which takes an integer for the maximum Levenshtein String Distance for an email address to be considered as "fuzzy-matching" one of the addresses in the blacklist. These things currently are quite simple, i.e. they have no mechanisms for learning, and email addresses must be added to the EMAIL_BLACKLIST list by hand.

My work is in my fix/9385-fuzzy-email-matching branch, and there are some added unittests, all of which are passing.

I would like to merge this into bridgedb-0.2.2, so please review! Thanks!

comment:12 Changed 6 years ago by isis

Keywords: bridgedb-0.2.3 added; bridgedb-0.2.2 removed

Oops, I meant bridgedb-0.2.3.

comment:13 Changed 6 years ago by isis

Resolution: fixed
Status: needs_reviewclosed

Okay, merged into bridgedb-0.2.3. This is fixed.

comment:14 Changed 5 years ago by isis

Resolution: fixed
Status: closedreopened

The implementation was broken. In bridgedb.email.autoresponder.SMTPAutoresponder.runChecks(), it does:

         # If fuzzy matching is enabled via the EMAIL_FUZZY_MATCH setting, then
         # calculate the Levenshtein String Distance (see
         # :func:`~bridgedb.util.levenshteinDistance`):
         if self.incoming.context.fuzzyMatch != 0:
             for blacklistedAddress in self.incoming.context.blacklist:
                 distance = levenshteinDistance(self.incoming.canonicalFromEmail,
                                                blacklistedAddress)
                 if distance <= self.incoming.context.fuzzyMatch:
                     logging.info("Fuzzy-matched %s to blacklisted address %s!"
                                  % (self.incoming.canonicalFromEmail,
                                     blacklistedAddress))
                     return False

which is broken because self.incoming.canonicalFromEmail is the client's canonicalised email domain from the email headers, e.g. for an address ''feidanchaoren0006@gmail.com' it would be 'gmail.com', which obviously doesn't have a levenshtein distance anything close to the EMAIL_BLACKLIST address which it should have fuzzy-matched ('feidanchaoren0001@gmail.com').

The fix is simple: change self.incoming.caconicalFromEmail to str(client) to use the full email address.

comment:15 Changed 5 years ago by isis

Resolution: fixed
Status: reopenedclosed

Fixed in my hotfix/0.3.1-9385-levenshtein-distance branch.

Note: See TracTickets for help on using tickets.