bridgedb's email responder should fuzzy match email addresses within time periods

added bridgedb-0.2.3 bridgedb-email component::circumvention/bridgedb owner::isis priority::medium resolution::fixed status::closed type::defect labels

Interesting. Nice find, isis!

Just for the sake of completeness: the plus sign in GMail addresses is used to automatically label emails. If I own foo@gmail.com, I can distribute foo+bar@gmail.com to automatically label all incoming emails with "bar". So for the "<static_username>+<incremental_integer>@gmail.com"-attack you mention, you need only a single GMail address.

However, going down the path of finding clever regexes to match things like the fake .onion address looking email addresses in addition to all the other things which are clearly patterns to a human sounds like a good way to either write unreadable code or accidentally block honest users.

Could you test for Kolmogorov Complexity?

Trac:
Username: mparte

Replying to phw:

Interesting. Nice find, isis!

Definitely!

This will be (sadly) very difficult to fix, however. We do handle the '+' notation already:

    idx = localpart.find('+')
    if idx >= 0:
        localpart = localpart[:idx]
    rules = domainrules.get(domain, [])
    if 'ignore_dots' in rules:
        # j.doe@ is the same as jdoe@.
        localpart = localpart.replace(".", "")

    return "%s@%s"%(localpart, domain)

So, limiting "incrementing" addresses and "similar" addresses is the next challenge. I fear it will be a losing battle.

Replying to mparte:

Could you test for Kolmogorov Complexity? Interesting thought, it's something we can look into and experiment with.

Replying to sysrqb:

We do handle the '+' notation already:

Great!

So, limiting "incrementing" addresses and "similar" addresses is the next challenge. I fear it will be a losing battle.

I'm not even sure if that is a battle worth fighting. Even if we come up with the perfect algorithm to detect similar addresses, the adversary could then simply start generating non-similar addresses by, e.g., randomly concatenating words from a dictionary. I feel like it would be very expensive for us to fix this problem but trivial to circumvent our fix once again. These rate-limiting strategies should be implemented by the email provider (that's actually the very reason, we require Yahoo/GMail addresses) and I'm afraid there's a good reason, they can't do a better job.

Trac:
Cc: isis,sysrqb to isis, sysrqb, phw

Trac:
Cc: isis, sysrqb, phw to isis, sysrqb, phw, mrphs

Trac:
Keywords: distributor, bots, spam, email deleted, bridgedb-0.2.x, bridgedb-email added

Replying to mparte:

{{{ However, going down the path of finding clever regexes to match things like the fake .onion address looking email addresses in addition to all the other things which are clearly patterns to a human sounds like a good way to either write unreadable code or accidentally block honest users. }}}

Could you test for Kolmogorov Complexity?

This is venturing into crazy territory, but fwiw: it might be possible to calculate similarity simply using Hamming Distances: https://en.wikipedia.org/wiki/Hamming_distance which at the binary/bit level is 'the number of XORs needed to make $given_string into $some_target_string.'

Actually implementing such a distance metric on top of email address storage (god forbid) would entitle making an email address storage mechanism (probably) based on binary trees. I fear it would be simple to produce false positives, though; the mechanism could prefer calculating distances starting from the right-hand-side of the first part of the email address, whatnot.

In any case, I guess I agree that this does indeed sound like a losing battle with some insane code to top it off. :(

I forgot to mention, someone on IRC (sorry! I've forgotten your nick! Feel free to speak up!) once was interested in working on this, using the Levenshtein's string distance algorithm. This would be super simple to do!

Trac:
Status: new to accepted

Maybe this is silly, but, to make this more difficult: if a baddie wanted to know whether freedom76@gmail.com was requesting bridges, they could send requests from freedomNN@gmail.com and see if the requests are blocked.

Trac:
Username: joelanders

So, I've implemented this slightly differently than proposed.

Right now, there is a EMAIL_BLACKLIST option in the bridgedb.conf config file. There is also another setting, EMAIL_FUZZY_MATCH which takes an integer for the maximum Levenshtein String Distance for an email address to be considered as "fuzzy-matching" one of the addresses in the blacklist. These things currently are quite simple, i.e. they have no mechanisms for learning, and email addresses must be added to the EMAIL_BLACKLIST list by hand.

My work is in my fix/9385-fuzzy-email-matching branch, and there are some added unittests, all of which are passing.

I would like to merge this into bridgedb-0.2.2, so please review! Thanks!

Trac:
Status: accepted to needs_review
Keywords: bridgedb-0.2.x deleted, bridgedb-0.2.2 added

Oops, I meant bridgedb-0.2.3.

Trac:
Keywords: bridgedb-0.2.2 deleted, bridgedb-0.2.3 added

Okay, merged into bridgedb-0.2.3. This is fixed.

Trac:
Status: needs_review to closed
Resolution: N/A to fixed

The implementation was broken. In bridgedb.email.autoresponder.SMTPAutoresponder.runChecks(), it does:

         # If fuzzy matching is enabled via the EMAIL_FUZZY_MATCH setting, then
         # calculate the Levenshtein String Distance (see
         # :func:`~bridgedb.util.levenshteinDistance`):
         if self.incoming.context.fuzzyMatch != 0:
             for blacklistedAddress in self.incoming.context.blacklist:
                 distance = levenshteinDistance(self.incoming.canonicalFromEmail,
                                                blacklistedAddress)
                 if distance <= self.incoming.context.fuzzyMatch:
                     logging.info("Fuzzy-matched %s to blacklisted address %s!"
                                  % (self.incoming.canonicalFromEmail,
                                     blacklistedAddress))
                     return False

which is broken because self.incoming.canonicalFromEmail is the client's canonicalised email domain from the email headers, e.g. for an address ''feidanchaoren0006@gmail.com' it would be 'gmail.com', which obviously doesn't have a levenshtein distance anything close to the EMAIL_BLACKLIST address which it should have fuzzy-matched ('feidanchaoren0001@gmail.com').

The fix is simple: change self.incoming.caconicalFromEmail to str(client) to use the full email address.

Trac:
Resolution: fixed to N/A
Status: closed to reopened

Fixed in my hotfix/0.3.1-9385-levenshtein-distance branch.

Trac:
Resolution: N/A to fixed
Status: reopened to closed

closed

moved to tpo/anti-censorship/bridgedb#9385 (closed)

bridgedb's email responder should fuzzy match email addresses within time periods

Child items 0

Activity