Opened 5 years ago

Last modified 9 months ago

#16520 assigned task

Add research idea to Run some onion services to observe crawling trends

Reported by: arma Owned by: metrics-team
Priority: Medium Milestone:
Component: Webpages/Research Version:
Severity: Normal Keywords: tor-hs, nickm-cares, research-ideas
Cc: dlmccoy@…, naif, phw, teor Actual Points:
Parent ID: Points: 1
Reviewer: Sponsor:


We know some research groups that are doing full crawling of onion services. We also know that Ahmia et al are doing it. I keep hearing these days about big security companies selling "onion intelligence" or the like.

What are the characteristics of these crawls? Are many of them one level deep, or k levels deep, or full crawls? Do they obey robots.txt? Do they identify themselves by their user agent? Do they visit urls that are embedded in html comments that humans would never find? Do they de-obfuscate urls and visit those? Do they get suckered by web tarpits that produce infinite pages? Are the crawling trends going up quickly or slowly?

We should consider running a couple of onion services with various characteristics, and monitor their usage and see if we learn anything.

Child Tickets

Change History (14)

comment:1 Changed 5 years ago by arma

Are there some "honeysite" style tools that already automate most of this process?

comment:2 Changed 5 years ago by arma

Cc: dlmccoy@… added

comment:3 Changed 5 years ago by arma

Cc: naif added

comment:4 Changed 5 years ago by phw

I just set up an onion service that runs a web server. It hosts a simple login page and a robots.txt that points to additional, disallowed files. The onion URL isn't published anywhere and I'll keep an eye on the log files. I plan to extend the setup once I see any crawling.

comment:5 Changed 5 years ago by phw

Cc: phw added

comment:6 Changed 5 years ago by teor

Cc: teor added

comment:7 Changed 5 years ago by naif

Below a braindump with the morning Espresso.

The past experience i had with a very simple honeypot (with a shell script running from inetd sending me an email) without any active engagement highlighted always simple "curl" based crawler, but it took a lot of time till the first crawler passed there.

For intelligence on that i'd suggest to consider two different conditions:
a) catching crawlers targeting Unpublished TorHS (that we know harvesting HSDir)
b) catching crawlers targeting Published TorHS (that are published on and/or other Indexes)

I'd suggest to create tons (thousands) of TorHS every week, focusing on automated crawlers with a nice side-effect of creating many TorHS is that the "bad guys" selling the crawled/data will just had some difficulties. To do that we must fix #15251 that would also unlock the ability to develop "OnionFlare", the Onionized edition of CloudFlare :-)

For content classification and creation (ie: Anarchy site, Literacy site, Drug site, CP site, Political site) i'd suggest to use Ahmia index (containing classification) + Tor2web to create reverse-proxy Onion<---to--->Onion (Tor2web do support a "static mapping").

That way we'll be able to create content and observe timing, behaviours and approach in crawling different kind of content without the need to create our own, just acting as a "parasitic" network of Onion proxy in front of existing Onion sites.

Last edited 5 years ago by naif (previous) (diff)

comment:8 Changed 5 years ago by dgoulet

Keywords: SponsorR removed
Points: large
Severity: Normal

Removing ponsor R because this doesn't really fall under R for now but rather on other funding.

comment:9 Changed 3 years ago by nickm

Keywords: nickm-cares added

comment:10 Changed 3 years ago by karsten

Owner: set to metrics-team
Status: newassigned

comment:11 Changed 3 years ago by irl

Keywords: research-ideas added

comment:12 Changed 21 months ago by notirl

Component: Metrics/AnalysisWebpages/Research
Status: assignednew
Summary: Run some onion services to observe crawling trendsAdd research idea to Run some onion services to observe crawling trends
Type: projecttask

comment:13 Changed 21 months ago by irl

Keywords: metrics-roadmap-2019-q2 added
Points: large1
Status: newassigned

comment:14 Changed 9 months ago by gaba

Keywords: metrics-roadmap-2019-q2 removed

Releasing tickets from 2019 roadmap into the universe.

Note: See TracTickets for help on using tickets.