| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Hitton 2463 days ago
	Disclaimer: I have rather small experience with Golang and just skimmed the crawler code. From what I could see, author made effort to make the crawler distributed with k8s (which I don't is needed considering there are only approximately 75 000 onion addresses) using modern buzzword technology, but from what I could see the crawler itself is rather simplistic. It doesn't even seem to index/crawl relative urls, just absolute ones.

2 comments

etrain 2463 days ago

Assume 100 pages on each onion address (it’s probably power-law but let’s just assume that’s the mean). Latency with Tor is super high. Assume average of 5s to load a single page. This is generous because tail latency will probably dominate mean latency in this setting.

These things can happen in parallel but let’s also assume no more than 32 simultaneous TCP connections per host through a Tor proxy.

So we’re looking at ~75k1005/32 seconds = 14 days to run through all of them. You may not need to distribute this but there are situations (e.g. I want a fresh index daily) where it is warranted.

link

creekorful 2463 days ago

Author here. I'm fairly new to Golang too and it's my first project.

Regarding the number of onion addresses available you are wrong. Addresses are encoded in Base32 which means there are 32 characters available. So there are 32^16=1.208925819614629174706176×10^24 addresses available.

Not taken but available.

I agree with the fact that the crawler is really simplistic. But the project is new (2 months I think) and has to evolve. You can make a PR If you want to help me to improve it!

link

jerf 2463 days ago

"Addresses are encoded in Base32 which means there are 32 characters available. So there are 32^16=1.208925819614629174706176×10^24 addresses available."

As a defense against the parent comment, though, this proves way too much. It doesn't matter how much k8s you throw at that, you're never going to so much as find your first site, if you're looking at the problem that way.

That's not really a relevant number here.

link

akklesed 2463 days ago

Offtopic nitpick:

>Addresses are encoded in Base32 which means there are 32 characters available. So there are 32^16=1.208925819614629174706176×10^24 addresses available.

I sorta understand what you mean, technically it's 32 characters per position (5 bits), and 16 positions. In v2 .onion addresses, that is.

v3 ones [1] are 56 positions, but not all the bits are used for addressing, so the same formula wouldn't quite work to calculate real theoretical capacity. IIRC someone already made site which generates unlimited links to v3 addresses (without having them lead to anywhere, of course).

[1] https://trac.torproject.org/projects/tor/wiki/doc/NextGenOni...

link

kodablah 2463 days ago

> IIRC someone already made site which generates unlimited links to v3 addresses (without having them lead to anywhere, of course)

V3 addresses are just ed25519 pub keys and a couple byte changes. You can use Go libraries like Bine [0] to generate as many V3 (or V2) addresses as you want from keys.

0 - https://godoc.org/github.com/cretz/bine/torutil#OnionService...

link

bluesign 2463 days ago

I think 75000 comment is coming from stats[1].

[1] https://metrics.torproject.org/hidserv-dir-onions-seen.html

link

ValleZ 2463 days ago

I assure you that there are less than 10k unique onion addresses. This is a huge overkill to have a distributed system to crawl something this small.

edit: onion services, not addresses

link