Share code that uses new URL Search tool and win AWS credit

Y	Hacker News new \| ask \| show \| jobs

	Share code that uses new URL Search tool and win AWS credit (commoncrawl.org)
	17 points by LisaG 4894 days ago

7 comments

djoerd 4894 days ago

While I know that some of the pages of my home page are in the crawl, they do not show up with the following query: http://urlsearch.commoncrawl.org/?q=nl.utwente.cs.wwwhome%2F... nor with: http://urlsearch.commoncrawl.org/?q=nl.utwente.cs.wwwhome%2F... (no, this is not only an ego search problem ;-) )

link

anjackson 4894 days ago

Yes, that's a little odd. If you shorten the search term, the results come up just fine: http://urlsearch.commoncrawl.org/?q=nl.utwente.cs.wwwhome/~h...

link

greglindahl 4894 days ago

Works on me!

http://urlsearch.commoncrawl.org/?q=com.pbm.www%2F~lindahl

So there's a bug there, but not all the time for ~.

link

LisaG 4894 days ago

I hope that some of you who use/play around with the Common Crawl data will try out using the JSON files from the URL Search and then share your code.

If you didn't see the details in the blog post, Common Crawl is giving out $100 in AWS credit to the first five people who share code that incorporates a JSON file from the URL Search.

link

visarga 4893 days ago

Is it possible to get a list of webhosts, like all the domains and subdomains, stripped from the rest of the url?

link

LisaG 4894 days ago

From @djoerd Why does @CommonCrawl URL search (http://urlsearch.commoncrawl.org/ ) need 'tld.domain' format rather than 'domain.tld'? Read Google's BigTable paper.

link

frederi 4894 days ago

Why can't they just write code that reverses the input?

link

srobertson 4894 days ago

The main intent of the search is to retrieve a list of urls that the site crawled for a given subdomain, domain or tld. So for now you can do that using reversed url notation. Which I admit is not very intuitive.

We're toying with the idea of implementing some sort of wild card that way we can present the urls in natural order. Something like *.google.com to retrieve all urls under google. But we wanted to judge the level of interest first. After all "done" is better than "perfect".

link

LisaG 4894 days ago

"Done" is better than "perfect" should be on a sign hanging in every startup.

link

Aloisius 4894 days ago

Just an oversight. Most of our work is done by people who graciously volunteer their time. We'll get that fixed.

link

lubujackson 4894 days ago

I'd love it if there was a feature to search for a specific URL. Like if "com.google" just loaded the Google homepage if you put it in quotes.

link

srobertson 4894 days ago

good suggestion

link

lubujackson 4894 days ago

Top results for "com" are a little odd. Seems like @ wasn't filtered from the domain part of the URL (though it should be, I would think).

link

srobertson 4894 days ago

I think it's just that the site converts unicode urls for display purposes. If you click on one of the links with "@" in it, you'll see the real url in url encoded format

http://%2E%2E%2E@harunyahya.com/Ajax/get.comments/oid/4612 176.34.181.212 20120516214328 text/html 912

link

djoerd 4894 days ago

The first FAQ link seems to be broken (maybe a web server setting gone bad?) BTW, this is a great resource. Thanks for sharing this!

link

LisaG 4894 days ago

Thanks for the catch Djoerd! We will fix it now

link