I hope that some of you who use/play around with the Common Crawl data will try out using the JSON files from the URL Search and then share your code.
If you didn't see the details in the blog post, Common Crawl is giving out $100 in AWS credit to the first five people who share code that incorporates a JSON file from the URL Search.
From @djoerd
Why does @CommonCrawl URL search (http://urlsearch.commoncrawl.org/ ) need 'tld.domain' format rather than 'domain.tld'? Read Google's BigTable paper.
The main intent of the search is to retrieve a list of urls that the site crawled for a given subdomain, domain or tld. So for now you can do that using reversed url notation. Which I admit is not very intuitive.
We're toying with the idea of implementing some sort of wild card that way we can present the urls in natural order. Something like *.google.com to retrieve all urls under google. But we wanted to judge the level of interest first. After all "done" is better than "perfect".
I think it's just that the site converts unicode urls for display purposes. If you click on one of the links with "@" in it, you'll see the real url in url encoded format