Hacker News new | ask | show | jobs
Ask HN: What can we do with 20B links from sites
3 points by itsokaywelldoit 1025 days ago
Hello, geeks of the world. It's my first time writing here. I and my friend developed a project that receives list of domains like www.wikipedia.org(example, example!) and saves all of the links in the html of it's starting page. Sorts them into images, internal site links etc. We had list of like 200 mil domains and we parsed them all, resulting in 20 billion links from sites paired with source domains and sorted types.

We also found a service that can provide domains that are newly registered and domains that recently died, so we can keep our database up to date and even generate regular reports that state changes, improve sorting of result links, and write custom processors for clients to get more data from sites that meet their criteria.

Our thoughts were that we could query links to word press plugins used on sites and generate reports for commercial plugin developers, with regular updates about who uses, who are new users and who stopped using the plugin. But we haven't sent out many emails, so no answers yet.

Example of the juicy part of our data: {

  "source_url": "http://www.wikipedia.org",

  "source_domain": "www.wikipedia.org",

  "destination_url": "https://creativecommons.org/licenses/by-sa/4.0/",

  "destination_domain": "creativecommons.org",

  "link_type": "EXTERNAL_LINK",

  "anchor_text": "Creative Commons Attribution-ShareAlike License"
}

Please help us gather ideas who could be interested in such data and possible insights(leads from sites using competitors, sites using your plugin what plugins get combined with yours most often, which sites are most referred on others, which have contact forms or contact us pages and collect these forms. Google analytics, google ads usage. Does site have links to Google Play and or App Store. Links to social media sites and which SM accounts are most often found on sites home pages) We're sitting on a well of information and we don't know what to find from it and what people would be interested in. Damn, we could be doing graphs and maps and we're just sitting here "ehh what would be interesting to people" with a lot of "m"'s like that doggo.

Help us find ideas what to do with that data and who to target.

If you want to get such data to do something fun, write us, we can devise the query and send you the results no problem)

Thanks a bunch! Looking forward to your comments

1 comments

How did you decide to mine this data specifically if you have trouble finding a market for it? What was the insight?
We had a client that needed similar project, but for small amounts of domains, and we had the list of domains 'just lying around', so we thought 'wait, if our client knows where to use it then there's a market', and, well - wrote some more and combined the two.
Sorry, missed the point of the question. Finding specifically links - because of that client. As we were working on his project we slowly began to realise that cross-referencing sites, links to competitors and stuff like that might be useful for someone. Also, with links u can go deeper into the site, and scrape links from all pages.

And maaybe we got too in love with the concept so now we're firmly stuck in it, finding reasons for our behaviour that are not "woohoo it's so fun to play with it and we have justification to spend time here - it's gonna work for us!"

Some time ago, I did the same thing, and doing something without knowing the full path might be worthless, so it's better to know exactly what to do in the future, having learned from my own mistakes.