Hacker News new | ask | show | jobs
Full MongoDB database dump of the Blippex search engine (blippex.github.io)
23 points by karli 4708 days ago
4 comments

So it's a bunch of internet URLs without content or content metadata?
Seems that way. I mean, that's still be valuable and interesting for other reasons, but let's not call it a "dump of a search engine". There's nothing in there that's actually searchable!

Still, nice gesture by Blippex. Somebody will find something interesting to do with this, even if they just use it educationally.

Yes, we will add metadata in the next dump, but currently the time_spent is all that matters.

Our plans are to add categories, the language and rough information about the content type (video, image, etc).

Gerald, CTO Blippex

> Last friday we reached the milestone of 50k searches per day. Today we are releasing as promised the first dump of our database.

I hope whatever they are releasing, is not the user search query data.

Remmember when AOL did that?

http://techcrunch.com/2006/08/06/aol-proudly-releases-massiv...

edit - okay, I think it's the index data.

so this is just when a given site was crawled?

  "_id": "b919f02c8f053c41e8ee86311ca9b0f6,
  "url": "https://www.example.com/",
  "host": "www.example.com",
  "root": "example.com",
  "time_spent": [
    {
      "sec": 45,
      "seen_at": ISODate("2013-06-23T00: 41: 44.0Z")
    },
    {
      "sec": 5,
      "seen_at": ISODate("2013-07-01T14: 41: 44.0Z")
    }
Hi,

yes, as it is said in the blogpost, the only thing missing is the full text of the page for indexing & searching in it, we don't dare to release it because of copyright issues (he, you distribute the full text of my page!).

With this data you could for example built a new alexa and find out what was the most visited page last week :)

With this data you could for example built a new alexa and find out what was the most visited page last week:)

While what you're doing is interesting, and this data could shed some light on a lot of questions, you're putting the cart way before the horse here.

125k searches equates to, generously, 12.5k users? From the Chrome Webstore and PlayStore it seems theres about 500 users from those.

A correct statement would be 'with this data you could for example find out waht the most visited web page was from our subset of a subset of 12.5k users'

That said if you get a significant market share this could be very interesting. I'm guessing you don't always plan on providing dumps for free and will monetize them at some point?

Of course you are right, with such data you could build something like alexa. We are aware that this data is currently a tiny subset of the web and does not represent much, but it has potential if it grows.

Since we do not consider this data to be ours, we do not plan to charge anyone for the dumps.

Gerald, CTO Blippex

How would we figure out which page was visited the most last week? Are these crawl logs or access logs?
From blippex.org:

'Blippex is a search engine by the people, for the people. Individuals that have our browser extension installed tell us how long they stayed on a webpage.'

I haven't had a chance to download the dump yet but I'm assuming that the time points are time spent on the website by users.

It would be crazy cool to get a real-time feed of browsed URLs (not this dump format). Kind of like the mythical Twitter fire-hose.
Nice idea I think we should do that.

Gerald, CTO blippex

This will be fun as a list of places to try out 0-day exploits on.