|
|
|
|
|
by asgard1024
4699 days ago
|
|
Actually, not so simple. Sure, you can do simple crawling easily; but the hard part is to extract meaningful data from it. It's very easy to loop on many sites for instance. Protocol violations abound - some sites serve binaries as text/html, for instance. What I heard about a smaller search engine was that web crawling is usually augmented with some manually added rules for various sites to prevent spoiling the database. Not a trivial task at all. Doing queries is IMHO algorithmically much better understood, because it's a constrained problem. But getting information extracted out from the real world, with all the PHP and HTML "hackers", not so easy. |
|
It is also why the rate of innovation in search isn't moving as fast as it can be moving.
If Google opened up (unlimited) web API access to their search interface, to say a large city for a year or two people would really get a taste of what innovation in search looked like.
And of course it would be in Google's interest cause search as a platform or marketplace is where the future of Google really lies. All the other advertising empire defending distractions like Android, Chrome and YouTube are really sideshows.