15 years ago, I could reasonably write a search engine. Myself. 1 person. In a few weeks (modulo bandwidth and server farm). I write a program that grabs a web page, and reads out keywords. Today, if I grab a web page, quite often, that web page has nothing except for JavaScript code. That code grabs the actual content from the server, lays it out, and animates it. To write a web search engine, I need to write a complete JavaScript library.
At the time, we were talking about developing all sorts of agents. Things that would shop for you. Things that would find parts for you. Thinks that would remember what web sites you visited, and let you search them. Things that would track where in a long set of pages you were (blog, comic, etc.), and let you keep reading from there. It happened for a while, and then it died when the web became too damn hard. Writing anything that can reasonably see and parse web pages now takes many, many web years. There are only four or five organizations with that kind of resources (WebKit, Mozilla, Opera, IE, and internally, Google). There are countless things we just didn't even imagine.
It's like the DMCA. You notice all the innovations that happen, but you miss all the innovations it made impossible.
>15 years ago, I could reasonably write a search engine.
No, 15 years ago you could reasonably write a search engine for 15 years ago. It would suck by today's standards.
You want to handle Javascript? Easy! There are plenty of tools to choose from now. Run a browser as your crawler, visit the sites, and read the generated source instead of the static source. Shove that into your 15-years-ago search engine, and there's no difference.
>Things that would track where in a long set of pages you were
You mean bookmarks? Add a scroll %, assuming they're not nice enough to use anchor tags / IDs meaningfully, and you're golden.
>Writing anything that can reasonably see and parse web pages...
has become a community effort, instead of a bunch of isolated silos where people reinvented the wheel out of necessity.
The resources required aren't so large just because it's so much more complex, it's large because it's so much faster, and you won't survive if you can't compete. How long did we languish with crappy Javascript engines? How much would you need to know to actively compete in that section alone now? It's easy to make a slow-but-functional browser, and if you looked around you'd see some people doing just that. Making a fast-and-resilient one is as hard as making a fast-and-resilient anything, especially where human input (ie, HTML) is expected to be consumed.
> You mean bookmarks? Add a scroll %, assuming they're not nice enough to use anchor tags / IDs meaningfully, and you're golden.
Bookmarks in books work okay. You move them. Book marks in browsers don't. You have to remove the old one, add the new one, and the overall process is too cumbersome to be useful for the application I mentioned.
We actually built a site to solve that problem. If you have a series of pages (blog, comic, book, etc) and want to mark your place in them with a bookmark that moves as you read, try Serialist (https://serialist.net/).
As to the auto-updating bookmarks, would it resolve the issue if I made an extension to do that for you? I can see the use, honestly, and I like it. (seriously, I'm offering, and I'd probably use it myself. It'd be an interesting project. Even if it doesn't resolve the issue - we might just fundamentally disagree here, I'm OK with that.)
But why should that be part of the browser, when modern browsers allow you to do damn near anything by simply leveraging it? Why should we rely on browser makers to tell us what's possible, when we can do it ourselves, because of the changes in the past 15 years?
I'd love to see that extension. If you write it, I will use it. I use Chrome too, so it should work here.
As to what should and shouldn't be part of the browser -- the way to figure that out is experimentation and competition. When you make technologies and standards simple and easy, people will make independent implementations and try things. The vast majority will be dumb, but some (often unanticipated ones) will turn out to be useful, clever, or brilliant. That's how the technology improves.
When you make standards big and cumbersome, progress stops.
If you want to move a bookmark to a different place on a blog / content site, it is probably because you want to read new entries. RSS does this fairly well.
If you want to read through a site's archives, what I do is keep it open in a tab. It is restored when I reopen my browser, saved if I reboot, etc. It's not as handy as a bookmark, but it comes close.
With all the headless Webkit tools coming out nowadays (and all the free and fast JS engines like V8), writing a spider that runs a JS engine and clicks on all kinds of non-<a> elements is not beyond the reach of somebody innovative and motivated enough to create new kinds of spidering robots.
You won't need to write a complete JavaScript library. Look at all the testing suites that automate browser instances, Selenium being the most well-known.
15 years ago the thing we call "web application" hardly existed. If web page "has nothing except JavaScript" (e.g. GMail) is probably is web app and indexing it makes little sense anyway. If someone misuses JS on content site, that's another story.
And your comment about innovation makes no sense at all. Capabilities of modern browsers (Canvas, geolocation, local storage, offline apps, etc.) offer more opportunities for innovation than "old web" could even imagine.
I think you (and most people here) underestimate what the "old web" could imagine, though. We had all sort of ideas for agents that would go out and grab and analyze data for us in all sorts of clever and interesting ways. Search engines got built, as did one or two other things, and then the web just got too complex.
Hell, even I had a simple app that went out and grabbed all my favorite comics and showed them to me, nicely formatted, and without ads.
You mean ad filtered RSS/Atom? I assume such a program would be much faster to write these days: have a set of newsfeeds, map() them with a filter function and merge the results.
While the web gets more complex, the tools at hand get better. Much better.
Gmail's HTML view works fine in Links. That team has been showing competence and diligence that's increasingly rare, and I wish people wouldn't tar them with the same brush as the clowns who write js-only crap.
At the time, we were talking about developing all sorts of agents. Things that would shop for you. Things that would find parts for you. Thinks that would remember what web sites you visited, and let you search them. Things that would track where in a long set of pages you were (blog, comic, etc.), and let you keep reading from there.
The drive toward semantic markup in HTML5 is supposed to help the web get back to those original ideals. Over time, we'll increasingly expect web developers to conform to a subset of possible HTML arrangements, much like book publishers conform to a subset of the possible random arrangements and orientations of letters on a page (odd poetry excepted).
Your premise that the web is somehow less effective because you can't scrape data from pages easily doesn't make much sense to me.
Have you taken a look recently at the plethora of web APIs for just about every purpose? The modern way of collecting machine-friendly data from a server is through APIs and semantic content (RDFa, microformats, etc.).
Not through HTML / CSS / Javascript formatted pages which are made primarily for human consumption.
Most people would gladly make it harder for a single person to write a search engine if, in return, it makes it easier for them to make good web pages and web apps.
I would blame poor/lazy devs inappropriately using JS rather than the evolution of the browser for this. For the average web page, it's unnecessary 90% of the time to require JavaScript for any core functionality ( not so much with web applications ). I have a hard time understanding why people do this as it's often much easier to test and develop when you're layering on JS unobtrusively.
Agreed that it's nearly impossible to generally parse web pages now, though if you're screen scraping it's still pretty easy (if not easier than before) to pull out data. Before you had to parse the DOM; now you can often get structured data via JSON APIs. It's more brittle, though.
I think he's saying that it makes scraping harder.
But today JS frameworks like jQuery give us the means to do anything we want javascript-related, in any browser that half-supports javascript. By deprecating IE7 they're just saying they're going to drop all of the extra hacks they had to use to keep IE7 working.
A lot of what newer browsers give us is just better rendering. You can replace a mess of tables and nested divs with things like border-radius, which means less client-side html to wade through.
Are you kidding? The semantic web and using more metadata is making it easier than ever. Nowadays in many cases not only you have the content, as it is tagged with microformats or RDFa.
Try looking at Freebase or DBpedia and tell me where did you have such a huge amount of easily parsable, semantic content in the 90s.
More and more content is being taken entirely off the open web and siloed behind a server that talks an unstable proprietary protocol, with exactly one blob of javascript in existence that knows how to tunnel requests over HTTP to access shreds of that content and cram them into an utterly non-semantic DOM. We are hurtling backwards into the client-server hell the web had saved us from.
Yeah, I don't see that. I see more and more accessible APIs[1] and pages having more and more an incentive to being semantic due to search engines now reading that data (hRecipe, for example).
Service architecture have also been moving from stuff like SOAP to REST, which is definitively more open and accessible.
And even Ajax-ladden webpages are still just a Firebug Network tab away since they all run over HTTP, and then you have a nicely structured data format instead of having to deal with messy HTML pages.
A JSON (or SOAP) backend is only usable by third parties if its API is kept stable. There are far too many devs who redesign their backend request and response formats at the drop of a hat because they think their js client is the only one that matters (a self-fulfilling prophesy) and they can replace it simultaneously. And their responses tend to look like "here's some more markup to stuff into an arbitrary location in the DOM we're using today", not semantically structured (e.g., Rails now has this built into JavaScriptGenerator). A given site can be reverse-engineered, but anything built on that is going to be fragile and short-lived, much more so than when the typical visual rendering desired for a page determined its structure.
At the time, we were talking about developing all sorts of agents. Things that would shop for you. Things that would find parts for you. Thinks that would remember what web sites you visited, and let you search them. Things that would track where in a long set of pages you were (blog, comic, etc.), and let you keep reading from there. It happened for a while, and then it died when the web became too damn hard. Writing anything that can reasonably see and parse web pages now takes many, many web years. There are only four or five organizations with that kind of resources (WebKit, Mozilla, Opera, IE, and internally, Google). There are countless things we just didn't even imagine.
It's like the DMCA. You notice all the innovations that happen, but you miss all the innovations it made impossible.