| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by knicholes 822 days ago
	This seems pretty simple to me to do. Search the html of the main page for anchor tags. Add the links in those tags to an array as your exploration frontier. Once done parsing that html, load the next link. Add deduplication to avoid loops and just run a depth-first search. What am I missing?

2 comments

somethingAlex 822 days ago

For brochure / static content sites this is definitely the beginnings of a web crawler but it can be a lot trickier for web apps.

For example, clicking a link which loads some data, then clicking edit (which isn't even an anchor), typing in & clicking stuff, then clicking the save button (don't click the cancel button!) would not be an interaction that would get picked up with your suggestion. Detecting loops becomes much more ambiguous and backtracking to get all the permutations of interactions becomes a whole other problem to solve.

dns_snek 822 days ago

In many web apps there are going to be buttons and links that are not represented as <a>. You would realistically have to enumerate everything that has any kind of event handler attached since it could potentially trigger an API call.

You would also have to fill and submit forms with valid and invalid data. You would have to toggle checkboxes, change radio buttons, click buttons, (e.g. "Apply filters" after changing values in a product filter section), and generally go through many combinations of inputs to find all valid parameters and possible responses.