Hacker News new | ask | show | jobs
by qrv3w 2232 days ago
This is great! Its a wonderful write-up.

I've also made something almost identical - a Go library for recipes scrapers for ingredients [1] and instructions [2]. Instead of the LCA method here, in my version I try to find the longest sequence of highest scoring HTML tags and those are "ingredients" or "instructions". It works very well (although I think this one works better).

Like the article mentioned, I found that the heuristics for finding HTML elements with ingredients turn out to be surprisingly simple - they usually include just a number, a measurement, and a food! This simple heuristic worked better than other sophisticated things I tried.

[1]: https://github.com/schollz/ingredients

[2]: https://github.com/schollz/instructions