|
|
|
|
|
by qrv3w
2232 days ago
|
|
This is great! Its a wonderful write-up. I've also made something almost identical - a Go library for recipes scrapers for ingredients [1] and instructions [2]. Instead of the LCA method here, in my version I try to find the longest sequence of highest scoring HTML tags and those are "ingredients" or "instructions". It works very well (although I think this one works better). Like the article mentioned, I found that the heuristics for finding HTML elements with ingredients turn out to be surprisingly simple - they usually include just a number, a measurement, and a food! This simple heuristic worked better than other sophisticated things I tried. [1]: https://github.com/schollz/ingredients [2]: https://github.com/schollz/instructions |
|