Hacker News new | ask | show | jobs
by savichmx 671 days ago
Usually, for scarping tools you need to point where content and other metadata are located. My parser is universal and works with every site out of the box. It's automatically understands where crucial information is located and then trying to parse it.
1 comments

Can you elaborate on how it does that? My knee jerk reaction is an llm api call which, if true, would make me immediately suspicious (so I guess don't elaborate unless it isn't that lol)
Right now my parser is using the combination of open-sourced parsers and combines the best results that they produce. These parsers also use different approaches. Some of them have hardcoded patterns and keywords that they are using for searching in the DOM structure. Some of them uses their own classification ML models. What about LLM, I have plans to try it too, at least for websites that cannot be parsed with existing tools. Also I am thinking about to create my own ML model that will trained on a huge amount of HTML files (but this option is too expensive for me so far)