Hacker News new | ask | show | jobs
Show HN: I created a tool that allows you to parse web page to structured data (rapture-parser.com)
2 points by savichmx 671 days ago
Hi everyone. I've built a tool "Rapture Parser" that allows you to parse and extract useful information from any web page. You can do it in UI or by REST API (in case if you need to integrate it with your system). For now it's a very raw first version, but later I am going to extend it with a lot of new features and make it more powerful, so my tool will be able to parse exactly any web page or file, even if they are under the paywall or is not in the parsable format.

I would really love to get your feedback and hear your ideas, so I can improve my parser and make it better to use.

Thank you!

2 comments

How does this differ from just scraping a site?
Usually, for scarping tools you need to point where content and other metadata are located. My parser is universal and works with every site out of the box. It's automatically understands where crucial information is located and then trying to parse it.
Can you elaborate on how it does that? My knee jerk reaction is an llm api call which, if true, would make me immediately suspicious (so I guess don't elaborate unless it isn't that lol)
Right now my parser is using the combination of open-sourced parsers and combines the best results that they produce. These parsers also use different approaches. Some of them have hardcoded patterns and keywords that they are using for searching in the DOM structure. Some of them uses their own classification ML models. What about LLM, I have plans to try it too, at least for websites that cannot be parsed with existing tools. Also I am thinking about to create my own ML model that will trained on a huge amount of HTML files (but this option is too expensive for me so far)
How can pages be parsed when behind a paywall?

I would hope there aren't many worthwhile pages with "display: none" for paywalled content.

Not sure that it will work in 100% cases, but the idea is to buy a subscription at the most popular paid services and use it for reading the content.