| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by zulko 656 days ago

Same experience here. Been building a classical music database [1] where historical and composer life events are scraped off wikipedia by asking ChatGPT to extract lists of `[{event, year, location}, ...]` from biographies.

- Using chatgpt-mini was the only cheap option, worked well (although I have a feeling it's dumbing down these days) and made it virtually free.

- Just extracting the webpage text from HTML, with `BeautifulSoup(html).text` slashes the number of tokens (but can be risky when dealing with complex tables)

- At some point I needed to scrape ~10,000 pages that have the same format and it was much more efficient speed-wise and price-wise to provide ChatGPT with the HTML once and say "write some python code that extracts data", then apply that code to the 10,000 pages. I'm thinking a very smart GPT-based web parser could do that, with dynamically generated scraping methods.

- Finally because this article mentions tables, Pandas has a very nice feature `from_html("http:/the-website.com")` that will detect and parse all tables on a page. But the article does a good job pointing at websites where the method would fail because the tables don't use `<table/>`

[1] https://github.com/Zulko/composer-timelines

2 comments

davidsojevic 656 days ago

If you haven't considered it, you can also use the direct wikitext markup, from which the HTML is derived.

Depending on how you use it, the wikitext may or may not be more ingestible if you're passing it through to an LLM anyway. You may also be able to pare it down a bit by heading/section so that you can reduce it do only sections that are likely to be relevant (eg. "Life and career") type sections.

You can also download full dumps [0] from Wikipedia and query them via SQL to make your life easier if you're processing them.

[0] https://en.wikipedia.org/wiki/Wikipedia:Database_download#Wh...?

link

zulko 656 days ago

> reduce it do only sections that are likely to be relevant (eg. "Life and career")

True but I also managed to do this from HTML. I tried getting pages wikitext through the API but couldn't find how to.

Just querying the HTML page was less friction and fast enough that I didn't need a dump (although when AI becomes cheap enough, there is probably a lot of things to do from a wikipedia dump!).

One advantage of using online wikipedia instead of a dump is that I have a pipeline on Github Actions where I just enter a composer name and it automagically scrapes the web and adds the composer to the database (takes exactly one minute from the click of the button!).

link

distances 656 days ago

Wikipedia's api.php supports JSON output, which probably helps already quite a bit. For example https://en.wikipedia.org/w/api.php?action=query&prop=extract...

link

zulko 656 days ago

Oooh I had missed that thanks!

link

iudqnolq 655 days ago

This doesn't directly address your issue but since this caused me some pain I'll share that if you want to parse structured information from Wikipedia infoboxes the npm module wtf_wikipedia works.

link