|
|
|
|
|
by zulko
656 days ago
|
|
Same experience here. Been building a classical music database [1] where historical and composer life events are scraped off wikipedia by asking ChatGPT to extract lists of `[{event, year, location}, ...]` from biographies. - Using chatgpt-mini was the only cheap option, worked well (although I have a feeling it's dumbing down these days) and made it virtually free. - Just extracting the webpage text from HTML, with `BeautifulSoup(html).text` slashes the number of tokens (but can be risky when dealing with complex tables) - At some point I needed to scrape ~10,000 pages that have the same format and it was much more efficient speed-wise and price-wise to provide ChatGPT with the HTML once and say "write some python code that extracts data", then apply that code to the 10,000 pages. I'm thinking a very smart GPT-based web parser could do that, with dynamically generated scraping methods. - Finally because this article mentions tables, Pandas has a very nice feature `from_html("http:/the-website.com")` that will detect and parse all tables on a page. But the article does a good job pointing at websites where the method would fail because the tables don't use `<table/>` [1] https://github.com/Zulko/composer-timelines |
|
Depending on how you use it, the wikitext may or may not be more ingestible if you're passing it through to an LLM anyway. You may also be able to pare it down a bit by heading/section so that you can reduce it do only sections that are likely to be relevant (eg. "Life and career") type sections.
You can also download full dumps [0] from Wikipedia and query them via SQL to make your life easier if you're processing them.
[0] https://en.wikipedia.org/wiki/Wikipedia:Database_download#Wh...?