Hacker News new | ask | show | jobs
by chii 1422 days ago
Why didn't they just download the dumps via https://dumps.wikimedia.org/enwiktionary/ (as explained in https://en.wiktionary.org/wiki/Help:FAQ#Downloading_Wiktiona...)

Scraping, even via an api, is way less efficient imho.

1 comments

They’re in wikitext, which looks to be considerably less semantic than the crawled data. I’m not sure that’s the reason, but it could be a reason.
I'd say not the reason, since the wiki text is pretty semantic. the wiki source of https://en.wiktionary.org/wiki/subbureau#English is:

  ==English==

  ===Etymology===
  {{prefix|en|sub|bureau}}

  ===Noun===
  {{en-noun|s|subbureaux}}

  # A [[district]]-level public security bureau in [[China]].
so as long as one can parse wikitext, it's split pretty well up!