| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by chii 1422 days ago
	Why didn't they just download the dumps via https://dumps.wikimedia.org/enwiktionary/ (as explained in https://en.wiktionary.org/wiki/Help:FAQ#Downloading_Wiktiona...) Scraping, even via an api, is way less efficient imho.

1 comments

gnubison 1422 days ago

They’re in wikitext, which looks to be considerably less semantic than the crawled data. I’m not sure that’s the reason, but it could be a reason.

link

chii 1421 days ago

I'd say not the reason, since the wiki text is pretty semantic. the wiki source of https://en.wiktionary.org/wiki/subbureau#English is:

  ==English==

  ===Etymology===
  {{prefix|en|sub|bureau}}

  ===Noun===
  {{en-noun|s|subbureaux}}

  # A [[district]]-level public security bureau in [[China]].

so as long as one can parse wikitext, it's split pretty well up!

link