| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by zepearl 2652 days ago

I used Python to load the contents of the articles into a DB (potentially wrong extract of veeery old code - I have something like 20 different versions lying around therefore I'm not 100% sure that this did work well):

===

  import xml.dom.pulldom as pulldom
  from lxml import etree
  from xml.etree import ElementTree as ET
  sInputFileName = "/my/input/wiki_file.xml"
  context = etree.iterparse(sInputFileName, events=('end',), tag='doc')

  for event, elem in context:
    iThisArticleCharLength = len(elem.text)
    sPageURL = elem.get("url")[0:4000]
    sPageTitle = elem.get("title")[0:4000]
    SPageContents = elem.text

    <do what you want with these vars...>

===