| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hnriot 4303 days ago

for some URLs the data throws an exception, for example:

http://en.wikipedia.org/wiki/Horse (I don't like snakes)

  File "/home/drace/dev/NLUlite/client_python/NLUlite.py", line 375, in add_url
    parser.feed(page)
  File "/usr/lib/python2.7/HTMLParser.py", line 114, in feed
    self.goahead(0)
  File "/usr/lib/python2.7/HTMLParser.py", line 158, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.7/HTMLParser.py", line 305, in parse_starttag
    attrvalue = self.unescape(attrvalue)
  File "/usr/lib/python2.7/HTMLParser.py", line 472, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/usr/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8: ordinal not in range(128)

It's also really slow at learning. I have a ton of everything, cores, memory etc and it takes minutes to process web pages. I guess you do say that on the website that the free version is slow.

2 comments

NLUlite 4302 days ago

By the way, the commercial version's parser scales almost linearly with the number of (independent) threads. The Wisdom.ask() method is also faster with the multithreaded version.

link

NLUlite 4303 days ago

Thanks for the feedback, we are looking into it.

link