| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by throwawayyyz 4743 days ago
	I love how the code is such a mess. You can really tell one guy just wrote this whole thing over the span of a decade... It's just one patch on top of another and the comments are pretty amusing. Also funny to see hardcoded algorithms for pre-defined site paths and whole domains such as facebook/myspace/vimeo. This is truly a makeshift search engine on a massive scale. EDIT: Gotta say, this has some very useful pieces of code. I'm working on a niche-specific crawler and am battling the url stripping/cleanup part of it. This is very useful: https://github.com/gigablast/open-source-search-engine/blob/...

1 comments

runarb 4743 days ago

Just found a little gem myself. I am working on another open source search engine[0], and needed a way to make bad behaving document filters timeout.

Unfortunately the document filter in questioning dose spawn child processes, so the normal way of using fork() and a monitoring process was not working. However using ulimit like this should work: https://github.com/gigablast/open-source-search-engine/blob/... . Hadn’t thought about spanning a new shell and let it have control like that :)

0: https://github.com/searchdaimon/enterprise-search

link

conductor 4743 days ago

There is possible buffer overflow right there (if the HOME directory is long enough). Why don't people use snprintf?

link

runarb 4743 days ago

>Why don't people use snprintf?

Old habits perhaps? When I look back at it I remember that my first books on C were full of problematic sprintf and strcpy use. It may then easy to continue using what you first learned, even when you know better. It basically the "Baby duck syndrome"[0] for C functions.

0: http://en.wikipedia.org/wiki/Imprinting_(psychology)#Baby_du...

link