| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rb2k_ 5703 days ago
	Uh, look what the cat dragged in: my thesis :) Hope some of you enjoy the read, I'm open for comments and criticism

4 comments

cd34 5703 days ago

I'm guessing English isn't your primary language. There are numerous typos and grammatical errors throughout the document. Spellcheck would catch about 90% of them as they aren't project names. The grammar errors might be more difficult since you are getting the right wordstem and you're getting words that are close.

It also seems like 2/3 of the way through, you started templating your answers and reviews and didn't do as thorough an analysis of the competitive solutions.

The thesis does demonstrate that you know and understand the technology but I don't get the sense you have an in-depth understanding of what was done. While the results suggest the project was successful, it seems more like you were an observer, validating decisions. It also seems you don't agree the decisions made were the correct ones based on some of the underlying tones.

Still, there is a lot of great information in there, presented very well. You might consider submitting it to highscalability.com.

Would you implement the current structure the same way after writing your thesis?

link

rb2k_ 5703 days ago

Thank you for the feedback!

Typos: Yeah, I'm German. Could you just point out some of the errors (2-3), that would help me to look for them harder next time :)

Lack of detail towards the end: The thesis was written after most of the project was done and I wanted to give people new to the field an introduction to the tools I used and the problems I encountered. All of this was an actual internship project and the ability to use it as my thesis was just a nice "addon".

That's probably why you (rightfully) noticed that some of the competitive solutions (e.g. graph databases) might have not gotten the level of detail and research they deserved. It was a balance between delivering a working product and putting the thesis on a theoretically sound basis while moving to another country :)

In general, I'd re-implement it more or less the same way. I would probably do one or more of the following things:

- take a look at how Riak search turned out

- switch from MySQL to Postgres

- Think about another way of determining popularity than incoming links (can get problematic when trying to recrawl sites... you'd have to keep track of all of the domains that link to a certain site. Maybe graph databases would be a good solution for this problem)

- start with coding EVERYTHING in an asynchronous manner. Maybe use em-synchrony (https://github.com/igrigorik/em-synchrony)

- write more tests (the more the better)

link

cd34 5703 days ago

things I remember: postgressql, defiantly (you meant definitely), you used deduct rather than deduce. Several typos were obvious typos that spellcheck would find. Double keys, letters swapped, etc.

Writing async from the start is worlds easier than refactoring. Had you been there at the start, I'm thinking your thesis may have taken a much different approach. It looks like you understand scalability, but, every day there's a new product to evaluate. :) Good luck with it.

link

arkitaip 5703 days ago

Very timely and interesting. I am currently looking for a crawler that tightly integrated with Drupal and that can be easily managed through Drupal nodes. Any suggestions on a solution for a small site that only needs to handle thousands of pages/urls?

link

rb2k_ 5703 days ago

I don't really know what the "managed through Drupal nodes" means in this context. For a simple drupal fulltext search I can recommend apache solr ( http://drupal.org/project/apachesolr ).

For regular crawling:

I found anemone ( http://anemone.rubyforge.org/ ) to be a lovely framework for single page crawls.

Other interesting candidates:

https://github.com/hasmanydevelopers/RDaneel

http://www.redaelli.org/matteo-blog/projects/ebot/

http://nutch.apache.org/ (meh, java)

link

toumhi 5703 days ago

scrapy (http://scrapy.org/) is a well-documented and open source python scraping framework that I've used in a couple of projects.

link

rb2k_ 5703 days ago

Indeed, seems like a great framework.

Considering the timespan of the project, I had to rely on something I'm pretty ok at (Ruby), but I remember hitting a lot of posts about scrapy on the way

link

nowarninglabel 5703 days ago

Excellent to have this up. I'm glad that you made it available. I'm doing web crawling w/ Drupal, so always interesting to see how others are doing it.

link

rubyrescue 5703 days ago

great paper. You mentioned that you would have considered Riak if it had search. Now that it does, if you did this again would you use seriously consider using it instead?

link

rb2k_ 5703 days ago

The crawler currently runs on a single large EC2 instance. I could see myself trying to use a bunch of EC2 micro instances instead and then use Riak + Riak Search.

I actually tried putting a dump of the data into Riak and it seemed to hold up pretty well on my macbook.

Another problem was the fact that Riak didn't allow me to do server-side increments on the "incoming links" counter which mysql, mongodb or redis allowed. However, I think that this is something that could be solved using Redis as a caching layer.

I have to admit that I would love to use Riak for something just because it seems to be a really slick piece of software, so it's hard to stay objective :)

link