| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sleepythread 4213 days ago

One common misconception about using Hadoop is that use Hadoop if your data is large. Usage of Hadoop should be more driven based on the growth of data rather than size.

I agree that for the given use case, the solution is appropriate and works fine. Problem mentioned in the given post is not a Big Data problem.

Hadoop will be helpful in case if there are millions of games are played everyday and we need to update the statistics daily e.t.c. For this case, the given solution will hit bottleneck and there will be some optimisation/code change needed to keep running the code.

Hadoop and its ecosystem are not a silver bullet and hence should not be used for everything. The problem has to be a Big Data problem

1 comments

yuanchuan 4213 days ago

It is that buzz surrounding Hadoop that makes people misunderstood its use and capability. I have met non-technical analysts who want RDBMS performance on Hadoop. They expect seconds to minutes scale queries on hundreds of GB of data.

I always throw this analogy to people who misunderstood Hadoop: A stone to crack an egg or a spoon?

Hadoop and RDBMS only have a thin overlapping region in the Venn diagram that describes their capabilities and use cases.

Ultimately, it is cost vs efficiency. Hadoop can solve all data problems. Likewise for RDBMS. This is an engineering tradeoff that people have to make.

link

sleepythread 4213 days ago

I totally agree with you. Capability <strong>"LIKE"</strong> will drive Hadoop adoption, Hadoop should not be seen as replacement of R.D.B.M.S. These are two different tools for made for different purpose.

link

pacala 4213 days ago

> They expect seconds to minutes scale queries on hundreds of GB of data.

Use BigQuery from Google.

link

yuanchuan 4212 days ago

On-premise cluster.

Cloud solution are totally out due to the nature of the data. Not everything can be done in cloud.

If you have such huge amount of data, the total amount of time it takes to transfer there and compute is not as competitive as an on-premise solution, unless all your data live in the cloud.

link

pacala 4212 days ago

I would look into https://spark.apache.org/ then. You can get quite good performance out of it, but you need to spend more effort in babysitting your data.

link