|
|
|
|
|
by subprotocol
4400 days ago
|
|
I use spark a lot and my experience has been quite the opposite. The queries I run against spark are billions of events and results are sub-second. I could only speculate as to what this users issues were. One difference between hadoop and spark is that it is more sensitive in that you sometimes need to tell it how many tasks to use. In practice it is no big deal at all. Perhaps the user was running into this- the data for a task in spark runs all in memory, whereas hadoop will load and spill to disk within a task. So if you give a single hadoop reducer 1TB of data, it will complete after a very long time. In spark if you did this you would need to have 1TB of memory on the executor. I wouldn't give an executor/JVM anything over 10GB. So if you have lots of memory, just be sure to balance it with cores and executors. I have seen spark use up all the inodes on systems before. A job with 1000 map and 1000 reduce tasks would create 1M spill files on disk. However that was on an earlier version of spark and I was using ext3. I think this has since been improved. For me spark runs circles around hadoop. |
|
This is interesting, I haven't gotten Spark to do anything at all in less than a second. How big is this dataset (what does each event consist of)? How is the data stored? How many machines / cores are running across? What sort of queries are you running?
>I could only speculate as to what this users issues were.
I'm the author of the above post and unfortunately I can also "only speculate" what my issues were. Maybe Spark doesn't like 100x growth in the size of an RDD using flatMap? Maybe large-scale joins don't work well? Who knows. The problem, however, definitely doesn't seem to be anything from the tuning guide(s).