| HN Mirror

> How big is this dataset (what does each event consist of)?

Standard clickstream data, maybe 50-ish parameters per event.

> What sort of queries are you running? > How is the data stored?

Depends on the use-case. For sub-second adhoc queries we go against bitmap indexes. Other queries we uses RDD.cache() after a group/cogroup and answer queries directly from that. For other queries we go hit ORC files. Spark is very memory sensitive compared to hadoop, so using a columnar store and only pulling out the data that you absolutely need goes a very long way. Minimizing cross-communication and shuffling is key to achieving sub-second. It's impossible to achieve that if you're waiting for TB of data to shuffle around =)

> How many machines / cores are running across?

Depends on the use case. Clusters are 10-30 machines, some we run virtual on open stack. We will grow our 30 node cluster in 6mo.

> Maybe Spark doesn't like 100x growth in the size of an RDD using flatMap

You may actually just need to proportionally scale the number of partitions for that particular task by the same amount. Also when possible use mapPartitions, it is very memory efficient compared to map/flatMap.

> Maybe large-scale joins don't work well

Keep in mind that what ever happens per task happens all in memory. For large joins I created a "bloom join" implementation (not currently open source =( ) that does this efficiently. It takes two passes at the data, but minimizes what is shuffled.