|
|
|
|
|
by rr808
390 days ago
|
|
1) Yes Scala and JVM is fast. If we could just use that to clean up a feed on a single box that would be great. The problem is calling the Spark API creates a lot of complexity for developers and runtime platform which is super slow.
2) Yes for the few feeds that are a TB we need spark. The platform really just loads from hadoop transforms then saves back again. |
|
b) The reason centralised clusters exist is because you can't have dozens/hundreds of data engineers/scientists all copying company data onto their laptop, causing support headaches because they can't install X library and making productionising impossible. There are bigger concerns than your personal productivity.