|
|
|
|
|
by raja_sekar
2430 days ago
|
|
The author of the repo here. It is definitely not orders of magnitude faster. I didn't mention it anywhere also I guess. But yeah, JVM is sometimes a problem for in-memory computing for big data processing. Spark itself tried to address this. This is what their tungsten engine does. They circumvent huge Java Objects by using native types through JNI(sun.misc.Unsafe). This is the reason why Dataframes are generally much faster than RDD(which typically uses Java objects). This is the reason only certain native types are allowed in Dataframes. This project was just for exploring the feasibility of implementing itself in the native language. Closure serialization can be a nightmare here. If it actually translated to even 2-4X better performance than Spark which itself is very difficult to achieve considering years of optimizations went into Spark, it can be a good alternative and can reduce cloud costs a bit, especially if the Python APIs remain compatible. Spark Dataframes are already highly optimized. Therefore I just thought of open-sourcing it and if others see the benefits, it will automatically grow with the help of the community. It is still a long, long way to reach Spark level maturity. Spark is indeed a very huge ecosystem built upon an already big Hadoop ecosystem. |
|
There’s nothing automatic about it, you or someone else will need to put a lot of work into leading the community, merging pull requests, debugging, etc.
(Sad to say, promotion too, in a lot of cases.)