|
|
|
|
|
by avibryant
4483 days ago
|
|
Yeah, for those same reasons I much prefer using Scalding's typed API [1], which feels very similar to Scoobi. The tuple API shown in these slides is great for places like Etsy that already have a large investment in Cascading, but otherwise you're better off getting the added type safety and similarity to the standard Scala API. [1] https://github.com/twitter/scalding/wiki/Type-safe-api-refer... |
|
- There a couple types for datasets in the Scalding API: TypedPipe, and KeyedList and subclasses. Scoobi subsumes both of these under DList; thanks to the usual Scala wizardry, this has all the methods to operate on key-value pairs without loss of typesafety. This isn't a huge deal, but it removes the tiny pains of constantly converting back and forth between the two. - Scoobi's other abstraction, DObject, represents a single value. These are usually created by aggregations or as a way to expose the distributed cache, and have all the operations you'd expect when joining them together or with full datasets. You can emulate this in Cascading / Scalding, but it's a bit less explicit and more error-prone. - There's no equivalent to the compile-time check for serialization in Scalding, AFAICT. - Scoobi has less opinions about the job runner itself... there are some helpers for setting up the job, but all features are available as a library. For some reason, I found the two harder to separate in Scalding? - IIRC, Scalding did job setup by mutating a Cascading object that was available implicitly in the Job. In Scoobi, you build up an immutable datastructure describing the computation and hand that to the compiler. This suits my sense of aesthetics better, I suppose...
* Also, thanks to you guys for Algebird! That's a really fantastic little project, and I use it all the time.