|
|
|
|
|
by aub3bhat
3570 days ago
|
|
The paper is frankly stupid and a great example of difference between practice and academia. it looks good because they are using a snapshot of Twitter network from 2010. In reality the work flow is complex, e.g. the follower graph gets updates every hour. 10 different teams have their different requirements as to how to set up the graph and computations. These computations need to be run at different (hourly, weekly, daily) granularity. 100 downstream jobs are also dependent on them and need to start as soon as previous job finishes. The output of the jobs gets imported/indexed in database which is then pushed to production systems and/or used by analysts who might update and retry/rerun computations. Unlike a bunch of out of touch researchers the key concern isn't how "fast" calculations finish, but several others such as ability to reuse, fault tolerance, multi user support etc. I can outrun a Boeing 777 on my bike in a 3 meter race but no would care. The single laptop example is essentially that. |
|
We used these data and workloads because that was what GraphX used. If you take the graphs any bigger, Spark and GraphX at least couldn't handle it and just failed. They've probably gotten better in the meantime, so take that with a grain of salt.
> Unlike a bunch of out of touch researchers the key concern isn't how "fast" calculations finish, but several others such as ability to reuse, fault tolerance, multi user support etc.
The paper says these exact things. You have to keep reading, and it's hard I know, but for example the last paragraph of section 5 says pretty much exactly this.
And, if you read the paper even more carefully, it is pretty clearly not about whether you should use these systems or not, but how you should not evaluate them (i.e. only on tasks at a scale that a laptop could do better).