Hacker News new | ask | show | jobs
by elhawtaky 2039 days ago
I'm the author. Let me know if you have any questions.
7 comments

Reading through your article, this solution is built on top of s3. So, moving and listing files is faster, presumably due to a new metadata system you've built for tracking files. The trade off here, is that writes must be strictly slower now than they were previously because you've added a network hop. All read and write data now flows through these workers. Which adds a point of failure, if you steam too much data through these workers, you could potentially OOM them. Reads are potentially faster, but that also depends on the hit rate of the block cache. Would be nice to see a more transparent post listing the pros and cons, rather than what reads as a technical advertisement.
I'm one of the co-authors. The numbers for writes are in the paper, so it is very unfair to call it an advertisement. And it is a global cache - if the block is cached somewhere, it will be used in reads.
The parent does make a good point about centralization of requests being a problem. S3 load balances under the hood, so different key prefixes within a bucket are usually serviced in isolation -- a DoS to one prefix will usually not affect other prefixes.

It seems like you'd be limiting yourself for concurrent access -- if everything is flowing through the MySQL cluster -- not a bad thing! Just perhaps warrants a caveat note. I'd expect S3 to smoke HopFS on concurrency.

Read the paper, it doesn't smoke HopsFS on concurrency. In the 2nd paper, we get 1.6m ops/sec on HopsFS HA over 3 availability zones (without S3 as a backing layer). Can you get that on S3 (maybe if you are Netflix, otherwise your starting quota is 1000s ops/sec)?
"Amazon S3 automatically scales to high request rates. For example, your application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in a bucket. There are no limits to the number of prefixes in a bucket. You can increase your read or write performance by parallelizing reads. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second."

With 500 prefixes it doesn't matter if you're Netflix or a student working on a distributed computing project, you can get that 1.6m requests.

There is no limit to the number of prefixes in a bucket. You can scale to be arbitrarily large.

S3 has quotas. Go try and get 1m ops/sec as a quota.
The linked post says "has the same cost as S3", yet on the linked pricing page there's only a "Contact Us" Enterprise plan besides a free one. Am I missing something?
There is no extra charge for HopsFS as part of the Hopsworks platform - so you only pay for what you read/write/store in S3. There is SaaS pricing public on hopsworks.ai.
This was also not clear to me when looking at the pricing page. I assumed there was Free and Enterprise, and that to use this beyond 250 working days, I would have to go to Enterprise.
No, like Dropbox if you do referrals you can get extra credit and stay on the free tier indefinitely. Or you can pay-as-you go on the non-enterprise tier.
> ... but has 100X the performance of S3 for file move/rename operations

Isn't rename in S3 effectively a copy-delete operation?

That’s my understanding too. Also rename / copy turned out not to be very useful at the end of the day. Nearly all my implementations just boil down to randomized characters as ids
Yup. Use a system like Redis/DynamoDB or even a traditional database to store the metadata and use random UUID for actual file storage.

And tag the files for expiration/clean up. S3 is not a file system and people should stop treating it like one - only to get bitten by these assumptions around it being a FS.

You say it's "posix-like" - so what from posix had to be left out?
Random writes are not supported in the HDFS API.
What tradeoffs did you make? In what situations does S3 have better characteristics than HopsFS?
How does this differ from Objectivefs?
Funnily enough, i wasn't aware of ObjectiveFS - i guess it's because i can't find a research paper for it. HopsFS on S3 is similar to ADLS on Azure (built on Azure block storage). Internally, ADLS and HopsFS are different - HopsFS has a scaleout consistent metadata layer with a CDC API, while ADLS doesn't. But ADLS-v2 is also very good.
I searched your page, and then this HN discussion, for the string 'ssh' and got nothing ...

What is the access protocol ? What tools am I using to access the POSIX presentation ?