|
|
|
|
|
by rsync
2458 days ago
|
|
"At the UH Cancer Center we routinely deal with datasets in the TB - PB range ..." ... "Do you happen to have an S3 bucket with that data live?" As someone not working in academia (or in this field at all) can you help me understand the question you have just asked ? Specifically, wouldn't it be tremendously profligate for them to have that PB range dataset living in S3 ? Given the resources that a university has (in both Internet2 connectivity, hardware budget and (relatively) cheap manpower), why would they ever store that data outside of their own UH datacenter ? If the answer is "offsite backup" wouldn't it be glacier or nearline or ... anything but S3 ? |
|
There are many ways to shave S3 costs (e.g. intelligent tiering, glacier), but at some point the data become so slow to access that you can't offer a pleasant user experience around browsing, searching, and feeding pipelines.
Most importantly, the "my data, my bucket" strategy gives users control over their data. A university with their own bucket has more control over their data than they do if Google, Facebook, etc. host and monetize it.