I solved this problem locally. When uploading a file to the server before going to S3 it is cached in redis. Whenever the codebase needs to use the file, it checks redis, and if it is not there it fetches it and caches it again.
Exactly. Write-through cache is exactly how Userify[0] used to work for self-hosted versions. (when it was Python, we used Redis to keep state synced across multiple processes, but now that it's a Go app, we do all the caching and state management in memory using Ristretto[1])
However, we now install by default to local disk filesystem, since it's much faster to just do a periodic S3 hot sync, like with restic or aws-cli, than to treat S3 as the primary backing store, or just version the EBS or instance volume. The other reason you might want to use S3 as a primary is if you use a lot of disk, but our files are compressed and extremely small, even for a large installation with tens of thousands of users and instances.
What were the reasons to move from Redis to Ristretto? Both seem to be very different, since Redis is distributed where as Ristretto is local to the process.
In our case, Python (because of the GIL) required us to have a single python process per core in order to take advantage of multiple cores, and so we needed Redis to maintain a unified memory state across all the cores, but Go can automatically span across multiple cores.
We also saw about a 10x speedup by moving all caching into the server process, and since it was all in the same process, we no longer had to compress and encrypt data before sending to Redis. We still checkpoint the moving server state, encrypted and compressed, to disk every sixty seconds, just like Redis would do with BGSAVE, so we can start back up within a few seconds (actually faster than the old Redis after a restart.)
Files are just a bunch of bytes. No harm in putting them in a database.
There were some benchmarks, I couldn’t fine where SQLite was faster than native file system at retrieving, searching and adding files to a large directory.
SQLite reads and writes small blobs (for example, thumbnail images) 35% faster¹ than the same blobs can be read from or written to individual files on disk using fread() or fwrite().
Furthermore, a single SQLite database holding 10-kilobyte blobs uses about 20% less disk space than storing the blobs in individual files.
The performance difference arises (we believe) because when working from an SQLite database, the open() and close() system calls are invoked only once, whereas open() and close() are invoked once for each blob when using blobs stored in individual files. It appears that the overhead of calling open() and close() is greater than the overhead of using the database. The size reduction arises from the fact that individual files are padded out to the next multiple of the filesystem block size, whereas the blobs are packed more tightly into an SQLite database.
However, we now install by default to local disk filesystem, since it's much faster to just do a periodic S3 hot sync, like with restic or aws-cli, than to treat S3 as the primary backing store, or just version the EBS or instance volume. The other reason you might want to use S3 as a primary is if you use a lot of disk, but our files are compressed and extremely small, even for a large installation with tens of thousands of users and instances.
0. https://userify.com (ssh key management + sudo for teams)
1. https://github.com/dgraph-io/ristretto