Hacker News new | ask | show | jobs
by gpav 2740 days ago
Judging from the date on the reddit post, it's a bit late for my comment (two days after the post), but the real issue is not how to duplicate so many DVDs, the issue is how to RIP those DVDs onto some cheap storage for later burning onto DVDs. So high data bandwidth between stacks of DVD players with lots of memory for caching (or however you cache DVD data streams). Fry's had a 4TB external hard drive for $89 this weekend. Maybe have a bunch of SSDs as the intake point for the DVDs, and then offload the ripped copies from the SSDs onto physical drives while you're changing out the DVDs. Would want to use the fastest interfaces available. I've no idea these days what the cool kids are using. (My first hard drive was a 5-1/4" full-height 10 MB MFM. Thought I would never need more storage than that. I think my second HD was 20 MB and used RLL.)

Organizing the DVDs by length would allow optimizing the loading/ripping process to assure minimum time lost waiting for the operator's hands to be free. This kind of planning makes for an interesting project.

It might make an interesting crowd-funded project, if it's reasonably easy to get a research permit. Plan it out, go in with the hardware, come out with the images. Use all the error correction opportunities you've got available.

Do a web search for "bulk dvd ripping" (without quotes) and you'll find lots and lots of discussion and advice, including some about building a dedicated DVD ripping rig. MakeMKV gets good press, in my very quick read of a few posts.

And there's always the option of crowd-funding to raise the exact amount needed to pay off the break-even for Amazon's investment. I can't imagine they'd fight back too hard when looking at a large check vs. a non-performing asset, unless Bezos personally never intended to let the footage go free.

2 comments

A 16x DVD drive is around 21MB/s output. Any modern hard drive is likely to have 5-12x the sequential write performance as a 16x DVD drive. Shouldn't be any need for SSDs.
> Any modern hard drive is likely to have 5-12x the sequential write performance

The data rate falls to the floor as soon as the access pattern isn't sequential though, which if you are using multiple readers it won't be. While an OS might be bright enough to organise data flowing out of write buffers so it isn't as random as it could be there is a limit to how far they will go with this because they are general purpose OSs and optimising for multiple bulk streams will punish more interactive activity. If you have a tool they bypasses the OS cache and works in large enough blocks you might see better results except if the write activity from each lines up at which point this will make things worse.

Pulling the data off multiple DVD drives onto an SSD, swapping to output to another once near full to continue while its contents are dumped sequentially to cheaper-per-Gb traditional drives, would probably be the way I'd suggest.

In fact, you would get away without swapping between two SSDs: the read activity pulling data from the SSD to a traditional drive is unlikely to have much effect on the write performance for the data coming off the DVDs unless you have a great many readers in one machine. If doing this all relatively manually, to reduce manual steps once a DVD copy is complete add it to a queue to be moved using something like https://en.wikipedia.org/wiki/TeraCopy so you don't have to worry about manually coordinating the SSD-to-cheaper copy operation to keep it sequential.

Assuming 15 minutes to read each disk (it is a long time since I pulled data off a DVD in bulk so this is guess work based on old memory of it taking a little more than 10 minutes to read a full DVD9 disk, and rounding up to 15 to allow for manual process inefficiencies and some disks being slower to extract due to condition causing rereads, etc) you are looking at wanting 21 or more drives constantly on the go to get the job done in 3 solid 8-hour days (2,000x15/3/8/60 = 20.8). Five laptops each with an internal SSD (128G+) to extract to, five DVD readers on USB3 to extract from, and a 4+Tb spinning disk (also external) to finally write to, might do the job and have the space (2,000x8.5/5=~3.5Tb output per laptop). You'll need a powered USB dock/ for each laptop instead of a passive hub, and you are going to want to add more of everything to allow for the possibility of device failures.

Of course significantly less resource is needed (or you get more contingency time (and/or spare kit to deal with failures) from the same resource) if most of the media is DVD5 and/or not full disks. I've assumed the initial three days is just for obtaining the content - I've not accounted for any other processing (such as indexing and transcoding) or further distribution.

I've done this kind of stuff, and between OS buffering, and making sure the ripping software is writing large blocks (say 4-32MB at a time) its possible to run drives at basically full bandwidth with something less than a dozen streams. There is going to be more inner/outer track bandwidth variation than the perf falloff going from 1 to 6 streams with large blocks (say 4-32M sequences). There are a lot of reasons for this, but a lot has to do with data placement effectively combining multiple streams into data writes to the same sequential track.

More interesting is that even "sequential" read/writes already have seek times built in because HD's aren't spiral track, so head switching, and track to track seek (and the associated rotational/finding the servo track) are inherent in sequential IO perf. So most filessytem placement/schedulers aren't going to place 3 files being written at the same time on opposite sides of a disk, so those head switch times and track times have nearly immeasurable increases because the drive itself is also storing a large part of a track write and moving 3 tracks and a head, is basically the same as just moving a head.

there's an easy fix. suppose Ubuntu or openSUSE Linux: run your install on an SSD of at least 128 gb. set a swapfile of at least 64 gb right on your root partition & make sure it's mounted as swap (put in /etc/fstab or do it manually each boot) now attach & mount your hi capacity storage.

just have your script queue up a few in /tmp before moving to the mounted storage.

pretty easy & now you have multi level buffering caches that Linux knows how to work with efficiently & that can nearly guarantee sequential writes

> research permit

The government has to explicitly permit you to do research?

What the fuck is wrong with this world?!

It gives access to an archive containing a huge quantity of irreplaceable material. Here are the requirements: https://www.archives.gov/research/start/researcher-card.html