Hacker News new | ask | show | jobs
by gwern 3236 days ago
Until today's release of the headless Linux client, you still had to run the full StarCraft program, which gets expensive fast. And it massively complicates the workflow to have to play through every game serially to recreate the state rather than simply reading random rows of data from a 300GB dataframe on disk.
2 comments

Oh I see, thanks, I didn't know. But man, 300 GB per game sounds completely nuts!
No, total. For comparison they quote the replay files at what was it, 5GB? It's a classic space-time tradeoff, but in deep learning right now, hard drives are far cheaper than CPUs/GPUs. Playing out the games as you need individual datapoints would probably be at least twice as slow, while anyone can easily store 300GB these days.
I believe the 400GB is the total amount for the 65000 different game replays
@wfunction: yes, TorchCraft includes a serializer that compresses the useful game state into a relatively small struct. That is then further compressed with other tricks and zstd.
Oh but how does that work? That's ~6 MB per game which sounds like just a list of actions rather than precomputed data per frame. Is it compressed somehow?
"The full dataset after compression is 365 GB, 1535 million frames, and 496 million player actions." - Yes
FYI, there are two things being discussed here. There dataset linked in the comment above is for Brood War. The headless client released today is for SC2.
I am aware of that. The point remains the same: both Brood War and SC2 are expensive to run, so you really don't want to and it's worth spending disk space to cache the results of playing out a replay files. This will probably also be true of the replay files DM/Blizzard will be releasing even with the lite client.