Hacker News new | ask | show | jobs
by dpflan 3238 days ago
Related: Today I learned that a group of AI researchers has released a paper called: STARDATA: A StarCraft AI Research Dataset. According to one of the authors: "We're releasing a dataset of 65k StarCraft: Brood War games, 1.5b frames, 500m actions, 400GB of data. Check it out!"

> Article: https://arxiv.org/abs/1708.02139

> Github: https://github.com/TorchCraft/StarData

1 comments

The great thing about this is that it includes the game state throughout the game. It's been pretty easy to find lots of Starcraft replays, but the replays only include enough information to recreate the game (basically just the player actions). If you wanted to know what was happening in the game at the time the player made an action, you had to load up Starcraft and simulate out the game until that point. This dataset has already run the game for you and provided the data!
Is it that much computation to simulate an entire game? You obviously don't need to render the graphics or anything, it should just be a list of events that occur, which doesn't seem all that slow to process.
Until today's release of the headless Linux client, you still had to run the full StarCraft program, which gets expensive fast. And it massively complicates the workflow to have to play through every game serially to recreate the state rather than simply reading random rows of data from a 300GB dataframe on disk.
Oh I see, thanks, I didn't know. But man, 300 GB per game sounds completely nuts!
No, total. For comparison they quote the replay files at what was it, 5GB? It's a classic space-time tradeoff, but in deep learning right now, hard drives are far cheaper than CPUs/GPUs. Playing out the games as you need individual datapoints would probably be at least twice as slow, while anyone can easily store 300GB these days.
I believe the 400GB is the total amount for the 65000 different game replays
@wfunction: yes, TorchCraft includes a serializer that compresses the useful game state into a relatively small struct. That is then further compressed with other tricks and zstd.
Oh but how does that work? That's ~6 MB per game which sounds like just a list of actions rather than precomputed data per frame. Is it compressed somehow?
FYI, there are two things being discussed here. There dataset linked in the comment above is for Brood War. The headless client released today is for SC2.
I am aware of that. The point remains the same: both Brood War and SC2 are expensive to run, so you really don't want to and it's worth spending disk space to cache the results of playing out a replay files. This will probably also be true of the replay files DM/Blizzard will be releasing even with the lite client.