I think it's noteworthy that they even include command line arguments that are mistyped, for example "bulid".
What happens if you accidentally paste an AWS secret key or similar in the middle of a command line argument? Will that too appear in public csv files a year later?
Hi. Team member here. We used a simple algorithm to prevent that. We essentially got the data itself to vote on what a real command was for exactly this reason. This means that a lot of people typed "bulid" since the vote passed on that one. I don't have a count, but many rows were not included in the data since they didn't pass the minimum threshold for being a real command. Imagine you spelled "build" backwards for some reason. That would have been quite uncommon.
Well, it looks like they are including "command verbs" even if they are mistyped, for example "bulid".
What happens if you accidentally paste an AWS secret key or similar in the middle of a command verb? Will that too appear in public csv files a year later?
See my comment to the grandparent comment on our approach to only including common command strings (which wouldn't include anyone's AWS key). Also, and more importantly, we will only collect known arguments. From the blog post:
> Only known arguments and options will be collected (not arbitrary strings).
We don't want your AWS secret key in this data as much as you do. We have put systematic mitigations in place to ensure that this doesn't happen.
What happens if you accidentally paste an AWS secret key or similar in the middle of a command line argument? Will that too appear in public csv files a year later?