Hacker News new | ask | show | jobs
by learnstats2 3974 days ago
>We use the public GREYC keystroke benchmark database

Yes. That's their own database which they're talking up, the one that they made to do this research. That's what I was talking about.

>In order to reduce the bias due to this high quantity of male information, we only kept the first n male samples( where n is the number of female samples).

It happens that I didn't read this part.

On reflection, what I understand now is far worse than what I originally understood:

- They have 35 females and 98 males, they take many handwriting samples from each.

- Since the participants provided many samples, these samples appear both in the training set data and in the test set data.

- I use the training set data to figure out if I can recognise the handwriting of the 35 female participants.

- Then I look through the test data to see if I can identify those participants again.

Basically what you've shown is you can identify the handwriting of 35 people if you've already seen it - 88% of the time.

Splitting groups into 'female' and 'male' is a red herring. This method would presumably work, even if I split them into two random groups.

If I'm right, this is not even state-of-the-art. In 2006 they could have been scoring 96%: http://abcnews.go.com/Technology/story?id=97978&page=2