| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by HDMI_Cable 361 days ago

| The main challenge in building content-aware memory models is lack of data. To my knowledge, no publicly available dataset exists that contains real-world usage data with both card textual content and review histories.

I wonder if the author has ever considered reaching out to makers of Anki decks used by premeds and medical students like the AnKing [1]. They create Anki decks for users studying the MCAT and various Med School curricula, so have a) relatively stable deck content (which is very well annotated and contains lots of key words that would make semantic grouping quite easy) b) probably contains loads of statistics on user reviews (since they have an Anki addon that sends telemetry to their team to make the decks better IIRC), and c) contains incredibly disparate information (all the way from high-school physics to neurochemistry).

---

[1]: https://www.theanking.com

1 comments

ran3000 360 days ago

It would be awesome to work on that data. I'm afraid of the privacy implications though.

HDMI_Cable 360 days ago

What sort of privacy implications? I'd imagine that Anki data would be relatively privacy-concern free, as it contains no PII, and for the AnKing decks, all of the content is standardized and so wouldn't contain personal notes. Though, having never worked with this data, please let me know if I'm wrong!

Also, having used those decks in the past, and downloaded the add-on/look at the monetization structure of developers like the AnKing, I would be very surprised if aggregate data on review statistics wasn't collected in some way. I.e., if the AnKing is collecting this data already to design better decks/understand which cards are the hardest—probably to target individual support—then I imagine that collecting some de-anonymized version of that data wouldn't be too much of a stretch.

Plus, considering that all of the developers of AnKing-style decks are all doctors, they probably have a pretty good grasp at handling PII and could (hopefully) make pretty sound decisions on whether to give you access :)

ran3000 360 days ago

You're right, it might work by restricting to just AnKing data. My concern was around other, possibly personal, cards making their way into the dataset.