| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mallamanis 2456 days ago

[I'm one of the Microsoft Research people who worked on this]

We did consider adding StackOverflow questions. Some of our queries in the CodeSearchNet challenge do actually come from StackOverflow (via StaQC [1]). It's certainly interesting to see how all other SO data can be useful for this task. Thanks for the suggestion!

The reason we didn't try this at this point:

Many people in research have tried working with SO data. In my experience I have observed an interesting problem with the data: it's deduplicated! This is great for users but bad for machine learning, since the data looks "sparse" (roughly, each concept appears once). Sparsity is an obstacle, since it's hard for most existing machine learning methods to generalize from sparse data. In contrast, in natural language there are (e.g.) multiple articles describing the same event more or less.

[1] https://ml4code.github.io/publications/yao2018staqc/