Hacker News new | ask | show | jobs
by throwaway314155 478 days ago
Great explanation, thanks. I have some followups if you have the time!

a.) Why does this work as well as it does? Why does compression/fewer-parameters encourage better answers in this instance?

b.) Will it naturally transfer to other benchmarks that evaluate different domains? If so does that imply an approach similarly robust to pre-training that can be used for different domains/modalities?

c.) It works 20-30% of the time - do the researchers find any reason to believe that this could "scale" up in some fashion so that, say, a single larger network could handle any of the problems, rather than needing a new network for each problem? If so, would it improve accuracy as well as robustness?

1 comments

Boo, go read the other comments that explain all of this instead of wasting people's time.
> I have some followups *if you have the time*

Emphasis mine. No one should feel obligated to answer my questions. I had hoped that was obvious.