|
|
|
|
|
by throwaway314155
478 days ago
|
|
Great explanation, thanks. I have some followups if you have the time! a.) Why does this work as well as it does? Why does compression/fewer-parameters encourage better answers in this instance? b.) Will it naturally transfer to other benchmarks that evaluate different domains? If so does that imply an approach similarly robust to pre-training that can be used for different domains/modalities? c.) It works 20-30% of the time - do the researchers find any reason to believe that this could "scale" up in some fashion so that, say, a single larger network could handle any of the problems, rather than needing a new network for each problem? If so, would it improve accuracy as well as robustness? |
|