|
|
|
|
|
by neonbjb
781 days ago
|
|
I'm James Betker. Of course architecture matters in this regard lol. Comparing a CNN to a transformer is like comparing two children brought up in the same household but one has a severe disability. What I meant in this blog post was that given two NNs which have the same basic components that are sufficiently large and trained long enough on the same dataset, the "behavior" of the resulting models is often shockingly similar. "Behavior" here means the typical (mean, heh) responses you get from the model. This is a function of your dataset distribution. :edit: Perhaps it'd be best to give a specific example: Lets say you train two pairs of networks:
(1) A Mamba SSM and a Transformer on the Pile.
(2) Two transformers, one trained on the Pile, the other trained on Reddit comments.
All are trained to the same MMLU performance. I'd put big money that the average responses you get when sampling from the models in (1) are nearly identical, whereas the two models in (2) will be quite different. |
|
You, sir, are my hero.