Hacker News new | ask | show | jobs
by karalala 769 days ago
Already seeing major flaws in the paper.

The benchmarking done in the table 1 is extremely questionable. Their table basically contradicts the results from multiple peer reviewed papers, especially for RNNs which report results much closer to baseline transformers (and conducted much larger experiments btw).

Page 40 they mention that all models are trained with the same lr for comparability.

> Contradicts their own scaling laws table which uses different lr for different models

> And no it is not a fair comparison to use the same lr to test all these different models. Benchmarking results just looks like they are using tuned hyperparameters for their model which happens to not work for other models.

2 comments

You should publish a response paper and get them to retract their paper if it has major flaws.
Its xlstm contradicting existing peer reviewed papers lmao. Either xlstm should fix their benchmarks or existing peer reviewed papers should retract.

RWKV-v6 > RWKV-v5 > RWKV-v4, not the other way round obviously. HGRN 8 ppl worse than baseline transformers? NIPS 2023 spotlight paper btw.

Are you saying this is obvious because people have published the exact same benchmarks which are 100% comparable in journals? If so where are they? I have seen quite a few published benchmarks that could not quite be reproduced, tbh. So, again, what makes this "obvious" to you?
I thought it was common knowledge that architecture comparisons in papers aren't worth the paper they're printed on; there are so many ways to deliberately or accidentally structure things to favour one architecture over the others. Ultimately the lmsys chatpot arena will be the final judge.
True, but they normally arent this far off. HGRN claims that they outperform transformer for 1B parameter model trained on the pile. HGRN performing 8ppl worse suggests that its useless.
My experience - many are far off and most of the time published tables of different papers are hard to compare. If you make the assertion here of these results to be flawed, I would like to see more substance (code, reproduction,...). And for balance, for the same reason, hard to verify the accuracy of these results without further insight.
So many papers play tricks with the learning rate schedule: https://arxiv.org/abs/2307.06440
Could you explain for a dum-dum?
Results of xlstm are promising but will need larger scale experiments.

However they completely messed up benchmarking experiments for various RNN models which in their papers claim comparable and even better performance than base transformer.

These experiments seem pretty large already though, no? How are you so sure they messed up benchmarking? Is the code out already?