|
|
|
|
|
by lukebechtel
331 days ago
|
|
> H-Net demonstrates three important results on language modeling: > 1. H-Nets scale better with data than state-of-the-art Transformers with BPE tokenization, while learning directly from raw bytes. This improved scaling is even more pronounced on domains without natural tokenization boundaries, like Chinese, code, and DNA. > 2. H-Nets can be stacked together to learn from deeper hierarchies, which further improves performance. > 3. H-Nets are significantly more robust to small perturbations in input data like casing, showing an avenue for creating models that are more robust and aligned with human reasoning. |
|
paper