| I've been experimenting with transforming Microsoft's Phi-3.5 into a byte-level language model with RetNet-inspired elements. The result is RetNPhi, a hybrid model that combines the strengths of Phi-3.5 with the efficiency of RetNet. Key features: - Byte-level processing for universal file type handling - RetNet's multi-scale exponential decay and group normalization for efficient long-range dependency modeling - Recurrent inference mode with constant memory usage, regardless of sequence length - Minimal fine-tuning: only post-layer norms, first token embedding layer, and LoRA on self-attention output projections (o_proj) are adjusted - Surprisingly coherent output after training on just 64 lines of Tiny Shakespeare Technical details: - Based on Microsoft's Phi-3.5 architecture - Implements RetNet's retention mechanism - Uses LoRA for efficient adaptation of pretrained weights - Dual-mode processing: parallel for training, recurrent for inference Sample output (input: "first citi"): zen:
you are all resolved rather to die than to fam This approach could lead to more efficient, locally-runnable language models. The byte-level processing opens up interesting possibilities for handling various data types, while the recurrent inference mode could be a game-changer for running these models on consumer-grade hardware. I'm particularly interested in feedback on: 1. Potential applications for a byte-level LM with efficient long-context handling 2. Thoughts on the hybridization of Transformer-based models (like Phi) with RetNet concepts 3. Ideas for further optimizing the model for local deployment GitHub: https://github.com/JosefAlbers/Phi-3-Vision-MLX/blob/main/as... |