BTW, for anyone who might not be aware of it, this model trained by Intel based on the Mistral architecture is probably the single best general 7B model available currently:
The Intel one had supervised fine-tuning with the SlimOrca dataset, and then DPO alignment on top of that using a preference dataset.
The technique for generating the preference data is what’s so interesting about that one. Instead of having human labelers choose a preferred response, they generated a response from a small model and a large model, and then always selected the large one’s as the preferred response.