Quite honestly not in my experiments. I wanted to do some Bayesian hyperparameter optimization with some discretized options like noise/no-noise and n_expert/top_k but haven't been able to find the time or free time in one of our GPU clusters. I plan on using perplexity as this is not yet instruction fine tuned.