What do you think about the train test discrepancy? ie. will practitioners have to fine-tune Nubia's models on their training dataset in order to evaluate on their test dataset?
- The general dataset used to train the language model before being fine-tuned to extract semantic similarity, logical entailment and grammaticality (ie Wikipedia)
- The dataset used to fine-tune the semantic similarity module and logical inference scorer
- The dataset used to predict human judgement
So far, the experiments have actually shown that without any finetuning, the NUBIA model trained to assess machine translations does better at agreeing with human judgement for image captions than the metrics specifically design to assess image captions.
For more advanced cases like, say, scoring medical reports where, for example, grammaticality doesn't matter as much, it may have to be fine-tuned. This is not unlike human training actually where experts are trained on "what to look for".
The nice thing with this modular architecture and the interpretable scores is that it can provide a lot of flexibility to study individual components and their emergent properties and make a judgement call on whether or not to fine tune.
The aggregators in Nubia are pretrained to correlate with human judgement, so it should only be used for inference, but the idea is that you can use it as a loss function to optimize translation/image captioning/summarization. It’s too big for that as is but thats what we’re working towards.
I think the question here is more along the lines of "If now, I have ,say, radiology reports, do I use Nubia out of the box or do I need to make it read radiology reports and have a sense of what high quality radiology reports look like before using it?"
At least 3 datasets go into making a NUBIA model:
- The general dataset used to train the language model before being fine-tuned to extract semantic similarity, logical entailment and grammaticality (ie Wikipedia)
- The dataset used to fine-tune the semantic similarity module and logical inference scorer
- The dataset used to predict human judgement
So far, the experiments have actually shown that without any finetuning, the NUBIA model trained to assess machine translations does better at agreeing with human judgement for image captions than the metrics specifically design to assess image captions.
For more advanced cases like, say, scoring medical reports where, for example, grammaticality doesn't matter as much, it may have to be fine-tuned. This is not unlike human training actually where experts are trained on "what to look for".
The nice thing with this modular architecture and the interpretable scores is that it can provide a lot of flexibility to study individual components and their emergent properties and make a judgement call on whether or not to fine tune.