Hacker News new | ask | show | jobs
by sourabh03agr 1180 days ago
Few details on our approach:

Step 1 - Use the visualisation functionalities (ex: UMAP for BERT embeddings) to qualitatively assess the distribution shifts happening.

From here, we saw that Dialogsum & Samsum datasets indeed belong to two different clusters and we can expect performance degradation due to data drift.

Step 2 - Use statistical techniques to identify clusters near few low-performing samples (samples selected by us) and find data-points belonging to them.

Interestingly, this gave us a nice collection of low-performing datapoints (accuracy ~ 40% lower than that on whole dataset). Upon manually inspecting, we saw some interesting behaviours around model failures which we will use to generate retraining datasets

Step 3 - Use UpTrain's Custom Signal interface to define rules for collecting edge-cases. We defined two rules:

1. We saw model outputs incomplete summaries when input text length is too long. Hence, we defined a rule on number of words in the input conversation.

2. In many cases, we saw model selects one or two sentences from the conversation as the summary. This works well generally but fails miserably when the conversation is all about negating those sentences. We defined a rule for the same.

Step 4 - We also wanted to check if we can detect a shift in the vocabulary between the two datasets and defined a custom monitor for the same. (Interestingly we saw higher occurrences of words related to Asia in the dialogsum dataset). Similar monitors can be designed to identify newer topics, sentiments, tone of voice, etc.

Would love you to play around with the tool and provide your feedback!