Hacker News new | ask | show | jobs
by hsuyash 1248 days ago
what do you mean by problematic data points and how do you identify them?
1 comments

Great question - Problematic data-points are essentially the cases where your model is not performing well.

Now, we have three ways to find them:

1. Statistical tools: We perform clustering on your training dataset and identify cases in production which are far away from all the training clusters (the idea is that if the given data-point is out-of-distribution, model may not perform well and may require retraining)

2. User Feedback: Based on the user behaviour, we infer Ground Truth. For ex: In case of recommendation systems, GT = if user likes the video. In case of ChatGPT3, GT = 0 if we see user asking the same question in multiple ways etc. We use such signals to identify cases where the user is not satisfied with the model output

3. Rule-based Signals: Many times, data scientists and ML engineers have a good idea about where their models are not performing well. These insights can be developed by analysing user feedback or manually testing their models. We allow them to define rule-based signals to filter out any interesting cases which they like to test or retrain their models upon

Online detection of problematic inputs seems plenty interesting! I am curious, does your framework run the detection logic in process or as a daemon the library is shipping over data to?
The former i.e. it runs the detection logic in background on the machine itself where the model predictions are happening. Currently we support running simple clustering algos but are working to enable even running simple Neural Nets as part of the observability loop.