Hacker News new | ask | show | jobs
by twsttest 2169 days ago
I'm not disagreeing with your basic idea, but it seems you're nitpicking and talking past Yann's point.

A model's only link to the real world is the training data, so saying it's sufficient to "worry about the training data" captures all the concerns we may have about bias, because from the model's POV there is no other relevant interface with the real world.

Saying "we need to do more" is devoid of meaning when by addressing the training data we are truly doing all we can as model builders and trainers.

1 comments

So here's an example of more that we can do.

A huge problem in the field is that we must use the previous benchmarks. This is because how do you know if the needle moves or not if you just change your data constantly?

So. In order to tackle this problem, someone with more resources than me needs to create training sets that are less biased. THEN, new academic papers need to benchmarked against the old biased sets, and also the new "less biased" (I don't think it's possible to ever get 0% bias, the world just isn't that clean) sets. And progress needs to be eventually transitioned to be measured on the new less biased sets.

The upsampling algorithm used pictures of celebrities. And the researchers put a blurb in their paper that was basically a "We know this is biased but everyone uses it so we must also". I feel like this is less useful science than an algorithm trained on more of a mix of actual real-world humans.

I admit it's quite challenging and probably impossible to do in some areas. I mean, how do you make a field whose end algorithmic goal is generalization, not use real world data to generalize people? But I think the issue can be worked on, and the need to use celebrity photos to train a set is a good place to start.

All this is going to do is researchers not releasing data and code when publishing their articles so that the public doesn't meme biases/mistakes of their data/code into twitter hate mobs.

We'll probably go back to the 2000s model where you have to email the authors for code and data. The authors will delay by saying they are preparing it and then release it a few years later when it becomes irrelevant for public discourse.

ML is a huge field outside of modelling humans and their behavior. For instance, image recognition of vehicles, financial data prediction and analytics, and weather forecasting, to name a couple examples. Those don't draw scrutiny. The problem comes with generalizing humans. And generalizing using biased data. And applying generalized algorithms in areas that cause a lot of harm. I think these researchers should properly be placed under the microscope since they have the potential to be very hurtful to society. I do not think they should be subject to death threats or loss of income or whatever the social media mob throws at them these days, but I don't think researchers should be cavalier in creating algorithms that generalize humans without taking very careful steps to not create bias in the end result.