Hacker News new | ask | show | jobs
by eric4smith 1544 days ago
Impressive BUT.

Who is defining toxic speech? Where is that data being taken from?

This is the definition of using AI to set what the edges of “speech” should be based on potentially flawed data.

This is a clown world.

2 comments

> In this example, we’re using the Copilot extension for Visual Studio Code, and a free toxicity dataset that we built;

(Emphasis mine)

Following that link:

> Surge AI is a data labeling platform and workforce. Our labeling team pored over tens of thousands of social media comments to build this toxicity dataset. Each comment was then evaluated by multiple members of our team to determine its severity level.

I feel so sorry for the labeling team. Hope they were paid well.
I think you missed the forest for the trees. It isn't the model that matters, it's that copilot is building the classifier from intent (comments). It wouldn't matter if it was classifying flowers instead.
No. I did not miss it. The work is pretty good.

My problem is with the dataset and datasets like this overall that sets the tone through AI of what is acceptable and what is not.