Hacker News new | ask | show | jobs
by abeppu 1542 days ago
We're all focusing on the weaknesses of co-pilot (the comments can be longer than the code produced; you need to understand code to know when to elaborate your comment, etc).

But also ... what do you need to know to recognize that the concept of a 'toxicity classifier' is likely broken? We can do _profanity_ detection pretty well, and without a huge amount of data. But with 1000 example comments, can you actually get at 'toxicity'? Can you judge toxicity purely from a comment in isolation, or does it need to be considered in the context in which that comment is made?

Maybe you don't need to know about python, but if you're building this, you should probably have spent some time thinking and grappling with ML problems in context, right? You want to know that, for example, the pipeline copilot is suggesting (word counts, TFIDF, naive Bayes) doesn't understand word order? Or to wonder whether it's tokenizing on just whitespace, and whether `'eat sh!t'` will fail to get flagged b/c `'shit'` and `'sh!t'` are literally orthogonal to the model?

More people should be able to create digital stuff that _does_ things, and maybe copilot is a tool to help us move in that direction. Great! But writing a bad "toxicity classifier" by not really engaging with the problem or thinking about how the solution works and where it fails seems potentially net harmful. More people should be able to make physical stuff too, but 3d-printed high-capacity magazines don't really get most of us where we want to go.

4 comments

> We're all focusing on the weaknesses of co-pilot (the comments can be longer than the code produced; you need to understand code to know when to elaborate your comment, etc).

See, this tells me you may not have even used copilot. Because while tutorials such as this (and the OpenAI codex tools) have you use comments explicitly to code, the reality is that you're not hammering out plain english requirements for copilot to work. You just code - and sometimes it finishes your thought, sometimes it doesn't. You hit tab to accept autocomplete, just like you would for any other autocomplete. So you are generally reading and evaluating what copilot thinks is a good output and choosing whether it goes in the program or not with the TAB key.

Copilot is great as a 'smart auto-complete' or when you need to do pattern based drudge work... but that's not what this article is about. It's trying to sell people on copilot as a no-code tool.

The leading question is this:

>But as helpful as it is for coders, what if it enabled non-engineers to program too – by merely talking to an AI about their goals?

and it answers this in my opinion deceptively by presenting what amounts to a parlor trick. Whether copilot in general is any good or not is in my mind totally separate to this.

Yeah, I agree Copilot is absolutely not a no-code tool.
Less no-code, more low-code high-tongue tool.
It doesn't actually say that at all, because you can use Copilot in different ways. One way is the way you mention, by writing code and letting Copilot finish those off. Another way is the way GP describes it (and, the technique that the article uses) where you write comments and let Copilot fill out the code.

Just because one uses one of the ways doesn't mean they are not aware of the other way too.

Not logically, no. But it is implied because you actually get both such experiences on-demand in VS Code/vim/emacs. It's a fascinating experience and you find yourself writing more descriptive function names and variable names rather than using handwritten instructions. You quickly realize that comments are just one of many prompt engineering tricks available once you have access to this - and simply generating snippets as the linked article does is quite restricting sometimes.

Basically, the concern that e.g. comment length gets too long is a weird one, because you don't tend to actually use copilot that way if you have access to it through tab-complete.

Perhaps what I really mean is - people should try using copilot for an actual coding project. Its benefits aren't really obvious in contrived examples.

A few years ago I did some work with IBM's Watson Twitter integration. One of the fun things you could do was sentiment analysis. It was reasonably accurate for the extremes but anything in the gray area would be wildly off. A politely worded tweet that was scathing would come across high on the positive sides of the scale, whereas a perfectly reasonable sentence that included profanity as used in a quote would immediately be high on the negatives.

This part from the article made me chuckle, because IMO the author fell for some of the most basic language processing smoke & mirrors:

    …so we’ll give it some examples. When generating the array, it even creates the ideal variable name and escapes the quotations.
Here, it generates toxic_comments as a variable name, when the instructions were:

   # create an array with the following toxic comments: [etc]
This is pretty basic language parsing stuff that might have been kicking around awhile. I think the most basic english language parser could output something along the lines of what was suggested, given an understanding of what valid Python should look like. While impressive, it's not nearly as interesting or good as the rest of the work being done.

Copilot appears no different to most ML models out there. Poor and incomplete training data will yield ok results for popular things but as soon as you ask for edge cases it will fall apart like Siri trying to understand a Scottish accent.

Eventually it might get there with enough good representative training data but it's unclear to me how long that will take. If it tracks with speech processing models it might take decades plus.

Another consideration is that because the training data is being done using github public repos (at least last I read), it's likely that it's ripe for abuse. If that's still how they're doing it I'm looking forward to the TEDTalk in two years from a researcher who "hacked" the copilot AI by polluting its training data.

> I think the most basic english language parser could output something along the lines of what was suggested, given an understanding of what valid Python should look like.

OK, I am waiting for you to propose a basic language parser that can do it. There's a reason we're only now having this debate - it was unconceivable 5 years ago, in the era of basic language parsers.

> OK, I am waiting for you to propose a basic language parser that can do it. There's a reason we're only now having this debate - it was unconceivable 5 years ago, in the era of basic language parsers.

This is really untrue. In fact, making "English as a programming language" was a goal of many older programming languages such as COBOL[1], BASIC, and PASCAL as early as the 60s. It's hardly a new idea and was hardly inconceivable "5 years ago" for something to output a programming language.

The sentence example here could easily be broken down by the ParseTalk model from the mid-90s[2].

Here's a recent ish example (2018) of someone developing a "fully English" programming language:

https://osmosianplainenglishprogramming.blog/2018/05/02/plai...

It's also a source of fun[3][4][5] for people.

These are all examples of either programming languages straight up using English as syntax, or lexical parsers that can break down language and provide you with the programmatic ability to make this kind of output.

The difference here is that while copilot is pulling in python examples based on its training data set, that one thing the author singled out for amazement could easily be done by these older non-ML methods. The value copilot is adding in the example is just outputting python compared to those other methods. The real value is way larger than that, pulling in potentially more complex code to accomplish a complete task.

It's a bit like seeing an all-electric cargo train and being amazed that a train can run on electricity, when electrified light rail has existed for a long time. The impressive part is not that a thing on rails can use electricity to move around, it's the fact that it can pull heavy cargo efficiently enough to make electric power viable.

[1]: https://en.wikipedia.org/wiki/COBOL#COBOL_60

[2]: https://arxiv.org/abs/cmp-lg/9410017

[3]: https://github.com/RockstarLang/rockstar/blob/main/examples/...

[4]: https://en.wikipedia.org/wiki/Shakespeare_Programming_Langua...

[5]: https://github.com/lhartikk/ArnoldC/wiki/ArnoldC

Years ago at pyData Berlin I remember a talk trying to classify comments from three major online newspapers with the question if we. Could detect where a comment was made.

One newspaper was left leaning, the other had the reputation of right wing trolls commenting and one was somewhat in the middle ground with a reputation of the audience being pseudo intellectual neoliberalists.

The 'center' (most typical) comment for these three sites totally was in line with these sentiments. The perfect proof (or confirmation bias).

But the classification didn't work. While there were clear cut cases (one has to love stereotypes) most cases were just neutral. Meaning they could have been made on any of these media sites. Either they were just too short or just not extreme enough.

I feel (used explicitly here) that toxicity is not something that is easily classifiable without deeper understanding of the context. Else, if feeling a comment was toxic was the measure one would need to query all walks of life from extreme left to extreme right and afterwards would probably be left with a lot of toxicity that doesn't tell us much except that different people will find different things toxic.

didn't watson turn out to be useless and spaghetti code inside? aka ibm's marketing arm
I'll preface this by saying that my time working with it was while I was working at IBM, so feel free to take this with a grain of salt. In my time since I've worked in a few Data/ML and Security positions, so I do have a basis for comparison with other systems.

From what I saw, the actual language-processing part of it was top-tier. It's just it's a hard problem to come up with a demo for that people will actually respond positively to, hence the Jeopardy stint. It has limited real applications. It's really good at what it does but what it does isn't really widely useful.

Nobody wants to see "We're going to replace all our online help / support chat stuff with Watson" because people find those systems frustrating already, even if it would make things vastly better than some of the alternatives.

So you end up with weird stuff like Chef Watson, Doctor Watson, and so on -- things in areas where an ML model isn't going to replace a human anytime soon.

Then Marketing gets involved and suddenly anything that uses any kind of ML needs to have Watson slapped on it, even if it's not doing any language processing.

Welp, you're downplaying IBM too much. IBM got the product direction right earlier than anyone. Watson is a querying system w/ advanced NLP/IR/KRR capability running on dedicated compute chips, and large corps are more or less following this path. It's just that IBM did it too early and used rather old approaches, which doesn't grow well (thus "spaghetti").

Still, Watson is pretty much the only one in its class. There are good alternatives out there that worked well for many people, but they offer only a subset of Watson's feature set. If an organization need some real bang, Watson is the only option.

We must suspend disbelief a bit regardless: Any “toxicity classifier” has a limited operational life as people who want to say toxic things will simply adapt their language and walk circles around it.

From simple letter substitution (sh!t) to completely different words/concepts (unalive) to “layer 2 sarcasm” (where someone adopts the persona of someone who supports the word view that’s against what they believe in a non-obvious attempt to rally people against that persona).

People have been getting away with being toxic in public for a long time. ML cannot keep up. Humans can’t even keep up.

(Post author here.) Agree with both you and the parent here! We work a lot in the NLP and Trust & Safety space, and many of the models and datasets we see do ignore context -- and so real-world "toxicity models often end up simply as "profanity detectors" (https://www.surgehq.ai/blog/are-popular-toxicity-models-simp...). Which would certainly happen with a Naive Bayes model as well.

Similarly, a lot of the training data/features ML engineers use ignore context -- for example, a Reddit comment may seem hateful in isolation, until you realize the subreddit it's in changes the meaning entirely (https://www.surgehq.ai/blog/why-context-aware-datasets-are-c...).

Regarding your point, we actually do a lot of "adversarial labeling" to try to make ML models robust to countermeasures (e.g., making sure that the ML models train on word letter substitutions), but it's pretty tricky!

The fact that "toxicity" is not well-defined or black and white and you'll never be able to reach 100% accuracy is extremely obvious and not very interesting. That's probably why nobody is talking about it.
Sure, but we probably can work on that a little more rather than throwing in the towel and saying 'toxicity is when text matches regexp'.
Well, we probably could throw in the towel. The definition of the word is ever-changing and context-dependent, AND subjective to the receiver. That doesn't sound like something you can train a model for.
If you had access to the reactions of someone reading the content, you could _possibly_ train an agent to spot textual patterns likely to cause the reader to have negative reactions.

You could do a similar thing with a robot DJ, by feeding it a stream of the dancefloor, and training it to keep that dancefloor grooving.

But think about how this is trained. As with all of the authoritarian anti-offence rhetoric (i.e. not person to person politeness, but politeness enforcement), the response should be: who gets to decide?

Some concepts become less offensive over time; some more offensive. 20 years ago gay marriage was offensive in many parts of the world. Should that be codified into our communication tools? Offence is in no way objective, and this will never change.

There is genuine, vast utility in advocating for this sort of thing, but only if you want to be the person with the power to decide what everyone else is allowed to talk about.

Would you say the same about a less divisive speech pattern like a flamebait/flamewar classifier? Because it’s already double with simple heuristics like upvote/comment ratios and seems like a fine fit for a moderator assisted classifier.
Yeah I would say the same. It's also not well defined or black or white but that doesn't mean you have to just give up. You can do better than nothing in both cases.