Hacker News new | ask | show | jobs
by aubanel 589 days ago
> For me that's just a great example of how basically the vast majority of AI influencers (who vie for influence on social media, rather than research) are basically clueless about AI and CS

This is a bit stark: there are many great knowledgeable engineers and scientists who would not get your point about a^nb^n. It's impossible to know 100% of of such a wide area as "AI and CS".

2 comments

>> This is a bit stark: there are many great knowledgeable engineers and scientists who would not get your point about a^nb^n. It's impossible to know 100% of of such a wide area as "AI and CS".

I think, engineers, yes, especially those who don't have a background in academic CS. But scientists, no, I don't think so. I don't think it's possible to be a computer scientist without knowing the difference between a regular and a super-regular language. As to knowing that a^nb^n specifically is context-free, as I suggest in the sibling comment, computer scientists who are also AI specialists would recognise a^nb^n immediately, as they would Dyck languages and Reber grammars, because those are standard tests of learnability used to demonstrate various principles, from the good old days of purely symbolic AI, to the brave new world of modern deep learning.

For example, I learned about Reber grammars for the first time when I was trying to understand LSTMs, when they were all the hype in Deep Learning, at the time I was doing my MSc in 2014. Online tutorials on coding LSTMs used Reber grammars as the dataset (because, as with other formal grammars it's easy to generate tons of strings from them and that's awfully convenient for big data approaches).

Btw that's really the difference between a computer scientist and a computer engineer: the scientist knows the theory. That's what they do to you in CS school, they drill that stuff in your head with extreme prejudice; at least the good schools do. I see this with my partner who is 10 times a better engineer than me and yet hasn't got a clue what all this Chomsky hierarhcy stuff is. But then, my partner is not trying to be an AI influencer.

Strong gatekeeping vibes. "Not even wrong" is perfect for this sort of fixation with labels and titles and an odd seemingly resentful take that gwern has being an AI influencer as a specific goal.
"not even wrong" is supposed to refer to a specific category of flawed argument, but of course like many other terms it's come to really mean "low status belief"
It all feels like their only goal is circumlocutions over the subset of contemporary glyphs they know?

The physical principles remain regardless of how humans write them down.

OK, I concede that if I try to separate engineers from scientists it sounds like I'm trying to gatekeep. In truth, I'm organising things in my head because I started out thinking of myself as an engineer, because I like to make stuff, and at some point I started thinking of myself as a scientist, malgré moi, because I also like to know how stuff works and why. I multiclassed, you see, so I am trying to understand exactly what changed, when, and why.

I mean obviously it happened when I moved from industry to academia, but it's still the case there's a lot of overlap between the two areas, at least in CS and AI. In CS and AI the best engineers make the best scientists and vv. I think.

Btw, "gatekeeping" I think assumes that I somehow think of one category less than the other? Is that right? To be clear, I don't. I was responding to the use of both terms in the OP's comments with a personal reflection on the two categories.

I sure hope nobody ever remembers you being confidently wrong about something. But if they do, hopefully that person will have the grace and self-restraint not to broadcast it any time you might make a public appearance, because they're apparently bitter that you still have any credibility.
Point taken and I warned my comment would sound vituperative. Again, the difference is that I'm not an AI influencer, and I'm not trying to make a living by claiming an expertise I don't have. I don't make "public appearances" except in conferences where I present the results of my research.

And you should see the criticism I get by other academics when I try to publish my papers and they decide I'm not even wrong. And that kind of criticism has teeth: my papers don't get published.

Please be aware that your criticism has teeth too, you just don't feel the bite of them. You say I "should see" that criticism you receive on your papers, but I don't; it's delivered in private. Unlike the review comments you get from your peers, you are writing in public. I'm sure you wouldn't appreciate it if your peer reviewer stood up after your conference keynote and told the audience that they'd rejected your paper five years ago, described your errors, and went on to say that nobody at this conference should be listening to you.
What is the point of saying "I warned my comment would sound vituperative"? Acknowledging a flaw in the motivation of your comment doesn't negate that flaw, it means you realize you are posting something mean spirited and consciously deciding to do it even though you recognize you're being mean spirited.
Can I say a bit more about criticism on the side? I've learned to embrace it as a necessary step to self-improvement.

My formative experience as a PhD student was when a senior colleague attacked my work. That was after I asked for his feedback for a paper I was writing where I showed that my system beat his system. He didn't deal with it well, sent me a furiously critical response (with obvious misunderstandings of my work) and then proceeded to tell my PhD advisor and everyone else in a conference we were attending that my work is premature and shouldn't be submitted. My advisor, trusting his ex-student (him) more than his brand new one (me), agreed and suggested I should sit on the paper a bit longer.

Later on the same colleague attacked my system again, but this time he gave me a concrete reason why: he gave me an example of a task that my system could not complete (learn a recursive logic program to return the last element in a list from a single example that is not an example of the base-case of the recursion; it's a lot harder than it may sound).

Now, I had been able to dismiss the earlier criticism as sour grapes, but this one I couldn't get over because my system really couldn't deal with it. So I tried to figure out why- where was the error I was making in my theories? Because my theoretical results said that my system should be able to learn that. Long story short, I did figure it out and I got that example to work, plus a bunch of other hard tests that people had thrown at me in the meanwhile. So I improved.

I still think my colleague's behaviour was immature and not becoming of a senior academic- attacking a PhD student because she did what you 've always done, beat your own system, is childish. In my current post-doc I just published a paper with one of our PhD students where we report his system trouncing mine (in speed; still some meat on those old bones otherwise). I think criticism is a good thing overall, if you can learn to use it to improve your work. It doesn't mean that you'll learn to like it, or that you'll be best friends with the person criticising you, it doesn't even mean that they're not out to get you; they probably are... but if the criticism is pointing out a real weakness you have, you can still use it to your advantage no matter what.

Constructive criticism is a good thing, but in this thread you aren't speaking to Gwern directly, you're badmouthing him to his peers. I'm sure you would have felt different if your colleague had done that.
is it really? this is the most common example for context free languages and something most first year CS students will be familiar with.

totally agree that you can be a great engineer and not be familiar with it, but seems weird for an expert in the field to confidently make wrong statements about this.

Thanks, that's what I meant. a^nb^n is a standard test of learnability.

That stuff is still absolutely relevant, btw. Some DL people like to dismiss it as irrelevant but that's just because they lack the background to appreciate why it matters. Also: the arrogance of youth (hey I've already been a postdoc for a year, I'm ancient). Here's a recent paper on Neural Networks and the Chomsky Hierarchy that tests RNNs and Transformers on formal languages (I think it doesn't test on a^nb^n directly but tests similar a-b based CF languages):

https://arxiv.org/abs/2207.02098

And btw that's a good paper. Probably one of the most satisfying DL papers I've read in recent years. You know when you read a paper and you get this feeling of satiation, like "aaah, that hit the spot"? That's the kind of paper.

a^nb^n can definitely be expressed and recognized with a transformer.

A transformer (with relative invariant positional embedding) has full context so can see the whole sequence. It just has to count and compare.

To convince yourself, construct the weights manually.

First layer :

zeros the character which are equal to the previous character.

Second layer :

Build a feature to detect and extract the position embedding of the first a. a second feature to detect and extract the position embedding of the last a, a third feature to detect and extract the position embedding of the first b, a fourth feature to detect and extract the position embedding of the last b,

Third layer :

on top that check whether (second feature - first feature) == (fourth feature - third feature).

The paper doesn't distinguish between what is the expressive capability of the model, and the finding the optimum of the model, aka the training procedure.

If you train by only showing example with varying n, there probably isn't inductive bias to make it converge naturally towards the optimal solution you can construct by hand. But you can probably train multiple formal languages simultaneously, to make the counting feature emerge from the data.

You can't deduce much from negative results in research beside it requiring more work.

>> The paper doesn't distinguish between what is the expressive capability of the model, and the finding the optimum of the model, aka the training procedure.

They do. That's the whole point of the paper: you can set a bunch of weights manually like you suggest, but can you learn them instead; and how? See the Introduction. They make it very clear that they are investigating whether certain concepts can be learned by gradient descent, specifically. They point out that earlier work doesn't do that and that gradient descent is an obvious bit of bias that should affect the ability of different architectures to learn different concepts. Like I say, good work.

>> But you can probably train multiple formal languages simultaneously, to make the counting feature emerge from the data.

You could always try it out yourself, you know. Like I say that's the beauty of grammars: you can generate tons of synthetic data and go to town.

>> You can't deduce much from negative results in research beside it requiring more work.

I disagree. I'm a falsificationist. The only time we learn anything useful is when stuff fails.

Gradient descent usually get stuck in local minimum, it depends on the shape of the energy landscape, that's expected behavior.

The current wisdom is that by optimizing for multiple tasks simultaneously, it makes the energy landscape smoother. One task allow to discover features which can be used to solve other tasks.

Useful features that are used by many tasks can more easily emerge from the sea of useless features. If you don't have sufficiently many distinct tasks the signal doesn't get above the noise and is much harder to observe.

That the whole point of "Generalist" intelligence in the scaling hypothesis.

For problems where you can write a solution manually you can also help the training procedure by regularising your problem by adding the auxiliary task of predicting some custom feature. Alternatively you can "Generatively Pretrain" to obtain useful feature, replacing custom loss function by custom data.

The paper is a useful characterisation of the energy landscape of various formal tasks in isolation, but doesn't investigate the more general simpler problem that occur in practice.

In my country (France), I think most last-year CS students will not have heard of it (pls anyone correct me if I'm wrong).