| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bobbylarrybobby 1203 days ago
	> Do people just throw random layers and activation functions together until something works? In a lot of cases, yes. You can start with a reasonable baseline architectural guess, like “convolution should be good for vision” or “attention should be good for language”, but after that it's a lot of guess and check.

1 comments

uoaei 1203 days ago

This may be the case in certain shops or disciplines, but I can assure you, if you have the right insights, you can use statistical and ML principles to achieve better accuracy with orders of magnitude less compute.

I consider it a tragedy that we throw huge hogs of models into datacenters and let them churn without giving much thought to improving performance. The climate impacts relative to the real benefits are measurable and depressing for a field so focused on innovation. Usually the 100x model built by a team of "data scientists" over a month is no better than a 1x model built by a couple SMEs over a couple weeks.

link

nerdponx 1203 days ago

Just because there are a lot of under-qualified bootcamp data scientists out there who think Pytorch can solve all their problems, that doesn't give you the right to sneer at the cutting edge of ML research on the grounds that you've built some good models in Excel in the past.

There are plenty of times when deep learning is the wrong solution, and probably most businesses out there should not ever expect to need any kind of highly compute-intensive model training. But there are also plenty cases where neural networks are the right solution as well (even if not "deep learning" as such).

Maybe I'll be convinced if you share some specific examples. Otherwise I'll assume that your comment is just arrogant derision.

link

uoaei 1203 days ago

I have a lot to say on this and related topics, but I'll try to keep it on topic to this thread.

> sneer at the cutting edge of ML research

I was reacting specifically to mis-specifications of what "good" means in these contexts, e.g., "attention should be good for language" from GGP comment.

The major advances of these models du jour are not the outputs, but rather the infrastructure that it takes to run them. I salute all those involved in implementation. I hope (and see!) these innovations spill into things that help people and generate real value, for instance, all the things businesses use accurate climate forecasts for.

My issue mostly revolves around the discrepancy between value-created-per-FLOPS and cost-per-FLOPS. If we run huge models for what serves for us the same purpose as the TV walls in Fahrenheit 451 do for their owners, what are we adding? If we are constantly running pixel-level segmentation networks on the cameras of automobiles that are already being driven by humans, how is that helping?

Stable Diffusion is a real innovation in optimization techniques. This is the only part of recent news that can be considered "cutting edge ML research", as it is an important step in unifying optimization theory with that of non-equilibrium statistical mechanics. The parallels have been noticed there for decades, and the cutting edge of ML research today is unifying them by cribbing aspects of each into the other.

The ideas (models) themselves are quite naive. The perceptual space of these models is essentially one-dimensional. This, in the broad strokes, is a terrible model for reality, and so these casual discussions of "intelligence" seem mis-informed, not to mention under-determined. It seems like a lot of mis-informed people speak quite confidently about the things they don't know that they don't know, especially on a site where people who value their autodidactive skills tend to congregate.

Inductive bias is the primary term of art that is missing from these discussions. The models du jour do not have any architectural components which represent language per se, only flat representations thereof. The reason LLMs work so well is because of the Transformers inside; multi-head attention is a clever trick to vastly expand the possible data processing techniques as information passes and is distilled through the network into the outputs, but ultimately it is just a fancy way of doing a technique we could call "quasi-random projections" under the domain of linear algebra.

The notion of intelligence we entertain in these discussions is hopelessly anthropocentric, and, furthermore, incredibly arrogant. The recent achievements say much more about how easy it is to simulate human-comprehensible language and images than it does about how we are birthing new intelligences. I believe the hype and deployment of these tools is doing more to hurt society in the long term than it is to help it, and I welcome more candid discussions about how simple the rules that make us "intelligent" might be. That seems to me, during my studies, to be the real question.

link

nerdponx 1203 days ago

It kind of sounds like you're saying that it should be possible to build a better GPT-3 by incorporating more linguistics domain knowledge into the model somehow, beyond stacking word embeddings and transformers together. If so... then okay, show me.

The biggest innovation of deep learning is that you can take a relatively general architecture and build surprisingly sophisticated models by training that general architecture on a ton of data, that absolutely wipe the floor with the older style of models.

AlphaGo/Zero is maybe an exception here: it's not "just" a model, it's a whole system based around a model, which is specifically designed to perform well on certain tasks, and succeeds where a more naive system would probably fail. You could argue that ChatGPT itself is an early kind of advancement along those lines, going beyond just cranking predictions from a model to a more holistic task-specific system. But it's still mostly "just" a huge model trained on a huge pile of data, and I'm not going to argue strongly that it isn't. What I am arguing is that people aren't as naive as you think, and it's not for lack of trying that LLMs don't incorporate detailed knowledge about linguistics, psychology, et al.

You suggest that transformers shouldn't be considered "good" for text, because they don't include enough inductive bias. Don't they? They're really clever things in my opinion, and I think they actually represent quite a lot of inductive bias compared to what came before, and have a lot of intuitive appeal in their respective intended task domains. I hardly think it's fair or correct to consider the transformer nothing more than a linear algebra trick.

I suppose you're arguing that they aren't "good" solely because they produce state-of-the-art results and beat literally everything else we have. I think your argument is that they only work as well as they do because they need a tremendous amount of computing power and data to get useful results out of them. That might be true, but that's literally why people are trying to make them faster and less costly to train!

Don't forget that the other big innovation of deep learning is transfer learning and fine tuning. A big LLM like GPT-3 only needs to be trained fully every once in a while, while it can be reused and adapted to a huge variety of tasks relatively efficiently and cheaply. The whole point of the article at the top of this thread is making big models cheaper and faster to train. If you're bothered by the cost- and compute-efficiency of these things, what could be better than making the training process more efficient?

Moreover, it's kind of weird to insinuate that the field doesn't care about making these things faster or cheaper to run, when immediately after any new model is announced, a flood of projects follows trying to shrink the model, improve inference speed, etc.

Finally, who said anything about intelligence? I didn't, and nobody else did either. That's a completely unrelated topic as far as I'm concerned, and I leave that one to the philosophers to debate.

link