Hacker News new | ask | show | jobs
by kristjank 1066 days ago
Yo dawg, we heard you like transformers so we put transformers on your transformers so you can train while you train. The spider web graph shows metatransformers performing worse to their counterparts in almost all fields. Is there a reason I should not believe that an expert model will always outperform a general purpose one, even if it's a metatransformer?
6 comments

I mean, there is a somewhat unique value proposition of a multimodal framework like this meta transfirmer. Its goal isn't necessarily to beat expert models in their own game, but to provide a unified framework for processing diverse modalities of data.

I think it aims to leverage the cross-modal relationships and unified learning, which might not be possible with expert models designed for only a single modality.

Even if it performs slightly worse on some tasks, the ability to handle multiple modalities within a single framework is an pretty sweet advantage in scenarios where data from various sources need to be processed simultaneously, and patterns across modalities need to be captured somehow.

A general-purpose model could also be a more cost-effective solution in some cases, ensemble experts are difficult to scale and parallelize.

>an expert model will always outperform a general purpose one, even if it's a metatransformer

It's an interesting question as it begs questions of conceptual "boundaries."

The sense-plan-do process requires a search and filter process for task switching, assuming an agent can do more than one thing.

So assuming you have a robotic/autonomous agent that is a collection of systems (locomotion, dexterous gripper, visual perception, etc...), if each system could be represented as an "expert module", say for example the dexterous manipulator, then so long as a discriminator can appropriately switch states using the sensor/system inputs, then it's conceptually possible that there is a canonical "expert module" that everyone uses and therefore "general purpose" would apply to the agent as a whole while expert model would apply to the dexterous manipulator.

You can walk that reasoning up the abstraction layers then to conclude that (as usual with these turtle stacks) the distinctions come as each sub system/module specializes more granularly for the environment they operate in.

I think that it's probably forever and always true that any system designed to explore/exploit a bounded environment with comprehensive observations, will always outperform a system that is required to adapt it's sense-plan-do components to the bounded environment without similar observations.

A system would either have to generate different observations than the native agent, or change the boundaries of the environment in a way that is unavailable to the native agent in order to outperform it.

I'm pretty sure it's a relatively small model?

If you had the same quantity of text data as GPT-4 + comparable quantity of data for other domains, it could probably learn transferable skills across those domains.

But it would take a huge amount of processing power that is probably not attainable today

>Is there a reason I should not believe that an expert model will always outperform a general purpose one, even if it's a metatransformer?

If a general purpose model beats the specialized one, you could almost certainly distill the general purpose one into a better specialized one.

Yo dawg, we just need to figure out what x converges to as you apply transformer() infinite times and then finally attention will no longer be all you need:

transformer(transformer(transformer( ... x ... ))) = ?

performance is bounded and so outperformance will approach episilon?