Hacker News new | ask | show | jobs
by dt23 1149 days ago
They are quite different neural net architectures! CNNs (ie convolutional neural nets) have little "patches" that are used to pick up features of the input such as edges, etc. As you go deeper in the network, these patches tend to pick up more and more abstract features, like textures or kinds of object. All this is passed into a fully connected network which then gives a prediction. CNNs are most famously used for image classification.

In contrast, an LLM is usually built using a sequence of Transformer modules, which use something called "self-attention": it modifies each piece of input seen so far to include information about its relationship with the other bits of input. In text, this is a natural thing to do (what role does my word play in the sentence); you can also do it with images (giving you Vision Transformers, aka ViTs) but it might be less natural. After self-attention is a little fully connected network, and then the output is passed into a new transformer, etc, for a number of times (commonly 6), until the output of the final transformer is used as the prediction.

In a nutshell: very different architectures, exploiting different aspects of the data (CNN: local structure; Transformer: relationship between all elements in the context).