Fat tails & model collapse

In deep learning your learning engine is gradient descent. Francois Chollet

The implication of what Francois is saying is that, LLMs are not “lookup tables”, like the first generation of Alexa. Rather “knowledge” is implicitily embedded within a high-dimensional space.

These parametric curves trained with gradient descent are great fits for everything that’s System 1-type thinking: pattern recognition, intuition, memorization, etc

Is analogy is inperfect, but the model is mapping inputs to somewhere in “latent space” and the output is a probabilities from within that space. However, these models are generalising to within the distribution of the data sampled. i.e. the structure of the world created is based on the data and the gradient descent. This is why so much value and effort goes into prompt engineering. It’s essentially trying to get the model to do interesting things, to move away from the average, to explore some part of latent space that has no probability. I no doubt have the details of what’s going on under the hood wrong. We’re still discovering how LLMs really think. Traceing the thoughts of a large language model.

In Google’s Q3’24 earnings call, Sundar Pichai say more than 25% of all new code at Google is generated by AI. If the future is here but not unverisally distributed, what percent of the web will be AI generated? What is the AI training on, the web? This recursive loop is what folks call “model collapse”.

The more the Internet consists of AI-generated texts, the narrower the distribution of future training data becomes. LLMs trained on this text may become even more narrow, starting a downward spiral. This is called model collapse: Tails of the distribution are thinned out and disappear. The distribution becomes narrower over time. @ChristophMolnar

Much of knowledge work is in the tails of the distribution. In order to be valuable as a knowledge work in the economy, you must seek “the edge of chaos”.