Pages

Net Galley Challenge

Challenge Participant

Sunday, March 31, 2024

This AI Business

 long-horizon tasks

"And so the curve you see is, it gets it right one in a thousand, then one in a hundred, then one in ten, and so forth."

Huge context length

Learning in a context change quicker

ability to work on a thing for months together

Reliability in chaining tasks

mmlu scores

million token attention

quadratic attention costs

GOFAI ("Good old fashioned artificial intelligence")


"people often talk about how attention at inference time is such a huge cost. When you're actually generating tokens, the operation is not n squared. One set of Q-vectors looks up a whole bunch of KV-vectors and that's linear with respect to the amount of context that the model has."


Picasso vs Cezanne

"The broader thing being that if you're learning in the forward pass, it's much more sample efficient because you can basically think as you're learning. Like when you read a textbook, you're not just skimming it and trying to absorb inductively, “these words follow these words.” You read it and you think about it, and then you read some more and you think about it some more. "

This seems like Cezanne way of learning. most poets go through rounds of editing and revision that LLMs are not trained to do.


 "residual stream, Sholto alluded to the read-write operations, as a poor man's adaptive compute."

"So for the residual stream, imagine you're in a boat going down a river and the boat is the current query where you're trying to predict the next token. So it's “the cat sat on the _____.” And then you have these little streams that are coming off the river where you can get extra passengers or collect extra information if you want. And those correspond to the attention heads and MLPs that are part of the model. " - Trenton

"five to seven levels of recursion"

CNN - Convolution neural networks workd due to visual cortex

"At least in the cerebellum you basically do have a residual stream in what we'll call the attention model for now–and I can go into whatever amount of detail you want for that–where you have inputs that route through it, but they'll also just go directly to the end point that that module will contribute to. So there's a direct path and an indirect path. and, and so the model can pick up whatever information it wants and then add that back in." - Trenton

"cerebellum nominally just does fine motor control but I analogize this to the person who's lost their keys and is just looking under the streetlight where it's very easy to observe this behavior."

"sherlock Holmes.. sample efficient"

plumbing shortage-- plumber robots?

training incentives for plumber?

A rich guy with a request for god, fulfilling another's wish for god's undivided attention is Pareto efficiency

If alignment happens before intelligence explosion

No comments: