Could someone explain every term in this subthread in a very simple way to someone who basically only knows "transformers are a neural network architecture that use something called 'attention' to consider the entire input the whole time or something like that", and who does not understand what "quadratic" even means in a time complexity or mathematical sense beyond that "quad" has something to do with the number four.
I am aware I could Google it all or ask an LLM, but I'm still interested in a good human explanation.
You will effectively want a 48GB card or more for quantized versions, otherwise you won't have meaningful space left for the KV cache. Blackwell and above is generally a good idea to get faster hardware support for 4b (some recent models took some time to ship for older architectures, gpt-oss IIRC).
Lots of model releases are like this. We can only upvote.
We can't run the model on our personal computers.
We can neither test their 'Efficient Attention' concept on our personal computers.
Honestly, it would take 24 hours just to download the 98 GB model if I wanted to try it out (assuming I had a card with 98 GB of ram).
For the uninitiated, what's a "hybrid linear attention architecture"?
1/4 of their layers are conventional quadratic attention
Could someone explain every term in this subthread in a very simple way to someone who basically only knows "transformers are a neural network architecture that use something called 'attention' to consider the entire input the whole time or something like that", and who does not understand what "quadratic" even means in a time complexity or mathematical sense beyond that "quad" has something to do with the number four.
I am aware I could Google it all or ask an LLM, but I'm still interested in a good human explanation.
any hardware recommendations? how much memory do we need to this?
You will effectively want a 48GB card or more for quantized versions, otherwise you won't have meaningful space left for the KV cache. Blackwell and above is generally a good idea to get faster hardware support for 4b (some recent models took some time to ship for older architectures, gpt-oss IIRC).
https://lifearchitect.ai/models-table/
125 upvotes with 2 comments is kinda sus
Lots of model releases are like this. We can only upvote. We can't run the model on our personal computers. We can neither test their 'Efficient Attention' concept on our personal computers.
Honestly, it would take 24 hours just to download the 98 GB model if I wanted to try it out (assuming I had a card with 98 GB of ram).
[flagged]
[flagged]