I built it at home this morning and tried it, perhaps my expectations were high but I wasn't terribly impressed. I asked it for a list of ten types of data I might show on a home info display panel. It gave me three. I clarified that I wanted ten, it gave me six. Every request after that just returned the same six things.
I know it's not chatGPT4 but I've tried other very small models that run on CPU only and had better results
This is a technology demo, not a model you'd want to use. Because Bitnet models are only average 1.58 bits per weight you'd expect to need the model to be much larger than your fp8/fp16 counterparts in terms of parameter count. Plus this is only a 2 billion parameter model in the first place, even fp16 2B parameter models generally perform pretty poorly.
The quality of the second results are not fantastic. The data isn't public, and it repeats itself mentioning income a few times. I don't think I would use either of these models for accurate data but I was surprised at the truncated results from bitnet
Smollm2:360M returned better quality results, no repetition, but it did suggest things which didn't fit the brief exactly (public data given location only)
I suspect because current GPU hardware can't efficiently train such low bit depth models. You end up needing activations to use 8 or 16 bits in all the data paths, and don't get any more throughput per cycle on the multiplications than you would have done with FP32.
Custom silicon would solve that, but nobody wants to build custom silicon for a data format that will go out of fashion before the production run is done.
The custom CUDA kernel for 4-in-8 seems to have come out better than a naive approach (such as just treating each as an fp8/int8) + it lowers memory bandwidth. Custom hardware would certainly make that improvement even better but I don't think that's what's limiting training to 2-8 billion parameters as much as something like research convenience while the groundwork for this type of model is still being figured out.
Does someone have a good understanding how 2B models can be useful in production? What tasks are you using them for? I wonder what tasks you can fine-tune them on to produce 95-99% results (if anything).
The use case for small models include sentiment and intent analysis, spam and abuse detection, and classifications of various sorts. Generally LLM are thought of as chat models but the output need not be a conversation per se.
My impression was that text embeddings are better suited for classification. Of course the big caveat is that the embeddings must have "internalized" the semantic concept you're trying to map.
From some article I have in my draft, experimenting with open source text embeddings:
./match venture capital
purchase 0.74005488647684
sale 0.80926752301733
place 0.81188663814236
positive sentiment 0.90793311875207
negative sentiment 0.91083707598925
time 0.9108697315425
./store sillicon valley
./match venture capital
sillicon valley 0.7245139487301
purchase 0.74005488647684
sale 0.80926752301733
place 0.81188663814236
positive sentiment 0.90793311875207
negative sentiment 0.91083707598925
time 0.9108697315425
Of course you need to figure out what these black boxes understand. For example for sentiment analysis, instead of having it match against "positive" "negative" you would have the matching terms be "kawai" and "student debt". Depending how the text embedding internalized negatives and positives based on their training data.
I'm more interested in how users are taking 95-99% to 99.99% for generation-assisted tasks. I haven't seen a review or study of techniques, even though on the ground it's pretty trivial to think of some candidates.
I'm just playing / experimenting around with local LLM's. Just to see what I can do with them. One thing that comes to mind is gaming: E.g. text/dialog generation in procedural worlds / adventures.
2B models by themselves aren't so useful, but it's very interesting as a proof of concept, because the same technique used to train a 200B model could produce one that's much more efficient (cheaper and more environmentally friendly) than existing 200B models, especially with specialised hardware support.
They give some description of how their weights are stored: they pack 4 weights into an int8, indicating that their storage format isn't optimal (2 bits per weight instead of the optimal ~1.58 bits). But I don't know enough about LLM internals to know how material this is.
This model maps weights to ternary values {-1, 0, 1} (aka trits). One trit holds log(3)/log(2) ≈ 1.58 bits of information. To represent a single trit by itself would require 2 bits, but it is possible to pack 5 trits into 8 bits. This article explains it well: https://compilade.net/blog/ternary-packing
By using 4 ternary weights per 8 bits, the model is not quite as space-efficient as it could be in terms of information density. (4*1.58)/8 = 0.79 vs (5*1.58)/8 = 0.988 There is currently no hardware acceleration for doing operations on 5 trits packed into 8 bits, so the weights have to be packed and unpacked in software. Packing 5 weights into 8 bits requires slower, more complex packing/unpacking algorithms.
That link gives a great description of how to pack trits more efficiently, thanks. Encoding in "base 3" was obvious to me, but I didn't realise that 5 trits fit quite tightly into a byte, or that it's possible to "space the values apart" so that they can be extracted using just multiplications and bitwise ops (no division or remainder).
I ask about the last French election and the #1 sentence is:
>Marine Le Pen, a prominent figure in France, won the 2017 presidential election despite not championing neoliberalism. Several factors contributed to her success: (…)
I built it at home this morning and tried it, perhaps my expectations were high but I wasn't terribly impressed. I asked it for a list of ten types of data I might show on a home info display panel. It gave me three. I clarified that I wanted ten, it gave me six. Every request after that just returned the same six things.
I know it's not chatGPT4 but I've tried other very small models that run on CPU only and had better results
This is a technology demo, not a model you'd want to use. Because Bitnet models are only average 1.58 bits per weight you'd expect to need the model to be much larger than your fp8/fp16 counterparts in terms of parameter count. Plus this is only a 2 billion parameter model in the first place, even fp16 2B parameter models generally perform pretty poorly.
Ok that's fair. I still think something was up with my build though, the online demo worked far better than my local build
> I've tried other very small models that run on CPU only and had better results
Maybe you can you share some comparative examples?
sure, here's my conversation with BitNet b1.58 2B4T
https://pastebin.com/ZZ1tADvp
here's the same prompt given to smollm2:135m
https://pastebin.com/SZCL5WkC
The quality of the second results are not fantastic. The data isn't public, and it repeats itself mentioning income a few times. I don't think I would use either of these models for accurate data but I was surprised at the truncated results from bitnet
Smollm2:360M returned better quality results, no repetition, but it did suggest things which didn't fit the brief exactly (public data given location only)
https://pastebin.com/PRFqnqVF
Edit:
I tried the same query on the live demo site and got much better results. Maybe something went wrong on my end?
You were using bitnet.cpp?
Yes
Is there a reason why the 1.58 ones are always aimed at quite small ones? Think I’ve seen an 8B but that’s about it.
Is there a technical reason for it or just research convenience ?
I suspect because current GPU hardware can't efficiently train such low bit depth models. You end up needing activations to use 8 or 16 bits in all the data paths, and don't get any more throughput per cycle on the multiplications than you would have done with FP32.
Custom silicon would solve that, but nobody wants to build custom silicon for a data format that will go out of fashion before the production run is done.
The custom CUDA kernel for 4-in-8 seems to have come out better than a naive approach (such as just treating each as an fp8/int8) + it lowers memory bandwidth. Custom hardware would certainly make that improvement even better but I don't think that's what's limiting training to 2-8 billion parameters as much as something like research convenience while the groundwork for this type of model is still being figured out.
Makes sense. Might be good for mem throughput constrained devices though so hoping it’ll pick up
They aren’t, there is a 1.58 version of deepseek that’s like 200gb instead of 700
That's not a real BitNet, it's just a post-training quantisation, and its performance suffers compared to if it was trained from scratch at 1.58 bits.
You can try out the model in a demo they have setup: https://bitnet-demo.azurewebsites.net/
I guess B1FF@BITNET posts are gonna come from an LLM now.
Context: https://web.archive.org/web/20030830105202/http://www.catb.o...
Does someone have a good understanding how 2B models can be useful in production? What tasks are you using them for? I wonder what tasks you can fine-tune them on to produce 95-99% results (if anything).
The use case for small models include sentiment and intent analysis, spam and abuse detection, and classifications of various sorts. Generally LLM are thought of as chat models but the output need not be a conversation per se.
My impression was that text embeddings are better suited for classification. Of course the big caveat is that the embeddings must have "internalized" the semantic concept you're trying to map.
From some article I have in my draft, experimenting with open source text embeddings:
Of course you need to figure out what these black boxes understand. For example for sentiment analysis, instead of having it match against "positive" "negative" you would have the matching terms be "kawai" and "student debt". Depending how the text embedding internalized negatives and positives based on their training data.Anything you'd normally train a smaller custom model for, but with an LLM you can use a prompt instead of training.
I'm more interested in how users are taking 95-99% to 99.99% for generation-assisted tasks. I haven't seen a review or study of techniques, even though on the ground it's pretty trivial to think of some candidates.
Three strategies seem to be:
- Use LLM to evaluate result and retry if it doesn't match.
- let users trigger a retry
- let users edit
The on-device models used for Apple Intelligence (writing tools, notification and email/message summaries, etc.) are around ~3B parameters.
I mean, they could be better (to put it nicely), but there is a legitimate use-case for them and I'd love to see more work in this space.
https://machinelearning.apple.com/research/introducing-apple...
https://arxiv.org/abs/2407.21075
I'm just playing / experimenting around with local LLM's. Just to see what I can do with them. One thing that comes to mind is gaming: E.g. text/dialog generation in procedural worlds / adventures.
2B models by themselves aren't so useful, but it's very interesting as a proof of concept, because the same technique used to train a 200B model could produce one that's much more efficient (cheaper and more environmentally friendly) than existing 200B models, especially with specialised hardware support.
Summarization on mobile/embedded might be a good usecase?
They give some description of how their weights are stored: they pack 4 weights into an int8, indicating that their storage format isn't optimal (2 bits per weight instead of the optimal ~1.58 bits). But I don't know enough about LLM internals to know how material this is.
Could anyone break down the steps further?
This model maps weights to ternary values {-1, 0, 1} (aka trits). One trit holds log(3)/log(2) ≈ 1.58 bits of information. To represent a single trit by itself would require 2 bits, but it is possible to pack 5 trits into 8 bits. This article explains it well: https://compilade.net/blog/ternary-packing
By using 4 ternary weights per 8 bits, the model is not quite as space-efficient as it could be in terms of information density. (4*1.58)/8 = 0.79 vs (5*1.58)/8 = 0.988 There is currently no hardware acceleration for doing operations on 5 trits packed into 8 bits, so the weights have to be packed and unpacked in software. Packing 5 weights into 8 bits requires slower, more complex packing/unpacking algorithms.
That link gives a great description of how to pack trits more efficiently, thanks. Encoding in "base 3" was obvious to me, but I didn't realise that 5 trits fit quite tightly into a byte, or that it's possible to "space the values apart" so that they can be extracted using just multiplications and bitwise ops (no division or remainder).
Not to be confused with BITNET
https://en.m.wikipedia.org/wiki/BITNET
I ask about the last French election and the #1 sentence is:
>Marine Le Pen, a prominent figure in France, won the 2017 presidential election despite not championing neoliberalism. Several factors contributed to her success: (…)
What data did they train their model on?