Safetensors
English
llama

Running this model on consumer hardware?

#3
by whsinth - opened

I tried running this model, and the 2t one on my hardware, RTX 5090 with 32GB vram, but I think I ran out of VRAM and the machine froze.

I then try bnb-my-repo resulting in this, which does run but only produce gibberish. I'm not sure if it's the problem in quantization process or the model do work like that.

from transformers import pipeline
pipe = pipeline("text-generation", model="whsinth/comma-v0.1-1t-bnb-4bit")
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at whsinth/comma-v0.1-1t-bnb-4bit and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0
pipe("```py\n# Hello world\n", max_new_tokens=30, num_return_sequences=3)
[{'generated_text': '```py\n# Hello world\n Gordon Essex abundances shadingagedy tapping percentilamical ∆ipucky pneumonia tool Continuousilonsha pneumoniasensor pneumonia�526disabled pneumonia Amo-enterblastonaws-phenylbenzene'},
 {'generated_text': '```py\n# Hello world\n Excellentcpkg extruded digestbare florobert MBlinea spending JKlineamiguckyasurerail\\new))\nOr\tvobilayoutConsequently.Equ_offEquiReviewed Excellentyar Athe'},
 {'generated_text': "```py\n# Hello world\n Excellent.npmDescriptorERK?id+, renormalpolymeDragLY µ vanishing Thereafter establishments femal mong presumably maneuWn-endLYprint English'=>$wu\\mathrm incis�.trace`)\n"}]

Well, I figured out a few things.

Short version: Here's the BitsAndBytes version: https://huggingface.co/whsinth/comma-v0.1-1t-bnb-8b . It was made simply by loading this original model in transformers and asking it to use BNB, then saving it (although the Transformers docs doesn't tell you much how to do save it without uploading to HF - feel like HF is AI's Vercel here...) You can also use it in vLLM.

From my brief testing I think it do get worse than the full version, but I can't measure by how much since I ran the full version on Colab CPU. One question takes about 8 minutes, and iirc about 38GB RAM.

Long version: if you're like me and you want to use this in the real world. This is not what you're looking for. Whatever you're looking for doesn't exists today (a model trained from 100% permissive source that you can just ollama pull). Here's what I learned from the journey (and may be incorrect...)

  • This is a base model, meaning it is essentially an autocomplete system. It doesn't talk like ChatGPT.
  • To use a base model, you'd need to think what input format it was trained on (which I can't find info of), and replicate that. For example, in octocoder they trained it on GitHub issues. You'd need to frame your question in the form of GitHub issues. (eg. "User 1: I want to do this. User 2 (AI): Sure, here's how you would do that " then let the model actually fill user 2's answer) I don't know what this model's format, but I think it worked well enough with simple markdown prompt.
  • There's 1t and 2t version, which are trained from 1 & 2 trillion tokens respectively. If you read the Comma paper, you'll know that 2t is actually the same data as 1t but the data is mostly duplicated twice.
  • You can't make this into the GGUF form used by Ollama/Llama.cpp. The convertor in Llama.cpp has a hardcoded list of tokenizers, which is I suppose why when new models come out you'd need to upgrade Ollama. As mentioned in the paper, Comma has its own tokenizer to ensure that the tokenizer is also generated from permissively licensed data. Hence, you'd need to modify Llama.cpp to add the tokenizer.
  • Someone will need to train this into an "instruction tuned" model, which I think is also hard. There are 3 approaches done to Starcoder known to me that may need to be used on this model
    • Starcoder2 Instruct used some "seed" source code and let the model itself generate description of the seeds. Then, the model is trained by asking it to generate both code & test from the description. This approach doesn't use external data other than the seed.
    • Octocoder use CommitpackFT and OASST. CommitpackFT is a commit database from The Stack, so going in this route means creating a clean version (iirc it exists? probably in The Common Pile too??) and OASST is a human roleplay dataset where human act as both user and AI.
    • StarChat2 use UltraFeedback and Orca DPO. These are dataset of ChatGPT 3.5 and friends' responses and is often used to train models for chatting use. While OpenAI don't claim copyright to ChatGPT output, it may or may not be tainted depending on your view of the ChatGPT's legal issues.

I feel compelled to correct you.

if you're like me and you want to use this in the real world. This is not what you're looking for.

You can use this in the real world if you understand its limitations. Like other permissive models This is a sentence completion model. You let it finish your sentences. It says as much in the paper. Being a base model has nothing to do with it.

  • You can't make this into the GGUF form used by Ollama/Llama.cpp.

No. You could make it into GGUF, just not today. This is true for all models, and no exception for Comma. GGUF always has a turnaround time to add support. Hit up Ggeranov's team about getting the tokenizer added to the list, or open a PR about it on their GitHub. This is why we like open source.

As for the rest of the remarks, they seem unrelated or undirected. If you have 32gb VRAM you have more than enough power to run this, and you're doing something else wrong. It happens, not to worry. I'm on discord if you'd like help writing code to run this model in Transformers.

In conclusion, am glad to know there is demand for this kind of model, and I hope you can appreciate the work Eleuther has done towards creating a future where reasonably trained, attributed and open models are more prevalent, or even the standard.

With love. X

Sign up or log in to comment