-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deepseek-r1-distill-qwen-7b (and seemingly other reasoning models) exclude initial <think> tag #4761
Comments
Removed with markdown, but the end of the second paragraph had the /think tag without any initial think tag used |
same issue, on my mac studio, distill qwen 32b, only got /think tag. |
Since this affects all thinking models, I've attempted to force |
You can fix this fairly easily, it doesn't really need a ticket. Step 1 - create a new template file, call it something like:
Then, inside your yaml file for the gguf, mine is context_size: 65536
f16: true
threads: 4
gpu_layers: 90
name: DeepSeek R1 Distill Qwen 32B
tensor_split: "90,0,0"
main_gpu: "0"
prompt_cache_all: false
parameters:
model: DeepSeek-R1-Distill-Qwen-32B-abliterated-Q5_K_M.gguf
temperature: 0.7
top_k: 40
top_p: 0.95
batch: 512
tfz: 1.0
n_keep: 0
# rope_freq_base: 4000000
template:
chat_messages: chatml
chat: chatml-thinking-block
completion: completion
stopwords:
- <|im_end|>
- <|end▁of▁sentence|>
- <|end▁of▁sentence|> |
This does not work. It still strips the initial |
Is there anything to stop it from showing in chat window?
|
Are you using chat completion or text completion to run your examples through? In the above example, I'm assuming one's using Open-WebUI (which uses chat completion), which would trigger that think block. Depending on how you're running LocalAI, you can run a I decided to upload a video that describes the before/after for the fix I mentioned, incase you don't get it beforehand: https://www.youtube.com/watch?v=qswwhbXS8H0 |
If you're referring to Open-WebUI (the think block) then I'm not sure. In your example you're adding a stop word of <|think|>. This can be a problem because these models are "thinking" first, then going with results. In other words, it'd stop on the first token. If you're consuming the message, programmatically, then I'd consider using regex to remove the block from your interest. If it's with Open-WebUI, it shows up as a small box you can click on to expand (or at least does for me). Personally, I found the "thinking" models to not really hit what I want, so I went back to Qwen primarily. |
It doesn't effect the output, it's just I know when the sentence has ended. LocalAI UI and I've notice this more with Deepseek but also with a few others after a while. @RealJohnGalt DeepSeek-R1-Distill-Qwen-32B-abliterated-Q5 how much vram does this model use on your system? |
I think you meant to ask me, since I'm the one running these 32B models. The answer is ~46Gb of video ram. I have 3 48Gb cards, and run 64k context to keep all the layers on one card. I notice a large enough speed difference when I go across cards that it doesn't make it worth it except in very specific situations. |
Do you ever have a ram emergency? I hear having too much can cause this affect. We do need a little chart on a few models, listing Vram usage and speed. Something the community can put together. I suppose it could be called model bench marks? this would give a rough guide on the systems needed to run this project happily. I wish what I knew now, I knew many months ago. |
I'm unaware of any "ram emergency". You have ram that's taken up by three main components when it comes to these gguf files. That being the core layers, the context, and the "working" area (main_gpu) in the config. If you split a model across cards, then the overall use seems to increase (I believe the context is cloned to some degree across the cards). The speed problem on my end comes from one main factor - the card communication between each other. I use A6000 ADAs which run over the PCI lanes vs a direct connection between the cards. The machine is a bit older, PCI 3 max. So it's a bit slower - plus the more layers there are, the slower the model is as well. So there's a few factors that come into play on that. In terms of how much you can stuff on one card, TheBloke did quite a good job on this when he was active. You can see the table on https://huggingface.co/TheBloke/CapybaraHermes-2.5-Mistral-7B-GGUF that explains it. You can pull the tables depending on the size of the model and the context there. Many of the newer models, think 32B and higher, support 128k context size. You don't need to run it at that, and I don't, to save on memory. The main problem I have with 3 cards like I have is when I have competing things I want to run that "lock" a card in a certain way. E.g. if I'm training a model, that card is locked and I can't use it for anything else while it's training. The video I have posted a few replies up is actually from that workstation, and I get around it by restarting the corresponding container. But, overall, my strategy is: Card 0: Qwen 32B 64k context. This also runs my STT stack too. For training, I primarily. use Card 0 for smaller models, but that's logistically due to heat concerns. It sits on top of the rest, and training can take 24-48 hours at times. When that happens, I rebalance card 0's load to Card 2 and kill the optional stuff. It's all scripted. |
This was a joke (https://en.wikipedia.org/wiki/The_IT_Crowd https://getyarn.io/yarn-clip/8dfec43b-6f8b-4ea6-8809-57e65e45dd07) This is what I am talking about and In the documentation, there should be a table with this kind of info. Thanks for giving us a little inside into your system. |
LocalAI version:
464686a
Environment, CPU architecture, OS, and Version:
linux + vulkan
Describe the bug
I am excluding the proper think tags since I cannot get them to show due to markdown.
deepseek-r1-distill-qwen-7b (and other deepseek models I've tested) exclude the initial think tag, but include the "/think" tag after. This is creates an issue with some frontends such as open webui, since they recently added support for think tags and hiding/showing/giving stats on thinking.
To Reproduce
Ask any question, response begins with reasoning directly, and ends with /think but has no beginning think tag.
Expected behavior
I expect think to precede the thinking.
An example output from: why is the sky blue?
The text was updated successfully, but these errors were encountered: