Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepseek-r1-distill-qwen-7b (and seemingly other reasoning models) exclude initial <think> tag #4761

Open
RealJohnGalt opened this issue Feb 4, 2025 · 13 comments
Labels
bug Something isn't working unconfirmed

Comments

@RealJohnGalt
Copy link

RealJohnGalt commented Feb 4, 2025

LocalAI version:
464686a

Environment, CPU architecture, OS, and Version:
linux + vulkan

Describe the bug
I am excluding the proper think tags since I cannot get them to show due to markdown.
deepseek-r1-distill-qwen-7b (and other deepseek models I've tested) exclude the initial think tag, but include the "/think" tag after. This is creates an issue with some frontends such as open webui, since they recently added support for think tags and hiding/showing/giving stats on thinking.

To Reproduce
Ask any question, response begins with reasoning directly, and ends with /think but has no beginning think tag.

Expected behavior
I expect think to precede the thinking.

An example output from: why is the sky blue?

Okay, the user is asking, "why is the sky blue?" I remember that there are a few explanations for this, so I need to figure out which one is the most accurate based on the provided context.

Looking at the context, there are two sources with the same question. Both sources explain that the sky appears blue because of the scattering of sunlight by atmospheric molecules. When sunlight enters the Earth's atmosphere, it's made up of all the colors of the spectrum. As it travels through the air, shorter wavelengths like blue and violet are scattered more than the longer wavelengths like red and orange. This scattering makes the sky appear blue, especially when the sun is high in the sky.

The context also mentions that in polluted skies, small particles like dust can scatter more blue light, making the sky look more blue. However, the user's question is general, so I don't need to include that unless it's necessary. Also, I should cite the sources that provide this information. Since both sources are from Britannica, I'll use [britannica.com] for the context.

I should structure my answer to first explain the phenomenon, then mention the factors that contribute to the sky's color. I'll include the citation at the end to give credit to the source. I'll keep it concise and clear, making sure it addresses the user's question directly.

The sky appears blue primarily due to the scattering of sunlight by molecules in the Earth's atmosphere. Sunlight is composed of all the colors of the visible spectrum, but shorter wavelengths like blue and violet are scattered more than longer wavelengths like red and orange. This scattering effect is more pronounced when the sun is high in the sky, as the sunlight travels through a larger portion of the atmosphere. The result is that the sky appears blue to our eyes [britannica.com].

@RealJohnGalt RealJohnGalt added bug Something isn't working unconfirmed labels Feb 4, 2025
@RealJohnGalt
Copy link
Author

Removed with markdown, but the end of the second paragraph had the /think tag without any initial think tag used

@aotsukiqx
Copy link

aotsukiqx commented Feb 5, 2025

same issue, on my mac studio, distill qwen 32b, only got /think tag.

@RealJohnGalt
Copy link
Author

Since this affects all thinking models, I've attempted to force <think> to the beginning of responses in yaml, but my attempts are either incorrect or something else is stripping out the initial <think>

@TheDarkTrumpet
Copy link

You can fix this fairly easily, it doesn't really need a ticket.

Step 1 - create a new template file, call it something like: chatml-thinking-block.tmpl, and inside it include the following:

{{.Input}}
<|im_start|>assistant
<think>

Then, inside your yaml file for the gguf, mine is DeepSeek-R1-Distill-Qwen-32B-abliterated-Q5_K_M.yaml you can have the following:

context_size: 65536
f16: true
threads: 4
gpu_layers: 90
name: DeepSeek R1 Distill Qwen 32B
tensor_split: "90,0,0"
main_gpu: "0"
prompt_cache_all: false
parameters:
  model: DeepSeek-R1-Distill-Qwen-32B-abliterated-Q5_K_M.gguf
  temperature: 0.7
  top_k: 40
  top_p: 0.95
  batch: 512
  tfz: 1.0
  n_keep: 0
#  rope_freq_base: 4000000
template:
  chat_messages: chatml
  chat: chatml-thinking-block
  completion: completion
stopwords:
- <|im_end|>
- <|end▁of▁sentence|>
- <|end▁of▁sentence|>

@RealJohnGalt
Copy link
Author

You can fix this fairly easily, it doesn't really need a ticket.

Step 1 - create a new template file, call it something like: chatml-thinking-block.tmpl, and inside it include the following:

{{.Input}}
<|im_start|>assistant
<think>

Then, inside your yaml file for the gguf, mine is DeepSeek-R1-Distill-Qwen-32B-abliterated-Q5_K_M.yaml you can have the following:

context_size: 65536
f16: true
threads: 4
gpu_layers: 90
name: DeepSeek R1 Distill Qwen 32B
tensor_split: "90,0,0"
main_gpu: "0"
prompt_cache_all: false
parameters:
model: DeepSeek-R1-Distill-Qwen-32B-abliterated-Q5_K_M.gguf
temperature: 0.7
top_k: 40
top_p: 0.95
batch: 512
tfz: 1.0
n_keep: 0

rope_freq_base: 4000000

template:
chat_messages: chatml
chat: chatml-thinking-block
completion: completion
stopwords:

  • <|im_end|>
  • <|end▁of▁sentence|>
  • <|end▁of▁sentence|>

This does not work. It still strips the initial <think> tag. Also given expected behavior is for it to keep the tag, this should be a ticket.

@Hello-World-Traveler
Copy link

Is there anything to stop it from showing in chat window?

<|assistant|>
 <|end▁of▁sentence|>
 <|think|>

@TheDarkTrumpet
Copy link

TheDarkTrumpet commented Feb 25, 2025

Are you using chat completion or text completion to run your examples through? In the above example, I'm assuming one's using Open-WebUI (which uses chat completion), which would trigger that think block.

Depending on how you're running LocalAI, you can run a docker logs -f and watch what comes to it to make sure the template is correctly being hit. If you're using Text Completion, then that needs to be part of your input for that (which you send entirely) and would be part of that pipeline.

I decided to upload a video that describes the before/after for the fix I mentioned, incase you don't get it beforehand: https://www.youtube.com/watch?v=qswwhbXS8H0

@TheDarkTrumpet
Copy link

Is there anything to stop it from showing in chat window?

<|assistant|>
 <|end▁of▁sentence|>
 <|think|>

If you're referring to Open-WebUI (the think block) then I'm not sure. In your example you're adding a stop word of <|think|>. This can be a problem because these models are "thinking" first, then going with results. In other words, it'd stop on the first token. If you're consuming the message, programmatically, then I'd consider using regex to remove the block from your interest. If it's with Open-WebUI, it shows up as a small box you can click on to expand (or at least does for me). Personally, I found the "thinking" models to not really hit what I want, so I went back to Qwen primarily.

@Hello-World-Traveler
Copy link

and stability within the system.

<|assistant|>

<|assistant|>

<|assistant|> Preventative measures

It doesn't effect the output, it's just I know when the sentence has ended.

LocalAI UI and I've notice this more with Deepseek but also with a few others after a while.

@RealJohnGalt DeepSeek-R1-Distill-Qwen-32B-abliterated-Q5 how much vram does this model use on your system?

@TheDarkTrumpet
Copy link

LocalAI UI and I've notice this more with Deepseek but also with a few others after a while.

@RealJohnGalt DeepSeek-R1-Distill-Qwen-32B-abliterated-Q5 how much vram does this model use on your system?

I think you meant to ask me, since I'm the one running these 32B models. The answer is ~46Gb of video ram. I have 3 48Gb cards, and run 64k context to keep all the layers on one card. I notice a large enough speed difference when I go across cards that it doesn't make it worth it except in very specific situations.

@Hello-World-Traveler
Copy link

Do you ever have a ram emergency? I hear having too much can cause this affect.

We do need a little chart on a few models, listing Vram usage and speed. Something the community can put together. I suppose it could be called model bench marks? this would give a rough guide on the systems needed to run this project happily. I wish what I knew now, I knew many months ago.

@TheDarkTrumpet
Copy link

I'm unaware of any "ram emergency". You have ram that's taken up by three main components when it comes to these gguf files. That being the core layers, the context, and the "working" area (main_gpu) in the config. If you split a model across cards, then the overall use seems to increase (I believe the context is cloned to some degree across the cards).

The speed problem on my end comes from one main factor - the card communication between each other. I use A6000 ADAs which run over the PCI lanes vs a direct connection between the cards. The machine is a bit older, PCI 3 max. So it's a bit slower - plus the more layers there are, the slower the model is as well. So there's a few factors that come into play on that.

In terms of how much you can stuff on one card, TheBloke did quite a good job on this when he was active. You can see the table on https://huggingface.co/TheBloke/CapybaraHermes-2.5-Mistral-7B-GGUF that explains it. You can pull the tables depending on the size of the model and the context there. Many of the newer models, think 32B and higher, support 128k context size. You don't need to run it at that, and I don't, to save on memory.

The main problem I have with 3 cards like I have is when I have competing things I want to run that "lock" a card in a certain way. E.g. if I'm training a model, that card is locked and I can't use it for anything else while it's training. The video I have posted a few replies up is actually from that workstation, and I get around it by restarting the corresponding container. But, overall, my strategy is:

Card 0: Qwen 32B 64k context. This also runs my STT stack too.
Card 1: Qwen-code, 32b, 32k context and maybe a small model here and there depending on what I'm doing.
Card 2: Miscellaneous card - it runs my TTS, image models, and so on.

For training, I primarily. use Card 0 for smaller models, but that's logistically due to heat concerns. It sits on top of the rest, and training can take 24-48 hours at times. When that happens, I rebalance card 0's load to Card 2 and kill the optional stuff. It's all scripted.

@Hello-World-Traveler
Copy link

ram emergency

This was a joke (https://en.wikipedia.org/wiki/The_IT_Crowd https://getyarn.io/yarn-clip/8dfec43b-6f8b-4ea6-8809-57e65e45dd07)

This is what I am talking about and In the documentation, there should be a table with this kind of info.

Image

Thanks for giving us a little inside into your system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working unconfirmed
Projects
None yet
Development

No branches or pull requests

4 participants