-
Notifications
You must be signed in to change notification settings - Fork 11.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eval bug: QWQ generates repeated text when running with reduced context length #12251
Comments
Use |
@ggerganov with this option, it responds with |
@remixer-dec you can also try to reduce the batch-size: |
@aviallon thanks for suggestion, tried it, didn't work. |
Update: I think it is related to the |
Increasing Also, it's recommended to use greedy sampling (e.g. |
I remarked that model tend to repeat themselves more if ctx-size-per-slot < batch-size. |
Is there any documentation about how slots are implemented and what is the correct use for them? I feel like it is a very useful feature, but I am misunderstanding how it works. In case of SGlang it’s pretty obvious, you hit the server with a bunch of requests with similar tokens,they are processed together reusing the parts that are the same, with llama.cpp it’s a bit more tricky from my understanding, does each slot have it’s own kv cache? Who decides to which slot does a request go? Do idle slots guarantee immediate responses when others are busy? Is this why idle slots are affecting the output quality of a single used slot? |
Name and Version
$./llama-server
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
version: 0 (unknown)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
Not sure why it shows unknown version, possibly because the source was downloaded as tgz from github, I built it manually today, probably at this commit ea00281
I tried earlier versions from a week and a month ago, they have the same issue
Operating systems
Linux
GGML backends
CUDA
Hardware
RTX 4090
Models
Tried a lot of different quants from
https://huggingface.co/Qwen/QwQ-32B-GGUF and
https://huggingface.co/bartowski/Qwen_QwQ-32B-GGUF
for example
https://huggingface.co/bartowski/Qwen_QwQ-32B-GGUF/blob/main/Qwen_QwQ-32B-IQ4_XS.gguf
Problem description & steps to reproduce
When running llama-server with fixed (reduced) context length, the model starts to repeat itself, the less is context length, the earlier it starts to repeat itself.
To reproduce:
Run the server with --ctx-size 1024, I used some of the settings from here to test if it helps, except that I didn't touch repetition penalty since it should work just fine without it. Setting
-n -2
does not help.Send chat completions request
Expected result: the model thinks as much as it wants hitting the context length limit and stopping if reaching it. This is what happens in AWQ+SGlang stack.
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: