Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval bug: QWQ generates repeated text when running with reduced context length #12251

Open
remixer-dec opened this issue Mar 7, 2025 · 8 comments

Comments

@remixer-dec
Copy link

remixer-dec commented Mar 7, 2025

Name and Version

$./llama-server
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
version: 0 (unknown)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Not sure why it shows unknown version, possibly because the source was downloaded as tgz from github, I built it manually today, probably at this commit ea00281
I tried earlier versions from a week and a month ago, they have the same issue

Operating systems

Linux

GGML backends

CUDA

Hardware

RTX 4090

Models

Tried a lot of different quants from
https://huggingface.co/Qwen/QwQ-32B-GGUF and
https://huggingface.co/bartowski/Qwen_QwQ-32B-GGUF
for example
https://huggingface.co/bartowski/Qwen_QwQ-32B-GGUF/blob/main/Qwen_QwQ-32B-IQ4_XS.gguf

Problem description & steps to reproduce

When running llama-server with fixed (reduced) context length, the model starts to repeat itself, the less is context length, the earlier it starts to repeat itself.

To reproduce:

  1. Run the server with --ctx-size 1024, I used some of the settings from here to test if it helps, except that I didn't touch repetition penalty since it should work just fine without it. Setting -n -2 does not help.

  2. Send chat completions request

curl --request POST \
  --url http://localhost:40000/v1/chat/completions \
  --header 'Content-Type: application/json' \
  --data '{
"model":"/models/llms/Qwen_QwQ-32B-IQ4_XS.gguf",
"messages": [
             {"role": "user", "content": "Hello. How many Rs are in strawberry?"}
],
"max_tokens": 1024,
"temperature": 0.6
}'
  1. See this:
Image

Expected result: the model thinks as much as it wants hitting the context length limit and stopping if reaching it. This is what happens in AWQ+SGlang stack.

First Bad Commit

No response

Relevant log output

/llama-server --host \
127.0.0.1 \
--port \
40000 \
--model \
/models/llms/Qwen_QwQ-32B-IQ4_XS.gguf \
--ctx-size \
1024 \
--gpu-layers \
99 \
--parallel \
10 \
--mlock \
--samplers \
"top_k;top_p;min_p;temperature;dry;typ_p;xtc" \
--temp \
0.6 \
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
warn: LLAMA_ARG_HOST environment variable is set, but will be overwritten by command line argument --host
build: 0 (unknown) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 12, n_threads_batch = 12, total_threads = 24

system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: HTTP server is listening, hostname: 127.0.0.1, port: 40000, http threads: 23
main: loading model
srv    load_model: loading model '/models/llms/Qwen_QwQ-32B-IQ4_XS.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 23725 MiB free
llama_model_loader: loaded meta data with 37 key-value pairs and 771 tensors from /models/llms/Qwen_QwQ-32B-IQ4_XS.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = QwQ 32B
llama_model_loader: - kv   3:                           general.basename str              = QwQ
llama_model_loader: - kv   4:                         general.size_label str              = 32B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       general.license.link str              = https://huggingface.co/Qwen/QWQ-32B/b...
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen2.5 32B
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-32B
llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["chat", "text-generation"]
llama_model_loader: - kv  12:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  13:                          qwen2.block_count u32              = 64
llama_model_loader: - kv  14:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv  15:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv  16:                  qwen2.feed_forward_length u32              = 27648
llama_model_loader: - kv  17:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv  18:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  19:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  20:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:                          general.file_type u32              = 30
llama_model_loader: - kv  33:                      quantize.imatrix.file str              = /models_out/QwQ-32B-GGUF/Qwen_QwQ-32B...
llama_model_loader: - kv  34:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  35:             quantize.imatrix.entries_count i32              = 448
llama_model_loader: - kv  36:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  321 tensors
llama_model_loader: - type q5_K:   64 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_xs:  385 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ4_XS - 4.25 bpw
print_info: file size   = 16.47 GiB (4.32 BPW) 
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 5120
print_info: n_layer          = 64
print_info: n_head           = 40
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 5
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 27648
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 32B
print_info: model params     = 32.76 B
print_info: general.name     = QwQ 32B
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 64 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 65/65 layers to GPU
load_tensors:        CUDA0 model buffer size = 16473.35 MiB
load_tensors:   CPU_Mapped model buffer size =   394.45 MiB
...............................................................................................warning: failed to mlock 1058283520-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MEMLOCK ('ulimit -l' as root).
.
llama_init_from_model: n_seq_max     = 10
llama_init_from_model: n_ctx         = 1024
llama_init_from_model: n_ctx_per_seq = 102
llama_init_from_model: n_batch       = 1024
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 1000000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (102) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 1024, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   256.00 MiB
llama_init_from_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     5.80 MiB
llama_init_from_model:      CUDA0 compute buffer size =   307.00 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    12.01 MiB
llama_init_from_model: graph nodes  = 2246
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 1024
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 10
slot         init: id  0 | task -1 | new slot n_ctx_slot = 102
slot         init: id  1 | task -1 | new slot n_ctx_slot = 102
slot         init: id  2 | task -1 | new slot n_ctx_slot = 102
slot         init: id  3 | task -1 | new slot n_ctx_slot = 102
slot         init: id  4 | task -1 | new slot n_ctx_slot = 102
slot         init: id  5 | task -1 | new slot n_ctx_slot = 102
slot         init: id  6 | task -1 | new slot n_ctx_slot = 102
slot         init: id  7 | task -1 | new slot n_ctx_slot = 102
slot         init: id  8 | task -1 | new slot n_ctx_slot = 102
slot         init: id  9 | task -1 | new slot n_ctx_slot = 102
main: model loaded
main: chat template, chat_template: {%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0]['role'] == 'system' %}
        {{- messages[0]['content'] }}
    {%- else %}
        {{- '' }}
    {%- endif %}
    {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
    {%- if messages[0]['role'] == 'system' %}
        {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
  {%- endif %}
{%- endif %}
{%- for message in messages %}
    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" and not message.tool_calls %}
        {%- set content = message.content.split('</think>')[-1].lstrip('\n') %}
        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {%- set content = message.content.split('</think>')[-1].lstrip('\n') %}
        {{- '<|im_start|>' + message.role }}
        {%- if message.content %}
            {{- '\n' + content }}
        {%- endif %}
        {%- for tool_call in message.tool_calls %}
            {%- if tool_call.function is defined %}
                {%- set tool_call = tool_call.function %}
            {%- endif %}
            {{- '\n<tool_call>\n{"name": "' }}
            {{- tool_call.name }}
            {{- '", "arguments": ' }}
            {{- tool_call.arguments | tojson }}
            {{- '}\n</tool_call>' }}
        {%- endfor %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- message.content }}
        {{- '\n</tool_response>' }}
        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n<think>\n' }}
{%- endif %}
, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://127.0.0.1:40000 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /v1/models 127.0.0.1 200
srv  log_server_r: request: GET /v1/models 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 102, n_keep = 0, n_prompt_tokens = 18
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 18, n_tokens = 18, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 18, n_tokens = 18
slot update_slots: id  0 | task 0 | slot context shift, n_keep = 0, n_left = 101, n_discard = 50
slot update_slots: id  0 | task 0 | slot context shift, n_keep = 0, n_left = 101, n_discard = 50
slot update_slots: id  0 | task 0 | slot context shift, n_keep = 0, n_left = 101, n_discard = 50
slot update_slots: id  0 | task 0 | slot context shift, n_keep = 0, n_left = 101, n_discard = 50
slot update_slots: id  0 | task 0 | slot context shift, n_keep = 0, n_left = 101, n_discard = 50
slot update_slots: id  0 | task 0 | slot context shift, n_keep = 0, n_left = 101, n_discard = 50
slot update_slots: id  0 | task 0 | slot context shift, n_keep = 0, n_left = 101, n_discard = 50
slot update_slots: id  0 | task 0 | slot context shift, n_keep = 0, n_left = 101, n_discard = 50
slot update_slots: id  0 | task 0 | slot context shift, n_keep = 0, n_left = 101, n_discard = 50
slot update_slots: id  0 | task 0 | slot context shift, n_keep = 0, n_left = 101, n_discard = 50
srv  cancel_tasks: cancel task, id_task = 0
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot      release: id  0 | task 0 | stop processing: n_past = 71, truncated = 1
srv  update_slots: all slots are idle
@ggerganov
Copy link
Member

Use --no-context-shift

@remixer-dec
Copy link
Author

@ggerganov with this option, it responds with {"error":{"code":500,"message":"context shift is disabled","type":"server_error"}}, in all cases except for when I run it with -n -2 and remove max_tokens from request, in this case it returns 1 token

@remixer-dec remixer-dec changed the title Eval bug: QWQ generates repititions when running with reduced context length argument Eval bug: QWQ generates repeated text when running with reduced context length Mar 7, 2025
@aviallon
Copy link
Contributor

aviallon commented Mar 7, 2025

@remixer-dec you can also try to reduce the batch-size: --ctx-size 1024 -b 512 would probably work.

@remixer-dec
Copy link
Author

@aviallon thanks for suggestion, tried it, didn't work.

Image

@remixer-dec
Copy link
Author

remixer-dec commented Mar 7, 2025

Update: I think it is related to the --parallel argument, when I change it from 10 to 1, I don't see any repetitions, but people reported this behavior with different settings

@ggerganov
Copy link
Member

Increasing -np will reduce the per-slot context, so make sure to adjust the context size. With --ctx-size 1024 --parallel 10 you have only 100 tokens per slot which is very small for any meaningful task.

Also, it's recommended to use greedy sampling (e.g. --top-k 1) for these kind of tasks. Other sampling settings will likely degrade the overall quality.

@aviallon
Copy link
Contributor

aviallon commented Mar 7, 2025

Increasing -np will reduce the per-slot context, so make sure to adjust the context size. With --ctx-size 1024 --parallel 10 you have only 100 tokens per slot which is very small for any meaningful task.

Also, it's recommended to use greedy sampling (e.g. --top-k 1) for these kind of tasks. Other sampling settings will likely degrade the overall quality.

I remarked that model tend to repeat themselves more if ctx-size-per-slot < batch-size.
Perhaps outputting an error in that case would make sense?

@remixer-dec
Copy link
Author

remixer-dec commented Mar 7, 2025

Is there any documentation about how slots are implemented and what is the correct use for them?

I feel like it is a very useful feature, but I am misunderstanding how it works. In case of SGlang it’s pretty obvious, you hit the server with a bunch of requests with similar tokens,they are processed together reusing the parts that are the same, with llama.cpp it’s a bit more tricky from my understanding, does each slot have it’s own kv cache? Who decides to which slot does a request go? Do idle slots guarantee immediate responses when others are busy? Is this why idle slots are affecting the output quality of a single used slot?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants