Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chat request timesout after 60s, and context lenght not detected automatically #3296

Closed
4 tasks done
amida47 opened this issue Jun 19, 2024 · 0 comments
Closed
4 tasks done

Comments

@amida47
Copy link

amida47 commented Jun 19, 2024

Bug Report

Description

Bug Summary:
on OpenWebui when I try long context, the chat request timeout after on 1 min

Steps to Reproduce:
create a GitHub codespace with 4-cores
install ollama through curl -fsSL https://ollama.com/install.sh | sh
run ollama serve
run ollama run phi3:14b-medium-128k-instruct-q4_0
run docker run -d --network=host -v open-webui:/app/backend/data -e OLLAMA_BASE_URL=http://127.0.0.1:11434 --name open-webui --restart always ghcr.io/open-webui/open-webui:main
go to openwebui
test connection to ollama by running simple prompt
upload long .tex document and ask to summarize
the request to /chat timesout after 60 s

Expected Behavior:
start streaming as normal

Actual Behavior:
the request to /chat timesout after 60 s

Environment

  • Open WebUI Version: latest as of 06/19/2024

  • Ollama (if applicable): ollama version is 0.1.44

  • Operating System: [e.g., Windows 10, macOS Big Sur, Ubuntu 20.04]

  • Browser (if applicable): [e.g., Chrome 100.0, Firefox 98.0]

Reproduction Details

Confirmation:

  • I have read and followed all the instructions provided in the README.md.
  • I am on the latest version of both Open WebUI and Ollama.
  • I have included the browser console logs.
  • I have included the Docker container logs.

Logs and Screenshots

output.mp4

another example, it started streaming but once the request time reaches 60s it failed
image
the logs for this screenshot , you can see the context lenght ollama used 2048 while phi model has 128k in context so openwebui didn't read this context

time=2024-06-19T13:53:42.769Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=1 memory.available="751.5 MiB" memory.required.full="8.1 GiB" memory.required.partial="642.4 MiB" memory.required.kv="400.0 MiB" memory.weights.total="7.3 GiB" memory.weights.repeating="7.1 GiB" memory.weights.nonrepeating="128.4 MiB" memory.graph.full="266.7 MiB" memory.graph.partial="266.7 MiB"
time=2024-06-19T13:53:42.770Z level=INFO source=server.go:341 msg="starting llama server" cmd="/tmp/ollama3165410411/runners/cpu_avx2/ollama_llama_server --model /home/codespace/.ollama/models/blobs/sha256-b62bc11c25b7e38174045d9ada511e94d81466656dad2bf90a805d027a04fb25 --ctx-size 2048 --batch-size 512 --embedding --log-disable --parallel 1 --port 35253"
time=2024-06-19T13:53:42.770Z level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-06-19T13:53:42.770Z level=INFO source=server.go:529 msg="waiting for llama runner to start responding"
time=2024-06-19T13:53:42.770Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="5921b8f" tid="129792223676288" timestamp=1718805222
INFO [main] system info | n_threads=2 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="129792223676288" timestamp=1718805222 total_threads=4
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="3" port="35253" tid="129792223676288" timestamp=1718805222
llama_model_loader: loaded meta data with 27 key-value pairs and 245 tensors from /home/codespace/.ollama/models/blobs/sha256-b62bc11c25b7e38174045d9ada511e94d81466656dad2bf90a805d027a04fb25 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.name str              = Phi3
llama_model_loader: - kv   2:                        phi3.context_length u32              = 131072
llama_model_loader: - kv   3:  phi3.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv   4:                      phi3.embedding_length u32              = 5120
llama_model_loader: - kv   5:                   phi3.feed_forward_length u32              = 17920
llama_model_loader: - kv   6:                           phi3.block_count u32              = 40
llama_model_loader: - kv   7:                  phi3.attention.head_count u32              = 40
llama_model_loader: - kv   8:               phi3.attention.head_count_kv u32              = 10
llama_model_loader: - kv   9:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                  phi3.rope.dimension_count u32              = 128
llama_model_loader: - kv  11:                        phi3.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 2
llama_model_loader: - kv  13:              phi3.rope.scaling.attn_factor f32              = 1.190238
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,32064]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,32064]   = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  22:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {% for message in messages %}{% if (m...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   83 tensors
llama_model_loader: - type q4_0:  161 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 323
llm_load_vocab: token to piece cache size = 0.3372 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi3
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 10
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1280
llm_load_print_meta: n_embd_v_gqa     = 1280
llm_load_print_meta: f_norm_eps       = 0.0e 00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e 00
llm_load_print_meta: f_max_alibi_bias = 0.0e 00
llm_load_print_meta: f_logit_scale    = 0.0e 00
llm_load_print_meta: n_ff             = 17920
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 14B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 13.96 B
llm_load_print_meta: model size       = 7.35 GiB (4.53 BPW) 
llm_load_print_meta: general.name     = Phi3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
llm_load_tensors: ggml ctx size =    0.14 MiB
time=2024-06-19T13:53:43.221Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server not responding"
llm_load_tensors:        CPU buffer size =  7530.58 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   400.00 MiB
llama_new_context_with_model: KV self size  =  400.00 MiB, K (f16):  200.00 MiB, V (f16):  200.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.14 MiB
llama_new_context_with_model:        CPU compute buffer size =   209.01 MiB
llama_new_context_with_model: graph nodes  = 1606
llama_new_context_with_model: graph splits = 1
time=2024-06-19T13:53:43.473Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model"
INFO [main] model loaded | tid="129792223676288" timestamp=1718805224
time=2024-06-19T13:53:44.225Z level=INFO source=server.go:572 msg="llama runner started in 1.46 seconds"
[GIN] 2024/06/19 - 13:54:42 | 200 | 59.796477624s |       127.0.0.1 | POST     "/api/chat"



## Installation Method

install ollama through `curl -fsSL https://ollama.com/install.sh | sh`
run `ollama serve`
run ` ollama run phi3:14b-medium-128k-instruct-q4_0`
run `docker run -d --network=host -v open-webui:/app/backend/data -e OLLAMA_BASE_URL=http://127.0.0.1:11434 --name open-webui --restart always ghcr.io/open-webui/open-webui:main`
@open-webui open-webui locked and limited conversation to collaborators Jun 19, 2024
@tjbck tjbck converted this issue into discussion #3298 Jun 19, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant