Skip to content

_create_completion cache_item.input_ids.tolist() - AttributeError: 'NoneType' object has no attribute 'input_ids' #348

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
snxraven opened this issue Jun 8, 2023 · 13 comments
Labels
bug Something isn't working server

Comments

@snxraven
Copy link

snxraven commented Jun 8, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [ x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [ x] I carefully followed the README.md.
  • [ x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [ x] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

The chat completion endpoint should be processing chat and responding accordingly

Current Behavior

I see the following crash:

llama.cpp: loading model from ./models/ggml-vicuna-7b-1.1-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 6096
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  = 3048.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
INFO:     Started server process [7]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://llama-python-server:8000 (Press CTRL+C to quit)
INFO:     172.18.0.3:54642 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 276, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__

Environment and Context

This is a docker-compose project, feel free to check how I have the installation configured:

https://git.ssh.surf/snxraven/llama-cpp-python-djs-bot

@gjmulder
Copy link
Contributor

gjmulder commented Jun 8, 2023

Sadly, there's no obvious error there that can point to where the issue is within llama-cpp-python.

Any chance you can reproduce the issue with a an identical curl against the server?

@gjmulder gjmulder added bug Something isn't working server labels Jun 8, 2023
@snxraven
Copy link
Author

snxraven commented Jun 8, 2023

It seems the actual error was cut out of my output provided, please give me a moment to capture the issue

@snxraven
Copy link
Author

snxraven commented Jun 8, 2023

@gjmulder hopefully this output is more helpful

llama.cpp: loading model from ./models/ggml-vicuna-7b-1.1-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 6096
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  = 3048.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
INFO:     Started server process [7]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://llama-python-server:8000 (Press CTRL+C to quit)
INFO:     172.18.0.3:55220 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 276, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
    raise e
  File "/usr/local/lib/python3.10/dist-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 237, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 163, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/server/app.py", line 436, in create_chat_completion
    completion: llama_cpp.ChatCompletion = await run_in_threadpool(
  File "/usr/local/lib/python3.10/dist-packages/starlette/concurrency.py", line 41, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/llama.py", line 1381, in create_chat_completion
    completion_or_chunks = self(
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/llama.py", line 1258, in __call__
    return self.create_completion(
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/llama.py", line 1210, in create_completion
    completion: Completion = next(completion_or_chunks)  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/llama.py", line 814, in _create_completion
    cache_item.input_ids.tolist(), prompt_tokens
AttributeError: 'NoneType' object has no attribute 'input_ids'

https://pastebin.com/t01PyjVw

The error seems to reference: https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama.py#LL814C18-L814C18

@gjmulder gjmulder changed the title AttributeError: 'NoneType' object has no attribute 'input_ids' _create_completion cache_item.input_ids.tolist() - AttributeError: 'NoneType' object has no attribute 'input_ids' Jun 8, 2023
@abetlen
Copy link
Owner

abetlen commented Jun 8, 2023

Yup, looks like a bug with the new diskcache implementation, working on a fix.

@snxraven
Copy link
Author

snxraven commented Jun 8, 2023

Thank you @abetlen Much respect :)

@abetlen
Copy link
Owner

abetlen commented Jun 8, 2023

@snxraven after resolving the issue with the cache (LlamaCache was changed to an abstract base class, I changed this back to point to LlamaRAMCache to avoid breaking changes) it looks like there might be a bug with the llama.cpp save_state api, I'll need to look into it but for now it seems that LlamaCache does not work. Note: the models still keep their eval'd tokens so it's not so bad.

@snxraven
Copy link
Author

snxraven commented Jun 8, 2023

@abetlen Thanks for the heads up, ill stay updated here and within the repo for updates. Good looking out

@AlphaAtlas
Copy link

Is this the related issue?

ggml-org/llama.cpp#1699

I applied the fix at the bottom of the thread, and the cache seems to be working.

@abetlen
Copy link
Owner

abetlen commented Jun 10, 2023

@AlphaAtlas yes that looks like the issue. Did you also remove this line?

@AlphaAtlas
Copy link

Yeah, I removed all three and False lines and am getting Llama._create_completion: cache saved in the log.

@AlphaAtlas
Copy link

The cache issue is fixed upstream: ggml-org/llama.cpp#1699 (comment)

@abetlen
Copy link
Owner

abetlen commented Jun 10, 2023

@AlphaAtlas @snxraven should be fixed now in v0.1.62

@snxraven
Copy link
Author

@abetlen looks perfect :) All seems well you can most likely close this

@abetlen abetlen closed this as completed Jun 11, 2023
xaptronic pushed a commit to xaptronic/llama-cpp-python that referenced this issue Jun 13, 2023
…n#348)

* cmdline option for custom amount of model parts (--n_parts N)

* Update main.cpp

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working server
Projects
None yet
Development

No branches or pull requests

4 participants