You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I used llama-cpp-python with Langchain. I got an error when I tried to run the example code from Langchain doc.
I installed: CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCMAKE_CUDA_FLAGS='-DGGML_CUDA_FORCE_CUSTOM_MEMORY_POOL'" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
and I also upgraded Langchain to 0.0.330
Then, I runned the following example code from Langchain doc:
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
template = """Question: {question}
Answer: Let's work this out in a step by step way to be sure we have the right answer."""
prompt = PromptTemplate(template=template, input_variables=["question"])
# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
n_gpu_layers = 32 # Change this value based on your model and your GPU VRAM pool.
n_batch = 4 # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
# Make sure the model path is correct for your system!
llm = LlamaCpp(
model_path="/home/xxxx/llama-2-7b-chat.Q4_K_M.gguf",
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
callback_manager=callback_manager,
verbose=True, # Verbose is required to pass to the callback manager
)
llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
llm_chain.run(question)
here are the output:
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6 Device 1: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1
...
llm_load_tensors: ggml ctx size = 0.11 MB
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3060) as main device
llm_load_tensors: mem required = 172.97 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloaded 32/35 layers to GPU
llm_load_tensors: VRAM used: 3718.38 MB
..................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 256.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 7.18 MB
llama_new_context_with_model: VRAM scratch buffer: 0.55 MB
llama_new_context_with_model: total VRAM used: 3718.93 MB (model: 3718.38 MB, context: 0.55 MB)
CUDA error 1 at /tmp/pip-install-2o911nrr/llama-cpp-python_7b2f2508c89b451280d9116461f3c9cf/vendor/llama.cpp/ggml-cuda.cu:7036: invalid argument
current device: 1
I have two different cards that work well with the compiled llama.cpp. But I encountered an error while using llama-cpp-python. :(
The same issue has been resolved in llama.cpp, but don't know if llama.cpp propagates to llama-cpp-python in time.
Physical (or virtual) hardware you are using, e.g. for Linux:
Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
Device 0: NVIDIA GeForce RTX 3060
Device 1: NVIDIA GeForce GTX 1080 Ti
Operating System, e.g. for Linux:
Linux localhost.localdomain 3.10.0-1160.90.1.el7.x86_64
SDK version, e.g. for Linux:
Python 3.9.16
GNU Make 4.2.1
g++ (GCC) 11.2.0
The text was updated successfully, but these errors were encountered:
I used
llama-cpp-python
withLangchain
. I got an error when I tried to run the example code from Langchain doc.I installed:
CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCMAKE_CUDA_FLAGS='-DGGML_CUDA_FORCE_CUSTOM_MEMORY_POOL'" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
and I also upgraded
Langchain
to 0.0.330Then, I runned the following example code from Langchain doc:
here are the output:
I have two different cards that work well with the compiled
llama.cpp
. But I encountered an error while usingllama-cpp-python
. :(The same issue has been resolved in llama.cpp, but don't know if llama.cpp propagates to llama-cpp-python in time.
The text was updated successfully, but these errors were encountered: