You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/main/README.md
+2-2
Original file line number
Diff line number
Diff line change
@@ -276,9 +276,9 @@ These options help improve the performance and memory usage of the LLaMA models.
276
276
277
277
-`--numa`: Attempt optimizations that help on some systems with non-uniform memory access. This currently consists of pinning an equal proportion of the threads to the cores on each NUMA node, and disabling prefetch and readahead for mmap. The latter causes mapped pages to be faulted in on first access instead of all at once, and in combination with pinning threads to NUMA nodes, more of the pages end up on the NUMA node where they are used. Note that if the model is already in the system page cache, for example because of a previous run without this option, this will have little effect unless you drop the page cache first. This can be done by rebooting the system or on Linux by writing '3' to '/proc/sys/vm/drop\_caches' as root.
278
278
279
-
### Memory Float 32
279
+
### KV cache type
280
280
281
-
-`--memory-f32`: Use 32-bit floats instead of 16-bit floats for memory key+value. This doubles the context memory requirement and cached prompt file size but does not appear to increase generation quality in a measurable way. Not recommended.
281
+
-`-kvt, --kv-type`: The data type to use for the KV cache. Uses q8_0 by default. Alternatives are f16 and f32. The alternatives increase memory consumption for marginal quality differences.
Copy file name to clipboardExpand all lines: examples/server/README.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@ Command line options:
13
13
-`-ts SPLIT, --tensor-split SPLIT`: When using multiple GPUs this option controls how large tensors should be split across all GPUs. `SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each GPU should get in order. For example, "3,2" will assign 60% of the data to GPU 0 and 40% to GPU 1. By default the data is split in proportion to VRAM but this may not be optimal for performance. Requires cuBLAS.
14
14
-`-lv, --low-vram`: Do not allocate a VRAM scratch buffer for holding temporary results. Reduces VRAM usage at the cost of performance, particularly prompt processing speed. Requires cuBLAS.
15
15
-`-b N`, `--batch-size N`: Set the batch size for prompt processing. Default: `512`.
16
-
-`--memory-f32`: Use 32-bit floats instead of 16-bit floats for memory key+value. Not recommended.
16
+
-`-kvt, --kv-type`: The data type to use for the KV cache. Uses q8_0 by default. Alternatives are f16 and f32. The alternatives increase memory consumption for marginal quality differences.
17
17
-`--mlock`: Lock the model in memory, preventing it from being swapped out when memory-mapped.
18
18
-`--no-mmap`: Do not memory-map the model. By default, models are mapped into memory, which allows the system to load only the necessary parts of the model as needed.
19
19
-`--numa`: Attempt optimizations that help on some NUMA systems.
0 commit comments