-
Notifications
You must be signed in to change notification settings - Fork 949
Added weight compression for Dolly 2.0 #1319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 12 commits
1afee40
1289174
d1206ea
54d1bc8
59f9fb4
fe62185
4c2eb57
4f97d86
1d730d9
66ab819
b2137d8
e4eec8b
9b8e4e4
a6187a0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,6 +17,7 @@ | |
| "\n", | ||
nikita-savelyevv marked this conversation as resolved.
Show resolved
Hide resolved
nikita-savelyevv marked this conversation as resolved.
Show resolved
Hide resolved
nikita-savelyevv marked this conversation as resolved.
Show resolved
Hide resolved
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. comment looks strange on my opinion for end users, as 2023.2 is not released yet, it is better to say that in 2023.1.0 release weights compression supported only on CPU, GPU support will be added later. It is recommended to disable weights compression for GPU
Reply via ReviewNB
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
| "- Install prerequisites\n", | ||
| "- Download and convert the model from a public source using the [OpenVINO integration with Hugging Face Optimum](https://huggingface.co/blog/openvino).\n", | ||
| "- Compress model weights to INT8 with [OpenVINO NNCF](https://github.com/openvinotoolkit/nncf)\n", | ||
| "- Create an instruction-following inference pipeline\n", | ||
| "- Run instruction-following pipeline\n", | ||
| "\n", | ||
|
|
@@ -31,29 +32,31 @@ | |
| ] | ||
| }, | ||
| { | ||
| "attachments": {}, | ||
| "cell_type": "markdown", | ||
| "id": "f97c435a", | ||
| "metadata": {}, | ||
nikita-savelyevv marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| "source": [ | ||
| "### Table of content:\n", | ||
| "- [Prerequisites](#Prerequisites-Uparrow)\n", | ||
| " - [Select inference device](#Select-inference-device-Uparrow)\n", | ||
| "- [Download and Convert Model](#Download-and-Convert-Model-Uparrow)\n", | ||
| "- [Create an instruction-following inference pipeline](#Create-an-instruction-following-inference-pipeline-Uparrow)\n", | ||
| " - [Setup imports](#Setup-imports-Uparrow)\n", | ||
| " - [Prepare template for user prompt](#Prepare-template-for-user-prompt-Uparrow)\n", | ||
| " - [Helpers for output parsing](#Helpers-for-output-parsing-Uparrow)\n", | ||
| " - [Main generation function](#Main-generation-function-Uparrow)\n", | ||
| " - [Helpers for application](#Helpers-for-application-Uparrow)\n", | ||
| "- [Run instruction-following pipeline](#Run-instruction-following-pipeline-Uparrow)" | ||
| "### Table of contents:\n", | ||
| "- [Prerequisites](#Prerequisites-$\\Uparrow$)\n", | ||
| " - [Select inference device](#Select-inference-device-$\\Uparrow$)\n", | ||
| "- [Download and Convert Model](#Download-and-Convert-Model-$\\Uparrow$)\n", | ||
| "- [NNCF model weights compression](#NNCF-model-weights-compression-$\\Uparrow$)\n", | ||
| "- [Create an instruction-following inference pipeline](#Create-an-instruction-following-inference-pipeline-$\\Uparrow$)\n", | ||
| " - [Setup imports](#Setup-imports-$\\Uparrow$)\n", | ||
| " - [Prepare template for user prompt](#Prepare-template-for-user-prompt-$\\Uparrow$)\n", | ||
| " - [Helpers for output parsing](#Helpers-for-output-parsing-$\\Uparrow$)\n", | ||
| " - [Main generation function](#Main-generation-function-$\\Uparrow$)\n", | ||
| " - [Helpers for application](#Helpers-for-application-$\\Uparrow$)\n", | ||
| "- [Run instruction-following pipeline](#Run-instruction-following-pipeline-$\\Uparrow$)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "08aa16b1-d2f6-4a3a-abfb-5ec278133c80", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Prerequisites [$\\Uparrow$](#Table-of-content:)\n", | ||
| "## Prerequisites [$\\Uparrow$](#Table-of-contents:)\n", | ||
| "\n", | ||
| "First, we should install the [Hugging Face Optimum](https://huggingface.co/docs/optimum/installation) library accelerated by OpenVINO integration.\n", | ||
| "The Hugging Face Optimum Intel API is a high-level API that enables us to convert and quantize models from the Hugging Face Transformers library to the OpenVINO™ IR format. For more details, refer to the [Hugging Face Optimum Intel documentation](https://huggingface.co/docs/optimum/intel/inference)." | ||
|
|
@@ -75,7 +78,7 @@ | |
| "id": "367f84f8-33e8-4ad6-bd40-e6fd41d2d703", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### Select inference device [$\\Uparrow$](#Table-of-content:)\n", | ||
| "### Select inference device [$\\Uparrow$](#Table-of-contents:)\n", | ||
| "\n", | ||
| "select device from dropdown list for running inference using OpenVINO" | ||
| ] | ||
|
|
@@ -89,12 +92,12 @@ | |
| { | ||
| "data": { | ||
| "application/vnd.jupyter.widget-view+json": { | ||
| "model_id": "5fe94d76fb364dd4ae8e6e39abe65cd7", | ||
| "model_id": "7f17a47330e74340a8a5b01d99d44652", | ||
| "version_major": 2, | ||
| "version_minor": 0 | ||
| }, | ||
| "text/plain": [ | ||
| "Dropdown(description='Device:', options=('CPU', 'GPU', 'AUTO'), value='CPU')" | ||
| "Dropdown(description='Device:', options=('CPU', 'AUTO'), value='CPU')" | ||
| ] | ||
| }, | ||
| "execution_count": 2, | ||
|
|
@@ -123,7 +126,7 @@ | |
| "id": "93fec698-344d-48aa-8899-6821bf3e16bf", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Download and Convert Model [$\\Uparrow$](#Table-of-content:)\n", | ||
| "## Download and Convert Model [$\\Uparrow$](#Table-of-contents:)\n", | ||
| "\n", | ||
| "Optimum Intel can be used to load optimized models from the [Hugging Face Hub](https://huggingface.co/docs/optimum/intel/hf.co/models) and create pipelines to run an inference with OpenVINO Runtime using Hugging Face APIs. The Optimum Inference models are API compatible with Hugging Face Transformers models. This means we just need to replace `AutoModelForXxx` class with the corresponding `OVModelForXxx` class.\n", | ||
| "\n", | ||
|
|
@@ -192,12 +195,131 @@ | |
| " ov_model.save_pretrained(model_path)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "5b1238c8-dcc9-4495-aeff-1ecbd8bd5082", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### NNCF model weights compression [$\\Uparrow$](#Table-of-contents:)\n", | ||
| "\n", | ||
| "NNCF [Weights Compression algorithm](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/CompressWeights.md) compresses weights of a model to `INT8`. This is an alternative to [Quantization algorithm](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/post_training/Quantization.md) that compresses both weights and activations. Weight compression is effective in optimizing footprint and performance of large models where the size of weights is significantly larger than the size of activations, for example, in Large Language Models (LLMs) such as Dolly 2.0. Additionally, Weight Compression usually leads to almost no accuracy drop.\n", | ||
| ">Note: Starting from OpenVINO 2023.2 weight compression will also have an effect when run on a GPU." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 4, | ||
| "id": "8e5c9e68-3772-432f-b231-f1163442357d", | ||
| "metadata": {}, | ||
| "outputs": [ | ||
| { | ||
| "data": { | ||
| "application/vnd.jupyter.widget-view+json": { | ||
| "model_id": "3a774ab747fa4f95b5bd76aaab8c0691", | ||
| "version_major": 2, | ||
| "version_minor": 0 | ||
| }, | ||
| "text/plain": [ | ||
| "Dropdown(description='Compression:', index=1, options=('Disable', 'Enable'), value='Enable')" | ||
| ] | ||
| }, | ||
| "execution_count": 4, | ||
| "metadata": {}, | ||
| "output_type": "execute_result" | ||
| } | ||
| ], | ||
| "source": [ | ||
| "to_compress = widgets.Dropdown(\n", | ||
| " options=['Disable', 'Enable'],\n", | ||
| " value='Disable',\n", | ||
| " description='Compression:',\n", | ||
| " disabled=False,\n", | ||
| ")\n", | ||
| "to_compress" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 5, | ||
| "id": "392940e3-01da-4876-a9d1-2475ed3da882", | ||
| "metadata": {}, | ||
| "outputs": [ | ||
| { | ||
| "name": "stderr", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "Framework not specified. Using pt to export to ONNX.\n", | ||
| "Using framework PyTorch: 1.13.1+cpu\n", | ||
| "Overriding 1 configuration item(s)\n", | ||
| "\t- use_cache -> True\n", | ||
| "/home/nsavel/venvs/ov_notebooks/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py:594: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", | ||
| " assert batch_size > 0, \"batch_size has to be defined and > 0\"\n", | ||
| "/home/nsavel/venvs/ov_notebooks/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py:314: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", | ||
| " if seq_len > self.max_seq_len_cached:\n", | ||
| "/home/nsavel/venvs/ov_notebooks/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py:239: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", | ||
| " if key_length > self.bias.shape[-1]:\n", | ||
| "/home/nsavel/venvs/ov_notebooks/lib/python3.8/site-packages/nncf/torch/dynamic_graph/wrappers.py:74: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.\n", | ||
| " op1 = operator(*args, **kwargs)\n", | ||
| "Compiling the model...\n", | ||
| "Set CACHE_DIR to /tmp/tmpelmw8467/model_cache\n" | ||
| ] | ||
| }, | ||
| { | ||
| "name": "stdout", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "* Original IR model size: 5297.21 MB\n", | ||
| "* Compressed IR model size: 2660.29 MB\n", | ||
| "* Model compression rate: 1.991\n" | ||
| ] | ||
| }, | ||
| { | ||
| "name": "stderr", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "Compiling the model...\n", | ||
| "Set CACHE_DIR to dolly-v2-3b_compressed/model_cache\n" | ||
| ] | ||
| } | ||
| ], | ||
| "source": [ | ||
| "import gc\n", | ||
| "from optimum.intel import OVQuantizer\n", | ||
| "from transformers import AutoModelForCausalLM\n", | ||
| "\n", | ||
| "compressed_model_path = Path(f'{model_path}_compressed')\n", | ||
| "\n", | ||
| "def calculate_compression_rate(model_path_ov, model_path_ov_compressed):\n", | ||
| " model_size_original = model_path_ov.with_suffix(\".bin\").stat().st_size / 2 ** 20\n", | ||
| " model_size_compressed = model_path_ov_compressed.with_suffix(\".bin\").stat().st_size / 2 ** 20\n", | ||
| " print(f\"* Original IR model size: {model_size_original:.2f} MB\")\n", | ||
| " print(f\"* Compressed IR model size: {model_size_compressed:.2f} MB\")\n", | ||
| " print(f\"* Model compression rate: {model_size_original / model_size_compressed:.3f}\")\n", | ||
| "\n", | ||
| "if to_compress.value == 'Enable':\n", | ||
| " if not compressed_model_path.exists():\n", | ||
| " # Weight compression can't yet be applied after FP16 was applied to FP32 OV model.\n", | ||
| " # Because of this we convert the original model to FP16 first.\n", | ||
| " model = AutoModelForCausalLM.from_pretrained(model_id)\n", | ||
| " model.half()\n", | ||
| " model.save_pretrained(compressed_model_path)\n", | ||
| " ov_model = OVModelForCausalLM.from_pretrained(compressed_model_path, device=current_device, export=True)\n", | ||
| " \n", | ||
| " quantizer = OVQuantizer.from_pretrained(ov_model)\n", | ||
| " quantizer.quantize(save_directory=compressed_model_path, weights_only=True)\n", | ||
| " del quantizer\n", | ||
| " gc.collect()\n", | ||
| " \n", | ||
| " calculate_compression_rate(model_path / 'openvino_model.xml', compressed_model_path / 'openvino_model.xml')\n", | ||
| " ov_model = OVModelForCausalLM.from_pretrained(compressed_model_path, device=current_device)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "b6d9c4a5-ef75-4076-9f1c-f45a2259ec46", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Create an instruction-following inference pipeline [$\\Uparrow$](#Table-of-content:)\n", | ||
| "## Create an instruction-following inference pipeline [$\\Uparrow$](#Table-of-contents:)\n", | ||
| " \n", | ||
| " The `run_generation` function accepts user-provided text input, tokenizes it, and runs the generation process. Text generation is an iterative process, where each next token depends on previously generated until a maximum number of tokens or stop generation condition is not reached. To obtain intermediate generation results without waiting until when generation is finished, we will use [`TextIteratorStreamer`](https://huggingface.co/docs/transformers/main/en/internal/generation_utils#transformers.TextIteratorStreamer), provided as part of HuggingFace [Streaming API](https://huggingface.co/docs/transformers/main/en/generation_strategies#streaming).\n", | ||
| " \n", | ||
|
|
@@ -238,12 +360,12 @@ | |
| "id": "b9b5da4d-d2fd-440b-b204-7fbc6966dd1f", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### Setup imports [$\\Uparrow$](#Table-of-content:)\n" | ||
| "### Setup imports [$\\Uparrow$](#Table-of-contents:)\n" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 4, | ||
| "execution_count": 6, | ||
| "id": "6f976094-8603-42c4-8f18-a32ba6d7192e", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
|
|
@@ -261,14 +383,14 @@ | |
| "id": "c58611d6-0a91-4efd-976e-4221acbb43cd", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### Prepare template for user prompt [$\\Uparrow$](#Table-of-content:)\n", | ||
| "### Prepare template for user prompt [$\\Uparrow$](#Table-of-contents:)\n", | ||
| "\n", | ||
| "For effective generation, model expects to have input in specific format. The code below prepare template for passing user instruction into model with providing additional context." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 5, | ||
| "execution_count": 7, | ||
| "id": "52ac10a5-3141-4227-8f0b-0617acd027c8", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
|
|
@@ -301,14 +423,14 @@ | |
| "id": "27a01739-1363-42ef-927f-6a340bdbe7ba", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### Helpers for output parsing [$\\Uparrow$](#Table-of-content:)\n", | ||
| "### Helpers for output parsing [$\\Uparrow$](#Table-of-contents:)\n", | ||
| "\n", | ||
| "Model was retrained to finish generation using special token `### End` the code below find its id for using it as generation stop-criteria." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 6, | ||
| "execution_count": 8, | ||
| "id": "524e72f4-8750-48ff-b002-e558d03b3302", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
|
|
@@ -351,14 +473,14 @@ | |
| "id": "583202d2-6d29-4729-af2e-232d3ee0bc2c", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### Main generation function [$\\Uparrow$](#Table-of-content:)\n", | ||
| "### Main generation function [$\\Uparrow$](#Table-of-contents:)\n", | ||
| "\n", | ||
| "As it was discussed above, `run_generation` function is the entry point for starting generation. It gets provided input instruction as parameter and returns model response." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 7, | ||
| "execution_count": 9, | ||
| "id": "67fb4f9d-5877-48d8-8eff-c30ff6974d7a", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
|
|
@@ -420,14 +542,14 @@ | |
| "id": "562f2dcf-75ef-4554-85e3-e04f486776cc", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### Helpers for application [$\\Uparrow$](#Table-of-content:)\n", | ||
| "### Helpers for application [$\\Uparrow$](#Table-of-contents:)\n", | ||
| "\n", | ||
| "For making interactive user interface we will use Gradio library. The code bellow provides useful functions used for communication with UI elements." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 8, | ||
| "execution_count": 10, | ||
| "id": "f114944f-c060-44ba-ba59-02cb2516554c", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
|
|
@@ -495,7 +617,7 @@ | |
| "id": "50d918a9-1cbe-49a5-85ad-5e370c8af7f5", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Run instruction-following pipeline [$\\Uparrow$](#Table-of-content:)\n", | ||
| "## Run instruction-following pipeline [$\\Uparrow$](#Table-of-contents:)\n", | ||
| "\n", | ||
| "Now, we are ready to explore model capabilities. This demo provides a simple interface that allows communication with a model using text instruction. Type your instruction into the `User instruction` field or select one from predefined examples and click on the `Submit` button to start generation. Additionally, you can modify advanced generation parameters:\n", | ||
| "\n", | ||
|
|
@@ -508,7 +630,7 @@ | |
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 9, | ||
| "execution_count": 11, | ||
| "id": "a00c2293-15b1-4734-b9b4-1abb524bb8d6", | ||
| "metadata": { | ||
| "tags": [] | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.