-
Notifications
You must be signed in to change notification settings - Fork 949
Added weight compression for Dolly 2.0 #1319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
1afee40
1289174
d1206ea
54d1bc8
59f9fb4
fe62185
4c2eb57
4f97d86
1d730d9
66ab819
b2137d8
e4eec8b
9b8e4e4
a6187a0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -38,16 +38,17 @@ | |
| "metadata": {}, | ||
| "source": [ | ||
| "### Table of content:\n", | ||
| "- [Prerequisites](#Prerequisites-Uparrow)\n", | ||
| " - [Select inference device](#Select-inference-device-Uparrow)\n", | ||
| "- [Download and Convert Model](#Download-and-Convert-Model-Uparrow)\n", | ||
| "- [Create an instruction-following inference pipeline](#Create-an-instruction-following-inference-pipeline-Uparrow)\n", | ||
| " - [Setup imports](#Setup-imports-Uparrow)\n", | ||
| " - [Prepare template for user prompt](#Prepare-template-for-user-prompt-Uparrow)\n", | ||
| " - [Helpers for output parsing](#Helpers-for-output-parsing-Uparrow)\n", | ||
| " - [Main generation function](#Main-generation-function-Uparrow)\n", | ||
| " - [Helpers for application](#Helpers-for-application-Uparrow)\n", | ||
| "- [Run instruction-following pipeline](#Run-instruction-following-pipeline-Uparrow)" | ||
| "- [Prerequisites](#Prerequisites-$\\Uparrow$)\n", | ||
| " - [Select inference device](#Select-inference-device-$\\Uparrow$)\n", | ||
| "- [Download and Convert Model](#Download-and-Convert-Model-$\\Uparrow$)\n", | ||
| "- [NNCF model weights compression](#NNCF-model-weights-compression-$\\Uparrow$)\n", | ||
| "- [Create an instruction-following inference pipeline](#Create-an-instruction-following-inference-pipeline-$\\Uparrow$)\n", | ||
| " - [Setup imports](#Setup-imports-$\\Uparrow$)\n", | ||
| " - [Prepare template for user prompt](#Prepare-template-for-user-prompt-$\\Uparrow$)\n", | ||
| " - [Helpers for output parsing](#Helpers-for-output-parsing-$\\Uparrow$)\n", | ||
| " - [Main generation function](#Main-generation-function-$\\Uparrow$)\n", | ||
| " - [Helpers for application](#Helpers-for-application-$\\Uparrow$)\n", | ||
| "- [Run instruction-following pipeline](#Run-instruction-following-pipeline-$\\Uparrow$)" | ||
| ] | ||
| }, | ||
| { | ||
|
|
@@ -63,26 +64,15 @@ | |
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 1, | ||
| "execution_count": 2, | ||
| "id": "4421fc85-bed6-4a62-b8fa-19c7ba474891", | ||
| "metadata": {}, | ||
| "outputs": [ | ||
| { | ||
| "name": "stdout", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "\n", | ||
| "\u001B[1m[\u001B[0m\u001B[34;49mnotice\u001B[0m\u001B[1;39;49m]\u001B[0m\u001B[39;49m A new release of pip is available: \u001B[0m\u001B[31;49m23.1.2\u001B[0m\u001B[39;49m -> \u001B[0m\u001B[32;49m23.2\u001B[0m\n", | ||
| "\u001B[1m[\u001B[0m\u001B[34;49mnotice\u001B[0m\u001B[1;39;49m]\u001B[0m\u001B[39;49m To update, run: \u001B[0m\u001B[32;49mpip install --upgrade pip\u001B[0m\n", | ||
| "\n", | ||
| "\u001B[1m[\u001B[0m\u001B[34;49mnotice\u001B[0m\u001B[1;39;49m]\u001B[0m\u001B[39;49m A new release of pip is available: \u001B[0m\u001B[31;49m23.1.2\u001B[0m\u001B[39;49m -> \u001B[0m\u001B[32;49m23.2\u001B[0m\n", | ||
| "\u001B[1m[\u001B[0m\u001B[34;49mnotice\u001B[0m\u001B[1;39;49m]\u001B[0m\u001B[39;49m To update, run: \u001B[0m\u001B[32;49mpip install --upgrade pip\u001B[0m\n" | ||
| ] | ||
| } | ||
| ], | ||
| "outputs": [], | ||
| "source": [ | ||
| "!pip install -q \"diffusers>=0.16.1\" \"transformers>=4.28.0\"\n", | ||
| "!pip install -q \"git+https://github.com/huggingface/optimum-intel.git\" datasets onnx onnxruntime gradio" | ||
| "!pip install -q \"git+https://github.com/huggingface/optimum-intel.git\" datasets onnx onnxruntime gradio\n", | ||
| "!pip install -q \"git+https://github.com/openvinotoolkit/nncf.git@release_v260\"\n", | ||
| "!pip install -q \"openvino==2023.1.0.dev20230811\" \"openvino_dev==2023.1.0.dev20230811\"" | ||
| ] | ||
| }, | ||
| { | ||
|
|
@@ -97,22 +87,22 @@ | |
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 2, | ||
| "execution_count": 3, | ||
| "id": "6ddd57de-9f41-403c-bccc-8d3118654a24", | ||
| "metadata": {}, | ||
| "outputs": [ | ||
| { | ||
| "data": { | ||
| "application/vnd.jupyter.widget-view+json": { | ||
| "model_id": "5bc9f8fc615a4cf7af5cb987afd0211d", | ||
| "model_id": "c940eca7b64742dbae2fcaf98667af98", | ||
| "version_major": 2, | ||
| "version_minor": 0 | ||
| }, | ||
| "text/plain": [ | ||
| "Dropdown(description='Device:', index=2, options=('CPU', 'GPU', 'AUTO'), value='AUTO')" | ||
| "Dropdown(description='Device:', options=('CPU', 'GPU.0', 'GPU.1', 'GPU.2', 'AUTO'), value=' AUTO')" | ||
| ] | ||
| }, | ||
| "execution_count": 2, | ||
| "execution_count": 3, | ||
| "metadata": {}, | ||
| "output_type": "execute_result" | ||
| } | ||
|
|
@@ -160,20 +150,10 @@ | |
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 3, | ||
| "execution_count": 4, | ||
| "id": "91f42296-627d-44ff-a1cb-936bb6f87992", | ||
| "metadata": {}, | ||
| "outputs": [ | ||
| { | ||
| "name": "stderr", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "2023-07-17 14:47:00.308996: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n", | ||
| "2023-07-17 14:47:00.348466: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n", | ||
| "To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n", | ||
| "2023-07-17 14:47:01.039895: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n" | ||
| ] | ||
| }, | ||
| { | ||
| "name": "stdout", | ||
| "output_type": "stream", | ||
|
|
@@ -185,18 +165,25 @@ | |
| "name": "stderr", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'\n", | ||
| "comet_ml is installed but `COMET_API_KEY` is not set.\n", | ||
| "No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda-11.7'\n", | ||
| "2023-09-14 15:39:32.055450: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n", | ||
| "2023-09-14 15:39:32.089487: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n", | ||
| "To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n", | ||
| "2023-09-14 15:39:32.706748: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n", | ||
| "/home/nsavel/venvs/ov_notebooks_tmp/lib/python3.8/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations\n", | ||
| " warnings.warn(\n", | ||
| "The argument `from_transformers` is deprecated, and will be removed in optimum 2.0. Use `export` instead\n", | ||
| "Framework not specified. Using pt to export to ONNX.\n", | ||
| "Using framework PyTorch: 1.13.1+cpu\n", | ||
| "Overriding 1 configuration item(s)\n", | ||
| "\t- use_cache -> True\n", | ||
| "/home/ea/work/notebooks_convert/notebooks_conv_env/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py:504: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", | ||
| "/home/nsavel/venvs/ov_notebooks_tmp/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py:594: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", | ||
| " assert batch_size > 0, \"batch_size has to be defined and > 0\"\n", | ||
| "/home/ea/work/notebooks_convert/notebooks_conv_env/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py:270: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", | ||
| "/home/nsavel/venvs/ov_notebooks_tmp/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py:314: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", | ||
| " if seq_len > self.max_seq_len_cached:\n", | ||
| "/home/ea/work/notebooks_convert/notebooks_conv_env/lib/python3.8/site-packages/nncf/torch/dynamic_graph/wrappers.py:74: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.\n", | ||
| "/home/nsavel/venvs/ov_notebooks_tmp/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py:239: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", | ||
| " if key_length > self.bias.shape[-1]:\n", | ||
| "/home/nsavel/venvs/ov_notebooks_tmp/lib/python3.8/site-packages/nncf/torch/dynamic_graph/wrappers.py:74: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.\n", | ||
| " op1 = operator(*args, **kwargs)\n", | ||
| "In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode\n", | ||
| "In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode\n", | ||
|
|
@@ -232,7 +219,7 @@ | |
| "In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode\n", | ||
| "Saving external data to one file...\n", | ||
| "Compiling the model...\n", | ||
| "Set CACHE_DIR to /tmp/tmpndw8_20n/model_cache\n" | ||
| "Set CACHE_DIR to /tmp/tmp3vew161f/model_cache\n" | ||
| ] | ||
| } | ||
| ], | ||
|
|
@@ -255,6 +242,101 @@ | |
| " ov_model.save_pretrained(model_path)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "5b1238c8-dcc9-4495-aeff-1ecbd8bd5082", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### NNCF model weights compression [$\\Uparrow$](#Table-of-content:)\n", | ||
| "\n", | ||
| "NNCF [Weights Compression algorithm](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/CompressWeights.md) compresses weights of a model to `INT8`. This is an alternative to [Quantization algorithm](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/post_training/Quantization.md) that compresses both weights and activations. Weight compression is effective in optimizing footprint and performance of large models where the size of weights is significantly larger than the size of activations, for example, in Large Language Models (LLMs) such as Dolly 2.0. Additionaly, Weight Compression usually leads to almost no accuracy drop." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 5, | ||
| "id": "8e5c9e68-3772-432f-b231-f1163442357d", | ||
| "metadata": {}, | ||
| "outputs": [ | ||
| { | ||
| "data": { | ||
| "application/vnd.jupyter.widget-view+json": { | ||
| "model_id": "6be9ab974c06454e81077fe735e7cb37", | ||
| "version_major": 2, | ||
| "version_minor": 0 | ||
| }, | ||
| "text/plain": [ | ||
| "Dropdown(description='Compression:', index=1, options=('Disable', 'Enable'), value='Enable')" | ||
| ] | ||
| }, | ||
| "execution_count": 5, | ||
| "metadata": {}, | ||
| "output_type": "execute_result" | ||
| } | ||
| ], | ||
| "source": [ | ||
| "to_compress = widgets.Dropdown(\n", | ||
| " options=['Disable', 'Enable'],\n", | ||
| " value='Enable',\n", | ||
| " description='Compression:',\n", | ||
| " disabled=False,\n", | ||
| ")\n", | ||
| "to_compress" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 6, | ||
| "id": "392940e3-01da-4876-a9d1-2475ed3da882", | ||
| "metadata": {}, | ||
| "outputs": [ | ||
| { | ||
| "name": "stdout", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "* Original IR model size: 10590.42 MB\n", | ||
| "* Compressed IR model size: 2660.28 MB\n", | ||
| "* Model compression rate: 3.981\n" | ||
| ] | ||
| }, | ||
| { | ||
| "name": "stderr", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "Compiling the model...\n", | ||
| "Set CACHE_DIR to dolly-v2-3b_compressed/model_cache\n" | ||
| ] | ||
| } | ||
| ], | ||
| "source": [ | ||
| "import nncf\n", | ||
| "import shutil\n", | ||
| "import openvino.runtime as ov\n", | ||
| "\n", | ||
| "compressed_model_path = Path(f'{model_path}_compressed') / 'openvino_model.xml'\n", | ||
| "\n", | ||
| "def compress_model(model):\n", | ||
|
||
| " if not compressed_model_path.exists():\n", | ||
| " if not compressed_model_path.parent.exists():\n", | ||
| " compressed_model_path.parent.mkdir()\n", | ||
| " compressed_model = nncf.compress_weights(model)\n", | ||
| " ov.serialize(compressed_model, compressed_model_path)\n", | ||
| " shutil.copy(model_path / 'config.json', compressed_model_path.parent / 'config.json') # Copy config.json manually\n", | ||
| " del compressed_model\n", | ||
| "\n", | ||
| "def calculate_compression_rate(model_path_ov, model_path_ov_compressed):\n", | ||
| " model_size_original = model_path_ov.with_suffix(\".bin\").stat().st_size / 2 ** 20\n", | ||
| " model_size_compressed = model_path_ov_compressed.with_suffix(\".bin\").stat().st_size / 2 ** 20\n", | ||
| " print(f\"* Original IR model size: {model_size_original:.2f} MB\")\n", | ||
| " print(f\"* Compressed IR model size: {model_size_compressed:.2f} MB\")\n", | ||
| " print(f\"* Model compression rate: {model_size_original / model_size_compressed:.3f}\")\n", | ||
| "\n", | ||
| "if to_compress.value == 'Enable':\n", | ||
| " compress_model(ov_model.model)\n", | ||
| " calculate_compression_rate(model_path / 'openvino_model.xml', compressed_model_path)\n", | ||
| " ov_model = OVModelForCausalLM.from_pretrained(compressed_model_path.parent, device=current_device)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "b6d9c4a5-ef75-4076-9f1c-f45a2259ec46", | ||
|
|
@@ -306,7 +388,7 @@ | |
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 4, | ||
| "execution_count": 7, | ||
| "id": "6f976094-8603-42c4-8f18-a32ba6d7192e", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
|
|
@@ -331,7 +413,7 @@ | |
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 5, | ||
| "execution_count": 8, | ||
| "id": "52ac10a5-3141-4227-8f0b-0617acd027c8", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
|
|
@@ -371,7 +453,7 @@ | |
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 6, | ||
| "execution_count": 9, | ||
| "id": "524e72f4-8750-48ff-b002-e558d03b3302", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
|
|
@@ -421,7 +503,7 @@ | |
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 7, | ||
| "execution_count": 10, | ||
| "id": "67fb4f9d-5877-48d8-8eff-c30ff6974d7a", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
|
|
@@ -490,7 +572,7 @@ | |
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 8, | ||
| "execution_count": 11, | ||
| "id": "f114944f-c060-44ba-ba59-02cb2516554c", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
|
|
@@ -571,7 +653,7 @@ | |
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 9, | ||
| "execution_count": 12, | ||
| "id": "a00c2293-15b1-4734-b9b4-1abb524bb8d6", | ||
| "metadata": { | ||
| "tags": [] | ||
|
|
@@ -581,7 +663,7 @@ | |
| "name": "stderr", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "/tmp/ipykernel_1272681/896135151.py:57: GradioDeprecationWarning: The `enable_queue` parameter has been deprecated. Please use the `.queue()` method instead.\n", | ||
| "/tmp/ipykernel_3967369/3994661578.py:57: GradioDeprecationWarning: The `enable_queue` parameter has been deprecated. Please use the `.queue()` method instead.\n", | ||
| " demo.launch(enable_queue=True, share=False, height=800)\n" | ||
| ] | ||
| }, | ||
|
|
@@ -734,4 +816,4 @@ | |
| }, | ||
| "nbformat": 4, | ||
| "nbformat_minor": 5 | ||
| } | ||
| } | ||
Uh oh!
There was an error while loading. Please reload this page.