Skip to content
Merged
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
"\n",
Copy link
Contributor

@eaidova eaidova Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment looks strange on my opinion for end users, as 2023.2 is not released yet, it is better to say that in 2023.1.0 release weights compression supported only on CPU, GPU support will be added later. It is recommended to disable weights compression for GPU


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"- Install prerequisites\n",
"- Download and convert the model from a public source using the [OpenVINO integration with Hugging Face Optimum](https://huggingface.co/blog/openvino).\n",
"- Compress model weights to INT8 with [OpenVINO NNCF](https://github.com/openvinotoolkit/nncf)\n",
"- Create an instruction-following inference pipeline\n",
"- Run instruction-following pipeline\n",
"\n",
Expand All @@ -38,16 +39,17 @@
"metadata": {},
"source": [
"### Table of content:\n",
"- [Prerequisites](#Prerequisites-Uparrow)\n",
" - [Select inference device](#Select-inference-device-Uparrow)\n",
"- [Download and Convert Model](#Download-and-Convert-Model-Uparrow)\n",
"- [Create an instruction-following inference pipeline](#Create-an-instruction-following-inference-pipeline-Uparrow)\n",
" - [Setup imports](#Setup-imports-Uparrow)\n",
" - [Prepare template for user prompt](#Prepare-template-for-user-prompt-Uparrow)\n",
" - [Helpers for output parsing](#Helpers-for-output-parsing-Uparrow)\n",
" - [Main generation function](#Main-generation-function-Uparrow)\n",
" - [Helpers for application](#Helpers-for-application-Uparrow)\n",
"- [Run instruction-following pipeline](#Run-instruction-following-pipeline-Uparrow)"
"- [Prerequisites](#Prerequisites-$\\Uparrow$)\n",
" - [Select inference device](#Select-inference-device-$\\Uparrow$)\n",
"- [Download and Convert Model](#Download-and-Convert-Model-$\\Uparrow$)\n",
"- [NNCF model weights compression](#NNCF-model-weights-compression-$\\Uparrow$)\n",
"- [Create an instruction-following inference pipeline](#Create-an-instruction-following-inference-pipeline-$\\Uparrow$)\n",
" - [Setup imports](#Setup-imports-$\\Uparrow$)\n",
" - [Prepare template for user prompt](#Prepare-template-for-user-prompt-$\\Uparrow$)\n",
" - [Helpers for output parsing](#Helpers-for-output-parsing-$\\Uparrow$)\n",
" - [Main generation function](#Main-generation-function-$\\Uparrow$)\n",
" - [Helpers for application](#Helpers-for-application-$\\Uparrow$)\n",
"- [Run instruction-following pipeline](#Run-instruction-following-pipeline-$\\Uparrow$)"
]
},
{
Expand All @@ -63,26 +65,15 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 2,
"id": "4421fc85-bed6-4a62-b8fa-19c7ba474891",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.1.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.2\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
"\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.1.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.2\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
]
}
],
"outputs": [],
"source": [
"%pip install -q \"diffusers>=0.16.1\" \"transformers>=4.28.0\"\n",
"%pip install -q \"git+https://github.com/huggingface/optimum-intel.git\" datasets onnx onnxruntime gradio"
"%pip install -q \"git+https://github.com/huggingface/optimum-intel.git\" datasets onnx onnxruntime gradio\n",
"%pip install -q \"git+https://github.com/openvinotoolkit/nncf.git@release_v260\"\n",
"%pip install -q \"openvino==2023.1.0.dev20230811\" \"openvino_dev==2023.1.0.dev20230811\""
]
},
{
Expand All @@ -97,22 +88,22 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 5,
"id": "6ddd57de-9f41-403c-bccc-8d3118654a24",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "5bc9f8fc615a4cf7af5cb987afd0211d",
"model_id": "18b43cd3ea0f4d30b0973918023f3b12",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Dropdown(description='Device:', index=2, options=('CPU', 'GPU', 'AUTO'), value='AUTO')"
"Dropdown(description='Device:', options=('CPU', 'GPU.0', 'GPU.1', 'GPU.2', 'AUTO'), value='CPU')"
]
},
"execution_count": 2,
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -125,7 +116,7 @@
"\n",
"device = widgets.Dropdown(\n",
" options=core.available_devices + [\"AUTO\"],\n",
" value='AUTO',\n",
" value='CPU',\n",
" description='Device:',\n",
" disabled=False,\n",
")\n",
Expand Down Expand Up @@ -160,44 +151,19 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 6,
"id": "91f42296-627d-44ff-a1cb-936bb6f87992",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2023-07-17 14:47:00.308996: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n",
"2023-07-17 14:47:00.348466: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
"To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
"2023-07-17 14:47:01.039895: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'\n",
"comet_ml is installed but `COMET_API_KEY` is not set.\n",
"The argument `from_transformers` is deprecated, and will be removed in optimum 2.0. Use `export` instead\n",
"Framework not specified. Using pt to export to ONNX.\n",
"Using framework PyTorch: 1.13.1+cpu\n",
"Using framework PyTorch: 2.0.1+cu117\n",
"Overriding 1 configuration item(s)\n",
"\t- use_cache -> True\n",
"/home/ea/work/notebooks_convert/notebooks_conv_env/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py:504: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n",
" assert batch_size > 0, \"batch_size has to be defined and > 0\"\n",
"/home/ea/work/notebooks_convert/notebooks_conv_env/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py:270: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n",
" if seq_len > self.max_seq_len_cached:\n",
"/home/ea/work/notebooks_convert/notebooks_conv_env/lib/python3.8/site-packages/nncf/torch/dynamic_graph/wrappers.py:74: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.\n",
" op1 = operator(*args, **kwargs)\n",
"In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode\n",
"In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode\n",
"In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode\n",
Expand Down Expand Up @@ -230,9 +196,25 @@
"In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode\n",
"In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode\n",
"In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode\n",
"Saving external data to one file...\n",
"Saving external data to one file...\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"============= Diagnostic Run torch.onnx.export version 2.0.1+cu117 =============\n",
"verbose: False, log level: Level.ERROR\n",
"======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Compiling the model...\n",
"Set CACHE_DIR to /tmp/tmpndw8_20n/model_cache\n"
"Set CACHE_DIR to /tmp/tmpdbawql7m/model_cache\n"
]
}
],
Expand All @@ -255,6 +237,93 @@
" ov_model.save_pretrained(model_path)"
]
},
{
"cell_type": "markdown",
"id": "5b1238c8-dcc9-4495-aeff-1ecbd8bd5082",
"metadata": {},
"source": [
"### NNCF model weights compression [$\\Uparrow$](#Table-of-content:)\n",
"\n",
"NNCF [Weights Compression algorithm](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/CompressWeights.md) compresses weights of a model to `INT8`. This is an alternative to [Quantization algorithm](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/post_training/Quantization.md) that compresses both weights and activations. Weight compression is effective in optimizing footprint and performance of large models where the size of weights is significantly larger than the size of activations, for example, in Large Language Models (LLMs) such as Dolly 2.0. Additionaly, Weight Compression usually leads to almost no accuracy drop."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "8e5c9e68-3772-432f-b231-f1163442357d",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "66698b98b669482c97f1a29db1e38a66",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Dropdown(description='Compression:', index=1, options=('Disable', 'Enable'), value='Enable')"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"to_compress = widgets.Dropdown(\n",
" options=['Disable', 'Enable'],\n",
" value='Enable',\n",
" description='Compression:',\n",
" disabled=False,\n",
")\n",
"to_compress"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "392940e3-01da-4876-a9d1-2475ed3da882",
"metadata": {},
"outputs": [
{
"ename": "ValueError",
"evalue": "`weights_only` currently not supported for `OVModels`, only available for torch.nn.Module.",
"output_type": "error",
"traceback": [
"\u001B[0;31m---------------------------------------------------------------------------\u001B[0m",
"\u001B[0;31mValueError\u001B[0m Traceback (most recent call last)",
"Cell \u001B[0;32mIn[10], line 18\u001B[0m\n\u001B[1;32m 16\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m compressed_model_path\u001B[38;5;241m.\u001B[39mexists():\n\u001B[1;32m 17\u001B[0m quantizer \u001B[38;5;241m=\u001B[39m OVQuantizer\u001B[38;5;241m.\u001B[39mfrom_pretrained(ov_model)\n\u001B[0;32m---> 18\u001B[0m \u001B[43mquantizer\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mquantize\u001B[49m\u001B[43m(\u001B[49m\u001B[43msave_directory\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mcompressed_model_path\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mweights_only\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[38;5;28;43;01mTrue\u001B[39;49;00m\u001B[43m)\u001B[49m\n\u001B[1;32m 19\u001B[0m \u001B[38;5;28;01mdel\u001B[39;00m quantizer\n\u001B[1;32m 20\u001B[0m gc\u001B[38;5;241m.\u001B[39mcollect()\n",
"File \u001B[0;32m~/venvs/ov_notebooks_tmp/lib/python3.8/site-packages/optimum/intel/openvino/quantization.py:167\u001B[0m, in \u001B[0;36mOVQuantizer.quantize\u001B[0;34m(self, calibration_dataset, save_directory, quantization_config, file_name, batch_size, data_collator, remove_unused_columns, weights_only, **kwargs)\u001B[0m\n\u001B[1;32m 165\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m weights_only:\n\u001B[1;32m 166\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(\u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mmodel, OVBaseModel):\n\u001B[0;32m--> 167\u001B[0m \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;167;01mValueError\u001B[39;00m(\n\u001B[1;32m 168\u001B[0m \u001B[38;5;124m\"\u001B[39m\u001B[38;5;124m`weights_only` currently not supported for `OVModels`, only available for torch.nn.Module.\u001B[39m\u001B[38;5;124m\"\u001B[39m\n\u001B[1;32m 169\u001B[0m )\n\u001B[1;32m 170\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m calibration_dataset \u001B[38;5;129;01mis\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m:\n\u001B[1;32m 171\u001B[0m logger\u001B[38;5;241m.\u001B[39mwarning(\n\u001B[1;32m 172\u001B[0m \u001B[38;5;124m\"\u001B[39m\u001B[38;5;124m`calibration_dataset` was provided but will not be used as `weights_only` is set to `True`.\u001B[39m\u001B[38;5;124m\"\u001B[39m\n\u001B[1;32m 173\u001B[0m )\n",
"\u001B[0;31mValueError\u001B[0m: `weights_only` currently not supported for `OVModels`, only available for torch.nn.Module."
]
}
],
"source": [
"import gc\n",
"import nncf\n",
"import shutil\n",
"import openvino.runtime as ov\n",
"from optimum.intel import OVQuantizer\n",
"\n",
"compressed_model_path = Path(f'{model_path}_compressed')\n",
"\n",
"def calculate_compression_rate(model_path_ov, model_path_ov_compressed):\n",
" model_size_original = model_path_ov.with_suffix(\".bin\").stat().st_size / 2 ** 20\n",
" model_size_compressed = model_path_ov_compressed.with_suffix(\".bin\").stat().st_size / 2 ** 20\n",
" print(f\"* Original IR model size: {model_size_original:.2f} MB\")\n",
" print(f\"* Compressed IR model size: {model_size_compressed:.2f} MB\")\n",
" print(f\"* Model compression rate: {model_size_original / model_size_compressed:.3f}\")\n",
"\n",
"if to_compress.value == 'Enable':\n",
" if not compressed_model_path.exists():\n",
" quantizer = OVQuantizer.from_pretrained(ov_model)\n",
" quantizer.quantize(save_directory=compressed_model_path, weights_only=True)\n",
" del quantizer\n",
" gc.collect()\n",
" calculate_compression_rate(model_path / 'openvino_model.xml', compressed_model_path / 'openvino_model.xml')\n",
" ov_model = OVModelForCausalLM.from_pretrained(compressed_model_path.parent, device=current_device)"
]
},
{
"cell_type": "markdown",
"id": "b6d9c4a5-ef75-4076-9f1c-f45a2259ec46",
Expand Down Expand Up @@ -306,7 +375,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"id": "6f976094-8603-42c4-8f18-a32ba6d7192e",
"metadata": {},
"outputs": [],
Expand All @@ -331,7 +400,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": null,
"id": "52ac10a5-3141-4227-8f0b-0617acd027c8",
"metadata": {},
"outputs": [],
Expand Down Expand Up @@ -371,7 +440,7 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": null,
"id": "524e72f4-8750-48ff-b002-e558d03b3302",
"metadata": {},
"outputs": [],
Expand Down Expand Up @@ -421,7 +490,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": null,
"id": "67fb4f9d-5877-48d8-8eff-c30ff6974d7a",
"metadata": {},
"outputs": [],
Expand Down Expand Up @@ -490,7 +559,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": null,
"id": "f114944f-c060-44ba-ba59-02cb2516554c",
"metadata": {},
"outputs": [],
Expand Down Expand Up @@ -571,42 +640,12 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": null,
"id": "a00c2293-15b1-4734-b9b4-1abb524bb8d6",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_1272681/896135151.py:57: GradioDeprecationWarning: The `enable_queue` parameter has been deprecated. Please use the `.queue()` method instead.\n",
" demo.launch(enable_queue=True, share=False, height=800)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Running on local URL: http://127.0.0.1:7860\n",
"\n",
"To create a public link, set `share=True` in `launch()`.\n"
]
},
{
"data": {
"text/html": [
"<div><iframe src=\"http://127.0.0.1:7860/\" width=\"100%\" height=\"800\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"outputs": [],
"source": [
"available_devices = Core().available_devices + [\"AUTO\"]\n",
"\n",
Expand Down Expand Up @@ -668,6 +707,14 @@
" except Exception:\n",
" demo.launch(enable_queue=True, share=True, height=800)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "af0c8d9d-693d-48d5-8039-54788049236d",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand Down
1 change: 1 addition & 0 deletions notebooks/240-dolly-2-instruction-following/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ The tutorial consists of the following steps:

- Install prerequisites
- Download and convert the model from a public source using the [OpenVINO integration with Hugging Face Optimum](https://huggingface.co/blog/openvino).
- Compress model weights to INT8 with [OpenVINO NNCF](https://github.com/openvinotoolkit/nncf)
- Create an instruction-following inference pipeline
- Run instruction-following pipeline

Expand Down