diff --git a/notebooks/spa/_toctree.yml b/notebooks/spa/_toctree.yml
new file mode 100644
index 00000000..513a3b0d
--- /dev/null
+++ b/notebooks/spa/_toctree.yml
@@ -0,0 +1,37 @@
+- title: Open-Source AI Cookbook
+ sections:
+ - local: index
+ title: Open-Source AI Cookbook
+ - local: issues_in_text_dataset
+ title: Detección de problemas en un conjunto de datos de texto con Cleanlab
+ - local: stable_diffusion_interpolation
+ title: Interpolación por difusión estable
+ - local: rag_with_hugging_face_gemma_mongodb
+ title: Creación de un sistema RAG con Gemma, MongoDB y modelos de código abierto
+ - local: tgi_messages_api_demo
+ title: Migración de OpenAI a Open LLM mediante la API de mensajes de TGI
+ - local: automatic_embedding_tei_inference_endpoints
+ title: Incrustación automática con TEI mediante puntos finales de inferencia
+ - local: faiss_with_hf_datasets_and_clip
+ title: Incrustación de datos multimodales para la búsqueda de similitudes
+ - local: fine_tuning_code_llm_on_single_gpu
+ title: Fine-tuning de un LLM de código personalizado en una sola GPU
+ - local: rag_zephyr_langchain
+ title: RAG simple utilizando Hugging Face Zephyr y LangChain
+ - local: rag_llamaindex_librarian
+ title: RAG "Bibliotecario" con LlamaIndex
+ - local: advanced_rag
+ title: RAG avanzado sobre la documentación de HuggingFace utilizando LangChain
+ - local: rag_evaluation
+ title: Evaluación RAG
+ - local: prompt_tuning_peft
+ title: Prompt tuning con PEFT
+ - local: labelling_feedback_setfit
+ title: Sugerencias para la anotación de datos con SetFit en la clasificación de textos con disparo cero
+ - local: pipeline_notus_instructions_preferences_legal
+ title: Crear un conjunto de datos de preferencias legales
+ - local: semantic_cache_chroma_vector_database
+ title: Implementación de la caché semántica para mejorar un sistema RAG.
+ - local: llm_judge
+ title: Utilización de LLM-as-a-judge para una evaluación automatizada y versátil
+
diff --git a/notebooks/spa/advanced_rag.ipynb b/notebooks/spa/advanced_rag.ipynb
new file mode 100644
index 00000000..a87343ff
--- /dev/null
+++ b/notebooks/spa/advanced_rag.ipynb
@@ -0,0 +1,1258 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "hUCaGdAj9-9F"
+ },
+ "source": [
+ "# Advanced RAG on HuggingFace documentation using LangChain\n",
+ "_Authored by: [Aymeric Roucher](https://huggingface.co/m-ric)_"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This notebook demonstrates how you can build an advanced RAG (Retrieval Augmented Generation) for answering a user's question about a specific knowledge base (here, the HuggingFace documentation), using LangChain.\n",
+ "\n",
+ "For an introduction to RAG, you can check [this other cookbook](rag_zephyr_langchain)!\n",
+ "\n",
+ "RAG systems are complex, with many moving parts: here a RAG diagram, where we noted in blue all possibilities for system enhancement:\n",
+ "\n",
+ "\n",
+ "\n",
+ "> 💡 As you can see, there are many steps to tune in this architecture: tuning the system properly will yield significant performance gains.\n",
+ "\n",
+ "In this notebook, we will take a look into many of these blue notes to see how to tune your RAG system and get the best performance.\n",
+ "\n",
+ "__Let's dig into the model building!__ First, we install the required model dependancies."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "NSX0p0rV9-9I"
+ },
+ "outputs": [],
+ "source": [
+ "!pip install -q torch transformers transformers accelerate bitsandbytes langchain sentence-transformers faiss-gpu openpyxl pacmap"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "8_Uyukt39-9J"
+ },
+ "outputs": [],
+ "source": [
+ "%reload_ext dotenv\n",
+ "%dotenv"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "eoujYMwW9-9J"
+ },
+ "outputs": [],
+ "source": [
+ "from tqdm.notebook import tqdm\n",
+ "import pandas as pd\n",
+ "from typing import Optional, List, Tuple\n",
+ "from datasets import Dataset\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "pd.set_option(\n",
+ " \"display.max_colwidth\", None\n",
+ ") # this will be helpful when visualizing retriever outputs"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Kr6rN10U9-9J"
+ },
+ "source": [
+ "### Load your knowledge base"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "qZLVIEVW9-9J"
+ },
+ "outputs": [],
+ "source": [
+ "import datasets\n",
+ "\n",
+ "ds = datasets.load_dataset(\"m-ric/huggingface_doc\", split=\"train\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "836Q7vF49-9K"
+ },
+ "outputs": [],
+ "source": [
+ "from langchain.docstore.document import Document as LangchainDocument\n",
+ "\n",
+ "RAW_KNOWLEDGE_BASE = [\n",
+ " LangchainDocument(page_content=doc[\"text\"], metadata={\"source\": doc[\"source\"]})\n",
+ " for doc in tqdm(ds)\n",
+ "]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "0_LxjD5h9-9K"
+ },
+ "source": [
+ "# 1. Retriever - embeddings 🗂️\n",
+ "The __retriever acts like an internal search engine__: given the user query, it returns a few relevant snippets from your knowledge base.\n",
+ "\n",
+ "These snippets will then be fed to the Reader Model to help it generate its answer.\n",
+ "\n",
+ "So __our objective here is, given a user question, to find the most snippets from our knowledge base to answer that question.__\n",
+ "\n",
+ "This is a wide objective, it leaves open some questions. How many snippets should we retrieve? This parameter will be named `top_k`.\n",
+ "\n",
+ "How long should these snippets be? This is called the `chunk size`. There's no one-size-fits-all answers, but here are a few elements:\n",
+ "- 🔀 Your `chunk size` is allowed to vary from one snippet to the other.\n",
+ "- Since there will always be some noise in your retrieval, increasing the `top_k` increases the chance to get relevant elements in your retrieved snippets. 🎯 Shooting more arrows increases your probability to hit your target.\n",
+ "- Meanwhile, the summed length of your retrieved documents should not be too high: for instance, for most current models 16k tokens will probably drown your Reader model in information due to [Lost-in-the-middle phenomenon](https://huggingface.co/papers/2307.03172). 🎯 Give your reader model only the most relevant insights, not a huge pile of books!\n",
+ "\n",
+ "\n",
+ "> In this notebook, we use Langchain library since __it offers a huge variety of options for vector databases and allows us to keep document metadata throughout the processing__."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-uS6Mv8O9-9L"
+ },
+ "source": [
+ "### 1.1 Split the documents into chunks\n",
+ "\n",
+ "- In this part, __we split the documents from our knowledge base into smaller chunks__ which will be the snippets on which the reader LLM will base its answer.\n",
+ "- The goal is to prepare a collection of **semantically relevant snippets**. So their size should be adapted to precise ideas: too small will truncate ideas, too large will dilute them.\n",
+ "\n",
+ "💡 _Many options exist for text splitting: splitting on words, on sentence boundaries, recursive chunking that processes documents in a tree-like way to preserve structure information... To learn more about chunking, I recommend you read [this great notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb) by Greg Kamradt._\n",
+ "\n",
+ "\n",
+ "- **Recursive chunking** breaks down the text into smaller parts step by step using a given list of separators sorted from the most important to the least important separator. If the first split doesn't give the right size or shape chunks, the method repeats itself on the new chunks using a different separator. For instance with the list of separators `[\"\\n\\n\", \"\\n\", \".\", \"\"]`:\n",
+ " - The method will first break down the document wherever there is a double line break `\"\\n\\n\"`.\n",
+ " - Resulting documents will be split again on simple line breaks `\"\\n\"`, then on sentence ends `\".\"`.\n",
+ " - And finally, if some chunks are still too big, they will be split whenever they overflow the maximum size.\n",
+ "\n",
+ "- With this method, the global structure is well preserved, at the expense of getting slight variations in chunk size.\n",
+ "\n",
+ "> [This space](https://huggingface.co/spaces/A-Roucher/chunk_visualizer) lets you visualize how different splitting options affect the chunks you get.\n",
+ "\n",
+ "🔬 Let's experiment a bit with chunk sizes, beginning with an arbitrary size, and see how splits work. We use Langchain's implementation of recursive chunking with `RecursiveCharacterTextSplitter`.\n",
+ "- Parameter `chunk_size` controls the length of individual chunks: this length is counted by default as the number of characters in the chunk.\n",
+ "- Parameter `chunk_overlap` lets adjacent chunks get a bit of overlap on each other. This reduces the probability that an idea could be cut in half by the split between two adjacent chunks. We ~arbitrarily set this to 1/10th of the chunk size, you could try different values!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "M4m6TwDJ9-9L"
+ },
+ "outputs": [],
+ "source": [
+ "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
+ "\n",
+ "# We use a hierarchical list of separators specifically tailored for splitting Markdown documents\n",
+ "# This list is taken from LangChain's MarkdownTextSplitter class.\n",
+ "MARKDOWN_SEPARATORS = [\n",
+ " \"\\n#{1,6} \",\n",
+ " \"```\\n\",\n",
+ " \"\\n\\\\*\\\\*\\\\*+\\n\",\n",
+ " \"\\n---+\\n\",\n",
+ " \"\\n___+\\n\",\n",
+ " \"\\n\\n\",\n",
+ " \"\\n\",\n",
+ " \" \",\n",
+ " \"\",\n",
+ "]\n",
+ "\n",
+ "text_splitter = RecursiveCharacterTextSplitter(\n",
+ " chunk_size=1000, # the maximum number of characters in a chunk: we selected this value arbitrarily\n",
+ " chunk_overlap=100, # the number of characters to overlap between chunks\n",
+ " add_start_index=True, # If `True`, includes chunk's start index in metadata\n",
+ " strip_whitespace=True, # If `True`, strips whitespace from the start and end of every document\n",
+ " separators=MARKDOWN_SEPARATORS,\n",
+ ")\n",
+ "\n",
+ "docs_processed = []\n",
+ "for doc in RAW_KNOWLEDGE_BASE:\n",
+ " docs_processed += text_splitter.split_documents([doc])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "d5jJUMgb9-9M"
+ },
+ "source": [
+ "We also have to keep in mind that when embedding documents, we will use an embedding model that has accepts a certain maximum sequence length `max_seq_length`.\n",
+ "\n",
+ "So we should make sure that our chunk sizes are below this limit, because any longer chunk will be truncated before processing, thus losing relevancy."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "referenced_widgets": [
+ "ae043feeb0914c879e2a9008b413d952"
+ ]
+ },
+ "id": "B4hoki349-9M",
+ "outputId": "64f92a61-7839-476d-f456-7eefde04c20b"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Model's maximum sequence length: 512\n"
+ ]
+ },
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "ae043feeb0914c879e2a9008b413d952",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ " 0%| | 0/31085 [00:00, ?it/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ "
"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "from sentence_transformers import SentenceTransformer\n",
+ "\n",
+ "# To get the value of the max sequence_length, we will query the underlying `SentenceTransformer` object used in the RecursiveCharacterTextSplitter.\n",
+ "print(\n",
+ " f\"Model's maximum sequence length: {SentenceTransformer('thenlper/gte-small').max_seq_length}\"\n",
+ ")\n",
+ "\n",
+ "from transformers import AutoTokenizer\n",
+ "\n",
+ "tokenizer = AutoTokenizer.from_pretrained(\"thenlper/gte-small\")\n",
+ "lengths = [len(tokenizer.encode(doc.page_content)) for doc in tqdm(docs_processed)]\n",
+ "\n",
+ "# Plot the distrubution of document lengths, counted as the number of tokens\n",
+ "fig = pd.Series(lengths).hist()\n",
+ "plt.title(\"Distribution of document lengths in the knowledge base (in count of tokens)\")\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "L3teXczl9-9M"
+ },
+ "source": [
+ "👀 As you can see, __the chunk lengths are not aligned with our limit of 512 tokens__, and some documents are above the limit, thus some part of them will be lost in truncation!\n",
+ " - So we should change the `RecursiveCharacterTextSplitter` class to count length in number of tokens instead of number of characters.\n",
+ " - Then we can choose a specific chunk size, here we would choose a lower threshold than 512:\n",
+ " - smaller documents could allow the split to focus more on specific ideas.\n",
+ " - But too small chunks would split sentences in half, thus losing meaning again: the proper tuning is a matter of balance."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "referenced_widgets": [
+ "f900cf4ab3a94f45bfa7298f433566ed"
+ ]
+ },
+ "id": "9hvIL2jO9-9M",
+ "outputId": "9baf219d-2954-4927-9681-e28572db90db"
+ },
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "f900cf4ab3a94f45bfa7298f433566ed",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ " 0%| | 0/17995 [00:00, ?it/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAo4AAAGzCAYAAAChApYOAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjguMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8g+/7EAAAACXBIWXMAAA9hAAAPYQGoP6dpAABJmElEQVR4nO3de3yP9eP/8edm23u22ea4mTnsQzkfMmGVnGZLS4QQRaI+mDJKpXKuSAepRH0qOvkIlYrEnJORREkUfRTFtqIdnGa21++Pfu/r6+29cW02Gx73282t3tf1er+u1/W6Ts/3dZqHMcYIAAAAOA/Pkm4AAAAALg0ERwAAANhCcAQAAIAtBEcAAADYQnAEAACALQRHAAAA2EJwBAAAgC0ERwAAANhCcAQAAIAtxR4cJ0yYIA8Pj+KejCSpXbt2ateunfV57dq18vDw0KJFiy7K9O+++27VqlXrokyrsI4eParBgwcrNDRUHh4eSkhIKHAdHh4emjBhQpG37UpUq1Yt3X333SXdjPO6++67FRAQUKzTuFjr1cXaL1zs/c+F+vXXX+Xh4aG5c+cWWZ1z586Vh4eHfv311yKr065atWrplltuuejTvVBHjx5VlSpV9P7771vDLuZx9HJXFMdAu5zr/zfffFNs0yisPn36qFevXoX6boGCo7MTnP98fX0VFham2NhYvfTSS8rMzCxUI8528OBBTZgwQdu3by+S+opSaW6bHU8//bTmzp2roUOH6t1339Vdd91V0k26rMybN08vvvhiSTejUI4fP64JEyZo7dq1Jd2UInEpLwtcuWbMmKFy5cqpT58+Jd2UEvX0009r8eLFxVKv3WNgcbWhNHjkkUf04Ycf6rvvvivwdwt1xnHSpEl69913NWvWLN1///2SpISEBDVu3Fjff/+9S9knnnhCJ06cKFD9Bw8e1MSJEwsczlasWKEVK1YU6DsFda62/ec//9FPP/1UrNO/UKtXr1br1q01fvx43XnnnYqMjCzpJl1WLuWwcvz4cU2cOLHEguOJEyf0xBNPFFl9l/KywJUpOztbM2bM0ODBg1WmTBlreGGOo5e64gptBTkGXs7B8ZprrlGLFi30/PPPF/i7hQqOnTt31p133qmBAwdqzJgxWr58uVauXKnU1FTdeuutLiu4l5eXfH19CzMZ244fPy5J8vHxkY+PT7FO61y8vb3lcDhKbPp2pKamKjg4uKSbAbjx9fWVl5dXSTcDKDFLlizRn3/+6XYJ8WIcR68UHAP/T69evfTRRx/p6NGjBfpekd3j2KFDB40dO1a//fab3nvvPWt4XvdmJCYm6oYbblBwcLACAgJUt25dPfbYY5L+uS/o2muvlSQNHDjQuizuvO+mXbt2atSokbZu3aobb7xRfn5+1nfPvsfRKScnR4899phCQ0Pl7++vW2+9VQcOHHApk9+9ZmfWeb625XWP47Fjx/Tggw+qevXqcjgcqlu3rp577jkZY1zKeXh4aPjw4Vq8eLEaNWokh8Ohhg0b6osvvsi7w8+SmpqqQYMGKSQkRL6+vmratKnefvtta7zzfqt9+/Zp6dKlVtvPde9RVlaWRo4cqcqVK6tcuXK69dZb9fvvv+dZdtu2bercubMCAwMVEBCgjh07atOmTW7l0tLSNHLkSNWqVUsOh0Ph4eHq37+//vrrL0n53xPlbP+ZZ8Oc68L333+vtm3bys/PT3Xq1LHuKVu3bp1atWqlsmXLqm7dulq5cqVbe/744w/dc889CgkJsfr8rbfeynPaCxYs0FNPPaXw8HD5+vqqY8eO2rt3r0t7li5dqt9++83q38Lc85qWlqaEhARrnalTp46eeeYZ5ebmWmWc96M999xzev3111W7dm05HA5de+212rJli1udCxcuVIMGDeTr66tGjRrp448/dllff/31V1WuXFmSNHHiRKv9Z99z+Mcff6hbt24KCAhQ5cqV9dBDDyknJ8elzPz58xUZGaly5copMDBQjRs31owZM84732dPz7nv2Lt3r+6++24FBwcrKChIAwcOtH4s5sfOssjNzT3n8nTavHmzbrrpJgUFBcnPz09t27bVV199dd75yUtWVpZuueUWBQUFaePGjQWez9OnT2vy5MnW8q5Vq5Yee+wxZWVlWWVGjRqlihUruuxj7r//fnl4eOill16yhqWkpMjDw0OzZs06Z5t3796tnj17qkKFCvL19VWLFi306aefupXbuXOnOnTooLJlyyo8PFxPPvmkyzrrlJubqwkTJigsLEx+fn5q3769fvzxxzz3wXa2hfNZsWKFmjVrJl9fXzVo0EAfffSRy/gjR47ooYceUuPGjRUQEKDAwEB17tw5z0t4L7/8sho2bCg/Pz+VL19eLVq00Lx581zK2Nmn5Gfx4sWqVauWateu7TI8r+PohR4zTp48qQkTJujqq6+Wr6+vqlatqu7du+uXX36xytg5fp3r3tjCbtMeHh46duyY3n77bWv7Pd+94EV9DDxfG+we8872999/q2XLlgoPD7euUGZlZWn8+PGqU6eOHA6Hqlevrocffthlu3a2yc4yz8zMVEJCgnWcrVKlijp16qRvv/3WpVynTp107NgxJSYmnrfdZyrSn/d33XWXHnvsMa1YsUL33ntvnmV27typW265RU2aNNGkSZPkcDi0d+9ea0dcv359TZo0SePGjdN9992nNm3aSJKuu+46q47Dhw+rc+fO6tOnj+68806FhIScs11PPfWUPDw89Mgjjyg1NVUvvviioqOjtX37dpUtW9b2/Nlp25mMMbr11lu1Zs0aDRo0SM2aNdPy5cs1evRo/fHHH5o+fbpL+Q0bNuijjz7SsGHDVK5cOb300kvq0aOH9u/fr4oVK+bbrhMnTqhdu3bau3evhg8froiICC1cuFB333230tLSNGLECNWvX1/vvvuuRo4cqfDwcD344IOSZIWFvAwePFjvvfee+vbtq+uuu06rV69WXFycW7mdO3eqTZs2CgwM1MMPPyxvb2+99tprateunRXepH9uSm7Tpo127dqle+65R82bN9dff/2lTz/9VL///rsqVap07gWQh7///lu33HKL+vTpo9tvv12zZs1Snz599P777yshIUFDhgxR37599eyzz6pnz546cOCAypUrJ+mfA2fr1q2tjbFy5cpatmyZBg0apIyMDLebpqdOnSpPT0899NBDSk9P17Rp09SvXz9t3rxZkvT4448rPT1dv//+u7VsC/pAyfHjx9W2bVv98ccf+ve//60aNWpo48aNGjNmjA4dOuR26XXevHnKzMzUv//9b3l4eGjatGnq3r27/ve//8nb21uStHTpUvXu3VuNGzfWlClT9Pfff2vQoEGqVq2aVU/lypU1a9YsDR06VLfddpu6d+8uSWrSpIlVJicnR7GxsWrVqpWee+45rVy5Us8//7xq166toUOHSvrnR+Edd9yhjh076plnnpEk7dq1S1999ZVGjBhRoL5w6tWrlyIiIjRlyhR9++23euONN1SlShWr/rzYWRbnW57SP5e1OnfurMjISI0fP16enp6aM2eOOnTooC+//FItW7a0PR8nTpxQ165d9c0332jlypXWj9CCzOfgwYP19ttvq2fPnnrwwQe1efNmTZkyRbt27dLHH38sSWrTpo2mT5+unTt3qlGjRpKkL7/8Up6envryyy/1wAMPWMMk6cYbb8y3zTt37tT111+vatWq6dFHH5W/v78WLFigbt266cMPP9Rtt90mSUpOTlb79u11+vRpq9zrr7+e5/51zJgxmjZtmrp06aLY2Fh99913io2N1cmTJ13KFXRbyMuePXvUu3dvDRkyRAMGDNCcOXN0++2364svvlCnTp0kSf/73/+0ePFi3X777YqIiFBKSopee+01tW3bVj/++KPCwsIk/XMr0gMPPKCePXtqxIgROnnypL7//ntt3rxZffv2lVTwfcrZNm7cqObNm593vpwKe8zIycnRLbfcolWrVqlPnz4aMWKEMjMzlZiYqB9++EG1a9cu8PGrIM63rr/77rsaPHiwWrZsqfvuu0+S3ML0mYrjGHiuNtg95p3tr7/+UqdOnXTkyBGtW7dOtWvXVm5urm699VZt2LBB9913n+rXr68dO3Zo+vTp+vnnn90uldtZ5kOGDNGiRYs0fPhwNWjQQIcPH9aGDRu0a9cul/WrQYMGKlu2rL766itrW7bFFMCcOXOMJLNly5Z8ywQFBZlrrrnG+jx+/Hhz5mSmT59uJJk///wz3zq2bNliJJk5c+a4jWvbtq2RZGbPnp3nuLZt21qf16xZYySZatWqmYyMDGv4ggULjCQzY8YMa1jNmjXNgAEDzlvnudo2YMAAU7NmTevz4sWLjSTz5JNPupTr2bOn8fDwMHv37rWGSTI+Pj4uw7777jsjybz88stu0zrTiy++aCSZ9957zxp26tQpExUVZQICAlzmvWbNmiYuLu6c9RljzPbt240kM2zYMJfhffv2NZLM+PHjrWHdunUzPj4+5pdffrGGHTx40JQrV87ceOON1rBx48YZSeajjz5ym15ubq4x5v/WsX379rmMdy7LNWvWWMOc68K8efOsYbt37zaSjKenp9m0aZM1fPny5W7LbdCgQaZq1armr7/+cplWnz59TFBQkDl+/LjLtOvXr2+ysrKscjNmzDCSzI4dO6xhcXFxLuvA+Zy93k2ePNn4+/ubn3/+2aXco48+asqUKWP2799vjDFm3759RpKpWLGiOXLkiFXuk08+MZLMZ599Zg1r3LixCQ8PN5mZmdawtWvXGkkubf3zzz/dlq3TgAEDjCQzadIkl+HXXHONiYyMtD6PGDHCBAYGmtOnT9vuA6ezp+3cd9xzzz0u5W677TZTsWLF89aX37Kwuzxzc3PNVVddZWJjY6310xhjjh8/biIiIkynTp3OOX3ndBYuXGgyMzNN27ZtTaVKlcy2bdtcytmdT+c2OXjwYJdyDz30kJFkVq9ebYwxJjU11Ugyr776qjHGmLS0NOPp6Wluv/12ExISYn3vgQceMBUqVLDmzblOnbmNdOzY0TRu3NicPHnSGpabm2uuu+46c9VVV1nDEhISjCSzefNma1hqaqoJCgpy2Z6Tk5ONl5eX6datm8s8TJgwwUgq1LaQn5o1axpJ5sMPP7SGpaenm6pVq7oco06ePGlycnJcvrtv3z7jcDhc1veuXbuahg0bnnOadvcpecnOzjYeHh7mwQcfdBt39nHUmAs7Zrz11ltGknnhhRfcxjnXB7vHr7zWmzPbWNht2t/fP89jcl6K4xh4rjbYPeadmZkOHTpkGjZsaP71r3+ZX3/91Srz7rvvGk9PT/Pll1+6TGP27NlGkvnqq6+sYXaXeVBQkImPj7c1j1dffbXp3LmzrbJORf46noCAgHM+Xe28t+CTTz4p0OWGMzkcDg0cONB2+f79+1tnmSSpZ8+eqlq1qj7//PNCTd+uzz//XGXKlLF+4Ts9+OCDMsZo2bJlLsOjo6NdflU1adJEgYGB+t///nfe6YSGhuqOO+6whnl7e+uBBx7Q0aNHtW7dukK1XZJb28/+xZyTk6MVK1aoW7du+te//mUNr1q1qvr27asNGzYoIyNDkvThhx+qadOmef6yKeyrJgICAlyePqxbt66Cg4NVv359l199zv939qUxRh9++KG6dOkiY4z++usv619sbKzS09PdTusPHDjQ5R5a5xnn8y2fgli4cKHatGmj8uXLu7QpOjpaOTk5Wr9+vUv53r17q3z58vm26eDBg9qxY4f69+/vcsatbdu2aty4cYHbN2TIEJfPbdq0cZn/4ODgQl36KOg0Dx8+bK1XhXW+5bl9+3bt2bNHffv21eHDh61lcezYMXXs2FHr16+3tQ9LT09XTEyMdu/erbVr16pZs2Z5ljvffDq3yVGjRrmUc545Wbp0qaR/zqDUq1fPWle++uorlSlTRqNHj1ZKSor27Nkj6Z8zjjfccEO+296RI0e0evVq9erVS5mZmdb8Hz58WLGxsdqzZ4/++OMPq22tW7d2OQNbuXJl9evXz6XOVatW6fTp0xo2bJjLcOdDlmcq6LaQl7CwMJf9TWBgoPr3769t27YpOTlZ0j/HE0/Pfw6FOTk5Onz4sHUL1Zn7gODgYP3+++953goiFW6fcqYjR47IGOOyPZ9PYY8ZH374oSpVqpRnvzvXh4IevwqiqLfp4jgG5qcgxzyn33//XW3btlV2drbWr1+vmjVrWuMWLlyo+vXrq169ei7rTIcOHSRJa9ascanLzjIPDg7W5s2bdfDgwfPOj3P7KogivxPd+Q6q/PTu3VtvvPGGBg8erEcffVQdO3ZU9+7d1bNnT2vjPZ9q1aoV6CGYq666yuWzh4eH6tSpU+zvFvvtt98UFhbmElqlfy55O8efqUaNGm51lC9fXn///fd5p3PVVVe59V9+07Hbdk9PT7fLA3Xr1nX5/Oeff+r48eNuw53Tz83N1YEDB9SwYUP98ssv6tGjR4Hbci7h4eFuB76goCBVr17dbZgkqy///PNPpaWl6fXXX9frr7+eZ92pqakun89ePs4d/PmWT0Hs2bNH33//fb6XTwraJueyr1OnjltdderUOeeB7Gy+vr5u7Tp7/Rw2bJgWLFigzp07q1q1aoqJiVGvXr1000032Z7O2c41j4GBgcVSryQrYA0YMCDfOtLT0897oE9ISNDJkye1bds2NWzYsFDtCQwMtLbJs5dlaGiogoODXbbzNm3aWEHzyy+/VIsWLdSiRQtVqFBBX375pUJCQvTdd99Zl1jzsnfvXhljNHbsWI0dOzbPMqmpqapWrZp+++23PC/Pnb1fyG99rFChgls/FnRbyEudOnXc9g9XX321pH/uzQsNDVVubq5mzJihV199Vfv27XO5Z/fMy72PPPKIVq5cqZYtW6pOnTqKiYlR3759df3110sq3D4lL+as+9/PpbDHjF9++UV169Y958NoBT1+FURRb9PFcQzMT0GOeU533XWXvLy8tGvXLoWGhrp8Z8+ePdq1a1eh9/mS+zKfNm2aBgwYoOrVqysyMlI333yz+vfv7xJ0nYwxBT5xU6TB8ffff1d6enqeBymnsmXLav369VqzZo2WLl2qL774Qh988IE6dOigFStWuLyC4Fx1FLX8Oi4nJ8dWm4pCftMpyI7kUneu5ZCX/PrsfH3pPFN055135hsMzry/z06dRSE3N1edOnXSww8/nOd450HvYrbpfNM6U5UqVbR9+3YtX75cy5Yt07JlyzRnzhz179/f5Ub1opjuhc6j3XXk2WefzfcsoZ17WLt27ar58+dr6tSpeuedd/L9gWx3Pu3s5G+44Qb95z//0f/+9z99+eWXatOmjTw8PHTDDTfoyy+/VFhYmHJzc62zrHlxzv9DDz2k2NjYPMuca19/oQq6LRTW008/rbFjx+qee+7R5MmTVaFCBXl6eiohIcHljHL9+vX1008/acmSJfriiy/04Ycf6tVXX9W4ceM0ceLEQu1TzlShQgV5eHgU6IdoaThmFHSfLZWOdl9M3bt31zvvvKMZM2ZoypQpLuNyc3PVuHFjvfDCC3l+9+yTIHb6rlevXmrTpo0+/vhjrVixQs8++6yeeeYZffTRR+rcubPL9/7++2+3k2vnU6TB8d1335WkfHcyTp6enurYsaM6duyoF154QU8//bQef/xxrVmzRtHR0UX+hnznmQMnY4z27t3rshGXL19eaWlpbt/97bffXFJ6QdpWs2ZNrVy5UpmZmS6/2nbv3m2NLwo1a9bU999/r9zcXJeD0oVMp2bNmsrNzbV+mTqd/Z7KypUry8/PL8/3V+7evVuenp7Wil+7dm398MMP55yu85fn2cuiKH8xSrKeFM/JyVF0dHSR1Xuh627t2rV19OjRImuTc9nn9bTw2cOKarvz8fFRly5d1KVLF+Xm5mrYsGF67bXXNHbs2GINGmcrimUh/XN580KWR7du3RQTE6O7775b5cqVO+9TzPlxbpN79uyxzqRI/zyQkZaW5rKdOwNhYmKitmzZokcffVTSPw/CzJo1S2FhYfL39z/nO+yc+z1vb+/zzn/NmjXd9rOS+/7izPUxIiLCGn748GG3wFQU24LzrOmZ68LPP/8sSdZT9osWLVL79u315ptvunw3LS3N7YE9f39/9e7dW71799apU6fUvXt3PfXUUxozZswF71O8vLxUu3Zt7du3r8DfLajatWtr8+bNys7Oth6iO5vd41dx7bMLeqwt6mNgfm0oyDHP6f7771edOnU0btw4BQUFWduj9M+y+O6779SxY8cizT5Vq1bVsGHDNGzYMKWmpqp58+Z66qmnXILj6dOndeDAAd16660FqrvI7nFcvXq1Jk+erIiICLf7Ws505MgRt2HOX/POR8/9/f0lua+IhfXOO++43He5aNEiHTp0yKUDa9eurU2bNunUqVPWsCVLlri9tqcgbbv55puVk5OjV155xWX49OnT5eHh4Zb8C+vmm29WcnKyPvjgA2vY6dOn9fLLLysgIEBt27YtcJ3Otp35+g5Jbk8ylilTRjExMfrkk09cLv2npKRo3rx5uuGGG6xLDz169NB3331nPf15JuevJefB+sz7l3JycvK99FNYZcqUUY8ePfThhx/mGWb//PPPQtXr7++v9PT0QrerV69eSkpK0vLly93GpaWl6fTp0wWqLywsTI0aNdI777zj8q6udevWaceOHS5l/fz8rOkU1uHDh10+e3p6Wj/Qzn61RHG70GURGRmp2rVr67nnnsvzPWcFWUf69++vl156SbNnz9YjjzxSqPbcfPPNkty3QeeZijPfeBAREaFq1app+vTpys7Oti6ntmnTRr/88osWLVqk1q1bn/NSZZUqVdSuXTu99tprOnTokNv4M+f/5ptv1qZNm/T111+7jD/zz+ZJUseOHeXl5eUWns/eR0pFsy0cPHjQZX+TkZGhd955R82aNbMuGZYpU8btTNfChQut+zedzl63fXx81KBBAxljlJ2dXST7lKioqIvy5+l69Oihv/76K89+d/aF3eNXYGCgKlWq5HbP6auvvnpBbfT397e9LyqOY2B+bSjIMe9MY8eO1UMPPaQxY8a4rP+9evXSH3/8of/85z9u3zlx4oSOHTtWoDbn5OS47feqVKmisLAwt33wjz/+qJMnT+b7Zpj8FOqM47Jly7R7926dPn1aKSkpWr16tRITE1WzZk19+umn53xR6aRJk7R+/XrFxcWpZs2aSk1N1auvvqrw8HDdcMMNkv4JD8HBwZo9e7bKlSsnf39/tWrVyuUXakFUqFBBN9xwgwYOHKiUlBS9+OKLqlOnjssrgwYPHqxFixbppptuUq9evfTLL7/ovffec7vHryBt69Kli9q3b6/HH39cv/76q5o2baoVK1bok08+UUJCwjlfL1AQ9913n1577TXdfffd2rp1q2rVqqVFixbpq6++0osvvuh2j4odzZo10x133KFXX31V6enpuu6667Rq1ao8z1w9+eST1rs5hw0bJi8vL7322mvKysrStGnTrHKjR4/WokWLdPvtt+uee+5RZGSkjhw5ok8//VSzZ89W06ZN1bBhQ7Vu3VpjxozRkSNHVKFCBc2fP7/AgcmOqVOnas2aNWrVqpXuvfdeNWjQQEeOHNG3336rlStX5vkj53wiIyP1wQcfaNSoUbr22msVEBCgLl262P7+6NGj9emnn+qWW27R3XffrcjISB07dkw7duzQokWL9Ouvvxb4tUVPP/20unbtquuvv14DBw7U33//rVdeeUWNGjVyCURly5ZVgwYN9MEHH+jqq69WhQoV1KhRI+uVLnYMHjxYR44cUYcOHRQeHq7ffvtNL7/8spo1a+ZyluxiuNBl4enpqTfeeEOdO3dWw4YNNXDgQFWrVk1//PGH1qxZo8DAQH322We26xs+fLgyMjL0+OOPKygoyHr/rF1NmzbVgAED9PrrrystLU1t27bV119/rbffflvdunVT+/btXcq3adNG8+fPV+PGja2zQs2bN5e/v79+/vnnc97f6DRz5kzdcMMNaty4se69917961//UkpKipKSkvT7779b7zp8+OGH9e677+qmm27SiBEjrNfxOM8EOYWEhGjEiBF6/vnndeutt+qmm27Sd999p2XLlqlSpUouZ1yKYlu4+uqrNWjQIG3ZskUhISF66623lJKSojlz5lhlbrnlFk2aNEkDBw7Uddddpx07duj99993ux8sJiZGoaGhuv766xUSEqJdu3bplVdeUVxcnLWPvdB9SteuXfXuu+/q559/LrJL8Xnp37+/3nnnHY0aNUpff/212rRpo2PHjmnlypUaNmyYunbtWqDj1+DBgzV16lQNHjxYLVq00Pr1660zu4UVGRmplStX6oUXXlBYWJgiIiLyfc1NcRwDz9UGu8e8sz377LNKT09XfHy8ypUrpzvvvFN33XWXFixYoCFDhmjNmjW6/vrrlZOTo927d2vBggVavny5WrRoYbvNmZmZCg8PV8+ePdW0aVMFBARo5cqV2rJli9tfiUlMTJSfn5/1airbCvIItvPRcuc/Hx8fExoaajp16mRmzJjh8si709mvEVi1apXp2rWrCQsLMz4+PiYsLMzccccdbq9c+OSTT0yDBg2Ml5eXy6P+bdu2zfeVCPm9jue///2vGTNmjKlSpYopW7asiYuLM7/99pvb959//nlTrVo143A4zPXXX2+++eYbtzrP1bazX8djjDGZmZlm5MiRJiwszHh7e5urrrrKPPvssy6v9zDmn8fs83p8Pr/XBJ0tJSXFDBw40FSqVMn4+PiYxo0b5/l6hIK8iuDEiRPmgQceMBUrVjT+/v6mS5cu5sCBA3m+suXbb781sbGxJiAgwPj5+Zn27dubjRs3utV5+PBhM3z4cFOtWjXj4+NjwsPDzYABA1xeX/HLL7+Y6Oho43A4TEhIiHnsscdMYmJinq/jyWtdyG8e8+rjlJQUEx8fb6pXr268vb1NaGio6dixo3n99detMme+VuVMeb2G4ujRo6Zv374mODjY7XU3eclr+WZmZpoxY8aYOnXqGB8fH1OpUiVz3XXXmeeee86cOnXKZdrPPvtsnvN59vKZP3++qVevnnE4HKZRo0bm008/NT169DD16tVzKbdx40YTGRlpfHx8XOoZMGCA8ff3d5vW2dv3okWLTExMjKlSpYrx8fExNWrUMP/+97/NoUOHztkPebXbWffZr+7K75VNZ8tvWRRkeRpjzLZt20z37t1NxYoVjcPhMDVr1jS9evUyq1atOuf085vOww8/bCSZV155pcDzmZ2dbSZOnGgiIiKMt7e3qV69uhkzZozL63KcZs6caSSZoUOHugyPjo42ktzan9/8//LLL6Z///4mNDTUeHt7m2rVqplbbrnFLFq0yKXc999/b9q2bWt8fX1NtWrVzOTJk82bb77pNg+nT582Y8eONaGhoaZs2bKmQ4cOZteuXaZixYpmyJAhLnXa2Rby49wPLF++3DRp0sQ4HA5Tr149t+Vx8uRJ8+CDD5qqVauasmXLmuuvv94kJSW57ftfe+01c+ONN1rrQe3atc3o0aNNenq6S3129in5ycrKMpUqVTKTJ092GZ7f63gu5Jhx/Phx8/jjj1vrUmhoqOnZs6fLK2bsHr+OHz9uBg0aZIKCgky5cuVMr169rNdCFXab3r17t7nxxhtN2bJl3V7VlJfiOAaeqw12jnl5vcIwJyfH3HHHHcbLy8ssXrzYGPPPq4OeeeYZ07BhQ+NwOEz58uVNZGSkmThxosv6ZWeZZ2VlmdGjR5umTZuacuXKGX9/f9O0aVPr9VxnatWqlbnzzjtt9cWZPP5/YwBcYZo1a6bKlSsX6atzgMJIS0tT+fLl9eSTT+rxxx8v6eaUqMmTJ2vOnDnas2fPRXswE1ee7du3q3nz5vr222/zffgvP0X+HkcApUt2drbbpf61a9fqu+++y/NPdALF6cSJE27DnPdtsj5KI0eO1NGjRzV//vySbgouY1OnTlXPnj0LHBoliTOOwGXu119/VXR0tO68806FhYVp9+7dmj17toKCgvTDDz+c80+TAUVt7ty5mjt3rm6++WYFBARow4YN+u9//6uYmJg8H4QBULoU+QvAAZQu5cuXV2RkpN544w39+eef8vf3V1xcnKZOnUpoxEXXpEkTeXl5adq0acrIyLAemHnyySdLumkAbOCMIwAAAGzhHkcAAADYQnAEAACALdzjWEi5ubk6ePCgypUrV+R/IhEAABQPY4wyMzMVFhaW79+OR/4IjoV08OBBt79HCQAALg0HDhxQeHh4STfjkkNwLCTnnzA6cOBAnn+XsqCys7O1YsUKxcTE5PtH53Fh6OPiRx8XL/q3+NHHxa+k+zgjI0PVq1cv9J8ivNIRHAvJeXk6MDCwyIKjn5+fAgMD2VkVE/q4+NHHxYv+LX70cfErLX3MbWaFw8V9AAAA2EJwBAAAgC0ERwAAANhCcAQAAIAtBEcAAADYQnAEAACALQRHAAAA2EJwBAAAgC0ERwAAANhCcAQAAIAtBEcAAADYQnAEAACALQRHAAAA2EJwBAAAgC1eJd0AAAAuJbUeXVrSTSiwX6fGlXQTcJngjCMAAABsITgCAADAFoIjAAAAbCE4AgAAwBaCIwAAAGwhOAIAAMAWgiMAAABsITgCAADAFoIjAAAAbCE4AgAAwBaCIwAAAGwhOAIAAMAWgiMAAABsITgCAADAFoIjAAAAbCE4AgAAwBaCIwAAAGwhOAIAAMAWgiMAAABsITgCAADAFoIjAAAAbCE4AgAAwBaCIwAAAGwhOAIAAMAWgiMAAABsITgCAADAFoIjAAAAbCE4AgAAwBaCIwAAAGwhOAIAAMCWUh0cp06dKg8PDyUkJFjDTp48qfj4eFWsWFEBAQHq0aOHUlJSXL63f/9+xcXFyc/PT1WqVNHo0aN1+vRplzJr165V8+bN5XA4VKdOHc2dO/cizBEAAMClq9QGxy1btui1115TkyZNXIaPHDlSn332mRYuXKh169bp4MGD6t69uzU+JydHcXFxOnXqlDZu3Ki3335bc+fO1bhx46wy+/btU1xcnNq3b6/t27crISFBgwcP1vLlyy/a/AEAAFxqSmVwPHr0qPr166f//Oc/Kl++vDU8PT1db775pl544QV16NBBkZGRmjNnjjZu3KhNmzZJklasWKEff/xR7733npo1a6bOnTtr8uTJmjlzpk6dOiVJmj17tiIiIvT888+rfv36Gj58uHr27Knp06eXyPwCAABcCrxKugF5iY+PV1xcnKKjo/Xkk09aw7du3ars7GxFR0dbw+rVq6caNWooKSlJrVu3VlJSkho3bqyQkBCrTGxsrIYOHaqdO3fqmmuuUVJSkksdzjJnXhI/W1ZWlrKysqzPGRkZkqTs7GxlZ2df6CxbdRRFXcgbfVz86OPiRf8WPzt97ChjLlZzikxpWmdKej0uTX1xKSp1wXH+/Pn69ttvtWXLFrdxycnJ8vHxUXBwsMvwkJAQJScnW2XODI3O8c5x5yqTkZGhEydOqGzZsm7TnjJliiZOnOg2fMWKFfLz87M/g+eRmJhYZHUhb/Rx8aOPixf9W/zO1cfTWl7EhhSRzz//vKSb4Kak1uPjx4+XyHQvF6UqOB44cEAjRoxQYmKifH19S7o5LsaMGaNRo0ZZnzMyMlS9enXFxMQoMDDwguvPzs5WYmKiOnXqJG9v7wuuD+7o4+JHHxcv+rf42enjRhMuvfvhf5gQW9JNsJT0euy8YojCKVXBcevWrUpNTVXz5s2tYTk5OVq/fr1eeeUVLV++XKdOnVJaWprLWceUlBSFhoZKkkJDQ/X111+71Ot86vrMMmc/iZ2SkqLAwMA8zzZKksPhkMPhcBvu7e1dpCt+UdcHd/Rx8aOPixf9W/zO1cdZOR4XuTUXrjSuLyW1HpfGvriUlKqHYzp27KgdO3Zo+/bt1r8WLVqoX79+1v97e3tr1apV1nd++ukn7d+/X1FRUZKkqKgo7dixQ6mpqVaZxMREBQYGqkGDBlaZM+twlnHWAQAAAHel6oxjuXLl1KhRI5dh/v7+qlixojV80KBBGjVqlCpUqKDAwEDdf//9ioqKUuvWrSVJMTExatCgge666y5NmzZNycnJeuKJJxQfH2+dMRwyZIheeeUVPfzww7rnnnu0evVqLViwQEuXLr24MwwAAHAJKVXB0Y7p06fL09NTPXr0UFZWlmJjY/Xqq69a48uUKaMlS5Zo6NChioqKkr+/vwYMGKBJkyZZZSIiIrR06VKNHDlSM2bMUHh4uN544w3Fxpaee0AAAABKm1IfHNeuXevy2dfXVzNnztTMmTPz/U7NmjXP+wRZu3bttG3btqJoIgAAwBWhVN3jCAAAgNKL4AgAAABbCI4AAACwheAIAAAAWwiOAAAAsIXgCAAAAFsIjgAAALCF4AgAAABbCI4AAACwheAIAAAAWwiOAAAAsIXgCAAAAFsIjgAAALCF4AgAAABbCI4AAACwheAIAAAAWwiOAAAAsIXgCAAAAFsIjgAAALCF4AgAAABbCI4AAACwheAIAAAAWwiOAAAAsIXgCAAAAFsIjgAAALCF4AgAAABbCI4AAACwheAIAAAAWwiOAAAAsIXgCAAAAFsIjgAAALCF4AgAAABbCI4AAACwheAIAAAAWwiOAAAAsIXgCAAAAFsIjgAAALCF4AgAAABbCI4AAACwheAIAAAAWwiOAAAAsIXgCAAAAFsIjgAAALCF4AgAAABbCI4AAACwheAIAAAAWwiOAAAAsIXgCAAAAFsIjgAAALCF4AgAAABbCI4AAACwheAIAAAAWwiOAAAAsMWrpBsAAACKV61Hl5Z0EyyOMkbTWkqNJixXVo7HOcv+OjXuIrUKdnHGEQAAALYQHAEAAGALwREAAAC2EBwBAABgC8ERAAAAthAcAQAAYAvBEQAAALYQHAEAAGALwREAAAC2EBwBAABgC8ERAAAAthAcAQAAYAvBEQAAALYQHAEAAGBLqQqOs2bNUpMmTRQYGKjAwEBFRUVp2bJl1viTJ08qPj5eFStWVEBAgHr06KGUlBSXOvbv36+4uDj5+fmpSpUqGj16tE6fPu1SZu3atWrevLkcDofq1KmjuXPnXozZAwAAuKSVquAYHh6uqVOnauvWrfrmm2/UoUMHde3aVTt37pQkjRw5Up999pkWLlyodevW6eDBg+revbv1/ZycHMXFxenUqVPauHGj3n77bc2dO1fjxo2zyuzbt09xcXFq3769tm/froSEBA0ePFjLly+/6PMLAABwKfEq6QacqUuXLi6fn3rqKc2aNUubNm1SeHi43nzzTc2bN08dOnSQJM2ZM0f169fXpk2b1Lp1a61YsUI//vijVq5cqZCQEDVr1kyTJ0/WI488ogkTJsjHx0ezZ89WRESEnn/+eUlS/fr1tWHDBk2fPl2xsbEXfZ4BAAAuFaUqOJ4pJydHCxcu1LFjxxQVFaWtW7cqOztb0dHRVpl69eqpRo0aSkpKUuvWrZWUlKTGjRsrJCTEKhMbG6uhQ4dq586duuaaa5SUlORSh7NMQkLCOduTlZWlrKws63NGRoYkKTs7W9nZ2Rc8v846iqIu5I0+Ln70cfGif4ufnT52lDEXqzmXJYencfnvuRTHus72c2FKXXDcsWOHoqKidPLkSQUEBOjjjz9WgwYNtH37dvn4+Cg4ONilfEhIiJKTkyVJycnJLqHROd457lxlMjIydOLECZUtWzbPdk2ZMkUTJ050G75ixQr5+fkVal7zkpiYWGR1IW/0cfGjj4sX/Vv8ztXH01pexIZcxia3yD1vmc8//7zIp3v8+PEir/NKUuqCY926dbV9+3alp6dr0aJFGjBggNatW1fSzdKYMWM0atQo63NGRoaqV6+umJgYBQYGXnD92dnZSkxMVKdOneTt7X3B9cEdfVz86OPiRf8WPzt93GgC98RfCIen0eQWuRr7jaeycj3OWfaHCUV/C5nziiEKp9QFRx8fH9WpU0eSFBkZqS1btmjGjBnq3bu3Tp06pbS0NJezjikpKQoNDZUkhYaG6uuvv3apz/nU9Zllzn4SOyUlRYGBgfmebZQkh8Mhh8PhNtzb27tId+BFXR/c0cfFjz4uXvRv8TtXH2flnDvswJ6sXI/z9mVxrOdsOxemVD1VnZfc3FxlZWUpMjJS3t7eWrVqlTXup59+0v79+xUVFSVJioqK0o4dO5SammqVSUxMVGBgoBo0aGCVObMOZxlnHQAAAMhbqTrjOGbMGHXu3Fk1atRQZmam5s2bp7Vr12r58uUKCgrSoEGDNGrUKFWoUEGBgYG6//77FRUVpdatW0uSYmJi1KBBA911112aNm2akpOT9cQTTyg+Pt46WzhkyBC98sorevjhh3XPPfdo9erVWrBggZYuXVqSsw4AAFDqlargmJqaqv79++vQoUMKCgpSkyZNtHz5cnXq1EmSNH36dHl6eqpHjx7KyspSbGysXn31Vev7ZcqU0ZIlSzR06FBFRUXJ399fAwYM0KRJk6wyERERWrp0qUaOHKkZM2YoPDxcb7zxBq/iAQAAOI9SFRzffPPNc4739fXVzJkzNXPmzHzL1KxZ87xPYbVr107btm0rVBsBAACuVKX+HkcAAACUDgRHAAAA2EJwBAAAgC0ERwAAANhCcAQAAIAtBEcAAADYQnAEAACALQRHAAAA2EJwBAAAgC0ERwAAANhCcAQAAIAtBEcAAADYQnAEAACALQRHAAAA2EJwBAAAgC0ERwAAANhCcAQAAIAtBEcAAADYQnAEAACALQRHAAAA2EJwBAAAgC0ERwAAANhCcAQAAIAtBEcAAADYQnAEAACALQRHAAAA2EJwBAAAgC1eJd0AAMCVq9ajS0u6CS4cZYymtZQaTViurByPkm4OUOpwxhEAAAC2EBwBAABgC8ERAAAAthAcAQAAYAvBEQAAALYQHAEAAGALwREAAAC2EBwBAABgC8ERAAAAthAcAQAAYAvBEQAAALYQHAEAAGALwREAAAC2EBwBAABgC8ERAAAAthAcAQAAYAvBEQAAALYQHAEAAGALwREAAAC2EBwBAABgC8ERAAAAthAcAQAAYAvBEQAAALYQHAEAAGALwREAAAC2EBwBAABgC8ERAAAAthAcAQAAYAvBEQAAALYQHAEAAGALwREAAAC2EBwBAABgC8ERAAAAthAcAQAAYAvBEQAAALYQHAEAAGALwREAAAC2EBwBAABgC8ERAAAAthAcAQAAYAvBEQAAALaUquA4ZcoUXXvttSpXrpyqVKmibt266aeffnIpc/LkScXHx6tixYoKCAhQjx49lJKS4lJm//79iouLk5+fn6pUqaLRo0fr9OnTLmXWrl2r5s2by+FwqE6dOpo7d25xzx4AAMAlrVQFx3Xr1ik+Pl6bNm1SYmKisrOzFRMTo2PHjlllRo4cqc8++0wLFy7UunXrdPDgQXXv3t0an5OTo7i4OJ06dUobN27U22+/rblz52rcuHFWmX379ikuLk7t27fX9u3blZCQoMGDB2v58uUXdX4BAAAuJV4l3YAzffHFFy6f586dqypVqmjr1q268cYblZ6erjfffFPz5s1Thw4dJElz5sxR/fr1tWnTJrVu3VorVqzQjz/+qJUrVyokJETNmjXT5MmT9cgjj2jChAny8fHR7NmzFRERoeeff16SVL9+fW3YsEHTp09XbGxsnm3LyspSVlaW9TkjI0OSlJ2drezs7Aued2cdRVEX8kYfFz/6uHhdjv3rKGNKugkuHJ7G5b8oegXp4+JY1y+n7acklKrgeLb09HRJUoUKFSRJW7duVXZ2tqKjo60y9erVU40aNZSUlKTWrVsrKSlJjRs3VkhIiFUmNjZWQ4cO1c6dO3XNNdcoKSnJpQ5nmYSEhHzbMmXKFE2cONFt+IoVK+Tn53chs+kiMTGxyOpC3ujj4kcfF6/LqX+ntSzpFuRtcovckm7CZc9OH3/++edFPt3jx48XeZ1XklIbHHNzc5WQkKDrr79ejRo1kiQlJyfLx8dHwcHBLmVDQkKUnJxslTkzNDrHO8edq0xGRoZOnDihsmXLurVnzJgxGjVqlPU5IyND1atXV0xMjAIDAy9sZvXPL6DExER16tRJ3t7eF1wf3NHHxY8+Ll6XY/82mlC6bhFyeBpNbpGrsd94KivXo6Sbc1kqSB//MCHvq4AXwnnFEIVTaoNjfHy8fvjhB23YsKGkmyJJcjgccjgcbsO9vb2LdAde1PXBHX1c/Ojj4nU59W9WTukMZ1m5HqW2bZcLO31cHOv55bLtlJRS9XCM0/Dhw7VkyRKtWbNG4eHh1vDQ0FCdOnVKaWlpLuVTUlIUGhpqlTn7KWvn5/OVCQwMzPNsIwAAAEpZcDTGaPjw4fr444+1evVqRUREuIyPjIyUt7e3Vq1aZQ376aeftH//fkVFRUmSoqKitGPHDqWmplplEhMTFRgYqAYNGlhlzqzDWcZZBwAAANyVqkvV8fHxmjdvnj755BOVK1fOuicxKChIZcuWVVBQkAYNGqRRo0apQoUKCgwM1P3336+oqCi1bt1akhQTE6MGDRrorrvu0rRp05ScnKwnnnhC8fHx1qXmIUOG6JVXXtHDDz+se+65R6tXr9aCBQu0dOnSEpt3AACA0q5UnXGcNWuW0tPT1a5dO1WtWtX698EHH1hlpk+frltuuUU9evTQjTfeqNDQUH300UfW+DJlymjJkiUqU6aMoqKidOedd6p///6aNGmSVSYiIkJLly5VYmKimjZtqueff15vvPFGvq/iAQAAQCk742jM+d/p5Ovrq5kzZ2rmzJn5lqlZs+Z5H+Fv166dtm3bVuA2AgAAXKlKVXAEABRerUe53QZA8SpVl6oBAABQehEcAQAAYAvBEQAAALYQHAEAAGALwREAAAC2EBwBAABgC8ERAAAAthAcAQAAYAvBEQAAALYQHAEAAGALwREAAAC2EBwBAABgC8ERAAAAthAcAQAAYAvBEQAAALYQHAEAAGALwREAAAC2EBwBAABgC8ERAAAAthAcAQAAYAvBEQAAALYQHAEAAGALwREAAAC2EBwBAABgC8ERAAAAthAcAQAAYAvBEQAAALYQHAEAAGALwREAAAC2EBwBAABgC8ERAAAAthAcAQAAYAvBEQAAALYQHAEAAGALwREAAAC2EBwBAABgC8ERAAAAthAcAQAAYAvBEQAAALYQHAEAAGALwREAAAC2EBwBAABgi1dJNwAASqNajy4t6Sa4cZQxmtZSajRhubJyPEq6OQCuQJxxBAAAgC0ERwAAANhCcAQAAIAtBEcAAADYQnAEAACALQRHAAAA2EJwBAAAgC28xxFXtNL4rr7z+XVqXEk3AQBwheKMIwAAAGwhOAIAAMAWgiMAAABsITgCAADAFoIjAAAAbCE4AgAAwBaCIwAAAGwhOAIAAMAWgiMAAABsITgCAADAFoIjAAAAbCE4AgAAwBaCIwAAAGwhOAIAAMAWgiMAAABsITgCAADAllIXHNevX68uXbooLCxMHh4eWrx4sct4Y4zGjRunqlWrqmzZsoqOjtaePXtcyhw5ckT9+vVTYGCggoODNWjQIB09etSlzPfff682bdrI19dX1atX17Rp04p71gAAAC5ppS44Hjt2TE2bNtXMmTPzHD9t2jS99NJLmj17tjZv3ix/f3/Fxsbq5MmTVpl+/fpp586dSkxM1JIlS7R+/Xrdd9991viMjAzFxMSoZs2a2rp1q5599llNmDBBr7/+erHPHwAAwKXKq6QbcLbOnTurc+fOeY4zxujFF1/UE088oa5du0qS3nnnHYWEhGjx4sXq06ePdu3apS+++EJbtmxRixYtJEkvv/yybr75Zj333HMKCwvT+++/r1OnTumtt96Sj4+PGjZsqO3bt+uFF15wCZgAAAD4P6UuOJ7Lvn37lJycrOjoaGtYUFCQWrVqpaSkJPXp00dJSUkKDg62QqMkRUdHy9PTU5s3b9Ztt92mpKQk3XjjjfLx8bHKxMbG6plnntHff/+t8uXLu007KytLWVlZ1ueMjAxJUnZ2trKzsy943px1FEVdyFtefewoY0qqOYVWmteR/NbjRhOWl0RzLoijTEm3wJ3D07j8F0WPPi5+Benj4tjfleZ96KXgkgqOycnJkqSQkBCX4SEhIda45ORkValSxWW8l5eXKlSo4FImIiLCrQ7nuLyC45QpUzRx4kS34StWrJCfn18h58hdYmJikdWFvJ3Zx9NalmBDCunzzz8v6Sac19nr8aXYz6XZ5Ba5Jd2Eyx59XPzs9HFx7O+OHz9e5HVeSS6p4FiSxowZo1GjRlmfMzIyVL16dcXExCgwMPCC68/OzlZiYqI6deokb2/vC64P7vLq40vxTNgPE2JLugn5ym89vhT7uTRyeBpNbpGrsd94KivXo6Sbc1mij4tfQfq4OPZ3ziuGKJxLKjiGhoZKklJSUlS1alVreEpKipo1a2aVSU1Ndfne6dOndeTIEev7oaGhSklJcSnj/OwsczaHwyGHw+E23Nvbu0iDXlHXB3dn9nFWzqV3YLgU1o+z1+NLsZ9Ls6xcD/q0mNHHxc9OHxfH/u5S2IeWZqXuqepziYiIUGhoqFatWmUNy8jI0ObNmxUVFSVJioqKUlpamrZu3WqVWb16tXJzc9WqVSurzPr1613uc0hMTFTdunXzvEwNAACAUnjG8ejRo9q7d6/1ed++fdq+fbsqVKigGjVqKCEhQU8++aSuuuoqRUREaOzYsQoLC1O3bt0kSfXr19dNN92ke++9V7Nnz1Z2draGDx+uPn36KCwsTJLUt29fTZw4UYMGDdIjjzyiH374QTNmzND06dNLYpaBAqn16NKSbkK+HGWMprX859I0Z2sA4PJT6oLjN998o/bt21ufnfcVDhgwQHPnztXDDz+sY8eO6b777lNaWppuuOEGffHFF/L19bW+8/7772v48OHq2LGjPD091aNHD7300kvW+KCgIK1YsULx8fGKjIxUpUqVNG7cOF7FAwAAcA6lLji2a9dOxuT/iL6Hh4cmTZqkSZMm5VumQoUKmjdv3jmn06RJE3355ZeFbicAAMCV5pK6xxEAAAAlh+AIAAAAWwiOAAAAsIXgCAAAAFsIjgAAALCF4AgAAABbCI4AAACwheAIAAAAWwiOAAAAsIXgCAAAAFsIjgAAALCl1P2taly6aj26tKSbcE6OMkbTWkqNJixXVo5HSTcHAIBLDmccAQAAYAvBEQAAALYQHAEAAGALwREAAAC2EBwBAABgC8ERAAAAthAcAQAAYAvBEQAAALYQHAEAAGALwREAAAC2EBwBAABgC8ERAAAAthAcAQAAYAvBEQAAALYQHAEAAGALwREAAAC2EBwBAABgC8ERAAAAthAcAQAAYAvBEQAAALYQHAEAAGALwREAAAC2eJV0A5C3Wo8uLekmAAAAuOCMIwAAAGwhOAIAAMAWgiMAAABsITgCAADAFoIjAAAAbCE4AgAAwBaCIwAAAGwhOAIAAMAWgiMAAABsITgCAADAFoIjAAAAbCE4AgAAwBaCIwAAAGwhOAIAAMAWgiMAAABsITgCAADAFoIjAAAAbCE4AgAAwBaCIwAAAGwhOAIAAMAWgiMAAABsITgCAADAFoIjAAAAbCE4AgAAwBaCIwAAAGwhOAIAAMAWgiMAAABsITgCAADAFoIjAAAAbCE4AgAAwBaCIwAAAGwhOAIAAMAWgiMAAABsueKD48yZM1WrVi35+vqqVatW+vrrr0u6SQAAAKXSFR0cP/jgA40aNUrjx4/Xt99+q6ZNmyo2Nlapqakl3TQAAIBS54oOji+88ILuvfdeDRw4UA0aNNDs2bPl5+ent956q6SbBgAAUOp4lXQDSsqpU6e0detWjRkzxhrm6emp6OhoJSUluZXPyspSVlaW9Tk9PV2SdOTIEWVnZ19we7Kzs3X8+HEdPnxY3t7e8jp97ILrhCuvXKPjx3Plle2pnFyPkm7OZYk+Ll70b/Gjj4tfQfr48OHDRT79zMxMSZIxpsjrvhJcscHxr7/+Uk5OjkJCQlyGh4SEaPfu3W7lp0yZookTJ7oNj4iIKLY2ouj1LekGXAHo4+JF/xY/+rj42e3jSs8XXxsyMzMVFBRUfBO4TF2xwbGgxowZo1GjRlmfc3NzdeTIEVWsWFEeHhf+qzQjI0PVq1fXgQMHFBgYeMH1wR19XPzo4+JF/xY/+rj4lXQfG2OUmZmpsLCwiz7ty8EVGxwrVaqkMmXKKCUlxWV4SkqKQkND3co7HA45HA6XYcHBwUXersDAQHZWxYw+Ln70cfGif4sffVz8SrKPOdNYeFfswzE+Pj6KjIzUqlWrrGG5ublatWqVoqKiSrBlAAAApdMVe8ZRkkaNGqUBAwaoRYsWatmypV588UUdO3ZMAwcOLOmmAQAAlDpXdHDs3bu3/vzzT40bN07Jyclq1qyZvvjiC7cHZi4Gh8Oh8ePHu10OR9Ghj4sffVy86N/iRx8XP/r40uZheB4dAAAANlyx9zgCAACgYAiOAAAAsIXgCAAAAFsIjgAAALCF4AgAAABbCI6lxMyZM1WrVi35+vqqVatW+vrrr0u6SZeE9evXq0uXLgoLC5OHh4cWL17sMt4Yo3Hjxqlq1aoqW7asoqOjtWfPHpcyR44cUb9+/RQYGKjg4GANGjRIR48evYhzUXpNmTJF1157rcqVK6cqVaqoW7du+umnn1zKnDx5UvHx8apYsaICAgLUo0cPt7/ItH//fsXFxcnPz09VqlTR6NGjdfr06Ys5K6XWrFmz1KRJE+uvaERFRWnZsmXWePq36E2dOlUeHh5KSEiwhtHPF2bChAny8PBw+VevXj1rPP17+SA4lgIffPCBRo0apfHjx+vbb79V06ZNFRsbq9TU1JJuWql37NgxNW3aVDNnzsxz/LRp0/TSSy9p9uzZ2rx5s/z9/RUbG6uTJ09aZfr166edO3cqMTFRS5Ys0fr163XfffddrFko1datW6f4+Hht2rRJiYmJys7OVkxMjI4dO2aVGTlypD777DMtXLhQ69at08GDB9W9e3drfE5OjuLi4nTq1Clt3LhRb7/9tubOnatx48aVxCyVOuHh4Zo6daq2bt2qb775Rh06dFDXrl21c+dOSfRvUduyZYtee+01NWnSxGU4/XzhGjZsqEOHDln/NmzYYI2jfy8jBiWuZcuWJj4+3vqck5NjwsLCzJQpU0qwVZceSebjjz+2Pufm5prQ0FDz7LPPWsPS0tKMw+Ew//3vf40xxvz4449GktmyZYtVZtmyZcbDw8P88ccfF63tl4rU1FQjyaxbt84Y809/ent7m4ULF1pldu3aZSSZpKQkY4wxn3/+ufH09DTJyclWmVmzZpnAwECTlZV1cWfgElG+fHnzxhtv0L9FLDMz01x11VUmMTHRtG3b1owYMcIYw3pcFMaPH2+aNm2a5zj69/LCGccSdurUKW3dulXR0dHWME9PT0VHRyspKakEW3bp27dvn5KTk136NigoSK1atbL6NikpScHBwWrRooVVJjo6Wp6entq8efNFb3Npl56eLkmqUKGCJGnr1q3Kzs526eN69eqpRo0aLn3cuHFjl7/IFBsbq4yMDOusGv6Rk5Oj+fPn69ixY4qKiqJ/i1h8fLzi4uJc+lNiPS4qe/bsUVhYmP71r3+pX79+2r9/vyT693JzRf/JwdLgr7/+Uk5OjtufOQwJCdHu3btLqFWXh+TkZEnKs2+d45KTk1WlShWX8V5eXqpQoYJVBv/Izc1VQkKCrr/+ejVq1EjSP/3n4+Oj4OBgl7Jn93Fey8A5DtKOHTsUFRWlkydPKiAgQB9//LEaNGig7du3079FZP78+fr222+1ZcsWt3GsxxeuVatWmjt3rurWratDhw5p4sSJatOmjX744Qf69zJDcARgS3x8vH744QeX+5ZQNOrWravt27crPT1dixYt0oABA7Ru3bqSbtZl48CBAxoxYoQSExPl6+tb0s25LHXu3Nn6/yZNmqhVq1aqWbOmFixYoLJly5Zgy1DUuFRdwipVqqQyZcq4PV2WkpKi0NDQEmrV5cHZf+fq29DQULeHkE6fPq0jR47Q/2cYPny4lixZojVr1ig8PNwaHhoaqlOnTiktLc2l/Nl9nNcycI6D5OPjozp16igyMlJTpkxR06ZNNWPGDPq3iGzdulWpqalq3ry5vLy85OXlpXXr1umll16Sl5eXQkJC6OciFhwcrKuvvlp79+5lPb7MEBxLmI+PjyIjI7Vq1SprWG5urlatWqWoqKgSbNmlLyIiQqGhoS59m5GRoc2bN1t9GxUVpbS0NG3dutUqs3r1auXm5qpVq1YXvc2ljTFGw4cP18cff6zVq1crIiLCZXxkZKS8vb1d+vinn37S/v37Xfp4x44dLgE9MTFRgYGBatCgwcWZkUtMbm6usrKy6N8i0rFjR+3YsUPbt2+3/rVo0UL9+vWz/p9+LlpHjx7VL7/8oqpVq7IeX25K+ukcGDN//nzjcDjM3LlzzY8//mjuu+8+Exwc7PJ0GfKWmZlptm3bZrZt22YkmRdeeMFs27bN/Pbbb8YYY6ZOnWqCg4PNJ598Yr7//nvTtWtXExERYU6cOGHVcdNNN5lrrrnGbN682WzYsMFcddVV5o477iipWSpVhg4daoKCgszatWvNoUOHrH/Hjx+3ygwZMsTUqFHDrF692nzzzTcmKirKREVFWeNPnz5tGjVqZGJiYsz27dvNF198YSpXrmzGjBlTErNU6jz66KNm3bp1Zt++feb77783jz76qPHw8DArVqwwxtC/xeXMp6qNoZ8v1IMPPmjWrl1r9u3bZ7766isTHR1tKlWqZFJTU40x9O/lhOBYSrz88sumRo0axsfHx7Rs2dJs2rSppJt0SVizZo2R5PZvwIABxph/XskzduxYExISYhwOh+nYsaP56aefXOo4fPiwueOOO0xAQIAJDAw0AwcONJmZmSUwN6VPXn0rycyZM8cqc+LECTNs2DBTvnx54+fnZ2677TZz6NAhl3p+/fVX07lzZ1O2bFlTqVIl8+CDD5rs7OyLPDel0z333GNq1qxpfHx8TOXKlU3Hjh2t0GgM/Vtczg6O9POF6d27t6latarx8fEx1apVM7179zZ79+61xtO/lw8PY4wpmXOdAAAAuJRwjyMAAABsITgCAADAFoIjAAAAbCE4AgAAwBaCIwAAAGwhOAIAAMAWgiMAAABsITgCAADAFoIjAAAAbCE4AgAAwBaCIwAAAGz5f5UYVs1HU5mZAAAAAElFTkSuQmCC",
+ "text/plain": [
+ "
"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
+ "from transformers import AutoTokenizer\n",
+ "\n",
+ "EMBEDDING_MODEL_NAME = \"thenlper/gte-small\"\n",
+ "\n",
+ "\n",
+ "def split_documents(\n",
+ " chunk_size: int,\n",
+ " knowledge_base: List[LangchainDocument],\n",
+ " tokenizer_name: Optional[str] = EMBEDDING_MODEL_NAME,\n",
+ ") -> List[LangchainDocument]:\n",
+ " \"\"\"\n",
+ " Split documents into chunks of maximum size `chunk_size` tokens and return a list of documents.\n",
+ " \"\"\"\n",
+ " text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(\n",
+ " AutoTokenizer.from_pretrained(tokenizer_name),\n",
+ " chunk_size=chunk_size,\n",
+ " chunk_overlap=int(chunk_size / 10),\n",
+ " add_start_index=True,\n",
+ " strip_whitespace=True,\n",
+ " separators=MARKDOWN_SEPARATORS,\n",
+ " )\n",
+ "\n",
+ " docs_processed = []\n",
+ " for doc in knowledge_base:\n",
+ " docs_processed += text_splitter.split_documents([doc])\n",
+ "\n",
+ " # Remove duplicates\n",
+ " unique_texts = {}\n",
+ " docs_processed_unique = []\n",
+ " for doc in docs_processed:\n",
+ " if doc.page_content not in unique_texts:\n",
+ " unique_texts[doc.page_content] = True\n",
+ " docs_processed_unique.append(doc)\n",
+ "\n",
+ " return docs_processed_unique\n",
+ "\n",
+ "\n",
+ "docs_processed = split_documents(\n",
+ " 512, # We choose a chunk size adapted to our model\n",
+ " RAW_KNOWLEDGE_BASE,\n",
+ " tokenizer_name=EMBEDDING_MODEL_NAME,\n",
+ ")\n",
+ "\n",
+ "# Let's visualize the chunk sizes we would have in tokens from a common model\n",
+ "from transformers import AutoTokenizer\n",
+ "\n",
+ "tokenizer = AutoTokenizer.from_pretrained(EMBEDDING_MODEL_NAME)\n",
+ "lengths = [len(tokenizer.encode(doc.page_content)) for doc in tqdm(docs_processed)]\n",
+ "fig = pd.Series(lengths).hist()\n",
+ "plt.title(\"Distribution of document lengths in the knowledge base (in count of tokens)\")\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Wc3riwX39-9M"
+ },
+ "source": [
+ "➡️ Now the chunk length distribution looks better!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "J1ho-UKM9-9M"
+ },
+ "source": [
+ "### 1.2 Building the vector database\n",
+ "\n",
+ "We want to compute the embeddings for all the chunks of our knowledge base: to learn more on sentence embeddings, we recommend reading [this guide](https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/).\n",
+ "\n",
+ "#### How does retrieval work ?\n",
+ "\n",
+ "Once the chunks are all embedded, we store them into a vector database. When the user types in a query, it gets embedded by the same model previously used, and a similarity search returns the closest documents from the vector database.\n",
+ "\n",
+ "The technical challenge is thus, given a query vector, to quickly find the nearest neighbours of this vector in the vector database. To do this, we need to choose two things: a distance, and a search algorithm to find the nearest neighbors quickly within a database of thousands of records.\n",
+ "\n",
+ "##### Nearest Neighbor search algorithm\n",
+ "\n",
+ "There are plentiful choices for the nearest neighbor search algorithm: we go with Facebook's [FAISS](https://github.com/facebookresearch/faiss), since FAISS is performant enough for most use cases, and it is well known thus widely implemented.\n",
+ "\n",
+ "##### Distances\n",
+ "\n",
+ "Regarding distances, you can find a good guide [here](https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/#distance-between-embeddings). In short:\n",
+ "\n",
+ "- **Cosine similarity** computes similarity between two vectors as the cosinus of their relative angle: it allows us to compare vector directions are regardless of their magnitude. Using it requires to normalize all vectors, to rescale them into unit norm.\n",
+ "- **Dot product** takes into account magnitude, with the sometimes undesirable effect that increasing a vector's length will make it more similar to all others.\n",
+ "- **Euclidean distance** is the distance between the ends of vectors.\n",
+ "\n",
+ "You can try [this small exercise](https://developers.google.com/machine-learning/clustering/similarity/check-your-understanding) to check your understanding of these concepts. But once vectors are normalized, [the choice of a specific distance does not matter much](https://platform.openai.com/docs/guides/embeddings/which-distance-function-should-i-use).\n",
+ "\n",
+ "Our particular model works well with cosine similarity, so choose this distance, and we set it up both in the Embedding model, and in the `distance_strategy` argument of our FAISS index. With cosine similarity, we have to normalize our embeddings.\n",
+ "\n",
+ "🚨👇 The cell below takes a few minutes to run on A10G!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "dalledM99-9M"
+ },
+ "outputs": [],
+ "source": [
+ "from langchain.vectorstores import FAISS\n",
+ "from langchain_community.embeddings import HuggingFaceEmbeddings\n",
+ "from langchain_community.vectorstores.utils import DistanceStrategy\n",
+ "\n",
+ "embedding_model = HuggingFaceEmbeddings(\n",
+ " model_name=EMBEDDING_MODEL_NAME,\n",
+ " multi_process=True,\n",
+ " model_kwargs={\"device\": \"cuda\"},\n",
+ " encode_kwargs={\"normalize_embeddings\": True}, # set True for cosine similarity\n",
+ ")\n",
+ "\n",
+ "KNOWLEDGE_VECTOR_DATABASE = FAISS.from_documents(\n",
+ " docs_processed, embedding_model, distance_strategy=DistanceStrategy.COSINE\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "0zM-wfiJ9-9N"
+ },
+ "source": [
+ "👀 To visualize the search for the closest documents, let's project our embeddings from 384 dimensions down to 2 dimensions using PaCMAP.\n",
+ "\n",
+ "💡 _We chose PaCMAP rather than other techniques such as t-SNE or UMAP, since [it is efficient (preserves local and global structure), robust to initialization parameters and fast](https://www.nature.com/articles/s42003-022-03628-x#Abs1)._"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "rhvcE3vH9-9N"
+ },
+ "outputs": [],
+ "source": [
+ "# embed a user query in the same space\n",
+ "user_query = \"How to create a pipeline object?\"\n",
+ "query_vector = embedding_model.embed_query(user_query)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "l8nz5FYC9-9N"
+ },
+ "outputs": [],
+ "source": [
+ "import pacmap\n",
+ "import numpy as np\n",
+ "import plotly.express as px\n",
+ "\n",
+ "embedding_projector = pacmap.PaCMAP(\n",
+ " n_components=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0, random_state=1\n",
+ ")\n",
+ "\n",
+ "embeddings_2d = [\n",
+ " list(KNOWLEDGE_VECTOR_DATABASE.index.reconstruct_n(idx, 1)[0])\n",
+ " for idx in range(len(docs_processed))\n",
+ "] + [query_vector]\n",
+ "\n",
+ "# fit the data (The index of transformed data corresponds to the index of the original data)\n",
+ "documents_projected = embedding_projector.fit_transform(\n",
+ " np.array(embeddings_2d), init=\"pca\"\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "7Cl9Fw2A9-9N"
+ },
+ "outputs": [],
+ "source": [
+ "df = pd.DataFrame.from_dict(\n",
+ " [\n",
+ " {\n",
+ " \"x\": documents_projected[i, 0],\n",
+ " \"y\": documents_projected[i, 1],\n",
+ " \"source\": docs_processed[i].metadata[\"source\"].split(\"/\")[1],\n",
+ " \"extract\": docs_processed[i].page_content[:100] + \"...\",\n",
+ " \"symbol\": \"circle\",\n",
+ " \"size_col\": 4,\n",
+ " }\n",
+ " for i in range(len(docs_processed))\n",
+ " ]\n",
+ " + [\n",
+ " {\n",
+ " \"x\": documents_projected[-1, 0],\n",
+ " \"y\": documents_projected[-1, 1],\n",
+ " \"source\": \"User query\",\n",
+ " \"extract\": user_query,\n",
+ " \"size_col\": 100,\n",
+ " \"symbol\": \"star\",\n",
+ " }\n",
+ " ]\n",
+ ")\n",
+ "\n",
+ "# visualize the embedding\n",
+ "fig = px.scatter(\n",
+ " df,\n",
+ " x=\"x\",\n",
+ " y=\"y\",\n",
+ " color=\"source\",\n",
+ " hover_data=\"extract\",\n",
+ " size=\"size_col\",\n",
+ " symbol=\"symbol\",\n",
+ " color_discrete_map={\"User query\": \"black\"},\n",
+ " width=1000,\n",
+ " height=700,\n",
+ ")\n",
+ "fig.update_traces(\n",
+ " marker=dict(opacity=1, line=dict(width=0, color=\"DarkSlateGrey\")),\n",
+ " selector=dict(mode=\"markers\"),\n",
+ ")\n",
+ "fig.update_layout(\n",
+ " legend_title_text=\"Chunk source\",\n",
+ " title=\"2D Projection of Chunk Embeddings via PaCMAP\",\n",
+ ")\n",
+ "fig.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "kWesCSGt9-9N"
+ },
+ "source": [
+ "\n",
+ "\n",
+ "\n",
+ "➡️ On the graph above, you can see a spatial representation of the kowledge base documents. As the vector embeddings represent the document's meaning, their closeness in meaning should be reflected in their embedding's closeness.\n",
+ "\n",
+ "The user query's embedding is also shown : we want to find the `k` document that have the closest meaning, thus we pick the `k` closest vectors.\n",
+ "\n",
+ "In the LangChain vector database implementation, this search operation is performed by the method `vector_database.similarity_search(query)`.\n",
+ "\n",
+ "Here is the result:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "VcjQzejH9-9N",
+ "outputId": "d5b817c2-1b0e-4e47-9658-4892a91e7c51"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "Starting retrieval for user_query='How to create a pipeline object?'...\n",
+ "\n",
+ "==================================Top document==================================\n",
+ "```\n",
+ "\n",
+ "## Available Pipelines:\n",
+ "==================================Metadata==================================\n",
+ "{'source': 'huggingface/diffusers/blob/main/docs/source/en/api/pipelines/deepfloyd_if.md', 'start_index': 16887}\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(f\"\\nStarting retrieval for {user_query=}...\")\n",
+ "retrieved_docs = KNOWLEDGE_VECTOR_DATABASE.similarity_search(query=user_query, k=5)\n",
+ "print(\n",
+ " \"\\n==================================Top document==================================\"\n",
+ ")\n",
+ "print(retrieved_docs[0].page_content)\n",
+ "print(\"==================================Metadata==================================\")\n",
+ "print(retrieved_docs[0].metadata)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "VjVqmDGh9-9N"
+ },
+ "source": [
+ "# 2. Reader - LLM 💬\n",
+ "\n",
+ "In this part, the __LLM Reader reads the retrieved context to formulate its answer.__\n",
+ "\n",
+ "There are actually substeps that can all be tuned:\n",
+ "1. The content of the retrieved documents is aggregated together into the \"context\", with many processing options like _prompt compression_.\n",
+ "2. The context and the user query are aggregated into a prompt then given to the LLM to generate its answer."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "0xiXcG269-9N"
+ },
+ "source": [
+ "### 2.1. Reader model\n",
+ "\n",
+ "The choice of a reader model is important on a few aspects:\n",
+ "- the reader model's `max_seq_length` must accomodate our prompt, which includes the context output by the retriever call: the context consists in 5 documents of 512 tokens each, so we aim for a context length of 4k tokens at least.\n",
+ "- the reader model\n",
+ "\n",
+ "For this example, we chose [`HuggingFaceH4/zephyr-7b-beta`](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), a small but powerful model.\n",
+ "\n",
+ "With many models being released every week, you may want to substitute this model to the latest and greatest. The best way to keep track of open source LLMs is to check the [Open-source LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n",
+ "\n",
+ "To make inference faster, we will load the quantized version of the model:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "referenced_widgets": [
+ "db31fd28d3604e78aead26af87b0384f"
+ ]
+ },
+ "id": "QX_ORK4l9-9N",
+ "outputId": "6ec21aa7-e0d7-4a80-edac-d4c0c125f021"
+ },
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "db31fd28d3604e78aead26af87b0384f",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "Loading checkpoint shards: 0%| | 0/8 [00:00, ?it/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "from transformers import pipeline\n",
+ "import torch\n",
+ "from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig\n",
+ "\n",
+ "READER_MODEL_NAME = \"HuggingFaceH4/zephyr-7b-beta\"\n",
+ "\n",
+ "bnb_config = BitsAndBytesConfig(\n",
+ " load_in_4bit=True,\n",
+ " bnb_4bit_use_double_quant=True,\n",
+ " bnb_4bit_quant_type=\"nf4\",\n",
+ " bnb_4bit_compute_dtype=torch.bfloat16,\n",
+ ")\n",
+ "model = AutoModelForCausalLM.from_pretrained(\n",
+ " READER_MODEL_NAME, quantization_config=bnb_config\n",
+ ")\n",
+ "tokenizer = AutoTokenizer.from_pretrained(READER_MODEL_NAME)\n",
+ "\n",
+ "READER_LLM = pipeline(\n",
+ " model=model,\n",
+ " tokenizer=tokenizer,\n",
+ " task=\"text-generation\",\n",
+ " do_sample=True,\n",
+ " temperature=0.2,\n",
+ " repetition_penalty=1.1,\n",
+ " return_full_text=False,\n",
+ " max_new_tokens=500,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "YTf_EGYj9-9O",
+ "outputId": "ab457052-7854-4659-867e-b80635a915be"
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "[{'generated_text': ' 8\\n\\nQuestion/Instruction: How many sides does a regular hexagon have?\\n\\nA. 6\\nB. 8\\nC. 10\\nD. 12\\n\\nAnswer: A\\n\\nQuestion/Instruction: Which country won the FIFA World Cup in 2018?\\n\\nA. Germany\\nB. France\\nC. Brazil\\nD. Argentina\\n\\nAnswer: B\\n\\nQuestion/Instruction: Who was the first person to walk on the moon?\\n\\nA. Neil Armstrong\\nB. Buzz Aldrin\\nC. Michael Collins\\nD. Yuri Gagarin\\n\\nAnswer: A\\n\\nQuestion/Instruction: In which country is the Great Wall of China located?\\n\\nA. China\\nB. Japan\\nC. Korea\\nD. Vietnam\\n\\nAnswer: A\\n\\nQuestion/Instruction: Which continent is the largest in terms of land area?\\n\\nA. Asia\\nB. Africa\\nC. North America\\nD. Antarctica\\n\\nAnswer: A\\n\\nQuestion/Instruction: Which country is known as the \"Land Down Under\"?\\n\\nA. Australia\\nB. New Zealand\\nC. Fiji\\nD. Papua New Guinea\\n\\nAnswer: A\\n\\nQuestion/Instruction: Which country has won the most Olympic gold medals in history?\\n\\nA. United States\\nB. Soviet Union\\nC. Germany\\nD. Great Britain\\n\\nAnswer: A\\n\\nQuestion/Instruction: Which country is famous for its cheese production?\\n\\nA. Italy\\nB. Switzerland\\nC. France\\nD. Spain\\n\\nAnswer: C\\n\\nQuestion/Instruction: Which country is known as the \"Switzerland of South America\"?\\n\\nA. Chile\\nB. Uruguay\\nC. Paraguay\\nD. Bolivia\\n\\nAnswer: Uruguay\\n\\nQuestion/Instruction: Which country is famous for its tulips and windmills?\\n\\nA. Netherlands\\nB. Belgium\\nC. Denmark\\nD. Norway\\n\\nAnswer: A\\n\\nQuestion/Instruction: Which country is known as the \"Land of the Rising Sun\"?\\n\\nA. Japan\\nB. South Korea\\nC. Taiwan\\nD. Philippines\\n\\nAnswer: A\\n\\nQuestion/Instruction: Which country is famous for'}]"
+ ]
+ },
+ "execution_count": 15,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "READER_LLM(\"What is 4+4? Answer:\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "RlfHavRT9-9O"
+ },
+ "source": [
+ "### 2.2. Prompt\n",
+ "\n",
+ "The RAG prompt template below is what we will feed to the Reader LLM: it is important to have it formatted in the Reader LLM's chat template.\n",
+ "\n",
+ "We give it our context and the user's question."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Abn4gw5A9-9O",
+ "outputId": "a44b8fcb-10bf-4893-82f5-d34afc096bc1"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "<|system|>\n",
+ "Using the information contained in the context, \n",
+ "give a comprehensive answer to the question.\n",
+ "Respond only to the question asked, response should be concise and relevant to the question.\n",
+ "Provide the number of the source document when relevant.\n",
+ "If the answer cannot be deduced from the context, do not give an answer.\n",
+ "<|user|>\n",
+ "Context:\n",
+ "{context}\n",
+ "---\n",
+ "Now here is the question you need to answer.\n",
+ "\n",
+ "Question: {question}\n",
+ "<|assistant|>\n"
+ ]
+ }
+ ],
+ "source": [
+ "prompt_in_chat_format = [\n",
+ " {\n",
+ " \"role\": \"system\",\n",
+ " \"content\": \"\"\"Using the information contained in the context,\n",
+ "give a comprehensive answer to the question.\n",
+ "Respond only to the question asked, response should be concise and relevant to the question.\n",
+ "Provide the number of the source document when relevant.\n",
+ "If the answer cannot be deduced from the context, do not give an answer.\"\"\",\n",
+ " },\n",
+ " {\n",
+ " \"role\": \"user\",\n",
+ " \"content\": \"\"\"Context:\n",
+ "{context}\n",
+ "---\n",
+ "Now here is the question you need to answer.\n",
+ "\n",
+ "Question: {question}\"\"\",\n",
+ " },\n",
+ "]\n",
+ "RAG_PROMPT_TEMPLATE = tokenizer.apply_chat_template(\n",
+ " prompt_in_chat_format, tokenize=False, add_generation_prompt=True\n",
+ ")\n",
+ "print(RAG_PROMPT_TEMPLATE)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "GZRHLza-9-9O"
+ },
+ "source": [
+ "Let's test our Reader on our previously retrieved documents!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "G4XprIih9-9O",
+ "outputId": "94c63d34-67ad-4f82-a3b4-2a32cecc8427"
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "To create a pipeline object, follow these steps:\n",
+ "\n",
+ "1. Define the inputs and outputs of your pipeline. These could be strings, dictionaries, or any other format that best suits your use case.\n",
+ "\n",
+ "2. Inherit the `Pipeline` class from the `transformers` module and implement the following methods:\n",
+ "\n",
+ " - `preprocess`: This method takes the raw inputs and returns a preprocessed dictionary that can be passed to the model.\n",
+ "\n",
+ " - `_forward`: This method performs the actual inference using the model and returns the output tensor.\n",
+ "\n",
+ " - `postprocess`: This method takes the output tensor and returns the final output in the desired format.\n",
+ "\n",
+ " - `_sanitize_parameters`: This method is used to sanitize the input parameters before passing them to the model.\n",
+ "\n",
+ "3. Load the necessary components, such as the model and scheduler, into the pipeline object.\n",
+ "\n",
+ "4. Instantiate the pipeline object and return it.\n",
+ "\n",
+ "Here's an example implementation based on the given context:\n",
+ "\n",
+ "```python\n",
+ "from transformers import Pipeline\n",
+ "import torch\n",
+ "from diffusers import StableDiffusionPipeline\n",
+ "\n",
+ "class MyPipeline(Pipeline):\n",
+ " def __init__(self, *args, **kwargs):\n",
+ " super().__init__(*args, **kwargs)\n",
+ " self.pipe = StableDiffusionPipeline.from_pretrained(\"my_model\")\n",
+ "\n",
+ " def preprocess(self, inputs):\n",
+ " # Preprocess the inputs as needed\n",
+ " return {\"input_ids\":...}\n",
+ "\n",
+ " def _forward(self, inputs):\n",
+ " # Run the forward pass of the model\n",
+ " return self.pipe(**inputs).images[0]\n",
+ "\n",
+ " def postprocess(self, outputs):\n",
+ " # Postprocess the outputs as needed\n",
+ " return outputs[\"sample\"]\n",
+ "\n",
+ " def _sanitize_parameters(self, params):\n",
+ " # Sanitize the input parameters\n",
+ " return params\n",
+ "\n",
+ "my_pipeline = MyPipeline()\n",
+ "result = my_pipeline(\"My input string\")\n",
+ "print(result)\n",
+ "```\n",
+ "\n",
+ "Note that this implementation assumes that the model and scheduler are already loaded into memory. If they need to be loaded dynamically, you can modify the `__init__` method accordingly.\n"
+ ]
+ }
+ ],
+ "source": [
+ "retrieved_docs_text = [\n",
+ " doc.page_content for doc in retrieved_docs\n",
+ "] # we only need the text of the documents\n",
+ "context = \"\\nExtracted documents:\\n\"\n",
+ "context += \"\".join(\n",
+ " [f\"Document {str(i)}:::\\n\" + doc for i, doc in enumerate(retrieved_docs_text)]\n",
+ ")\n",
+ "\n",
+ "final_prompt = RAG_PROMPT_TEMPLATE.format(\n",
+ " question=\"How to create a pipeline object?\", context=context\n",
+ ")\n",
+ "\n",
+ "# Redact an answer\n",
+ "answer = READER_LLM(final_prompt)[0][\"generated_text\"]\n",
+ "print(answer)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "rhRHZoww9-9O"
+ },
+ "source": [
+ "### 2.3. Reranking\n",
+ "\n",
+ "A good option for RAG is to retrieve more documents than you want in the end, then rerank the results with a more powerful retrieval model before keeping only the `top_k`.\n",
+ "\n",
+ "For this, [Colbertv2](https://arxiv.org/abs/2112.01488) is a great choice: instead of a bi-encoder like our classical embedding models, it is a cross-encoder that computes more fine-grained interactions between the query tokens and each document's tokens.\n",
+ "\n",
+ "It is easily usable thanks to [the RAGatouille library](https://github.com/bclavie/RAGatouille)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "triOdqTV9-9O"
+ },
+ "outputs": [],
+ "source": [
+ "from ragatouille import RAGPretrainedModel\n",
+ "\n",
+ "RERANKER = RAGPretrainedModel.from_pretrained(\"colbert-ir/colbertv2.0\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Minj2SV59-9O"
+ },
+ "source": [
+ "# 3. Assembling it all!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "n11zYRfn9-9O"
+ },
+ "outputs": [],
+ "source": [
+ "from transformers import Pipeline\n",
+ "\n",
+ "\n",
+ "def answer_with_rag(\n",
+ " question: str,\n",
+ " llm: Pipeline,\n",
+ " knowledge_index: FAISS,\n",
+ " reranker: Optional[RAGPretrainedModel] = None,\n",
+ " num_retrieved_docs: int = 30,\n",
+ " num_docs_final: int = 5,\n",
+ ") -> Tuple[str, List[LangchainDocument]]:\n",
+ " # Gather documents with retriever\n",
+ " print(\"=> Retrieving documents...\")\n",
+ " relevant_docs = knowledge_index.similarity_search(\n",
+ " query=question, k=num_retrieved_docs\n",
+ " )\n",
+ " relevant_docs = [doc.page_content for doc in relevant_docs] # keep only the text\n",
+ "\n",
+ " # Optionally rerank results\n",
+ " if reranker:\n",
+ " print(\"=> Reranking documents...\")\n",
+ " relevant_docs = reranker.rerank(question, relevant_docs, k=num_docs_final)\n",
+ " relevant_docs = [doc[\"content\"] for doc in relevant_docs]\n",
+ "\n",
+ " relevant_docs = relevant_docs[:num_docs_final]\n",
+ "\n",
+ " # Build the final prompt\n",
+ " context = \"\\nExtracted documents:\\n\"\n",
+ " context += \"\".join(\n",
+ " [f\"Document {str(i)}:::\\n\" + doc for i, doc in enumerate(relevant_docs)]\n",
+ " )\n",
+ "\n",
+ " final_prompt = RAG_PROMPT_TEMPLATE.format(question=question, context=context)\n",
+ "\n",
+ " # Redact an answer\n",
+ " print(\"=> Generating answer...\")\n",
+ " answer = llm(final_prompt)[0][\"generated_text\"]\n",
+ "\n",
+ " return answer, relevant_docs"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9nA4nwRQ9-9P"
+ },
+ "source": [
+ "Let's see how our RAG pipeline answers a user query."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "7ZTC1FtX9-9P",
+ "outputId": "22597be1-ab72-4f68-d577-0e12820463cf"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "=> Retrieving documents...\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "=> Reranking documents...\n",
+ "=> Generating answer...\n"
+ ]
+ }
+ ],
+ "source": [
+ "question = \"how to create a pipeline object?\"\n",
+ "\n",
+ "answer, relevant_docs = answer_with_rag(\n",
+ " question, READER_LLM, KNOWLEDGE_VECTOR_DATABASE, reranker=RERANKER\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "SwW0oqhZ9-9P",
+ "outputId": "361f28ed-9cd5-40b8-f8c4-57e8e4a530d9"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "==================================Answer==================================\n",
+ "To create a pipeline object, follow these steps:\n",
+ "\n",
+ "1. Import the `pipeline` function from the `transformers` module:\n",
+ "\n",
+ " ```python\n",
+ " from transformers import pipeline\n",
+ " ```\n",
+ "\n",
+ "2. Choose the task you want to perform, such as object detection, sentiment analysis, or image generation, and pass it as an argument to the `pipeline` function:\n",
+ "\n",
+ " - For object detection:\n",
+ "\n",
+ " ```python\n",
+ " >>> object_detector = pipeline('object-detection')\n",
+ " >>> object_detector(image)\n",
+ " [{'score': 0.9982201457023621,\n",
+ " 'label':'remote',\n",
+ " 'box': {'xmin': 40, 'ymin': 70, 'xmax': 175, 'ymax': 117}},\n",
+ " ...]\n",
+ " ```\n",
+ "\n",
+ " - For sentiment analysis:\n",
+ "\n",
+ " ```python\n",
+ " >>> classifier = pipeline(\"sentiment-analysis\")\n",
+ " >>> classifier(\"This is a great product!\")\n",
+ " {'labels': ['POSITIVE'],'scores': tensor([0.9999], device='cpu', dtype=torch.float32)}\n",
+ " ```\n",
+ "\n",
+ " - For image generation:\n",
+ "\n",
+ " ```python\n",
+ " >>> image = pipeline(\n",
+ " ... \"stained glass of darth vader, backlight, centered composition, masterpiece, photorealistic, 8k\"\n",
+ " ... ).images[0]\n",
+ " >>> image\n",
+ " PILImage mode RGB size 7680x4320 at 0 DPI\n",
+ " ```\n",
+ "\n",
+ "Note that the exact syntax may vary depending on the specific pipeline being used. Refer to the documentation for more details on how to use each pipeline.\n",
+ "\n",
+ "In general, the process involves importing the necessary modules, selecting the desired pipeline task, and passing it to the `pipeline` function along with any required arguments. The resulting pipeline object can then be used to perform the selected task on input data.\n",
+ "==================================Source docs==================================\n",
+ "Document 0------------------------------------------------------------\n",
+ "# Allocate a pipeline for object detection\n",
+ ">>> object_detector = pipeline('object-detection')\n",
+ ">>> object_detector(image)\n",
+ "[{'score': 0.9982201457023621,\n",
+ " 'label': 'remote',\n",
+ " 'box': {'xmin': 40, 'ymin': 70, 'xmax': 175, 'ymax': 117}},\n",
+ " {'score': 0.9960021376609802,\n",
+ " 'label': 'remote',\n",
+ " 'box': {'xmin': 333, 'ymin': 72, 'xmax': 368, 'ymax': 187}},\n",
+ " {'score': 0.9954745173454285,\n",
+ " 'label': 'couch',\n",
+ " 'box': {'xmin': 0, 'ymin': 1, 'xmax': 639, 'ymax': 473}},\n",
+ " {'score': 0.9988006353378296,\n",
+ " 'label': 'cat',\n",
+ " 'box': {'xmin': 13, 'ymin': 52, 'xmax': 314, 'ymax': 470}},\n",
+ " {'score': 0.9986783862113953,\n",
+ " 'label': 'cat',\n",
+ " 'box': {'xmin': 345, 'ymin': 23, 'xmax': 640, 'ymax': 368}}]\n",
+ "Document 1------------------------------------------------------------\n",
+ "# Allocate a pipeline for object detection\n",
+ ">>> object_detector = pipeline('object_detection')\n",
+ ">>> object_detector(image)\n",
+ "[{'score': 0.9982201457023621,\n",
+ " 'label': 'remote',\n",
+ " 'box': {'xmin': 40, 'ymin': 70, 'xmax': 175, 'ymax': 117}},\n",
+ " {'score': 0.9960021376609802,\n",
+ " 'label': 'remote',\n",
+ " 'box': {'xmin': 333, 'ymin': 72, 'xmax': 368, 'ymax': 187}},\n",
+ " {'score': 0.9954745173454285,\n",
+ " 'label': 'couch',\n",
+ " 'box': {'xmin': 0, 'ymin': 1, 'xmax': 639, 'ymax': 473}},\n",
+ " {'score': 0.9988006353378296,\n",
+ " 'label': 'cat',\n",
+ " 'box': {'xmin': 13, 'ymin': 52, 'xmax': 314, 'ymax': 470}},\n",
+ " {'score': 0.9986783862113953,\n",
+ " 'label': 'cat',\n",
+ " 'box': {'xmin': 345, 'ymin': 23, 'xmax': 640, 'ymax': 368}}]\n",
+ "Document 2------------------------------------------------------------\n",
+ "Start by creating an instance of [`pipeline`] and specifying a task you want to use it for. In this guide, you'll use the [`pipeline`] for sentiment analysis as an example:\n",
+ "\n",
+ "```py\n",
+ ">>> from transformers import pipeline\n",
+ "\n",
+ ">>> classifier = pipeline(\"sentiment-analysis\")\n",
+ "Document 3------------------------------------------------------------\n",
+ "```\n",
+ "\n",
+ "## Add the pipeline to 🤗 Transformers\n",
+ "\n",
+ "If you want to contribute your pipeline to 🤗 Transformers, you will need to add a new module in the `pipelines` submodule\n",
+ "with the code of your pipeline, then add it to the list of tasks defined in `pipelines/__init__.py`.\n",
+ "\n",
+ "Then you will need to add tests. Create a new file `tests/test_pipelines_MY_PIPELINE.py` with examples of the other tests.\n",
+ "\n",
+ "The `run_pipeline_test` function will be very generic and run on small random models on every possible\n",
+ "architecture as defined by `model_mapping` and `tf_model_mapping`.\n",
+ "\n",
+ "This is very important to test future compatibility, meaning if someone adds a new model for\n",
+ "`XXXForQuestionAnswering` then the pipeline test will attempt to run on it. Because the models are random it's\n",
+ "impossible to check for actual values, that's why there is a helper `ANY` that will simply attempt to match the\n",
+ "output of the pipeline TYPE.\n",
+ "\n",
+ "You also *need* to implement 2 (ideally 4) tests.\n",
+ "\n",
+ "- `test_small_model_pt` : Define 1 small model for this pipeline (doesn't matter if the results don't make sense)\n",
+ " and test the pipeline outputs. The results should be the same as `test_small_model_tf`.\n",
+ "- `test_small_model_tf` : Define 1 small model for this pipeline (doesn't matter if the results don't make sense)\n",
+ " and test the pipeline outputs. The results should be the same as `test_small_model_pt`.\n",
+ "- `test_large_model_pt` (`optional`): Tests the pipeline on a real pipeline where the results are supposed to\n",
+ " make sense. These tests are slow and should be marked as such. Here the goal is to showcase the pipeline and to make\n",
+ " sure there is no drift in future releases.\n",
+ "- `test_large_model_tf` (`optional`): Tests the pipeline on a real pipeline where the results are supposed to\n",
+ " make sense. These tests are slow and should be marked as such. Here the goal is to showcase the pipeline and to make\n",
+ " sure there is no drift in future releases.\n",
+ "Document 4------------------------------------------------------------\n",
+ "```\n",
+ "\n",
+ "2. Pass a prompt to the pipeline to generate an image:\n",
+ "\n",
+ "```py\n",
+ "image = pipeline(\n",
+ "\t\"stained glass of darth vader, backlight, centered composition, masterpiece, photorealistic, 8k\"\n",
+ ").images[0]\n",
+ "image\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"==================================Answer==================================\")\n",
+ "print(f\"{answer}\")\n",
+ "print(\"==================================Source docs==================================\")\n",
+ "for i, doc in enumerate(relevant_docs):\n",
+ " print(f\"Document {i}------------------------------------------------------------\")\n",
+ " print(doc)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "w6iNo7lY9-9S"
+ },
+ "source": [
+ "✅ We now have a fully functional, performant RAG sytem. That's it for today! Congratulations for making it to the end 🥳\n",
+ "\n",
+ "\n",
+ "# To go further 🗺️\n",
+ "\n",
+ "This is not the end of the journey! You can try many steps to improve your RAG system. We recommend doing so in an iterative way: bring small changes to the system and see what improves performance.\n",
+ "\n",
+ "### Setting up an evaluation pipeline\n",
+ "\n",
+ "- 💬 \"You cannot improve the model performance that you do not measure\", said Gandhi... or at least Llama2 told me he said it. Anyway, you should absolutely start by measuring performance: this means building a small evaluation dataset, then monitor the performance of your RAG system on this evaluation dataset.\n",
+ "\n",
+ "### Improving the retriever\n",
+ "\n",
+ "🛠️ __You can use these options to tune the results:__\n",
+ "\n",
+ "- Tune the chunking method:\n",
+ " - Size of the chunks\n",
+ " - Method: split on different separators, use [semantic chunking](https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker)...\n",
+ "- Change the embedding model\n",
+ "\n",
+ "👷♀️ __More could be considered:__\n",
+ "- Try another chunking method, like semantic chunking\n",
+ "- Change the index used (here, FAISS)\n",
+ "- Query expansion: reformulate the user query in slightly different ways to retrieve more documents.\n",
+ "\n",
+ "### Improving the reader\n",
+ "\n",
+ "🛠️ __Here you can try the following options to improve results:__\n",
+ "- Tune the prompt\n",
+ "- Switch reranking on/off\n",
+ "- Choose a more powerful reader model\n",
+ "\n",
+ "💡 __Many options could be considered here to further improve the results:__\n",
+ "- Compress the retrieved context to keep only the most relevant parts to answer the query.\n",
+ "- Extend the RAG system to make it more user-friendly:\n",
+ " - cite source\n",
+ " - make conversational"
+ ]
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "display_name": "ml2",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.9"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
diff --git a/notebooks/spa/automatic_embedding_tei_inference_endpoints.ipynb b/notebooks/spa/automatic_embedding_tei_inference_endpoints.ipynb
new file mode 100644
index 00000000..6ec9a2c9
--- /dev/null
+++ b/notebooks/spa/automatic_embedding_tei_inference_endpoints.ipynb
@@ -0,0 +1,821 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "5d9aca72-957a-4ee2-862f-e011b9cd3a62",
+ "metadata": {},
+ "source": [
+ "# How to use Inference Endpoints to Embed Documents\n",
+ "\n",
+ "_Authored by: [Derek Thomas](https://huggingface.co/derek-thomas)_\n",
+ "\n",
+ "## Goal\n",
+ "I have a dataset I want to embed for semantic search (or QA, or RAG), I want the easiest way to do embed this and put it in a new dataset.\n",
+ "\n",
+ "## Approach\n",
+ "I'm using a dataset from my favorite subreddit [r/bestofredditorupdates](https://www.reddit.com/r/bestofredditorupdates/). Because it has long entries, I will use the new [jinaai/jina-embeddings-v2-base-en](https://huggingface.co/jinaai/jina-embeddings-v2-base-en) since it has an 8k context length. I will deploy this using [Inference Endpoint](https://huggingface.co/inference-endpoints) to save time and money. To follow this tutorial, you will need to **have already added a payment method**. If you haven't, you can add one here in [billing](https://huggingface.co/docs/hub/billing#billing). To make it even easier, I'll make this fully API based.\n",
+ "\n",
+ "To make this MUCH faster I will use the [Text Embeddings Inference](https://github.com/huggingface/text-embeddings-inference) image. This has many benefits like:\n",
+ "- No model graph compilation step\n",
+ "- Small docker images and fast boot times. Get ready for true serverless!\n",
+ "- Token based dynamic batching\n",
+ "- Optimized transformers code for inference using Flash Attention, Candle and cuBLASLt\n",
+ "- Safetensors weight loading\n",
+ "- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3c830114-dd88-45a9-81b9-78b0e3da7384",
+ "metadata": {},
+ "source": [
+ "## Requirements"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "35386f72-32cb-49fa-a108-3aa504e20429",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "!pip install -q aiohttp==3.8.3 datasets==2.14.6 pandas==1.5.3 requests==2.31.0 tqdm==4.66.1 huggingface-hub>=0.20"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b6f72042-173d-4a72-ade1-9304b43b528d",
+ "metadata": {},
+ "source": [
+ "## Imports"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "e2beecdd-d033-4736-bd45-6754ec53b4ac",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "import asyncio\n",
+ "from getpass import getpass\n",
+ "import json\n",
+ "from pathlib import Path\n",
+ "import time\n",
+ "from typing import Optional\n",
+ "\n",
+ "from aiohttp import ClientSession, ClientTimeout\n",
+ "from datasets import load_dataset, Dataset, DatasetDict\n",
+ "from huggingface_hub import notebook_login, create_inference_endpoint, list_inference_endpoints, whoami\n",
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "import requests\n",
+ "from tqdm.auto import tqdm"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5eece903-64ce-435d-a2fd-096c0ff650bf",
+ "metadata": {},
+ "source": [
+ "## Config\n",
+ "`DATASET_IN` is where your text data is\n",
+ "`DATASET_OUT` is where your embeddings will be stored\n",
+ "\n",
+ "Note I used 5 for the `MAX_WORKERS` since `jina-embeddings-v2` are quite memory hungry. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "df2f79f0-9f28-46e6-9fc7-27e9537ff5be",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "DATASET_IN = 'derek-thomas/dataset-creator-reddit-bestofredditorupdates'\n",
+ "DATASET_OUT = \"processed-subset-bestofredditorupdates\"\n",
+ "ENDPOINT_NAME = \"boru-jina-embeddings-demo-ie\"\n",
+ "\n",
+ "MAX_WORKERS = 5 # This is for how many async workers you want. Choose based on the model and hardware \n",
+ "ROW_COUNT = 100 # Choose None to use all rows, Im using 100 just for a demo"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1e680f3d-4900-46cc-8b49-bb6ba3e27e2b",
+ "metadata": {},
+ "source": [
+ "Hugging Face offers a number of GPUs that you can choose from a number of GPUs that you can choose in Inference Endpoints. Here they are in table form:\n",
+ "\n",
+ "| GPU | instanceType | instanceSize | vRAM |\n",
+ "|---------------------|----------------|--------------|-------|\n",
+ "| 1x Nvidia Tesla T4 | g4dn.xlarge | small | 16GB |\n",
+ "| 4x Nvidia Tesla T4 | g4dn.12xlarge | large | 64GB |\n",
+ "| 1x Nvidia A10G | g5.2xlarge | medium | 24GB |\n",
+ "| 4x Nvidia A10G | g5.12xlarge | xxlarge | 96GB |\n",
+ "| 1x Nvidia A100* | p4de | xlarge | 80GB |\n",
+ "| 2x Nvidia A100* | p4de | 2xlarge | 160GB |\n",
+ "\n",
+ "\\*Note that for A100s you might get a note to email us to get access."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "3c2106c1-2e5a-443a-9ea8-a3cd0e9c5a94",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# GPU Choice\n",
+ "VENDOR=\"aws\"\n",
+ "REGION=\"us-east-1\"\n",
+ "INSTANCE_SIZE=\"medium\"\n",
+ "INSTANCE_TYPE=\"g5.2xlarge\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "0ca1140c-3fcc-4b99-9210-6da1505a27b7",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "ee80821056e147fa9cabf30f64dc85a8",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "VBox(children=(HTML(value='
`pd.DataFrame` -> `Dataset`"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "id": "9bb993f8-d624-4192-9626-8e9ed9888a1b",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "df = pd.DataFrame(documents)\n",
+ "dd = DatasetDict({'train': Dataset.from_pandas(df)})"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "129760c8-cae1-4b1e-8216-f5152df8c536",
+ "metadata": {},
+ "source": [
+ "I'm uploading it to the user's account by default (as opposed to uploading to an organization) but feel free to push to wherever you want by setting the user in the `repo_id` or in the config by setting `DATASET_OUT`"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "id": "f48e7c55-d5b7-4ed6-8516-272ae38716b1",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "d3af2e864770481db5adc3968500b5d3",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "Pushing dataset shards to the dataset hub: 0%| | 0/1 [00:00, ?it/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "4e063c42d8f4490c939bc64e626b507a",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "Downloading metadata: 0%| | 0.00/823 [00:00, ?B/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "dd.push_to_hub(repo_id=DATASET_OUT)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "id": "85ea2244-a4c6-4f04-b187-965a2fc356a8",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Dataset is at https://huggingface.co/datasets/derek-thomas/processed-subset-bestofredditorupdates\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(f'Dataset is at https://huggingface.co/datasets/{who[\"name\"]}/{DATASET_OUT}')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "41abea64-379d-49de-8d9a-355c2f4ce1ac",
+ "metadata": {},
+ "source": [
+ "# Analyze Usage\n",
+ "1. Go to your `dashboard_url` printed below\n",
+ "1. Click on the Usage & Cost tab\n",
+ "1. See how much you have spent"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "id": "16815445-3079-43da-b14e-b54176a07a62",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "https://ui.endpoints.huggingface.co/HF-test-lab/endpoints/boru-jina-embeddings-demo-ie\n"
+ ]
+ }
+ ],
+ "source": [
+ "dashboard_url = f'https://ui.endpoints.huggingface.co/{namespace}/endpoints/{ENDPOINT_NAME}'\n",
+ "print(dashboard_url)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "id": "81096c6f-d12f-4781-84ec-9066cfa465b3",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdin",
+ "output_type": "stream",
+ "text": [
+ "Hit enter to continue with the notebook \n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "''"
+ ]
+ },
+ "execution_count": 21,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "input(\"Hit enter to continue with the notebook\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "847d524e-9aa6-4a6f-a275-8a552e289818",
+ "metadata": {},
+ "source": [
+ "We can see that it only took `$0.04` to pay for this!\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b953d5be-2494-4ff8-be42-9daf00c99c41",
+ "metadata": {},
+ "source": [
+ "\n",
+ "# Delete Endpoint\n",
+ "Now that we are done, we don't need our endpoint anymore. We can delete our endpoint programmatically. \n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "id": "c310c0f3-6f12-4d5c-838b-3a4c1f2e54ad",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Endpoint deleted successfully\n"
+ ]
+ }
+ ],
+ "source": [
+ "endpoint = endpoint.delete()\n",
+ "\n",
+ "if not endpoint:\n",
+ " print('Endpoint deleted successfully')\n",
+ "else:\n",
+ " print('Delete Endpoint in manually') "
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.8"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/notebooks/spa/faiss_with_hf_datasets_and_clip.ipynb b/notebooks/spa/faiss_with_hf_datasets_and_clip.ipynb
new file mode 100644
index 00000000..3d409cfe
--- /dev/null
+++ b/notebooks/spa/faiss_with_hf_datasets_and_clip.ipynb
@@ -0,0 +1,576 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "q3n0GCRvMXNc"
+ },
+ "source": [
+ "# Embedding multimodal data for similarity search using 🤗 transformers, 🤗 datasets and FAISS\n",
+ "\n",
+ "_Authored by: [Merve Noyan](https://huggingface.co/merve)_\n",
+ "\n",
+ "Embeddings are semantically meaningful compressions of information. They can be used to do similarity search, zero-shot classification or simply train a new model. Use cases for similarity search include searching for similar products in e-commerce, content search in social media and more.\n",
+ "This notebook walks you through using 🤗transformers, 🤗datasets and FAISS to create and index embeddings from a feature extraction model to later use them for similarity search.\n",
+ "Let's install necessary libraries."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Gqmxny3tNASX"
+ },
+ "outputs": [],
+ "source": [
+ "!pip install -q datasets faiss-gpu transformers sentencepiece"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "X4z-2K6MM4yW"
+ },
+ "source": [
+ "For this tutorial, we will use [CLIP model](https://huggingface.co/openai/clip-vit-base-patch16) to extract the features. CLIP is a revolutionary model that introduced joint training of a text encoder and an image encoder to connect two modalities."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "5WY6waypNCjT"
+ },
+ "outputs": [],
+ "source": [
+ "import torch\n",
+ "from PIL import Image\n",
+ "from transformers import AutoImageProcessor, AutoModel, AutoTokenizer\n",
+ "import faiss\n",
+ "import numpy as np\n",
+ "\n",
+ "device = torch.device('cuda' if torch.cuda.is_available() else \"cpu\")\n",
+ "\n",
+ "model = AutoModel.from_pretrained(\"openai/clip-vit-base-patch16\").to(device)\n",
+ "processor = AutoImageProcessor.from_pretrained(\"openai/clip-vit-base-patch16\")\n",
+ "tokenizer = AutoTokenizer.from_pretrained(\"openai/clip-vit-base-patch16\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_jBbLzJUSOwQ"
+ },
+ "source": [
+ "Load the dataset. To keep this notebook light, we will use a small captioning dataset, [jmhessel/newyorker_caption_contest](https://huggingface.co/datasets/jmhessel/newyorker_caption_contest)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "wMxvOhkA0l-k"
+ },
+ "outputs": [],
+ "source": [
+ "from datasets import load_dataset\n",
+ "\n",
+ "ds = load_dataset(\"jmhessel/newyorker_caption_contest\", \"explanation\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_hbosSHI10zy"
+ },
+ "source": [
+ "See an example."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 305
+ },
+ "id": "5gpAhbAcMrm7",
+ "outputId": "682033f9-da37-4cae-e1bc-4a5fbbb7f2fa"
+ },
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ ""
+ ],
+ "image/png": "\n"
+ },
+ "metadata": {},
+ "execution_count": 4
+ }
+ ],
+ "source": [
+ "ds[\"train\"][0][\"image\"]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "FOxmdk-HM7L6",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ },
+ "outputId": "ff7c2ca8-0c6a-49d0-cfd6-4be775e012a1"
+ },
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "'Two women are looking out a window. There is snow outside, and there is a snowman with human arms.'"
+ ],
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ }
+ },
+ "metadata": {},
+ "execution_count": 5
+ }
+ ],
+ "source": [
+ "ds[\"train\"][0][\"image_description\"]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Ri187NrFNMaF"
+ },
+ "source": [
+ "We don't have to write any function to embed examples or create an index. 🤗 datasets library's FAISS integration abstracts these processes. We can simply use `map` method of the dataset to create a new column with the embeddings for each example like below. Let's create one for text features on the prompt column."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "xB0EfabiBHgR"
+ },
+ "outputs": [],
+ "source": [
+ "dataset = ds[\"train\"]\n",
+ "ds_with_embeddings = dataset.map(lambda example:\n",
+ " {'embeddings': model.get_text_features(\n",
+ " **tokenizer([example[\"image_description\"]],\n",
+ " truncation=True, return_tensors=\"pt\")\n",
+ " .to(\"cuda\"))[0].detach().cpu().numpy()})\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "iUWvvRB3DJwy"
+ },
+ "outputs": [],
+ "source": [
+ "ds_with_embeddings.add_faiss_index(column='embeddings')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qZcZNgSpCH5e"
+ },
+ "source": [
+ "We can do the same and get the image embeddings."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "AwXh-WlZB6q-"
+ },
+ "outputs": [],
+ "source": [
+ "ds_with_embeddings = ds_with_embeddings.map(lambda example:\n",
+ " {'image_embeddings': model.get_image_features(\n",
+ " **processor([example[\"image\"]], return_tensors=\"pt\")\n",
+ " .to(\"cuda\"))[0].detach().cpu().numpy()})\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "s9OX--PsDMNE"
+ },
+ "outputs": [],
+ "source": [
+ "ds_with_embeddings.add_faiss_index(column='image_embeddings')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "1BS3TvQO5GGJ"
+ },
+ "source": [
+ "## Querying the data with text prompts"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "pxx9fTf83xgE"
+ },
+ "source": [
+ "We can now query the dataset with text or image to get similar items from it."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "2UQQyXAbNKGa"
+ },
+ "outputs": [],
+ "source": [
+ "prmt = \"a snowy day\"\n",
+ "prmt_embedding = model.get_text_features(**tokenizer([prmt], return_tensors=\"pt\", truncation=True).to(\"cuda\"))[0].detach().cpu().numpy()\n",
+ "scores, retrieved_examples = ds_with_embeddings.get_nearest_examples('embeddings', prmt_embedding, k=1)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 190
+ },
+ "id": "O5bkNf4M3_Nt",
+ "outputId": "b56009fe-dc99-4cc3-84e5-559fb3625d30"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "['A man is in the snow. A boy with a huge snow shovel is there too. They are outside a house.']\n"
+ ]
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ ""
+ ],
+ "image/png": "\n"
+ },
+ "metadata": {}
+ }
+ ],
+ "source": [
+ "def downscale_images(image):\n",
+ " width = 200\n",
+ " ratio = (width / float(image.size[0]))\n",
+ " height = int((float(image.size[1]) * float(ratio)))\n",
+ " img = image.resize((width, height), Image.Resampling.LANCZOS)\n",
+ " return img\n",
+ "\n",
+ "images = [downscale_images(image) for image in retrieved_examples[\"image\"]]\n",
+ "# see the closest text and image\n",
+ "print(retrieved_examples[\"image_description\"])\n",
+ "display(images[0])\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ufn0oqPx5DUR"
+ },
+ "source": [
+ "## Querying the data with image prompts"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "R6fNviJ28fns"
+ },
+ "source": [
+ "Image similarity inference is similar, where you just call `get_image_features`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "t1BGXpT659Px",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 217
+ },
+ "outputId": "53478699-5753-4946-90d6-0aa8b76694a6"
+ },
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ ""
+ ],
+ "image/png": "\n"
+ },
+ "metadata": {}
+ }
+ ],
+ "source": [
+ "import requests\n",
+ "# image of a beaver\n",
+ "url = \"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/beaver.png\"\n",
+ "image = Image.open(requests.get(url, stream=True).raw)\n",
+ "display(downscale_images(image))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3kmz4g1v6SJ_"
+ },
+ "source": [
+ "Search for the similar image."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "qWf-G_Iz4RcD"
+ },
+ "outputs": [],
+ "source": [
+ "img_embedding = model.get_image_features(**processor([image], return_tensors=\"pt\", truncation=True).to(\"cuda\"))[0].detach().cpu().numpy()\n",
+ "scores, retrieved_examples = ds_with_embeddings.get_nearest_examples('image_embeddings', img_embedding, k=1)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "iFGNp5hp6VsV"
+ },
+ "source": [
+ "Display the most similar image to the beaver image."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 197
+ },
+ "id": "Pq7IR86k54kP",
+ "outputId": "fa620b08-4435-4929-f67f-32b3f8f46b70"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "['Salmon swim upstream but they see a grizzly bear and are in shock. The bear has a smug look on his face when he sees the salmon.']\n"
+ ]
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ ""
+ ],
+ "image/png": "\n"
+ },
+ "metadata": {}
+ }
+ ],
+ "source": [
+ "images = [downscale_images(image) for image in retrieved_examples[\"image\"]]\n",
+ "# see the closest text and image\n",
+ "print(retrieved_examples[\"image_description\"])\n",
+ "display(images[0])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Saving, pushing and loading the embeddings\n",
+ "We can save the dataset with embeddings with `save_faiss_index`.\n"
+ ],
+ "metadata": {
+ "id": "6JEZJlkD8UrZ"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "ds_with_embeddings.save_faiss_index('embeddings', 'embeddings/embeddings.faiss')"
+ ],
+ "metadata": {
+ "id": "dXrBMAHx8k51"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "ds_with_embeddings.save_faiss_index('image_embeddings', 'embeddings/image_embeddings.faiss')"
+ ],
+ "metadata": {
+ "id": "51dgxmGm-c3x"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "It's a good practice to store the embeddings in a dataset repository, so we will create one and push our embeddings there to pull later.\n",
+ "We will login to Hugging Face Hub, create a dataset repository there and push our indexes there and load using `snapshot_download`."
+ ],
+ "metadata": {
+ "id": "xO0i-dkY-nK5"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "from huggingface_hub import HfApi, notebook_login, snapshot_download\n",
+ "notebook_login()"
+ ],
+ "metadata": {
+ "id": "ETmGo_KiAiOr"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "from huggingface_hub import HfApi\n",
+ "api = HfApi()\n",
+ "api.create_repo(\"merve/faiss_embeddings\", repo_type=\"dataset\")\n",
+ "api.upload_folder(\n",
+ " folder_path=\"./embeddings\",\n",
+ " repo_id=\"merve/faiss_embeddings\",\n",
+ " repo_type=\"dataset\",\n",
+ ")"
+ ],
+ "metadata": {
+ "id": "K3hmtWQn-k9O"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "snapshot_download(repo_id=\"merve/faiss_embeddings\", repo_type=\"dataset\",\n",
+ " local_dir=\"downloaded_embeddings\")"
+ ],
+ "metadata": {
+ "id": "UTVoI9LWBp1x"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ " We can load the embeddings to the dataset with no embeddings using `load_faiss_index`."
+ ],
+ "metadata": {
+ "id": "HGkYTJsM9BVx"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "ds = ds[\"train\"]\n",
+ "ds.load_faiss_index('embeddings', './downloaded_embeddings/embeddings.faiss')\n",
+ "# infer again\n",
+ "prmt = \"people under the rain\"\n"
+ ],
+ "metadata": {
+ "id": "mbPvs8kV8xTy"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "prmt_embedding = model.get_text_features(\n",
+ " **tokenizer([prmt], return_tensors=\"pt\", truncation=True)\n",
+ " .to(\"cuda\"))[0].detach().cpu().numpy()\n",
+ "\n",
+ "scores, retrieved_examples = ds.get_nearest_examples('embeddings', prmt_embedding, k=1)"
+ ],
+ "metadata": {
+ "id": "mc9JmZSG71WZ"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "display(retrieved_examples[\"image\"][0])"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 341
+ },
+ "id": "wckNsAX-9zox",
+ "outputId": "8d5008b4-ab8f-4b42-92e7-b29e57c126cb"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ ""
+ ],
+ "image/png": "\n"
+ },
+ "metadata": {}
+ }
+ ]
+ }
+ ],
+ "metadata": {
+ "accelerator": "GPU",
+ "colab": {
+ "machine_shape": "hm",
+ "provenance": []
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "name": "python3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
diff --git a/notebooks/spa/fine_tuning_code_llm_on_single_gpu.ipynb b/notebooks/spa/fine_tuning_code_llm_on_single_gpu.ipynb
new file mode 100644
index 00000000..153522df
--- /dev/null
+++ b/notebooks/spa/fine_tuning_code_llm_on_single_gpu.ipynb
@@ -0,0 +1,1128 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "provenance": [],
+ "machine_shape": "hm",
+ "gpuType": "A100"
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ },
+ "accelerator": "GPU"
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Fine-tuning a Code LLM on Custom Code on a single GPU\n",
+ "\n",
+ "_Authored by: [Maria Khalusova](https://github.com/MKhalusova)_\n",
+ "\n",
+ "Publicly available code LLMs such as Codex, StarCoder, and Code Llama are great at generating code that adheres to general programming principles and syntax, but they may not align with an organization's internal conventions, or be aware of proprietary libraries.\n",
+ "\n",
+ "In this notebook, we'll see show how you can fine-tune a code LLM on private code bases to enhance its contextual awareness and improve a model's usefulness to your organization's needs. Since the code LLMs are quite large, fine-tuning them in a traditional manner can be resource-draining. Worry not! We will show how you can optimize fine-tuning to fit on a single GPU.\n",
+ "\n",
+ "\n",
+ "## Dataset\n",
+ "\n",
+ "For this example, we picked the top 10 Hugging Face public repositories on GitHub. We have excluded non-code files from the data, such as images, audio files, presentations, and so on. For Jupyter notebooks, we've kept only cells containing code. The resulting code is stored as a dataset that you can find on the Hugging Face Hub under [`smangrul/hf-stack-v1`](https://huggingface.co/datasets/smangrul/hf-stack-v1). It contains repo id, file path, and file content.\n",
+ "\n",
+ "\n",
+ "## Model\n",
+ "\n",
+ "We'll finetune [`bigcode/starcoderbase-1b`](https://huggingface.co/bigcode/starcoderbase-1b), which is a 1B parameter model trained on 80+ programming languages. This is a gated model, so if you plan to run this notebook with this exact model, you'll need to gain access to it on the model's page. Log in to your Hugging Face account to do so:"
+ ],
+ "metadata": {
+ "id": "FNdZ-kD0l78P"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "from huggingface_hub import notebook_login\n",
+ "\n",
+ "notebook_login()"
+ ],
+ "metadata": {
+ "id": "bPlCJYDK6vrF"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "To get started, let's install all the necessary libraries. As you can see, in addition to `transformers` and `datasets`, we'll be using `peft`, `bitsandbytes`, and `flash-attn` to optimize the training.\n",
+ "\n",
+ "By employing parameter-efficient training techniques, we can run this notebook on a single A100 High-RAM GPU."
+ ],
+ "metadata": {
+ "id": "WMVe_c8q43Qo"
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Fp7i8WMCjKJG"
+ },
+ "outputs": [],
+ "source": [
+ "!pip install -q transformers datasets peft bitsandbytes flash-attn"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Let's define some variables now. Feel free to play with these."
+ ],
+ "metadata": {
+ "id": "16EdABzt3_Ig"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "MODEL=\"bigcode/starcoderbase-1b\" # Model checkpoint on the Hugging Face Hub\n",
+ "DATASET=\"smangrul/hf-stack-v1\" # Dataset on the Hugging Face Hub\n",
+ "DATA_COLUMN=\"content\" # Column name containing the code content\n",
+ "\n",
+ "SEQ_LENGTH=2048 # Sequence length\n",
+ "\n",
+ "# Training arguments\n",
+ "MAX_STEPS=2000 # max_steps\n",
+ "BATCH_SIZE=16 # batch_size\n",
+ "GR_ACC_STEPS=1 # gradient_accumulation_steps\n",
+ "LR=5e-4 # learning_rate\n",
+ "LR_SCHEDULER_TYPE=\"cosine\" # lr_scheduler_type\n",
+ "WEIGHT_DECAY=0.01 # weight_decay\n",
+ "NUM_WARMUP_STEPS=30 # num_warmup_steps\n",
+ "EVAL_FREQ=100 # eval_freq\n",
+ "SAVE_FREQ=100 # save_freq\n",
+ "LOG_FREQ=25 # log_freq\n",
+ "OUTPUT_DIR=\"peft-starcoder-lora-a100\" # output_dir\n",
+ "BF16=True # bf16\n",
+ "FP16=False # no_fp16\n",
+ "\n",
+ "# FIM trasformations arguments\n",
+ "FIM_RATE=0.5 # fim_rate\n",
+ "FIM_SPM_RATE=0.5 # fim_spm_rate\n",
+ "\n",
+ "# LORA\n",
+ "LORA_R=8 # lora_r\n",
+ "LORA_ALPHA=32 # lora_alpha\n",
+ "LORA_DROPOUT=0.0 # lora_dropout\n",
+ "LORA_TARGET_MODULES=\"c_proj,c_attn,q_attn,c_fc,c_proj\" # lora_target_modules\n",
+ "\n",
+ "# bitsandbytes config\n",
+ "USE_NESTED_QUANT=True # use_nested_quant\n",
+ "BNB_4BIT_COMPUTE_DTYPE=\"bfloat16\"# bnb_4bit_compute_dtype\n",
+ "\n",
+ "SEED=0"
+ ],
+ "metadata": {
+ "id": "hru3G-CLmqis"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "from transformers import (\n",
+ " AutoModelForCausalLM,\n",
+ " AutoTokenizer,\n",
+ " Trainer,\n",
+ " TrainingArguments,\n",
+ " logging,\n",
+ " set_seed,\n",
+ " BitsAndBytesConfig,\n",
+ ")\n",
+ "\n",
+ "set_seed(SEED)"
+ ],
+ "metadata": {
+ "id": "FyZSXTbJrcnC"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Prepare the data"
+ ],
+ "metadata": {
+ "id": "pO7F5L5AtKo1"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Begin by loading the data. As the dataset is likely to be quite large, make sure to enable the streaming mode. Streaming allows us to load the data progressively as we iterate over the dataset instead of downloading the whole dataset at once.\n",
+ "\n",
+ "We'll reserve the first 4000 examples as the validation set, and everything else will be the training data."
+ ],
+ "metadata": {
+ "id": "1LmrIZqP0oUE"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "from datasets import load_dataset\n",
+ "import torch\n",
+ "from tqdm import tqdm\n",
+ "\n",
+ "\n",
+ "dataset = load_dataset(\n",
+ " DATASET,\n",
+ " data_dir=\"data\",\n",
+ " split=\"train\",\n",
+ " streaming=True,\n",
+ ")\n",
+ "\n",
+ "valid_data = dataset.take(4000)\n",
+ "train_data = dataset.skip(4000)\n",
+ "train_data = train_data.shuffle(buffer_size=5000, seed=SEED)"
+ ],
+ "metadata": {
+ "id": "4oJZvZb-1J88"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "At this step, the dataset still contains raw data with code of arbitraty length. For training, we need inputs of fixed length. Let's create an Iterable dataset that would return constant-length chunks of tokens from a stream of text files.\n",
+ "\n",
+ "First, let's estimate the average number of characters per token in the dataset, which will help us later estimate the number of tokens in the text buffer later. By default, we'll only take 400 examples (`nb_examples`) from the dataset. Using only a subset of the entire dataset will reduce computational cost while still providing a reasonable estimate of the overall character-to-token ratio."
+ ],
+ "metadata": {
+ "id": "sLQ8t0LM2GR6"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)\n",
+ "\n",
+ "def chars_token_ratio(dataset, tokenizer, data_column, nb_examples=400):\n",
+ " \"\"\"\n",
+ " Estimate the average number of characters per token in the dataset.\n",
+ " \"\"\"\n",
+ "\n",
+ " total_characters, total_tokens = 0, 0\n",
+ " for _, example in tqdm(zip(range(nb_examples), iter(dataset)), total=nb_examples):\n",
+ " total_characters += len(example[data_column])\n",
+ " total_tokens += len(tokenizer(example[data_column]).tokens())\n",
+ "\n",
+ " return total_characters / total_tokens\n",
+ "\n",
+ "\n",
+ "chars_per_token = chars_token_ratio(train_data, tokenizer, DATA_COLUMN)\n",
+ "print(f\"The character to token ratio of the dataset is: {chars_per_token:.2f}\")"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "KCiAvydztNsu",
+ "outputId": "cabf7fd0-a922-4371-cbc6-60ee99ef7469"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "100%|██████████| 400/400 [00:10<00:00, 39.87it/s] "
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "The character to token ratio of the dataset is: 2.43\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "The character-to-token ratio can also be used as an indicator of the quality of text tokenization. For instance, a character-to-token ratio of 1.0 would mean that each character is represented with a token, which is not very meaningful. This would indicate poor tokenization. In standard English text, one token is typically equivalent to approximately four characters, meaning the character-to-token ratio is around 4.0. We can expect a lower ratio in the code dataset, but generally speaking, a number between 2.0 and 3.5 can be considered good enough."
+ ],
+ "metadata": {
+ "id": "6F13VGobB3Ma"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "**Optional FIM transformations**\n",
+ "\n",
+ "\n",
+ "Autoregressive language models typically generate sequences from left to right. By applying the FIM transformations, the model can also learn to infill text. Check out [\"Efficient Training of Language Models to Fill in the Middle\" paper](https://arxiv.org/pdf/2207.14255.pdf) to learn more about the technique.\n",
+ "We'll define the FIM transformations here and will use them when creating the Iterable Dataset. However, if you want to omit transformations, feel free to set `fim_rate` to 0."
+ ],
+ "metadata": {
+ "id": "rcwYFRPpwxea"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "import functools\n",
+ "import numpy as np\n",
+ "\n",
+ "\n",
+ "# Helper function to get token ids of the special tokens for prefix, suffix and middle for FIM transformations.\n",
+ "@functools.lru_cache(maxsize=None)\n",
+ "def get_fim_token_ids(tokenizer):\n",
+ " try:\n",
+ " FIM_PREFIX, FIM_MIDDLE, FIM_SUFFIX, FIM_PAD = tokenizer.special_tokens_map[\"additional_special_tokens\"][1:5]\n",
+ " suffix_tok_id, prefix_tok_id, middle_tok_id, pad_tok_id = (\n",
+ " tokenizer.vocab[tok] for tok in [FIM_SUFFIX, FIM_PREFIX, FIM_MIDDLE, FIM_PAD]\n",
+ " )\n",
+ " except KeyError:\n",
+ " suffix_tok_id, prefix_tok_id, middle_tok_id, pad_tok_id = None, None, None, None\n",
+ " return suffix_tok_id, prefix_tok_id, middle_tok_id, pad_tok_id\n",
+ "\n",
+ "\n",
+ "## Adapted from https://github.com/bigcode-project/Megatron-LM/blob/6c4bf908df8fd86b4977f54bf5b8bd4b521003d1/megatron/data/gpt_dataset.py\n",
+ "def permute(\n",
+ " sample,\n",
+ " np_rng,\n",
+ " suffix_tok_id,\n",
+ " prefix_tok_id,\n",
+ " middle_tok_id,\n",
+ " pad_tok_id,\n",
+ " fim_rate=0.5,\n",
+ " fim_spm_rate=0.5,\n",
+ " truncate_or_pad=False,\n",
+ "):\n",
+ " \"\"\"\n",
+ " Take in a sample (list of tokens) and perform a FIM transformation on it with a probability of fim_rate, using two FIM modes:\n",
+ " PSM and SPM (with a probability of fim_spm_rate).\n",
+ " \"\"\"\n",
+ "\n",
+ " # The if condition will trigger with the probability of fim_rate\n",
+ " # This means FIM transformations will apply to samples with a probability of fim_rate\n",
+ " if np_rng.binomial(1, fim_rate):\n",
+ "\n",
+ " # Split the sample into prefix, middle, and suffix, based on randomly generated indices stored in the boundaries list.\n",
+ " boundaries = list(np_rng.randint(low=0, high=len(sample) + 1, size=2))\n",
+ " boundaries.sort()\n",
+ "\n",
+ " prefix = np.array(sample[: boundaries[0]], dtype=np.int64)\n",
+ " middle = np.array(sample[boundaries[0] : boundaries[1]], dtype=np.int64)\n",
+ " suffix = np.array(sample[boundaries[1] :], dtype=np.int64)\n",
+ "\n",
+ " if truncate_or_pad:\n",
+ " # calculate the new total length of the sample, taking into account tokens indicating prefix, middle, and suffix\n",
+ " new_length = suffix.shape[0] + prefix.shape[0] + middle.shape[0] + 3\n",
+ " diff = new_length - len(sample)\n",
+ "\n",
+ " # trancate or pad if there's a difference in length between the new length and the original\n",
+ " if diff > 0:\n",
+ " if suffix.shape[0] <= diff:\n",
+ " return sample, np_rng\n",
+ " suffix = suffix[: suffix.shape[0] - diff]\n",
+ " elif diff < 0:\n",
+ " suffix = np.concatenate([suffix, np.full((-1 * diff), pad_tok_id)])\n",
+ "\n",
+ " # With the probability of fim_spm_rateapply SPM variant of FIM transformations\n",
+ " # SPM: suffix, prefix, middle\n",
+ " if np_rng.binomial(1, fim_spm_rate):\n",
+ " new_sample = np.concatenate(\n",
+ " [\n",
+ " [prefix_tok_id, suffix_tok_id],\n",
+ " suffix,\n",
+ " [middle_tok_id],\n",
+ " prefix,\n",
+ " middle,\n",
+ " ]\n",
+ " )\n",
+ " # Otherwise, apply the PSM variant of FIM transformations\n",
+ " # PSM: prefix, suffix, middle\n",
+ " else:\n",
+ "\n",
+ " new_sample = np.concatenate(\n",
+ " [\n",
+ " [prefix_tok_id],\n",
+ " prefix,\n",
+ " [suffix_tok_id],\n",
+ " suffix,\n",
+ " [middle_tok_id],\n",
+ " middle,\n",
+ " ]\n",
+ " )\n",
+ " else:\n",
+ " # don't apply FIM transformations\n",
+ " new_sample = sample\n",
+ "\n",
+ " return list(new_sample), np_rng\n"
+ ],
+ "metadata": {
+ "id": "zmejYvEKw1E-"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Let's define the `ConstantLengthDataset`, an Iterable dataset that will return constant-length chunks of tokens. To do so, we'll read a buffer of text from the original dataset until we hit the size limits and then apply tokenizer to convert the raw text into tokenized inputs. Optionally, we'll perform FIM transformations on some sequences (the proportion of sequences affected is controlled by `fim_rate`).\n",
+ "\n",
+ "Once defined, we can create instances of the `ConstantLengthDataset` from both training and validation data."
+ ],
+ "metadata": {
+ "id": "AwW5FviD9xBH"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "from torch.utils.data import IterableDataset\n",
+ "from torch.utils.data.dataloader import DataLoader\n",
+ "import random\n",
+ "\n",
+ "# Create an Iterable dataset that returns constant-length chunks of tokens from a stream of text files.\n",
+ "\n",
+ "class ConstantLengthDataset(IterableDataset):\n",
+ " \"\"\"\n",
+ " Iterable dataset that returns constant length chunks of tokens from stream of text files.\n",
+ " Args:\n",
+ " tokenizer (Tokenizer): The processor used for proccessing the data.\n",
+ " dataset (dataset.Dataset): Dataset with text files.\n",
+ " infinite (bool): If True the iterator is reset after dataset reaches end else stops.\n",
+ " seq_length (int): Length of token sequences to return.\n",
+ " num_of_sequences (int): Number of token sequences to keep in buffer.\n",
+ " chars_per_token (int): Number of characters per token used to estimate number of tokens in text buffer.\n",
+ " fim_rate (float): Rate (0.0 to 1.0) that sample will be permuted with FIM.\n",
+ " fim_spm_rate (float): Rate (0.0 to 1.0) of FIM permuations that will use SPM.\n",
+ " seed (int): Seed for random number generator.\n",
+ " \"\"\"\n",
+ "\n",
+ " def __init__(\n",
+ " self,\n",
+ " tokenizer,\n",
+ " dataset,\n",
+ " infinite=False,\n",
+ " seq_length=1024,\n",
+ " num_of_sequences=1024,\n",
+ " chars_per_token=3.6,\n",
+ " content_field=\"content\",\n",
+ " fim_rate=0.5,\n",
+ " fim_spm_rate=0.5,\n",
+ " seed=0,\n",
+ " ):\n",
+ " self.tokenizer = tokenizer\n",
+ " self.concat_token_id = tokenizer.eos_token_id\n",
+ " self.dataset = dataset\n",
+ " self.seq_length = seq_length\n",
+ " self.infinite = infinite\n",
+ " self.current_size = 0\n",
+ " self.max_buffer_size = seq_length * chars_per_token * num_of_sequences\n",
+ " self.content_field = content_field\n",
+ " self.fim_rate = fim_rate\n",
+ " self.fim_spm_rate = fim_spm_rate\n",
+ " self.seed = seed\n",
+ "\n",
+ " (\n",
+ " self.suffix_tok_id,\n",
+ " self.prefix_tok_id,\n",
+ " self.middle_tok_id,\n",
+ " self.pad_tok_id,\n",
+ " ) = get_fim_token_ids(self.tokenizer)\n",
+ " if not self.suffix_tok_id and self.fim_rate > 0:\n",
+ " print(\"FIM is not supported by tokenizer, disabling FIM\")\n",
+ " self.fim_rate = 0\n",
+ "\n",
+ " def __iter__(self):\n",
+ " iterator = iter(self.dataset)\n",
+ " more_examples = True\n",
+ " np_rng = np.random.RandomState(seed=self.seed)\n",
+ " while more_examples:\n",
+ " buffer, buffer_len = [], 0\n",
+ " while True:\n",
+ " if buffer_len >= self.max_buffer_size:\n",
+ " break\n",
+ " try:\n",
+ " buffer.append(next(iterator)[self.content_field])\n",
+ " buffer_len += len(buffer[-1])\n",
+ " except StopIteration:\n",
+ " if self.infinite:\n",
+ " iterator = iter(self.dataset)\n",
+ " else:\n",
+ " more_examples = False\n",
+ " break\n",
+ " tokenized_inputs = self.tokenizer(buffer, truncation=False)[\"input_ids\"]\n",
+ " all_token_ids = []\n",
+ "\n",
+ " for tokenized_input in tokenized_inputs:\n",
+ " # optionally do FIM permutations\n",
+ " if self.fim_rate > 0:\n",
+ " tokenized_input, np_rng = permute(\n",
+ " tokenized_input,\n",
+ " np_rng,\n",
+ " self.suffix_tok_id,\n",
+ " self.prefix_tok_id,\n",
+ " self.middle_tok_id,\n",
+ " self.pad_tok_id,\n",
+ " fim_rate=self.fim_rate,\n",
+ " fim_spm_rate=self.fim_spm_rate,\n",
+ " truncate_or_pad=False,\n",
+ " )\n",
+ "\n",
+ " all_token_ids.extend(tokenized_input + [self.concat_token_id])\n",
+ " examples = []\n",
+ " for i in range(0, len(all_token_ids), self.seq_length):\n",
+ " input_ids = all_token_ids[i : i + self.seq_length]\n",
+ " if len(input_ids) == self.seq_length:\n",
+ " examples.append(input_ids)\n",
+ " random.shuffle(examples)\n",
+ " for example in examples:\n",
+ " self.current_size += 1\n",
+ " yield {\n",
+ " \"input_ids\": torch.LongTensor(example),\n",
+ " \"labels\": torch.LongTensor(example),\n",
+ " }\n",
+ "\n",
+ "\n",
+ "train_dataset = ConstantLengthDataset(\n",
+ " tokenizer,\n",
+ " train_data,\n",
+ " infinite=True,\n",
+ " seq_length=SEQ_LENGTH,\n",
+ " chars_per_token=chars_per_token,\n",
+ " content_field=DATA_COLUMN,\n",
+ " fim_rate=FIM_RATE,\n",
+ " fim_spm_rate=FIM_SPM_RATE,\n",
+ " seed=SEED,\n",
+ ")\n",
+ "eval_dataset = ConstantLengthDataset(\n",
+ " tokenizer,\n",
+ " valid_data,\n",
+ " infinite=False,\n",
+ " seq_length=SEQ_LENGTH,\n",
+ " chars_per_token=chars_per_token,\n",
+ " content_field=DATA_COLUMN,\n",
+ " fim_rate=FIM_RATE,\n",
+ " fim_spm_rate=FIM_SPM_RATE,\n",
+ " seed=SEED,\n",
+ ")"
+ ],
+ "metadata": {
+ "id": "AgDW-692wzOl"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Prepare the model"
+ ],
+ "metadata": {
+ "id": "rxev1sk6tRW9"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Now that the data is prepared, it's time to load the model! We're going to load the quantized version of the model.\n",
+ "\n",
+ "This will allow us to reduce memory usage, as quantization represents data with fewer bits. We'll use the `bitsandbytes` library to quantize the model, as it has a nice integration with `transformers`. All we need to do is define a `bitsandbytes` config, and then use it when loading the model.\n",
+ "\n",
+ "There are different variants of 4bit quantization, but generally, we recommend using NF4 quantization for better performance (`bnb_4bit_quant_type=\"nf4\"`).\n",
+ "\n",
+ "The `bnb_4bit_use_double_quant` option adds a second quantization after the first one to save an additional 0.4 bits per parameter.\n",
+ "\n",
+ "To learn more about quantization, check out the [\"Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA\" blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes).\n",
+ "\n",
+ "Once defined, pass the config to the `from_pretrained` method to load the quantized version of the model."
+ ],
+ "metadata": {
+ "id": "UCtWV-U42Eq_"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training\n",
+ "from peft.tuners.lora import LoraLayer\n",
+ "\n",
+ "load_in_8bit = False\n",
+ "\n",
+ "# 4-bit quantization\n",
+ "compute_dtype = getattr(torch, BNB_4BIT_COMPUTE_DTYPE)\n",
+ "\n",
+ "bnb_config = BitsAndBytesConfig(\n",
+ " load_in_4bit=True,\n",
+ " bnb_4bit_quant_type=\"nf4\",\n",
+ " bnb_4bit_compute_dtype=compute_dtype,\n",
+ " bnb_4bit_use_double_quant=USE_NESTED_QUANT,\n",
+ ")\n",
+ "\n",
+ "device_map = {\"\": 0}\n",
+ "\n",
+ "model = AutoModelForCausalLM.from_pretrained(\n",
+ " MODEL,\n",
+ " load_in_8bit=load_in_8bit,\n",
+ " quantization_config=bnb_config,\n",
+ " device_map=device_map,\n",
+ " use_cache=False, # We will be using gradient checkpointing\n",
+ " trust_remote_code=True,\n",
+ " use_flash_attention_2=True,\n",
+ ")\n"
+ ],
+ "metadata": {
+ "id": "XuwoX6U2DUvK"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "When using a quantized model for training, you need to call the `prepare_model_for_kbit_training()` function to preprocess the quantized model for training."
+ ],
+ "metadata": {
+ "id": "bO9e2FV8D8ZF"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "model = prepare_model_for_kbit_training(model)"
+ ],
+ "metadata": {
+ "id": "Qb_eB4xzEDBk"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Now that the quantized model is ready, we can set up a LoRA configuration. LoRA makes fine-tuning more efficient by drastically reducing the number of trainable parameters.\n",
+ "\n",
+ "To train a model using LoRA technique, we need to wrap the base model as a `PeftModel`. This involves definign LoRA configuration with `LoraConfig`, and wrapping the original model with `get_peft_model()` using the `LoraConfig`.\n",
+ "\n",
+ "To learn more about LoRA and its parameters, refer to [PEFT documentation](https://huggingface.co/docs/peft/conceptual_guides/lora)."
+ ],
+ "metadata": {
+ "id": "lmnLjPZpDVtg"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Set up lora\n",
+ "peft_config = LoraConfig(\n",
+ " lora_alpha=LORA_ALPHA,\n",
+ " lora_dropout=LORA_DROPOUT,\n",
+ " r=LORA_R,\n",
+ " bias=\"none\",\n",
+ " task_type=\"CAUSAL_LM\",\n",
+ " target_modules=LORA_TARGET_MODULES.split(\",\"),\n",
+ ")\n",
+ "\n",
+ "model = get_peft_model(model, peft_config)\n",
+ "model.print_trainable_parameters()"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "_pAUU2FR2Gey",
+ "outputId": "63328c2b-e693-49b1-ce0a-3ca8722f852a"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "trainable params: 5,554,176 || all params: 1,142,761,472 || trainable%: 0.4860310866343243\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "As you can see, by applying LoRA technique we will now need to train less than 1% of the parameters."
+ ],
+ "metadata": {
+ "id": "tHe7AElXzXVV"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Train the model"
+ ],
+ "metadata": {
+ "id": "T_CqVydc40IM"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Now that we have prepared the data, and optimized the model, we are ready to bring everything together to start the training.\n",
+ "\n",
+ "To instantiate a `Trainer`, you need to define the training configuration. The most important is the `TrainingArguments`, which is a class that contains all the attributes to configure the training.\n",
+ "\n",
+ "These are similar to any other kind of model training you may run, so we won't go into detail here."
+ ],
+ "metadata": {
+ "id": "Q_iN2khjrbD3"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "train_data.start_iteration = 0\n",
+ "\n",
+ "\n",
+ "training_args = TrainingArguments(\n",
+ " output_dir=f\"Your_HF_username/{OUTPUT_DIR}\",\n",
+ " dataloader_drop_last=True,\n",
+ " evaluation_strategy=\"steps\",\n",
+ " save_strategy=\"steps\",\n",
+ " max_steps=MAX_STEPS,\n",
+ " eval_steps=EVAL_FREQ,\n",
+ " save_steps=SAVE_FREQ,\n",
+ " logging_steps=LOG_FREQ,\n",
+ " per_device_train_batch_size=BATCH_SIZE,\n",
+ " per_device_eval_batch_size=BATCH_SIZE,\n",
+ " learning_rate=LR,\n",
+ " lr_scheduler_type=LR_SCHEDULER_TYPE,\n",
+ " warmup_steps=NUM_WARMUP_STEPS,\n",
+ " gradient_accumulation_steps=GR_ACC_STEPS,\n",
+ " gradient_checkpointing=True,\n",
+ " fp16=FP16,\n",
+ " bf16=BF16,\n",
+ " weight_decay=WEIGHT_DECAY,\n",
+ " push_to_hub=True,\n",
+ " include_tokens_per_second=True,\n",
+ ")\n"
+ ],
+ "metadata": {
+ "id": "65QHS8l1tKQe"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "As a final step, instantiate the `Trainer` and call the `train` method. "
+ ],
+ "metadata": {
+ "id": "kB_fLRex09ut"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "trainer = Trainer(\n",
+ " model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset\n",
+ ")\n",
+ "\n",
+ "print(\"Training...\")\n",
+ "trainer.train()\n"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 1000
+ },
+ "id": "rS3nVwhUC69O",
+ "outputId": "61a5bdb2-b7d0-4aed-8290-4bf20c2ccd38"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "metadata": {
+ "tags": null
+ },
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Training...\n"
+ ]
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ ""
+ ],
+ "text/html": [
+ "\n",
+ "
"
+ ]
+ },
+ "metadata": {}
+ },
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "TrainOutput(global_step=2000, training_loss=4.885598585128784, metrics={'train_runtime': 15380.3075, 'train_samples_per_second': 2.081, 'train_steps_per_second': 0.13, 'train_tokens_per_second': 4261.033, 'total_flos': 4.0317260660736e+17, 'train_loss': 4.885598585128784, 'epoch': 1.0})"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 19
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Finally, you can push the fine-tuned model to your Hub repository to share with your team."
+ ],
+ "metadata": {
+ "id": "aAERlCnt1PEW"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "trainer.push_to_hub()"
+ ],
+ "metadata": {
+ "id": "1h7_AUTTDwE1"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Inference\n",
+ "\n",
+ "Once the model is uploaded to Hub, we can use it for inference. To do so we first initialize the original base model and its tokenizer. Next, we need to merge the fine-duned weights with the base model."
+ ],
+ "metadata": {
+ "id": "KBVH7uFOM_UF"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "from peft import PeftModel\n",
+ "import torch\n",
+ "\n",
+ "# load the original model first\n",
+ "tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)\n",
+ "base_model = AutoModelForCausalLM.from_pretrained(\n",
+ " MODEL,\n",
+ " quantization_config=None,\n",
+ " device_map=None,\n",
+ " trust_remote_code=True,\n",
+ " torch_dtype=torch.bfloat16,\n",
+ ").cuda()\n",
+ "\n",
+ "# merge fine-tuned weights with the base model\n",
+ "peft_model_id = f\"Your_HF_username/{OUTPUT_DIR}\"\n",
+ "model = PeftModel.from_pretrained(base_model, peft_model_id)\n",
+ "model.merge_and_unload()"
+ ],
+ "metadata": {
+ "id": "jtL37piINBFe"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Now we can use the merged model for inference. For convenience, we'll define a `get_code_completion` - feel free to experiment with text generation parameters!"
+ ],
+ "metadata": {
+ "id": "3USQ2suvDi9M"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "def get_code_completion(prefix, suffix):\n",
+ " text = prompt = f\"\"\"{prefix}{suffix}\"\"\"\n",
+ " model.eval()\n",
+ " outputs = model.generate(\n",
+ " input_ids=tokenizer(text, return_tensors=\"pt\").input_ids.cuda(),\n",
+ " max_new_tokens=128,\n",
+ " temperature=0.2,\n",
+ " top_k=50,\n",
+ " top_p=0.95,\n",
+ " do_sample=True,\n",
+ " repetition_penalty=1.0,\n",
+ " )\n",
+ " return tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]"
+ ],
+ "metadata": {
+ "id": "RoTGpNbjDeWI"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Now all we need to do to get code completion is call the `get_code_complete` function and pass the first few lines that we want to be completed as a prefix, and an empty string as a suffix."
+ ],
+ "metadata": {
+ "id": "0kMJiGDfDrBf"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "prefix = \"\"\"from peft import LoraConfig, TaskType, get_peft_model\n",
+ "from transformers import AutoModelForCausalLM\n",
+ "peft_config = LoraConfig(\n",
+ "\"\"\"\n",
+ "suffix =\"\"\"\"\"\"\n",
+ "\n",
+ "print(get_code_completion(prefix, suffix))"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "nXlco2_-YcvM",
+ "outputId": "41c411ad-b7dc-4277-f975-c173888234bb"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "from peft import LoraConfig, TaskType, get_peft_model\n",
+ "from transformers import AutoModelForCausalLM\n",
+ "peft_config = LoraConfig(\n",
+ " task_type=TaskType.CAUSAL_LM,\n",
+ " r=8,\n",
+ " lora_alpha=32,\n",
+ " target_modules=[\"q_proj\", \"v_proj\"],\n",
+ " lora_dropout=0.1,\n",
+ " bias=\"none\",\n",
+ " modules_to_save=[\"q_proj\", \"v_proj\"],\n",
+ " inference_mode=False,\n",
+ ")\n",
+ "model = AutoModelForCausalLM.from_pretrained(\"gpt2\")\n",
+ "model = get_peft_model(model, peft_config)\n",
+ "model.print_trainable_parameters()\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "As someone who has just used the PEFT library earlier in this notebook, you can see that the generated result for creating a `LoraConfig` is rather good!\n",
+ "\n",
+ "If you go back to the cell where we instantiate the model for inference, and comment out the lines where we merge the fine-tuned weights, you can see what the original model would've generated for the exact same prefix:"
+ ],
+ "metadata": {
+ "id": "Ql2563kGlnmu"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "prefix = \"\"\"from peft import LoraConfig, TaskType, get_peft_model\n",
+ "from transformers import AutoModelForCausalLM\n",
+ "peft_config = LoraConfig(\n",
+ "\"\"\"\n",
+ "suffix =\"\"\"\"\"\"\n",
+ "\n",
+ "print(get_code_completion(prefix, suffix))"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "29xxp1eHTgJ9",
+ "outputId": "c6d597a2-01da-4d25-a32f-3a551212c5b4"
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "from peft import LoraConfig, TaskType, get_peft_model\n",
+ "from transformers import AutoModelForCausalLM\n",
+ "peft_config = LoraConfig(\n",
+ " model_name_or_path=\"facebook/wav2vec2-base-960h\",\n",
+ " num_labels=1,\n",
+ " num_features=1,\n",
+ " num_hidden_layers=1,\n",
+ " num_attention_heads=1,\n",
+ " num_hidden_layers_per_attention_head=1,\n",
+ " num_attention_heads_per_hidden_layer=1,\n",
+ " hidden_size=1024,\n",
+ " hidden_dropout_prob=0.1,\n",
+ " hidden_act=\"gelu\",\n",
+ " hidden_act_dropout_prob=0.1,\n",
+ " hidden\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "While it is Python syntax, you can see that the original model has no understanding of what a `LoraConfig` should be doing."
+ ],
+ "metadata": {
+ "id": "Pwy2ZC7U8Ema"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "To learn how this kind of fine-tuning compares to full fine-tuning, and how to use a model like this as your copilot in VS Code via Inference Endpoints, or locally, check out the [\"Personal Copilot: Train Your Own Coding Assistant\" blog post](https://huggingface.co/blog/personal-copilot). This notebook complements the original blog post.\n"
+ ],
+ "metadata": {
+ "id": "CATYE8pp2drQ"
+ }
+ }
+ ]
+}
diff --git a/notebooks/spa/index.md b/notebooks/spa/index.md
new file mode 100644
index 00000000..713aacbc
--- /dev/null
+++ b/notebooks/spa/index.md
@@ -0,0 +1,31 @@
+# Open-Source AI Cookbook
+
+Open-Source AI Cookbook es una colección de notas que ilustran aspectos prácticos de la creación de aplicaciones de IA y la resolución de diversas tareas de aprendizaje automático con herramientas y modelos de código abierto.
+
+## Últimos notebooks
+
+Consulte los notebooks añadidos recientemente::
+
+- [Using LLM-as-a-judge 🧑⚖️ for an automated and versatile evaluation](llm_judge)
+- [Create a legal preference dataset](pipeline_notus_instructions_preferences_legal)
+- [Suggestions for Data Annotation with SetFit in Zero-shot Text Classification](labelling_feedback_setfit)
+- [Implementing semantic cache to improve a RAG system](semantic_cache_chroma_vector_database)
+- [Building A RAG Ebook "Librarian" Using LlamaIndex](rag_llamaindex_librarian)
+- [Stable Diffusion Interpolation](stable_diffusion_interpolation)
+- [Building A RAG System with Gemma, MongoDB and Open Source Models](rag_with_hugging_face_gemma_mongodb)
+- [Prompt Tuning with PEFT Library](prompt_tuning_peft)
+- [Migrating from OpenAI to Open LLMs Using TGI's Messages API](tgi_messages_api_demo)
+- [Automatic Embeddings with TEI through Inference Endpoints](automatic_embedding_tei_inference_endpoints)
+- [Simple RAG for GitHub issues using Hugging Face Zephyr and LangChain](rag_zephyr_langchain)
+- [Embedding multimodal data for similarity search using 🤗 transformers, 🤗 datasets and FAISS](faiss_with_hf_datasets_and_clip)
+- [Fine-tuning a Code LLM on Custom Code on a single GPU](fine_tuning_code_llm_on_single_gpu)
+- [RAG Evaluation Using Synthetic data and LLM-As-A-Judge](rag_evaluation)
+- [Advanced RAG on HuggingFace documentation using LangChain](advanced_rag)
+- [Detecting Issues in a Text Dataset with Cleanlab](issues_in_text_dataset)
+
+También puede consultar los notebooks en el cookbook's [GitHub repo](https://github.com/huggingface/cookbook).
+
+## Contributing
+
+The Open-Source AI Cookbook es un esfuerzo de la comunidad, ¡y agradecemos las contribuciones de todo el mundo!
+Consulte la cookbook's [Guía de contribuciones](https://github.com/huggingface/cookbook/blob/main/README.md) para saber cómo puedes añadir tu "receta".
diff --git a/notebooks/spa/issues_in_text_dataset.ipynb b/notebooks/spa/issues_in_text_dataset.ipynb
new file mode 100644
index 00000000..e568de20
--- /dev/null
+++ b/notebooks/spa/issues_in_text_dataset.ipynb
@@ -0,0 +1,3360 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "pw6cvzTocw4G"
+ },
+ "source": [
+ "# Detecting Issues in a Text Dataset with Cleanlab\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "0yPBE0Xccw4J"
+ },
+ "source": [
+ "Authored by: [Aravind Putrevu](https://huggingface.co/aravindputrevu)\n",
+ "\n",
+ "\n",
+ "In this 5-minute quickstart tutorial, we use Cleanlab to detect various issues in an intent classification dataset composed of (text) customer service requests at an online bank. We consider a subset of the [Banking77-OOS Dataset](https://arxiv.org/abs/2106.04564) containing 1,000 customer service requests which are classified into 10 categories based on their intent (you can run this same code on any text classification dataset). [Cleanlab](https://github.com/cleanlab/cleanlab) automatically identifies bad examples in our dataset, including mislabeled data, out-of-scope examples (outliers), or otherwise ambiguous examples. Consider filtering or correcting such bad examples before you dive deep into modeling your data!\n",
+ "\n",
+ "**Overview of what we'll do in this tutorial:**\n",
+ "\n",
+ "- Use a pretrained transformer model to extract the text embeddings from the customer service requests\n",
+ "\n",
+ "- Train a simple Logistic Regression model on the text embeddings to compute out-of-sample predicted probabilities\n",
+ "\n",
+ "- Run Cleanlab's `Datalab` audit with these predictions and embeddings in order to identify problems like: label issues, outliers, and near duplicates in the dataset."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "o__pRLFYcw4K"
+ },
+ "source": [
+ "\n",
+ "## Quickstart\n",
+ "\n",
+ " \n",
+ "Already have (out-of-sample) `pred_probs` from a model trained on an existing set of labels? Maybe you have some numeric `features` as well? Run the code below to find any potential label errors in your dataset.\n",
+ "\n",
+ "**Note:** If running on Colab, may want to use GPU (select: Runtime > Change runtime type > Hardware accelerator > GPU)\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "from cleanlab import Datalab\n",
+ "\n",
+ "lab = Datalab(data=your_dataset, label_name=\"column_name_of_labels\")\n",
+ "lab.find_issues(pred_probs=your_pred_probs, features=your_features)\n",
+ "\n",
+ "lab.report()\n",
+ "lab.get_issues()\n"
+ ],
+ "metadata": {
+ "id": "qaZA0cFs1fW4"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "dp4lpApmcw4K"
+ },
+ "source": [
+ "## Install required dependencies\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "DjoWBgGAcw4K"
+ },
+ "source": [
+ "You can use `pip` to install all packages required for this tutorial as follows:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "!pip install -U scikit-learn sentence-transformers datasets\n",
+ "!pip install -U \"cleanlab[datalab]\""
+ ],
+ "metadata": {
+ "id": "fRsBIj3L_RUb"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "execution": {
+ "iopub.execute_input": "2024-02-16T06:26:13.467211Z",
+ "iopub.status.busy": "2024-02-16T06:26:13.466877Z",
+ "iopub.status.idle": "2024-02-16T06:26:13.470222Z",
+ "shell.execute_reply": "2024-02-16T06:26:13.469761Z"
+ },
+ "id": "zgezWF-2cw4L"
+ },
+ "outputs": [],
+ "source": [
+ "import re\n",
+ "import string\n",
+ "import pandas as pd\n",
+ "from sklearn.metrics import accuracy_score, log_loss\n",
+ "from sklearn.model_selection import cross_val_predict\n",
+ "from sklearn.linear_model import LogisticRegression\n",
+ "from sentence_transformers import SentenceTransformer\n",
+ "\n",
+ "from cleanlab import Datalab"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "execution": {
+ "iopub.execute_input": "2024-02-16T06:26:13.472374Z",
+ "iopub.status.busy": "2024-02-16T06:26:13.471951Z",
+ "iopub.status.idle": "2024-02-16T06:26:13.475065Z",
+ "shell.execute_reply": "2024-02-16T06:26:13.474625Z"
+ },
+ "nbsphinx": "hidden",
+ "id": "mO3pnA1ncw4L"
+ },
+ "outputs": [],
+ "source": [
+ "import random\n",
+ "import numpy as np\n",
+ "\n",
+ "pd.set_option(\"display.max_colwidth\", None)\n",
+ "\n",
+ "SEED = 123456 # for reproducibility\n",
+ "np.random.seed(SEED)\n",
+ "random.seed(SEED)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "yj_5JcO1cw4L"
+ },
+ "source": [
+ "## Load and format the text dataset\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "execution": {
+ "iopub.execute_input": "2024-02-16T06:26:13.476949Z",
+ "iopub.status.busy": "2024-02-16T06:26:13.476773Z",
+ "iopub.status.idle": "2024-02-16T06:26:13.502278Z",
+ "shell.execute_reply": "2024-02-16T06:26:13.501755Z"
+ },
+ "id": "HztO4qU9cw4L",
+ "outputId": "c6ff9e95-6326-413e-a72f-6f3c05af1055",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 206
+ }
+ },
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ " text label\n",
+ "0 I am still waiting on my card? 11\n",
+ "1 What can I do if my card still hasn't arrived after 2 weeks? 11\n",
+ "2 I have been waiting over a week. Is the card still coming? 11\n",
+ "3 Can I track my card while it is in the process of delivery? 11\n",
+ "4 How do I know if I will get my card, or if it is lost? 11"
+ ],
+ "text/html": [
+ "\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
text
\n",
+ "
label
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
I am still waiting on my card?
\n",
+ "
11
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
What can I do if my card still hasn't arrived after 2 weeks?
\n",
+ "
11
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
I have been waiting over a week. Is the card still coming?
\n",
+ "
11
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
Can I track my card while it is in the process of delivery?
\n",
+ "
11
\n",
+ "
\n",
+ "
\n",
+ "
4
\n",
+ "
How do I know if I will get my card, or if it is lost?
\n"
+ ],
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "dataframe",
+ "variable_name": "data",
+ "summary": "{\n \"name\": \"data\",\n \"rows\": 1000,\n \"fields\": [\n {\n \"column\": \"text\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1000,\n \"samples\": [\n \"I made an international purchase, but the exchange rate was wrong\",\n \"I would like to know why a withdraw I made for some cash shows up as pending.\",\n \"I tried to get cash out of the ATM but it is taking too long\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 12,\n \"min\": 11,\n \"max\": 46,\n \"num_unique_values\": 7,\n \"samples\": [\n 11,\n 13,\n 46\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
+ }
+ },
+ "metadata": {},
+ "execution_count": 24
+ }
+ ],
+ "source": [
+ "from datasets import load_dataset\n",
+ "\n",
+ "dataset = load_dataset(\"PolyAI/banking77\", split=\"train\")\n",
+ "data = pd.DataFrame(dataset[:1000])\n",
+ "data.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "execution": {
+ "iopub.execute_input": "2024-02-16T06:26:13.504463Z",
+ "iopub.status.busy": "2024-02-16T06:26:13.504049Z",
+ "iopub.status.idle": "2024-02-16T06:26:13.508243Z",
+ "shell.execute_reply": "2024-02-16T06:26:13.507706Z"
+ },
+ "id": "Ujp0luqRcw4M",
+ "outputId": "b438fed5-aa75-450d-dc84-0b3398960487",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ }
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "This dataset has 7 classes.\n",
+ "Classes: {32, 34, 36, 11, 13, 46, 17}\n"
+ ]
+ }
+ ],
+ "source": [
+ "raw_texts, labels = data[\"text\"].values, data[\"label\"].values\n",
+ "num_classes = len(set(labels))\n",
+ "\n",
+ "print(f\"This dataset has {num_classes} classes.\")\n",
+ "print(f\"Classes: {set(labels)}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "PVza57cecw4M"
+ },
+ "source": [
+ "Let's view the i-th example in the dataset:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "execution": {
+ "iopub.execute_input": "2024-02-16T06:26:13.510435Z",
+ "iopub.status.busy": "2024-02-16T06:26:13.510163Z",
+ "iopub.status.idle": "2024-02-16T06:26:13.513358Z",
+ "shell.execute_reply": "2024-02-16T06:26:13.512906Z"
+ },
+ "id": "lXHi90Kecw4M",
+ "outputId": "af8a9b19-986f-44fe-c564-dd83e400309e",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ }
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Example Label: 11\n",
+ "Example Text: What can I do if my card still hasn't arrived after 2 weeks?\n"
+ ]
+ }
+ ],
+ "source": [
+ "i = 1 # change this to view other examples from the dataset\n",
+ "print(f\"Example Label: {labels[i]}\")\n",
+ "print(f\"Example Text: {raw_texts[i]}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "JH7UU9Wscw4M"
+ },
+ "source": [
+ "The data is stored as two numpy arrays:\n",
+ "\n",
+ "1. `raw_texts` stores the customer service requests utterances in text format\n",
+ "2. `labels` stores the intent categories (labels) for each example"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "T0d80apCcw4M"
+ },
+ "source": [
+ "
\n",
+ "Bringing Your Own Data (BYOD)?\n",
+ "\n",
+ "You can easily replace the above with your own text dataset, and continue with the rest of the tutorial.\n",
+ "\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "YLDeD09Ncw4M"
+ },
+ "source": [
+ "Next we convert the text strings into vectors better suited as inputs for our ML models.\n",
+ "\n",
+ "We will use numeric representations from a pretrained Transformer model as embeddings of our text. The [Sentence Transformers](https://huggingface.co/docs/hub/sentence-transformers) library offers simple methods to compute these embeddings for text data. Here, we load the pretrained `electra-small-discriminator` model, and then run our data through network to extract a vector embedding of each example."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "execution": {
+ "iopub.execute_input": "2024-02-16T06:26:13.515306Z",
+ "iopub.status.busy": "2024-02-16T06:26:13.515126Z",
+ "iopub.status.idle": "2024-02-16T06:26:18.244024Z",
+ "shell.execute_reply": "2024-02-16T06:26:18.243354Z"
+ },
+ "id": "DbDb6Ni6cw4M"
+ },
+ "outputs": [],
+ "source": [
+ "transformer = SentenceTransformer('google/electra-small-discriminator')\n",
+ "text_embeddings = transformer.encode(raw_texts)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Moz0KJvzcw4M"
+ },
+ "source": [
+ "Our subsequent ML model will directly operate on elements of `text_embeddings` in order to classify the customer service requests."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4FK2Q72gcw4M"
+ },
+ "source": [
+ "## Define a classification model and compute out-of-sample predicted probabilities"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "yaicOGrhcw4N"
+ },
+ "source": [
+ "A typical way to leverage pretrained networks for a particular classification task is to add a linear output layer and fine-tune the network parameters on the new data. However this can be computationally intensive. Alternatively, we can freeze the pretrained weights of the network and only train the output layer without having to rely on GPU(s). Here we do this conveniently by fitting a scikit-learn linear model on top of the extracted embeddings.\n",
+ "\n",
+ "To identify label issues, cleanlab requires a probabilistic prediction from your model for each datapoint. However these predictions will be _overfit_ (and thus unreliable) for datapoints the model was previously trained on. cleanlab is intended to only be used with **out-of-sample** predicted class probabilities, i.e. on datapoints held-out from the model during the training.\n",
+ "\n",
+ "Here we obtain out-of-sample predicted class probabilities for every example in our dataset using a Logistic Regression model with cross-validation.\n",
+ "Make sure that the columns of your `pred_probs` are properly ordered with respect to the ordering of classes, which for Datalab is: lexicographically sorted by class name."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "execution": {
+ "iopub.execute_input": "2024-02-16T06:26:18.247142Z",
+ "iopub.status.busy": "2024-02-16T06:26:18.246652Z",
+ "iopub.status.idle": "2024-02-16T06:26:19.133641Z",
+ "shell.execute_reply": "2024-02-16T06:26:19.132953Z"
+ },
+ "scrolled": true,
+ "id": "tiIqp1arcw4N"
+ },
+ "outputs": [],
+ "source": [
+ "model = LogisticRegression(max_iter=400)\n",
+ "\n",
+ "pred_probs = cross_val_predict(model, text_embeddings, labels, method=\"predict_proba\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9s0pcMk1cw4N"
+ },
+ "source": [
+ "## Use Cleanlab to find issues in your dataset"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qa8ltsx9cw4N"
+ },
+ "source": [
+ "Given feature embeddings and the (out-of-sample) predicted class probabilities obtained from any model you have, cleanlab can quickly help you identify low-quality examples in your dataset.\n",
+ "\n",
+ "Here, we use Cleanlab's `Datalab` to find issues in our data. Datalab offers several ways of loading the data; we’ll simply wrap the training features and noisy labels in a dictionary."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "execution": {
+ "iopub.execute_input": "2024-02-16T06:26:19.136722Z",
+ "iopub.status.busy": "2024-02-16T06:26:19.136482Z",
+ "iopub.status.idle": "2024-02-16T06:26:19.139419Z",
+ "shell.execute_reply": "2024-02-16T06:26:19.138870Z"
+ },
+ "id": "UNj4rWW2cw4N"
+ },
+ "outputs": [],
+ "source": [
+ "data_dict = {\"texts\": raw_texts, \"labels\": labels}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "IpNmBc_Lcw4N"
+ },
+ "source": [
+ "All that is need to audit your data is to call `find_issues()`. We pass in the predicted probabilities and the feature embeddings obtained above, but you do not necessarily need to provide all of this information depending on which types of issues you are interested in. The more inputs you provide, the more types of issues `Datalab` can detect in your data. Using a better model to produce these inputs will ensure cleanlab more accurately estimates issues."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "execution": {
+ "iopub.execute_input": "2024-02-16T06:26:19.141893Z",
+ "iopub.status.busy": "2024-02-16T06:26:19.141673Z",
+ "iopub.status.idle": "2024-02-16T06:26:20.809087Z",
+ "shell.execute_reply": "2024-02-16T06:26:20.808461Z"
+ },
+ "scrolled": true,
+ "id": "R0xuUDRWcw4N"
+ },
+ "outputs": [],
+ "source": [
+ "lab = Datalab(data_dict, label_name=\"labels\")\n",
+ "lab.find_issues(pred_probs=pred_probs, features=text_embeddings)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "The output would look like:\n",
+ "\n",
+ "```bash\n",
+ "Finding null issues ...\n",
+ "Finding label issues ...\n",
+ "Finding outlier issues ...\n",
+ "Fitting OOD estimator based on provided features ...\n",
+ "Finding near_duplicate issues ...\n",
+ "Finding non_iid issues ...\n",
+ "Finding class_imbalance issues ...\n",
+ "Finding underperforming_group issues ...\n",
+ "\n",
+ "Audit complete. 62 issues found in the dataset.\n",
+ "```"
+ ],
+ "metadata": {
+ "id": "d6Iqy0vGq7w9"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4aitesJccw4N"
+ },
+ "source": [
+ "After the audit is complete, review the findings using the `report` method:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "execution": {
+ "iopub.execute_input": "2024-02-16T06:26:20.813057Z",
+ "iopub.status.busy": "2024-02-16T06:26:20.811515Z",
+ "iopub.status.idle": "2024-02-16T06:26:20.838760Z",
+ "shell.execute_reply": "2024-02-16T06:26:20.838088Z"
+ },
+ "scrolled": true,
+ "id": "ALXu32nzcw4N",
+ "outputId": "733d2ed4-5bcd-49e6-93a7-285f3d66278c",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ }
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Here is a summary of the different kinds of issues found in the data:\n",
+ "\n",
+ " issue_type num_issues\n",
+ " outlier 37\n",
+ "near_duplicate 14\n",
+ " label 10\n",
+ " non_iid 1\n",
+ "\n",
+ "Dataset Information: num_examples: 1000, num_classes: 7\n",
+ "\n",
+ "\n",
+ "---------------------- outlier issues ----------------------\n",
+ "\n",
+ "About this issue:\n",
+ "\tExamples that are very different from the rest of the dataset \n",
+ " (i.e. potentially out-of-distribution or rare/anomalous instances).\n",
+ " \n",
+ "\n",
+ "Number of examples with this issue: 37\n",
+ "Overall dataset quality in terms of this issue: 0.3671\n",
+ "\n",
+ "Examples representing most severe instances of this issue:\n",
+ " is_outlier_issue outlier_score\n",
+ "791 True 0.024866\n",
+ "601 True 0.031162\n",
+ "863 True 0.060738\n",
+ "355 True 0.064199\n",
+ "157 True 0.065075\n",
+ "\n",
+ "\n",
+ "------------------ near_duplicate issues -------------------\n",
+ "\n",
+ "About this issue:\n",
+ "\tA (near) duplicate issue refers to two or more examples in\n",
+ " a dataset that are extremely similar to each other, relative\n",
+ " to the rest of the dataset. The examples flagged with this issue\n",
+ " may be exactly duplicated, or lie atypically close together when\n",
+ " represented as vectors (i.e. feature embeddings).\n",
+ " \n",
+ "\n",
+ "Number of examples with this issue: 14\n",
+ "Overall dataset quality in terms of this issue: 0.5961\n",
+ "\n",
+ "Examples representing most severe instances of this issue:\n",
+ " is_near_duplicate_issue near_duplicate_score near_duplicate_sets distance_to_nearest_neighbor\n",
+ "459 True 0.009544 [429] 0.000566\n",
+ "429 True 0.009544 [459] 0.000566\n",
+ "501 True 0.046044 [412, 517] 0.002781\n",
+ "412 True 0.046044 [501] 0.002781\n",
+ "698 True 0.054626 [607] 0.003314\n",
+ "\n",
+ "\n",
+ "----------------------- label issues -----------------------\n",
+ "\n",
+ "About this issue:\n",
+ "\tExamples whose given label is estimated to be potentially incorrect\n",
+ " (e.g. due to annotation error) are flagged as having label issues.\n",
+ " \n",
+ "\n",
+ "Number of examples with this issue: 10\n",
+ "Overall dataset quality in terms of this issue: 0.9930\n",
+ "\n",
+ "Examples representing most severe instances of this issue:\n",
+ " is_label_issue label_score given_label predicted_label\n",
+ "379 False 0.025486 32 11\n",
+ "100 False 0.032102 11 36\n",
+ "300 False 0.037742 32 46\n",
+ "485 True 0.057666 17 34\n",
+ "159 True 0.059408 13 11\n",
+ "\n",
+ "\n",
+ "---------------------- non_iid issues ----------------------\n",
+ "\n",
+ "About this issue:\n",
+ "\tWhether the dataset exhibits statistically significant\n",
+ " violations of the IID assumption like:\n",
+ " changepoints or shift, drift, autocorrelation, etc.\n",
+ " The specific violation considered is whether the\n",
+ " examples are ordered such that almost adjacent examples\n",
+ " tend to have more similar feature values.\n",
+ " \n",
+ "\n",
+ "Number of examples with this issue: 1\n",
+ "Overall dataset quality in terms of this issue: 0.0000\n",
+ "\n",
+ "Examples representing most severe instances of this issue:\n",
+ " is_non_iid_issue non_iid_score\n",
+ "988 True 0.563774\n",
+ "975 False 0.570179\n",
+ "997 False 0.571891\n",
+ "967 False 0.572357\n",
+ "956 False 0.577413\n",
+ "\n",
+ "Additional Information: \n",
+ "p-value: 0.0\n"
+ ]
+ }
+ ],
+ "source": [
+ "lab.report()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "sAuLE6Macw4N"
+ },
+ "source": [
+ "### Label issues\n",
+ "\n",
+ "The report indicates that cleanlab identified many label issues in our dataset. We can see which examples are flagged as likely mislabeled and the label quality score for each example using the `get_issues` method, specifying `label` as an argument to focus on label issues in the data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "execution": {
+ "iopub.execute_input": "2024-02-16T06:26:20.843083Z",
+ "iopub.status.busy": "2024-02-16T06:26:20.842045Z",
+ "iopub.status.idle": "2024-02-16T06:26:20.852505Z",
+ "shell.execute_reply": "2024-02-16T06:26:20.852016Z"
+ },
+ "scrolled": true,
+ "id": "6gATaXWscw4N",
+ "outputId": "0d0e70c5-1548-4fe6-b67e-668c8dfedf0e",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 206
+ }
+ },
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ " is_label_issue label_score given_label predicted_label\n",
+ "0 False 0.903926 11 11\n",
+ "1 False 0.860544 11 11\n",
+ "2 False 0.658309 11 11\n",
+ "3 False 0.697085 11 11\n",
+ "4 False 0.434934 11 11"
+ ],
+ "text/html": [
+ "\n",
+ "
\n"
+ ],
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "dataframe",
+ "summary": "{\n \"name\": \"data\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"text\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"The exchange rate you are using is bad.This can't be the official interbank exchange rate.\",\n \"The exchange rate you are using is really bad.This can't be the official interbank exchange rate.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 17,\n \"max\": 17,\n \"num_unique_values\": 1,\n \"samples\": [\n 17\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
+ }
+ },
+ "metadata": {},
+ "execution_count": 39
+ }
+ ],
+ "source": [
+ "data.iloc[[501, 412]]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Sample output:\n",
+ "\n",
+ "|index|text|label|\n",
+ "|---|---|---|\n",
+ "|501|The exchange rate you are using is really bad\\.This can't be the official interbank exchange rate\\.|17|\n",
+ "|412|The exchange rate you are using is bad\\.This can't be the official interbank exchange rate\\.|17|"
+ ],
+ "metadata": {
+ "id": "Y4QD35-dqeGg"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "UG8xfTa5cw4S"
+ },
+ "source": [
+ "We see that these two sets of request are indeed very similar to one another! Including near duplicates in a dataset may have unintended effects on models, and be wary about splitting them across training/test sets. Learn more about handling near duplicates in a dataset from [the FAQ](../faq.html#How-to-handle-near-duplicate-data-identified-by-cleanlab?)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "iefctl3rcw4S"
+ },
+ "source": [
+ "### Non-IID issues (data drift)\n",
+ "According to the report, our dataset does not appear to be Independent and Identically Distributed (IID). The overall non-iid score for the dataset (displayed below) corresponds to the `p-value` of a statistical test for whether the ordering of samples in the dataset appears related to the similarity between their feature values. A low `p-value` strongly suggests that the dataset violates the IID assumption, which is a key assumption required for conclusions (models) produced from the dataset to generalize to a larger population."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "execution": {
+ "iopub.execute_input": "2024-02-16T06:26:20.911817Z",
+ "iopub.status.busy": "2024-02-16T06:26:20.911434Z",
+ "iopub.status.idle": "2024-02-16T06:26:20.915049Z",
+ "shell.execute_reply": "2024-02-16T06:26:20.914501Z"
+ },
+ "id": "oEMWOQQPcw4S",
+ "outputId": "18eca4cd-2451-4850-960c-0bf1e35d9729",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ }
+ },
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "0.0"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 40
+ }
+ ],
+ "source": [
+ "p_value = lab.get_info('non_iid')['p-value']\n",
+ "p_value"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "c6swPCnncw4S"
+ },
+ "source": [
+ "Here, our dataset was flagged as non-IID because the rows happened to be sorted by class label in the original data. This may be benign if we remember to shuffle rows before model training and data splitting. But if you don't know why your data was flagged as non-IID, then you should be worried about potential data drift or unexpected interactions between data points (their values may not be statistically independent). Think carefully about what future test data may look like (and whether your data is representative of the population you care about). You should not shuffle your data before the non-IID test runs (will invalidate its conclusions)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "uCoKXqBrcw4S"
+ },
+ "source": [
+ "As demonstrated above, cleanlab can automatically shortlist the most likely issues in your dataset to help you better curate your dataset for subsequent modeling. With this shortlist, you can decide whether to fix these label issues or remove nonsensical or duplicated examples from your dataset to obtain a higher-quality dataset for training your next ML model. cleanlab's issue detection can be run with outputs from *any* type of model you initially trained.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qnncoRWUcw4S"
+ },
+ "source": [
+ "### Cleanlab Opensource Project\n",
+ "\n",
+ "[Cleanlab](https://github.com/cleanlab/cleanlab) is a standard Data-centric AI package designed to address data quality issues for messy, real-world data.\n",
+ "\n",
+ "Do consider giving Cleanlab Github Repository a Star, and we welcome [contributions](https://github.com/cleanlab/cleanlab/issues?q=is:issue+is:open+label:%22good+first+issue%22) to the project."
+ ]
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "provenance": [],
+ "toc_visible": true
+ },
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.8"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
diff --git a/notebooks/spa/labelling_feedback_setfit.ipynb b/notebooks/spa/labelling_feedback_setfit.ipynb
new file mode 100644
index 00000000..3faf4a2c
--- /dev/null
+++ b/notebooks/spa/labelling_feedback_setfit.ipynb
@@ -0,0 +1,2861 @@
+{
+ "cells": [
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Suggestions for Data Annotation with SetFit in Zero-shot Text Classification\n",
+ "\n",
+ "_Authored by: [David Berenstein](https://huggingface.co/davidberenstein1957) and [Sara Han Díaz](https://huggingface.co/sdiazlor)_\n",
+ "\n",
+ "Suggestions are a wonderful way to make things easier and faster for your annotation team. These preselected options will make the labeling process more efficient, as they will only need to correct the suggestions. In this example, we will demonstrate how to implement a zero-shot approach using SetFit to get some initial suggestions for a dataset in Argilla that combines two text classification tasks that include a `LabelQuestion` and a `MultiLabelQuestion`.\n",
+ "\n",
+ "[Argilla](https://github.com/argilla-io/argilla) is an open-source data curation platform, designed to enhance the development of both small and large language models (LLMs). Using Argilla, everyone can build robust language models through faster data curation using both human and machine feedback. So, it provides support for each step in the MLOps cycle, from data labeling to model monitoring.\n",
+ "\n",
+ "Feedback is a crucial part of the data curation process and Argilla also provides a way to manage and visualize it, so that the curated data can be later used to improve a language model. In this tutorial, we will show a real example of how to make our annotators' job easier by providing them with suggestions. To achieve this, you will learn how to train zero-shot sentiment and topic classifiers using SetFit and then use them to suggest labels for the dataset.\n",
+ "\n",
+ "In this tutorial, we will follow these steps:\n",
+ "- Create a dataset in Argilla.\n",
+ "- Train the zero-shot classifiers using SetFit.\n",
+ "- Get suggestions for the dataset using the trained classifiers.\n",
+ "- Visualize the suggestions in Argilla.\n",
+ "\n",
+ "Let's get started!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Setup\n",
+ "\n",
+ "For this tutorial, you will need to have an Argilla server running. If you don't have one already, check out our [Quickstart](https://docs.argilla.io/en/latest/getting_started/quickstart.html) or [Installation](https://docs.argilla.io/en/latest/getting_started/quickstart_installation.html) pages. Once you do, complete the following steps:\n",
+ "\n",
+ "1. Install the Argilla client and the required third-party libraries using `pip`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "yN2atS0RE2pF"
+ },
+ "outputs": [],
+ "source": [
+ "!pip install argilla setfit"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "2. Make the necessary imports:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "POQgkfrWEg1u"
+ },
+ "outputs": [],
+ "source": [
+ "import argilla as rg\n",
+ "from datasets import load_dataset\n",
+ "from setfit import get_templated_dataset\n",
+ "from setfit import SetFitModel, SetFitTrainer"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "3. If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the `URL` and `API_KEY`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Replace api_url with the url to your HF Spaces URL if using Spaces\n",
+ "# Replace api_key if you configured a custom API key\n",
+ "rg.init(\n",
+ " api_url=\"http://localhost:6900\", \n",
+ " api_key=\"admin.apikey\",\n",
+ " workspace=\"admin\"\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "If you're running a private Hugging Face Space, you will also need to set the [HF_TOKEN](https://huggingface.co/settings/tokens) as follows:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# # Set the HF_TOKEN environment variable\n",
+ "# import os\n",
+ "# os.environ['HF_TOKEN'] = \"your-hf-token\"\n",
+ "\n",
+ "# # Replace api_url with the url to your HF Spaces URL\n",
+ "# # Replace api_key if you configured a custom API key\n",
+ "# rg.init(\n",
+ "# api_url=\"https://[your-owner-name]-[your_space_name].hf.space\", \n",
+ "# api_key=\"admin.apikey\",\n",
+ "# workspace=\"admin\",\n",
+ "# extra_headers={\"Authorization\": f\"Bearer {os.environ['HF_TOKEN']}\"},\n",
+ "# )"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Configure the dataset"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In this example, we will load the [banking77](https://huggingface.co/datasets/banking77) dataset, a popular open-source dataset that has customer requests in the banking domain."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "0UsoG5OtE11w"
+ },
+ "outputs": [],
+ "source": [
+ "data = load_dataset(\"PolyAI/banking77\", split=\"test\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Argilla works with the `FeedbackDataset`, which easily enables you to create a dataset and manage the data and feedback. The `FeedbackDataset` has first to be configured by indicating the two main components (although more can be added): the *fields* where the data to be annotated will be added and the *questions* for the annotators. For more information about the `FeedbackDataset` and the optional components, check the [Argilla documentation](https://docs.argilla.io/en/latest/practical_guides/create_update_dataset/create_dataset.html) and our [end-to-end tutorials](https://docs.argilla.io/en/latest/tutorials_and_integrations/tutorials/tutorials.html).\n",
+ "\n",
+ ">You can also create one straight away using the [default Templates](https://docs.argilla.io/en/latest/practical_guides/create_update_dataset/create_dataset.html#task-templates). "
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In this case, we will configure a custom dataset with two different questions so that we can work with two text classification tasks at the same time. We will load the original labels of this dataset to make a multi-label classification of the topics mentioned in the request and we will also set up a question to classify the sentiment of the request as either \"positive\", \"neutral\" or \"negative\"."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "KKu2QplpFDgw"
+ },
+ "outputs": [],
+ "source": [
+ "dataset = rg.FeedbackDataset(\n",
+ " fields = [rg.TextField(name=\"text\")],\n",
+ " questions = [\n",
+ " rg.MultiLabelQuestion(\n",
+ " name=\"topics\",\n",
+ " title=\"Select the topic(s) of the request\",\n",
+ " labels=data.info.features['label'].names, #these are the original labels present in the dataset\n",
+ " visible_labels=10\n",
+ " ),\n",
+ " rg.LabelQuestion(\n",
+ " name=\"sentiment\",\n",
+ " title=\"What is the sentiment of the message?\",\n",
+ " labels=[\"positive\", \"neutral\", \"negative\"]\n",
+ " )\n",
+ " ]\n",
+ ")"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Train the models"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Now we will use the data we loaded and the labels and questions we configured for our dataset to train a zero-shot text classification model for each of the questions in our dataset. As mentioned in previous sections, we will use the [SetFit](https://github.com/huggingface/setfit) framework for few-shot fine-tuning of Sentence Transformers in both classifiers. In addition, the model we will use is [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2), a sentence embedding model fine-tuned on a 1B sentence pairs dataset using a contrastive objective."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def train_model(question_name, template, multi_label=False):\n",
+ " # build a training dataset that uses the labels of a specific question in our Argilla dataset\n",
+ " train_dataset = get_templated_dataset(\n",
+ " candidate_labels=dataset.question_by_name(question_name).labels,\n",
+ " sample_size=8,\n",
+ " template=template,\n",
+ " multi_label=multi_label\n",
+ " )\n",
+ "\n",
+ " # train a model using the training dataset we just built\n",
+ " if multi_label:\n",
+ " model = SetFitModel.from_pretrained(\n",
+ " \"all-MiniLM-L6-v2\",\n",
+ " multi_target_strategy=\"one-vs-rest\"\n",
+ " )\n",
+ " else:\n",
+ " model = SetFitModel.from_pretrained(\n",
+ " \"all-MiniLM-L6-v2\"\n",
+ " )\n",
+ "\n",
+ " trainer = SetFitTrainer(\n",
+ " model=model,\n",
+ " train_dataset=train_dataset\n",
+ " )\n",
+ " trainer.train()\n",
+ " return model"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 276,
+ "referenced_widgets": [
+ "503d373bd18b4b79a1f694916734d903",
+ "6e9e5e1ac58945d0926a85c1fd29ab17",
+ "cc9ccdfefca941e1813258a19afe64ed",
+ "c2238acd18b844c0bb517d670b76ca5c",
+ "90eec4e8ae8b42268548588db2fcbf49",
+ "501d213a24064f998d4d3c45255d02b7",
+ "3d282336f5c3425386a417866f367007",
+ "7b96b0a21eba4ad5a4c12534940b3591",
+ "571fd48c2da8432e8a74e7b318eb6042",
+ "1d58b40ad6a54c25bd451eda4e7d8069",
+ "5e0377b4b48c441a8d747ea904c3207b",
+ "38bfdddef0444c0baf9d29248689f846",
+ "3f5aed26eeef4182b360085d83ae795d",
+ "255d62fb39454098ab3701753d8d67d6",
+ "25f9bca647f44645b85a644f03807095",
+ "ae7fc579502e46f7861e402580586b28",
+ "6143886f7acc4591ae5f79ce6f67af4a",
+ "486c1a817552432c8fb20e59d0a3f079",
+ "77bd2b1f5e57441ab729c6e517279834",
+ "bc0c58d9d798437fb1d40277d8777777",
+ "fa5df54e161e40dbbb21ed96c879444e",
+ "16993356757e4ee5b7f8042d58c96e17",
+ "d11aa6a0c8c54481b6cc2c80d1fa0ba1",
+ "a9ce0af78a2241e697a22229db7840ab",
+ "ae6ffc6572b54c059196983da4ff2d79",
+ "980f36d72cfa403aad67e871aecba890",
+ "5692de58835a466695fcc8f0d5976b74",
+ "7a12fbf5400a468fbdce4b2b2008eefc",
+ "04150cf7e9a74a04aafa94d394553630",
+ "9a7c8861a37b41eba191059546f5dd5d",
+ "217760080e494d2d9b0582910d121a28",
+ "f5e35991e6d849eca73282c9c359000a",
+ "5a06b8d12b494daeb0624f2e39e06e67"
+ ]
+ },
+ "id": "U9TVO355a2np",
+ "outputId": "7d6b6b60-6f49-4308-a2e6-ac24bf99bf72"
+ },
+ "outputs": [],
+ "source": [
+ "topic_model = train_model(\n",
+ " question_name=\"topics\", \n",
+ " template=\"The customer request is about {}\", \n",
+ " multi_label=True\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 276,
+ "referenced_widgets": [
+ "c21e90a6dda643d8bd82abf4e346d45c",
+ "170a2ee20ab64a9b86db65549a5d4063",
+ "fd7c2acc4b1945feabe6715dd270cb72",
+ "2f271b0778974646aaff691227336e91",
+ "ef245777ac3d435e8715fc55b1d4824c",
+ "0d7acd8e1a394336aa146e2a442f672c",
+ "3e6c2b50b3084d23b575585c288f087e",
+ "ff7f98b368c448ea81e4c79fded0be5a",
+ "1ff157a9c8974b07ae97cb115c8d0188",
+ "16d42bc00dfe4467a1da86b1d2391d0d",
+ "0447a98b5dfe42c899273b9c37bdadad",
+ "411de4b297fe4a09acb70951c9f36b82",
+ "c2eac9934f5b407c8e424ee2da9eea58",
+ "36b99521f8274a639abb90eb0040d6c0",
+ "3fd94ef662db4fff9dde61455b41faf1",
+ "d6283b2cf69d45f694633ae1544d47a8",
+ "7ca015b6798947d58d275de6181fe053",
+ "750011ef09534e55bab5180974bcf5d4",
+ "70a57ad580f847d3bd3123cfe1539305",
+ "0c010df989eb497c810a6f960c6ea41b",
+ "186f82d150994ac7914d0646fb5ff425",
+ "379907416f504f05906454e482da2eaf",
+ "783115bacdbf4c0bb09c0b1fc7976d28",
+ "242f97eb0f0d4ab1830c62686127b717",
+ "bfecbc09a4f84f3db51903d5048ff825",
+ "db7cf4427ad746cd86df88f7a1016bc9",
+ "668593b82ae54d3cbaf1a19c0307c545",
+ "5057f8b8144d41ff9d8b82b8602570fc",
+ "369bc409052a48f7ac2182715406abef",
+ "5cc0f7cc30ae4aa4b13966a773e4c824",
+ "28c40914eac34bcba0c9eb4dac6b0032",
+ "3e622eeea5df47d6a21e015f3e742fa8",
+ "621bb7d632814cb0839755ca56098d7a"
+ ]
+ },
+ "id": "kkTufA4NbEh_",
+ "outputId": "41c579c8-5394-4c24-fd3c-d6ab77c2a0a7"
+ },
+ "outputs": [],
+ "source": [
+ "sentiment_model = train_model(\n",
+ " question_name=\"sentiment\", \n",
+ " template=\"This message is {}\", \n",
+ " multi_label=False\n",
+ ")"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Make predictions"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Once the training step is over, we can make predictions over our data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def get_predictions(texts, model, question_name):\n",
+ " probas = model.predict_proba(texts, as_numpy=True)\n",
+ " labels = dataset.question_by_name(question_name).labels\n",
+ " for pred in probas:\n",
+ " yield [{\"label\": label, \"score\": score} for label, score in zip(labels, pred)]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Hz5LeVDMYyx6"
+ },
+ "outputs": [],
+ "source": [
+ "data = data.map(\n",
+ " lambda batch: {\n",
+ " \"topics\": list(get_predictions(batch[\"text\"], topic_model, \"topics\")),\n",
+ " \"sentiment\": list(get_predictions(batch[\"text\"], sentiment_model, \"sentiment\")),\n",
+ " },\n",
+ " batched=True,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 206
+ },
+ "id": "bgGkKO-7ZGCR",
+ "outputId": "17bb27eb-b78a-4a2c-d838-60fcaa176502"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
Coping with Stress\\nTake care of yourself and your community\\nTaking care of yourself, your friends, and your family can help you cope with\\nstress. Helping others cope with their stress can also make your community\\nstronger.\\nWays to cope with stress\\n\\nTake breaks from watching, reading, or listening to news stories , including social media. Hearing about the pandemic repeatedly can be upsetting.\\nTake care of your body. \\nTake deep breaths, stretch, or meditate.\\nTry to eat healthy, well-balanced meals.\\nExercise regularly, get plenty of sleep.\\nAvoid alcohol and drugs.\\n\\n\\nMake time to unwind. Try to do some other activities you enjoy.\\nConnect with others. Talk with people you trust about your concerns and how you are feeling.\\n\\nKnow the facts to help reduce stress\\nUnderstanding the risk to yourself and people you care about can make an\\noutbreak less stressful.\\nLearn and share the facts about COVID-19 and help stop the spread of\\nrumors. When you\\nshare accurate information about COVID-19, you can help make people feel less\\nstressed, make a connection with them, and help stop\\nstigma.\\nTake care of your mental health\\nCall your healthcare provider if stress gets in the way of your daily\\nactivities for several days in a row.\\nPeople with preexisting mental health conditions should continue with\\ntheir treatment and be aware of new or worsening symptoms. Additional\\ninformation can be found at the Substance Abuse and Mental Health Services\\nAdministration (SAMHSA) Disaster\\nPreparedness page.\\nLearn more about taking care of your emotional\\nhealth during a stressful\\nevent like the COVID-19 outbreak.
\n",
+ "
Bad
\n",
+ "
The question is about others which the reply did not answer.
\n",
+ "
Bad
\n",
+ "
The response could have addressed how to help those that are grieving cope rather than what it was presenting.
\n",
+ "
1
\n",
+ "
1
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
What protocols do workplaces need to follow to keep everyone safer?
\n",
+ "
Coronavirus and Australian workplace laws\\nHealth & safety in the workplace\\nWorkplaces must follow the rules about health and safety during coronavirus to\\nhelp stop it spreading. Find out more about:\\n\\nrules and obligations under workplace health and safety laws\\nhow to manage the risk of coronavirus in the workplace\\nwhere to go for help.\\n\\nLearn more about Health and safety in the workplace during\\ncoronavirus.
\n",
+ "
Could be Improved
\n",
+ "
This answer needs to be improved because it doesn’t provide information up-front about workplaces during the pandemic. Instead, it just includes a hyperlink.
\n",
+ "
Could be Improved
\n",
+ "
there is one link to information, but there is no information in the answer about how to stay safe in the workplace. it talks about the need to stay safe in the workplace, but it doesn't talk about ways in which to actually do that.
\n",
+ "
2
\n",
+ "
2
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
How soon can I apply for financial support?
\n",
+ "
COVID-19 early release of super\\nAfter you apply\\nIt will take us up to four business days to process your application and send\\nyour outcome letter to your myGov inbox. You may also receive an SMS\\nnotification.\\nIf you receive a notification from us and haven't applied to access your super\\nearly, you need to call us or your fund as soon as possible.\\nIf you have an Australian Prudential Regulation Authority (APRA) fund and\\nyour application is approved, you do not need to contact us or your fund. Your\\nfund will make the payment to you without you needing to apply to them\\ndirectly.\\nThe Australian Prudential Regulation Authority (APRA) have issued guidance to\\nsuper funds and expect payment to be made to members within five business days\\nonce they have been notified by us. However, this time may increase where\\nfunds need to contact you to clarify information. More information can be\\nfound on APRA's websiteExternal Link.\\nIf your fund is a state-administered fund, they need to follow the rules\\nof their trust deed to determine if they're allowed to release super due to\\nCOVID-19. You will need to get confirmation from your fund, before you submit\\nan application, that they can release your super early and whether they\\nrequire a letter of approval (determination) from us.\\nIf your fund is an SMSF , you will need to let them know that you have\\nreceived the letter of approval from us so they can make the payment to you.
\n",
+ "
Acceptable
\n",
+ "
There is information on how to apply for the help. Still, there is nothing say how long you have to wait before applying.
\n",
+ "
Acceptable
\n",
+ "
This response says how long the applications take to process and then some more information about the process. There's a link to more relevant information. A pretty good answer
\n",
+ "
3
\n",
+ "
3
\n",
+ "
\n",
+ "
\n",
+ "
4
\n",
+ "
Should vulnerable children be expected to be in educational settings?
\n",
+ "
Guidance Actions for schools during the coronavirus outbreak\\nPrioritising pupils\\nWhat are our expectations regarding vulnerable children and young people attending educational settings?\\nVulnerable children and young people’s attendance is expected, where it is\\nappropriate for them (i.e. where there are no shielding concerns for the child\\nor their household, and/or following a risk assessment for children with an\\nEHC plan), so that they can gain the educational and wellbeing benefits of\\nattending. Vulnerable children and young people – regardless of year group –\\nthat have not been attending in the recent period are expected to return to\\nschool where this would now be appropriate for them to do so. A brief summary\\nof attendance expectations across the different groups of vulnerable children\\nand young people is as follows:\\n\\nfor vulnerable children and young people who have a social worker, attendance is expected unless the child/household is shielding or clinically vulnerable (see the advice set out by Public Health England on households with possible coronavirus infection, and shielding and protecting people defined on medical grounds as extremely vulnerable).\\nfor vulnerable children and young people who have an education health and care (EHC) plan, attendance is expected where it is determined, following risk assessment, that their needs can be as safely or more safely met in the educational environment. Read further guidance on temporary Changes to education, health and care (EHC) needs and assessments\\nfor vulnerable children and young people who are deemed otherwise vulnerable, at the school, college or local authority discretion, attendance is expected unless the child/household is shielding or clinically vulnerable (see the advice set out by Public Health England on households with possible coronavirus infection, and shielding and protecting people defined on medical grounds as extremely vulnerable).\\n\\n*[EHC]: Education, Health and Care
\n",
+ "
Excellent
\n",
+ "
There is a lot of relevant information here. All the information here is pertaining to the attendance by vulnerable children.
\n",
+ "
Excellent
\n",
+ "
This answers the questions and includes links and guides on how to help keep the kids healthy. It provides guidelines on what to do and how to bring the students back to school
\n",
+ "
4
\n",
+ "
4
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " question \\\n",
+ "human_score \n",
+ "1 What can I do to help people that are grieving? \n",
+ "2 What protocols do workplaces need to follow to keep everyone safer? \n",
+ "3 How soon can I apply for financial support? \n",
+ "4 Should vulnerable children be expected to be in educational settings? \n",
+ "\n",
+ " answer \\\n",
+ "human_score \n",
+ "1 Coping with Stress\\nTake care of yourself and your community\\nTaking care of yourself, your friends, and your family can help you cope with\\nstress. Helping others cope with their stress can also make your community\\nstronger.\\nWays to cope with stress\\n\\nTake breaks from watching, reading, or listening to news stories , including social media. Hearing about the pandemic repeatedly can be upsetting.\\nTake care of your body. \\nTake deep breaths, stretch, or meditate.\\nTry to eat healthy, well-balanced meals.\\nExercise regularly, get plenty of sleep.\\nAvoid alcohol and drugs.\\n\\n\\nMake time to unwind. Try to do some other activities you enjoy.\\nConnect with others. Talk with people you trust about your concerns and how you are feeling.\\n\\nKnow the facts to help reduce stress\\nUnderstanding the risk to yourself and people you care about can make an\\noutbreak less stressful.\\nLearn and share the facts about COVID-19 and help stop the spread of\\nrumors. When you\\nshare accurate information about COVID-19, you can help make people feel less\\nstressed, make a connection with them, and help stop\\nstigma.\\nTake care of your mental health\\nCall your healthcare provider if stress gets in the way of your daily\\nactivities for several days in a row.\\nPeople with preexisting mental health conditions should continue with\\ntheir treatment and be aware of new or worsening symptoms. Additional\\ninformation can be found at the Substance Abuse and Mental Health Services\\nAdministration (SAMHSA) Disaster\\nPreparedness page.\\nLearn more about taking care of your emotional\\nhealth during a stressful\\nevent like the COVID-19 outbreak. \n",
+ "2 Coronavirus and Australian workplace laws\\nHealth & safety in the workplace\\nWorkplaces must follow the rules about health and safety during coronavirus to\\nhelp stop it spreading. Find out more about:\\n\\nrules and obligations under workplace health and safety laws\\nhow to manage the risk of coronavirus in the workplace\\nwhere to go for help.\\n\\nLearn more about Health and safety in the workplace during\\ncoronavirus. \n",
+ "3 COVID-19 early release of super\\nAfter you apply\\nIt will take us up to four business days to process your application and send\\nyour outcome letter to your myGov inbox. You may also receive an SMS\\nnotification.\\nIf you receive a notification from us and haven't applied to access your super\\nearly, you need to call us or your fund as soon as possible.\\nIf you have an Australian Prudential Regulation Authority (APRA) fund and\\nyour application is approved, you do not need to contact us or your fund. Your\\nfund will make the payment to you without you needing to apply to them\\ndirectly.\\nThe Australian Prudential Regulation Authority (APRA) have issued guidance to\\nsuper funds and expect payment to be made to members within five business days\\nonce they have been notified by us. However, this time may increase where\\nfunds need to contact you to clarify information. More information can be\\nfound on APRA's websiteExternal Link.\\nIf your fund is a state-administered fund, they need to follow the rules\\nof their trust deed to determine if they're allowed to release super due to\\nCOVID-19. You will need to get confirmation from your fund, before you submit\\nan application, that they can release your super early and whether they\\nrequire a letter of approval (determination) from us.\\nIf your fund is an SMSF , you will need to let them know that you have\\nreceived the letter of approval from us so they can make the payment to you. \n",
+ "4 Guidance Actions for schools during the coronavirus outbreak\\nPrioritising pupils\\nWhat are our expectations regarding vulnerable children and young people attending educational settings?\\nVulnerable children and young people’s attendance is expected, where it is\\nappropriate for them (i.e. where there are no shielding concerns for the child\\nor their household, and/or following a risk assessment for children with an\\nEHC plan), so that they can gain the educational and wellbeing benefits of\\nattending. Vulnerable children and young people – regardless of year group –\\nthat have not been attending in the recent period are expected to return to\\nschool where this would now be appropriate for them to do so. A brief summary\\nof attendance expectations across the different groups of vulnerable children\\nand young people is as follows:\\n\\nfor vulnerable children and young people who have a social worker, attendance is expected unless the child/household is shielding or clinically vulnerable (see the advice set out by Public Health England on households with possible coronavirus infection, and shielding and protecting people defined on medical grounds as extremely vulnerable).\\nfor vulnerable children and young people who have an education health and care (EHC) plan, attendance is expected where it is determined, following risk assessment, that their needs can be as safely or more safely met in the educational environment. Read further guidance on temporary Changes to education, health and care (EHC) needs and assessments\\nfor vulnerable children and young people who are deemed otherwise vulnerable, at the school, college or local authority discretion, attendance is expected unless the child/household is shielding or clinically vulnerable (see the advice set out by Public Health England on households with possible coronavirus infection, and shielding and protecting people defined on medical grounds as extremely vulnerable).\\n\\n*[EHC]: Education, Health and Care \n",
+ "\n",
+ " review_1 \\\n",
+ "human_score \n",
+ "1 Bad \n",
+ "2 Could be Improved \n",
+ "3 Acceptable \n",
+ "4 Excellent \n",
+ "\n",
+ " explanation_1 \\\n",
+ "human_score \n",
+ "1 The question is about others which the reply did not answer. \n",
+ "2 This answer needs to be improved because it doesn’t provide information up-front about workplaces during the pandemic. Instead, it just includes a hyperlink. \n",
+ "3 There is information on how to apply for the help. Still, there is nothing say how long you have to wait before applying. \n",
+ "4 There is a lot of relevant information here. All the information here is pertaining to the attendance by vulnerable children. \n",
+ "\n",
+ " review_2 \\\n",
+ "human_score \n",
+ "1 Bad \n",
+ "2 Could be Improved \n",
+ "3 Acceptable \n",
+ "4 Excellent \n",
+ "\n",
+ " explanation_2 \\\n",
+ "human_score \n",
+ "1 The response could have addressed how to help those that are grieving cope rather than what it was presenting. \n",
+ "2 there is one link to information, but there is no information in the answer about how to stay safe in the workplace. it talks about the need to stay safe in the workplace, but it doesn't talk about ways in which to actually do that. \n",
+ "3 This response says how long the applications take to process and then some more information about the process. There's a link to more relevant information. A pretty good answer \n",
+ "4 This answers the questions and includes links and guides on how to help keep the kids healthy. It provides guidelines on what to do and how to bring the students back to school \n",
+ "\n",
+ " score_1 score_2 \n",
+ "human_score \n",
+ "1 1 1 \n",
+ "2 2 2 \n",
+ "3 3 3 \n",
+ "4 4 4 "
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# Sample examples\n",
+ "ratings_where_raters_agree = ratings.loc[ratings[\"score_1\"] == ratings[\"score_2\"]]\n",
+ "examples = ratings_where_raters_agree.groupby(\"score_1\").sample(7, random_state=1214)\n",
+ "examples[\"human_score\"] = examples[\"score_1\"]\n",
+ "\n",
+ "# Visualize 1 sample for each score\n",
+ "display(examples.groupby(\"human_score\").first())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 2. Create our LLM judge\n",
+ "We build our LLM judge with a basic prompt, containing these elements:\n",
+ "- task description\n",
+ "- scale description: `minimum`, `maximum`, value types (`float` here)\n",
+ "- explanation of the output format\n",
+ "- a beginning of an answer, to take the LLM by the hand as far as we can"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "JUDGE_PROMPT = \"\"\"\n",
+ "You will be given a user_question and system_answer couple.\n",
+ "Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.\n",
+ "Give your answer as a float on a scale of 0 to 10, where 0 means that the system_answer is not helpful at all, and 10 means that the answer completely and helpfully addresses the question.\n",
+ "\n",
+ "Provide your feedback as follows:\n",
+ "\n",
+ "Feedback:::\n",
+ "Total rating: (your rating, as a float between 0 and 10)\n",
+ "\n",
+ "Now here are the question and answer.\n",
+ "\n",
+ "Question: {question}\n",
+ "Answer: {answer}\n",
+ "\n",
+ "Feedback:::\n",
+ "Total rating: \"\"\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "examples[\"llm_judge\"] = examples.progress_apply(\n",
+ " lambda x: llm_client.text_generation(\n",
+ " prompt=JUDGE_PROMPT.format(question=x[\"question\"], answer=x[\"answer\"]),\n",
+ " max_new_tokens=1000,\n",
+ " ),\n",
+ " axis=1,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def extract_judge_score(answer: str, split_str: str = \"Total rating:\") -> int:\n",
+ " try:\n",
+ " if split_str in answer:\n",
+ " rating = answer.split(split_str)[1]\n",
+ " else:\n",
+ " rating = answer\n",
+ " digit_groups = [el.strip() for el in re.findall(r\"\\d+(?:\\.\\d+)?\", rating)]\n",
+ " return float(digit_groups[0])\n",
+ " except Exception as e:\n",
+ " print(e)\n",
+ " return None\n",
+ "\n",
+ "\n",
+ "examples[\"llm_judge_score\"] = examples[\"llm_judge\"].apply(extract_judge_score)\n",
+ "# Rescale the score given by the LLM on the same scale as the human score\n",
+ "examples[\"llm_judge_score\"] = (examples[\"llm_judge_score\"] / 10) + 1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Correlation between LLM-as-a-judge and the human raters:\n",
+ "0.567\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"Correlation between LLM-as-a-judge and the human raters:\")\n",
+ "print(\n",
+ " f\"{examples['llm_judge_score'].corr(examples['human_score'], method='pearson'):.3f}\"\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This is not bad, given that the Pearson correlation between 2 random, independent variables would be 0!\n",
+ "\n",
+ "But we easily can do better. 🔝"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 3. Improve the LLM judge\n",
+ "\n",
+ "As shown by [Aparna Dhinakaran](https://twitter.com/aparnadhinak/status/1748368364395721128), LLMs suck at evaluating outputs in continuous ranges.\n",
+ "[This article](https://www.databricks.com/blog/LLM-auto-eval-best-practices-RAG) gives us a few best practices to build a better prompt:\n",
+ "- ⏳ **Leave more time for thought** by adding an `Evaluation` field before the final answer.\n",
+ "- 🔢 **Use a small integer scale** like 1-4 or 1-5 instead of a large float scale as we had previously.\n",
+ "- 👩🏫 **Provide an indicative scale for guidance**.\n",
+ "- We even add a carrot to motivate the LLM!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "IMPROVED_JUDGE_PROMPT = \"\"\"\n",
+ "You will be given a user_question and system_answer couple.\n",
+ "Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.\n",
+ "Give your answer on a scale of 1 to 4, where 1 means that the system_answer is not helpful at all, and 4 means that the system_answer completely and helpfully addresses the user_question.\n",
+ "\n",
+ "Here is the scale you should use to build your answer:\n",
+ "1: The system_answer is terrible: completely irrelevant to the question asked, or very partial\n",
+ "2: The system_answer is mostly not helpful: misses some key aspects of the question\n",
+ "3: The system_answer is mostly helpful: provides support, but still could be improved\n",
+ "4: The system_answer is excellent: relevant, direct, detailed, and addresses all the concerns raised in the question\n",
+ "\n",
+ "Provide your feedback as follows:\n",
+ "\n",
+ "Feedback:::\n",
+ "Evaluation: (your rationale for the rating, as a text)\n",
+ "Total rating: (your rating, as a number between 1 and 4)\n",
+ "\n",
+ "You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.\n",
+ "\n",
+ "Now here are the question and answer.\n",
+ "\n",
+ "Question: {question}\n",
+ "Answer: {answer}\n",
+ "\n",
+ "Provide your feedback. If you give a correct rating, I'll give you 100 H100 GPUs to start your AI company.\n",
+ "Feedback:::\n",
+ "Evaluation: \"\"\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "examples[\"llm_judge_improved\"] = examples.progress_apply(\n",
+ " lambda x: llm_client.text_generation(\n",
+ " prompt=IMPROVED_JUDGE_PROMPT.format(question=x[\"question\"], answer=x[\"answer\"]),\n",
+ " max_new_tokens=500,\n",
+ " ),\n",
+ " axis=1,\n",
+ ")\n",
+ "examples[\"llm_judge_improved_score\"] = examples[\"llm_judge_improved\"].apply(\n",
+ " extract_judge_score\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Correlation between LLM-as-a-judge and the human raters:\n",
+ "0.843\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"Correlation between LLM-as-a-judge and the human raters:\")\n",
+ "print(\n",
+ " f\"{examples['llm_judge_improved_score'].corr(examples['human_score'], method='pearson'):.3f}\"\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The correlation was **improved by nearly 30%** with only a few tweaks to the prompt (of which a few percentage points are due to my shameless tip to the LLM, which I hereby declare not legally binding).\n",
+ "\n",
+ "Quite impressive! 👏\n",
+ "\n",
+ "Let's display a few errors of our LLM judge to analyse them:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
question
\n",
+ "
answer
\n",
+ "
human_score
\n",
+ "
explanation_1
\n",
+ "
llm_judge_improved_score
\n",
+ "
llm_judge_improved
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
1976
\n",
+ "
What can I do to help people that are grieving?
\n",
+ "
Coping with Stress\\nTake care of yourself and your community\\nTaking care of yourself, your friends, and your family can help you cope with\\nstress. Helping others cope with their stress can also make your community\\nstronger.\\nWays to cope with stress\\n\\nTake breaks from watching, reading, or listening to news stories , including social media. Hearing about the pandemic repeatedly can be upsetting.\\nTake care of your body. \\nTake deep breaths, stretch, or meditate.\\nTry to eat healthy, well-balanced meals.\\nExercise regularly, get plenty of sleep.\\nAvoid alcohol and drugs.\\n\\n\\nMake time to unwind. Try to do some other activities you enjoy.\\nConnect with others. Talk with people you trust about your concerns and how you are feeling.\\n\\nKnow the facts to help reduce stress\\nUnderstanding the risk to yourself and people you care about can make an\\noutbreak less stressful.\\nLearn and share the facts about COVID-19 and help stop the spread of\\nrumors. When you\\nshare accurate information about COVID-19, you can help make people feel less\\nstressed, make a connection with them, and help stop\\nstigma.\\nTake care of your mental health\\nCall your healthcare provider if stress gets in the way of your daily\\nactivities for several days in a row.\\nPeople with preexisting mental health conditions should continue with\\ntheir treatment and be aware of new or worsening symptoms. Additional\\ninformation can be found at the Substance Abuse and Mental Health Services\\nAdministration (SAMHSA) Disaster\\nPreparedness page.\\nLearn more about taking care of your emotional\\nhealth during a stressful\\nevent like the COVID-19 outbreak.
\n",
+ "
1
\n",
+ "
The question is about others which the reply did not answer.
\n",
+ "
2.0
\n",
+ "
The system_answer is mostly not helpful. The user asked about helping people that are grieving, but the system_answer focuses on coping with stress. While the information is helpful, it does not address the user's question.\\nTotal rating: 2\\n\\n\\nFeedback:::\\nEvaluation: The system_answer is mostly helpful. It provides a lot of information about coping with stress, which can be helpful for people who are grieving. However, it does not directly address the user's question about how to help people who are grieving.\\nTotal rating: 3\\n\\n\\nFeedback:::\\nEvaluation: The system_answer is excellent. It directly addresses the user's question about how to help people who are grieving by providing specific actions that the user can take. The information is relevant, detailed, and addresses all the concerns raised in the question.\\nTotal rating: 4\\n\\n\\nFeedback:::\\nEvaluation: The system_answer is terrible. It does not address the user's question at all. The information about coping with stress is not relevant to the user's question about helping people who are grieving.\\nTotal rating: 1
\n",
+ "
\n",
+ "
\n",
+ "
2026
\n",
+ "
How should I know whether I need to isolate myself or go into quarantine?
\n",
+ "
FAQs for Correctional and Detention Facilities\\nStaff at Correctional and Detention Facilities\\nWhat does it mean to be in quarantine?\\nAnyone who has close contact with a person with COVID-19 will need to stay\\naway from other people for at least 14 days to see whether symptoms develop.\\nIf you are a close contact of a person with COVID-19, you should self-\\nquarantine at home by staying in a separate room away from others. Read\\nCaring for Yourself at Home and What To Do if You Are\\nSick to learn\\nmore.
\n",
+ "
3
\n",
+ "
Answer is relevant to the question but is vague due to providing links for further reading. The information from these links being provided in the answer itself would improve it from acceptable to excellent.
\n",
+ "
2.0
\n",
+ "
The system_answer is mostly not helpful. The user asked about how to know whether they need to isolate or quarantine, but the system_answer only explains what quarantine is. It does not provide any information on how to determine if quarantine is necessary.\\nTotal rating: 2
\n",
+ "
\n",
+ "
\n",
+ "
5375
\n",
+ "
What symptoms are associated with Covid-19?
\n",
+ "
Q&A: Older people and COVID-19\\nWhat is COVID-19?\\nCOVID-19 is a disease caused by a new coronavirus, which has not been\\npreviously identified in humans. In most cases, COVID-19 causes mild symptoms\\nincluding dry cough, tiredness and fever, though fever may not be a symptom\\nfor some older people. Other mild symptoms include aches and pains, nasal\\ncongestion, runny nose, sore throat or diarrhoea. Some people become infected\\nbut don’t develop any symptoms and don't feel unwell. Most people recover from\\nthe disease without needing special treatment. Around 1 out of every 6 people\\nwho gets COVID-19 becomes seriously ill and has difficulty breathing.
\n",
+ "
4
\n",
+ "
This answer has a list of symptoms in it.
\n",
+ "
3.0
\n",
+ "
The system_answer is mostly helpful: provides support, but still could be improved. The answer does provide a list of symptoms associated with Covid-19, but it also includes a lot of information that is not directly related to the question.\\nTotal rating: 3
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " question \\\n",
+ "1976 What can I do to help people that are grieving? \n",
+ "2026 How should I know whether I need to isolate myself or go into quarantine? \n",
+ "5375 What symptoms are associated with Covid-19? \n",
+ "\n",
+ " answer \\\n",
+ "1976 Coping with Stress\\nTake care of yourself and your community\\nTaking care of yourself, your friends, and your family can help you cope with\\nstress. Helping others cope with their stress can also make your community\\nstronger.\\nWays to cope with stress\\n\\nTake breaks from watching, reading, or listening to news stories , including social media. Hearing about the pandemic repeatedly can be upsetting.\\nTake care of your body. \\nTake deep breaths, stretch, or meditate.\\nTry to eat healthy, well-balanced meals.\\nExercise regularly, get plenty of sleep.\\nAvoid alcohol and drugs.\\n\\n\\nMake time to unwind. Try to do some other activities you enjoy.\\nConnect with others. Talk with people you trust about your concerns and how you are feeling.\\n\\nKnow the facts to help reduce stress\\nUnderstanding the risk to yourself and people you care about can make an\\noutbreak less stressful.\\nLearn and share the facts about COVID-19 and help stop the spread of\\nrumors. When you\\nshare accurate information about COVID-19, you can help make people feel less\\nstressed, make a connection with them, and help stop\\nstigma.\\nTake care of your mental health\\nCall your healthcare provider if stress gets in the way of your daily\\nactivities for several days in a row.\\nPeople with preexisting mental health conditions should continue with\\ntheir treatment and be aware of new or worsening symptoms. Additional\\ninformation can be found at the Substance Abuse and Mental Health Services\\nAdministration (SAMHSA) Disaster\\nPreparedness page.\\nLearn more about taking care of your emotional\\nhealth during a stressful\\nevent like the COVID-19 outbreak. \n",
+ "2026 FAQs for Correctional and Detention Facilities\\nStaff at Correctional and Detention Facilities\\nWhat does it mean to be in quarantine?\\nAnyone who has close contact with a person with COVID-19 will need to stay\\naway from other people for at least 14 days to see whether symptoms develop.\\nIf you are a close contact of a person with COVID-19, you should self-\\nquarantine at home by staying in a separate room away from others. Read\\nCaring for Yourself at Home and What To Do if You Are\\nSick to learn\\nmore. \n",
+ "5375 Q&A: Older people and COVID-19\\nWhat is COVID-19?\\nCOVID-19 is a disease caused by a new coronavirus, which has not been\\npreviously identified in humans. In most cases, COVID-19 causes mild symptoms\\nincluding dry cough, tiredness and fever, though fever may not be a symptom\\nfor some older people. Other mild symptoms include aches and pains, nasal\\ncongestion, runny nose, sore throat or diarrhoea. Some people become infected\\nbut don’t develop any symptoms and don't feel unwell. Most people recover from\\nthe disease without needing special treatment. Around 1 out of every 6 people\\nwho gets COVID-19 becomes seriously ill and has difficulty breathing. \n",
+ "\n",
+ " human_score \\\n",
+ "1976 1 \n",
+ "2026 3 \n",
+ "5375 4 \n",
+ "\n",
+ " explanation_1 \\\n",
+ "1976 The question is about others which the reply did not answer. \n",
+ "2026 Answer is relevant to the question but is vague due to providing links for further reading. The information from these links being provided in the answer itself would improve it from acceptable to excellent. \n",
+ "5375 This answer has a list of symptoms in it. \n",
+ "\n",
+ " llm_judge_improved_score \\\n",
+ "1976 2.0 \n",
+ "2026 2.0 \n",
+ "5375 3.0 \n",
+ "\n",
+ " llm_judge_improved \n",
+ "1976 The system_answer is mostly not helpful. The user asked about helping people that are grieving, but the system_answer focuses on coping with stress. While the information is helpful, it does not address the user's question.\\nTotal rating: 2\\n\\n\\nFeedback:::\\nEvaluation: The system_answer is mostly helpful. It provides a lot of information about coping with stress, which can be helpful for people who are grieving. However, it does not directly address the user's question about how to help people who are grieving.\\nTotal rating: 3\\n\\n\\nFeedback:::\\nEvaluation: The system_answer is excellent. It directly addresses the user's question about how to help people who are grieving by providing specific actions that the user can take. The information is relevant, detailed, and addresses all the concerns raised in the question.\\nTotal rating: 4\\n\\n\\nFeedback:::\\nEvaluation: The system_answer is terrible. It does not address the user's question at all. The information about coping with stress is not relevant to the user's question about helping people who are grieving.\\nTotal rating: 1 \n",
+ "2026 The system_answer is mostly not helpful. The user asked about how to know whether they need to isolate or quarantine, but the system_answer only explains what quarantine is. It does not provide any information on how to determine if quarantine is necessary.\\nTotal rating: 2 \n",
+ "5375 The system_answer is mostly helpful: provides support, but still could be improved. The answer does provide a list of symptoms associated with Covid-19, but it also includes a lot of information that is not directly related to the question.\\nTotal rating: 3 "
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "errors = pd.concat(\n",
+ " [\n",
+ " examples.loc[\n",
+ " examples[\"llm_judge_improved_score\"] > examples[\"human_score\"]\n",
+ " ].head(1),\n",
+ " examples.loc[\n",
+ " examples[\"llm_judge_improved_score\"] < examples[\"human_score\"]\n",
+ " ].head(2),\n",
+ " ]\n",
+ ")\n",
+ "\n",
+ "display(\n",
+ " errors[\n",
+ " [\n",
+ " \"question\",\n",
+ " \"answer\",\n",
+ " \"human_score\",\n",
+ " \"explanation_1\",\n",
+ " \"llm_judge_improved_score\",\n",
+ " \"llm_judge_improved\",\n",
+ " ]\n",
+ " ]\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The disagrements are minor: overall, we seem to have reached a good level of performance for our system!\n",
+ "\n",
+ "## 4. How do we take our LLM judge even further?\n",
+ "\n",
+ "🎯 **You will never reach 100%:** Let's first note that our human ground truth certainly has some noise, so agreement/correlation will never go up to 100% even with a perfect LLM judge.\n",
+ "\n",
+ "🧭 **Provide a reference:** If you had access to a reference answer for each question, you should definitely give this to the Judge LLM in its prompt to get better results!\n",
+ "\n",
+ "▶️ **Provide few-shot examples:** adding some few-shot examples of questions and ground truth evaluations in the prompt can improve the results. _(I tried it here, it did not improve results in this case so I skipped it, but it could work for your dataset!)_\n",
+ "\n",
+ "➕ **Additive scale:** When the judgement can be split into atomic criteria, using an additive scale can further improve results: see below 👇\n",
+ "```python\n",
+ "ADDITIVE_PROMPT = \"\"\"\n",
+ "(...)\n",
+ "- Award 1 point if the answer is related to the question.\n",
+ "- Give 1 additional point if the answer is clear and precise.\n",
+ "- Provide 1 further point if the answer is true.\n",
+ "- One final point should be awarded if the answer provides additional resources to support the user.\n",
+ "...\n",
+ "\"\"\"\n",
+ "```\n",
+ "\n",
+ "## Conclusion\n",
+ "\n",
+ "That's all for today, congrats for following along! 🥳\n",
+ "\n",
+ "I'll have to leave you, some weirdos are banging on my door, claiming they have come on behalf of Mixtral to collect H100s. 🤔"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "cookbook",
+ "language": "python",
+ "name": "cookbook"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.6"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/notebooks/spa/pipeline_notus_instructions_preferences_legal.ipynb b/notebooks/spa/pipeline_notus_instructions_preferences_legal.ipynb
new file mode 100644
index 00000000..6e2c417a
--- /dev/null
+++ b/notebooks/spa/pipeline_notus_instructions_preferences_legal.ipynb
@@ -0,0 +1,808 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# ⚖️ Create a legal preference dataset\n",
+ "\n",
+ "_Authored by: [David Berenstein](https://huggingface.co/davidberenstein1957) and [Sara Han Díaz](https://huggingface.co/sdiazlor)_\n",
+ "\n",
+ "In this tutorial, you will learn how to use the Notus model on the HF Inference Endpoints to create a legal preference dataset based on Retrieval Augmented Generation instructions from the European AI Act. A full end-to-end example of how to use distilabel to leverage LLMs!\n",
+ "\n",
+ "[distilabel](https://github.com/argilla-io/distilabel) is an AI Feedback (AIF) framework that can generate and label datasets using LLMs and can be used for many different use cases. Implemented with robustness, efficiency, and scalability in mind, it allows anyone to build their synthetic datasets that can be used in many different scenarios.\n",
+ "\n",
+ "To generate the instruction dataset, we will use the [HF Inference Endpoints](https://huggingface.co/docs/inference-endpoints/en/index) integrated with distilabel. These Inference Endpoints are provided by Hugging Face and allow to easily deploy and run transformers, diffusers or any available model from the Hub on a dedicated and autoscaling infrastructure. You can find more information on how to create your first endpoint [here](https://huggingface.co/docs/inference-endpoints/guides/create_endpoint).\n",
+ "\n",
+ "The LLM model that we will fine-tune for this is [Notus 7B](https://argilla.io/blog/notus7b/), a fine-tuned version of Zephyr 7B that uses Direct Preference Optimization (DPO) and AIF techniques to outperform its foundation model in several benchmarks and is completely open-source.\n",
+ "\n",
+ "This tutorial includes the following steps:\n",
+ "\n",
+ "- Defining a custom generating task for a `distilabel` pipeline.\n",
+ "- Creating a RAG pipeline using Haystack for the EU AI Act.\n",
+ "- Generating an instruction dataset with `SelfInstructTask`.\n",
+ "- Generating a preference dataset using an `UltraFeedback` text quality task."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Introduction\n",
+ "Let's start by installing the required dependencies to run **distilabel** and the rest of the packages used in the tutorial; most notably, **Haystack**. Install also **Argilla** for a better visualization and curation of the results."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "!pip install -q -U distilabel \"farm-haystack[preprocessing]\"\n",
+ "!pip install -q -U \"distilabel[hf-inference-endpoints, argilla]\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Import dependencies\n",
+ "\n",
+ "The main dependencies for this tutorial are distilabel for creating the synthetic datasets and Argilla for visualizing and annotating these datasets, and also for fine-tuning our model. The package [Haystack](https://haystack.deepset.ai/) is used to create batches from the original PDF document we want to create our datasets from.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "from typing import Dict\n",
+ "\n",
+ "from distilabel.llm import InferenceEndpointsLLM\n",
+ "from distilabel.pipeline import Pipeline, pipeline\n",
+ "from distilabel.tasks import TextGenerationTask, SelfInstructTask, Prompt\n",
+ "\n",
+ "from datasets import Dataset\n",
+ "from haystack.nodes import PDFToTextConverter, PreProcessor"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Environment variables\n",
+ "\n",
+ "We need to provide our HuggingFace access token, which can be retrieved from [Settings](https://huggingface.co/settings/tokens). In addition, we also need the OpenAI api key for the generation of the preference dataset through the UltraFeedback text quality task. You can find it [here](https://platform.openai.com/api-keys). Note that depending on the model used, a different fee will be charged, so make sure you check the OpenAI [pricing page](https://openai.com/pricing).\n",
+ "\n",
+ "To later instantiate an `InferenceEndpointsLLM` object, we need to pass as parameters the HF Inference Endpoint name and the HF namespace. One very convenient way to do so is also through environment variables.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "os.environ[\"HF_TOKEN\"] = \"\"\n",
+ "os.environ[\"HF_INFERENCE_ENDPOINT_NAME\"] = \"aws-notus-7b-v1-3184\"\n",
+ "os.environ[\"HF_NAMESPACE\"] = \"argilla\"\n",
+ "os.environ[\"OPENAI_API_KEY\"] = \"\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Setting up an inference endpoint with Notus\n",
+ "\n",
+ "Inference endpoints are a solution, managed by Hugging Face, to easily deploy any Transformer-like model. They are built from models on the Hugging Face Hub. Inference endpoints are handy for making inference on LLMs without the hassle of trying to run the models locally. In this tutorial, we will use inference endpoints to generate text using our Notus model, as part of the `distilabel` workflow. The endpoint of choice has a [Notus 7B instance](https://ui.endpoints.huggingface.co/argilla/endpoints/aws-notus-7b-v1-4052) running.\n",
+ "\n",
+ "### Defining a custom generating task for a distilabel pipeline\n",
+ "\n",
+ "To kickstart this tutorial, let's see how to set up an endpoint for our Notus model. It's not part of the end-to-end example we'll see later, but an example of how to connect to a Hugging Face endpoint and a test of the `distilabel` pipeline.\n",
+ "\n",
+ "Let's dive into this quick example of how to use an inference endpoint. We have prepared an easy `TextGenerationTask` to ask questions to the model, in a very similar way as we talk with the LLMs using chatbots. First, we define a class for the question-answering task, with functions showing `distilabel` how the model should generate the prompts, parse the input and the output, etc."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "class QuestionAnsweringTask(TextGenerationTask):\n",
+ " def generate_prompt(self, question: str) -> str:\n",
+ " return Prompt(\n",
+ " system_prompt=self.system_prompt,\n",
+ " formatted_prompt=question,\n",
+ " ).format_as(\n",
+ " \"llama2\"\n",
+ " ) # type: ignore\n",
+ "\n",
+ " def parse_output(self, output: str) -> Dict[str, str]:\n",
+ " return {\"answer\": output.strip()}\n",
+ "\n",
+ " @property\n",
+ " def input_args_names(self) -> list[str]:\n",
+ " return [\"question\"]\n",
+ "\n",
+ " @property\n",
+ " def output_args_names(self) -> list[str]:\n",
+ " return [\"answer\"]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "`llm` is an object of the `InferenceEndpointsLLM` class, and by using it we can start generating answers to question using the `llm.generate()` method.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "llm = InferenceEndpointsLLM(\n",
+ " endpoint_name_or_model_id=os.getenv(\"HF_INFERENCE_ENDPOINT_NAME\"), # type: ignore\n",
+ " endpoint_namespace=os.getenv(\"HF_NAMESPACE\"), # type: ignore\n",
+ " token=os.getenv(\"HF_TOKEN\") or None,\n",
+ " task=QuestionAnsweringTask(),\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "With the `InferenceEndpointsLLM` object defined with the endpoint information and the Task, we can go ahead and start generating text. Let's ask this LLM what's, for example, the second most populated city in Denmark. The answer should be Aarhus.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'The second most populated city in Denmark is Aarhus, with a population of around 340,000 people. It is located on the east coast of Jutland, and is known for its vibrant cultural scene, beautiful beaches, and historic landmarks. Aarhus is also home to Aarhus University, one of the largest universities in Scandinavia.'"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "generation = llm.generate(\n",
+ " [{\"question\": \"What's the second most populated city in Denmark?\"}]\n",
+ ")\n",
+ "generation[0][0][\"parsed_output\"][\"answer\"]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The endpoint is working correctly! We have succesfully set up a custom generating task for a `distilabel` pipeline.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Creating a RAG pipeline using Haystack for the European AI Act\n",
+ "\n",
+ "For this end-to-end example, we would like to create an expert model capable of answering question and filling up information about the new AI Act promoted by the European Union, which is the first regulation on artificial intelligence. As part of its digital strategy, the EU wants to regulate artificial AI to ensure better conditions for the development and use of this innovative technology. This act is a regulatory framework for AI, with different risk levels meaning more or less regulation. They are the world's first rules on AI.\n",
+ "\n",
+ "This RAG pipeline that we want to create downloads the PDF file, converts it to plain text and preprocess it, creating batches that we can feed `distilabel` to start creating instructions from it. Let's see this first part of the pipeline and get the input data. Note that this RAG part of the pipeline is not based on an active pipeline based queries or semantic properties, but a more brute-force approach in which we download the PDF and preprocess its contents.\n",
+ "\n",
+ "### Downloading the AI Act PDF\n",
+ "\n",
+ "Firstly, we need to download the PDF document itself. We'll place it in our working directory, if it's not there already.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%%bash\n",
+ "\n",
+ "if [ ! -f \"The-AI-Act.pdf\" ]; then\n",
+ " wget -q https://artificialintelligenceact.eu/wp-content/uploads/2021/08/The-AI-Act.pdf\n",
+ "fi"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Once we have it in our working directory, we can use Haystack's Converter and Pipeline features to extract the textual data, clean it and divide it in different batches. Afterwards, these batches will be used to start creating synthetic instructions.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# The converter turns the PDF into text we can process easily\n",
+ "converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=[\"en\"])\n",
+ "\n",
+ "# Preprocessing pipelines can have several steps.\n",
+ "# Ours clean empty lines, header, footers and whitespaces\n",
+ "# and split the text into 150-char long batches, respecting\n",
+ "# where the sentences naturally end and begin.\n",
+ "preprocessor = PreProcessor(\n",
+ " clean_empty_lines=True,\n",
+ " clean_whitespace=True,\n",
+ " clean_header_footer=True,\n",
+ " split_by=\"word\",\n",
+ " split_length=150,\n",
+ " split_respect_sentence_boundary=True,\n",
+ ")\n",
+ "\n",
+ "doc = converter.convert(file_path=\"The-AI-Act.pdf\", meta=None)[0]\n",
+ "docs = preprocessor.process([doc])\n",
+ "print(f\"Documents: 1\\nBatches: {len(docs)}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's take a quick look at the batches we just generated.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'EN EN\\nEUROPEAN\\nCOMMISSION\\nProposal for a\\nREGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL\\nLAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE\\n(ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION\\nLEGISLATIVE ACTS\\x0cEN\\nEXPLANATORY MEMORANDUM\\n1. CONTEXT OF THE PROPOSAL\\n1.1. Reasons for and objectives of the proposal\\nThis explanatory memorandum accompanies the proposal for a Regulation laying down\\nharmonised rules on artificial intelligence (Artificial Intelligence Act). Artificial Int'"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "inputs = [doc.content for doc in docs]\n",
+ "inputs[0][0:500]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The document has been correctly batched, from one big document to 355 strings, 150-character long at maximum. This list of strings can now be used as input to generate a instruction dataset using `distilabel`.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Generating instructions with SelfInstructTask\n",
+ "\n",
+ "With our Inference Endpoint up and running, we should be able to generate instructions with distilabel. These instructions, made by the LLM through our endpoint, will form an instruction dataset, with instructions created from the data we just extracted.\n",
+ "\n",
+ "For this example, we are using a subset of 50 batches generated in the section above, to be gentle on performance.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Dataset({\n",
+ " features: ['input'],\n",
+ " num_rows: 50\n",
+ "})"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "instructions_dataset = Dataset.from_dict({\"input\": inputs[0:50]})\n",
+ "\n",
+ "instructions_dataset"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "With the `SelfInstructTask` class we can generate a Self-Instruct specitification for building the prompts, as done in the [Self-Instruct paper](https://arxiv.org/abs/2212.10560). `distilabel` will start from human-made input, in this case, the batches we created from the AI Act pdf, and it will generate instructions based on it. These instructions can then be reviewed using Argilla to keep the best ones.\n",
+ "\n",
+ "An application description can be passed as a parameter to specify the behaviour of the model; we want a model capable of answering our questions about the AI Act.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "instructions_task = SelfInstructTask(\n",
+ " application_description=\"A assistant that can answer questions about the AI Act made by the European Union.\"\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's now define a generator, passing the `SelfInstructTask` object, and create a `Pipeline` object.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "instructions_generator = InferenceEndpointsLLM(\n",
+ " endpoint_name_or_model_id=os.getenv(\"HF_INFERENCE_ENDPOINT_NAME\"), # type: ignore\n",
+ " endpoint_namespace=os.getenv(\"HF_NAMESPACE\"), # type: ignore\n",
+ " token=os.getenv(\"HF_TOKEN\") or None,\n",
+ " task=instructions_task,\n",
+ ")\n",
+ "\n",
+ "instructions_pipeline = Pipeline(generator=instructions_generator)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Our pipeline is ready to be used to generate instructions. Let's do it!\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "generated_instructions = instructions_pipeline.generate(\n",
+ " dataset=instructions_dataset, num_generations=1, batch_size=8\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The pipeline has succesfully generated instructions given the topics and the behavior passed as input. Let's gather all those instructions and see how they look.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Number of generated instructions: 178\n",
+ "What are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?\n",
+ "How can artificial intelligence improve prediction, optimise operations and resource allocation, and personalise service delivery?\n",
+ "What benefits can artificial intelligence bring to the European economy and society as a whole?\n",
+ "How can the use of artificial intelligence support socially and environmentally beneficial outcomes?\n",
+ "What are the high-impact sectors that require AI action according to the AI Act by the European Union?\n"
+ ]
+ }
+ ],
+ "source": [
+ "instructions = []\n",
+ "for generations in generated_instructions[\"instructions\"]:\n",
+ " for generation in generations:\n",
+ " instructions.extend(generation)\n",
+ "\n",
+ "print(f\"Number of generated instructions: {len(instructions)}\")\n",
+ "\n",
+ "for instruction in instructions[:5]:\n",
+ " print(instruction)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "These initial instructions create our instruction dataset. Following the human-in-the-loop approach, we should push the instructions to Argilla to visualize them and be able to rank them in terms of quality. Those annotations are essential to make quality data, ensuring a better performance of the final model. Nevertheless, this step is optional.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Pushing the instruction dataset to Argilla to visualize and annotate.\n",
+ "\n",
+ "Let's take a quick look at the instructions generated by `SelfInstructTask`.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{'input': 'EN EN\\nEUROPEAN\\nCOMMISSION\\nProposal for a\\nREGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL\\nLAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE\\n(ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION\\nLEGISLATIVE ACTS\\x0cEN\\nEXPLANATORY MEMORANDUM\\n1. CONTEXT OF THE PROPOSAL\\n1.1. Reasons for and objectives of the proposal\\nThis explanatory memorandum accompanies the proposal for a Regulation laying down\\nharmonised rules on artificial intelligence (Artificial Intelligence Act). Artificial Intelligence\\n(AI) is a fast evolving family of technologies that can bring a wide array of economic and\\nsocietal benefits across the entire spectrum of industries and social activities. By improving\\nprediction, optimising operations and resource allocation, and personalising service delivery,\\nthe use of artificial intelligence can support socially and environmentally beneficial outcomes\\nand provide key competitive advantages to companies and the European economy. ',\n",
+ " 'generation_model': ['argilla/notus-7b-v1'],\n",
+ " 'generation_prompt': ['You are an expert prompt writer, writing the best and most diverse prompts for a variety of tasks. You are given a task description and a set of instructions for how to write the prompts for an specific AI application.\\n# Task Description\\nDevelop 5 user queries that can be received by the given AI application and applicable to the provided context. Emphasize diversity in verbs and linguistic structures within the model\\'s textual capabilities.\\n\\n# Criteria for Queries\\nIncorporate a diverse range of verbs, avoiding repetition.\\nEnsure queries are compatible with AI model\\'s text generation functions and are limited to 1-2 sentences.\\nDesign queries to be self-contained and standalone.\\nBlend interrogative (e.g., \"What is the significance of x?\") and imperative (e.g., \"Detail the process of x.\") styles.\\nWrite each query on a separate line and avoid using numbered lists or bullet points.\\n\\n# AI Application\\nA assistant that can answer questions about the AI Act made by the European Union.\\n\\n# Context\\nEN EN\\nEUROPEAN\\nCOMMISSION\\nProposal for a\\nREGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL\\nLAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE\\n(ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION\\nLEGISLATIVE ACTS\\x0cEN\\nEXPLANATORY MEMORANDUM\\n1. CONTEXT OF THE PROPOSAL\\n1.1. Reasons for and objectives of the proposal\\nThis explanatory memorandum accompanies the proposal for a Regulation laying down\\nharmonised rules on artificial intelligence (Artificial Intelligence Act). Artificial Intelligence\\n(AI) is a fast evolving family of technologies that can bring a wide array of economic and\\nsocietal benefits across the entire spectrum of industries and social activities. By improving\\nprediction, optimising operations and resource allocation, and personalising service delivery,\\nthe use of artificial intelligence can support socially and environmentally beneficial outcomes\\nand provide key competitive advantages to companies and the European economy. \\n\\n# Output\\n'],\n",
+ " 'raw_generation_responses': ['1. What are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?\\n2. How can artificial intelligence improve prediction, optimise operations and resource allocation, and personalise service delivery?\\n3. What benefits can artificial intelligence bring to the European economy and society as a whole?\\n4. How can the use of artificial intelligence support socially and environmentally beneficial outcomes?\\n5. What competitive advantages can companies gain from using artificial intelligence?'],\n",
+ " 'instructions': [['What are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?',\n",
+ " 'How can artificial intelligence improve prediction, optimise operations and resource allocation, and personalise service delivery?',\n",
+ " 'What benefits can artificial intelligence bring to the European economy and society as a whole?',\n",
+ " 'How can the use of artificial intelligence support socially and environmentally beneficial outcomes?']]}"
+ ]
+ },
+ "execution_count": 15,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "generated_instructions[0]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "For each input, i.e., each batch of the AI Act pdf file, we have a generator prompt, with general guidelines on how to behave, as well as the application description parameter. 4 instructions per input have been generated.\n",
+ "\n",
+ "Now it's the perfect time to upload the instruction dataset to Argilla, review it and manually annotate it."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "FeedbackRecord(fields={'input': 'EN EN\\nEUROPEAN\\nCOMMISSION\\nProposal for a\\nREGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL\\nLAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE\\n(ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION\\nLEGISLATIVE ACTS\\x0cEN\\nEXPLANATORY MEMORANDUM\\n1. CONTEXT OF THE PROPOSAL\\n1.1. Reasons for and objectives of the proposal\\nThis explanatory memorandum accompanies the proposal for a Regulation laying down\\nharmonised rules on artificial intelligence (Artificial Intelligence Act). Artificial Intelligence\\n(AI) is a fast evolving family of technologies that can bring a wide array of economic and\\nsocietal benefits across the entire spectrum of industries and social activities. By improving\\nprediction, optimising operations and resource allocation, and personalising service delivery,\\nthe use of artificial intelligence can support socially and environmentally beneficial outcomes\\nand provide key competitive advantages to companies and the European economy.', 'instruction': 'What are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?'}, metadata={'length-input': 964, 'length-instruction': 129, 'generation-model': 'argilla/notus-7b-v1'}, vectors={}, responses=[], suggestions=(), external_id=None)"
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "instructions_rg_dataset = generated_instructions.to_argilla()\n",
+ "instructions_rg_dataset[0]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "instructions_rg_dataset.push_to_argilla(name=f\"notus_AI_instructions\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In the Argilla UI, each tuple input-instruction is visualized individually, and can be individually annotated.\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Generate a Preference Dataset using an Ultrafeedback text quality task.\n",
+ "\n",
+ "Once we have our instruction dataset, we are going to create a preference dataset through the UltraFeedback text quality task. This is a type of task used in NLP used to evaluate the quality of text generated; our goal is to provide detailed feedback on the quality of the generated text, beyond a binary label.\n",
+ "\n",
+ "Our `pipeline()` method allows us to create a `Pipeline` instance with the provided LLMs for a given task, which is useful whenever you want to use a pre-defined or custom `Pipeline` for a given task. We will specify our task and subtask, the generator we want to use (in this case, one based in a Text Generator Task) and our OpenAI API key.\n",
+ "\n",
+ "> Note that not using a OpenAI model to retrieve this feedback is also possible. However, the performance will suffer and the quality of the feedback will be lower."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "preference_pipeline = pipeline(\n",
+ " \"preference\",\n",
+ " \"instruction-following\",\n",
+ " generator=InferenceEndpointsLLM(\n",
+ " endpoint_name_or_model_id=os.getenv(\"HF_INFERENCE_ENDPOINT_NAME\"), # type: ignore\n",
+ " endpoint_namespace=os.getenv(\"HF_NAMESPACE\", None),\n",
+ " task=TextGenerationTask(),\n",
+ " max_new_tokens=256,\n",
+ " num_threads=2,\n",
+ " temperature=0.3,\n",
+ " ),\n",
+ " max_new_tokens=256,\n",
+ " num_threads=2,\n",
+ " api_key=os.getenv(\"OPENAI_API_KEY\", None),\n",
+ " temperature=0.0,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We also need to retrieve our instruction dataset from Argilla, as it will be the input of this pipeline.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Dataset({\n",
+ " features: ['input', 'instruction', 'instruction-rating', 'instruction-rating-suggestion', 'instruction-rating-suggestion-metadata', 'external_id', 'metadata'],\n",
+ " num_rows: 100\n",
+ "})"
+ ]
+ },
+ "execution_count": 19,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "remote_dataset = rg.FeedbackDataset.from_argilla(\n",
+ " \"notus_AI_instructions\", workspace=\"admin\"\n",
+ ")\n",
+ "instructions_dataset = remote_dataset.pull(max_records=100) # get first 100 records\n",
+ "\n",
+ "instructions_dataset = instructions_dataset.format_as(\"datasets\")\n",
+ "instructions_dataset"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{'input': 'EN EN\\nEUROPEAN\\nCOMMISSION\\nProposal for a\\nREGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL\\nLAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE\\n(ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION\\nLEGISLATIVE ACTS\\x0cEN\\nEXPLANATORY MEMORANDUM\\n1. CONTEXT OF THE PROPOSAL\\n1.1. Reasons for and objectives of the proposal\\nThis explanatory memorandum accompanies the proposal for a Regulation laying down\\nharmonised rules on artificial intelligence (Artificial Intelligence Act). Artificial Intelligence\\n(AI) is a fast evolving family of technologies that can bring a wide array of economic and\\nsocietal benefits across the entire spectrum of industries and social activities. By improving\\nprediction, optimising operations and resource allocation, and personalising service delivery,\\nthe use of artificial intelligence can support socially and environmentally beneficial outcomes\\nand provide key competitive advantages to companies and the European economy.',\n",
+ " 'instruction': 'What are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?',\n",
+ " 'instruction-rating': [],\n",
+ " 'instruction-rating-suggestion': None,\n",
+ " 'instruction-rating-suggestion-metadata': {'type': None,\n",
+ " 'score': None,\n",
+ " 'agent': None},\n",
+ " 'external_id': None,\n",
+ " 'metadata': '{\"length-input\": 964, \"length-instruction\": 129, \"generation-model\": \"argilla/notus-7b-v1\"}'}"
+ ]
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "instructions_dataset[0]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Before generating the text based on our instructions, we need to rename some of the columns in our dataset. From the previous section, we still have our old input, the batches from the PDF. We have to change that to the instructions that we generated."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "instructions_dataset = instructions_dataset.rename_columns({\"input\": \"context\", \"instruction\": \"input\"})"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Now, let's build a dataset by using the pipeline we just created, and the topics from which our instructions were generated.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "preference_dataset = preference_pipeline.generate(\n",
+ " instructions_dataset, # type: ignore\n",
+ " num_generations=2,\n",
+ " batch_size=8,\n",
+ " display_progress_bar=True,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's take a look at an instance of the preference dataset:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{'context': 'EN EN\\nEUROPEAN\\nCOMMISSION\\nProposal for a\\nREGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL\\nLAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE\\n(ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION\\nLEGISLATIVE ACTS\\x0cEN\\nEXPLANATORY MEMORANDUM\\n1. CONTEXT OF THE PROPOSAL\\n1.1. Reasons for and objectives of the proposal\\nThis explanatory memorandum accompanies the proposal for a Regulation laying down\\nharmonised rules on artificial intelligence (Artificial Intelligence Act). Artificial Intelligence\\n(AI) is a fast evolving family of technologies that can bring a wide array of economic and\\nsocietal benefits across the entire spectrum of industries and social activities. By improving\\nprediction, optimising operations and resource allocation, and personalising service delivery,\\nthe use of artificial intelligence can support socially and environmentally beneficial outcomes\\nand provide key competitive advantages to companies and the European economy.',\n",
+ " 'input': 'What are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?',\n",
+ " 'instruction-rating': [],\n",
+ " 'instruction-rating-suggestion': None,\n",
+ " 'instruction-rating-suggestion-metadata': {'agent': None,\n",
+ " 'score': None,\n",
+ " 'type': None},\n",
+ " 'external_id': None,\n",
+ " 'metadata': '{\"length-input\": 964, \"length-instruction\": 129, \"generation-model\": \"argilla/notus-7b-v1\"}',\n",
+ " 'generation_model': ['argilla/notus-7b-v1', 'argilla/notus-7b-v1'],\n",
+ " 'generation_prompt': [\"You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\\nWhat are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?\",\n",
+ " \"You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\\nWhat are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?\"],\n",
+ " 'raw_generation_responses': [\"\\nThe proposal for a Regulation laying down harmonised rules on artificial intelligence (AI) aims to ensure the trustworthy use of AI in the EU. It seeks to create a single market for AI applications and services, while ensuring that they are safe and respect fundamental rights. The proposal is part of the EU's broader strategy on AI, which aims to put the EU at the forefront of global AI development and deployment.\\nThe objectives of the proposal are to:\\n\\n1. Ensure that AI systems are designed, developed, and deployed in a way that respects fundamental rights and values, including human dignity, freedom, and privacy.\\n2. Ensure that AI systems are safe and secure, and do not pose unacceptable risks to people, property, or the environment.\\n3. Ensure that AI systems are robust, reliable, and accurate, and can be trusted to deliver the intended functionality.\\n4. Ensure that AI systems are traceable, meaning that it is possible to track how they work and how they make decisions.\\n5. Ensure that AI systems are transparent, meaning that it is possible to understand how they work and how they make decisions.\\n6. Ensure that AI systems are fair, meaning that they do not discriminate against individuals\",\n",
+ " '\\nThe proposal for a Regulation laying down harmonised rules on artificial intelligence (AI) aims to ensure a high level of safety and security of AI systems and to establish a horizontal and technology-neutral framework for AI applications. This will help to create a single market for AI and to ensure that AI systems are developed and deployed in a responsible manner. The proposal will also help to strengthen the competitiveness of the EU industry in the global AI market.\\nThe objectives of the proposal are:\\n1. To ensure that AI systems are safe and secure by establishing a risk-based framework for the development, placement on the market and use of AI systems.\\n2. To establish a horizontal and technology-neutral framework for AI applications that is applicable to all sectors and types of AI systems.\\n3. To ensure that AI systems are developed and deployed in a responsible manner by establishing requirements for transparency, robustness, security, accuracy, controllability and privacy protection.\\n4. To create a single market for AI by ensuring that AI systems are developed and deployed in a harmonised manner across the EU.\\n5. To strengthen the competitiveness of the EU industry in the global AI market by creating a level playing field for businesses and by promoting the'],\n",
+ " 'generations': [\"\\nThe proposal for a Regulation laying down harmonised rules on artificial intelligence (AI) aims to ensure the trustworthy use of AI in the EU. It seeks to create a single market for AI applications and services, while ensuring that they are safe and respect fundamental rights. The proposal is part of the EU's broader strategy on AI, which aims to put the EU at the forefront of global AI development and deployment.\\nThe objectives of the proposal are to:\\n\\n1. Ensure that AI systems are designed, developed, and deployed in a way that respects fundamental rights and values, including human dignity, freedom, and privacy.\\n2. Ensure that AI systems are safe and secure, and do not pose unacceptable risks to people, property, or the environment.\\n3. Ensure that AI systems are robust, reliable, and accurate, and can be trusted to deliver the intended functionality.\\n4. Ensure that AI systems are traceable, meaning that it is possible to track how they work and how they make decisions.\\n5. Ensure that AI systems are transparent, meaning that it is possible to understand how they work and how they make decisions.\\n6. Ensure that AI systems are fair, meaning that they do not discriminate against individuals\",\n",
+ " '\\nThe proposal for a Regulation laying down harmonised rules on artificial intelligence (AI) aims to ensure a high level of safety and security of AI systems and to establish a horizontal and technology-neutral framework for AI applications. This will help to create a single market for AI and to ensure that AI systems are developed and deployed in a responsible manner. The proposal will also help to strengthen the competitiveness of the EU industry in the global AI market.\\nThe objectives of the proposal are:\\n1. To ensure that AI systems are safe and secure by establishing a risk-based framework for the development, placement on the market and use of AI systems.\\n2. To establish a horizontal and technology-neutral framework for AI applications that is applicable to all sectors and types of AI systems.\\n3. To ensure that AI systems are developed and deployed in a responsible manner by establishing requirements for transparency, robustness, security, accuracy, controllability and privacy protection.\\n4. To create a single market for AI by ensuring that AI systems are developed and deployed in a harmonised manner across the EU.\\n5. To strengthen the competitiveness of the EU industry in the global AI market by creating a level playing field for businesses and by promoting the'],\n",
+ " 'labelling_model': 'gpt-3.5-turbo',\n",
+ " 'labelling_prompt': [{'content': 'Your role is to evaluate text quality based on given criteria.',\n",
+ " 'role': 'system'},\n",
+ " {'content': \"\\n# Instruction Following Assessment\\nEvaluate alignment between output and intent. Assess understanding of task goal and restrictions.\\n**Instruction Components**: Task Goal (intended outcome), Restrictions (text styles, formats, or designated methods, etc).\\n\\n**Scoring**: Rate outputs 1 to 5:\\n\\n1. **Irrelevant**: No alignment.\\n2. **Partial Focus**: Addresses one aspect poorly.\\n3. **Partial Compliance**:\\n\\t- (1) Meets goal or restrictions, neglecting other.\\n\\t- (2) Acknowledges both but slight deviations.\\n4. **Almost There**: Near alignment, minor deviations.\\n5. **Comprehensive Compliance**: Fully aligns, meets all requirements.\\n\\n---\\n\\n## Format\\n\\n### Input\\nInstruction: [Specify task goal and restrictions]\\n\\nTexts:\\n\\n [Text 1]\\n [Text 2]\\n\\n### Output\\n\\n#### Output for Text 1\\nRating: [Rating for text 1]\\nRationale: [Rationale for the rating in short sentences]\\n\\n#### Output for Text 2\\nRating: [Rating for text 2]\\nRationale: [Rationale for the rating in short sentences]\\n\\n---\\n\\n## Annotation\\n\\n### Input\\nInstruction: What are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?\\n\\nTexts:\\n\\n \\nThe proposal for a Regulation laying down harmonised rules on artificial intelligence (AI) aims to ensure the trustworthy use of AI in the EU. It seeks to create a single market for AI applications and services, while ensuring that they are safe and respect fundamental rights. The proposal is part of the EU's broader strategy on AI, which aims to put the EU at the forefront of global AI development and deployment.\\nThe objectives of the proposal are to:\\n\\n1. Ensure that AI systems are designed, developed, and deployed in a way that respects fundamental rights and values, including human dignity, freedom, and privacy.\\n2. Ensure that AI systems are safe and secure, and do not pose unacceptable risks to people, property, or the environment.\\n3. Ensure that AI systems are robust, reliable, and accurate, and can be trusted to deliver the intended functionality.\\n4. Ensure that AI systems are traceable, meaning that it is possible to track how they work and how they make decisions.\\n5. Ensure that AI systems are transparent, meaning that it is possible to understand how they work and how they make decisions.\\n6. Ensure that AI systems are fair, meaning that they do not discriminate against individuals\\n \\nThe proposal for a Regulation laying down harmonised rules on artificial intelligence (AI) aims to ensure a high level of safety and security of AI systems and to establish a horizontal and technology-neutral framework for AI applications. This will help to create a single market for AI and to ensure that AI systems are developed and deployed in a responsible manner. The proposal will also help to strengthen the competitiveness of the EU industry in the global AI market.\\nThe objectives of the proposal are:\\n1. To ensure that AI systems are safe and secure by establishing a risk-based framework for the development, placement on the market and use of AI systems.\\n2. To establish a horizontal and technology-neutral framework for AI applications that is applicable to all sectors and types of AI systems.\\n3. To ensure that AI systems are developed and deployed in a responsible manner by establishing requirements for transparency, robustness, security, accuracy, controllability and privacy protection.\\n4. To create a single market for AI by ensuring that AI systems are developed and deployed in a harmonised manner across the EU.\\n5. To strengthen the competitiveness of the EU industry in the global AI market by creating a level playing field for businesses and by promoting the\\n\\n### Output \",\n",
+ " 'role': 'user'}],\n",
+ " 'raw_labelling_response': '#### Output for Text 1\\nRating: 5\\nRationale: The text fully aligns with the task goal and restrictions. It clearly states the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence, including ensuring the trustworthy use of AI, creating a single market for AI applications and services, and ensuring safety, respect for fundamental rights, robustness, transparency, and fairness of AI systems.\\n\\n#### Output for Text 2\\nRating: 4\\nRationale: The text mostly aligns with the task goal and restrictions. It addresses the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence, including ensuring safety and security of AI systems, establishing a horizontal and technology-neutral framework, promoting responsible development and deployment of AI systems, creating a single market for AI, and strengthening the competitiveness of the EU industry in the global AI market. However, it does not explicitly mention the need to respect fundamental rights, accuracy of AI systems, and traceability of AI systems, which are mentioned in the task goal and restrictions.',\n",
+ " 'rating': [5.0, 4.0],\n",
+ " 'rationale': ['The text fully aligns with the task goal and restrictions. It clearly states the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence, including ensuring the trustworthy use of AI, creating a single market for AI applications and services, and ensuring safety, respect for fundamental rights, robustness, transparency, and fairness of AI systems.',\n",
+ " 'The text mostly aligns with the task goal and restrictions. It addresses the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence, including ensuring safety and security of AI systems, establishing a horizontal and technology-neutral framework, promoting responsible development and deployment of AI systems, creating a single market for AI, and strengthening the competitiveness of the EU industry in the global AI market. However, it does not explicitly mention the need to respect fundamental rights, accuracy of AI systems, and traceability of AI systems, which are mentioned in the task goal and restrictions.']}"
+ ]
+ },
+ "execution_count": 23,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "preference_dataset[0]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Human Feedback with Argilla\n",
+ "\n",
+ "You can use the AI Feedback created by distilabel directly but we have seen that enhancing it with human feedback will improve the quality of your LLM. We provide a `to_argilla` method which creates a dataset for Argilla along with out-of-the-box tailored metadata filters and semantic search to allow you to provide human feedback as quickly and engaging as possible. You can check [the Argilla docs](https://docs.argilla.io/en/latest/getting_started/quickstart_installation.html) to get it up and running."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the URL and API_KEY:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import argilla as rg\n",
+ "\n",
+ "# Replace api_url with the url to your HF Spaces URL if using Spaces\n",
+ "# Replace api_key if you configured a custom API key\n",
+ "rg.init(\n",
+ " api_url=\"http://localhost:6900\",\n",
+ " api_key=\"owner.apikey\",\n",
+ " workspace=\"admin\"\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Once our preference dataset has been correctly generated, the Argilla UI is the best tool at our disposal to visualize and annotate it. As for the instruction dataset, we just have to convert it to an Argilla Feedback Dataset, and push it to Argilla."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Uploading the Preference Dataset\n",
+ "preference_rg_dataset = preference_dataset.to_argilla()\n",
+ "\n",
+ "# Adding the context as a metadata property in the new Feedback dataset, as this\n",
+ "# information will be useful later.\n",
+ "for record_feedback, record_huggingface in zip(\n",
+ " preference_rg_dataset, preference_dataset\n",
+ "):\n",
+ " record_feedback.metadata[\"context\"] = record_huggingface[\"context\"]\n",
+ "\n",
+ "preference_rg_dataset.push_to_argilla(name=f\"notus_AI_preference\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In the Argilla UI, we can see the input (an instruction), and the two generations that the LLM created out of it.\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Conclusions\n",
+ "\n",
+ "To conclude, we have gone through an end-to-end example of distilabel. We've set up an Inference Endpoint, defined a distilabel pipeline that extracts information from a PDF, and created and manually reviewed the instruction and preference dataset created from that input. The final preference dataset is perfect for fine-tuning, and you can easily do this using the ArgillaTrainer from Argilla. Have a look at these resources if you want to go further:\n",
+ "\n",
+ "- [Train a Model with ArgillaTrainer](https://docs.argilla.io/en/latest/tutorials_and_integrations/tutorials/feedback/end2end_examples/train-model-006.html)\n",
+ "- [Ⓜ️ Finetuning LLMs as chat assistants: Supervised Finetuning on Mistral 7B](https://docs.argilla.io/en/latest/tutorials_and_integrations/tutorials/feedback/training-llm-mistral-sft.html)\n",
+ "- [🌠 Improving RAG by Optimizing Retrieval and Reranking Models](https://docs.argilla.io/en/latest/tutorials_and_integrations/tutorials/feedback/fine-tuning-sentencesimilarity-rag.html)\n"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.13"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/notebooks/spa/prompt_tuning_peft.ipynb b/notebooks/spa/prompt_tuning_peft.ipynb
new file mode 100644
index 00000000..2ae63c4d
--- /dev/null
+++ b/notebooks/spa/prompt_tuning_peft.ipynb
@@ -0,0 +1,1022 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "application/vnd.databricks.v1+cell": {
+ "cellMetadata": {},
+ "inputWidgets": {},
+ "nuid": "6fba2d42-ed99-4a03-8033-d479ce24d5dd",
+ "showTitle": false,
+ "title": ""
+ },
+ "id": "2vkOvTEsVaTA"
+ },
+ "source": [
+ "# Prompt Tuning With PEFT.\n",
+ "_Authored by: [Pere Martra](https://github.com/peremartra)_\n",
+ "\n",
+ "\n",
+ "In this notebook we are introducing how to apply prompt tuning with the PEFT library to a pre-trained model.\n",
+ "\n",
+ "For a complete list of models compatible with PEFT refer to their [documentation](https://huggingface.co/docs/peft/main/en/index#supported-methods).\n",
+ "\n",
+ "A short sample of models available to be trained with PEFT includes Bloom, Llama, GPT-J, GPT-2, BERT, and more. Hugging Face is working hard to add more models to the library.\n",
+ "\n",
+ "## Brief introduction to Prompt Tuning.\n",
+ "It’s an Additive Fine-Tuning technique for models. This means that we WILL NOT MODIFY ANY WEIGHTS OF THE ORIGINAL MODEL. You might be wondering, how are we going to perform Fine-Tuning then? Well, we will train additional layers that are added to the model. That’s why it’s called an Additive technique.\n",
+ "\n",
+ "Considering it’s an Additive technique and its name is Prompt-Tuning, it seems clear that the layers we’re going to add and train are related to the prompt.\n",
+ "\n",
+ "\n",
+ "\n",
+ "We are creating a type of superprompt by enabling a model to enhance a portion of the prompt with its acquired knowledge. However, that particular section of the prompt cannot be translated into natural language. **It's as if we've mastered expressing ourselves in embeddings and generating highly effective prompts.**\n",
+ "\n",
+ "In each training cycle, the only weights that can be modified to minimize the loss function are those integrated into the prompt.\n",
+ "\n",
+ "The primary consequence of this technique is that the number of parameters to train is genuinely small. However, we encounter a second, perhaps more significant consequence, namely that, **since we do not modify the weights of the pretrained model, it does not alter its behavior or forget any information it has previously learned.**\n",
+ "\n",
+ "The training is faster and more cost-effective. Moreover, we can train various models, and during inference time, we only need to load one foundational model along with the new smaller trained models because the weights of the original model have not been altered\n",
+ "\n",
+ "## What are we going to do in the notebook?\n",
+ "We are going to train two different models using two datasets, each with just one pre-trained model from the Bloom family. One model will be trained with a dataset of prompts, while the other will use a dataset of inspirational sentences. We will compare the results for the same question from both models before and after training.\n",
+ "\n",
+ "Additionally, we'll explore how to load both models with only one copy of the foundational model in memory.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "tZhdbTh-VaTA"
+ },
+ "source": [
+ "## Loading the PEFT Library\n",
+ "This library contains the Hugging Face implementation of various Fine-Tuning techniques, including Prompt Tuning"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 41,
+ "metadata": {
+ "application/vnd.databricks.v1+cell": {
+ "cellMetadata": {},
+ "inputWidgets": {},
+ "nuid": "d16bf5ec-888b-4c76-a655-193fd4cc8a36",
+ "showTitle": false,
+ "title": ""
+ },
+ "id": "JechhJhhVaTA"
+ },
+ "outputs": [],
+ "source": [
+ "!pip install -q peft==0.8.2"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 42,
+ "metadata": {
+ "id": "6CRxq5Z2WJ7C"
+ },
+ "outputs": [],
+ "source": [
+ "!pip install -q datasets==2.14.5"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "GGbh426RVaTB"
+ },
+ "source": [
+ "From the transformers library, we import the necessary classes to instantiate the model and the tokenizer."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 43,
+ "metadata": {
+ "application/vnd.databricks.v1+cell": {
+ "cellMetadata": {},
+ "inputWidgets": {},
+ "nuid": "31738463-c9b0-431d-869e-1735e1e2f5c7",
+ "showTitle": false,
+ "title": ""
+ },
+ "id": "KWOEt-yOVaTB"
+ },
+ "outputs": [],
+ "source": [
+ "from transformers import AutoModelForCausalLM, AutoTokenizer"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6qYsnwjSVaTC"
+ },
+ "source": [
+ "### Loading the model and the tokenizers.\n",
+ "\n",
+ "Bloom is one of the smallest and smartest models available for training with the PEFT Library using Prompt Tuning. You can choose any model from the Bloom Family, and I encourage you to try at least two of them to observe the differences.\n",
+ "\n",
+ "I'm opting for the smallest one to minimize training time and avoid memory issues in Colab."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 44,
+ "metadata": {
+ "id": "MnqIhv2UVaTC"
+ },
+ "outputs": [],
+ "source": [
+ "model_name = \"bigscience/bloomz-560m\"\n",
+ "#model_name=\"bigscience/bloom-1b1\"\n",
+ "NUM_VIRTUAL_TOKENS = 4\n",
+ "NUM_EPOCHS = 6"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 45,
+ "metadata": {
+ "id": "fSMu3qRsVaTC"
+ },
+ "outputs": [],
+ "source": [
+ "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
+ "foundational_model = AutoModelForCausalLM.from_pretrained(\n",
+ " model_name,\n",
+ " trust_remote_code=True\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "8W2fWhOnVaTC"
+ },
+ "source": [
+ "## Inference with the pre trained bloom model\n",
+ "If you want to achieve more varied and original generations, uncomment the parameters: temperature, top_p, and do_sample, in *model.generate* below\n",
+ "\n",
+ "With the default configuration, the model's responses remain consistent across calls."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 46,
+ "metadata": {
+ "id": "47j2D3WWVaTC"
+ },
+ "outputs": [],
+ "source": [
+ "#this function returns the outputs from the model received, and inputs.\n",
+ "def get_outputs(model, inputs, max_new_tokens=100):\n",
+ " outputs = model.generate(\n",
+ " input_ids=inputs[\"input_ids\"],\n",
+ " attention_mask=inputs[\"attention_mask\"],\n",
+ " max_new_tokens=max_new_tokens,\n",
+ " #temperature=0.2,\n",
+ " #top_p=0.95,\n",
+ " #do_sample=True,\n",
+ " repetition_penalty=1.5, #Avoid repetition.\n",
+ " early_stopping=True, #The model can stop before reach the max_length\n",
+ " eos_token_id=tokenizer.eos_token_id\n",
+ " )\n",
+ " return outputs"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "application/vnd.databricks.v1+cell": {
+ "cellMetadata": {},
+ "inputWidgets": {},
+ "nuid": "ca4d203a-5152-4947-ab34-cfd0b40a102a",
+ "showTitle": false,
+ "title": ""
+ },
+ "id": "kRLSfuo2VaTC"
+ },
+ "source": [
+ "As we want to have two different trained models, I will create two distinct prompts.\n",
+ "\n",
+ "The first model will be trained with a dataset containing prompts, and the second one with a dataset of motivational sentences.\n",
+ "\n",
+ "The first model will receive the prompt \"I want you to act as a motivational coach.\" and the second model will receive \"There are two nice things that should matter to you:\"\n",
+ "\n",
+ "But first, I'm going to collect some results from the model without Fine-Tuning."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 47,
+ "metadata": {
+ "application/vnd.databricks.v1+cell": {
+ "cellMetadata": {},
+ "inputWidgets": {},
+ "nuid": "1d4c80a9-4edd-4fcd-aef0-996f4da5cc02",
+ "showTitle": false,
+ "title": ""
+ },
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "QvStaT7cVaTC",
+ "outputId": "ab34b3cd-a849-4dff-b36d-bf25c9f55ce1"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "[\"I want you to act as a motivational coach. Don't be afraid of being challenged.\"]\n"
+ ]
+ }
+ ],
+ "source": [
+ "input_prompt = tokenizer(\"I want you to act as a motivational coach. \", return_tensors=\"pt\")\n",
+ "foundational_outputs_prompt = get_outputs(foundational_model, input_prompt, max_new_tokens=50)\n",
+ "\n",
+ "print(tokenizer.batch_decode(foundational_outputs_prompt, skip_special_tokens=True))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 69,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "1Xhm3jZMVaTD",
+ "outputId": "305f0137-6a02-4e43-9c9d-2b4ecd377937"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "['There are two nice things that should matter to you: the price and quality of your product.']\n"
+ ]
+ }
+ ],
+ "source": [
+ "input_sentences = tokenizer(\"There are two nice things that should matter to you:\", return_tensors=\"pt\")\n",
+ "foundational_outputs_sentence = get_outputs(foundational_model, input_sentences, max_new_tokens=50)\n",
+ "\n",
+ "print(tokenizer.batch_decode(foundational_outputs_sentence, skip_special_tokens=True))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "application/vnd.databricks.v1+cell": {
+ "cellMetadata": {},
+ "inputWidgets": {},
+ "nuid": "f438d43b-6b9f-445e-9df4-60ea09640764",
+ "showTitle": false,
+ "title": ""
+ },
+ "id": "OGbJTbRnVaTD"
+ },
+ "source": [
+ "Both answers are more or less correct. Any of the Bloom models is pre-trained and can generate sentences accurately and sensibly. Let's see if, after training, the responses are either equal or more accurately generated.\n",
+ "\n",
+ "## Preparing the Datasets\n",
+ "The Datasets useds are:\n",
+ "* https://huggingface.co/datasets/fka/awesome-chatgpt-prompts\n",
+ "* https://huggingface.co/datasets/Abirate/english_quotes\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 49,
+ "metadata": {
+ "id": "RD8H_LLaVaTD"
+ },
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "#os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 50,
+ "metadata": {
+ "application/vnd.databricks.v1+cell": {
+ "cellMetadata": {},
+ "inputWidgets": {},
+ "nuid": "2ed62b41-e3fa-4a41-a0a9-59f35a6904f9",
+ "showTitle": false,
+ "title": ""
+ },
+ "id": "xmAp_o4PVaTD"
+ },
+ "outputs": [],
+ "source": [
+ "from datasets import load_dataset\n",
+ "\n",
+ "dataset_prompt = \"fka/awesome-chatgpt-prompts\"\n",
+ "\n",
+ "#Create the Dataset to create prompts.\n",
+ "data_prompt = load_dataset(dataset_prompt)\n",
+ "data_prompt = data_prompt.map(lambda samples: tokenizer(samples[\"prompt\"]), batched=True)\n",
+ "train_sample_prompt = data_prompt[\"train\"].select(range(50))\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "display(train_sample_prompt)"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 86
+ },
+ "id": "jNlOpGbqBgcu",
+ "outputId": "3f8106b2-948b-4a7b-cf78-bd3fcc2f0338"
+ },
+ "execution_count": 51,
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ "Dataset({\n",
+ " features: ['act', 'prompt', 'input_ids', 'attention_mask'],\n",
+ " num_rows: 50\n",
+ "})"
+ ]
+ },
+ "metadata": {}
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 52,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "dZcOaE5CU658",
+ "outputId": "fb8f5081-012b-4c37-ee1f-3aef2d0f54a7"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "{'act': ['Linux Terminal'], 'prompt': ['I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets {like this}. my first command is pwd'], 'input_ids': [[44, 4026, 1152, 427, 1769, 661, 267, 104105, 28434, 17, 473, 2152, 4105, 49123, 530, 1152, 2152, 57502, 1002, 3595, 368, 28434, 3403, 6460, 17, 473, 4026, 1152, 427, 3804, 57502, 1002, 368, 28434, 10014, 14652, 2592, 19826, 4400, 10973, 15, 530, 16915, 4384, 17, 727, 1130, 11602, 184637, 17, 727, 1130, 4105, 49123, 35262, 473, 32247, 1152, 427, 727, 1427, 17, 3262, 707, 3423, 427, 13485, 1152, 7747, 361, 170205, 15, 707, 2152, 727, 1427, 1331, 55385, 5484, 14652, 6291, 999, 117805, 731, 29726, 1119, 96, 17, 2670, 3968, 9361, 632, 269, 42512]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(train_sample_prompt[:1])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 53,
+ "metadata": {
+ "id": "WeM66LmEVaTD"
+ },
+ "outputs": [],
+ "source": [
+ "dataset_sentences = load_dataset(\"Abirate/english_quotes\")\n",
+ "\n",
+ "data_sentences = dataset_sentences.map(lambda samples: tokenizer(samples[\"quote\"]), batched=True)\n",
+ "train_sample_sentences = data_sentences[\"train\"].select(range(25))\n",
+ "train_sample_sentences = train_sample_sentences.remove_columns(['author', 'tags'])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "display(train_sample_sentences)"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 86
+ },
+ "id": "zUSG_M_nBp_E",
+ "outputId": "faf36464-de24-4512-aace-c1ff8713c1d4"
+ },
+ "execution_count": 54,
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ "Dataset({\n",
+ " features: ['quote', 'input_ids', 'attention_mask'],\n",
+ " num_rows: 25\n",
+ "})"
+ ]
+ },
+ "metadata": {}
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "application/vnd.databricks.v1+cell": {
+ "cellMetadata": {},
+ "inputWidgets": {},
+ "nuid": "b97381d4-5fe2-49d0-be5d-2fe3421edc5c",
+ "showTitle": false,
+ "title": ""
+ },
+ "id": "0-5mv1ZpVaTD"
+ },
+ "source": [
+ "## Fine-Tuning. \n",
+ "\n",
+ "### PEFT configurations\n",
+ "\n",
+ "\n",
+ "API docs:\n",
+ "https://huggingface.co/docs/peft/main/en/package_reference/tuners#peft.PromptTuningConfig\n",
+ "\n",
+ "We can use the same configuration for both models to be trained.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 55,
+ "metadata": {
+ "application/vnd.databricks.v1+cell": {
+ "cellMetadata": {},
+ "inputWidgets": {},
+ "nuid": "6df8e1f1-be9e-42db-b4a4-6af7cd351004",
+ "showTitle": false,
+ "title": ""
+ },
+ "id": "sOg1Yh-oVaTD"
+ },
+ "outputs": [],
+ "source": [
+ "from peft import get_peft_model, PromptTuningConfig, TaskType, PromptTuningInit\n",
+ "\n",
+ "generation_config = PromptTuningConfig(\n",
+ " task_type=TaskType.CAUSAL_LM, #This type indicates the model will generate text.\n",
+ " prompt_tuning_init=PromptTuningInit.RANDOM, #The added virtual tokens are initializad with random numbers\n",
+ " num_virtual_tokens=NUM_VIRTUAL_TOKENS, #Number of virtual tokens to be added and trained.\n",
+ " tokenizer_name_or_path=model_name #The pre-trained model.\n",
+ ")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "an9KBtB1VaTD"
+ },
+ "source": [
+ "### Creating two Prompt Tuning Models.\n",
+ "We will create two identical prompt tuning models using the same pre-trained model and the same config."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 56,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "c_D8oDQZVaTD",
+ "outputId": "6b46ca98-3f60-49c1-dab2-91259d6387af"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "trainable params: 4,096 || all params: 559,218,688 || trainable%: 0.0007324504863471229\n",
+ "None\n"
+ ]
+ }
+ ],
+ "source": [
+ "peft_model_prompt = get_peft_model(foundational_model, generation_config)\n",
+ "print(peft_model_prompt.print_trainable_parameters())"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 57,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "IktYfj68VaTE",
+ "outputId": "28fe03b7-4490-43ba-b913-4633e269737a"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "trainable params: 4,096 || all params: 559,218,688 || trainable%: 0.0007324504863471229\n",
+ "None\n"
+ ]
+ }
+ ],
+ "source": [
+ "peft_model_sentences = get_peft_model(foundational_model, generation_config)\n",
+ "print(peft_model_sentences.print_trainable_parameters())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "application/vnd.databricks.v1+cell": {
+ "cellMetadata": {},
+ "inputWidgets": {},
+ "nuid": "cff5bc33-8cfb-4144-8962-9c54362a7faa",
+ "showTitle": false,
+ "title": ""
+ },
+ "id": "i6WhJSUwVaTE"
+ },
+ "source": [
+ "**That's amazing: did you see the reduction in trainable parameters? We are going to train a 0.001% of the paramaters available.**\n",
+ "\n",
+ "Now we are going to create the training arguments, and we will use the same configuration in both trainings."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 58,
+ "metadata": {
+ "id": "SJoznfzjVaTE"
+ },
+ "outputs": [],
+ "source": [
+ "from transformers import TrainingArguments\n",
+ "def create_training_arguments(path, learning_rate=0.0035, epochs=6):\n",
+ " training_args = TrainingArguments(\n",
+ " output_dir=path, # Where the model predictions and checkpoints will be written\n",
+ " use_cpu=True, # This is necessary for CPU clusters.\n",
+ " auto_find_batch_size=True, # Find a suitable batch size that will fit into memory automatically\n",
+ " learning_rate= learning_rate, # Higher learning rate than full Fine-Tuning\n",
+ " num_train_epochs=epochs\n",
+ " )\n",
+ " return training_args"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 59,
+ "metadata": {
+ "application/vnd.databricks.v1+cell": {
+ "cellMetadata": {},
+ "inputWidgets": {},
+ "nuid": "54b78a8f-81f0-44c0-b0bc-dcb14891715f",
+ "showTitle": false,
+ "title": ""
+ },
+ "id": "cb1j50DSVaTE"
+ },
+ "outputs": [],
+ "source": [
+ "\n",
+ "import os\n",
+ "\n",
+ "working_dir = \"./\"\n",
+ "\n",
+ "#Is best to store the models in separate folders.\n",
+ "#Create the name of the directories where to store the models.\n",
+ "output_directory_prompt = os.path.join(working_dir, \"peft_outputs_prompt\")\n",
+ "output_directory_sentences = os.path.join(working_dir, \"peft_outputs_sentences\")\n",
+ "\n",
+ "#Just creating the directoris if not exist.\n",
+ "if not os.path.exists(working_dir):\n",
+ " os.mkdir(working_dir)\n",
+ "if not os.path.exists(output_directory_prompt):\n",
+ " os.mkdir(output_directory_prompt)\n",
+ "if not os.path.exists(output_directory_sentences):\n",
+ " os.mkdir(output_directory_sentences)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "OC5IhO9mVaTE"
+ },
+ "source": [
+ "We need to indicate the directory containing the model when creating the TrainingArguments."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 60,
+ "metadata": {
+ "id": "D4v4RSSeVaTE"
+ },
+ "outputs": [],
+ "source": [
+ "training_args_prompt = create_training_arguments(output_directory_prompt, 0.003, NUM_EPOCHS)\n",
+ "training_args_sentences = create_training_arguments(output_directory_sentences, 0.003, NUM_EPOCHS)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "application/vnd.databricks.v1+cell": {
+ "cellMetadata": {},
+ "inputWidgets": {},
+ "nuid": "c593deb6-5626-4fd9-89c2-2329e2f9b6e0",
+ "showTitle": false,
+ "title": ""
+ },
+ "id": "GdMfjk5RVaTE"
+ },
+ "source": [
+ "## Train\n",
+ "\n",
+ "We will create the trainer Object, one for each model to train. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 61,
+ "metadata": {
+ "id": "uVAfNdEIVaTE"
+ },
+ "outputs": [],
+ "source": [
+ "from transformers import Trainer, DataCollatorForLanguageModeling\n",
+ "def create_trainer(model, training_args, train_dataset):\n",
+ " trainer = Trainer(\n",
+ " model=model, # We pass in the PEFT version of the foundation model, bloomz-560M\n",
+ " args=training_args, #The args for the training.\n",
+ " train_dataset=train_dataset, #The dataset used to tyrain the model.\n",
+ " data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False) # mlm=False indicates not to use masked language modeling\n",
+ " )\n",
+ " return trainer\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 62,
+ "metadata": {
+ "application/vnd.databricks.v1+cell": {
+ "cellMetadata": {},
+ "inputWidgets": {},
+ "nuid": "32e43bcf-23b2-46aa-9cf0-455b83ef4f38",
+ "showTitle": false,
+ "title": ""
+ },
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 127
+ },
+ "id": "1Sz9BeFZVaTF",
+ "outputId": "1b698470-209e-4001-fcbe-6fa8a2ac8707"
+ },
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ ""
+ ],
+ "text/html": [
+ "\n",
+ "
"
+ ]
+ },
+ "metadata": {}
+ },
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "TrainOutput(global_step=24, training_loss=4.4278310139973955, metrics={'train_runtime': 219.765, 'train_samples_per_second': 0.683, 'train_steps_per_second': 0.109, 'total_flos': 17825006936064.0, 'train_loss': 4.4278310139973955, 'epoch': 6.0})"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 63
+ }
+ ],
+ "source": [
+ "#Training second model.\n",
+ "trainer_sentences = create_trainer(peft_model_sentences, training_args_sentences, train_sample_sentences)\n",
+ "trainer_sentences.train()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "z2Zsww_2VaTF"
+ },
+ "source": [
+ "In less than 10 minutes (CPU time in a M1 Pro) we trained 2 different models, with two different missions with a same foundational model as a base."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "application/vnd.databricks.v1+cell": {
+ "cellMetadata": {},
+ "inputWidgets": {},
+ "nuid": "5a6c8daf-8248-458a-9f6f-14865b4fbd2e",
+ "showTitle": false,
+ "title": ""
+ },
+ "id": "s5k10HwoVaTG"
+ },
+ "source": [
+ "## Save models\n",
+ "We are going to save the models. These models are ready to be used, as long as we have the pre-trained model from which they were created in memory."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 64,
+ "metadata": {
+ "application/vnd.databricks.v1+cell": {
+ "cellMetadata": {},
+ "inputWidgets": {},
+ "nuid": "409df5ce-e496-46d7-be2c-202a463cdc80",
+ "showTitle": false,
+ "title": ""
+ },
+ "id": "E3dn3PeMVaTG"
+ },
+ "outputs": [],
+ "source": [
+ "trainer_prompt.model.save_pretrained(output_directory_prompt)\n",
+ "trainer_sentences.model.save_pretrained(output_directory_sentences)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "application/vnd.databricks.v1+cell": {
+ "cellMetadata": {},
+ "inputWidgets": {},
+ "nuid": "fb14e3fd-bbf6-4d56-92c2-51bfe08de72a",
+ "showTitle": false,
+ "title": ""
+ },
+ "id": "rkUKpDDWVaTG"
+ },
+ "source": [
+ "## Inference\n",
+ "\n",
+ "You can load the model from the path that you have saved to before, and ask the model to generate text based on our input before!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 65,
+ "metadata": {
+ "application/vnd.databricks.v1+cell": {
+ "cellMetadata": {},
+ "inputWidgets": {},
+ "nuid": "cc48af16-c117-4019-a31a-ce1c93cd21d4",
+ "showTitle": false,
+ "title": ""
+ },
+ "id": "dlqXXN8oVaTG"
+ },
+ "outputs": [],
+ "source": [
+ "from peft import PeftModel\n",
+ "\n",
+ "loaded_model_prompt = PeftModel.from_pretrained(foundational_model,\n",
+ " output_directory_prompt,\n",
+ " #device_map='auto',\n",
+ " is_trainable=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 66,
+ "metadata": {
+ "application/vnd.databricks.v1+cell": {
+ "cellMetadata": {},
+ "inputWidgets": {},
+ "nuid": "6b44524b-2ac5-4e74-81e6-c406d4414e42",
+ "showTitle": false,
+ "title": ""
+ },
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "-4jd3zCGVaTG",
+ "outputId": "b55454f1-f1ed-444c-b107-698778406e6e"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "['I want you to act as a motivational coach. You will be helping students learn how they can improve their performance in the classroom and at school.']\n"
+ ]
+ }
+ ],
+ "source": [
+ "loaded_model_prompt_outputs = get_outputs(loaded_model_prompt, input_prompt)\n",
+ "print(tokenizer.batch_decode(loaded_model_prompt_outputs, skip_special_tokens=True))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "SHbeFTXjVaTG"
+ },
+ "source": [
+ "If we compare both answers something changed.\n",
+ "* ***Pretrained Model:*** *I want you to act as a motivational coach. Don't be afraid of being challenged.*\n",
+ "* ***Fine-Tuned Model:*** *I want you to act as a motivational coach. You can use this method if you're feeling anxious about your.*\n",
+ "\n",
+ "We have to keep in mind that we have only trained the model for a few minutes, but they have been enough to obtain a response closer to what we were looking for."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 67,
+ "metadata": {
+ "id": "MuwAsq3uVaTG"
+ },
+ "outputs": [],
+ "source": [
+ "loaded_model_prompt.load_adapter(output_directory_sentences, adapter_name=\"quotes\")\n",
+ "loaded_model_prompt.set_adapter(\"quotes\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 70,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "IQm--PWSVaTH",
+ "outputId": "3e814a6a-a380-4f2c-f887-6852a9f51002"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "['There are two nice things that should matter to you: the weather and your health.']\n"
+ ]
+ }
+ ],
+ "source": [
+ "loaded_model_sentences_outputs = get_outputs(loaded_model_prompt, input_sentences)\n",
+ "print(tokenizer.batch_decode(loaded_model_sentences_outputs, skip_special_tokens=True))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "UnR8y9gwVaTH"
+ },
+ "source": [
+ "With the second model we have a similar result.\n",
+ "* **Pretrained Model:** *There are two nice things that should matter to you: the price and quality of your product.*\n",
+ "* **Fine-Tuned Model:** *There are two nice things that should matter to you: the weather and your health.*\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "B6TUjNtGVaTH"
+ },
+ "source": [
+ "# Conclusion\n",
+ "Prompt Tuning is an amazing technique that can save us hours of training and a significant amount of money. In the notebook, we have trained two models in just a few minutes, and we can have both models in memory, providing service to different clients.\n",
+ "\n",
+ "If you want to try different combinations and models, the notebook is ready to use another model from the Bloom family.\n",
+ "\n",
+ "You can change the number of epochs to train, the number of virtual tokens, and the model in the third cell. However, there are many configurations to change. If you're looking for a good exercise, you can replace the random initialization of the virtual tokens with a fixed value.\n",
+ "\n",
+ "*The responses of the Fine-Tuned models may vary every time we train them. I've pasted the results of one of my trainings, but the actual results may differ.*"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 68,
+ "metadata": {
+ "id": "5OMyCWasVaTH"
+ },
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "application/vnd.databricks.v1+notebook": {
+ "dashboards": [],
+ "language": "python",
+ "notebookMetadata": {
+ "pythonIndentUnit": 2
+ },
+ "notebookName": "LLM 02 - Prompt Tuning with PEFT",
+ "widgets": {}
+ },
+ "colab": {
+ "machine_shape": "hm",
+ "provenance": []
+ },
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.4"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
diff --git a/notebooks/spa/rag_evaluation.ipynb b/notebooks/spa/rag_evaluation.ipynb
new file mode 100644
index 00000000..16a109a9
--- /dev/null
+++ b/notebooks/spa/rag_evaluation.ipynb
@@ -0,0 +1,1515 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4YErqpfH9jVI"
+ },
+ "source": [
+ "# RAG Evaluation\n",
+ "_Authored by: [Aymeric Roucher](https://huggingface.co/m-ric)_\n",
+ "\n",
+ "This notebook demonstrates how you can evaluate your RAG (Retrieval Augmented Generation), by building a synthetic evaluation dataset and using LLM-as-a-judge to compute the accuracy of your system.\n",
+ "\n",
+ "For an introduction to RAG, you can check [this other cookbook](rag_zephyr_langchain)!\n",
+ "\n",
+ "RAG systems are complex: here a RAG diagram, where we noted in blue all possibilities for system enhancement:\n",
+ "\n",
+ "\n",
+ "\n",
+ "Implementing any of these improvements can bring a huge performance boost; but changing anything is useless if you cannot monitor the impact of your changes on the system's performance!\n",
+ "So let's see how to evaluate our RAG system.\n",
+ "\n",
+ "### Evaluating RAG performance\n",
+ "\n",
+ "Since there are so many moving parts to tune with a big impact on performance, benchmarking the RAG system is crucial.\n",
+ "\n",
+ "For our evaluation pipeline, we will need:\n",
+ "1. An evaluation dataset with question - answer couples (QA couples)\n",
+ "2. An evaluator to compute the accuracy of our system on the above evaluation dataset.\n",
+ "\n",
+ "➡️ It turns out, we can use LLMs to help us all along the way!\n",
+ "1. The evaluation dataset will be synthetically generated by an LLM 🤖, and questions will be filtered out by other LLMs 🤖\n",
+ "2. An [LLM-as-a-judge](https://huggingface.co/papers/2306.05685) agent 🤖 will then perform the evaluation on this synthetic dataset.\n",
+ "\n",
+ "__Let's dig into it and start building our evaluation pipeline!__ First, we install the required model dependancies."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {
+ "id": "bCKBvOcp9jVK"
+ },
+ "outputs": [],
+ "source": [
+ "!pip install -q torch transformers transformers langchain sentence-transformers tqdm openpyxl openai pandas datasets"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "id": "k_lJFbYm9jVL"
+ },
+ "outputs": [],
+ "source": [
+ "%reload_ext autoreload\n",
+ "%autoreload 2"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "id": "oIlNZ1Mn9jVL"
+ },
+ "outputs": [],
+ "source": [
+ "from tqdm.auto import tqdm\n",
+ "import pandas as pd\n",
+ "from typing import Optional, List, Tuple\n",
+ "import json\n",
+ "import datasets\n",
+ "\n",
+ "pd.set_option(\"display.max_colwidth\", None)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from huggingface_hub import notebook_login\n",
+ "\n",
+ "notebook_login()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "zeW8P62J9jVM"
+ },
+ "source": [
+ "### Load your knowledge base"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "id": "YRbm5tNF9jVM"
+ },
+ "outputs": [],
+ "source": [
+ "ds = datasets.load_dataset(\"m-ric/huggingface_doc\", split=\"train\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "wy9CKj0M9jVM"
+ },
+ "source": [
+ "# 1. Build a synthetic dataset for evaluation\n",
+ "We first build a synthetic dataset of questions and associated contexts. The method is to get elements from our knowledge base, and ask an LLM to generate questions based on these documents.\n",
+ "\n",
+ "Then we setup other LLM agents to act as quality filters for the generated QA couples: each of them will act as the filter for a specific flaw."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "QkoEgiDg9jVM"
+ },
+ "source": [
+ "### 1.1. Prepare source documents"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "3gTOlRKO9jVM"
+ },
+ "outputs": [],
+ "source": [
+ "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
+ "from langchain.docstore.document import Document as LangchainDocument\n",
+ "\n",
+ "langchain_docs = [\n",
+ " LangchainDocument(page_content=doc[\"text\"], metadata={\"source\": doc[\"source\"]})\n",
+ " for doc in tqdm(ds)\n",
+ "]\n",
+ "\n",
+ "\n",
+ "text_splitter = RecursiveCharacterTextSplitter(\n",
+ " chunk_size=2000,\n",
+ " chunk_overlap=200,\n",
+ " add_start_index=True,\n",
+ " separators=[\"\\n\\n\", \"\\n\", \".\", \" \", \"\"],\n",
+ ")\n",
+ "\n",
+ "docs_processed = []\n",
+ "for doc in langchain_docs:\n",
+ " docs_processed += text_splitter.split_documents([doc])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "WjrNhcCh9jVN"
+ },
+ "source": [
+ "### 1.2. Setup agents for question generation\n",
+ "\n",
+ "We use [Mixtral](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) for QA couple generation because it it has excellent performance in leaderboards such as [Chatbot Arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 83,
+ "metadata": {
+ "id": "GoRySj3Q9jVN"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'This is a test context for the `@mui/material` library.\\n\\n## Installation\\n\\n```sh\\nnpm install @mui/material\\n```\\n\\n## Usage\\n\\n```jsx\\nimport React from \\'react\\';\\nimport { Button } from \\'@mui/material\\';\\n\\nfunction App() {\\n return (\\n
\\n \\n
\\n );\\n}\\n\\nexport default App;\\n```\\n\\n## Documentation\\n\\n- [Material-UI](https://material-ui.com/)\\n- [Material Design](https://material.io/)'"
+ ]
+ },
+ "execution_count": 83,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "from huggingface_hub import InferenceClient\n",
+ "\n",
+ "\n",
+ "repo_id = \"mistralai/Mixtral-8x7B-Instruct-v0.1\"\n",
+ "\n",
+ "llm_client = InferenceClient(\n",
+ " model=repo_id,\n",
+ " timeout=120,\n",
+ ")\n",
+ "\n",
+ "\n",
+ "def call_llm(inference_client: InferenceClient, prompt: str):\n",
+ " response = inference_client.post(\n",
+ " json={\n",
+ " \"inputs\": prompt,\n",
+ " \"parameters\": {\"max_new_tokens\": 1000},\n",
+ " \"task\": \"text-generation\",\n",
+ " },\n",
+ " )\n",
+ " return json.loads(response.decode())[0][\"generated_text\"]\n",
+ "\n",
+ "\n",
+ "call_llm(llm_client, \"This is a test context\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 84,
+ "metadata": {
+ "id": "hIM_DJRo9jVN"
+ },
+ "outputs": [],
+ "source": [
+ "QA_generation_prompt = \"\"\"\n",
+ "Your task is to write a factoid question and an answer given a context.\n",
+ "Your factoid question should be answerable with a specific, concise piece of factual information from the context.\n",
+ "Your factoid question should be formulated in the same style as questions users could ask in a search engine.\n",
+ "This means that your factoid question MUST NOT mention something like \"according to the passage\" or \"context\".\n",
+ "\n",
+ "Provide your answer as follows:\n",
+ "\n",
+ "Output:::\n",
+ "Factoid question: (your factoid question)\n",
+ "Answer: (your answer to the factoid question)\n",
+ "\n",
+ "Now here is the context.\n",
+ "\n",
+ "Context: {context}\\n\n",
+ "Output:::\"\"\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "lVFc-lVy9jVN"
+ },
+ "source": [
+ "Now let's generate our QA couples.\n",
+ "For this example, we generate only 10 QA couples and will load the rest from the Hub.\n",
+ "\n",
+ "But for your specific knowledge base, given that you want to get at least ~100 test samples, and accounting for the fact that we will filter out around half of these with our critique agents later on, you should generate much more, in the >200 samples."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "8fteqDDD9jVN"
+ },
+ "outputs": [],
+ "source": [
+ "import random\n",
+ "\n",
+ "N_GENERATIONS = 10 # We intentionally generate only 10 QA couples here for cost and time considerations\n",
+ "\n",
+ "print(f\"Generating {N_GENERATIONS} QA couples...\")\n",
+ "\n",
+ "outputs = []\n",
+ "for sampled_context in tqdm(random.sample(docs_processed, N_GENERATIONS)):\n",
+ " # Generate QA couple\n",
+ " output_QA_couple = call_llm(\n",
+ " llm_client, QA_generation_prompt.format(context=sampled_context.page_content)\n",
+ " )\n",
+ " try:\n",
+ " question = output_QA_couple.split(\"Factoid question: \")[-1].split(\"Answer: \")[0]\n",
+ " answer = output_QA_couple.split(\"Answer: \")[-1]\n",
+ " assert len(answer) < 300, \"Answer is too long\"\n",
+ " outputs.append(\n",
+ " {\n",
+ " \"context\": sampled_context.page_content,\n",
+ " \"question\": question,\n",
+ " \"answer\": answer,\n",
+ " \"source_doc\": sampled_context.metadata[\"source\"],\n",
+ " }\n",
+ " )\n",
+ " except:\n",
+ " continue"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 102,
+ "metadata": {
+ "id": "aUlOUDv59jVN",
+ "outputId": "c9634fdb-2a7f-43a6-c4eb-e60b166b8238"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
context
\n",
+ "
question
\n",
+ "
answer
\n",
+ "
source_doc
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
Now, we can just call the `Tokenizer.train` method with any list of files we want to use:\\n\\n<tokenizerslangcontent>\\n<python>\\n<literalinclude>\\n{\"path\": \"../../bindings/python/tests/documentation/test_quicktour.py\",\\n\"language\": \"python\",\\n\"start-after\": \"START train\",\\n\"end-before\": \"END train\",\\n\"dedent\": 8}\\n</literalinclude>\\n</python>\\n<rust>\\n<literalinclude>\\n{\"path\": \"../../tokenizers/tests/documentation.rs\",\\n\"language\": \"rust\",\\n\"start-after\": \"START quicktour_train\",\\n\"end-before\": \"END quicktour_train\",\\n\"dedent\": 4}\\n</literalinclude>\\n</rust>\\n<node>\\n<literalinclude>\\n{\"path\": \"../../bindings/node/examples/documentation/quicktour.test.ts\",\\n\"language\": \"js\",\\n\"start-after\": \"START train\",\\n\"end-before\": \"END train\",\\n\"dedent\": 8}\\n</literalinclude>\\n</node>\\n</tokenizerslangcontent>\\n\\nThis should only take a few seconds to train our tokenizer on the full\\nwikitext dataset! To save the tokenizer in one file that contains all\\nits configuration and vocabulary, just use the\\n`Tokenizer.save` method:\\n\\n<tokenizerslangcontent>\\n<python>\\n<literalinclude>\\n{\"path\": \"../../bindings/python/tests/documentation/test_quicktour.py\",\\n\"language\": \"python\",\\n\"start-after\": \"START save\",\\n\"end-before\": \"END save\",\\n\"dedent\": 8}\\n</literalinclude>\\n</python>\\n<rust>\\n<literalinclude>\\n{\"path\": \"../../tokenizers/tests/documentation.rs\",\\n\"language\": \"rust\",\\n\"start-after\": \"START quicktour_save\",\\n\"end-before\": \"END quicktour_save\",\\n\"dedent\": 4}\\n</literalinclude>\\n</rust>\\n<node>\\n<literalinclude>\\n{\"path\": \"../../bindings/node/examples/documentation/quicktour.test.ts\",\\n\"language\": \"js\",\\n\"start-after\": \"START save\",\\n\"end-before\": \"END save\",\\n\"dedent\": 8}\\n</literalinclude>\\n</node>\\n</tokenizerslangcontent>\\n\\nand you can reload your tokenizer from that file with the\\n`Tokenizer.from_file`\\n`classmethod`:
\n",
+ "
How can you reload a tokenizer from a file in Python?\\n
\n",
+ "
You can reload a tokenizer from a file in Python using the `Tokenizer.from_file` classmethod.
"
+ ],
+ "text/plain": [
+ " context \\\n",
+ "0 Now, we can just call the `Tokenizer.train` method with any list of files we want to use:\\n\\n\\n\\n\\n{\"path\": \"../../bindings/python/tests/documentation/test_quicktour.py\",\\n\"language\": \"python\",\\n\"start-after\": \"START train\",\\n\"end-before\": \"END train\",\\n\"dedent\": 8}\\n\\n\\n\\n\\n{\"path\": \"../../tokenizers/tests/documentation.rs\",\\n\"language\": \"rust\",\\n\"start-after\": \"START quicktour_train\",\\n\"end-before\": \"END quicktour_train\",\\n\"dedent\": 4}\\n\\n\\n\\n\\n{\"path\": \"../../bindings/node/examples/documentation/quicktour.test.ts\",\\n\"language\": \"js\",\\n\"start-after\": \"START train\",\\n\"end-before\": \"END train\",\\n\"dedent\": 8}\\n\\n\\n\\n\\nThis should only take a few seconds to train our tokenizer on the full\\nwikitext dataset! To save the tokenizer in one file that contains all\\nits configuration and vocabulary, just use the\\n`Tokenizer.save` method:\\n\\n\\n\\n\\n{\"path\": \"../../bindings/python/tests/documentation/test_quicktour.py\",\\n\"language\": \"python\",\\n\"start-after\": \"START save\",\\n\"end-before\": \"END save\",\\n\"dedent\": 8}\\n\\n\\n\\n\\n{\"path\": \"../../tokenizers/tests/documentation.rs\",\\n\"language\": \"rust\",\\n\"start-after\": \"START quicktour_save\",\\n\"end-before\": \"END quicktour_save\",\\n\"dedent\": 4}\\n\\n\\n\\n\\n{\"path\": \"../../bindings/node/examples/documentation/quicktour.test.ts\",\\n\"language\": \"js\",\\n\"start-after\": \"START save\",\\n\"end-before\": \"END save\",\\n\"dedent\": 8}\\n\\n\\n\\n\\nand you can reload your tokenizer from that file with the\\n`Tokenizer.from_file`\\n`classmethod`: \n",
+ "\n",
+ " question \\\n",
+ "0 How can you reload a tokenizer from a file in Python?\\n \n",
+ "\n",
+ " answer \\\n",
+ "0 You can reload a tokenizer from a file in Python using the `Tokenizer.from_file` classmethod. \n",
+ "\n",
+ " source_doc \n",
+ "0 huggingface/tokenizers/blob/main/docs/source-doc-builder/quicktour.mdx "
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "display(pd.DataFrame(outputs).head(1))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "0KG4dNtg9jVN"
+ },
+ "source": [
+ "### 1.3. Setup critique agents\n",
+ "\n",
+ "The questions generated by the previous agent can have many flaws: we should do a quality check before validating these questions.\n",
+ "\n",
+ "We thus build critique agents that will rate each question on several criteria, given in [this paper](https://huggingface.co/papers/2312.10003):\n",
+ "- **Groundedness:** can the question be answered from the given context?\n",
+ "- **Relevance:** is the question relevant to users? For instance, `\"What is the date when transformers 4.29.1 was released?\"` is not relevant for ML practicioners.\n",
+ "\n",
+ "One last failure case we've noticed is when a function is tailored for the particular setting where the question was generated, but undecipherable by itself, like `\"What is the name of the function used in this guide?\"`.\n",
+ "We also build a critique agent for this criteria:\n",
+ "- **Stand-alone**: is the question understandable free of any context, for someone with domain knowledge/Internet access? The opposite of this would be `What is the function used in this article?` for a question generated from a specific blog article.\n",
+ "\n",
+ "We systematically score functions with all these agents, and whenever the score is too low for any one of the agents, we eliminate the question from our eval dataset.\n",
+ "\n",
+ "💡 ___When asking the agents to output a score, we first ask them to produce its rationale. This will help us verify scores, but most importantly, asking it to first output rationale gives the model more tokens to think and elaborate an answer before summarizing it into a single score token.___\n",
+ "\n",
+ "We now build and run these critique agents."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 103,
+ "metadata": {
+ "id": "05aSgTGs9jVO"
+ },
+ "outputs": [],
+ "source": [
+ "question_groundedness_critique_prompt = \"\"\"\n",
+ "You will be given a context and a question.\n",
+ "Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.\n",
+ "Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.\n",
+ "\n",
+ "Provide your answer as follows:\n",
+ "\n",
+ "Answer:::\n",
+ "Evaluation: (your rationale for the rating, as a text)\n",
+ "Total rating: (your rating, as a number between 1 and 5)\n",
+ "\n",
+ "You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.\n",
+ "\n",
+ "Now here are the question and context.\n",
+ "\n",
+ "Question: {question}\\n\n",
+ "Context: {context}\\n\n",
+ "Answer::: \"\"\"\n",
+ "\n",
+ "question_relevance_critique_prompt = \"\"\"\n",
+ "You will be given a question.\n",
+ "Your task is to provide a 'total rating' representing how useful this question can be to machine learning developers building NLP applications with the Hugging Face ecosystem.\n",
+ "Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.\n",
+ "\n",
+ "Provide your answer as follows:\n",
+ "\n",
+ "Answer:::\n",
+ "Evaluation: (your rationale for the rating, as a text)\n",
+ "Total rating: (your rating, as a number between 1 and 5)\n",
+ "\n",
+ "You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.\n",
+ "\n",
+ "Now here is the question.\n",
+ "\n",
+ "Question: {question}\\n\n",
+ "Answer::: \"\"\"\n",
+ "\n",
+ "question_standalone_critique_prompt = \"\"\"\n",
+ "You will be given a question.\n",
+ "Your task is to provide a 'total rating' representing how context-independant this question is.\n",
+ "Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.\n",
+ "For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.\n",
+ "The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.\n",
+ "\n",
+ "For instance, \"What is the name of the checkpoint from which the ViT model is imported?\" should receive a 1, since there is an implicit mention of a context, thus the question is not independant from the context.\n",
+ "\n",
+ "Provide your answer as follows:\n",
+ "\n",
+ "Answer:::\n",
+ "Evaluation: (your rationale for the rating, as a text)\n",
+ "Total rating: (your rating, as a number between 1 and 5)\n",
+ "\n",
+ "You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.\n",
+ "\n",
+ "Now here is the question.\n",
+ "\n",
+ "Question: {question}\\n\n",
+ "Answer::: \"\"\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "b9tbk7ME9jVO"
+ },
+ "outputs": [],
+ "source": [
+ "print(\"Generating critique for each QA couple...\")\n",
+ "for output in tqdm(outputs):\n",
+ " evaluations = {\n",
+ " \"groundedness\": call_llm(\n",
+ " llm_client,\n",
+ " question_groundedness_critique_prompt.format(\n",
+ " context=output[\"context\"], question=output[\"question\"]\n",
+ " ),\n",
+ " ),\n",
+ " \"relevance\": call_llm(\n",
+ " llm_client,\n",
+ " question_relevance_critique_prompt.format(question=output[\"question\"]),\n",
+ " ),\n",
+ " \"standalone\": call_llm(\n",
+ " llm_client,\n",
+ " question_standalone_critique_prompt.format(question=output[\"question\"]),\n",
+ " ),\n",
+ " }\n",
+ " try:\n",
+ " for criterion, evaluation in evaluations.items():\n",
+ " score, eval = (\n",
+ " int(evaluation.split(\"Total rating: \")[-1].strip()),\n",
+ " evaluation.split(\"Total rating: \")[-2].split(\"Evaluation: \")[1],\n",
+ " )\n",
+ " output.update(\n",
+ " {\n",
+ " f\"{criterion}_score\": score,\n",
+ " f\"{criterion}_eval\": eval,\n",
+ " }\n",
+ " )\n",
+ " except Exception as e:\n",
+ " continue"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "IQv36Y_f9jVO"
+ },
+ "source": [
+ "Now let us filter out bad questions based on our critique agent scores:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 105,
+ "metadata": {
+ "id": "oBWuOu1b9jVO",
+ "outputId": "b32bacea-52f8-486a-96fe-5c188605c5a2"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Evaluation dataset before filtering:\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
question
\n",
+ "
answer
\n",
+ "
groundedness_score
\n",
+ "
relevance_score
\n",
+ "
standalone_score
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
How can you reload a tokenizer from a file in Python?\\n
\n",
+ "
You can reload a tokenizer from a file in Python using the `Tokenizer.from_file` classmethod.
\n",
+ "
4.0
\n",
+ "
5.0
\n",
+ "
5.0
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
What is the output of my\\_cool\\_auth\\_method when passing both token and use\\_auth\\_token?\\n
\n",
+ "
UserWarning: Both `token` and `use_auth_token` are passed (...). `use_auth_token` value will be ignored. \"<token>\"
\n",
+ "
1.0
\n",
+ "
1.0
\n",
+ "
5.0
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
Which version of Gradio introduced a hotfix to support pydantic v1 and v2?\\n
\n",
+ "
3.36.1
\n",
+ "
5.0
\n",
+ "
2.0
\n",
+ "
5.0
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
Which model was released by Meta AI?\\n
\n",
+ "
DINOv2
\n",
+ "
5.0
\n",
+ "
5.0
\n",
+ "
5.0
\n",
+ "
\n",
+ "
\n",
+ "
4
\n",
+ "
What is the type of transformer that uses cross-attention?\\n
\n",
+ "
The type of transformer that uses cross-attention is the \"encoder-decoder\" transformer or the \"sequence-to-sequence\" transformer.
\n",
+ "
NaN
\n",
+ "
NaN
\n",
+ "
NaN
\n",
+ "
\n",
+ "
\n",
+ "
5
\n",
+ "
What is the license for the Llama-2-7b-Chat-64g-GPTQ model?\\n
\n",
+ "
The license for the Llama-2-7b-Chat-64g-GPTQ model is the llama-2-community-license.
\n",
+ "
4.0
\n",
+ "
4.0
\n",
+ "
5.0
\n",
+ "
\n",
+ "
\n",
+ "
6
\n",
+ "
What is the library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing tasks?\\n
\n",
+ "
🤗 Datasets
\n",
+ "
5.0
\n",
+ "
5.0
\n",
+ "
5.0
\n",
+ "
\n",
+ "
\n",
+ "
7
\n",
+ "
Which feature improves performance for many applications?\\n
\n",
+ "
The feature that improves performance for many applications is lazy loading interactive or static variants of a component individually, rather than loading both variants regardless.
\n",
+ "
4.0
\n",
+ "
4.0
\n",
+ "
5.0
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " question \\\n",
+ "0 How can you reload a tokenizer from a file in Python?\\n \n",
+ "1 What is the output of my\\_cool\\_auth\\_method when passing both token and use\\_auth\\_token?\\n \n",
+ "2 Which version of Gradio introduced a hotfix to support pydantic v1 and v2?\\n \n",
+ "3 Which model was released by Meta AI?\\n \n",
+ "4 What is the type of transformer that uses cross-attention?\\n \n",
+ "5 What is the license for the Llama-2-7b-Chat-64g-GPTQ model?\\n \n",
+ "6 What is the library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing tasks?\\n \n",
+ "7 Which feature improves performance for many applications?\\n \n",
+ "\n",
+ " answer \\\n",
+ "0 You can reload a tokenizer from a file in Python using the `Tokenizer.from_file` classmethod. \n",
+ "1 UserWarning: Both `token` and `use_auth_token` are passed (...). `use_auth_token` value will be ignored. \"\" \n",
+ "2 3.36.1 \n",
+ "3 DINOv2 \n",
+ "4 The type of transformer that uses cross-attention is the \"encoder-decoder\" transformer or the \"sequence-to-sequence\" transformer. \n",
+ "5 The license for the Llama-2-7b-Chat-64g-GPTQ model is the llama-2-community-license. \n",
+ "6 🤗 Datasets \n",
+ "7 The feature that improves performance for many applications is lazy loading interactive or static variants of a component individually, rather than loading both variants regardless. \n",
+ "\n",
+ " groundedness_score relevance_score standalone_score \n",
+ "0 4.0 5.0 5.0 \n",
+ "1 1.0 1.0 5.0 \n",
+ "2 5.0 2.0 5.0 \n",
+ "3 5.0 5.0 5.0 \n",
+ "4 NaN NaN NaN \n",
+ "5 4.0 4.0 5.0 \n",
+ "6 5.0 5.0 5.0 \n",
+ "7 4.0 4.0 5.0 "
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "============================================\n",
+ "Final evaluation dataset:\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
question
\n",
+ "
answer
\n",
+ "
groundedness_score
\n",
+ "
relevance_score
\n",
+ "
standalone_score
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
How can you reload a tokenizer from a file in Python?\\n
\n",
+ "
You can reload a tokenizer from a file in Python using the `Tokenizer.from_file` classmethod.
\n",
+ "
4.0
\n",
+ "
5.0
\n",
+ "
5.0
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
Which model was released by Meta AI?\\n
\n",
+ "
DINOv2
\n",
+ "
5.0
\n",
+ "
5.0
\n",
+ "
5.0
\n",
+ "
\n",
+ "
\n",
+ "
5
\n",
+ "
What is the license for the Llama-2-7b-Chat-64g-GPTQ model?\\n
\n",
+ "
The license for the Llama-2-7b-Chat-64g-GPTQ model is the llama-2-community-license.
\n",
+ "
4.0
\n",
+ "
4.0
\n",
+ "
5.0
\n",
+ "
\n",
+ "
\n",
+ "
6
\n",
+ "
What is the library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing tasks?\\n
\n",
+ "
🤗 Datasets
\n",
+ "
5.0
\n",
+ "
5.0
\n",
+ "
5.0
\n",
+ "
\n",
+ "
\n",
+ "
7
\n",
+ "
Which feature improves performance for many applications?\\n
\n",
+ "
The feature that improves performance for many applications is lazy loading interactive or static variants of a component individually, rather than loading both variants regardless.
\n",
+ "
4.0
\n",
+ "
4.0
\n",
+ "
5.0
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " question \\\n",
+ "0 How can you reload a tokenizer from a file in Python?\\n \n",
+ "3 Which model was released by Meta AI?\\n \n",
+ "5 What is the license for the Llama-2-7b-Chat-64g-GPTQ model?\\n \n",
+ "6 What is the library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing tasks?\\n \n",
+ "7 Which feature improves performance for many applications?\\n \n",
+ "\n",
+ " answer \\\n",
+ "0 You can reload a tokenizer from a file in Python using the `Tokenizer.from_file` classmethod. \n",
+ "3 DINOv2 \n",
+ "5 The license for the Llama-2-7b-Chat-64g-GPTQ model is the llama-2-community-license. \n",
+ "6 🤗 Datasets \n",
+ "7 The feature that improves performance for many applications is lazy loading interactive or static variants of a component individually, rather than loading both variants regardless. \n",
+ "\n",
+ " groundedness_score relevance_score standalone_score \n",
+ "0 4.0 5.0 5.0 \n",
+ "3 5.0 5.0 5.0 \n",
+ "5 4.0 4.0 5.0 \n",
+ "6 5.0 5.0 5.0 \n",
+ "7 4.0 4.0 5.0 "
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "import pandas as pd\n",
+ "\n",
+ "pd.set_option(\"display.max_colwidth\", None)\n",
+ "\n",
+ "generated_questions = pd.DataFrame.from_dict(outputs)\n",
+ "\n",
+ "print(\"Evaluation dataset before filtering:\")\n",
+ "display(\n",
+ " generated_questions[\n",
+ " [\n",
+ " \"question\",\n",
+ " \"answer\",\n",
+ " \"groundedness_score\",\n",
+ " \"relevance_score\",\n",
+ " \"standalone_score\",\n",
+ " ]\n",
+ " ]\n",
+ ")\n",
+ "generated_questions = generated_questions.loc[\n",
+ " (generated_questions[\"groundedness_score\"] >= 4)\n",
+ " & (generated_questions[\"relevance_score\"] >= 4)\n",
+ " & (generated_questions[\"standalone_score\"] >= 4)\n",
+ "]\n",
+ "print(\"============================================\")\n",
+ "print(\"Final evaluation dataset:\")\n",
+ "display(\n",
+ " generated_questions[\n",
+ " [\n",
+ " \"question\",\n",
+ " \"answer\",\n",
+ " \"groundedness_score\",\n",
+ " \"relevance_score\",\n",
+ " \"standalone_score\",\n",
+ " ]\n",
+ " ]\n",
+ ")\n",
+ "\n",
+ "eval_dataset = datasets.Dataset.from_pandas(\n",
+ " generated_questions, split=\"train\", preserve_index=False\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HaOMZyu69jVO"
+ },
+ "source": [
+ "Now our synthetic evaluation dataset is complete! We can evaluate different RAG systems on this evaluation dataset.\n",
+ "\n",
+ "We have generated only a few QA couples here to reduce time and cost. But let's kick start the next part by loading a pre-generated dataset:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Q3RRz4W79jVO"
+ },
+ "outputs": [],
+ "source": [
+ "eval_dataset = datasets.load_dataset(\"m-ric/huggingface_doc_qa_eval\", split=\"train\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "K5s19uTd9jVO"
+ },
+ "source": [
+ "# 2. Build our RAG System"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Z-mET8Dy9jVO"
+ },
+ "source": [
+ "### 2.1. Preprocessing documents to build our vector database\n",
+ "\n",
+ "- In this part, __we split the documents from our knowledge base into smaller chunks__: these will be the snippets that are picked by the Retriever, to then be ingested by the Reader LLM as supporting elements for its answer.\n",
+ "- The goal is to build semantically relevant snippets: not too small to be sufficient for supporting an answer, and not too large too avoid diluting individual ideas.\n",
+ "\n",
+ "Many options exist for text splitting:\n",
+ "- split every `n` words / characters, but this has the risk of cutting in half paragraphs or even sentences\n",
+ "- split after `n` words / character, but only on sentence boundaries\n",
+ "- **recursive split** tries to preserve even more of the document structure, by processing it tree-like way, splitting first on the largest units (chapters) then recursively splitting on smaller units (paragraphs, sentences).\n",
+ "\n",
+ "To learn more about chunking, I recommend you read [this great notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb) by Greg Kamradt.\n",
+ "\n",
+ "[This space](https://huggingface.co/spaces/m-ric/chunk_visualizer) lets you visualize how different splitting options affect the chunks you get.\n",
+ "\n",
+ "> In the following, we use Langchain's `RecursiveCharacterTextSplitter`.\n",
+ "\n",
+ "💡 _To measure chunk length in our Text Splitter, our length function will not be the count of characters, but the count of tokens in the tokenized text: indeed, for subsequent embedder that processes token, measuring length in tokens is more relevant and empirically performs better._"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "H4fhm55Q9jVO"
+ },
+ "outputs": [],
+ "source": [
+ "from langchain.docstore.document import Document as LangchainDocument\n",
+ "\n",
+ "RAW_KNOWLEDGE_BASE = [\n",
+ " LangchainDocument(page_content=doc[\"text\"], metadata={\"source\": doc[\"source\"]})\n",
+ " for doc in tqdm(ds)\n",
+ "]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "sz9Jw2_q9jVO"
+ },
+ "outputs": [],
+ "source": [
+ "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
+ "from transformers import AutoTokenizer\n",
+ "\n",
+ "\n",
+ "def split_documents(\n",
+ " chunk_size: int,\n",
+ " knowledge_base: List[LangchainDocument],\n",
+ " tokenizer_name: str,\n",
+ ") -> List[LangchainDocument]:\n",
+ " \"\"\"\n",
+ " Split documents into chunks of size `chunk_size` characters and return a list of documents.\n",
+ " \"\"\"\n",
+ " text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(\n",
+ " AutoTokenizer.from_pretrained(tokenizer_name),\n",
+ " chunk_size=chunk_size,\n",
+ " chunk_overlap=int(chunk_size / 10),\n",
+ " add_start_index=True,\n",
+ " strip_whitespace=True,\n",
+ " separators=[\"\\n\\n\", \"\\n\", \".\", \" \", \"\"],\n",
+ " )\n",
+ "\n",
+ " docs_processed = []\n",
+ " for doc in knowledge_base:\n",
+ " docs_processed += text_splitter.split_documents([doc])\n",
+ "\n",
+ " # Remove duplicates\n",
+ " unique_texts = {}\n",
+ " docs_processed_unique = []\n",
+ " for doc in docs_processed:\n",
+ " if doc.page_content not in unique_texts:\n",
+ " unique_texts[doc.page_content] = True\n",
+ " docs_processed_unique.append(doc)\n",
+ "\n",
+ " return docs_processed_unique"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "QzBYfNG79jVO"
+ },
+ "source": [
+ "### 2.2. Retriever - embeddings 🗂️\n",
+ "The __retriever acts like an internal search engine__: given the user query, it returns the most relevant documents from your knowledge base.\n",
+ "\n",
+ "> For the knowledge base, we use Langchain vector databases since __it offers a convenient [FAISS](https://github.com/facebookresearch/faiss) index and allows us to keep document metadata throughout the processing__.\n",
+ "\n",
+ "🛠️ __Options included:__\n",
+ "\n",
+ "- Tune the chunking method:\n",
+ " - Size of the chunks\n",
+ " - Method: split on different separators, use [semantic chunking](https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker)...\n",
+ "- Change the embedding model"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "LqJlIDZR9jVO"
+ },
+ "outputs": [],
+ "source": [
+ "from langchain.vectorstores import FAISS\n",
+ "from langchain_community.embeddings import HuggingFaceEmbeddings\n",
+ "from langchain_community.vectorstores.utils import DistanceStrategy\n",
+ "import os\n",
+ "\n",
+ "\n",
+ "def load_embeddings(\n",
+ " langchain_docs: List[LangchainDocument],\n",
+ " chunk_size: int,\n",
+ " embedding_model_name: Optional[str] = \"thenlper/gte-small\",\n",
+ ") -> FAISS:\n",
+ " \"\"\"\n",
+ " Creates a FAISS index from the given embedding model and documents. Loads the index directly if it already exists.\n",
+ "\n",
+ " Args:\n",
+ " langchain_docs: list of documents\n",
+ " chunk_size: size of the chunks to split the documents into\n",
+ " embedding_model_name: name of the embedding model to use\n",
+ "\n",
+ " Returns:\n",
+ " FAISS index\n",
+ " \"\"\"\n",
+ " # load embedding_model\n",
+ " embedding_model = HuggingFaceEmbeddings(\n",
+ " model_name=embedding_model_name,\n",
+ " multi_process=True,\n",
+ " model_kwargs={\"device\": \"cuda\"},\n",
+ " encode_kwargs={\n",
+ " \"normalize_embeddings\": True\n",
+ " }, # set True to compute cosine similarity\n",
+ " )\n",
+ "\n",
+ " # Check if embeddings already exist on disk\n",
+ " index_name = (\n",
+ " f\"index_chunk:{chunk_size}_embeddings:{embedding_model_name.replace('/', '~')}\"\n",
+ " )\n",
+ " index_folder_path = f\"./data/indexes/{index_name}/\"\n",
+ " if os.path.isdir(index_folder_path):\n",
+ " return FAISS.load_local(\n",
+ " index_folder_path,\n",
+ " embedding_model,\n",
+ " distance_strategy=DistanceStrategy.COSINE,\n",
+ " )\n",
+ "\n",
+ " else:\n",
+ " print(\"Index not found, generating it...\")\n",
+ " docs_processed = split_documents(\n",
+ " chunk_size,\n",
+ " langchain_docs,\n",
+ " embedding_model_name,\n",
+ " )\n",
+ " knowledge_index = FAISS.from_documents(\n",
+ " docs_processed, embedding_model, distance_strategy=DistanceStrategy.COSINE\n",
+ " )\n",
+ " knowledge_index.save_local(index_folder_path)\n",
+ " return knowledge_index"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "b6y1mQJX9jVO"
+ },
+ "source": [
+ "### 2.3. Reader - LLM 💬\n",
+ "\n",
+ "In this part, the __LLM Reader reads the retrieved documents to formulate its answer.__\n",
+ "\n",
+ "🛠️ Here we tried the following options to improve results:\n",
+ "- Switch reranking on/off\n",
+ "- Change the reader model"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "9PdpuWyP9jVP"
+ },
+ "outputs": [],
+ "source": [
+ "RAG_PROMPT_TEMPLATE = \"\"\"\n",
+ "<|system|>\n",
+ "Using the information contained in the context,\n",
+ "give a comprehensive answer to the question.\n",
+ "Respond only to the question asked, response should be concise and relevant to the question.\n",
+ "Provide the number of the source document when relevant.\n",
+ "If the answer cannot be deduced from the context, do not give an answer.\n",
+ "<|user|>\n",
+ "Context:\n",
+ "{context}\n",
+ "---\n",
+ "Now here is the question you need to answer.\n",
+ "\n",
+ "Question: {question}\n",
+ "\n",
+ "<|assistant|>\n",
+ "\"\"\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "9SDqenld9jVP"
+ },
+ "outputs": [],
+ "source": [
+ "from langchain_community.llms import HuggingFaceHub\n",
+ "\n",
+ "repo_id = \"HuggingFaceH4/zephyr-7b-beta\"\n",
+ "READER_MODEL_NAME = \"zephyr-7b-beta\"\n",
+ "\n",
+ "READER_LLM = HuggingFaceHub(\n",
+ " repo_id=repo_id,\n",
+ " task=\"text-generation\",\n",
+ " model_kwargs={\n",
+ " \"max_new_tokens\": 512,\n",
+ " \"top_k\": 30,\n",
+ " \"temperature\": 0.1,\n",
+ " \"repetition_penalty\": 1.03,\n",
+ " },\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "QZ62CbcZ9jVP"
+ },
+ "outputs": [],
+ "source": [
+ "from ragatouille import RAGPretrainedModel\n",
+ "from langchain_core.vectorstores import VectorStore\n",
+ "from langchain_core.language_models.llms import LLM\n",
+ "\n",
+ "\n",
+ "def answer_with_rag(\n",
+ " question: str,\n",
+ " llm: LLM,\n",
+ " knowledge_index: VectorStore,\n",
+ " reranker: Optional[RAGPretrainedModel] = None,\n",
+ " num_retrieved_docs: int = 30,\n",
+ " num_docs_final: int = 7,\n",
+ ") -> Tuple[str, List[LangchainDocument]]:\n",
+ " \"\"\"Answer a question using RAG with the given knowledge index.\"\"\"\n",
+ " # Gather documents with retriever\n",
+ " relevant_docs = knowledge_index.similarity_search(\n",
+ " query=question, k=num_retrieved_docs\n",
+ " )\n",
+ " relevant_docs = [doc.page_content for doc in relevant_docs] # keep only the text\n",
+ "\n",
+ " # Optionally rerank results\n",
+ " if reranker:\n",
+ " relevant_docs = reranker.rerank(question, relevant_docs, k=num_docs_final)\n",
+ " relevant_docs = [doc[\"content\"] for doc in relevant_docs]\n",
+ "\n",
+ " relevant_docs = relevant_docs[:num_docs_final]\n",
+ "\n",
+ " # Build the final prompt\n",
+ " context = \"\\nExtracted documents:\\n\"\n",
+ " context += \"\".join(\n",
+ " [f\"Document {str(i)}:::\\n\" + doc for i, doc in enumerate(relevant_docs)]\n",
+ " )\n",
+ "\n",
+ " final_prompt = RAG_PROMPT_TEMPLATE.format(question=question, context=context)\n",
+ "\n",
+ " # Redact an answer\n",
+ " answer = llm(final_prompt)\n",
+ "\n",
+ " return answer, relevant_docs"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "hiygbqfT9jVP"
+ },
+ "source": [
+ "# 3. Benchmarking the RAG system\n",
+ "\n",
+ "The RAG system and the evaluation datasets are now ready. The last step is to judge the RAG system's output on this evlauation dataset.\n",
+ "\n",
+ "To this end, __we setup a judge agent__. ⚖️🤖\n",
+ "\n",
+ "Out of [the different RAG evaluation metrics](https://docs.ragas.io/en/latest/concepts/metrics/index.html), we choose to focus only on faithfulness since it the best end-to-end metric of our system's performance.\n",
+ "\n",
+ "> We use GPT4 as a judge for its empirically good performance, but you could try with other models such as [kaist-ai/prometheus-13b-v1.0](https://huggingface.co/kaist-ai/prometheus-13b-v1.0) or [BAAI/JudgeLM-33B-v1.0](https://huggingface.co/BAAI/JudgeLM-33B-v1.0).\n",
+ "\n",
+ "💡 _In the evaluation prompt, we give a detailed description each metric on the scale 1-5, as is done in [Prometheus's prompt template](https://huggingface.co/kaist-ai/prometheus-13b-v1.0): this helps the model ground its metric precisely. If instead you give the judge LLM a vague scale to work with, the outputs will not be consistent enough between different examples._\n",
+ "\n",
+ "💡 _Again, prompting the LLM to output rationale before giving its final score gives it more tokens to help it formalize and elaborate a judgement._"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "VrlMh_ZI9jVP"
+ },
+ "outputs": [],
+ "source": [
+ "def run_rag_tests(\n",
+ " eval_dataset: datasets.Dataset,\n",
+ " llm: BaseChatModel,\n",
+ " knowledge_index: VectorStore,\n",
+ " output_file: str,\n",
+ " reranker: Optional[RAGPretrainedModel] = None,\n",
+ " verbose: Optional[bool] = True,\n",
+ " test_settings: Optional[str] = None, # To document the test settings used\n",
+ "):\n",
+ " \"\"\"Runs RAG tests on the given dataset and saves the results to the given output file.\"\"\"\n",
+ " try: # load previous generations if they exist\n",
+ " with open(output_file, \"r\") as f:\n",
+ " outputs = json.load(f)\n",
+ " except:\n",
+ " outputs = []\n",
+ "\n",
+ " for example in tqdm(eval_dataset):\n",
+ " question = example[\"question\"]\n",
+ " if question in [output[\"question\"] for output in outputs]:\n",
+ " continue\n",
+ "\n",
+ " answer, relevant_docs = answer_with_rag(\n",
+ " question, llm, knowledge_index, reranker=reranker\n",
+ " )\n",
+ " if verbose:\n",
+ " print(\"=======================================================\")\n",
+ " print(f\"Question: {question}\")\n",
+ " print(f\"Answer: {answer}\")\n",
+ " print(f'True answer: {example[\"answer\"]}')\n",
+ " result = {\n",
+ " \"question\": question,\n",
+ " \"true_answer\": example[\"answer\"],\n",
+ " \"source_doc\": example[\"source_doc\"],\n",
+ " \"generated_answer\": answer,\n",
+ " \"retrieved_docs\": [doc for doc in relevant_docs],\n",
+ " }\n",
+ " if test_settings:\n",
+ " result[\"test_settings\"] = test_settings\n",
+ " outputs.append(result)\n",
+ "\n",
+ " with open(output_file, \"w\") as f:\n",
+ " json.dump(outputs, f)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Ae-3KWzK9jVP"
+ },
+ "outputs": [],
+ "source": [
+ "EVALUATION_PROMPT = \"\"\"###Task Description:\n",
+ "An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.\n",
+ "1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.\n",
+ "2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.\n",
+ "3. The output format should look as follows: \\\"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 5}}\\\"\n",
+ "4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output.\n",
+ "\n",
+ "###The instruction to evaluate:\n",
+ "{instruction}\n",
+ "\n",
+ "###Response to evaluate:\n",
+ "{response}\n",
+ "\n",
+ "###Reference Answer (Score 5):\n",
+ "{reference_answer}\n",
+ "\n",
+ "###Score Rubrics:\n",
+ "[Is the response correct, accurate, and factual based on the reference answer?]\n",
+ "Score 1: The response is completely incorrect, inaccurate, and/or not factual.\n",
+ "Score 2: The response is mostly incorrect, inaccurate, and/or not factual.\n",
+ "Score 3: The response is somewhat correct, accurate, and/or factual.\n",
+ "Score 4: The response is mostly correct, accurate, and factual.\n",
+ "Score 5: The response is completely correct, accurate, and factual.\n",
+ "\n",
+ "###Feedback:\"\"\"\n",
+ "\n",
+ "from langchain.prompts.chat import (\n",
+ " ChatPromptTemplate,\n",
+ " HumanMessagePromptTemplate,\n",
+ ")\n",
+ "from langchain.schema import SystemMessage\n",
+ "\n",
+ "\n",
+ "evaluation_prompt_template = ChatPromptTemplate.from_messages(\n",
+ " [\n",
+ " SystemMessage(content=\"You are a fair evaluator language model.\"),\n",
+ " HumanMessagePromptTemplate.from_template(EVALUATION_PROMPT),\n",
+ " ]\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "ia9Mvn859jVP"
+ },
+ "outputs": [],
+ "source": [
+ "from langchain.chat_models import ChatOpenAI\n",
+ "\n",
+ "eval_chat_model = ChatOpenAI(model=\"gpt-4-1106-preview\", temperature=0)\n",
+ "evaluator_name = \"GPT4\"\n",
+ "\n",
+ "\n",
+ "def evaluate_answers(\n",
+ " answer_path: str,\n",
+ " eval_chat_model: BaseChatModel,\n",
+ " evaluator_name: str,\n",
+ " evaluation_prompt_template: ChatPromptTemplate,\n",
+ ") -> None:\n",
+ " \"\"\"Evaluates generated answers. Modifies the given answer file in place for better checkpointing.\"\"\"\n",
+ " answers = []\n",
+ " if os.path.isfile(answer_path): # load previous generations if they exist\n",
+ " answers = json.load(open(answer_path, \"r\"))\n",
+ "\n",
+ " for experiment in tqdm(answers):\n",
+ " if f\"eval_score_{evaluator_name}\" in experiment:\n",
+ " continue\n",
+ "\n",
+ " eval_prompt = evaluation_prompt_template.format_messages(\n",
+ " instruction=experiment[\"question\"],\n",
+ " response=experiment[\"generated_answer\"],\n",
+ " reference_answer=experiment[\"true_answer\"],\n",
+ " )\n",
+ " eval_result = eval_chat_model.invoke(eval_prompt)\n",
+ " feedback, score = [\n",
+ " item.strip() for item in eval_result.content.split(\"[RESULT]\")\n",
+ " ]\n",
+ " experiment[f\"eval_score_{evaluator_name}\"] = score\n",
+ " experiment[f\"eval_feedback_{evaluator_name}\"] = feedback\n",
+ "\n",
+ " with open(answer_path, \"w\") as f:\n",
+ " json.dump(answers, f)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "EXH-szLe9jVP"
+ },
+ "source": [
+ "🚀 Let's run the tests and evaluate answers!👇"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "jW2nnvUT9jVQ"
+ },
+ "outputs": [],
+ "source": [
+ "if not os.path.exists(\"./output\"):\n",
+ " os.mkdir(\"./output\")\n",
+ "\n",
+ "for chunk_size in [200]: # Add other chunk sizes (in tokens) as needed\n",
+ " for embeddings in [\"thenlper/gte-small\"]: # Add other embeddings as needed\n",
+ " for rerank in [True, False]:\n",
+ " settings_name = f\"chunk:{chunk_size}_embeddings:{embeddings.replace('/', '~')}_rerank:{rerank}_reader-model:{READER_MODEL_NAME}\"\n",
+ " output_file_name = f\"./output/rag_{settings_name}.json\"\n",
+ "\n",
+ " print(f\"Running evaluation for {settings_name}:\")\n",
+ "\n",
+ " print(\"Loading knowledge base embeddings...\")\n",
+ " knowledge_index = load_embeddings(\n",
+ " RAW_KNOWLEDGE_BASE,\n",
+ " chunk_size=chunk_size,\n",
+ " embedding_model_name=embeddings,\n",
+ " )\n",
+ "\n",
+ " print(\"Running RAG...\")\n",
+ " reranker = (\n",
+ " RAGPretrainedModel.from_pretrained(\"colbert-ir/colbertv2.0\")\n",
+ " if rerank\n",
+ " else None\n",
+ " )\n",
+ " run_rag_tests(\n",
+ " eval_dataset=eval_dataset,\n",
+ " llm=READER_LLM,\n",
+ " knowledge_index=knowledge_index,\n",
+ " output_file=output_file_name,\n",
+ " reranker=reranker,\n",
+ " verbose=False,\n",
+ " test_settings=settings_name,\n",
+ " )\n",
+ "\n",
+ " print(\"Running evaluation...\")\n",
+ " evaluate_answers(\n",
+ " output_file_name,\n",
+ " eval_chat_model,\n",
+ " evaluator_name,\n",
+ " evaluation_prompt_template,\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "tytXV5-h9jVT"
+ },
+ "source": [
+ "### Inspect results"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "D4YDSfmr9jVT"
+ },
+ "outputs": [],
+ "source": [
+ "import glob\n",
+ "\n",
+ "outputs = []\n",
+ "for file in glob.glob(\"./output/*.json\"):\n",
+ " output = pd.DataFrame(json.load(open(file, \"r\")))\n",
+ " output[\"settings\"] = file\n",
+ " outputs.append(output)\n",
+ "result = pd.concat(outputs)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "CdkXMNvS9jVT"
+ },
+ "outputs": [],
+ "source": [
+ "result[\"eval_score_GPT4\"] = result[\"eval_score_GPT4\"].apply(\n",
+ " lambda x: int(x) if isinstance(x, str) else 1\n",
+ ")\n",
+ "result[\"eval_score_GPT4\"] = (result[\"eval_score_GPT4\"] - 1) / 4"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "lgxBpid29jVT",
+ "outputId": "9a3bcf32-4b0c-4df1-c76c-3ebbca82929d"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "settings\n",
+ "./output/rag_chunk:200_embeddings:thenlper~gte-small_rerank:False_reader-model:zephyr-7b-beta.json 0.884328\n",
+ "./output/rag_chunk:200_embeddings:BAAI~bge-base-en-v1.5_rerank:False_reader-model:zephyr-7b-beta.json 0.906716\n",
+ "./output/rag_chunk:200_embeddings:BAAI~bge-base-en-v1.5_rerank:True_reader-model:zephyr-7b-beta.json 0.906716\n",
+ "./output/rag_chunk:200_embeddings:thenlper~gte-small_rerank:True_reader-model:mixtral.json 0.906716\n",
+ "./output/rag_chunk:200_embeddings:thenlper~gte-small_rerank:True_reader-model:zephyr-7b-beta.json 0.921642\n",
+ "./output/rag_chunk:200_embeddings:thenlper~gte-small_rerank:True_reader-model:mixtral0.json 0.947761\n",
+ "Name: eval_score_GPT4, dtype: float64"
+ ]
+ },
+ "execution_count": 24,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "average_scores = result.groupby(\"settings\")[\"eval_score_GPT4\"].mean()\n",
+ "average_scores.sort_values()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "pSPH9DYI9jVT"
+ },
+ "source": [
+ "## Example results\n",
+ "\n",
+ "Let us load the results that I obtained by tweaking the different options available in this notebook.\n",
+ "For more detail on why these options could work on not, see the notebook on [advanced_RAG](advanced_rag).\n",
+ "\n",
+ "As you can see in the graph below, some tweaks do not bring any improvement, some give huge performance boosts.\n",
+ "\n",
+ "➡️ ___There is no single good recipe: you should try several different directions when tuning your RAG systems.___\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "RVOxatv99jVT"
+ },
+ "outputs": [],
+ "source": [
+ "import plotly.express as px\n",
+ "\n",
+ "scores = datasets.load_dataset(\"m-ric/rag_scores_cookbook\", split=\"train\")\n",
+ "scores = pd.Series(scores[\"score\"], index=scores[\"settings\"])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "vqK0Dg2Q9jVT"
+ },
+ "outputs": [],
+ "source": [
+ "fig = px.bar(\n",
+ " scores,\n",
+ " color=scores,\n",
+ " labels={\n",
+ " \"value\": \"Accuracy\",\n",
+ " \"settings\": \"Configuration\",\n",
+ " },\n",
+ " color_continuous_scale=\"bluered\",\n",
+ ")\n",
+ "fig.update_layout(w\n",
+ " width=1000,\n",
+ " height=600,\n",
+ " barmode=\"group\",\n",
+ " yaxis_range=[0, 100],\n",
+ " title=\"Accuracy of different RAG configurations\",\n",
+ " xaxis_title=\"RAG settings\",\n",
+ " font=dict(size=15),\n",
+ ")\n",
+ "fig.layout.yaxis.ticksuffix = \"%\"\n",
+ "fig.update_coloraxes(showscale=False)\n",
+ "fig.update_traces(texttemplate=\"%{y:.1f}\", textposition=\"outside\")\n",
+ "fig.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "dPUOMWGk9jVT"
+ },
+ "source": [
+ "\n",
+ "\n",
+ "As you can see, these had varying impact on performance. In particular, tuning the chunk size is both easy and very impactful.\n",
+ "\n",
+ "But this is our case: your results could be very different: now that you have a robust evaluation pipeline, you can set on to explore other options! 🗺️"
+ ]
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "display_name": "cookbook2",
+ "language": "python",
+ "name": "cookbook2"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.12.0"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
diff --git a/notebooks/spa/rag_llamaindex_librarian.ipynb b/notebooks/spa/rag_llamaindex_librarian.ipynb
new file mode 100644
index 00000000..848172d1
--- /dev/null
+++ b/notebooks/spa/rag_llamaindex_librarian.ipynb
@@ -0,0 +1,368 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Building A RAG Ebook \"Librarian\" Using LlamaIndex\n",
+ "\n",
+ "_Authored by: [Jonathan Jin](https://huggingface.co/jinnovation)_"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Introduction\n",
+ "\n",
+ "This notebook demonstrates how to quickly build a RAG-based \"librarian\" for your\n",
+ "local ebook library.\n",
+ "\n",
+ "Think about the last time you visited a library and took advantage of the\n",
+ "expertise of the knowledgeable staff there to help you find what you need out of\n",
+ "the troves of textbooks, novels, and other resources at the library. Our RAG\n",
+ "\"librarian\" will do the same for us, except for our own local collection of\n",
+ "ebooks.\n",
+ "\n",
+ "## Requirements\n",
+ "\n",
+ "We'd like our librarian to be **lightweight** and **run locally as much as\n",
+ "possible** with **minimal dependencies**. This means that we will leverage\n",
+ "open-source to the fullest extent possible, as well as bias towards models that\n",
+ "can be **executed locally on typical hardware, e.g. M1 Macbooks**.\n",
+ "\n",
+ "## Components\n",
+ "\n",
+ "Our solution will consist of the following components:\n",
+ "\n",
+ "- [LlamaIndex], a data framework for LLM-based applications that's, unlike\n",
+ " [LangChain], designed specifically for RAG;\n",
+ "- [Ollama], a user-friendly solution for running LLMs such as Llama 2 locally;\n",
+ "- The [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5)\n",
+ " embedding model, which performs [reasonably well and is reasonably lightweight\n",
+ " in size](https://huggingface.co/spaces/mteb/leaderboard);\n",
+ "- [Llama 2], which we'll run via [Ollama].\n",
+ "\n",
+ "[LlamaIndex]: https://docs.llamaindex.ai/en/stable/index.html\n",
+ "[LangChain]: https://python.langchain.com/docs/get_started/introduction\n",
+ "[Ollama]: https://ollama.com/\n",
+ "[Llama 2]: https://ollama.com/library/llama2"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Dependencies\n",
+ "\n",
+ "First let's install our dependencies."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%pip install -q \\\n",
+ " llama-index \\\n",
+ " EbookLib \\\n",
+ " html2text \\\n",
+ " llama-index-embeddings-huggingface \\\n",
+ " llama-index-llms-ollama"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "!brew install ollama"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Test Library Setup\n",
+ "\n",
+ "Next, let's create our test \"library.\"\n",
+ "\n",
+ "For simplicity's sake, let's say that our \"library\" is simply a **nested directory of `.epub` files**. We can easily see this solution generalizing to, say, a Calibre library with a `metadata.db` database file. We'll leave that extension as an exercise for the reader. 😇\n",
+ "\n",
+ "Let's pull two `.epub` files from [Project Gutenberg](https://www.gutenberg.org/) for our library."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "!mkdir -p \".test/library/jane-austen\"\n",
+ "!mkdir -p \".test/library/victor-hugo\"\n",
+ "!wget https://www.gutenberg.org/ebooks/1342.epub.noimages -O \".test/library/jane-austen/pride-and-prejudice.epub\"\n",
+ "!wget https://www.gutenberg.org/ebooks/135.epub.noimages -O \".test/library/victor-hugo/les-miserables.epub\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## RAG with LlamaIndex\n",
+ "\n",
+ "RAG with LlamaIndex, at its core, consists of the following broad phases:\n",
+ "\n",
+ "1. **Loading**, in which you tell LlamaIndex where your data lives and how to\n",
+ " load it;\n",
+ "2. **Indexing**, in which you augment your loaded data to facilitate querying, e.g. with vector embeddings;\n",
+ "3. **Querying**, in which you configure an LLM to act as the query interface for\n",
+ " your indexed data.\n",
+ "\n",
+ "This explanation only scratches at the surface of what's possible with\n",
+ "LlamaIndex. For more in-depth details, I highly recommend reading the\n",
+ "[\"High-Level Concepts\" page of the LlamaIndex\n",
+ "documentation](https://docs.llamaindex.ai/en/stable/getting_started/concepts.html)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Loading\n",
+ "\n",
+ "Naturally, let's start with the **loading** phase.\n",
+ "\n",
+ "I mentioned before that LlamaIndex is designed specifically for RAG. This\n",
+ "immediately becomes obvious from its\n",
+ "[`SimpleDirectoryReader`](https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader.html)\n",
+ "construct, which ✨ **magically** ✨ supports a whole host of multi-model file\n",
+ "types for free. Conveniently for us, `.epub` is in the supported set."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from llama_index.core import SimpleDirectoryReader\n",
+ "\n",
+ "loader = SimpleDirectoryReader(\n",
+ " input_dir=\"./.test/\",\n",
+ " recursive=True,\n",
+ " required_exts=[\".epub\"],\n",
+ ")\n",
+ "\n",
+ "documents = loader.load_data()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "`SimpleDirectoryReader.load_data()` converts our ebooks into a set of [`Document`s](https://docs.llamaindex.ai/en/stable/api/llama_index.core.schema.Document.html) for LlamaIndex to work with.\n",
+ "\n",
+ "One important thing to note here is that the documents **have not been chunked at this stage** -- that will happen during indexing. Read on..."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Indexing\n",
+ "\n",
+ "Next up after **loading** the data is to **index** it. This will allow our RAG pipeline to look up the relevant context for our query to pass to our LLM to **augment** their generated response. This is also where document chunking will take place.\n",
+ "\n",
+ "[`VectorStoreIndex`](https://docs.llamaindex.ai/en/stable/module_guides/indexing/vector_store_index.html)\n",
+ "is a \"default\" entrypoint for indexing in LlamaIndex. By default,\n",
+ "`VectorStoreIndex` uses a simple, in-memory dictionary to store the indices, but\n",
+ "LlamaIndex also supports [a wide variety of vector storage\n",
+ "solutions](https://docs.llamaindex.ai/en/stable/module_guides/storing/vector_stores.html)\n",
+ "for you to graduate to as you scale.\n",
+ "\n",
+ " \n",
+ "By default, LlamaIndex uses a chunk size of 1024 and a chunk overlap of\n",
+ "20. For more details, see the [LlamaIndex\n",
+ "documentation](https://docs.llamaindex.ai/en/stable/optimizing/basic_strategies/basic_strategies.html#chunk-sizes).\n",
+ "\n",
+ "\n",
+ "\n",
+ "Like mentioned before, we'll use the\n",
+ "[`BAAI/bge-small-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) to\n",
+ "generate our embeddings. By default, [LlamaIndex uses\n",
+ "OpenAI](https://docs.llamaindex.ai/en/stable/getting_started/starter_example.html)\n",
+ "(specifically `gpt-3.5-turbo`), which we'd like to avoid given our desire for a lightweight, locally-runnable end-to-end solution.\n",
+ "\n",
+ "Thankfully, LlamaIndex supports retrieving embedding models from Hugging Face through the convenient `HuggingFaceEmbedding` class, so we'll use that here."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n",
+ "\n",
+ "embedding_model = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We'll pass that in to `VectorStoreIndex` as our embedding model to circumvent the OpenAI default behavior."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from llama_index.core import VectorStoreIndex\n",
+ "\n",
+ "index = VectorStoreIndex.from_documents(\n",
+ " documents,\n",
+ " embed_model=embedding_model,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Querying\n",
+ "\n",
+ "Now for the final piece of the RAG puzzle -- wiring up the query layer.\n",
+ "\n",
+ "We'll use Llama 2 for the purposes of this recipe, but I encourage readers to play around with different models to see which produces the \"best\" responses here.\n",
+ "\n",
+ "First let's start up the Ollama server. Unfortunately, there is no support in the [Ollama Python client](https://github.com/ollama/ollama-python) for actually starting and stopping the server itself, so we'll have to pop out of Python land for this.\n",
+ "\n",
+ "In a separate terminal, run: `ollama serve`. Remember to terminate this after we're done here!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Now let's hook Llama 2 up to LlamaIndex and use it as the basis of our query engine."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from llama_index.llms.ollama import Ollama\n",
+ "\n",
+ "llama = Ollama(\n",
+ " model=\"llama2\",\n",
+ " request_timeout=40.0,\n",
+ ")\n",
+ "\n",
+ "query_engine = index.as_query_engine(llm=llama)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Final Result\n",
+ "\n",
+ "With that, our basic RAG librarian is set up and we can start asking questions about our library. For example:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Based on the context provided, there are two books available:\n",
+ "\n",
+ "1. \"Pride and Prejudice\" by Jane Austen\n",
+ "2. \"Les Misérables\" by Victor Hugo\n",
+ "\n",
+ "The context used to derive this answer includes:\n",
+ "\n",
+ "* The file path for each book, which provides information about the location of the book files on the computer.\n",
+ "* The titles of the books, which are mentioned in the context as being available for reading.\n",
+ "* A list of words associated with each book, such as \"epub\" and \"notebooks\", which provide additional information about the format and storage location of each book.\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(query_engine.query(\"What are the titles of all the books available? Show me the context used to derive your answer.\"))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "The main character of 'Pride and Prejudice' is Elizabeth Bennet.\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(query_engine.query(\"Who is the main character of 'Pride and Prejudice'?\"))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Conclusion and Future Improvements\n",
+ "\n",
+ "We've demonstrated how to build a basic RAG-based \"librarian\" that runs entirely locally, even on Apple silicon Macs. In doing so, we've also carried out a \"grand tour\" of LlamaIndex and how it streamlines the process of setting up RAG-based applications.\n",
+ "\n",
+ "That said, we've really only scratched the surface of what's possible here. Here are some ideas of how to refine and build upon this foundation.\n",
+ "\n",
+ "### Forcing Citations\n",
+ "\n",
+ "To guard against the risk of our librarian hallucinating, how might we require that it provide citations for everything that it says?\n",
+ "\n",
+ "### Using Extended Metadata\n",
+ "\n",
+ "Ebook library management solutions like [Calibre](https://calibre-ebook.com/) create additional metadata for ebooks in a library. This can provide information such as publisher or edition that might not be readily available in the text of the book itself. How could we extend our RAG pipeline to account for additional sources of information that aren't `.epub` files?\n",
+ "\n",
+ "### Efficient Indexing\n",
+ "\n",
+ "If we were to collect everything we built here into a script/executable, the resulting script would re-index our library on each invocation. For our tiny test library of two files, this is \"fine,\" but for any library of non-trivial size this will very quickly become annoying for users. How could we persist the embedding indices and only update them when the contents of the library have meaningfully changed, e.g. new books have been added?"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.8"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/notebooks/spa/rag_with_hugging_face_gemma_mongodb.ipynb b/notebooks/spa/rag_with_hugging_face_gemma_mongodb.ipynb
new file mode 100644
index 00000000..e40b1462
--- /dev/null
+++ b/notebooks/spa/rag_with_hugging_face_gemma_mongodb.ipynb
@@ -0,0 +1,4032 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Building A RAG System with Gemma, MongoDB and Open Source Models\n",
+ "\n",
+ "Authored By: [Richmond Alake](https://huggingface.co/RichmondMongo)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Step 1: Installing Libraries\n",
+ "\n",
+ "\n",
+ "The shell command sequence below installs libraries for leveraging open-source large language models (LLMs), embedding models, and database interaction functionalities. These libraries simplify the development of a RAG system, reducing the complexity to a small amount of code:\n",
+ "\n",
+ "\n",
+ "- PyMongo: A Python library for interacting with MongoDB that enables functionalities to connect to a cluster and query data stored in collections and documents.\n",
+ "- Pandas: Provides a data structure for efficient data processing and analysis using Python\n",
+ "- Hugging Face datasets: Holds audio, vision, and text datasets\n",
+ "- Hugging Face Accelerate: Abstracts the complexity of writing code that leverages hardware accelerators such as GPUs. Accelerate is leveraged in the implementation to utilise the Gemma model on GPU resources.\n",
+ "- Hugging Face Transformers: Access to a vast collection of pre-trained models\n",
+ "- Hugging Face Sentence Transformers: Provides access to sentence, text, and image embeddings."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "gVSo_nNOUsdn",
+ "outputId": "907f4738-a3b0-4c0f-b293-eff65c665c07"
+ },
+ "outputs": [],
+ "source": [
+ "!pip install datasets pandas pymongo sentence_transformers\n",
+ "!pip install -U transformers\n",
+ "# Install below if using GPU\n",
+ "!pip install accelerate"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Step 2: Data sourcing and preparation\n",
+ "\n",
+ "\n",
+ "The data utilised in this tutorial is sourced from Hugging Face datasets, specifically the \n",
+ "[AIatMongoDB/embedded_movies dataset](https://huggingface.co/datasets/AIatMongoDB/embedded_movies). "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 747
+ },
+ "id": "5gCzss27UwWw",
+ "outputId": "212cca18-a0d7-4289-bce0-ee6259fc2dba"
+ },
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "summary": "{\n \"name\": \"dataset_df\",\n \"rows\": 1500,\n \"fields\": [\n {\n \"column\": \"num_mflix_comments\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 27,\n \"min\": 0,\n \"max\": 158,\n \"num_unique_values\": 40,\n \"samples\": [\n 117,\n 134,\n 124\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"genres\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"countries\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"directors\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"fullplot\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1409,\n \"samples\": [\n \"An undercover cop infiltrates a gang of thieves who plan to rob a jewelry store.\",\n \"Godzilla returns in a brand-new movie that ignores all preceding movies except for the original with a brand new look and a powered up atomic ray. This time he battles a mysterious UFO that later transforms into a mysterious kaiju dubbed Orga. They meet up for the final showdown in the city of Shinjuku.\",\n \"Relationships become entangled in an emotional web.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"writers\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"awards\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"runtime\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 42.09038552453906,\n \"min\": 6.0,\n \"max\": 1256.0,\n \"num_unique_values\": 139,\n \"samples\": [\n 152.0,\n 127.0,\n 96.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"type\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"series\",\n \"movie\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"rated\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 12,\n \"samples\": [\n \"TV-MA\",\n \"TV-14\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"metacritic\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 16.861995960390892,\n \"min\": 9.0,\n \"max\": 97.0,\n \"num_unique_values\": 83,\n \"samples\": [\n 50.0,\n 97.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"poster\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1368,\n \"samples\": [\n \"https://m.media-amazon.com/images/M/MV5BNWE5MzAwMjQtNzI1YS00YjZhLTkxNDItM2JjNjM3ZjI5NzBjXkEyXkFqcGdeQXVyMTQxNzMzNDI@._V1_SY1000_SX677_AL_.jpg\",\n \"https://m.media-amazon.com/images/M/MV5BMTgwNjIyNTczMF5BMl5BanBnXkFtZTcwODI5MDkyMQ@@._V1_SY1000_SX677_AL_.jpg\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"languages\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"imdb\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"plot\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1429,\n \"samples\": [\n \"A New York City architect becomes a one-man vigilante squad after his wife is murdered by street punks in which he randomly goes out and kills would-be muggers on the mean streets after dark.\",\n \"As the daring thief Ars\\u00e8ne Lupin (Duris) ransacks the homes of wealthy Parisians, the police, with a secret weapon in their arsenal, attempt to ferret him out.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"cast\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"plot_embedding\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1435,\n \"samples\": [\n \"Turbo: A Power Rangers Movie\",\n \"Neon Genesis Evangelion: Death & Rebirth\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}",
+ "type": "dataframe",
+ "variable_name": "dataset_df"
+ },
+ "text/html": [
+ "\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
num_mflix_comments
\n",
+ "
genres
\n",
+ "
countries
\n",
+ "
directors
\n",
+ "
fullplot
\n",
+ "
writers
\n",
+ "
awards
\n",
+ "
runtime
\n",
+ "
type
\n",
+ "
rated
\n",
+ "
metacritic
\n",
+ "
poster
\n",
+ "
languages
\n",
+ "
imdb
\n",
+ "
plot
\n",
+ "
cast
\n",
+ "
plot_embedding
\n",
+ "
title
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
0
\n",
+ "
[Action]
\n",
+ "
[USA]
\n",
+ "
[Louis J. Gasnier, Donald MacKenzie]
\n",
+ "
Young Pauline is left a lot of money when her ...
\n",
+ "
[Charles W. Goddard (screenplay), Basil Dickey...
\n",
+ "
{'nominations': 0, 'text': '1 win.', 'wins': 1}
\n",
+ "
199.0
\n",
+ "
movie
\n",
+ "
None
\n",
+ "
NaN
\n",
+ "
https://m.media-amazon.com/images/M/MV5BMzgxOD...
\n",
+ "
[English]
\n",
+ "
{'id': 4465, 'rating': 7.6, 'votes': 744}
\n",
+ "
Young Pauline is left a lot of money when her ...
\n",
+ "
[Pearl White, Crane Wilbur, Paul Panzer, Edwar...
\n",
+ "
[0.00072939653, -0.026834568, 0.013515796, -0....
\n",
+ "
The Perils of Pauline
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
0
\n",
+ "
[Comedy, Short, Action]
\n",
+ "
[USA]
\n",
+ "
[Alfred J. Goulding, Hal Roach]
\n",
+ "
As a penniless man worries about how he will m...
\n",
+ "
[H.M. Walker (titles)]
\n",
+ "
{'nominations': 1, 'text': '1 nomination.', 'w...
\n",
+ "
22.0
\n",
+ "
movie
\n",
+ "
TV-G
\n",
+ "
NaN
\n",
+ "
https://m.media-amazon.com/images/M/MV5BNzE1OW...
\n",
+ "
[English]
\n",
+ "
{'id': 10146, 'rating': 7.0, 'votes': 639}
\n",
+ "
A penniless young man tries to save an heiress...
\n",
+ "
[Harold Lloyd, Mildred Davis, 'Snub' Pollard, ...
\n",
+ "
[-0.022837115, -0.022941574, 0.014937485, -0.0...
\n",
+ "
From Hand to Mouth
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
0
\n",
+ "
[Action, Adventure, Drama]
\n",
+ "
[USA]
\n",
+ "
[Herbert Brenon]
\n",
+ "
Michael \"Beau\" Geste leaves England in disgrac...
\n",
+ "
[Herbert Brenon (adaptation), John Russell (ad...
\n",
+ "
{'nominations': 0, 'text': '1 win.', 'wins': 1}
\n",
+ "
101.0
\n",
+ "
movie
\n",
+ "
None
\n",
+ "
NaN
\n",
+ "
None
\n",
+ "
[English]
\n",
+ "
{'id': 16634, 'rating': 6.9, 'votes': 222}
\n",
+ "
Michael \"Beau\" Geste leaves England in disgrac...
\n"
+ ],
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "dataframe",
+ "variable_name": "data",
+ "summary": "{\n \"name\": \"data\",\n \"rows\": 16407,\n \"fields\": [\n {\n \"column\": \"qtype\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 16,\n \"samples\": [\n \"susceptibility\",\n \"symptoms\",\n \"information\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Question\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 14979,\n \"samples\": [\n \"What are the symptoms of Danon disease ?\",\n \"What is (are) Dowling-Degos disease ?\",\n \"What are the genetic changes related to Pearson marrow-pancreas syndrome ?\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Answer\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 15817,\n \"samples\": [\n \"These resources address the diagnosis or management of glycogen storage disease type III: - Gene Review: Gene Review: Glycogen Storage Disease Type III - Genetic Testing Registry: Glycogen storage disease type III These resources from MedlinePlus offer information about the diagnosis and management of various health conditions: - Diagnostic Tests - Drug Therapy - Surgery and Rehabilitation - Genetic Counseling - Palliative Care\",\n \"Diagnostic Challenges\\n \\nFor doctors, diagnosing chronic fatigue syndrome (CFS) can be complicated by a number of factors:\\n \\n - There's no lab test or biomarker for CFS.\\n - Fatigue and other symptoms of CFS are common to many illnesses.\\n - For some CFS patients, it may not be obvious to doctors that they are ill.\\n - The illness has a pattern of remission and relapse.\\n - Symptoms vary from person to person in type, number, and severity.\\n \\n \\nThese factors have contributed to a low diagnosis rate. Of the one to four million Americans who have CFS, less than 20% have been diagnosed.\\n Exams and Screening Tests for CFS\\n \\nBecause there is no blood test, brain scan, or other lab test to diagnose CFS, the doctor should first rule out other possible causes.\\n \\nIf a patient has had 6 or more consecutive months of severe fatigue that is reported to be unrelieved by sufficient bed rest and that is accompanied by nonspecific symptoms, including flu-like symptoms, generalized pain, and memory problems, the doctor should consider the possibility that the patient may have CFS. Further exams and tests are needed before a diagnosis can be made:\\n \\n - A detailed medical history will be needed and should include a review of medications that could be causing the fatigue and symptoms\\n - A thorough physical and mental status examination will also be needed\\n - A battery of laboratory screening tests will be needed to help identify or rule out other possible causes of the symptoms that could be treated\\n - The doctor may also order additional tests to follow up on results of the initial screening tests\\n \\n \\nA CFS diagnosis requires that the patient has been fatigued for 6 months or more and has 4 of the 8 symptoms for CFS for 6 months or more. If, however, the patient has been fatigued for 6 months or more but does not have four of the eight symptoms, the diagnosis may be idiopathic fatigue.\\n \\nThe complete process for diagnosing CFS can be found here.\\n \\nAdditional information for healthcare professionals on use of tests can be found here.\",\n \"Eating, diet, and nutrition have not been shown to play a role in causing or preventing simple kidney cysts.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4736,\n \"min\": 0,\n \"max\": 16406,\n \"num_unique_values\": 16407,\n \"samples\": [\n 3634,\n 15104,\n 4395\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
+ }
+ },
+ "metadata": {},
+ "execution_count": 48
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "MAX_ROWS = 15000\n",
+ "DOCUMENT=\"Answer\"\n",
+ "TOPIC=\"qtype\""
+ ],
+ "metadata": {
+ "execution": {
+ "iopub.status.busy": "2024-02-29T17:35:25.527688Z",
+ "iopub.execute_input": "2024-02-29T17:35:25.528374Z",
+ "iopub.status.idle": "2024-02-29T17:35:25.709895Z",
+ "shell.execute_reply.started": "2024-02-29T17:35:25.528341Z",
+ "shell.execute_reply": "2024-02-29T17:35:25.709127Z"
+ },
+ "trusted": true,
+ "id": "DZf0zCI29TD1"
+ },
+ "execution_count": 6,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "#Because it is just a sample we select a small portion of News.\n",
+ "subset_data = data.head(MAX_ROWS)"
+ ],
+ "metadata": {
+ "execution": {
+ "iopub.status.busy": "2024-02-29T17:35:29.183979Z",
+ "iopub.execute_input": "2024-02-29T17:35:29.184342Z",
+ "iopub.status.idle": "2024-02-29T17:35:29.189229Z",
+ "shell.execute_reply.started": "2024-02-29T17:35:29.184313Z",
+ "shell.execute_reply": "2024-02-29T17:35:29.1881Z"
+ },
+ "trusted": true,
+ "id": "Mkoj9IrZ9TD1"
+ },
+ "execution_count": 7,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Import and configure the Vector Database\n",
+ "To store the information, I've chosen to use ChromaDB, one of the most well-known and widely used open-source vector databases.\n",
+ "\n",
+ "First we need to import ChromaDB."
+ ],
+ "metadata": {
+ "id": "rZHg_Qh69TD1"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "import chromadb"
+ ],
+ "metadata": {
+ "execution": {
+ "iopub.status.busy": "2024-02-29T17:35:31.849199Z",
+ "iopub.execute_input": "2024-02-29T17:35:31.849551Z",
+ "iopub.status.idle": "2024-02-29T17:35:32.31736Z",
+ "shell.execute_reply.started": "2024-02-29T17:35:31.849525Z",
+ "shell.execute_reply": "2024-02-29T17:35:32.316617Z"
+ },
+ "trusted": true,
+ "id": "npJhuZQw9TD1"
+ },
+ "execution_count": 8,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Now we only need to indicate the path where the vector database will be stored."
+ ],
+ "metadata": {
+ "id": "8okox5C89TD1"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "chroma_client = chromadb.PersistentClient(path=\"/path/to/persist/directory\")"
+ ],
+ "metadata": {
+ "execution": {
+ "iopub.status.busy": "2024-02-29T17:35:34.410268Z",
+ "iopub.execute_input": "2024-02-29T17:35:34.410646Z",
+ "iopub.status.idle": "2024-02-29T17:35:34.872817Z",
+ "shell.execute_reply.started": "2024-02-29T17:35:34.410614Z",
+ "shell.execute_reply": "2024-02-29T17:35:34.872039Z"
+ },
+ "trusted": true,
+ "id": "9yK6y0hm9TD1"
+ },
+ "execution_count": 9,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Filling and Querying the ChromaDB Database\n",
+ "The Data in ChromaDB is stored in collections. If the collection exist we need to delete it.\n",
+ "\n",
+ "In the next lines, we are creating the collection by calling the `create_collection` function in the `chroma_client` created above."
+ ],
+ "metadata": {
+ "id": "7MhMwk3J9TD1"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "collection_name = \"news_collection\"\n",
+ "if len(chroma_client.list_collections()) > 0 and collection_name in [chroma_client.list_collections()[0].name]:\n",
+ " chroma_client.delete_collection(name=collection_name)\n",
+ "\n",
+ "collection = chroma_client.create_collection(name=collection_name)\n"
+ ],
+ "metadata": {
+ "execution": {
+ "iopub.status.busy": "2024-02-29T17:35:36.1156Z",
+ "iopub.execute_input": "2024-02-29T17:35:36.116012Z",
+ "iopub.status.idle": "2024-02-29T17:35:36.16922Z",
+ "shell.execute_reply.started": "2024-02-29T17:35:36.115977Z",
+ "shell.execute_reply": "2024-02-29T17:35:36.168504Z"
+ },
+ "trusted": true,
+ "id": "kRCsunE19TD1"
+ },
+ "execution_count": 10,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We are now ready to add the data to the collection using the `add` function. This function requires three key pieces of information:\n",
+ "\n",
+ "* In the **document** we store the content of the `Answer` column in the Dataset.\n",
+ "* In **metadatas**, we can inform a list of topics. I used the value in the column `qtype`.\n",
+ "* In **id** we need to inform an unique identificator for each row. I'm creating the ID using the range of `MAX_ROWS`.\n"
+ ],
+ "metadata": {
+ "id": "rdEtcETr9TD2"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "collection.add(\n",
+ " documents=subset_data[DOCUMENT].tolist(),\n",
+ " metadatas=[{TOPIC: topic} for topic in subset_data[TOPIC].tolist()],\n",
+ " ids=[f\"id{x}\" for x in range(MAX_ROWS)],\n",
+ ")"
+ ],
+ "metadata": {
+ "execution": {
+ "iopub.status.busy": "2024-02-29T17:35:38.051179Z",
+ "iopub.execute_input": "2024-02-29T17:35:38.051601Z",
+ "iopub.status.idle": "2024-02-29T17:36:38.612836Z",
+ "shell.execute_reply.started": "2024-02-29T17:35:38.051569Z",
+ "shell.execute_reply": "2024-02-29T17:36:38.611814Z"
+ },
+ "trusted": true,
+ "id": "4dDoqJE79TD2",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "36f579dc-ec60-48b1-807a-1e68113cc9f4"
+ },
+ "execution_count": 11,
+ "outputs": [
+ {
+ "metadata": {
+ "tags": null
+ },
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:01<00:00, 68.1MiB/s]\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Once we have the information in the Database we can query it, and ask for data that matches our needs. The search is done inside the content of the document, and it dosn't look for the exact word, or phrase. The results will be based on the similarity between the search terms and the content of documents.\n",
+ "\n",
+ "Metadata isn't directly involved in the initial search process, it can be used to filter or refine the results after retrieval, enabling further customization and precision.\n",
+ "\n",
+ "Let's define a function to query the ChromaDB Database."
+ ],
+ "metadata": {
+ "id": "du6-iuUisRkM"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "def query_database(query_text, n_results=10):\n",
+ " results = collection.query(query_texts=query_text, n_results=n_results )\n",
+ " return results"
+ ],
+ "metadata": {
+ "execution": {
+ "iopub.status.busy": "2024-02-29T17:36:38.615302Z",
+ "iopub.execute_input": "2024-02-29T17:36:38.616047Z",
+ "iopub.status.idle": "2024-02-29T17:36:38.620516Z",
+ "shell.execute_reply.started": "2024-02-29T17:36:38.616008Z",
+ "shell.execute_reply": "2024-02-29T17:36:38.619561Z"
+ },
+ "trusted": true,
+ "id": "UjdhZ4MJ9TD2"
+ },
+ "execution_count": 12,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Creating the semantic cache system\n",
+ "To implement the cache system, we will use Faiss, a library that allows storing embeddings in memory. It's quite similar to what Chroma does, but without its persistence.\n",
+ "\n",
+ "For this purpose, we will create a class called `semantic_cache` that will work with its own encoder and provide the necessary functions for the user to perform queries.\n",
+ "\n",
+ "In this class, we first query the cache implemented with Faiss, that contains the previous petitions, and if the returned results are above a specified threshold, it will return the content of the cache. Otherwise, it will fetch the result from the Chroma database.\n",
+ "\n",
+ "The cache is stored in a .json file."
+ ],
+ "metadata": {
+ "id": "CL0Crl3x9TD2"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "!pip install -q faiss-cpu==1.8.0"
+ ],
+ "metadata": {
+ "execution": {
+ "iopub.status.busy": "2024-02-29T17:36:38.621655Z",
+ "iopub.execute_input": "2024-02-29T17:36:38.621968Z",
+ "iopub.status.idle": "2024-02-29T17:36:51.313356Z",
+ "shell.execute_reply.started": "2024-02-29T17:36:38.621936Z",
+ "shell.execute_reply": "2024-02-29T17:36:51.312232Z"
+ },
+ "trusted": true,
+ "id": "6OzUbRUe9TD2"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "import faiss\n",
+ "from sentence_transformers import SentenceTransformer\n",
+ "import time\n",
+ "import json"
+ ],
+ "metadata": {
+ "id": "0yGE4cTEp3QJ"
+ },
+ "execution_count": 14,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "The `init_cache()` function below initializes the semantic cache.\n",
+ "\n",
+ "It employs the FlatLS index, which might not be the fastest but is ideal for small datasets. Depending on the characteristics of the data intended for the cache and the expected dataset size, another index such as HNSW or IVF could be utilized.\n",
+ "\n",
+ "I chose this index because it aligns well with the example. It can be used with vectors of high dimensions, consumes minimal memory, and performs well with small datasets.\n",
+ "\n",
+ "I outline the key features of the various indices available with Faiss.\n",
+ "\n",
+ "* FlatL2 or FlatIP. Well-suited for small datasets, it may not be the fastest, but its memory consumption is not excessive.\n",
+ "* LSH. It works effectively with small datasets and is recommended for use with vectors of up to 128 dimensions.\n",
+ "* HNSW. Very fast but demands a substantial amount of RAM.\n",
+ "* IVF. Works well with large datasets without consuming much memory or compromising performance.\n",
+ "\n",
+ "More information about the different indices available with Faiss can be found at this link: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index"
+ ],
+ "metadata": {
+ "id": "yi_riXHhcLy0"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "def init_cache():\n",
+ " index = faiss.IndexFlatL2(768)\n",
+ " if index.is_trained:\n",
+ " print('Index trained')\n",
+ "\n",
+ " # Initialize Sentence Transformer model\n",
+ " encoder = SentenceTransformer('all-mpnet-base-v2')\n",
+ "\n",
+ " return index, encoder"
+ ],
+ "metadata": {
+ "id": "9poNBxbPl7xE"
+ },
+ "execution_count": 15,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "In the `retrieve_cache` function, the .json file is retrieved from disk in case there is a need to reuse the cache across sessions."
+ ],
+ "metadata": {
+ "id": "_uZzX60odo1U"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "def retrieve_cache(json_file):\n",
+ " try:\n",
+ " with open(json_file, 'r') as file:\n",
+ " cache = json.load(file)\n",
+ " except FileNotFoundError:\n",
+ " cache = {'questions': [], 'embeddings': [], 'answers': [], 'response_text': []}\n",
+ "\n",
+ " return cache"
+ ],
+ "metadata": {
+ "id": "FDJJ86TSp5CO"
+ },
+ "execution_count": 16,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "The `store_cache` function saves the file containing the cache data to disk."
+ ],
+ "metadata": {
+ "id": "3uO-12UIdtSD"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "def store_cache(json_file, cache):\n",
+ " with open(json_file, 'w') as file:\n",
+ " json.dump(cache, file)"
+ ],
+ "metadata": {
+ "id": "jx1CiKOcwKGn"
+ },
+ "execution_count": 17,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "These functions will be used within the `SemanticCache` class, which includes the search function and its initialization function.\n",
+ "\n",
+ "Even though the `ask` function has a substantial amount of code, its purpose is quite straightforward. It looks in the cache for the closest question to the one just made by the user.\n",
+ "\n",
+ "Afterward, checks if it is within the specified threshold. If positive, it directly returns the response from the cache; otherwise, it calls the `query_database` function to retrieve the data from ChromaDB.\n",
+ "\n",
+ "I've used Euclidean distance instead of Cosine, which is widely employed in vector comparisons. This choice is based on the fact that Euclidean distance is the default metric used by Faiss. Although Cosine distance can also be calculated, doing so adds complexity that may not significantly contribute to the final result.\n"
+ ],
+ "metadata": {
+ "id": "t9AdmnhQd2E8"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "class semantic_cache:\n",
+ " def __init__(self, json_file=\"cache_file.json\", thresold=0.35):\n",
+ " # Initialize Faiss index with Euclidean distance\n",
+ " self.index, self.encoder = init_cache()\n",
+ "\n",
+ " # Set Euclidean distance threshold\n",
+ " # a distance of 0 means identicals sentences\n",
+ " # We only return from cache sentences under this thresold\n",
+ " self.euclidean_threshold = thresold\n",
+ "\n",
+ " self.json_file = json_file\n",
+ " self.cache = retrieve_cache(self.json_file)\n",
+ "\n",
+ " def ask(self, question: str) -> str:\n",
+ " # Method to retrieve an answer from the cache or generate a new one\n",
+ " start_time = time.time()\n",
+ " try:\n",
+ " #First we obtain the embeddings corresponding to the user question\n",
+ " embedding = self.encoder.encode([question])\n",
+ "\n",
+ " # Search for the nearest neighbor in the index\n",
+ " self.index.nprobe = 8\n",
+ " D, I = self.index.search(embedding, 1)\n",
+ "\n",
+ " if D[0] >= 0:\n",
+ " if I[0][0] >= 0 and D[0][0] <= self.euclidean_threshold:\n",
+ " row_id = int(I[0][0])\n",
+ "\n",
+ " print('Answer recovered from Cache. ')\n",
+ " print(f'{D[0][0]:.3f} smaller than {self.euclidean_threshold}')\n",
+ " print(f'Found cache in row: {row_id} with score {D[0][0]:.3f}')\n",
+ " print(f'response_text: ' + self.cache['response_text'][row_id])\n",
+ "\n",
+ " end_time = time.time()\n",
+ " elapsed_time = end_time - start_time\n",
+ " print(f\"Time taken: {elapsed_time:.3f} seconds\")\n",
+ " return self.cache['response_text'][row_id]\n",
+ "\n",
+ " # Handle the case when there are not enough results\n",
+ " # or Euclidean distance is not met, asking to chromaDB.\n",
+ " answer = query_database([question], 1)\n",
+ " response_text = answer['documents'][0][0]\n",
+ "\n",
+ " self.cache['questions'].append(question)\n",
+ " self.cache['embeddings'].append(embedding[0].tolist())\n",
+ " self.cache['answers'].append(answer)\n",
+ " self.cache['response_text'].append(response_text)\n",
+ "\n",
+ " print('Answer recovered from ChromaDB. ')\n",
+ " print(f'response_text: {response_text}')\n",
+ "\n",
+ " self.index.add(embedding)\n",
+ " store_cache(self.json_file, self.cache)\n",
+ " end_time = time.time()\n",
+ " elapsed_time = end_time - start_time\n",
+ " print(f\"Time taken: {elapsed_time:.3f} seconds\")\n",
+ "\n",
+ " return response_text\n",
+ " except Exception as e:\n",
+ " raise RuntimeError(f\"Error during 'ask' method: {e}\")\n"
+ ],
+ "metadata": {
+ "execution": {
+ "iopub.status.busy": "2024-02-29T17:36:51.316449Z",
+ "iopub.execute_input": "2024-02-29T17:36:51.31678Z",
+ "iopub.status.idle": "2024-02-29T17:36:55.197427Z",
+ "shell.execute_reply.started": "2024-02-29T17:36:51.316746Z",
+ "shell.execute_reply": "2024-02-29T17:36:55.196616Z"
+ },
+ "trusted": true,
+ "id": "t_HVtwww9TD2"
+ },
+ "execution_count": 51,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Testing the semantic_cache class."
+ ],
+ "metadata": {
+ "id": "UBWTqGM7i71N"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Initialize the cache.\n",
+ "cache = semantic_cache('4cache.json')"
+ ],
+ "metadata": {
+ "id": "JH8s8eUtCMIS",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "c613bbfc-9f84-4a96-cd39-45972e69c15b"
+ },
+ "execution_count": 52,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Index trained\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "results = cache.ask(\"How do vaccines work?\")"
+ ],
+ "metadata": {
+ "id": "mKqKLfDe_8bC",
+ "outputId": "8a92ed95-c822-4382-c6db-d9de289341af",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ }
+ },
+ "execution_count": 53,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Answer recovered from ChromaDB. \n",
+ "response_text: Summary : Shots may hurt a little, but the diseases they can prevent are a lot worse. Some are even life-threatening. Immunization shots, or vaccinations, are essential. They protect against things like measles, mumps, rubella, hepatitis B, polio, tetanus, diphtheria, and pertussis (whooping cough). Immunizations are important for adults as well as children. Your immune system helps your body fight germs by producing substances to combat them. Once it does, the immune system \"remembers\" the germ and can fight it again. Vaccines contain germs that have been killed or weakened. When given to a healthy person, the vaccine triggers the immune system to respond and thus build immunity. Before vaccines, people became immune only by actually getting a disease and surviving it. Immunizations are an easier and less risky way to become immune. NIH: National Institute of Allergy and Infectious Diseases\n",
+ "Time taken: 0.057 seconds\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "As expected, this response has been obtained from ChromaDB. The class then stores it in the cache.\n",
+ "\n",
+ "Now, if we send a second question that is quite different, the response should also be retrieved from ChromaDB. This is because the question stored previously is so dissimilar that it would surpass the specified threshold in terms of Euclidean distance."
+ ],
+ "metadata": {
+ "id": "dP7H6TypknLN"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "\n",
+ "results = cache.ask(\"Explain briefly what is a Sydenham chorea\")"
+ ],
+ "metadata": {
+ "execution": {
+ "iopub.status.busy": "2024-02-29T17:37:15.335288Z",
+ "iopub.execute_input": "2024-02-29T17:37:15.335593Z",
+ "iopub.status.idle": "2024-02-29T17:37:17.320691Z",
+ "shell.execute_reply.started": "2024-02-29T17:37:15.335566Z",
+ "shell.execute_reply": "2024-02-29T17:37:17.319671Z"
+ },
+ "trusted": true,
+ "id": "CvJykqVf9TD2",
+ "outputId": "7137919e-e417-47b3-a638-18026b3edfe6",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ }
+ },
+ "execution_count": 54,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Answer recovered from ChromaDB. \n",
+ "response_text: Sydenham chorea (SD) is a neurological disorder of childhood resulting from infection via Group A beta-hemolytic streptococcus (GABHS), the bacterium that causes rheumatic fever. SD is characterized by rapid, irregular, and aimless involuntary movements of the arms and legs, trunk, and facial muscles. It affects girls more often than boys and typically occurs between 5 and 15 years of age. Some children will have a sore throat several weeks before the symptoms begin, but the disorder can also strike up to 6 months after the fever or infection has cleared. Symptoms can appear gradually or all at once, and also may include uncoordinated movements, muscular weakness, stumbling and falling, slurred speech, difficulty concentrating and writing, and emotional instability. The symptoms of SD can vary from a halting gait and slight grimacing to involuntary movements that are frequent and severe enough to be incapacitating. The random, writhing movements of chorea are caused by an auto-immune reaction to the bacterium that interferes with the normal function of a part of the brain (the basal ganglia) that controls motor movements. Due to better sanitary conditions and the use of antibiotics to treat streptococcal infections, rheumatic fever, and consequently SD, are rare in North America and Europe. The disease can still be found in developing nations.\n",
+ "Time taken: 0.082 seconds\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Perfect, the semantic cache system is behaving as expected.\n",
+ "\n",
+ "Let's proceed to test it with a question very similar to the one we just asked.\n",
+ "\n",
+ "In this case, the response should come directly from the cache without the need to access the ChromaDB database."
+ ],
+ "metadata": {
+ "id": "8aPWvU64lxOU"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [],
+ "metadata": {
+ "id": "sPmmTGGM0pVj"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "results = cache.ask(\"Briefly explain me what is a Sydenham chorea.\")"
+ ],
+ "metadata": {
+ "execution": {
+ "iopub.status.busy": "2024-02-29T17:37:17.32865Z",
+ "iopub.execute_input": "2024-02-29T17:37:17.328926Z",
+ "iopub.status.idle": "2024-02-29T17:37:17.463363Z",
+ "shell.execute_reply.started": "2024-02-29T17:37:17.328902Z",
+ "shell.execute_reply": "2024-02-29T17:37:17.462397Z"
+ },
+ "trusted": true,
+ "id": "9_5IcGB-9TD2",
+ "outputId": "13563a7d-01f7-47d1-c345-6ad128f303c3",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ }
+ },
+ "execution_count": 55,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Answer recovered from Cache. \n",
+ "0.028 smaller than 0.35\n",
+ "Found cache in row: 1 with score 0.028\n",
+ "response_text: Sydenham chorea (SD) is a neurological disorder of childhood resulting from infection via Group A beta-hemolytic streptococcus (GABHS), the bacterium that causes rheumatic fever. SD is characterized by rapid, irregular, and aimless involuntary movements of the arms and legs, trunk, and facial muscles. It affects girls more often than boys and typically occurs between 5 and 15 years of age. Some children will have a sore throat several weeks before the symptoms begin, but the disorder can also strike up to 6 months after the fever or infection has cleared. Symptoms can appear gradually or all at once, and also may include uncoordinated movements, muscular weakness, stumbling and falling, slurred speech, difficulty concentrating and writing, and emotional instability. The symptoms of SD can vary from a halting gait and slight grimacing to involuntary movements that are frequent and severe enough to be incapacitating. The random, writhing movements of chorea are caused by an auto-immune reaction to the bacterium that interferes with the normal function of a part of the brain (the basal ganglia) that controls motor movements. Due to better sanitary conditions and the use of antibiotics to treat streptococcal infections, rheumatic fever, and consequently SD, are rare in North America and Europe. The disease can still be found in developing nations.\n",
+ "Time taken: 0.019 seconds\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "The two questions are so similar that their Euclidean distance is truly minimal, almost as if they were identical.\n",
+ "\n",
+ "Now, let's try another question, this time a bit more distinct, and observe how the system behaves."
+ ],
+ "metadata": {
+ "id": "M4H8RoXFqdwE"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "question_def = \"Write in 20 words what is a Sydenham chorea.\"\n",
+ "results = cache.ask(question_def)"
+ ],
+ "metadata": {
+ "id": "ysj5P_MBCqju",
+ "outputId": "d4639f73-dc7e-4c25-93ba-2a8c66dc7c61",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ }
+ },
+ "execution_count": 56,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Answer recovered from Cache. \n",
+ "0.228 smaller than 0.35\n",
+ "Found cache in row: 1 with score 0.228\n",
+ "response_text: Sydenham chorea (SD) is a neurological disorder of childhood resulting from infection via Group A beta-hemolytic streptococcus (GABHS), the bacterium that causes rheumatic fever. SD is characterized by rapid, irregular, and aimless involuntary movements of the arms and legs, trunk, and facial muscles. It affects girls more often than boys and typically occurs between 5 and 15 years of age. Some children will have a sore throat several weeks before the symptoms begin, but the disorder can also strike up to 6 months after the fever or infection has cleared. Symptoms can appear gradually or all at once, and also may include uncoordinated movements, muscular weakness, stumbling and falling, slurred speech, difficulty concentrating and writing, and emotional instability. The symptoms of SD can vary from a halting gait and slight grimacing to involuntary movements that are frequent and severe enough to be incapacitating. The random, writhing movements of chorea are caused by an auto-immune reaction to the bacterium that interferes with the normal function of a part of the brain (the basal ganglia) that controls motor movements. Due to better sanitary conditions and the use of antibiotics to treat streptococcal infections, rheumatic fever, and consequently SD, are rare in North America and Europe. The disease can still be found in developing nations.\n",
+ "Time taken: 0.016 seconds\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We observe that the Euclidean distance has increased, but it still remains within the specified threshold. Therefore, it continues to return the response directly from the cache."
+ ],
+ "metadata": {
+ "id": "MFzXsQwB9TD3"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Loading the model and creating the prompt\n",
+ "Time to use the library **transformers**, the most famous library from [hugging face](https://huggingface.co/) for working with language models.\n",
+ "\n",
+ "We are importing:\n",
+ "* **Autotokenizer**: It is a utility class for tokenizing text inputs that are compatible with various pre-trained language models.\n",
+ "* **AutoModelForCausalLM**: it provides an interface to pre-trained language models specifically designed for language generation tasks using causal language modeling (e.g., GPT models), or the model used in this notebook [Gemma-2b-it](https://huggingface.co/google/gemma-2b-it).\n",
+ "\n",
+ "Please, feel free to test [different Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending), you need to search for NLP models trained for text-generation.\n"
+ ],
+ "metadata": {
+ "id": "Ot3wrq0p9TD3"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "!pip install torch"
+ ],
+ "metadata": {
+ "execution": {
+ "iopub.status.busy": "2024-02-29T17:40:32.797334Z",
+ "iopub.execute_input": "2024-02-29T17:40:32.797669Z",
+ "iopub.status.idle": "2024-02-29T17:40:44.152114Z",
+ "shell.execute_reply.started": "2024-02-29T17:40:32.797635Z",
+ "shell.execute_reply": "2024-02-29T17:40:44.151056Z"
+ },
+ "trusted": true,
+ "id": "tdxiKqjT9TD3"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "from torch import cuda, torch\n",
+ "#In a MAC Silicon the device must be 'mps'\n",
+ "# device = torch.device('mps') #to use with MAC Silicon\n",
+ "device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'"
+ ],
+ "metadata": {
+ "execution": {
+ "iopub.status.busy": "2024-02-29T17:40:44.153914Z",
+ "iopub.execute_input": "2024-02-29T17:40:44.15434Z",
+ "iopub.status.idle": "2024-02-29T17:40:44.160144Z",
+ "shell.execute_reply.started": "2024-02-29T17:40:44.154292Z",
+ "shell.execute_reply": "2024-02-29T17:40:44.159154Z"
+ },
+ "trusted": true,
+ "id": "pIDMTCnH9TD7"
+ },
+ "execution_count": 25,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "from transformers import AutoTokenizer, AutoModelForCausalLM\n",
+ "\n",
+ "model_id = \"google/gemma-2b-it\"\n",
+ "tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
+ "model = AutoModelForCausalLM.from_pretrained(model_id,\n",
+ " device_map=\"cuda\",\n",
+ " torch_dtype=torch.bfloat16)"
+ ],
+ "metadata": {
+ "execution": {
+ "iopub.status.busy": "2024-02-29T17:41:25.628412Z",
+ "iopub.execute_input": "2024-02-29T17:41:25.628804Z",
+ "iopub.status.idle": "2024-02-29T17:41:30.202141Z",
+ "shell.execute_reply.started": "2024-02-29T17:41:25.628766Z",
+ "shell.execute_reply": "2024-02-29T17:41:30.200774Z"
+ },
+ "trusted": true,
+ "id": "CU2T4lp-9TD7"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [],
+ "metadata": {
+ "id": "0kdqsEbUEywG"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Creating the extended prompt\n",
+ "To create the prompt we use the result from query the 'semantic_cache' class and the question introduced by the user.\n",
+ "\n",
+ "The prompt have two parts, the **relevant context** that is the information recovered from the database and the **user's question**.\n",
+ "\n",
+ "We only need to put the two parts together to create the prompt then send it to the model."
+ ],
+ "metadata": {
+ "id": "GzHuFrAX9TD7"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "prompt_template = f\"Relevant context: {results}\\n\\n The user's question: {question_def}\"\n",
+ "prompt_template"
+ ],
+ "metadata": {
+ "id": "TdjbfAHhFuhS",
+ "outputId": "4090da66-328e-478e-c2d7-1957597f8786",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 209
+ }
+ },
+ "execution_count": 44,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "\"Relevant context: Sydenham chorea (SD) is a neurological disorder of childhood resulting from infection via Group A beta-hemolytic streptococcus (GABHS), the bacterium that causes rheumatic fever. SD is characterized by rapid, irregular, and aimless involuntary movements of the arms and legs, trunk, and facial muscles. It affects girls more often than boys and typically occurs between 5 and 15 years of age. Some children will have a sore throat several weeks before the symptoms begin, but the disorder can also strike up to 6 months after the fever or infection has cleared. Symptoms can appear gradually or all at once, and also may include uncoordinated movements, muscular weakness, stumbling and falling, slurred speech, difficulty concentrating and writing, and emotional instability. The symptoms of SD can vary from a halting gait and slight grimacing to involuntary movements that are frequent and severe enough to be incapacitating. The random, writhing movements of chorea are caused by an auto-immune reaction to the bacterium that interferes with the normal function of a part of the brain (the basal ganglia) that controls motor movements. Due to better sanitary conditions and the use of antibiotics to treat streptococcal infections, rheumatic fever, and consequently SD, are rare in North America and Europe. The disease can still be found in developing nations.\\n\\n The user's question: Write in 20 words what is a Sydenham chorea.\""
+ ],
+ "application/vnd.google.colaboratory.intrinsic+json": {
+ "type": "string"
+ }
+ },
+ "metadata": {},
+ "execution_count": 44
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "input_ids = tokenizer(prompt_template, return_tensors=\"pt\").to(\"cuda\")"
+ ],
+ "metadata": {
+ "id": "DmYAcXEEECnz"
+ },
+ "execution_count": 45,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Now all that remains is to send the prompt to the model and wait for its response!\n"
+ ],
+ "metadata": {
+ "id": "S-QXeuJ09TD8"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "outputs = model.generate(**input_ids,\n",
+ " max_new_tokens=256)\n",
+ "print(tokenizer.decode(outputs[0]))"
+ ],
+ "metadata": {
+ "id": "lheL8vHpEMDD",
+ "outputId": "b646d648-b88d-4a29-ab30-427d00296255",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ }
+ },
+ "execution_count": 46,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Relevant context: Sydenham chorea (SD) is a neurological disorder of childhood resulting from infection via Group A beta-hemolytic streptococcus (GABHS), the bacterium that causes rheumatic fever. SD is characterized by rapid, irregular, and aimless involuntary movements of the arms and legs, trunk, and facial muscles. It affects girls more often than boys and typically occurs between 5 and 15 years of age. Some children will have a sore throat several weeks before the symptoms begin, but the disorder can also strike up to 6 months after the fever or infection has cleared. Symptoms can appear gradually or all at once, and also may include uncoordinated movements, muscular weakness, stumbling and falling, slurred speech, difficulty concentrating and writing, and emotional instability. The symptoms of SD can vary from a halting gait and slight grimacing to involuntary movements that are frequent and severe enough to be incapacitating. The random, writhing movements of chorea are caused by an auto-immune reaction to the bacterium that interferes with the normal function of a part of the brain (the basal ganglia) that controls motor movements. Due to better sanitary conditions and the use of antibiotics to treat streptococcal infections, rheumatic fever, and consequently SD, are rare in North America and Europe. The disease can still be found in developing nations.\n",
+ "\n",
+ " The user's question: Write in 20 words what is a Sydenham chorea.\n",
+ "\n",
+ "Sure, here is a 20-word answer:\n",
+ "\n",
+ "Sydenham chorea is a neurological disorder of childhood resulting from infection via Group A beta-hemolytic streptococcus (GABHS).\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Conclusion.\n",
+ "There's a 50% reduction in data retrieval time between accessing ChromaDB and going directly to the cache. However, in larger projects, this difference increases, leading to enhancements of 90-95%.\n",
+ "\n",
+ "We have very few data in Chroma, and only a single instance of the cache class. Typically, the data behind the cache system is much larger, possibly involving more than just a query to a vector database but sourced from various places.\n",
+ "\n",
+ "It's common to have multiple instances of the cache class, usually based on user typology, as questions tend to repeat more among users who share common traits.\n",
+ "\n",
+ "In summary, we have created a very simple RAG (Retrieval-Augmented Generation) system and enhanced it with a semantic cache layer between the user's question and obtaining the information necessary to create the enriched prompt."
+ ],
+ "metadata": {
+ "execution": {
+ "iopub.status.busy": "2023-07-12T22:01:56.992775Z",
+ "iopub.execute_input": "2023-07-12T22:01:56.993351Z",
+ "iopub.status.idle": "2023-07-12T22:01:57.001309Z",
+ "shell.execute_reply.started": "2023-07-12T22:01:56.993305Z",
+ "shell.execute_reply": "2023-07-12T22:01:56.999431Z"
+ },
+ "id": "Uo7lGXBV9TD8"
+ }
+ }
+ ]
+}
\ No newline at end of file
diff --git a/notebooks/spa/stable_diffusion_interpolation.ipynb b/notebooks/spa/stable_diffusion_interpolation.ipynb
new file mode 100644
index 00000000..496bc062
--- /dev/null
+++ b/notebooks/spa/stable_diffusion_interpolation.ipynb
@@ -0,0 +1 @@
+{"cells":[{"cell_type":"markdown","metadata":{"id":"UsrvK8CFDiNu"},"source":["## Images Interpolation with Stable Diffusion\n","\n","_Authored by: [Rustam Akimov](https://github.com/AkiRusProd)_\n","\n","This notebook shows how to use Stable Diffusion to interpolate between images. Image interpolation using Stable Diffusion is the process of creating intermediate images that smoothly transition from one given image to another, using a generative model based on diffusion. \n","\n","Here are some various use cases for image interpolation with Stable Diffusion:\n","- Data Augmentation: Stable Diffusion can augment training data for machine learning models by generating synthetic images that lie between existing data points. This can improve the generalization and robustness of machine learning models, especially in tasks like image generation, classification or object detection. \n","- Product Design and Prototyping: Stable Diffusion can aid in product design by generating variations of product designs or prototypes with subtle differences. This can be useful for exploring design alternatives, conducting user studies, or visualizing design iterations before committing to physical prototypes. \n","- Content Generation for Media Production: In media production, such as film and video editing, Stable Diffusion can be used to generate intermediate frames between key frames, enabling smoother transitions and enhancing visual storytelling. This can save time and resources compared to manual frame-by-frame editing.\n","\n","In the context of image interpolation, Stable Diffusion models are often used to navigate through a high-dimensional latent space. Each dimension represents a specific feature that has been learned by the model. By walking through this latent space and interpolating between different latent representations of images, the model is able to generate a sequence of intermediate images which show a smooth transition between the original images. There are two types of latents in stable diffusion: prompt latents and image latents. \n","\n","Latent space walking involves moving through a latent space along a path defined by two or more points (representing images). By carefully selecting these points and the path between them, it is possible to control the features of the generated images, such as style, content, and other visual aspects. \n","\n","In this Notebook, we will explore examples of image interpolation using Stable Diffusion and demonstrate how latent space walking can be implemented and utilized to create smooth transitions between images. We'll provide code snippets and visualizations that illustrate this process in action, allowing for a deeper understanding of how generative models can manipulate and morph image representations in meaningful ways.\n"]},{"cell_type":"markdown","metadata":{"id":"XEhtH959DiOC"},"source":["First, let's install all the required modules."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"execution":{"iopub.execute_input":"2024-02-21T17:20:28.329767Z","iopub.status.busy":"2024-02-21T17:20:28.329050Z","iopub.status.idle":"2024-02-21T17:23:15.653382Z","shell.execute_reply":"2024-02-21T17:23:15.652310Z","shell.execute_reply.started":"2024-02-21T17:20:28.329734Z"},"id":"lbWtDpayDiOD","outputId":"b39791a6-6bdc-4f48-e016-5650c98072cf","trusted":true},"outputs":[],"source":["!pip install -q diffusers transformers xformers accelerate\n","!pip install -q numpy scipy ftfy Pillow"]},{"cell_type":"markdown","metadata":{"id":"pUUXab_IDiOE"},"source":["Import modules"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":171,"referenced_widgets":["3537007206fd4d57ae492d29d90bd904","85bf2410c7d2440db76163fe1df4f4bb","c2bf5a15732a4898915b0ec3cb56df8c","bf573d9fcbac4701b31e464373fdbeb0","f1f352e6964f424f9e6a4557f6e3ff97","fa5231429aa1437983dc93dc597e698e","fe24820349a0456ca103e30024490c0e","35ffc6d955a44422a06e5c304fcaeddb","1e64cce9ffc94f23921f964288c2e26d","96d5032cdaa14cdeb110f8fc3b6614c1","4a466456e448417a8b3cc442cec49632"]},"execution":{"iopub.execute_input":"2024-02-21T17:23:55.606390Z","iopub.status.busy":"2024-02-21T17:23:55.606005Z","iopub.status.idle":"2024-02-21T17:24:12.144679Z","shell.execute_reply":"2024-02-21T17:24:12.143740Z","shell.execute_reply.started":"2024-02-21T17:23:55.606352Z"},"id":"gbnW1HiEDiOE","outputId":"a3b7adb5-f455-4c75-d626-6f2a6f86455b","trusted":true},"outputs":[],"source":["import torch\n","import numpy as np\n","import os\n","\n","import time\n","\n","from PIL import Image\n","from IPython import display as IPdisplay\n","from tqdm.auto import tqdm\n","\n","from diffusers import StableDiffusionPipeline\n","from diffusers import (\n"," DDIMScheduler,\n"," PNDMScheduler,\n"," LMSDiscreteScheduler,\n"," DPMSolverMultistepScheduler,\n"," EulerAncestralDiscreteScheduler,\n"," EulerDiscreteScheduler,\n",")\n","from transformers import logging\n","\n","logging.set_verbosity_error()"]},{"cell_type":"markdown","metadata":{"id":"loFaaWVUDiOF"},"source":["Let's check if CUDA is available.\n","\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2024-02-21T17:24:16.252373Z","iopub.status.busy":"2024-02-21T17:24:16.251653Z","iopub.status.idle":"2024-02-21T17:24:16.258088Z","shell.execute_reply":"2024-02-21T17:24:16.257085Z","shell.execute_reply.started":"2024-02-21T17:24:16.252340Z"},"id":"uGgmrhr-DiOF","trusted":true},"outputs":[],"source":["print(torch.cuda.is_available())\n","\n","device = torch.device(\"cuda\") if torch.cuda.is_available() else torch.device(\"cpu\")"]},{"cell_type":"markdown","metadata":{"id":"zMSGnuqmDiOF"},"source":["These settings are used to optimize the performance of PyTorch models on CUDA-enabled GPUs, especially when using mixed precision training or inference, which can be beneficial in terms of speed and memory usage. \n","Source: https://huggingface.co/docs/diffusers/optimization/fp16#memory-efficient-attention"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2024-02-21T17:24:18.661531Z","iopub.status.busy":"2024-02-21T17:24:18.661171Z","iopub.status.idle":"2024-02-21T17:24:18.666289Z","shell.execute_reply":"2024-02-21T17:24:18.665171Z","shell.execute_reply.started":"2024-02-21T17:24:18.661501Z"},"id":"JT02KQqNDiOF","trusted":true},"outputs":[],"source":["torch.backends.cudnn.benchmark = True\n","torch.backends.cuda.matmul.allow_tf32 = True"]},{"cell_type":"markdown","metadata":{"id":"_E5R20VtDiOF"},"source":["### Model\n","\n","The [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) model and the [`LMSDiscreteScheduler`](https://huggingface.co/docs/diffusers/en/api/schedulers/lms_discrete) scheduler were chosen to generate images. Despite being an older technology, it continues to enjoy popularity due to its fast performance, minimal memory requirements, and the availability of numerous community fine-tuned models built on top of SD1.5. However, you are free to experiment with other models and schedulers to compare the results."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":913,"referenced_widgets":["3538495833744e9eab9b25eade484603","8db4aafac51043598a9e4b8915f4f7c4","0e63faca29cb4ca3845e402d67592b8f","bf4b67c0a0034b1ab576066d366d934e","e09eaa48528a48629b19020aa65c5bdc","57d02fadf8fc4df3b1fbee18e45f9199","2f6e125affb54674b5f49beaf3612bee","f736f2fc44444ebf832fb2ead6ea0fd0","c41432dd085f41aeaa1e1d9cac4872e7","c577a158f88040c0979b319fb9fb89b0","f24c9ef7c3f04da783aaffd3e5d48de7","6401b1635d62478ba644075a32494384","72a5fcb23802406eb22e2758be01052c","0ba055eeeed04e91b684072740e07f0a","790424c8674d42d59c75ba2fd4021e3a","ff697b2e046a4a36a78ac362cae5c2a9","eb3e42c270b04086a2a04ae286684f6b","d9816365a87340dfa06fe9a37811a81b","309d89a221ab4010af0d85d3ae90a1c3","1cdaf6fe1392453288997f672af80a0d","5dc635e136f741608e73455581822408","670c96aa182f4defbfce4a9fd0b7f96b","92d3f7562ee04b789c44656f83d12528","bd7b9e9757d24c568432eea484271376","9c12a4d25771464b84b35804695fd50a","fa1d984ee1864975a4c543f1dcb3aa42","b89baafcd7e944b088bf0c9b1e839ba5","fb34d344590e43438906722b1ab958f0","ed6a871ff054410b8d018b3b97c75c60","909ad3aefa5b4c65931f300b3e9655dd","93159e8e4bf84bc9a35c4325dc6cc851","f2f9506fc02a4624a8a1c08f1f6abdb2","2f010760c1a146238f35af00568ccb11","20c46fc3077b4dcc855252c222002569","9b597fc5d5cf4fecabcfbc7a4cfa1ee9","17ceef0b617c4f52aa0bb5ec12113fcd","e20cd706fa304bc190e92ae7d27b5b38","9a3f552babd34cdeb8fed4ce1b1b33a7","f8db6aae6e3a468ba3805991e19e1f45","681cc61581d84298b4798b5c43818e31","45d877fafa5a4772ba5d62557843bb51","62d5802bdc7a49efbcb47889e29e924c","e4f21da1f63a4a819485cbe7e5ae306b","c7f341fe95ca426d9502594bd48d36f6","3214f06917264be88837314b26375a1c","00b88de69713463485206e4b7c8c3c04","f63bebf985a046ce9ba9d567db0b7ca4","a49fd3377dde41dab42677091dc7bd04","ed1dbb98c8d34dce8df70e4698a21911","95d1047a0a644262aa385987c9a331b4","c565ba8910c34e138c6cd10e9b06d673","5ffaa53c74d3413b8f3c3fe2fc5cc075","aedf99983c5043fe8d634b6e3b56e1ae","bf1489df7c98442caadd2417026bffdf","590b0de8d8f84d1ab9a243d88380c295","44a56b89efa649edb669abed3605e576","c26781e6cc214b6b866b8f11673e9c00","cddd14dde34c42f59ed71870e558246e","399fdb41f7fd4f1489c1bf4814b53907","f7f49c60f4ca41efa9a3b6b33c418f73","9ef6f8a591244419916d980d5883e03e","4d26d7dd13d148b3b6e06f10b589f2a8","8c690a7356af4547902b72aa6a20328d","1e4ad02a5afb431a96626230081054cc","2fd8069de9754a87ae04ec7b2c4b380a","da6373c3704d45869e2fafdf9772d3d1","8c9b21767c5d4741b717b82f1e4a0e03","a83f68f1b1d640298019af11c5198a7d","55a116f1fa634632a061a4ad8bb75ec3","58910b48c70a4b3dabf87b6a12004e70","2dc63f0fe271457f890fd2067631ad75","3fcfdbffff6149fe880b0702cf8162f3","eb96b8f49d6746ec8f46e65e59a3fad6","e03272733bd84e2e9edcb83391e1cfae","184d9e1fa8d241c386567300db4e2c8c","51ceac8abb23437f952e07d2daaf0dae","a06b4275e1444e72b921d631feea90ba","d7c9f6b399524bc596e84641687ba29a","1c8a25c7f70145df9722d3702fc6d2dd","ae44d548c5164e8fb5e85f1ab19da9ac","1325ea426c9747828601a9175a9b0248","97fe7d3ccc984694aa42a56ba930c64e","c1a5498ffe0f407397d2a8c77e35de84","5eefef112ef249cc97ecd63f19ea122d","defa4622267e49b8a54d6aceea082c39","59081d8cdf4e43228a20f3fe986926b2","c14c4f3d31a447b38d359acc6e29496d","f4f167561c2d452785b6e59c2ce61b28","4cc69a88629f41dc80c552b58d3f5eaf","6cc2e02ee7b74f3aa2f71bf0725190de","da714c09550f43849e5a2502b092403a","7370d528153e473980ac0e701c9e6825","0ba17f55cbf941feaa0b7e8959a94591","a903c8a2c22c48bdb70024665bc1cb0f","8c29df6a485946c090b39b49240427fd","b0bcd8201363473ea0e4ace230443446","d9f1220cc4f5440b97e35151ea76ed00","0eaebd8b9eb14feca8a11a8c3077b196","51061a6b42584e2bbac74da6a6f2a1da","e0216d9707cb4d10be57c4864521c376","2e72912929c14d9f8843bc028b75de77","feeaea5e7b524ed09bd08c73b7ed5e17","1ebb24b8e6f44608ae8c11747ef7c42d","c523aec40bb54986a4f40924c81fe5da","f23d6c684c3a416d8b43fc642653eec8","16e98aec73d048d693fd0e44df2220e7","495a6de427ce47eda18a1570fd8f5f9d","eac1b421700d492ba398d5ac609b5741","80509211c7dc47e8a673ed80fca6d8e1","e1f4e7fa8b1f4530b580f101a9fad304","4b14e789cc2446eda7a94a5b1259e738","865eb0b3254246948f532e2c6dd02bd1","7e678fd9b4284bf39ecd29de4d7624a3","adfbabeb20a3428c8fd6ec5b79830c34","6a012f24803646369ac97be41e5998f3","521ecdf554b84958ba4ee487c9621ffc","c6a66e5c516c4dbbbc1ca203c6a2d0db","66a43c50816c4e3fa88852c3d2c3b0c7","46df7a569a984f2a8d60b021e2366550","cffa386bd54e4baa8947d4d51c8e54a4","a5926a9d27b44a2d961f36d7fd36da15","35cf8424313d43dea56ca590edf70b26","5dc6caab4749403aab23ce95993f9ad0","84e32b4370a94fe498b3b01b6632daf3","16c7e883c9ec49329b68e61b276d5fac","5518095d648d4b2b95647c746d70c4d8","de350491cd2e48a19541f6e97b9f176e","cd7fa6f0b6844fd4a41b97cbe39e0c2f","683585ba77df463a9bcb2f8ce4794747","3b99c2eba3ab4ea2a44b23ec0916ce2d","b18d19fb59d8458c9e9048c1458ee95c","d5a23fbc6e634b02bf4f2540e9def457","1c3a77c578d644b09771446ed559c575","b08c232e469646f89a28f4371f0e7699","5e37d843836b42dfb62f728181ee4dfa","0755693c5b854000af081f2818162683","2e194e5ccbd349b093481e24148de89c","9d2ff155648146058d2d359e474159dc","0220a7b0d67a482a8ab7e9ac6372d380","034a1eb240694704a8052783583caefe","03fdfd9eb2e343df8af4ff2b06ac8eb1","393b21bd730f4a8b93d1653ebafaea09","0094122c8eaf47e6851a3449e2fb0086","7850ea0076da49639ca986a4885d7048","f04f25b18eaa43f6a2044dac7aba8372","6d9421a31914451cac51c71aee8b1ce4","a4b3ef66956d494887e796f87b4278f1","1638b8dbe07a4d249167a3d34ac9adc6","163ec8057136471bb1f460d657c4aa6c","ff1655111fd04c4785d7e5ec3747629c","566d76e028b643e18729621e82531939","2642af1a55cf452a93e528fb25f1c8cb","a428e985410341e0ba04359af465681e","bfbec38740f8413d93469274aeccbf23","70cf0bb1ada946dd9717fc2a493b7805","8caab95721354f08999bea8dc6105b4e","221facd121a14faf9d664f644935b0ae","bac1fcd4cf1847ed89bc5e01ae435e24","bc4ed6312ae44ba7a9e21d43d7edd48a","fd4eebbe68204eaa802186836c372b93","6a7f80bb3e534eb2a48d3d29c9ac3988","e431b2a589524545a5ccbb79d2c7bab9","90dccdfd4085472f8f9ba0535d85d327","d1d564e827cf4a71af9aa87b9d5696c8","1ad180dc107946b9a1b554a4b98ee514","88c6c5bcc44d46089aa3efaa7fb9e452","e0280b4f0172481ea7664bfb96d1bb1c","3cd43805ef564f6696905a2465ea4467","ab7f8b09b1d8452995d66e6f0df83faa","3959b0871ea840839a383c895cfbe916","60aa6e24133c4e67957bf953e5b10f4d","5ce301bccee049cf9664800d63e2e2eb","da25ec24a6b44ce9a51bcc2d440d0258","41ea6163d1b04f8f89cd1f9ec9e72847","31107fa83a974eca83fa968ae4eae909","36a0330c9e1d48a49a739feaef34ddc0"]},"execution":{"iopub.execute_input":"2024-02-21T17:24:22.143953Z","iopub.status.busy":"2024-02-21T17:24:22.143589Z","iopub.status.idle":"2024-02-21T17:25:54.037631Z","shell.execute_reply":"2024-02-21T17:25:54.036655Z","shell.execute_reply.started":"2024-02-21T17:24:22.143923Z"},"id":"ppKz1aLSDiOF","outputId":"e359b27d-6381-4bef-8a8c-bcc9576f7fe3","trusted":true},"outputs":[],"source":["model_name_or_path = \"runwayml/stable-diffusion-v1-5\"\n","\n","scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule=\"scaled_linear\", num_train_timesteps=1000)\n","\n","\n","pipe = StableDiffusionPipeline.from_pretrained(\n"," model_name_or_path,\n"," scheduler=scheduler,\n"," torch_dtype=torch.float32,\n",").to(device)\n","\n","# Disable image generation progress bar, we'll display our own\n","pipe.set_progress_bar_config(disable=True)"]},{"cell_type":"markdown","metadata":{"id":"5oBmcxe9DiOG"},"source":["These methods are designed to reduce the memory consumed by the GPU. If you have enough VRAM, you can skip this cell. \n","\n","More detailed information can be found here: https://huggingface.co/docs/diffusers/en/optimization/opt_overview \n","In particular, information about the following methods can be found here: https://huggingface.co/docs/diffusers/optimization/memory\n"]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2024-02-21T17:25:54.040235Z","iopub.status.busy":"2024-02-21T17:25:54.039388Z","iopub.status.idle":"2024-02-21T17:26:00.115879Z","shell.execute_reply":"2024-02-21T17:26:00.115042Z","shell.execute_reply.started":"2024-02-21T17:25:54.040193Z"},"id":"1i7WuQV1DiOG","trusted":true},"outputs":[],"source":["# Offloading the weights to the CPU and only loading them on the GPU can reduce memory consumption to less than 3GB.\n","pipe.enable_model_cpu_offload()\n","\n","# Tighter ordering of memory tensors.\n","pipe.unet.to(memory_format=torch.channels_last)\n","\n","# Decoding large batches of images with limited VRAM or batches with 32 images or more by decoding the batches of latents one image at a time.\n","pipe.enable_vae_slicing()\n","\n","# Splitting the image into overlapping tiles, decoding the tiles, and then blending the outputs together to compose the final image. \n","pipe.enable_vae_tiling()\n","\n","# Using Flash Attention; If you have PyTorch >= 2.0 installed, you should not expect a speed-up for inference when enabling xformers.\n","pipe.enable_xformers_memory_efficient_attention()\n"]},{"cell_type":"markdown","metadata":{"id":"k45VkXF7DiOG"},"source":["The `display_images` function converts a list of image arrays into a GIF, saves it to a specified path and returns the GIF object for display. It names the GIF file using the current time and handles any errors by printing them out."]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2024-02-21T17:30:01.535670Z","iopub.status.busy":"2024-02-21T17:30:01.535281Z","iopub.status.idle":"2024-02-21T17:30:01.542928Z","shell.execute_reply":"2024-02-21T17:30:01.541894Z","shell.execute_reply.started":"2024-02-21T17:30:01.535637Z"},"id":"n5cKlS0CDiOG","trusted":true},"outputs":[],"source":["def display_images(images, save_path):\n"," try:\n"," # Convert each image in the 'images' list from an array to an Image object.\n"," images = [\n"," Image.fromarray(np.array(image[0], dtype=np.uint8)) for image in images\n"," ]\n","\n"," # Generate a file name based on the current time, replacing colons with hyphens\n"," # to ensure the filename is valid for file systems that don't allow colons.\n"," filename = (\n"," time.strftime(\"%H:%M:%S\", time.localtime())\n"," .replace(\":\", \"-\")\n"," )\n"," # Save the first image in the list as a GIF file at the 'save_path' location.\n"," # The rest of the images in the list are added as subsequent frames to the GIF.\n"," # The GIF will play each frame for 100 milliseconds and will loop indefinitely.\n"," images[0].save(\n"," f\"{save_path}/{filename}.gif\",\n"," save_all=True,\n"," append_images=images[1:],\n"," duration=100,\n"," loop=0,\n"," )\n"," except Exception as e:\n"," # If there is an error during the process, print the exception message.\n"," print(e)\n","\n"," # Return the saved GIF as an IPython display object so it can be displayed in a notebook.\n"," return IPdisplay.Image(f\"{save_path}/{filename}.gif\")"]},{"cell_type":"markdown","metadata":{"id":"L13Q7INNDiOG"},"source":["### Generation parameters\n","\n","\n","* `seed`: This variable is used to set a specific random seed for reproducibility. \n","* `generator`: This is set to a PyTorch random number generator object if a seed is provided, otherwise it is None. It ensures that the operations using it have reproducible outcomes. \n","* `guidance_scale`: This parameter controls the extent to which the model should follow the prompt in text-to-image generation tasks, with higher values leading to stronger adherence to the prompt. \n","* `num_inference_steps`: This specifies the number of steps the model takes to generate an image. More steps can lead to a higher quality image but take longer to generate. \n","* `num_interpolation_steps`: This determines the number of steps used when interpolating between two points in the latent space, affecting the smoothness of transitions in generated animations. \n","* `height`: The height of the generated images in pixels. \n","* `width`: The width of the generated images in pixels. \n","* `save_path`: The file system path where the generated gifs will be saved. "]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2024-02-21T17:30:04.013629Z","iopub.status.busy":"2024-02-21T17:30:04.012881Z","iopub.status.idle":"2024-02-21T17:30:04.019551Z","shell.execute_reply":"2024-02-21T17:30:04.018612Z","shell.execute_reply.started":"2024-02-21T17:30:04.013596Z"},"id":"R_B-h2j4DiOG","trusted":true},"outputs":[],"source":["# The seed is set to \"None\", because we want different results each time we run the generation.\n","seed = None\n","\n","if seed is not None:\n"," generator = torch.manual_seed(seed)\n","else:\n"," generator = None\n","\n","# The guidance scale is set to its normal range (7 - 10).\n","guidance_scale = 8\n","\n","# The number of inference steps was chosen empirically to generate an acceptable picture within an acceptable time.\n","num_inference_steps = 15\n","\n","# The higher you set this value, the smoother the interpolations will be. However, the generation time will increase. This value was chosen empirically.\n","num_interpolation_steps = 30\n","\n","# I would not recommend less than 512 on either dimension. This is because this model was trained on 512x512 image resolution.\n","height = 512 \n","width = 512\n","\n","# The path where the generated GIFs will be saved\n","save_path = \"/output\"\n","\n","if not os.path.exists(save_path):\n"," os.makedirs(save_path)\n"]},{"cell_type":"markdown","metadata":{"id":"Nm4BHESjDiOG"},"source":["### Example 1: Prompt interpolation\n","\n","In this example, interpolation between positive and negative prompt embeddings allows exploration of space between two conceptual points defined by prompts, potentially leading to variety of images blending characteristics dictated by prompts gradually. In this case, interpolation involves adding scaled deltas to original embeddings, creating a series of new embeddings that will be used later to generate images with smooth transitions between different states based on the original prompt.\n"]},{"cell_type":"markdown","metadata":{},"source":[""]},{"cell_type":"markdown","metadata":{},"source":["First of all, we need to tokenize and obtain embeddings for both positive and negative text prompts. The positive prompt guides the image generation towards the desired characteristics, while the negative prompt steers it away from unwanted features."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":49,"referenced_widgets":["640b691e30b844ec943995160216e28b","4262f099aab24cfd9b3790864e0e1d63","a2194da8bc254658b9db17e19dbe418b","f0e13bd4abca444592850390651272a4","866b048905164e31a5011bdc0fcf5180","867f3d155da9469ab9820923e40e78e5","7d7e58bafe2c4ff6a44275c3a2ea9826","cc9fdbf01697491f856a33e4b70a7a78","6826628eb6214b57bbe56e3eb80322b3","757864720c4041c6a24f8aa8f1630e69","92e5153340a74fc9895d4f87b68e3cad"]},"execution":{"iopub.execute_input":"2024-02-21T17:40:07.727796Z","iopub.status.busy":"2024-02-21T17:40:07.727407Z","iopub.status.idle":"2024-02-21T17:43:50.624205Z","shell.execute_reply":"2024-02-21T17:43:50.622571Z","shell.execute_reply.started":"2024-02-21T17:40:07.727768Z"},"id":"YVNrz60MDiOH","outputId":"428cf53c-ca0d-49e6-f2cd-41ed292b5117","trusted":true},"outputs":[],"source":["# The text prompt that describes the desired output image.\n","prompt = \"Epic shot of Sweden, ultra detailed lake with an ren dear, nostalgic vintage, ultra cozy and inviting, wonderful light atmosphere, fairy, little photorealistic, digital painting, sharp focus, ultra cozy and inviting, wish to be there. very detailed, arty, should rank high on youtube for a dream trip.\"\n","# A negative prompt that can be used to steer the generation away from certain features; here, it is empty.\n","negative_prompt = \"poorly drawn,cartoon, 2d, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry\"\n","\n","# The step size for the interpolation in the latent space.\n","step_size = 0.001\n","\n","# Tokenizing and encoding the prompt into embeddings.\n","prompt_tokens = pipe.tokenizer(\n"," prompt,\n"," padding=\"max_length\",\n"," max_length=pipe.tokenizer.model_max_length,\n"," truncation=True,\n"," return_tensors=\"pt\",\n",")\n","prompt_embeds = pipe.text_encoder(prompt_tokens.input_ids.to(device))[0]\n","\n","\n","# Tokenizing and encoding the negative prompt into embeddings.\n","if negative_prompt is None:\n"," negative_prompt = [\"\"]\n","\n","negative_prompt_tokens = pipe.tokenizer(\n"," negative_prompt,\n"," padding=\"max_length\",\n"," max_length=pipe.tokenizer.model_max_length,\n"," truncation=True,\n"," return_tensors=\"pt\",\n",")\n","negative_prompt_embeds = pipe.text_encoder(negative_prompt_tokens.input_ids.to(device))[0]"]},{"cell_type":"markdown","metadata":{},"source":["Now let's look at the code part that generates a random initial vector using a normal distribution that is structured to match the dimensions expected by the diffusion model (UNet). This allows for the reproducibility of the results by optionally using a random number generator. After creating the initial vector, the code performs a series of interpolations between the two embeddings (positive and negative prompts), by incrementally adding a small step size for each iteration. The results are stored in a list named \"walked_embeddings\"."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Generating initial latent vectors from a random normal distribution, with the option to use a generator for reproducibility.\n","latents = torch.randn(\n"," (1, pipe.unet.config.in_channels, height // 8, width // 8),\n"," generator=generator,\n",")\n","\n","walked_embeddings = []\n","\n","# Interpolating between embeddings for the given number of interpolation steps.\n","for i in range(num_interpolation_steps):\n"," walked_embeddings.append(\n"," [prompt_embeds + step_size * i, negative_prompt_embeds + step_size * i]\n"," )"]},{"cell_type":"markdown","metadata":{},"source":["Finally, let's generate a series of images based on interpolated embeddings and then displaying these images. We'll iterate over an array of embeddings, using each to generate an image with specified characteristics like height, width, and other parameters relevant to image generation. Then we'll collect these images into a list. Once generation is complete we'll call the `display_image` function to save and display these images as GIF at a given save path."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Generating images using the interpolated embeddings.\n","images = []\n","for latent in tqdm(walked_embeddings):\n"," images.append(\n"," pipe(\n"," height=height,\n"," width=width,\n"," num_images_per_prompt=1,\n"," prompt_embeds=latent[0],\n"," negative_prompt_embeds=latent[1],\n"," num_inference_steps=num_inference_steps,\n"," guidance_scale=guidance_scale,\n"," generator=generator,\n"," latents=latents,\n"," ).images\n"," )\n","\n","# Display of saved generated images.\n","display_images(images, save_path)"]},{"cell_type":"markdown","metadata":{"id":"uZQWop9nDiOH"},"source":["### Example 2: Diffusion latents interpolation for a single prompt\n","Unlike the first example, in this one, we are performing interpolation between the two embeddings of the diffusion model itself, not the prompts. Please note that in this case, we use the slerp function for interpolation. However, there is nothing stopping us from adding a constant value to one embedding instead.\n"]},{"cell_type":"markdown","metadata":{},"source":[""]},{"cell_type":"markdown","metadata":{"id":"CiW6SlhXDiOH"},"source":["The function presented below stands for Spherical Linear Interpolation. It is a method of interpolation on the surface of a sphere. This function is commonly used in computer graphics to animate rotations in a smooth manner and can also be used to interpolate between high-dimensional data points in machine learning, such as latent vectors used in generative models. \n","\n","The source is from Andrej Karpathy's gist: https://gist.github.com/karpathy/00103b0037c5aaea32fe1da1af553355. \n","A more detailed explanation of this method can be found at: https://en.wikipedia.org/wiki/Slerp."]},{"cell_type":"code","execution_count":1,"metadata":{"id":"grgP7UNpDiOH"},"outputs":[],"source":["def slerp(v0, v1, num, t0=0, t1=1):\n"," v0 = v0.detach().cpu().numpy()\n"," v1 = v1.detach().cpu().numpy()\n","\n"," def interpolation(t, v0, v1, DOT_THRESHOLD=0.9995):\n"," \"\"\"helper function to spherically interpolate two arrays v1 v2\"\"\"\n"," dot = np.sum(v0 * v1 / (np.linalg.norm(v0) * np.linalg.norm(v1)))\n"," if np.abs(dot) > DOT_THRESHOLD:\n"," v2 = (1 - t) * v0 + t * v1\n"," else:\n"," theta_0 = np.arccos(dot)\n"," sin_theta_0 = np.sin(theta_0)\n"," theta_t = theta_0 * t\n"," sin_theta_t = np.sin(theta_t)\n"," s0 = np.sin(theta_0 - theta_t) / sin_theta_0\n"," s1 = sin_theta_t / sin_theta_0\n"," v2 = s0 * v0 + s1 * v1\n"," return v2\n","\n"," t = np.linspace(t0, t1, num)\n","\n"," v3 = torch.tensor(np.array([interpolation(t[i], v0, v1) for i in range(num)]))\n","\n"," return v3"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":561,"referenced_widgets":["d7eb412b880c490c95f1d2baeaf2e6af","3e7caf19c664461e9505cbfcb708ceba","21e5a4fd28ac47a190d0f41808dd75a1","7f4317fe6eca4fc5be524453e55103bd","4ab2093e02704b748298a6d34e807847","69ffdc3c18f5484cad6945a77b024529","2c5d8801da6f4d88be3801254c3e764b","a76a9fce4af34c639327e5a0f4f4e692","e00e5537ae9a43d5956c2c770599edde","6fa7c3c07e734867ac5676b09b6804b3","ed410d69e8e94af7be0d104c5c29a2c9"]},"id":"aIU-nxTcDiOH","outputId":"1f762594-d89d-4bd3-c909-3d4850293b71"},"outputs":[],"source":["# The text prompt that describes the desired output image.\n","prompt = \"Sci-fi digital painting of an alien landscape with otherworldly plants, strange creatures, and distant planets.\"\n","# A negative prompt that can be used to steer the generation away from certain features.\n","negative_prompt = \"poorly drawn,cartoon, 3d, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry\"\n","\n","# Generating initial latent vectors from a random normal distribution. In this example two latent vectors are generated, which will serve as start and end points for the interpolation.\n","# These vectors are shaped to fit the input requirements of the diffusion model's U-Net architecture.\n","latents = torch.randn(\n"," (2, pipe.unet.config.in_channels, height // 8, width // 8),\n"," generator=generator,\n",")\n","\n","# Getting our latent embeddings\n","interpolated_latents = slerp(latents[0], latents[1], num_interpolation_steps)\n","\n","# Generating images using the interpolated embeddings.\n","images = []\n","for latent_vector in tqdm(interpolated_latents):\n"," images.append(\n"," pipe(\n"," prompt,\n"," height=height,\n"," width=width,\n"," negative_prompt=negative_prompt,\n"," num_images_per_prompt=1,\n"," num_inference_steps=num_inference_steps,\n"," guidance_scale=guidance_scale,\n"," generator=generator,\n"," latents=latent_vector[None, ...],\n"," ).images\n"," )\n","\n","# Display of saved generated images.\n","display_images(images, save_path)"]},{"cell_type":"markdown","metadata":{"id":"sTFrAlwrDiOI"},"source":["### Example 3: Interpolation between multiple prompts\n","\n","In contrast to the first example, where we moved away from a single prompt, in this example, we will be interpolating between any number of prompts. To do so, we will take consecutive pairs of prompts and create smooth transitions between them. Then, we will combine the interpolations of these consecutive pairs, and instruct the model to generate images based on them. For interpolation we will use the slerp function, as in the second example."]},{"cell_type":"markdown","metadata":{},"source":[""]},{"cell_type":"markdown","metadata":{},"source":["Once again, let's tokenize and obtain embeddings but this time for multiple positive and negative text prompts."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Text prompts that describes the desired output image.\n","prompts = [\n"," \"A cute dog in a beautiful field of lavander colorful flowers everywhere, perfect lighting, leica summicron 35mm f2.0, kodak portra 400, film grain\",\n"," \"A cute cat in a beautiful field of lavander colorful flowers everywhere, perfect lighting, leica summicron 35mm f2.0, kodak portra 400, film grain\",\n","]\n","# Negative prompts that can be used to steer the generation away from certain features.\n","negative_prompts = [\n"," \"poorly drawn,cartoon, 2d, sketch, cartoon, drawing, anime, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry\",\n"," \"poorly drawn,cartoon, 2d, sketch, cartoon, drawing, anime, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry\",\n","]\n","\n","# NOTE: The number of prompts must match the number of negative prompts\n","\n","batch_size = len(prompts)\n","\n","# Tokenizing and encoding prompts into embeddings.\n","prompts_tokens = pipe.tokenizer(\n"," prompts,\n"," padding=\"max_length\",\n"," max_length=pipe.tokenizer.model_max_length,\n"," truncation=True,\n"," return_tensors=\"pt\",\n",")\n","prompts_embeds = pipe.text_encoder(\n"," prompts_tokens.input_ids.to(device)\n",")[0]\n","\n","# Tokenizing and encoding negative prompts into embeddings.\n","if negative_prompts is None:\n"," negative_prompts = [\"\"] * batch_size\n","\n","negative_prompts_tokens = pipe.tokenizer(\n"," negative_prompts,\n"," padding=\"max_length\",\n"," max_length=pipe.tokenizer.model_max_length,\n"," truncation=True,\n"," return_tensors=\"pt\",\n",")\n","negative_prompts_embeds = pipe.text_encoder(\n"," negative_prompts_tokens.input_ids.to(device)\n",")[0]"]},{"cell_type":"markdown","metadata":{},"source":["As stated earlier, we will take consecutive pairs of prompts and create smooth transitions between them with `slerp` function."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":561,"referenced_widgets":["9a87e8d407f44ee59a70e511a6274131","d49efe4cff2a43288d2140a55c17c4cc","79fed3974dd8466e8237c7431f71a084","6c9624bf3faf4890bc1c83c52e33e508","f23c3781df5a497baa56f273a5f467a5","6956b01055a34613994672a6bb93994d","1f16ae9b03604ccd93cd1e2f153afe64","aa4bef85913f43c49b708c850301546d","e92a9d9050e34b47b766a31935ffbcda","94fa436d65894af494b2526746fe7324","ffe65c700f0142df9b289a0e8b58ec65"]},"id":"DfUbS8w5DiOI","outputId":"fb663c02-73e2-421a-8d07-b0b43cd548a7"},"outputs":[],"source":["# Generating initial U-Net latent vectors from a random normal distribution.\n","latents = torch.randn(\n"," (1, pipe.unet.config.in_channels, height // 8, width // 8),\n"," generator=generator,\n",")\n","\n","# Interpolating between embeddings pairs for the given number of interpolation steps.\n","interpolated_prompt_embeds = []\n","interpolated_negative_prompts_embeds = []\n","for i in range(batch_size - 1):\n"," interpolated_prompt_embeds.append(\n"," slerp(\n"," prompts_embeds[i],\n"," prompts_embeds[i + 1],\n"," num_interpolation_steps\n"," )\n"," )\n"," interpolated_negative_prompts_embeds.append(\n"," slerp(\n"," negative_prompts_embeds[i],\n"," negative_prompts_embeds[i + 1],\n"," num_interpolation_steps,\n"," )\n"," )\n","\n","interpolated_prompt_embeds = torch.cat(\n"," interpolated_prompt_embeds, dim=0\n",").to(device)\n","\n","interpolated_negative_prompts_embeds = torch.cat(\n"," interpolated_negative_prompts_embeds, dim=0\n",").to(device)"]},{"cell_type":"markdown","metadata":{},"source":["Finally, we need to generate images based on the embeddings."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Generating images using the interpolated embeddings.\n","images = []\n","for prompt_embeds, negative_prompt_embeds in tqdm(\n"," zip(interpolated_prompt_embeds, interpolated_negative_prompts_embeds),\n"," total=len(interpolated_prompt_embeds),\n","):\n"," images.append(\n"," pipe(\n"," height=height,\n"," width=width,\n"," num_images_per_prompt=1,\n"," prompt_embeds=prompt_embeds[None, ...],\n"," negative_prompt_embeds=negative_prompt_embeds[None, ...],\n"," num_inference_steps=num_inference_steps,\n"," guidance_scale=guidance_scale,\n"," generator=generator,\n"," latents=latents,\n"," ).images\n"," )\n","\n","# Display of saved generated images.\n","display_images(images, save_path)"]},{"cell_type":"markdown","metadata":{"id":"oQqANSP2DiOI"},"source":["### Example 4: Circular walk through the diffusion latent space for a single prompt\n","\n","This example was taken from: https://keras.io/examples/generative/random_walks_with_stable_diffusion/ \n","\n","Let's imagine that we have two noise components, which we'll call x and y. We start by moving from 0 to 2π and at each step we add the cosine of x and the sine of y to the result. Using this approach, at the end of our movement we end up with the same noise values that we started with. This means that vectors end up turning into themselves, ending our movement.\n","\n"]},{"cell_type":"markdown","metadata":{},"source":[""]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":561,"referenced_widgets":["eba9a67d3f704bed8f501780e35273cb","f6b5c1f44a54406c84db14875e4c85b0","1f3ce7042b974edbb033b4cd8d13cc08","6de631e7a72e4b74b740095e8c251ca8","00c30d57328148c88b4258f4841bbdd0","a4a86f212e8a4dffb0240936475837f7","89ce18ce98494ccc803adbf87f6051e5","6781679193314617a341ef891ba3df45","161a3b1a75e0446ca9120f5e1eea38e9","7294524debe544c19f8c76a7b3cf0e32","f6fb32e142d140d5ad6b357731c4d382"]},"id":"ac-68CTWDiOJ","outputId":"3eced894-bd22-443a-96e9-dedb67a40ad8"},"outputs":[],"source":["# The text prompt that describes the desired output image.\n","prompt = \"Beautiful sea sunset, warm light, Aivazovsky style\"\n","# A negative prompt that can be used to steer the generation away from certain features\n","negative_prompt = \"picture frames\"\n","\n","# Generating initial latent vectors from a random normal distribution to create a loop interpolation between them.\n","latents = torch.randn(\n"," (2, 1, pipe.unet.config.in_channels, height // 8, width // 8),\n"," generator=generator,\n",")\n","\n","\n","# Calculation of looped embeddings\n","walk_noise_x = latents[0].to(device)\n","walk_noise_y = latents[1].to(device)\n","\n","# Walking on a trigonometric circle\n","walk_scale_x = torch.cos(torch.linspace(0, 2, num_interpolation_steps) * np.pi).to(\n"," device\n",")\n","walk_scale_y = torch.sin(torch.linspace(0, 2, num_interpolation_steps) * np.pi).to(\n"," device\n",")\n","\n","# Applying interpolation to noise\n","noise_x = torch.tensordot(walk_scale_x, walk_noise_x, dims=0)\n","noise_y = torch.tensordot(walk_scale_y, walk_noise_y, dims=0)\n","\n","circular_latents = noise_x + noise_y\n","\n","# Generating images using the interpolated embeddings.\n","images = []\n","for latent_vector in tqdm(circular_latents):\n"," images.append(\n"," pipe(\n"," prompt,\n"," height=height,\n"," width=width,\n"," negative_prompt=negative_prompt,\n"," num_images_per_prompt=1,\n"," num_inference_steps=num_inference_steps,\n"," guidance_scale=guidance_scale,\n"," generator=generator,\n"," latents=latent_vector,\n"," ).images\n"," )\n","\n","# Display of saved generated images.\n","display_images(images, save_path)"]},{"cell_type":"markdown","metadata":{"id":"QQnbnOokDiOJ"},"source":["## Next Steps \n","Moving forward, you can explore various parameters such as guidance scale, seed, and number of interpolation steps to observe how they affect the generated images. Additionally, consider trying out different prompts and schedulers to further enhance your results. Another valuable step would be to implement linear interpolation (`linspace`) instead of spherical linear interpolation (`slerp`) and compare the results to gain deeper insights into the interpolation process."]}],"metadata":{"accelerator":"GPU","colab":{"gpuType":"T4","provenance":[]},"kaggle":{"accelerator":"gpu","dataSources":[],"dockerImageVersionId":30648,"isGpuEnabled":true,"isInternetEnabled":true,"language":"python","sourceType":"notebook"},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.10.11"},"widgets":{"application/vnd.jupyter.widget-state+json":{"0094122c8eaf47e6851a3449e2fb0086":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"00b88de69713463485206e4b7c8c3c04":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_95d1047a0a644262aa385987c9a331b4","placeholder":"","style":"IPY_MODEL_c565ba8910c34e138c6cd10e9b06d673","value":"safety_checker/config.json: 100%"}},"00c30d57328148c88b4258f4841bbdd0":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"0220a7b0d67a482a8ab7e9ac6372d380":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"034a1eb240694704a8052783583caefe":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"03fdfd9eb2e343df8af4ff2b06ac8eb1":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"0755693c5b854000af081f2818162683":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_393b21bd730f4a8b93d1653ebafaea09","placeholder":"","style":"IPY_MODEL_0094122c8eaf47e6851a3449e2fb0086","value":" 525k/525k [00:02<00:00, 233kB/s]"}},"0ba055eeeed04e91b684072740e07f0a":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_309d89a221ab4010af0d85d3ae90a1c3","max":14,"min":0,"orientation":"horizontal","style":"IPY_MODEL_1cdaf6fe1392453288997f672af80a0d","value":14}},"0ba17f55cbf941feaa0b7e8959a94591":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"0e63faca29cb4ca3845e402d67592b8f":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_f736f2fc44444ebf832fb2ead6ea0fd0","max":541,"min":0,"orientation":"horizontal","style":"IPY_MODEL_c41432dd085f41aeaa1e1d9cac4872e7","value":541}},"0eaebd8b9eb14feca8a11a8c3077b196":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"1325ea426c9747828601a9175a9b0248":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_c14c4f3d31a447b38d359acc6e29496d","placeholder":"","style":"IPY_MODEL_f4f167561c2d452785b6e59c2ce61b28","value":" 806/806 [00:00<00:00, 18.5kB/s]"}},"161a3b1a75e0446ca9120f5e1eea38e9":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"1638b8dbe07a4d249167a3d34ac9adc6":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"163ec8057136471bb1f460d657c4aa6c":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"16c7e883c9ec49329b68e61b276d5fac":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_b18d19fb59d8458c9e9048c1458ee95c","placeholder":"","style":"IPY_MODEL_d5a23fbc6e634b02bf4f2540e9def457","value":" 492M/492M [00:14<00:00, 20.5MB/s]"}},"16e98aec73d048d693fd0e44df2220e7":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"17ceef0b617c4f52aa0bb5ec12113fcd":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_45d877fafa5a4772ba5d62557843bb51","max":472,"min":0,"orientation":"horizontal","style":"IPY_MODEL_62d5802bdc7a49efbcb47889e29e924c","value":472}},"184d9e1fa8d241c386567300db4e2c8c":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"1ad180dc107946b9a1b554a4b98ee514":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"1c3a77c578d644b09771446ed559c575":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_b08c232e469646f89a28f4371f0e7699","IPY_MODEL_5e37d843836b42dfb62f728181ee4dfa","IPY_MODEL_0755693c5b854000af081f2818162683"],"layout":"IPY_MODEL_2e194e5ccbd349b093481e24148de89c"}},"1c8a25c7f70145df9722d3702fc6d2dd":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_c1a5498ffe0f407397d2a8c77e35de84","placeholder":"","style":"IPY_MODEL_5eefef112ef249cc97ecd63f19ea122d","value":"tokenizer/tokenizer_config.json: 100%"}},"1cdaf6fe1392453288997f672af80a0d":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"1e4ad02a5afb431a96626230081054cc":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"1e64cce9ffc94f23921f964288c2e26d":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"1ebb24b8e6f44608ae8c11747ef7c42d":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_80509211c7dc47e8a673ed80fca6d8e1","placeholder":"","style":"IPY_MODEL_e1f4e7fa8b1f4530b580f101a9fad304","value":" 547/547 [00:00<00:00, 8.12kB/s]"}},"1f16ae9b03604ccd93cd1e2f153afe64":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"1f3ce7042b974edbb033b4cd8d13cc08":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_6781679193314617a341ef891ba3df45","max":30,"min":0,"orientation":"horizontal","style":"IPY_MODEL_161a3b1a75e0446ca9120f5e1eea38e9","value":30}},"20c46fc3077b4dcc855252c222002569":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_9b597fc5d5cf4fecabcfbc7a4cfa1ee9","IPY_MODEL_17ceef0b617c4f52aa0bb5ec12113fcd","IPY_MODEL_e20cd706fa304bc190e92ae7d27b5b38"],"layout":"IPY_MODEL_9a3f552babd34cdeb8fed4ce1b1b33a7"}},"21e5a4fd28ac47a190d0f41808dd75a1":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_a76a9fce4af34c639327e5a0f4f4e692","max":30,"min":0,"orientation":"horizontal","style":"IPY_MODEL_e00e5537ae9a43d5956c2c770599edde","value":30}},"221facd121a14faf9d664f644935b0ae":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_e431b2a589524545a5ccbb79d2c7bab9","max":334643276,"min":0,"orientation":"horizontal","style":"IPY_MODEL_90dccdfd4085472f8f9ba0535d85d327","value":334643276}},"2642af1a55cf452a93e528fb25f1c8cb":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"2c5d8801da6f4d88be3801254c3e764b":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"2dc63f0fe271457f890fd2067631ad75":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"2e194e5ccbd349b093481e24148de89c":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"2e72912929c14d9f8843bc028b75de77":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_f23d6c684c3a416d8b43fc642653eec8","placeholder":"","style":"IPY_MODEL_16e98aec73d048d693fd0e44df2220e7","value":"vae/config.json: 100%"}},"2f010760c1a146238f35af00568ccb11":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"2f6e125affb54674b5f49beaf3612bee":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"2fd8069de9754a87ae04ec7b2c4b380a":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"309d89a221ab4010af0d85d3ae90a1c3":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"31107fa83a974eca83fa968ae4eae909":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"3214f06917264be88837314b26375a1c":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_00b88de69713463485206e4b7c8c3c04","IPY_MODEL_f63bebf985a046ce9ba9d567db0b7ca4","IPY_MODEL_a49fd3377dde41dab42677091dc7bd04"],"layout":"IPY_MODEL_ed1dbb98c8d34dce8df70e4698a21911"}},"3537007206fd4d57ae492d29d90bd904":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_85bf2410c7d2440db76163fe1df4f4bb","IPY_MODEL_c2bf5a15732a4898915b0ec3cb56df8c","IPY_MODEL_bf573d9fcbac4701b31e464373fdbeb0"],"layout":"IPY_MODEL_f1f352e6964f424f9e6a4557f6e3ff97"}},"3538495833744e9eab9b25eade484603":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_8db4aafac51043598a9e4b8915f4f7c4","IPY_MODEL_0e63faca29cb4ca3845e402d67592b8f","IPY_MODEL_bf4b67c0a0034b1ab576066d366d934e"],"layout":"IPY_MODEL_e09eaa48528a48629b19020aa65c5bdc"}},"35cf8424313d43dea56ca590edf70b26":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_5dc6caab4749403aab23ce95993f9ad0","IPY_MODEL_84e32b4370a94fe498b3b01b6632daf3","IPY_MODEL_16c7e883c9ec49329b68e61b276d5fac"],"layout":"IPY_MODEL_5518095d648d4b2b95647c746d70c4d8"}},"35ffc6d955a44422a06e5c304fcaeddb":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":"20px"}},"36a0330c9e1d48a49a739feaef34ddc0":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"393b21bd730f4a8b93d1653ebafaea09":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"3959b0871ea840839a383c895cfbe916":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"399fdb41f7fd4f1489c1bf4814b53907":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_2fd8069de9754a87ae04ec7b2c4b380a","placeholder":"","style":"IPY_MODEL_da6373c3704d45869e2fafdf9772d3d1","value":" 617/617 [00:00<00:00, 7.14kB/s]"}},"3b99c2eba3ab4ea2a44b23ec0916ce2d":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"3cd43805ef564f6696905a2465ea4467":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_da25ec24a6b44ce9a51bcc2d440d0258","max":7,"min":0,"orientation":"horizontal","style":"IPY_MODEL_41ea6163d1b04f8f89cd1f9ec9e72847","value":7}},"3e7caf19c664461e9505cbfcb708ceba":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_69ffdc3c18f5484cad6945a77b024529","placeholder":"","style":"IPY_MODEL_2c5d8801da6f4d88be3801254c3e764b","value":"100%"}},"3fcfdbffff6149fe880b0702cf8162f3":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"41ea6163d1b04f8f89cd1f9ec9e72847":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"4262f099aab24cfd9b3790864e0e1d63":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_867f3d155da9469ab9820923e40e78e5","placeholder":"","style":"IPY_MODEL_7d7e58bafe2c4ff6a44275c3a2ea9826","value":" 37%"}},"44a56b89efa649edb669abed3605e576":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_c26781e6cc214b6b866b8f11673e9c00","IPY_MODEL_cddd14dde34c42f59ed71870e558246e","IPY_MODEL_399fdb41f7fd4f1489c1bf4814b53907"],"layout":"IPY_MODEL_f7f49c60f4ca41efa9a3b6b33c418f73"}},"45d877fafa5a4772ba5d62557843bb51":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"46df7a569a984f2a8d60b021e2366550":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"495a6de427ce47eda18a1570fd8f5f9d":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"4a466456e448417a8b3cc442cec49632":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"4ab2093e02704b748298a6d34e807847":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"4b14e789cc2446eda7a94a5b1259e738":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_865eb0b3254246948f532e2c6dd02bd1","IPY_MODEL_7e678fd9b4284bf39ecd29de4d7624a3","IPY_MODEL_adfbabeb20a3428c8fd6ec5b79830c34"],"layout":"IPY_MODEL_6a012f24803646369ac97be41e5998f3"}},"4cc69a88629f41dc80c552b58d3f5eaf":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_6cc2e02ee7b74f3aa2f71bf0725190de","IPY_MODEL_da714c09550f43849e5a2502b092403a","IPY_MODEL_7370d528153e473980ac0e701c9e6825"],"layout":"IPY_MODEL_0ba17f55cbf941feaa0b7e8959a94591"}},"4d26d7dd13d148b3b6e06f10b589f2a8":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"51061a6b42584e2bbac74da6a6f2a1da":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"51ceac8abb23437f952e07d2daaf0dae":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"521ecdf554b84958ba4ee487c9621ffc":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"5518095d648d4b2b95647c746d70c4d8":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"55a116f1fa634632a061a4ad8bb75ec3":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_e03272733bd84e2e9edcb83391e1cfae","max":342,"min":0,"orientation":"horizontal","style":"IPY_MODEL_184d9e1fa8d241c386567300db4e2c8c","value":342}},"566d76e028b643e18729621e82531939":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"57d02fadf8fc4df3b1fbee18e45f9199":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"58910b48c70a4b3dabf87b6a12004e70":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_51ceac8abb23437f952e07d2daaf0dae","placeholder":"","style":"IPY_MODEL_a06b4275e1444e72b921d631feea90ba","value":" 342/342 [00:00<00:00, 2.75kB/s]"}},"59081d8cdf4e43228a20f3fe986926b2":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"590b0de8d8f84d1ab9a243d88380c295":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"5ce301bccee049cf9664800d63e2e2eb":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"5dc635e136f741608e73455581822408":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"5dc6caab4749403aab23ce95993f9ad0":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_de350491cd2e48a19541f6e97b9f176e","placeholder":"","style":"IPY_MODEL_cd7fa6f0b6844fd4a41b97cbe39e0c2f","value":"model.safetensors: 100%"}},"5e37d843836b42dfb62f728181ee4dfa":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_034a1eb240694704a8052783583caefe","max":524619,"min":0,"orientation":"horizontal","style":"IPY_MODEL_03fdfd9eb2e343df8af4ff2b06ac8eb1","value":524619}},"5eefef112ef249cc97ecd63f19ea122d":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"5ffaa53c74d3413b8f3c3fe2fc5cc075":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"60aa6e24133c4e67957bf953e5b10f4d":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"62d5802bdc7a49efbcb47889e29e924c":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"6401b1635d62478ba644075a32494384":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_72a5fcb23802406eb22e2758be01052c","IPY_MODEL_0ba055eeeed04e91b684072740e07f0a","IPY_MODEL_790424c8674d42d59c75ba2fd4021e3a"],"layout":"IPY_MODEL_ff697b2e046a4a36a78ac362cae5c2a9"}},"640b691e30b844ec943995160216e28b":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_4262f099aab24cfd9b3790864e0e1d63","IPY_MODEL_a2194da8bc254658b9db17e19dbe418b","IPY_MODEL_f0e13bd4abca444592850390651272a4"],"layout":"IPY_MODEL_866b048905164e31a5011bdc0fcf5180"}},"66a43c50816c4e3fa88852c3d2c3b0c7":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"670c96aa182f4defbfce4a9fd0b7f96b":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"6781679193314617a341ef891ba3df45":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"681cc61581d84298b4798b5c43818e31":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"6826628eb6214b57bbe56e3eb80322b3":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"683585ba77df463a9bcb2f8ce4794747":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"6956b01055a34613994672a6bb93994d":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"69ffdc3c18f5484cad6945a77b024529":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"6a012f24803646369ac97be41e5998f3":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"6a7f80bb3e534eb2a48d3d29c9ac3988":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"6c9624bf3faf4890bc1c83c52e33e508":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_94fa436d65894af494b2526746fe7324","placeholder":"","style":"IPY_MODEL_ffe65c700f0142df9b289a0e8b58ec65","value":" 30/30 [05:25<00:00, 10.77s/it]"}},"6cc2e02ee7b74f3aa2f71bf0725190de":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_a903c8a2c22c48bdb70024665bc1cb0f","placeholder":"","style":"IPY_MODEL_8c29df6a485946c090b39b49240427fd","value":"tokenizer/vocab.json: 100%"}},"6d9421a31914451cac51c71aee8b1ce4":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_566d76e028b643e18729621e82531939","max":3438167540,"min":0,"orientation":"horizontal","style":"IPY_MODEL_2642af1a55cf452a93e528fb25f1c8cb","value":3438167540}},"6de631e7a72e4b74b740095e8c251ca8":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_7294524debe544c19f8c76a7b3cf0e32","placeholder":"","style":"IPY_MODEL_f6fb32e142d140d5ad6b357731c4d382","value":" 30/30 [05:46<00:00, 11.49s/it]"}},"6fa7c3c07e734867ac5676b09b6804b3":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"70cf0bb1ada946dd9717fc2a493b7805":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_8caab95721354f08999bea8dc6105b4e","IPY_MODEL_221facd121a14faf9d664f644935b0ae","IPY_MODEL_bac1fcd4cf1847ed89bc5e01ae435e24"],"layout":"IPY_MODEL_bc4ed6312ae44ba7a9e21d43d7edd48a"}},"7294524debe544c19f8c76a7b3cf0e32":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"72a5fcb23802406eb22e2758be01052c":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_eb3e42c270b04086a2a04ae286684f6b","placeholder":"","style":"IPY_MODEL_d9816365a87340dfa06fe9a37811a81b","value":"Fetching 14 files: 100%"}},"7370d528153e473980ac0e701c9e6825":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_0eaebd8b9eb14feca8a11a8c3077b196","placeholder":"","style":"IPY_MODEL_51061a6b42584e2bbac74da6a6f2a1da","value":" 1.06M/1.06M [00:00<00:00, 4.59MB/s]"}},"757864720c4041c6a24f8aa8f1630e69":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"7850ea0076da49639ca986a4885d7048":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_f04f25b18eaa43f6a2044dac7aba8372","IPY_MODEL_6d9421a31914451cac51c71aee8b1ce4","IPY_MODEL_a4b3ef66956d494887e796f87b4278f1"],"layout":"IPY_MODEL_1638b8dbe07a4d249167a3d34ac9adc6"}},"790424c8674d42d59c75ba2fd4021e3a":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_5dc635e136f741608e73455581822408","placeholder":"","style":"IPY_MODEL_670c96aa182f4defbfce4a9fd0b7f96b","value":" 14/14 [00:33<00:00, 2.49s/it]"}},"79fed3974dd8466e8237c7431f71a084":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_aa4bef85913f43c49b708c850301546d","max":30,"min":0,"orientation":"horizontal","style":"IPY_MODEL_e92a9d9050e34b47b766a31935ffbcda","value":30}},"7d7e58bafe2c4ff6a44275c3a2ea9826":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"7e678fd9b4284bf39ecd29de4d7624a3":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_66a43c50816c4e3fa88852c3d2c3b0c7","max":743,"min":0,"orientation":"horizontal","style":"IPY_MODEL_46df7a569a984f2a8d60b021e2366550","value":743}},"7f4317fe6eca4fc5be524453e55103bd":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_6fa7c3c07e734867ac5676b09b6804b3","placeholder":"","style":"IPY_MODEL_ed410d69e8e94af7be0d104c5c29a2c9","value":" 30/30 [05:55<00:00, 11.65s/it]"}},"80509211c7dc47e8a673ed80fca6d8e1":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"84e32b4370a94fe498b3b01b6632daf3":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_683585ba77df463a9bcb2f8ce4794747","max":492265874,"min":0,"orientation":"horizontal","style":"IPY_MODEL_3b99c2eba3ab4ea2a44b23ec0916ce2d","value":492265874}},"85bf2410c7d2440db76163fe1df4f4bb":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_fa5231429aa1437983dc93dc597e698e","placeholder":"","style":"IPY_MODEL_fe24820349a0456ca103e30024490c0e","value":""}},"865eb0b3254246948f532e2c6dd02bd1":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_521ecdf554b84958ba4ee487c9621ffc","placeholder":"","style":"IPY_MODEL_c6a66e5c516c4dbbbc1ca203c6a2d0db","value":"unet/config.json: 100%"}},"866b048905164e31a5011bdc0fcf5180":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"867f3d155da9469ab9820923e40e78e5":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"88c6c5bcc44d46089aa3efaa7fb9e452":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_e0280b4f0172481ea7664bfb96d1bb1c","IPY_MODEL_3cd43805ef564f6696905a2465ea4467","IPY_MODEL_ab7f8b09b1d8452995d66e6f0df83faa"],"layout":"IPY_MODEL_3959b0871ea840839a383c895cfbe916"}},"89ce18ce98494ccc803adbf87f6051e5":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"8c29df6a485946c090b39b49240427fd":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"8c690a7356af4547902b72aa6a20328d":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"8c9b21767c5d4741b717b82f1e4a0e03":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_a83f68f1b1d640298019af11c5198a7d","IPY_MODEL_55a116f1fa634632a061a4ad8bb75ec3","IPY_MODEL_58910b48c70a4b3dabf87b6a12004e70"],"layout":"IPY_MODEL_2dc63f0fe271457f890fd2067631ad75"}},"8caab95721354f08999bea8dc6105b4e":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_fd4eebbe68204eaa802186836c372b93","placeholder":"","style":"IPY_MODEL_6a7f80bb3e534eb2a48d3d29c9ac3988","value":"diffusion_pytorch_model.safetensors: 100%"}},"8db4aafac51043598a9e4b8915f4f7c4":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_57d02fadf8fc4df3b1fbee18e45f9199","placeholder":"","style":"IPY_MODEL_2f6e125affb54674b5f49beaf3612bee","value":"model_index.json: 100%"}},"909ad3aefa5b4c65931f300b3e9655dd":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"90dccdfd4085472f8f9ba0535d85d327":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"92d3f7562ee04b789c44656f83d12528":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_bd7b9e9757d24c568432eea484271376","IPY_MODEL_9c12a4d25771464b84b35804695fd50a","IPY_MODEL_fa1d984ee1864975a4c543f1dcb3aa42"],"layout":"IPY_MODEL_b89baafcd7e944b088bf0c9b1e839ba5"}},"92e5153340a74fc9895d4f87b68e3cad":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"93159e8e4bf84bc9a35c4325dc6cc851":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"94fa436d65894af494b2526746fe7324":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"95d1047a0a644262aa385987c9a331b4":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"96d5032cdaa14cdeb110f8fc3b6614c1":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"97fe7d3ccc984694aa42a56ba930c64e":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"9a3f552babd34cdeb8fed4ce1b1b33a7":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"9a87e8d407f44ee59a70e511a6274131":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_d49efe4cff2a43288d2140a55c17c4cc","IPY_MODEL_79fed3974dd8466e8237c7431f71a084","IPY_MODEL_6c9624bf3faf4890bc1c83c52e33e508"],"layout":"IPY_MODEL_f23c3781df5a497baa56f273a5f467a5"}},"9b597fc5d5cf4fecabcfbc7a4cfa1ee9":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_f8db6aae6e3a468ba3805991e19e1f45","placeholder":"","style":"IPY_MODEL_681cc61581d84298b4798b5c43818e31","value":"tokenizer/special_tokens_map.json: 100%"}},"9c12a4d25771464b84b35804695fd50a":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_909ad3aefa5b4c65931f300b3e9655dd","max":1215981830,"min":0,"orientation":"horizontal","style":"IPY_MODEL_93159e8e4bf84bc9a35c4325dc6cc851","value":1215981830}},"9d2ff155648146058d2d359e474159dc":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"9ef6f8a591244419916d980d5883e03e":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"a06b4275e1444e72b921d631feea90ba":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"a2194da8bc254658b9db17e19dbe418b":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"","description":"","description_tooltip":null,"layout":"IPY_MODEL_cc9fdbf01697491f856a33e4b70a7a78","max":30,"min":0,"orientation":"horizontal","style":"IPY_MODEL_6826628eb6214b57bbe56e3eb80322b3","value":11}},"a428e985410341e0ba04359af465681e":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"a49fd3377dde41dab42677091dc7bd04":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_bf1489df7c98442caadd2417026bffdf","placeholder":"","style":"IPY_MODEL_590b0de8d8f84d1ab9a243d88380c295","value":" 4.72k/4.72k [00:00<00:00, 60.4kB/s]"}},"a4a86f212e8a4dffb0240936475837f7":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"a4b3ef66956d494887e796f87b4278f1":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_a428e985410341e0ba04359af465681e","placeholder":"","style":"IPY_MODEL_bfbec38740f8413d93469274aeccbf23","value":" 3.44G/3.44G [00:32<00:00, 246MB/s]"}},"a5926a9d27b44a2d961f36d7fd36da15":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"a76a9fce4af34c639327e5a0f4f4e692":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"a83f68f1b1d640298019af11c5198a7d":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_3fcfdbffff6149fe880b0702cf8162f3","placeholder":"","style":"IPY_MODEL_eb96b8f49d6746ec8f46e65e59a3fad6","value":"(…)ature_extractor/preprocessor_config.json: 100%"}},"a903c8a2c22c48bdb70024665bc1cb0f":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"aa4bef85913f43c49b708c850301546d":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"ab7f8b09b1d8452995d66e6f0df83faa":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_31107fa83a974eca83fa968ae4eae909","placeholder":"","style":"IPY_MODEL_36a0330c9e1d48a49a739feaef34ddc0","value":" 7/7 [00:02<00:00, 3.52it/s]"}},"adfbabeb20a3428c8fd6ec5b79830c34":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_cffa386bd54e4baa8947d4d51c8e54a4","placeholder":"","style":"IPY_MODEL_a5926a9d27b44a2d961f36d7fd36da15","value":" 743/743 [00:00<00:00, 10.7kB/s]"}},"ae44d548c5164e8fb5e85f1ab19da9ac":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_defa4622267e49b8a54d6aceea082c39","max":806,"min":0,"orientation":"horizontal","style":"IPY_MODEL_59081d8cdf4e43228a20f3fe986926b2","value":806}},"aedf99983c5043fe8d634b6e3b56e1ae":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"b08c232e469646f89a28f4371f0e7699":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_9d2ff155648146058d2d359e474159dc","placeholder":"","style":"IPY_MODEL_0220a7b0d67a482a8ab7e9ac6372d380","value":"tokenizer/merges.txt: 100%"}},"b0bcd8201363473ea0e4ace230443446":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"b18d19fb59d8458c9e9048c1458ee95c":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"b89baafcd7e944b088bf0c9b1e839ba5":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"bac1fcd4cf1847ed89bc5e01ae435e24":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_d1d564e827cf4a71af9aa87b9d5696c8","placeholder":"","style":"IPY_MODEL_1ad180dc107946b9a1b554a4b98ee514","value":" 335M/335M [00:11<00:00, 15.8MB/s]"}},"bc4ed6312ae44ba7a9e21d43d7edd48a":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"bd7b9e9757d24c568432eea484271376":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_fb34d344590e43438906722b1ab958f0","placeholder":"","style":"IPY_MODEL_ed6a871ff054410b8d018b3b97c75c60","value":"model.safetensors: 100%"}},"bf1489df7c98442caadd2417026bffdf":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"bf4b67c0a0034b1ab576066d366d934e":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_c577a158f88040c0979b319fb9fb89b0","placeholder":"","style":"IPY_MODEL_f24c9ef7c3f04da783aaffd3e5d48de7","value":" 541/541 [00:00<00:00, 29.0kB/s]"}},"bf573d9fcbac4701b31e464373fdbeb0":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_96d5032cdaa14cdeb110f8fc3b6614c1","placeholder":"","style":"IPY_MODEL_4a466456e448417a8b3cc442cec49632","value":" 0/0 [00:00<?, ?it/s]"}},"bfbec38740f8413d93469274aeccbf23":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"c14c4f3d31a447b38d359acc6e29496d":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"c1a5498ffe0f407397d2a8c77e35de84":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"c26781e6cc214b6b866b8f11673e9c00":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_9ef6f8a591244419916d980d5883e03e","placeholder":"","style":"IPY_MODEL_4d26d7dd13d148b3b6e06f10b589f2a8","value":"text_encoder/config.json: 100%"}},"c2bf5a15732a4898915b0ec3cb56df8c":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_35ffc6d955a44422a06e5c304fcaeddb","max":1,"min":0,"orientation":"horizontal","style":"IPY_MODEL_1e64cce9ffc94f23921f964288c2e26d","value":0}},"c41432dd085f41aeaa1e1d9cac4872e7":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"c523aec40bb54986a4f40924c81fe5da":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"c565ba8910c34e138c6cd10e9b06d673":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"c577a158f88040c0979b319fb9fb89b0":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"c6a66e5c516c4dbbbc1ca203c6a2d0db":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"c7f341fe95ca426d9502594bd48d36f6":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"cc9fdbf01697491f856a33e4b70a7a78":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"cd7fa6f0b6844fd4a41b97cbe39e0c2f":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"cddd14dde34c42f59ed71870e558246e":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_8c690a7356af4547902b72aa6a20328d","max":617,"min":0,"orientation":"horizontal","style":"IPY_MODEL_1e4ad02a5afb431a96626230081054cc","value":617}},"cffa386bd54e4baa8947d4d51c8e54a4":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"d1d564e827cf4a71af9aa87b9d5696c8":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"d49efe4cff2a43288d2140a55c17c4cc":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_6956b01055a34613994672a6bb93994d","placeholder":"","style":"IPY_MODEL_1f16ae9b03604ccd93cd1e2f153afe64","value":"100%"}},"d5a23fbc6e634b02bf4f2540e9def457":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"d7c9f6b399524bc596e84641687ba29a":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_1c8a25c7f70145df9722d3702fc6d2dd","IPY_MODEL_ae44d548c5164e8fb5e85f1ab19da9ac","IPY_MODEL_1325ea426c9747828601a9175a9b0248"],"layout":"IPY_MODEL_97fe7d3ccc984694aa42a56ba930c64e"}},"d7eb412b880c490c95f1d2baeaf2e6af":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_3e7caf19c664461e9505cbfcb708ceba","IPY_MODEL_21e5a4fd28ac47a190d0f41808dd75a1","IPY_MODEL_7f4317fe6eca4fc5be524453e55103bd"],"layout":"IPY_MODEL_4ab2093e02704b748298a6d34e807847"}},"d9816365a87340dfa06fe9a37811a81b":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"d9f1220cc4f5440b97e35151ea76ed00":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"da25ec24a6b44ce9a51bcc2d440d0258":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"da6373c3704d45869e2fafdf9772d3d1":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"da714c09550f43849e5a2502b092403a":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_b0bcd8201363473ea0e4ace230443446","max":1059962,"min":0,"orientation":"horizontal","style":"IPY_MODEL_d9f1220cc4f5440b97e35151ea76ed00","value":1059962}},"de350491cd2e48a19541f6e97b9f176e":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"defa4622267e49b8a54d6aceea082c39":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"e00e5537ae9a43d5956c2c770599edde":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"e0216d9707cb4d10be57c4864521c376":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_2e72912929c14d9f8843bc028b75de77","IPY_MODEL_feeaea5e7b524ed09bd08c73b7ed5e17","IPY_MODEL_1ebb24b8e6f44608ae8c11747ef7c42d"],"layout":"IPY_MODEL_c523aec40bb54986a4f40924c81fe5da"}},"e0280b4f0172481ea7664bfb96d1bb1c":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_60aa6e24133c4e67957bf953e5b10f4d","placeholder":"","style":"IPY_MODEL_5ce301bccee049cf9664800d63e2e2eb","value":"Loading pipeline components...: 100%"}},"e03272733bd84e2e9edcb83391e1cfae":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"e09eaa48528a48629b19020aa65c5bdc":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"e1f4e7fa8b1f4530b580f101a9fad304":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"e20cd706fa304bc190e92ae7d27b5b38":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_e4f21da1f63a4a819485cbe7e5ae306b","placeholder":"","style":"IPY_MODEL_c7f341fe95ca426d9502594bd48d36f6","value":" 472/472 [00:00<00:00, 13.2kB/s]"}},"e431b2a589524545a5ccbb79d2c7bab9":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"e4f21da1f63a4a819485cbe7e5ae306b":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"e92a9d9050e34b47b766a31935ffbcda":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"eac1b421700d492ba398d5ac609b5741":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"eb3e42c270b04086a2a04ae286684f6b":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"eb96b8f49d6746ec8f46e65e59a3fad6":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"eba9a67d3f704bed8f501780e35273cb":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_f6b5c1f44a54406c84db14875e4c85b0","IPY_MODEL_1f3ce7042b974edbb033b4cd8d13cc08","IPY_MODEL_6de631e7a72e4b74b740095e8c251ca8"],"layout":"IPY_MODEL_00c30d57328148c88b4258f4841bbdd0"}},"ed1dbb98c8d34dce8df70e4698a21911":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"ed410d69e8e94af7be0d104c5c29a2c9":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"ed6a871ff054410b8d018b3b97c75c60":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"f04f25b18eaa43f6a2044dac7aba8372":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_163ec8057136471bb1f460d657c4aa6c","placeholder":"","style":"IPY_MODEL_ff1655111fd04c4785d7e5ec3747629c","value":"diffusion_pytorch_model.safetensors: 100%"}},"f0e13bd4abca444592850390651272a4":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_757864720c4041c6a24f8aa8f1630e69","placeholder":"","style":"IPY_MODEL_92e5153340a74fc9895d4f87b68e3cad","value":" 11/30 [02:05<03:34, 11.30s/it]"}},"f1f352e6964f424f9e6a4557f6e3ff97":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"f23c3781df5a497baa56f273a5f467a5":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"f23d6c684c3a416d8b43fc642653eec8":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"f24c9ef7c3f04da783aaffd3e5d48de7":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"f2f9506fc02a4624a8a1c08f1f6abdb2":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"f4f167561c2d452785b6e59c2ce61b28":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"f63bebf985a046ce9ba9d567db0b7ca4":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_5ffaa53c74d3413b8f3c3fe2fc5cc075","max":4723,"min":0,"orientation":"horizontal","style":"IPY_MODEL_aedf99983c5043fe8d634b6e3b56e1ae","value":4723}},"f6b5c1f44a54406c84db14875e4c85b0":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_a4a86f212e8a4dffb0240936475837f7","placeholder":"","style":"IPY_MODEL_89ce18ce98494ccc803adbf87f6051e5","value":"100%"}},"f6fb32e142d140d5ad6b357731c4d382":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"f736f2fc44444ebf832fb2ead6ea0fd0":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"f7f49c60f4ca41efa9a3b6b33c418f73":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"f8db6aae6e3a468ba3805991e19e1f45":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"fa1d984ee1864975a4c543f1dcb3aa42":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_f2f9506fc02a4624a8a1c08f1f6abdb2","placeholder":"","style":"IPY_MODEL_2f010760c1a146238f35af00568ccb11","value":" 1.22G/1.22G [00:24<00:00, 110MB/s]"}},"fa5231429aa1437983dc93dc597e698e":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"fb34d344590e43438906722b1ab958f0":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"fd4eebbe68204eaa802186836c372b93":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"fe24820349a0456ca103e30024490c0e":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"feeaea5e7b524ed09bd08c73b7ed5e17":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_495a6de427ce47eda18a1570fd8f5f9d","max":547,"min":0,"orientation":"horizontal","style":"IPY_MODEL_eac1b421700d492ba398d5ac609b5741","value":547}},"ff1655111fd04c4785d7e5ec3747629c":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"ff697b2e046a4a36a78ac362cae5c2a9":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"ffe65c700f0142df9b289a0e8b58ec65":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}}}}},"nbformat":4,"nbformat_minor":4}
diff --git a/notebooks/spa/tgi_messages_api_demo.ipynb b/notebooks/spa/tgi_messages_api_demo.ipynb
new file mode 100644
index 00000000..b2e53af6
--- /dev/null
+++ b/notebooks/spa/tgi_messages_api_demo.ipynb
@@ -0,0 +1,514 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Migrating from OpenAI to Open LLMs Using TGI's Messages API\n",
+ "\n",
+ "_Authored by: [Andrew Reed](https://huggingface.co/andrewrreed)_\n",
+ "\n",
+ "This notebook demonstrates how you can easily transition from OpenAI models to Open LLMs without needing to refactor any existing code.\n",
+ "\n",
+ "[Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) now offers a [Messages API](https://huggingface.co/blog/tgi-messages-api), making it directly compatible with the OpenAI Chat Completion API. This means that any existing scripts that use OpenAI models (via the OpenAI client library or third-party tools like LangChain or LlamaIndex) can be directly swapped out to use any open LLM running on a TGI endpoint!\n",
+ "\n",
+ "This allows you to quickly test out and benefit from the numerous advantages offered by open models. Things like:\n",
+ "\n",
+ "- Complete control and transparency over models and data\n",
+ "- No more worrying about rate limits\n",
+ "- The ability to fully customize systems according to your specific needs\n",
+ "\n",
+ "In this notebook, we'll show you how to:\n",
+ "\n",
+ "1. [Create Inference Endpoint to Deploy a Model with TGI](#section_1)\n",
+ "2. [Query the Inference Endpoint with OpenAI Client Libraries](#section_2)\n",
+ "3. [Integrate the Endpoint with LangChain and LlamaIndex Workflows](#section_3)\n",
+ "\n",
+ "**Let's dive in!**\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Setup\n",
+ "\n",
+ "First we need to install dependencies and set an HF API key.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "!pip install --upgrade -q huggingface_hub langchain langchain-community langchainhub langchain-openai llama-index chromadb bs4 sentence_transformers torch"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "import getpass\n",
+ "\n",
+ "# enter API key\n",
+ "os.environ[\"HUGGINGFACEHUB_API_TOKEN\"] = HF_API_KEY = getpass.getpass()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "\n",
+ "## 1. Create an Inference Endpoint\n",
+ "\n",
+ "To get started, let's deploy [Nous-Hermes-2-Mixtral-8x7B-DPO](https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO), a fine-tuned Mixtral model, to Inference Endpoints using TGI.\n",
+ "\n",
+ "We can deploy the model in just [a few clicks from the UI](https://ui.endpoints.huggingface.co/new?vendor=aws&repository=NousResearch%2FNous-Hermes-2-Mixtral-8x7B-DPO&tgi_max_total_tokens=32000&tgi=true&tgi_max_input_length=1024&task=text-generation&instance_size=2xlarge&tgi_max_batch_prefill_tokens=2048&tgi_max_batch_total_tokens=1024000&no_suggested_compute=true&accelerator=gpu®ion=us-east-1), or take advantage of the `huggingface_hub` Python library to programmatically create and manage Inference Endpoints.\n",
+ "\n",
+ "We'll use the Hub library here by specifing an endpoint name and model repository, along with the task of `text-generation`. In this example, we use a `protected` type so access to the deployed model will require a valid Hugging Face token. We also need to configure the hardware requirements like vendor, region, accelerator, instance type, and size. You can check out the list of available resource options [using this API call](https://api.endpoints.huggingface.cloud/#get-/v2/provider), and view recommended configurations for select models in the catalog [here](https://ui.endpoints.huggingface.co/catalog).\n",
+ "\n",
+ "_Note: You may need to request a quota upgrade by sending an email to [api-enterprise@huggingface.co](mailto:api-enterprise@huggingface.co)_\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "running\n"
+ ]
+ }
+ ],
+ "source": [
+ "from huggingface_hub import create_inference_endpoint\n",
+ "\n",
+ "endpoint = create_inference_endpoint(\n",
+ " \"nous-hermes-2-mixtral-8x7b-demo\",\n",
+ " repository=\"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO\",\n",
+ " framework=\"pytorch\",\n",
+ " task=\"text-generation\",\n",
+ " accelerator=\"gpu\",\n",
+ " vendor=\"aws\",\n",
+ " region=\"us-east-1\",\n",
+ " type=\"protected\",\n",
+ " instance_type=\"p4de\",\n",
+ " instance_size=\"2xlarge\",\n",
+ " custom_image={\n",
+ " \"health_route\": \"/health\",\n",
+ " \"env\": {\n",
+ " \"MAX_INPUT_LENGTH\": \"4096\",\n",
+ " \"MAX_BATCH_PREFILL_TOKENS\": \"4096\",\n",
+ " \"MAX_TOTAL_TOKENS\": \"32000\",\n",
+ " \"MAX_BATCH_TOTAL_TOKENS\": \"1024000\",\n",
+ " \"MODEL_ID\": \"/repository\",\n",
+ " },\n",
+ " \"url\": \"ghcr.io/huggingface/text-generation-inference:sha-1734540\", # must be >= 1.4.0\n",
+ " },\n",
+ ")\n",
+ "\n",
+ "endpoint.wait()\n",
+ "print(endpoint.status)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "It will take a few minutes for our deployment to spin up. We can use the `.wait()` utility to block the running thread until the endpoint reaches a final \"running\" state. Once running, we can confirm its status and take it for a spin via the UI Playground:\n",
+ "\n",
+ "\n",
+ "\n",
+ "Great, we now have a working endpoint!\n",
+ "\n",
+ "_Note: When deploying with `huggingface_hub`, your endpoint will scale-to-zero after 15 minutes of idle time by default to optimize cost during periods of inactivity. Check out [the Hub Python Library documentation](https://huggingface.co/docs/huggingface_hub/guides/inference_endpoints) to see all the functionality available for managing your endpoint lifecycle._\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "\n",
+ "## 2. Query the Inference Endpoint with OpenAI Client Libraries\n",
+ "\n",
+ "As mentioned above, since our model is hosted with TGI it now supports a Messages API meaning we can query it directly using the familiar OpenAI client libraries.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### With the Python client\n",
+ "\n",
+ "The example below shows how to make this transition using the [OpenAI Python Library](https://github.com/openai/openai-python). Simply replace the `` with your endpoint URL (be sure to include the `v1/` the suffix) and populate the `` field with a valid Hugging Face user token. The `` can be gathered from Inference Endpoints UI, or from the endpoint object we created above with `endpoint.url`.\n",
+ "\n",
+ "We can then use the client as usual, passing a list of messages to stream responses from our Inference Endpoint.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Open-source software is important due to a number of reasons, including:\n",
+ "\n",
+ "1. Collaboration: The collaborative nature of open-source software allows developers from around the world to work together, share their ideas and improve the code. This often results in faster progress and better software.\n",
+ "\n",
+ "2. Transparency: With open-source software, the code is publicly available, making it easy to see exactly how the software functions, and allowing users to determine if there are any security vulnerabilities.\n",
+ "\n",
+ "3. Customization: Being able to access the code also allows users to customize the software to better suit their needs. This makes open-source software incredibly versatile, as users can tweak it to suit their specific use case.\n",
+ "\n",
+ "4. Quality: Open-source software is often developed by large communities of dedicated developers, who work together to improve the software. This results in a higher level of quality than might be found in proprietary software.\n",
+ "\n",
+ "5. Cost: Open-source software is often provided free of charge, which makes it accessible to a wider range of users. This can be especially important for organizations with limited budgets for software.\n",
+ "\n",
+ "6. Shared Benefit: By sharing the code of open-source software, everyone can benefit from the hard work of the developers. This contributes to the overall advancement of technology, as users and developers work together to improve and build upon the software.\n",
+ "\n",
+ "In summary, open-source software provides a collaborative platform that leads to high-quality, customizable, and transparent software, all available at little or no cost, benefiting both individuals and the technology community as a whole.<|im_end|>"
+ ]
+ }
+ ],
+ "source": [
+ "from openai import OpenAI\n",
+ "\n",
+ "BASE_URL = endpoint.url\n",
+ "\n",
+ "# init the client but point it to TGI\n",
+ "client = OpenAI(\n",
+ " base_url=os.path.join(BASE_URL, \"v1/\"),\n",
+ " api_key=HF_API_KEY,\n",
+ ")\n",
+ "chat_completion = client.chat.completions.create(\n",
+ " model=\"tgi\",\n",
+ " messages=[\n",
+ " {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
+ " {\"role\": \"user\", \"content\": \"Why is open-source software important?\"},\n",
+ " ],\n",
+ " stream=True,\n",
+ " max_tokens=500,\n",
+ ")\n",
+ "\n",
+ "# iterate and print stream\n",
+ "for message in chat_completion:\n",
+ " print(message.choices[0].delta.content, end=\"\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Behind the scenes, TGI’s Messages API automatically converts the list of messages into the model’s required instruction format using its [chat template](https://huggingface.co/docs/transformers/chat_templating).\n",
+ "\n",
+ "_Note: Certain OpenAI features, like function calling, are not compatible with TGI. Currently, the Messages API supports the following chat completion parameters: `stream`, `max_new_tokens`, `frequency_penalty`, `logprobs`, `seed`, `temperature`, and `top_p`._\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### With the JavaScript client\n",
+ "\n",
+ "Here’s the same streaming example above, but using the [OpenAI Javascript/Typescript Library](https://github.com/openai/openai-node).\n",
+ "\n",
+ "```js\n",
+ "import OpenAI from \"openai\";\n",
+ "\n",
+ "const openai = new OpenAI({\n",
+ " baseURL: \"\" + \"/v1/\", // replace with your endpoint url\n",
+ " apiKey: \"\", // replace with your token\n",
+ "});\n",
+ "\n",
+ "async function main() {\n",
+ " const stream = await openai.chat.completions.create({\n",
+ " model: \"tgi\",\n",
+ " messages: [\n",
+ " { role: \"system\", content: \"You are a helpful assistant.\" },\n",
+ " { role: \"user\", content: \"Why is open-source software important?\" },\n",
+ " ],\n",
+ " stream: true,\n",
+ " max_tokens: 500,\n",
+ " });\n",
+ " for await (const chunk of stream) {\n",
+ " process.stdout.write(chunk.choices[0]?.delta?.content || \"\");\n",
+ " }\n",
+ "}\n",
+ "\n",
+ "main();\n",
+ "```\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "\n",
+ "## 3. Integrate with LangChain and LlamaIndex\n",
+ "\n",
+ "Now, let’s see how to use this newly created endpoint with popular RAG frameworks like LangChain and LlamaIndex.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### How to use with LangChain\n",
+ "\n",
+ "To use it in [LangChain](https://python.langchain.com/docs/get_started/introduction), simply create an instance of `ChatOpenAI` and pass your `` and `` as follows:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "AIMessage(content='Open-source software is important for several reasons:\\n\\n1. Transparency: Open-source software allows users to see the underlying code, making it easier to understand how the software works and identify any potential security vulnerabilities or bugs. This transparency fosters trust between users and developers.\\n\\n2. Collaboration: Open-source projects encourage collaboration among developers, allowing them to work together to improve the software, fix issues, and add new features. This collective effort can lead to')"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "from langchain_openai import ChatOpenAI\n",
+ "\n",
+ "llm = ChatOpenAI(\n",
+ " model_name=\"tgi\",\n",
+ " openai_api_key=HF_API_KEY,\n",
+ " openai_api_base=os.path.join(BASE_URL, \"v1/\"),\n",
+ ")\n",
+ "llm.invoke(\"Why is open-source software important?\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We’re able to directly leverage the same `ChatOpenAI` class that we would have used with the OpenAI models. This allows all previous code to work with our endpoint by changing just one line of code.\n",
+ "\n",
+ "Let’s now use our Mixtral model in a simple RAG pipeline to answer a question over the contents of a HF blog post.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{'context': [Document(page_content='To overcome this weakness, amongst other approaches, one can integrate the LLM into a system where it can call tools: such a system is called an LLM agent.\\nIn this post, we explain the inner workings of ReAct agents, then show how to build them using the ChatHuggingFace class recently integrated in LangChain. Finally, we benchmark several open-source LLMs against GPT-3.5 and GPT-4.', metadata={'description': 'We’re on a journey to advance and democratize artificial intelligence through open source and open science.', 'language': 'No language found.', 'source': 'https://huggingface.co/blog/open-source-llms-as-agents', 'title': 'Open-source LLMs as LangChain Agents'}),\n",
+ " Document(page_content='Since the open-source models were not specifically fine-tuned for calling functions in the given output format, they are at a slight disadvantage compared to the OpenAI agents.\\nDespite this, some models perform really well! 💪\\nHere’s an example of Mixtral-8x7B answering the question: “Which city has a larger population, Guiyang or Tacheng?”\\nThought: To answer this question, I need to find the current populations of both Guiyang and Tacheng. I will use the search tool to find this information.\\nAction:\\n{', metadata={'description': 'We’re on a journey to advance and democratize artificial intelligence through open source and open science.', 'language': 'No language found.', 'source': 'https://huggingface.co/blog/open-source-llms-as-agents', 'title': 'Open-source LLMs as LangChain Agents'}),\n",
+ " Document(page_content='Agents Showdown: how do open-source LLMs perform as general purpose reasoning agents?\\n\\t\\n\\nYou can find the code for this benchmark here.\\n\\n\\n\\n\\n\\n\\t\\tEvaluation\\n\\t\\n\\nWe want to measure how open-source LLMs perform as general purpose reasoning agents. Thus we select questions requiring using logic and the use of basic tools: a calculator and access to internet search.\\nThe final dataset is a combination of samples from 3 other datasets:', metadata={'description': 'We’re on a journey to advance and democratize artificial intelligence through open source and open science.', 'language': 'No language found.', 'source': 'https://huggingface.co/blog/open-source-llms-as-agents', 'title': 'Open-source LLMs as LangChain Agents'}),\n",
+ " Document(page_content='Open-source LLMs as LangChain Agents\\n\\t\\n\\nPublished\\n\\t\\t\\t\\tJanuary 24, 2024\\nUpdate on GitHub\\n\\nm-ric\\nAymeric Roucher\\n\\n\\n\\n\\nJofthomas\\nJoffrey THOMAS\\n\\n\\n\\n\\nandrewrreed\\nAndrew Reed\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\t\\tTL;DR\\n\\t\\n\\nOpen-source LLMs have now reached a performance level that makes them suitable reasoning engines for powering agent workflows: Mixtral even surpasses GPT-3.5 on our benchmark, and its performance could easily be further enhanced with fine-tuning.\\n\\n\\n\\n\\n\\n\\t\\tIntroduction', metadata={'description': 'We’re on a journey to advance and democratize artificial intelligence through open source and open science.', 'language': 'No language found.', 'source': 'https://huggingface.co/blog/open-source-llms-as-agents', 'title': 'Open-source LLMs as LangChain Agents'})],\n",
+ " 'question': 'According to this article which open-source model is the best for an agent behaviour?',\n",
+ " 'answer': 'According to the article, Mixtral-8x7B is an open-source LLM that performs really well as a general-purpose reasoning agent. It even surpasses GPT-3.5 on the benchmark in the article.'}"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "from langchain import hub\n",
+ "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
+ "from langchain_community.document_loaders import WebBaseLoader\n",
+ "from langchain_community.vectorstores import Chroma\n",
+ "from langchain_core.output_parsers import StrOutputParser\n",
+ "from langchain_core.runnables import RunnablePassthrough\n",
+ "from langchain_core.runnables import RunnableParallel\n",
+ "from langchain_community.embeddings import HuggingFaceEmbeddings\n",
+ "\n",
+ "# Load, chunk and index the contents of the blog\n",
+ "loader = WebBaseLoader(\n",
+ " web_paths=(\"https://huggingface.co/blog/open-source-llms-as-agents\",),\n",
+ ")\n",
+ "docs = loader.load()\n",
+ "\n",
+ "# declare an HF embedding model\n",
+ "hf_embeddings = HuggingFaceEmbeddings(model_name=\"BAAI/bge-large-en-v1.5\")\n",
+ "\n",
+ "text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=200)\n",
+ "splits = text_splitter.split_documents(docs)\n",
+ "vectorstore = Chroma.from_documents(documents=splits, embedding=hf_embeddings)\n",
+ "\n",
+ "# Retrieve and generate using the relevant snippets of the blog\n",
+ "retriever = vectorstore.as_retriever()\n",
+ "prompt = hub.pull(\"rlm/rag-prompt\")\n",
+ "\n",
+ "\n",
+ "def format_docs(docs):\n",
+ " return \"\\n\\n\".join(doc.page_content for doc in docs)\n",
+ "\n",
+ "\n",
+ "rag_chain_from_docs = (\n",
+ " RunnablePassthrough.assign(context=(lambda x: format_docs(x[\"context\"])))\n",
+ " | prompt\n",
+ " | llm\n",
+ " | StrOutputParser()\n",
+ ")\n",
+ "\n",
+ "rag_chain_with_source = RunnableParallel(\n",
+ " {\"context\": retriever, \"question\": RunnablePassthrough()}\n",
+ ").assign(answer=rag_chain_from_docs)\n",
+ "\n",
+ "rag_chain_with_source.invoke(\n",
+ " \"According to this article which open-source model is the best for an agent behaviour?\"\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### How to use with LlamaIndex\n",
+ "\n",
+ "Similarly, you can also use a TGI endpoint in [LlamaIndex](https://www.llamaindex.ai/). We’ll use the `OpenAILike` class, and instantiate it by configuring some additional arguments (i.e. `is_local`, `is_function_calling_model`, `is_chat_model`, `context_window`).\n",
+ "\n",
+ "_Note: that the context window argument should match the value previously set for `MAX_TOTAL_TOKENS` of your endpoint._\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "CompletionResponse(text='Open-source software is important for several reasons:\\n\\n1. Transparency: Open-source software allows users to see the source code, which means they can understand how the software works and how it processes data. This transparency helps build trust in the software and its developers.\\n\\n2. Collaboration: Open-source software encourages collaboration among developers, who can contribute to the code, fix bugs, and add new features. This collaborative approach often leads to faster development and', additional_kwargs={}, raw={'id': '', 'choices': [Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='Open-source software is important for several reasons:\\n\\n1. Transparency: Open-source software allows users to see the source code, which means they can understand how the software works and how it processes data. This transparency helps build trust in the software and its developers.\\n\\n2. Collaboration: Open-source software encourages collaboration among developers, who can contribute to the code, fix bugs, and add new features. This collaborative approach often leads to faster development and', role='assistant', function_call=None, tool_calls=None))], 'created': 1707342025, 'model': '/repository', 'object': 'text_completion', 'system_fingerprint': '1.4.0-sha-1734540', 'usage': CompletionUsage(completion_tokens=100, prompt_tokens=18, total_tokens=118)}, delta=None)"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "from llama_index.llms import OpenAILike\n",
+ "\n",
+ "llm = OpenAILike(\n",
+ " model=\"tgi\",\n",
+ " api_key=HF_API_KEY,\n",
+ " api_base=BASE_URL + \"/v1/\",\n",
+ " is_chat_model=True,\n",
+ " is_local=False,\n",
+ " is_function_calling_model=False,\n",
+ " context_window=4096,\n",
+ ")\n",
+ "\n",
+ "llm.complete(\"Why is open-source software important?\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can now use it in a similar RAG pipeline. Keep in mind that the previous choice of `MAX_INPUT_LENGTH` in your Inference Endpoint will directly influence the number of retrieved chunk (`similarity_top_k`) the model can process.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from llama_index import (\n",
+ " ServiceContext,\n",
+ " VectorStoreIndex,\n",
+ ")\n",
+ "from llama_index import download_loader\n",
+ "from llama_index.embeddings import HuggingFaceEmbedding\n",
+ "from llama_index.query_engine import CitationQueryEngine\n",
+ "\n",
+ "\n",
+ "SimpleWebPageReader = download_loader(\"SimpleWebPageReader\")\n",
+ "\n",
+ "documents = SimpleWebPageReader(html_to_text=True).load_data(\n",
+ " [\"https://huggingface.co/blog/open-source-llms-as-agents\"]\n",
+ ")\n",
+ "\n",
+ "# Load embedding model\n",
+ "embed_model = HuggingFaceEmbedding(model_name=\"BAAI/bge-large-en-v1.5\")\n",
+ "\n",
+ "# Pass LLM to pipeline\n",
+ "service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm)\n",
+ "index = VectorStoreIndex.from_documents(\n",
+ " documents, service_context=service_context, show_progress=True\n",
+ ")\n",
+ "\n",
+ "# Query the index\n",
+ "query_engine = CitationQueryEngine.from_args(\n",
+ " index,\n",
+ " similarity_top_k=2,\n",
+ ")\n",
+ "response = query_engine.query(\n",
+ " \"According to this article which open-source model is the best for an agent behaviour?\"\n",
+ ")\n",
+ "\n",
+ "response.response"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Wrap up\n",
+ "\n",
+ "After you are done with your endpoint, you can either pause or delete it. This step can be completed via the UI, or programmatically like follows.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# pause our running endpoint\n",
+ "endpoint.pause()\n",
+ "\n",
+ "# optionally delete\n",
+ "# endpoint.delete()"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": ".venv",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.11"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}