This project demonstrates how to use Kedro to create a Retrieval-Augmented Generation (RAG)-based chatbot.
The chatbot is designed to assist users with Kedro-related questions by leveraging historical Q&A data from our Kedro Slack support channel. It creates a vector store from Slack conversations and employs a Generative AI-based agent to retrieve relevant context and generate accurate responses.
See the demo on YouTube.
Note: this project is a toy example, designed to explain how you can use Kedro to structure and manage GenAI workflows. While not production-ready, it provides a strong foundation for more advanced implementations.
- Extracts Q&A data from Slack conversations
- Converts text data into embeddings and stores them in a vector database
- Implements a retrieval-augmented chatbot using LangChain and OpenAI
- Interactive CLI interface for user interaction
- Compares RAG-based answers with responses from a standard LLM (without context retrieval)
- Saves interaction logs, including user questions, retrieved context, and chatbot responses
https://github.com/ElenaKhaustova/kedro-rag-chatbot.git
cd kedro-rag-chatbotpip install -r requirements.txtCreate a credentials.yml file and place it in the conf/base/ directory with the following format:
openai:
openai_api_base: <openai-api-base>
openai_api_key: <openai-api-key>The necessary raw data for a test run is already included in data/01_raw.
This step processes the Slack Q&A data and stores embeddings in a vector database.
kedro run -p create_vector_storeThis step initializes the AI agent, allowing it to query the vector store and generate responses.
kedro run -t agent_ragNote: to run agent_rag pipeline we use agent_rag tag to reuse some nodes from create_vector_store pipeline.
Once the chatbot is running, you can interact with it via the CLI. For each question you ask, the chatbot will provide:
- A response generated by the RAG agent using retrieved context.
- A response from a standard LLM without context retrieval.
This allows you to compare the effectiveness of retrieval-augmented generation versus a general-purpose model.
After exiting the loop, all questions asked, retrieved context, and generated answers are saved in data/08_reporting/output.md.
How can I force node execution order in the Kedro pipeline?
To force node execution order in the Kedro pipeline, you can use the before and after arguments when defining your nodes in the pipeline.py file.
For example, if you have two nodes node1 and node2, and you want node1 to run before node2, you can define them like this:
from kedro.pipeline import node
def create_pipeline():
return Pipeline(
[
node(func=node1, inputs="input_data", outputs="output_data", name="node1"),
node(func=node2, inputs="output_data", outputs="final_output", name="node2", before="node1")
]
)In this example, node2 will run after node1 because of the before="node1" argument. You can also use the after argument to specify that a node should run after another node.
By using these arguments, you can control the execution order of nodes in your Kedro pipeline.
In Kedro, the recommended approach is to rely on the topological sorting of nodes in the pipeline to determine the execution order. However, if you need to force a specific node execution order, you can create "pass-through" nodes to achieve this.
Here is an example of how you can force node execution order by creating pass-through nodes:
from kedro.pipeline import node
def pass_through_function(input_data):
# Perform any necessary operations here
return input_data
pass_through_node = node(pass_through_function, inputs="input_data", outputs="output_data")
# Define your pipeline with the pass-through nodes to enforce execution order
pipeline = Pipeline([
pass_through_node,
node(process_data, inputs="output_data", outputs="processed_data"),
node(analyze_data, inputs="processed_data", outputs=None)
])By inserting pass-through nodes between the nodes that need to be executed in a specific order, you can enforce the desired execution sequence in the pipeline.