Skip to content

Creating augmented suggester #54

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Jun 28, 2025
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
a82e872
Created new augmented_model_suggester and corresponding utils.
grace-sng7 May 19, 2025
63e6d82
Updated dependencies according to augmented_model_suggester and corre…
grace-sng7 May 19, 2025
b81e676
Updated LLM query prompt.
grace-sng7 May 19, 2025
2af6350
Minor fixes after testing AugmentedModelSuggester.
grace-sng7 May 20, 2025
e349bf5
Edited CauseNet search function.
grace-sng7 May 25, 2025
633d76d
Updated README.md to include augmented_model_suggester
grace-sng7 May 25, 2025
a474e5d
Update README.md
grace-sng7 May 27, 2025
d31fc79
Merge pull request #53 from grace-sng7/creating_augmented_suggester
grace-sng7 May 27, 2025
6b71deb
Update README.md
grace-sng7 May 27, 2025
d763517
Added augmented model suggester examples notebook
grace-sng7 May 27, 2025
e2f57e4
Merge pull request #55 from grace-sng7/creating_augmented_suggester
grace-sng7 May 27, 2025
072adc4
Uploaded augmented model suggester examples notebook again.
grace-sng7 May 27, 2025
8d82bd9
Merge branch 'py-why:creating_augmented_suggester' into creating_augm…
grace-sng7 May 27, 2025
fd17322
Merge pull request #56 from grace-sng7/creating_augmented_suggester
grace-sng7 May 27, 2025
05d9aa9
Set to ignore notebook testing for augmented model suggester examples
grace-sng7 May 27, 2025
0d1c2b5
Merge pull request #57 from grace-sng7/augmented_suggester
grace-sng7 May 27, 2025
00b61a2
Updated augmented_model_suggester_examples notebooks, docstrings, and…
grace-sng7 Jun 9, 2025
83a968f
Merge pull request #58 from grace-sng7/augmented_suggester
grace-sng7 Jun 9, 2025
2c6c7c2
Updated citations
grace-sng7 Jun 11, 2025
bfde305
Merge pull request #59 from grace-sng7/augmented_suggester
grace-sng7 Jun 11, 2025
9148c2a
Edited augmented model suggester llm_query method
grace-sng7 Jun 27, 2025
e868bb2
Merge pull request #60 from grace-sng7/augmented_suggester
grace-sng7 Jun 27, 2025
1e19398
Updated ignore_notebooks in tests
grace-sng7 Jun 27, 2025
ad04e9b
Merge pull request #61 from grace-sng7/augmented_suggester
grace-sng7 Jun 27, 2025
a90e0e4
Added onxruntime dependency
grace-sng7 Jun 28, 2025
59c6012
Merge pull request #62 from grace-sng7/augmented_suggester
grace-sng7 Jun 28, 2025
a68930b
onnxruntime dependency
grace-sng7 Jun 28, 2025
0f769a0
Merge pull request #63 from grace-sng7/augmented_suggester
grace-sng7 Jun 28, 2025
2e1ca67
Removed onnxruntime-silicon
grace-sng7 Jun 28, 2025
de4c85f
Merge pull request #64 from grace-sng7/augmented_suggester
grace-sng7 Jun 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions docs/notebooks/augmented_model_suggester_examples.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,15 @@
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Here we introduce the AugmentedModelSuggester class. Creating an instance of it enables the chosen LLM to utilize Retrieval Augmented Generation (RAG) to determine causality. It currently does this by searching the CauseNet dataset for a relevant causal pair and augmenting the LLM with the corresponding evidence/information stored in CauseNet."
],
"metadata": {
"id": "DjYECuX84vbN"
}
},
{
"cell_type": "code",
"source": [
Expand All @@ -66,6 +75,15 @@
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"AugmentedModelSuggester can suggest the pairwise relationship given two variables. If a relevant causal pair is found in CauseNet, the LLM is augmented with the aforementioned information in CauseNet. If not found, by default, the LLM will rely on its own knowledge."
],
"metadata": {
"id": "dES0LwHV57eX"
}
},
{
"cell_type": "code",
"source": [
Expand Down
19 changes: 1 addition & 18 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 0 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,6 @@ langchain-chroma = ">=0.2.4"
langchain-community = ">=0.3.24"
langchain-core = ">=0.3.60"
langchain-huggingface = ">=0.2.0"
langchain-openai = ">=0.3.17"
rank-bm25 = ">=0.2.2"
sentence-transformers = ">=4.1.0"

Expand Down
36 changes: 33 additions & 3 deletions pywhyllm/suggesters/augmented_model_suggester.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,29 @@
import re

from .simple_model_suggester import SimpleModelSuggester
from pywhyllm.utils.data_loader import *
from pywhyllm.utils.data_loader import download_causenet, load_causenet_json, create_causenet_dict
from pywhyllm.utils.augmented_model_suggester_utils import *


class AugmentedModelSuggester(SimpleModelSuggester):
"""
A class that extends SimpleModelSuggester and currently provides methods for suggesting causal relationships between variables by leveraging the CauseNet dataset for Retrieval Augmented Generation (RAG).

Methods:
- suggest_pairwise_relationship(variable1: str, variable2: str) -> List[str]:
Suggests the causal relationship between two variables and returns a list containing the cause, effect, and a description of the relationship.
"""

def __init__(self, llm, file_path: str = 'data/causenet-precision.jsonl.bz2'):
"""
Initialize the AugmentedModelSuggester with a language model and download CauseNet data.

Args:
llm: The language model instance to be used for querying.
file_path (str, optional): Path to save the downloaded CauseNet JSONL file.
Defaults to 'data/causenet-precision.jsonl.bz2'.
"""

super().__init__(llm)
self.file_path = file_path

Expand All @@ -23,13 +40,26 @@ def __init__(self, llm, file_path: str = 'data/causenet-precision.jsonl.bz2'):
print("Download failed")

def suggest_pairwise_relationship(self, variable1: str, variable2: str):
"""
Suggests a cause-and-effect relationship between two variables, leveraging the CauseNet dataset for Retrieval Augmented Generation (RAG).
If a relevant causal pair is found in CauseNet, the LLM is augmented with corresponding information regarding the relationship stored
in CauseNet. If not found, by default, the LLM will rely on its own knowledge.

Args:
variable1 (str): The name of the first variable.
variable2 (str): The name of the second variable.

Returns:
list: A list containing the suggested cause variable, the suggested effect variable, and a description of the reasoning behind the suggestion. If there is no relationship between the two variables, the first two elements will be None.
"""

result = find_top_match_in_causenet(self.causenet_dict, variable1, variable2)
if result:
source_text = get_source_text(result)
retriever = split_data_and_create_vectorstore_retriever(source_text)
response = query_llm(variable1, variable2, source_text, retriever)
response = query_llm(self.llm, variable1, variable2, source_text, retriever)
else:
response = query_llm(variable1, variable2)
response = query_llm(self.llm, variable1, variable2)

answer = re.findall(r'<answer>(.*?)</answer>', response)
answer = [ans.strip() for ans in answer]
Expand Down
31 changes: 4 additions & 27 deletions pywhyllm/utils/augmented_model_suggester_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
from langchain_core.documents import Document
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
Expand All @@ -13,49 +12,39 @@


def find_top_match_in_causenet(causenet_dict, variable1, variable2, threshold=0.7):
# Sample dictionary
pair_strings = [
f"{causenet_dict[key]['causal_relation']['cause']}-{causenet_dict[key]['causal_relation']['effect']}"
for key in causenet_dict]

# Tokenize for BM25
tokenized_pairs = [text.split() for text in pair_strings]
bm25 = BM25Okapi(tokenized_pairs)

# Original and reverse queries
query = variable1 + "-" + variable2
reverse_query = variable2 + "-" + variable1
tokenized_query = query.split()
tokenized_reverse_query = reverse_query.split()

# Combine tokens from both queries (remove duplicates)
combined_query = list(set(tokenized_query + tokenized_reverse_query))

# Get top-k candidates using BM25 with combined query
k = 5
scores = bm25.get_scores(combined_query)
top_k_indices = np.argsort(scores)[::-1][:k]
candidate_pairs = [pair_strings[i] for i in top_k_indices]

# Apply SBERT to candidates
model = SentenceTransformer('all-MiniLM-L6-v2')
query_embedding = model.encode(query, convert_to_tensor=True)
reverse_query_embedding = model.encode(reverse_query, convert_to_tensor=True)
candidate_embeddings = model.encode(candidate_pairs, convert_to_tensor=True)

# Compute similarities for both original and reverse queries
similarities = util.cos_sim(query_embedding, candidate_embeddings).flatten()
reverse_similarities = util.cos_sim(reverse_query_embedding, candidate_embeddings).flatten()

# Take the maximum similarity for each candidate (original or reverse)
max_similarities = np.maximum(similarities, reverse_similarities)

# Get the top match and its similarity score
top_idx = np.argmax(max_similarities)
top_similarity = max_similarities[top_idx]
top_pair = candidate_pairs[top_idx]

# Check if the top similarity meets the threshold
if top_similarity >= threshold:
print(f"Best match: {top_pair} (Similarity: {top_similarity:.4f})")
return causenet_dict[top_pair]
Expand All @@ -77,36 +66,29 @@ def get_source_text(causenet_query_result):
def split_data_and_create_vectorstore_retriever(source_text):
document = Document(page_content=source_text)

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100, # Adjust chunk size as needed
chunk_overlap=20 # Overlap for context
chunk_size=100,
chunk_overlap=20
)
# Split the documents
splits = text_splitter.split_documents([document])

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Create a vector store from the document splits
vectorstore = Chroma.from_documents(
documents=splits,
embedding=embeddings,
persist_directory="./chroma_db" # Optional: Save to disk for reuse
)

# Create a retriever from the vector store
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 5} # Retrieve top 5 relevant chunks
search_kwargs={"k": 5}
)

return retriever


def query_llm(variable1, variable2, source_text=None, retriever=None):
# Initialize the language model
llm = ChatOpenAI(model="gpt-4")

def query_llm(llm, variable1, variable2, source_text=None, retriever=None):
if source_text:
system_prompt = """You are a helpful assistant for causal reasoning.

Expand All @@ -116,7 +98,6 @@ def query_llm(variable1, variable2, source_text=None, retriever=None):
system_prompt = """You are a helpful assistant for causal reasoning.
"""

# prompt template
prompt = ChatPromptTemplate.from_messages([
("system", system_prompt),
("human", "{input}")
Expand All @@ -125,12 +106,8 @@ def query_llm(variable1, variable2, source_text=None, retriever=None):
query = f"""Which cause-and-effect-relationship is more likely? Provide reasoning and you must give your final answer (A, B, or C) in <answer> </answer> tags with the letter only.
A. {variable1} causes {variable2} B. {variable2} causes {variable1} C. neither {variable1} nor {variable2} cause each other."""

# Define the system prompt
if source_text:
# Create a document chain to combine retrieved documents
question_answer_chain = create_stuff_documents_chain(llm, prompt)

# Create the RAG chain
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

response = rag_chain.invoke({"input": query})
Expand Down
19 changes: 7 additions & 12 deletions pywhyllm/utils/data_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,9 @@ def download_causenet(url: str, file_path: str) -> bool:
International Conference on Information &amp; Knowledge Management (CIKM '20). Association for
Computing Machinery, New York, NY, USA, 3023–3030. https://doi.org/10.1145/3340531.3412763

TODO: Add license
License:
CauseNet data is licensed under the Creative Commons Attribution (CC BY) license.
For full license details, see: https://creativecommons.org/licenses/by/4.0/

Args:
url (str): The URL of the file to download.
Expand All @@ -30,21 +32,16 @@ def download_causenet(url: str, file_path: str) -> bool:
bool: True if the download was successful, False otherwise.
"""
try:
# Ensure the output directory exists
os.makedirs(os.path.dirname(file_path), exist_ok=True)

# Send a GET request to the URL
response = requests.get(url, stream=True)

# Check if the request was successful
if response.status_code != 200:
logging.error(f"Failed to download file from {url}. Status code: {response.status_code}")
return False

# Get the total file size for progress bar (if available)
total_size = int(response.headers.get("content-length", 0))

# Download and save the file with a progress bar
with open(file_path, "wb") as file, tqdm(
desc="Downloading",
total=total_size,
Expand Down Expand Up @@ -73,12 +70,11 @@ def load_causenet_json(file_path):
print("Loading CauseNet using json")
with bz2.open(file_path, 'rt',
encoding='utf-8') as file:
# Read each line and parse as JSON
for line in file:
line = line.strip() # Remove trailing newlines
if line: # Skip empty lines
json_obj = json.loads(line) # Parse the line as JSON
json_data.append(json_obj) # Add to list
line = line.strip()
if line:
json_obj = json.loads(line)
json_data.append(json_obj)
print("Done loading CauseNet using json")
return json_data

Expand All @@ -97,7 +93,6 @@ def create_causenet_dict(json_data):
'sources': item['sources']
}
else:
# Append sources to existing list
causenet_dict[key]['sources'].extend(item['sources'])
print("Done creating dictionary from CauseNet json data")
return causenet_dict
Loading