Skip to content

Commit e52821b

Browse files
author
Luke Hinds
authored
Merge pull request #15 from StacklokLabs/yaml-prompts
Implement YAML and CLI approach
2 parents cf8b2ae + f575d8a commit e52821b

30 files changed

+4353
-327
lines changed

.github/workflows/test.yml

Lines changed: 30 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -22,19 +22,38 @@ jobs:
2222
python-version: ${{ matrix.python-version }}
2323
cache: 'pip'
2424

25-
- name: Install dependencies
25+
- name: Install Poetry
2626
run: |
2727
python -m pip install --upgrade pip
28-
pip install -r requirements.txt
29-
pip install -r requirements-dev.txt
28+
curl -sSL https://install.python-poetry.org | python3 -
3029
31-
- name: Run tests with pytest
32-
run: |
33-
make test
34-
35-
- name: Run style checks
30+
- name: Configure Poetry
3631
run: |
37-
pip install ruff
38-
ruff check .
39-
ruff format --check .
32+
poetry config virtualenvs.create true
33+
poetry config virtualenvs.in-project true
34+
35+
- name: Cache Poetry virtualenv
36+
uses: actions/cache@v3
37+
id: cache
38+
with:
39+
path: ./.venv
40+
key: venv-${{ runner.os }}-${{ matrix.python-version }}-${{ hashFiles('**/poetry.lock') }}
41+
42+
- name: Install dependencies
43+
if: steps.cache.outputs.cache-hit != 'true'
44+
run: poetry install --with dev
45+
46+
- name: Run code formatting
47+
run: poetry run make format
48+
49+
- name: Run linting
50+
run: poetry run make lint
51+
52+
- name: Run tests
53+
run: poetry run make test
54+
55+
- name: Run security checks
56+
run: poetry run make security
4057

58+
- name: Run build
59+
run: poetry run make build

Makefile

Lines changed: 26 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,30 @@
1-
# Makefile
2-
.PHONY: test test-all lint
1+
.PHONY: clean install format lint test security build all
32

4-
test:
5-
pytest -v --cov=promptwright --cov-report=xml
3+
clean:
4+
rm -rf build/
5+
rm -rf dist/
6+
rm -rf *.egg-info
7+
rm -f .coverage
8+
find . -type d -name '__pycache__' -exec rm -rf {} +
9+
find . -type f -name '*.pyc' -delete
10+
11+
install:
12+
poetry install --with dev
613

7-
test-all:
8-
pytest -v
14+
format:
15+
poetry run black .
16+
poetry run ruff check --fix .
917

1018
lint:
11-
ruff check .
12-
ruff format --check .
19+
poetry run ruff check .
20+
21+
test:
22+
poetry run pytest
23+
24+
security:
25+
poetry run bandit -r promptwright/
26+
27+
build: clean test
28+
poetry build
29+
30+
all: clean install format lint test security build

README.md

Lines changed: 180 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -5,80 +5,212 @@
55

66
![promptwright-cover](https://github.com/user-attachments/assets/5e345bda-df66-474b-90e7-f488d8f89032)
77

8-
Promptwright is a Python library from [Stacklok](https://stacklok.com) designed for generating large synthetic
9-
datasets using a local LLM. The library offers a flexible and easy-to-use set of interfaces, enabling users
10-
the ability to generate prompt led synthetic datasets.
8+
Promptwright is a Python library from [Stacklok](https://stacklok.com) designed
9+
for generating large synthetic datasets using a local LLM. The library offers
10+
a flexible and easy-to-use set of interfaces, enabling users the ability to
11+
generate prompt led synthetic datasets.
1112

1213
Promptwright was inspired by the [redotvideo/pluto](https://github.com/redotvideo/pluto),
13-
in fact it started as fork, but ended up largley being a re-write, to allow dataset generation
14-
against a local LLM model.
14+
in fact it started as fork, but ended up largley being a re-write, to allow
15+
dataset generation against a local LLM model.
1516

1617
The library interfaces with Ollama, making it easy to just pull a model and run
17-
Promptwright.
18+
Promptwright, but other providers could be used, as long as they provide a
19+
compatible API (happy to help expand the library to support other providers,
20+
just open an issue).
1821

1922
## Features
2023

2124
- **Local LLM Client Integration**: Interact with Ollama based models
2225
- **Configurable Instructions and Prompts**: Define custom instructions and system prompts
23-
- **Push to Hugging Face**: Push the generated dataset to Hugging Face Hub.
26+
- **YAML Configuration**: Define your generation tasks using YAML configuration files
27+
- **Command Line Interface**: Run generation tasks directly from the command line
28+
- **Push to Hugging Face**: Push the generated dataset to Hugging Face Hub with automatic dataset cards and tags
2429

2530
## Getting Started
2631

2732
### Prerequisites
2833

2934
- Python 3.11+
30-
- `promptwright` library installed
35+
- Poetry (for dependency management)
3136
- Ollama CLI installed and running (see [Ollama Installation](https://ollama.com/)
3237
- A Model pulled via Ollama (see [Model Compatibility](#model-compatibility))
38+
- (Optional) Hugging Face account and API token for dataset upload
3339

3440
### Installation
3541

3642
To install the prerequisites, you can use the following commands:
3743

3844
```bash
39-
pip install promptwright
45+
# Install Poetry if you haven't already
46+
curl -sSL https://install.python-poetry.org | python3 -
47+
48+
# Install promptwright and its dependencies
49+
git clone https://github.com/StacklokLabs/promptwright.git
50+
cd promptwright
51+
poetry install
52+
53+
# Start Ollama service
4054
ollama serve
55+
56+
# Pull your desired model
4157
ollama pull {model_name} # whichever model you want to use
4258
```
4359

44-
### Example Usage
45-
46-
There are a few examples in the `examples` directory that demonstrate how to use
47-
the library to generate different topic based datasets.
48-
49-
### Running an Example
50-
51-
To run an example:
52-
53-
1. Ensure you have started Ollama by running `ollama serve`.
54-
2. Verify that the required model is downloaded (e.g. `llama3.2:latest`).
55-
4. Set the `model_name` in the chosen example file to the model you have downloaded.
56-
57-
```python
58-
59-
tree = TopicTree(
60-
args=TopicTreeArguments(
61-
root_prompt="Creative Writing Prompts",
62-
model_system_prompt=system_prompt,
63-
tree_degree=5, # Increase degree for more prompts
64-
tree_depth=4, # Increase depth for more prompts
65-
temperature=0.9, # Higher temperature for more creative variations
66-
model_name="ollama/llama3" # Set the model name here
67-
)
68-
)
69-
engine = DataEngine(
70-
args=EngineArguments(
71-
instructions="Generate creative writing prompts and example responses.",
72-
system_prompt="You are a creative writing instructor providing writing prompts and example responses.",
73-
model_name="ollama/llama3",
74-
temperature=0.9,
75-
max_retries=2,
76-
```
77-
5. Run your chosen example file:
78-
```bash
79-
python example/creative_writing.py
80-
```
81-
6. The generated dataset will be saved to a JSONL file to whatever is set within `dataset.save()`.
60+
### Usage
61+
62+
Promptwright offers two ways to define and run your generation tasks:
63+
64+
#### 1. Using YAML Configuration (Recommended)
65+
66+
Create a YAML file defining your generation task:
67+
68+
```yaml
69+
system_prompt: "You are a helpful assistant. You provide clear and concise answers to user questions."
70+
71+
topic_tree:
72+
args:
73+
root_prompt: "Capital Cities of the World."
74+
model_system_prompt: "<system_prompt_placeholder>"
75+
tree_degree: 3
76+
tree_depth: 2
77+
temperature: 0.7
78+
model_name: "ollama/mistral:latest"
79+
save_as: "basic_prompt_topictree.jsonl"
80+
81+
data_engine:
82+
args:
83+
instructions: "Please provide training examples with questions about capital cities."
84+
system_prompt: "<system_prompt_placeholder>"
85+
model_name: "ollama/mistral:latest"
86+
temperature: 0.9
87+
max_retries: 2
88+
89+
dataset:
90+
creation:
91+
num_steps: 5
92+
batch_size: 1
93+
model_name: "ollama/mistral:latest"
94+
save_as: "basic_prompt_dataset.jsonl"
95+
96+
# Optional Hugging Face Hub configuration
97+
huggingface:
98+
# Repository in format "username/dataset-name"
99+
repository: "your-username/your-dataset-name"
100+
# Token can also be provided via HF_TOKEN environment variable or --hf-token CLI option
101+
token: "your-hf-token"
102+
# Additional tags for the dataset (optional)
103+
# "promptwright" and "synthetic" tags are added automatically
104+
tags:
105+
- "promptwright-generated-dataset"
106+
- "geography"
107+
```
108+
109+
Run using the CLI:
110+
111+
```bash
112+
promptwright start config.yaml
113+
```
114+
115+
The CLI supports various options to override configuration values:
116+
117+
```bash
118+
promptwright start config.yaml \
119+
--topic-tree-save-as output_tree.jsonl \
120+
--dataset-save-as output_dataset.jsonl \
121+
--model-name ollama/llama3 \
122+
--temperature 0.8 \
123+
--tree-degree 4 \
124+
--tree-depth 3 \
125+
--num-steps 10 \
126+
--batch-size 2 \
127+
--hf-repo username/dataset-name \
128+
--hf-token your-token \
129+
--hf-tags tag1 --hf-tags tag2
130+
```
131+
132+
#### Hugging Face Hub Integration
133+
134+
Promptwright supports automatic dataset upload to the Hugging Face Hub with the following features:
135+
136+
1. **Dataset Upload**: Upload your generated dataset directly to Hugging Face Hub
137+
2. **Dataset Cards**: Automatically creates and updates dataset cards
138+
3. **Automatic Tags**: Adds "promptwright" and "synthetic" tags automatically
139+
4. **Custom Tags**: Support for additional custom tags
140+
5. **Flexible Authentication**: HF token can be provided via:
141+
- CLI option: `--hf-token your-token`
142+
- Environment variable: `export HF_TOKEN=your-token`
143+
- YAML configuration: `huggingface.token`
144+
145+
Example using environment variable:
146+
```bash
147+
export HF_TOKEN=your-token
148+
promptwright start config.yaml --hf-repo username/dataset-name
149+
```
150+
151+
Or pass it in as a CLI option:
152+
```bash
153+
promptwright start config.yaml --hf-repo username/dataset-name --hf-token your-token
154+
```
155+
156+
#### 2. Using Python Code
157+
158+
You can also create generation tasks programmatically using Python code. There
159+
are several examples in the `examples` directory that demonstrate this approach.
160+
161+
Example Python usage:
162+
163+
```python
164+
from promptwright import DataEngine, EngineArguments, TopicTree, TopicTreeArguments
165+
166+
tree = TopicTree(
167+
args=TopicTreeArguments(
168+
root_prompt="Creative Writing Prompts",
169+
model_system_prompt=system_prompt,
170+
tree_degree=5,
171+
tree_depth=4,
172+
temperature=0.9,
173+
model_name="ollama/llama3"
174+
)
175+
)
176+
177+
engine = DataEngine(
178+
args=EngineArguments(
179+
instructions="Generate creative writing prompts and example responses.",
180+
system_prompt="You are a creative writing instructor providing writing prompts and example responses.",
181+
model_name="ollama/llama3",
182+
temperature=0.9,
183+
max_retries=2,
184+
)
185+
)
186+
```
187+
188+
### Development
189+
190+
The project uses Poetry for dependency management. Here are some common development commands:
191+
192+
```bash
193+
# Install dependencies including development dependencies
194+
make install
195+
196+
# Format code
197+
make format
198+
199+
# Run linting
200+
make lint
201+
202+
# Run tests
203+
make test
204+
205+
# Run security checks
206+
make security
207+
208+
# Build the package
209+
make build
210+
211+
# Run all checks and build
212+
make all
213+
```
82214

83215
### Prompt Output Examples
84216

@@ -108,7 +240,7 @@ following models so far:
108240

109241
- **Mistral**
110242
- **LLaMA3**
111-
--**Qwen2.5**
243+
- **Qwen2.5**
112244

113245
## Unpredictable Behavior
114246

0 commit comments

Comments
 (0)