|
5 | 5 |
|
6 | 6 | 
|
7 | 7 |
|
8 |
| -Promptwright is a Python library from [Stacklok](https://stacklok.com) designed for generating large synthetic |
9 |
| -datasets using a local LLM. The library offers a flexible and easy-to-use set of interfaces, enabling users |
10 |
| -the ability to generate prompt led synthetic datasets. |
| 8 | +Promptwright is a Python library from [Stacklok](https://stacklok.com) designed |
| 9 | +for generating large synthetic datasets using a local LLM. The library offers |
| 10 | +a flexible and easy-to-use set of interfaces, enabling users the ability to |
| 11 | +generate prompt led synthetic datasets. |
11 | 12 |
|
12 | 13 | Promptwright was inspired by the [redotvideo/pluto](https://github.com/redotvideo/pluto),
|
13 |
| -in fact it started as fork, but ended up largley being a re-write, to allow dataset generation |
14 |
| -against a local LLM model. |
| 14 | +in fact it started as fork, but ended up largley being a re-write, to allow |
| 15 | +dataset generation against a local LLM model. |
15 | 16 |
|
16 | 17 | The library interfaces with Ollama, making it easy to just pull a model and run
|
17 |
| -Promptwright. |
| 18 | +Promptwright, but other providers could be used, as long as they provide a |
| 19 | +compatible API (happy to help expand the library to support other providers, |
| 20 | +just open an issue). |
18 | 21 |
|
19 | 22 | ## Features
|
20 | 23 |
|
21 | 24 | - **Local LLM Client Integration**: Interact with Ollama based models
|
22 | 25 | - **Configurable Instructions and Prompts**: Define custom instructions and system prompts
|
23 |
| -- **Push to Hugging Face**: Push the generated dataset to Hugging Face Hub. |
| 26 | +- **YAML Configuration**: Define your generation tasks using YAML configuration files |
| 27 | +- **Command Line Interface**: Run generation tasks directly from the command line |
| 28 | +- **Push to Hugging Face**: Push the generated dataset to Hugging Face Hub with automatic dataset cards and tags |
24 | 29 |
|
25 | 30 | ## Getting Started
|
26 | 31 |
|
27 | 32 | ### Prerequisites
|
28 | 33 |
|
29 | 34 | - Python 3.11+
|
30 |
| -- `promptwright` library installed |
| 35 | +- Poetry (for dependency management) |
31 | 36 | - Ollama CLI installed and running (see [Ollama Installation](https://ollama.com/)
|
32 | 37 | - A Model pulled via Ollama (see [Model Compatibility](#model-compatibility))
|
| 38 | +- (Optional) Hugging Face account and API token for dataset upload |
33 | 39 |
|
34 | 40 | ### Installation
|
35 | 41 |
|
36 | 42 | To install the prerequisites, you can use the following commands:
|
37 | 43 |
|
38 | 44 | ```bash
|
39 |
| -pip install promptwright |
| 45 | +# Install Poetry if you haven't already |
| 46 | +curl -sSL https://install.python-poetry.org | python3 - |
| 47 | + |
| 48 | +# Install promptwright and its dependencies |
| 49 | +git clone https://github.com/StacklokLabs/promptwright.git |
| 50 | +cd promptwright |
| 51 | +poetry install |
| 52 | + |
| 53 | +# Start Ollama service |
40 | 54 | ollama serve
|
| 55 | + |
| 56 | +# Pull your desired model |
41 | 57 | ollama pull {model_name} # whichever model you want to use
|
42 | 58 | ```
|
43 | 59 |
|
44 |
| -### Example Usage |
45 |
| - |
46 |
| -There are a few examples in the `examples` directory that demonstrate how to use |
47 |
| -the library to generate different topic based datasets. |
48 |
| - |
49 |
| -### Running an Example |
50 |
| - |
51 |
| -To run an example: |
52 |
| - |
53 |
| -1. Ensure you have started Ollama by running `ollama serve`. |
54 |
| -2. Verify that the required model is downloaded (e.g. `llama3.2:latest`). |
55 |
| -4. Set the `model_name` in the chosen example file to the model you have downloaded. |
56 |
| - |
57 |
| - ```python |
58 |
| - |
59 |
| - tree = TopicTree( |
60 |
| - args=TopicTreeArguments( |
61 |
| - root_prompt="Creative Writing Prompts", |
62 |
| - model_system_prompt=system_prompt, |
63 |
| - tree_degree=5, # Increase degree for more prompts |
64 |
| - tree_depth=4, # Increase depth for more prompts |
65 |
| - temperature=0.9, # Higher temperature for more creative variations |
66 |
| - model_name="ollama/llama3" # Set the model name here |
67 |
| - ) |
68 |
| - ) |
69 |
| - engine = DataEngine( |
70 |
| - args=EngineArguments( |
71 |
| - instructions="Generate creative writing prompts and example responses.", |
72 |
| - system_prompt="You are a creative writing instructor providing writing prompts and example responses.", |
73 |
| - model_name="ollama/llama3", |
74 |
| - temperature=0.9, |
75 |
| - max_retries=2, |
76 |
| - ``` |
77 |
| -5. Run your chosen example file: |
78 |
| - ```bash |
79 |
| - python example/creative_writing.py |
80 |
| - ``` |
81 |
| -6. The generated dataset will be saved to a JSONL file to whatever is set within `dataset.save()`. |
| 60 | +### Usage |
| 61 | + |
| 62 | +Promptwright offers two ways to define and run your generation tasks: |
| 63 | + |
| 64 | +#### 1. Using YAML Configuration (Recommended) |
| 65 | + |
| 66 | +Create a YAML file defining your generation task: |
| 67 | + |
| 68 | +```yaml |
| 69 | +system_prompt: "You are a helpful assistant. You provide clear and concise answers to user questions." |
| 70 | + |
| 71 | +topic_tree: |
| 72 | + args: |
| 73 | + root_prompt: "Capital Cities of the World." |
| 74 | + model_system_prompt: "<system_prompt_placeholder>" |
| 75 | + tree_degree: 3 |
| 76 | + tree_depth: 2 |
| 77 | + temperature: 0.7 |
| 78 | + model_name: "ollama/mistral:latest" |
| 79 | + save_as: "basic_prompt_topictree.jsonl" |
| 80 | + |
| 81 | +data_engine: |
| 82 | + args: |
| 83 | + instructions: "Please provide training examples with questions about capital cities." |
| 84 | + system_prompt: "<system_prompt_placeholder>" |
| 85 | + model_name: "ollama/mistral:latest" |
| 86 | + temperature: 0.9 |
| 87 | + max_retries: 2 |
| 88 | + |
| 89 | +dataset: |
| 90 | + creation: |
| 91 | + num_steps: 5 |
| 92 | + batch_size: 1 |
| 93 | + model_name: "ollama/mistral:latest" |
| 94 | + save_as: "basic_prompt_dataset.jsonl" |
| 95 | + |
| 96 | +# Optional Hugging Face Hub configuration |
| 97 | +huggingface: |
| 98 | + # Repository in format "username/dataset-name" |
| 99 | + repository: "your-username/your-dataset-name" |
| 100 | + # Token can also be provided via HF_TOKEN environment variable or --hf-token CLI option |
| 101 | + token: "your-hf-token" |
| 102 | + # Additional tags for the dataset (optional) |
| 103 | + # "promptwright" and "synthetic" tags are added automatically |
| 104 | + tags: |
| 105 | + - "promptwright-generated-dataset" |
| 106 | + - "geography" |
| 107 | +``` |
| 108 | +
|
| 109 | +Run using the CLI: |
| 110 | +
|
| 111 | +```bash |
| 112 | +promptwright start config.yaml |
| 113 | +``` |
| 114 | + |
| 115 | +The CLI supports various options to override configuration values: |
| 116 | + |
| 117 | +```bash |
| 118 | +promptwright start config.yaml \ |
| 119 | + --topic-tree-save-as output_tree.jsonl \ |
| 120 | + --dataset-save-as output_dataset.jsonl \ |
| 121 | + --model-name ollama/llama3 \ |
| 122 | + --temperature 0.8 \ |
| 123 | + --tree-degree 4 \ |
| 124 | + --tree-depth 3 \ |
| 125 | + --num-steps 10 \ |
| 126 | + --batch-size 2 \ |
| 127 | + --hf-repo username/dataset-name \ |
| 128 | + --hf-token your-token \ |
| 129 | + --hf-tags tag1 --hf-tags tag2 |
| 130 | +``` |
| 131 | + |
| 132 | +#### Hugging Face Hub Integration |
| 133 | + |
| 134 | +Promptwright supports automatic dataset upload to the Hugging Face Hub with the following features: |
| 135 | + |
| 136 | +1. **Dataset Upload**: Upload your generated dataset directly to Hugging Face Hub |
| 137 | +2. **Dataset Cards**: Automatically creates and updates dataset cards |
| 138 | +3. **Automatic Tags**: Adds "promptwright" and "synthetic" tags automatically |
| 139 | +4. **Custom Tags**: Support for additional custom tags |
| 140 | +5. **Flexible Authentication**: HF token can be provided via: |
| 141 | + - CLI option: `--hf-token your-token` |
| 142 | + - Environment variable: `export HF_TOKEN=your-token` |
| 143 | + - YAML configuration: `huggingface.token` |
| 144 | + |
| 145 | +Example using environment variable: |
| 146 | +```bash |
| 147 | +export HF_TOKEN=your-token |
| 148 | +promptwright start config.yaml --hf-repo username/dataset-name |
| 149 | +``` |
| 150 | + |
| 151 | +Or pass it in as a CLI option: |
| 152 | +```bash |
| 153 | +promptwright start config.yaml --hf-repo username/dataset-name --hf-token your-token |
| 154 | +``` |
| 155 | + |
| 156 | +#### 2. Using Python Code |
| 157 | + |
| 158 | +You can also create generation tasks programmatically using Python code. There |
| 159 | +are several examples in the `examples` directory that demonstrate this approach. |
| 160 | + |
| 161 | +Example Python usage: |
| 162 | + |
| 163 | +```python |
| 164 | +from promptwright import DataEngine, EngineArguments, TopicTree, TopicTreeArguments |
| 165 | + |
| 166 | +tree = TopicTree( |
| 167 | + args=TopicTreeArguments( |
| 168 | + root_prompt="Creative Writing Prompts", |
| 169 | + model_system_prompt=system_prompt, |
| 170 | + tree_degree=5, |
| 171 | + tree_depth=4, |
| 172 | + temperature=0.9, |
| 173 | + model_name="ollama/llama3" |
| 174 | + ) |
| 175 | +) |
| 176 | + |
| 177 | +engine = DataEngine( |
| 178 | + args=EngineArguments( |
| 179 | + instructions="Generate creative writing prompts and example responses.", |
| 180 | + system_prompt="You are a creative writing instructor providing writing prompts and example responses.", |
| 181 | + model_name="ollama/llama3", |
| 182 | + temperature=0.9, |
| 183 | + max_retries=2, |
| 184 | + ) |
| 185 | +) |
| 186 | +``` |
| 187 | + |
| 188 | +### Development |
| 189 | + |
| 190 | +The project uses Poetry for dependency management. Here are some common development commands: |
| 191 | + |
| 192 | +```bash |
| 193 | +# Install dependencies including development dependencies |
| 194 | +make install |
| 195 | + |
| 196 | +# Format code |
| 197 | +make format |
| 198 | + |
| 199 | +# Run linting |
| 200 | +make lint |
| 201 | + |
| 202 | +# Run tests |
| 203 | +make test |
| 204 | + |
| 205 | +# Run security checks |
| 206 | +make security |
| 207 | + |
| 208 | +# Build the package |
| 209 | +make build |
| 210 | + |
| 211 | +# Run all checks and build |
| 212 | +make all |
| 213 | +``` |
82 | 214 |
|
83 | 215 | ### Prompt Output Examples
|
84 | 216 |
|
@@ -108,7 +240,7 @@ following models so far:
|
108 | 240 |
|
109 | 241 | - **Mistral**
|
110 | 242 | - **LLaMA3**
|
111 |
| ---**Qwen2.5** |
| 243 | +- **Qwen2.5** |
112 | 244 |
|
113 | 245 | ## Unpredictable Behavior
|
114 | 246 |
|
|
0 commit comments