A schema-driven Python random data generator for PostgreSQL.
ℹ️ Name suggestions welcome 😊
The purpose of this tool is to provide a random data generator for each table in an (annotated) PostgreSQL schema. For example, when loading the schema:
CREATE TABLE a(
id CHAR(32) PRIMARY KEY, -- gen: md5
value TEXT
);The tool would automatically generate (and ingest) values for the columns id
and value in table a. Dependencies between tables are supported as well (see
this example).
- Annotations cannot be mixed
- Each table has to be defined in a single file
- IDs based on
INT/BIGINTcan have collisions - No automated tests
- Clone this repository
- Install dependencies:
pip3 install -r requirements.txt
We'll be using ./examples/simple for a demonstration:
-
Create a database and load
examples/simple/{a,b}.sql -
Run the generator:
./generator.py \ --dsn postgresql://postgres@localhost/pydatagen \ --batch-size 100 \ --rows 1000 \ --truncate \ --target examples/simple/two-tables.py
You will have to provide a Python file as entrypoint. The minimum definition
of such file for two tables a and b would look like this:
from lib.base_object import Table
TABLES = {
'public.a': Table(schema_path='...', scaler=1),
'public.b': Table(schema_path='...', scaler=2)
}
GRAPH = {
'public.a': [],
'public.b': []
}
ENTRYPOINT = 'public.a'Here, a and b are independent and b has twice as many rows as a. That
is, if you declare --rows=100, then 100 rows will be generated for a and
200 rows will be generated for b.
Currently, there can be only one annotation per column. An annotation must be placed on the same line as the column. Two examples below. This will work:
CREATE TABLE a(
foo BIGINT -- none_prob: 0.9
, bar BIGINT
)This won't work:
CREATE TABLE a(
foo BIGINT
-- none_prob: 0.9
, bar BIGINT
)| Annotation | Description |
|---|---|
| none_prob: <0.0..1.0> | Sets probability of generating a NULL value (if allowed) |
| gen: | Hardcodes the generator to use. Methods in ./lib/random.py are supported (not all!). For example: -- gen: md5 would use the md5 method. There is a special generator choose_from_list to inject dependencies (see below). |
If you have tables a and b and they are defined like:
-- a.sql
CREATE TABLE a(
id CHAR(32) PRIMARY KEY -- gen: md5
, value BIGINT
);
-- b.sql
CREATE TABLE b(
id CHAR(32) PRIMARY KEY -- gen: md5
, id_a CHAR(32) NOT NULL -- gen: choose_from_list public.a.id
, value BIGINT
)Then, the data generator will take random values from a and use them for
the id_a column of table b. Always supply a complete "path",
that is, the path must be of format: <schema>.<table>.<column>. Use public
as <schema> if you do not use custom schemas.
Modify the Python target file to ensure, that a is generated before b:
GRAPH = {
'public.a': ['public.b'],
'public.b': []
}