PG Datagen - PostgreSQL Random Data Generator

A schema-driven Python random data generator for PostgreSQL.

⚠️ PG Datagen is currently very much alpha, any definition may change. Consider it having a non-stable interface.

ℹ️ Name suggestions welcome 😊

The purpose of this tool is to provide a random data generator for each table in an (annotated) PostgreSQL schema. For example, when loading the schema:

CREATE TABLE a(
  id CHAR(32) PRIMARY KEY, -- gen: md5
  value TEXT
);

The tool would automatically generate (and ingest) values for the columns id and value in table a. Dependencies between tables are supported as well (see this example).

Current Limitations

Annotations cannot be mixed
Each table has to be defined in a single file
IDs based on INT/BIGINT can have collisions
No automated tests

Installation

Clone this repository
Install dependencies: pip3 install -r requirements.txt

Example Usage

We'll be using ./examples/simple for a demonstration:

Create a database and load examples/simple/{a,b}.sql

Run the generator:

./generator.py \
  --dsn postgresql://postgres@localhost/pydatagen \
  --batch-size 100 \
  --rows 1000 \
  --truncate \
  --target examples/simple/two-tables.py

Details

Python Control File

You will have to provide a Python file as entrypoint. The minimum definition of such file for two tables a and b would look like this:

from lib.base_object import Table

TABLES = {
  'public.a': Table(schema_path='...', scaler=1),
  'public.b': Table(schema_path='...', scaler=2)
}

GRAPH = {
  'public.a': [],
  'public.b': []
}

ENTRYPOINT = 'public.a'

Here, a and b are independent and b has twice as many rows as a. That is, if you declare --rows=100, then 100 rows will be generated for a and 200 rows will be generated for b.

Supported Annotations

Currently, there can be only one annotation per column. An annotation must be placed on the same line as the column. Two examples below. This will work:

CREATE TABLE a(
    foo BIGINT -- none_prob: 0.9
  , bar BIGINT
)

This won't work:

CREATE TABLE a(
    foo BIGINT
    -- none_prob: 0.9
  , bar BIGINT
)

Annotation	Description
none_prob: <0.0..1.0>	Sets probability of generating a `NULL` value (if allowed)
gen:	Hardcodes the generator to use. Methods in ./lib/random.py are supported (not all!). For example: `-- gen: md5` would use the `md5` method. There is a special generator `choose_from_list` to inject dependencies (see below).

Inject Dependencies

If you have tables a and b and they are defined like:

-- a.sql
CREATE TABLE a(
    id CHAR(32) PRIMARY KEY -- gen: md5
  , value BIGINT
);
    
-- b.sql
CREATE TABLE b(
    id CHAR(32) PRIMARY KEY -- gen: md5
  , id_a CHAR(32) NOT NULL -- gen: choose_from_list public.a.id
  , value BIGINT
)

Then, the data generator will take random values from a and use them for the id_a column of table b. Always supply a complete "path", that is, the path must be of format: <schema>.<table>.<column>. Use public as <schema> if you do not use custom schemas.

Modify the Python target file to ensure, that a is generated before b:

GRAPH = {
  'public.a': ['public.b'],
  'public.b': []
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PG Datagen - PostgreSQL Random Data Generator

Current Limitations

Installation

Example Usage

Details

Python Control File

Supported Annotations

Inject Dependencies

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github/workflows		.github/workflows
examples		examples
lib		lib
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generator.py		generator.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PG Datagen - PostgreSQL Random Data Generator

Current Limitations

Installation

Example Usage

Details

Python Control File

Supported Annotations

Inject Dependencies

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages