Skip to content

Releases: Diyago/Tabular-data-generation

v3.2.0 — BayesianGenerator

29 Mar 15:12

Choose a tag to compare

What's New

BayesianGenerator (Gaussian Copula)

  • New generator using Gaussian Copula for fast, lightweight synthetic data generation
  • No neural network training required — works out of the box
  • Added examples to Colab notebook

AutoSynth & HuggingFace Hub Integration (v3.1.0)

  • AutoSynth: automatically selects the best generator for your dataset
  • HuggingFace Hub: push/pull synthetic datasets directly
  • Blog post with benchmarks and speed comparisons

Other Improvements

  • Execution timing for all generators and quality reports
  • HuggingFace Space demo (Gradio app)
  • Fixed HF Space: disabled SSR, made heavy deps optional
  • Updated PyPI description

Full Changelog: v3.0.2...v3.2.0

TabGAN v3.0.1

28 Mar 05:38

Choose a tag to compare

What's New

Quality Report (HTML)

Generate self-contained HTML reports comparing original and synthetic data — column statistics, PSI per column, correlation heatmaps, distribution plots, and ML utility scores (TSTR vs TRTR).

from tabgan import QualityReport
report = QualityReport(original_df, synthetic_df, target_col="target").compute()
report.to_html("report.html")

Constraints System

Enforce business rules on generated data with 4 constraint types: RangeConstraint, UniqueConstraint, FormulaConstraint, RegexConstraint. Integrated directly into generate_data_pipe().

from tabgan import GANGenerator, RangeConstraint
new_train, _ = GANGenerator().generate_data_pipe(
    train, target, test,
    constraints=[RangeConstraint("age", min_val=0, max_val=120)]
)

Privacy Metrics

Assess re-identification risk with DCR (Distance to Closest Record), NNDR (Nearest Neighbor Distance Ratio), and membership inference risk. Returns an overall privacy score 0–1.

from tabgan import PrivacyMetrics
pm = PrivacyMetrics(original_df, synthetic_df).summary()
print(pm["overall_privacy_score"])

sklearn Pipeline Integration

TabGANTransformer — drop-in sklearn transformer for data augmentation inside Pipeline. Supports get_params/set_params, constraints, and all generator types.

from sklearn.pipeline import Pipeline
from tabgan import TabGANTransformer
pipe = Pipeline([("augment", TabGANTransformer(gen_x_times=1.5)), ("model", clf)])

Improvements

  • Refactored codebase: fixed mutable defaults, nested test classes, Warning() bug, make_two_digit() bug, deprecated pkg_resources
  • DRY generator factories via _BaseGenerator base class
  • Professional README with centered badges, pipeline diagram, CLI docs, new feature documentation
  • Python version classifiers added, python_requires updated to >= 3.9
  • Test coverage expanded: 39 → 115 tests

Dependencies

  • Added matplotlib>=3.5, requests

Full Changelog: v2.6.0...v3.0.1

Version 2.0.0 Release Notes with ForestDiffusion

30 Sep 19:35

Choose a tag to compare

This release introduces a generator called ForestDiffusion from the paper "Generating and Imputing Tabular Data via Diffusion and Flow-based XGBoost Models".

Installation: pip install tabgan
Data generation: ForestDiffusionGenerator().generate_data_pipe(train, target, test, )

1.2.0

26 Dec 11:39

Choose a tag to compare

Full Changelog: Diyago/GAN-for-tabular-data@1.0.1...1.2.0

Updates:

  1. Added Timeseries data generation aka TimeGAN
  2. Fixed version requirements for colab usage. #22 #23 #24
  3. More robust generated data #12
  4. Fixed big memory consumption #14
  5. Added logs #15

Install: pip install tabgan or pip install tabgan==1.2.0

Release to PIP

18 Feb 21:22

Choose a tag to compare

research

13 Jul 16:43

Choose a tag to compare

research Pre-release
Pre-release
move experiment to research folder