Releases: Diyago/Tabular-data-generation
v3.2.0 — BayesianGenerator
What's New
BayesianGenerator (Gaussian Copula)
- New generator using Gaussian Copula for fast, lightweight synthetic data generation
- No neural network training required — works out of the box
- Added examples to Colab notebook
AutoSynth & HuggingFace Hub Integration (v3.1.0)
- AutoSynth: automatically selects the best generator for your dataset
- HuggingFace Hub: push/pull synthetic datasets directly
- Blog post with benchmarks and speed comparisons
Other Improvements
- Execution timing for all generators and quality reports
- HuggingFace Space demo (Gradio app)
- Fixed HF Space: disabled SSR, made heavy deps optional
- Updated PyPI description
Full Changelog: v3.0.2...v3.2.0
TabGAN v3.0.1
What's New
Quality Report (HTML)
Generate self-contained HTML reports comparing original and synthetic data — column statistics, PSI per column, correlation heatmaps, distribution plots, and ML utility scores (TSTR vs TRTR).
from tabgan import QualityReport
report = QualityReport(original_df, synthetic_df, target_col="target").compute()
report.to_html("report.html")Constraints System
Enforce business rules on generated data with 4 constraint types: RangeConstraint, UniqueConstraint, FormulaConstraint, RegexConstraint. Integrated directly into generate_data_pipe().
from tabgan import GANGenerator, RangeConstraint
new_train, _ = GANGenerator().generate_data_pipe(
train, target, test,
constraints=[RangeConstraint("age", min_val=0, max_val=120)]
)Privacy Metrics
Assess re-identification risk with DCR (Distance to Closest Record), NNDR (Nearest Neighbor Distance Ratio), and membership inference risk. Returns an overall privacy score 0–1.
from tabgan import PrivacyMetrics
pm = PrivacyMetrics(original_df, synthetic_df).summary()
print(pm["overall_privacy_score"])sklearn Pipeline Integration
TabGANTransformer — drop-in sklearn transformer for data augmentation inside Pipeline. Supports get_params/set_params, constraints, and all generator types.
from sklearn.pipeline import Pipeline
from tabgan import TabGANTransformer
pipe = Pipeline([("augment", TabGANTransformer(gen_x_times=1.5)), ("model", clf)])Improvements
- Refactored codebase: fixed mutable defaults, nested test classes,
Warning()bug,make_two_digit()bug, deprecatedpkg_resources - DRY generator factories via
_BaseGeneratorbase class - Professional README with centered badges, pipeline diagram, CLI docs, new feature documentation
- Python version classifiers added,
python_requiresupdated to>= 3.9 - Test coverage expanded: 39 → 115 tests
Dependencies
- Added
matplotlib>=3.5,requests
Full Changelog: v2.6.0...v3.0.1
Version 2.0.0 Release Notes with ForestDiffusion
This release introduces a generator called ForestDiffusion from the paper "Generating and Imputing Tabular Data via Diffusion and Flow-based XGBoost Models".
Installation: pip install tabgan
Data generation: ForestDiffusionGenerator().generate_data_pipe(train, target, test, )
1.2.0
Full Changelog: Diyago/GAN-for-tabular-data@1.0.1...1.2.0
Updates:
- Added Timeseries data generation aka TimeGAN
- Fixed version requirements for colab usage. #22 #23 #24
- More robust generated data #12
- Fixed big memory consumption #14
- Added logs #15
Install: pip install tabgan or pip install tabgan==1.2.0