Forest DiffusionModel RC commit

Diyago · Diyago · commit adf8c65b0696 · 2023-09-30T22:19:24.000+03:00
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
 [![Downloads](https://pepy.tech/badge/tabgan)](https://pepy.tech/project/tabgan)
 
-# GANs for tabular  data
+# GANs and Diffusions for tabular  data
 
 <img src="./images/tabular_gan.png" height="15%" width="15%">
 Generative Adversarial Networks (GANs) are well-known for their success in realistic image generation. However, they can also be applied to generate tabular data. Here will give opportunity to try some of them.
@@ -29,9 +29,10 @@ test = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list("ABCD
 # generate data
 new_train1, new_target1 = OriginalGenerator().generate_data_pipe(train, target, test, )
 new_train2, new_target2 = GANGenerator().generate_data_pipe(train, target, test, )
+new_train3, new_target3 = ForestDiffusionGenerator().generate_data_pipe(train, target, test, )
 
 # example with all params defined
-new_train3, new_target3 = GANGenerator(gen_x_times=1.1, cat_cols=None,
+new_train4, new_target4 = GANGenerator(gen_x_times=1.1, cat_cols=None,
            bot_filter_quantile=0.001, top_filter_quantile=0.999, is_post_process=True,
            adversarial_model_params={
                "metrics": "AUC", "max_depth": 2, "max_bin": 100, 
@@ -41,7 +42,10 @@ new_train3, new_target3 = GANGenerator(gen_x_times=1.1, cat_cols=None,
                                           test, deep_copy=True, only_adversarial=False, use_adversarial=True)
 ```
 
-Both samplers `OriginalGenerator` and `GANGenerator` have same input parameters:
+All samplers `OriginalGenerator`, `ForestDiffusionGenerator` and `GANGenerator` have same input parameters.
+
+1. **GANGenerator** based on **CTGAN**
+2. **ForestDiffusionGenerator** based on **Forest Diffusion**
 
 * **gen_x_times**: float = 1.1 - how much data to generate, output might be less because of postprocessing and
   adversarial filtering
@@ -132,43 +136,14 @@ To run experiment follow these steps:
    add more datasets, adjust validation type and categorical encoders.
 5. Observe metrics across all experiment in console or in `./Research/results/fit_predict_scores.txt`
 
-**Task formalization**
-
-Let say we have **T_train** and **T_test** (train and test set respectively). We need to train the model on **T_train**
-and make predictions on **T_test**. However, we will increase the train by generating new data by GAN, somehow similar
-to **T_test**, without using ground truth labels.
 
 **Experiment design**
 
-In the case of having a smaller **T_train** and a different data distribution, we can use CTGAN to generate additional data **T_synth**. First, we train CTGAN on **T_train** with ground truth labels (step 1), then generate additional data **T_synth** (step 2). Secondly, we train boosting in an adversarial way on concatenated **T_train** and **T_synth** (target set to 0) with **T_test** (target set to 1) (steps 3 & 4). The goal is to apply the newly trained adversarial boosting to obtain rows more like **T_test**. Note that initial ground truth labels aren't used for adversarial training. As a result, we take top rows from **T_train** and **T_synth** sorted by correspondence to **T_test** (steps 5 & 6), and train new boosting on them and check results on **T_test**.
-
 ![Experiment design and workflow](./images/workflow.png?raw=true)
 
 **Picture 1.1** Experiment design and workflow
 
-Of course for the benchmark purposes we will test ordinal training without these tricks and another original pipeline
-but without CTGAN (in step 3 we won"t use **T_sync**).
-
-**Datasets**
-
-All datasets came from different domains. They have a different number of observations, number of categorical and
-numerical features. The objective for all datasets - binary classification. Preprocessing of datasets were simple:
-removed all time-based columns from datasets. Remaining columns were either categorical or numerical.
-
-**Table 1.1** Used datasets
-
-| Name | Total points | Train points | Test points | Number of features | Number of categorical features | Short description |
-| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
-| [Telecom](https://www.kaggle.com/blastchar/telco-customer-churn)   | 7.0k | 4.2k |  2.8k |  20   |  16  | Churn prediction for telecom data |
-| [Adult](https://www.kaggle.com/wenruliu/adult-income-dataset)   | 48.8k | 29.3k | 19.5k  |  15  | 8 | Predict if persons" income is bigger 50k |
-| [Employee](https://www.kaggle.com/c/amazon-employee-access-challenge/data)   | 32.7k | 19.6k | 13.1k  | 10  | 9 | Predict an employee"s access needs, given his/her job role|
-| [Credit](https://www.kaggle.com/c/home-credit-default-risk/data)   | 307.5k | 184.5k | 123k  |  121  | 18 | Loan repayment |
-| [Mortgages](https://www.crowdanalytix.com/contests/propensity-to-fund-mortgages)   |  45.6k | 27.4k | 18.2k | 20 | 9 | Predict if house mortgage is founded |
-| [Taxi](https://www.crowdanalytix.com/contests/mckinsey-big-data-hackathon) | 892.5k | 535.5k | 357k | 8 | 5 | Predict the probability of an offer being accepted by a certain driver |
-| [Poverty_A](https://www.drivendata.org/competitions/50/worldbank-poverty-prediction/page/99/)   | 37.6k | 22.5k | 15.0k | 41 | 38 | Predict whether or not a given household for a given country is poor or not |
-
 ## Results
-
 To determine the best sampling strategy, ROC AUC scores of each dataset were scaled (min-max scale) and then averaged
 among the dataset.
 
@@ -224,35 +199,12 @@ arxiv publication:
       primaryClass={cs.LG}
 }
 ```
-library itself:
-```bibtex
-@misc{Diyago2020tabgan,
-    author       = {Ashrapov, Insaf},
-    title        = {GANs for tabular data},
-    howpublished = {\url{https://github.com/Diyago/GAN-for-tabular-data}},
-    year         = {2020}
-}
-```
 
 ## References
 
-[1] Jonathan Hui. GAN — What is Generative Adversarial Networks GAN? (2018), medium article
-
-[2]Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville,
-Yoshua Bengio. Generative Adversarial Networks (2014). arXiv:1406.2661
-
-[3] Lei Xu LIDS, Kalyan Veeramachaneni. Synthesizing Tabular Data using Generative Adversarial Networks (2018). arXiv:
+[1] Lei Xu LIDS, Kalyan Veeramachaneni. Synthesizing Tabular Data using Generative Adversarial Networks (2018). arXiv:
 1811.11264v1 [cs.LG]
 
-[4] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular Data using Conditional
-GAN (2019). arXiv:1907.00503v2 [cs.LG]
-
-[5] Denis Vorotyntsev. Benchmarking Categorical Encoders. Medium post
-
-[6] Insaf Ashrapov. GAN-for-tabular-data. Github repository.
-
-[7] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, Timo Aila. Analyzing and Improving the
-Image Quality of StyleGAN (2019) arXiv:1912.04958v2 [cs.CV]
-
-[8]  ODS.ai: Open data science, https://ods.ai/
+[2] Alexia Jolicoeur-Martineau and Kilian Fatras and Tal Kachman. Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees ((2023) https://github.com/SamsungSAILMontreal/ForestDiffusion [cs.LG]
 
+[3] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular data using Conditional GAN. NeurIPS, (2019)
diff --git a/src/tabgan/sampler.py b/src/tabgan/sampler.py
@@ -307,11 +307,11 @@ def generate_data(
             self.TEMP_TARGET = None
         logging.info("Fitting ForestDiffusion model")
         if self.cat_cols is None:
-            forest_model = ForestDiffusionModel(train_df.to_numpy(), label_y=self.TEMP_TARGET, n_t=50,
+            forest_model = ForestDiffusionModel(train_df.to_numpy(), label_y=None, n_t=50,
                                                 duplicate_K=100,
                                                 diffusion_type='flow', n_jobs=-1)
         else:
-            forest_model = ForestDiffusionModel(train_df.to_numpy(), label_y=self.TEMP_TARGET, n_t=50,
+            forest_model = ForestDiffusionModel(train_df.to_numpy(), label_y=None, n_t=50,
                                                 duplicate_K=100,
                                                 # todo fix bug with cat cols
                                                 #cat_indexes=self.get_column_indexes(train_df, self.cat_cols),
@@ -393,39 +393,42 @@ def get_columns_if_exists(df, col) -> pd.DataFrame:
     logging.info(train)
     target = pd.DataFrame(np.random.randint(0, 2, size=(train_size, 1)), columns=list("Y"))
     test = pd.DataFrame(np.random.randint(0, 100, size=(train_size, 4)), columns=list("ABCD"))
-    # _sampler(OriginalGenerator(gen_x_times=15), train, target, test)
-    # _sampler(
-    #     GANGenerator(gen_x_times=10, only_generated_data=False,
-    #                  gen_params={"batch_size": 500, "patience": 25, "epochs": 500, }), train, target, test
-    # )
-    #
-    # _sampler(OriginalGenerator(gen_x_times=15), train, None, train)
-    # _sampler(
-    #     GANGenerator(cat_cols=["A"], gen_x_times=20, only_generated_data=True),
-    #     train,
-    #     None,
-    #     train,
-    # )
+    _sampler(OriginalGenerator(gen_x_times=15), train, target, test)
+    _sampler(
+        GANGenerator(gen_x_times=10, only_generated_data=False,
+                     gen_params={"batch_size": 500, "patience": 25, "epochs": 500, }), train, target, test
+    )
+
+    _sampler(OriginalGenerator(gen_x_times=15), train, None, train)
+    _sampler(
+        GANGenerator(cat_cols=["A"], gen_x_times=20, only_generated_data=True),
+        train,
+        None,
+        train,
+    )
     _sampler(
         ForestDiffusionGenerator(cat_cols=["A"], gen_x_times=1, only_generated_data=True),
         train,
         None,
         train,
     )
+    _sampler(
+        ForestDiffusionGenerator(gen_x_times=10, only_generated_data=False,
+                     gen_params={"batch_size": 500, "patience": 25, "epochs": 500, }), train, target, test
+    )
+
+    min_date = pd.to_datetime('2019-01-01')
+    max_date = pd.to_datetime('2021-12-31')
+
+    d = (max_date - min_date).days + 1
+
+    train['Date'] = min_date + pd.to_timedelta(np.random.randint(d, size=train_size), unit='d')
+    train = get_year_mnth_dt_from_date(train, 'Date')
 
-    #
-    # min_date = pd.to_datetime('2019-01-01')
-    # max_date = pd.to_datetime('2021-12-31')
-    #
-    # d = (max_date - min_date).days + 1
-    #
-    # train['Date'] = min_date + pd.to_timedelta(np.random.randint(d, size=train_size), unit='d')
-    # train = get_year_mnth_dt_from_date(train, 'Date')
-    #
-    # new_train, new_target = GANGenerator(gen_x_times=1.1, cat_cols=['year'], bot_filter_quantile=0.001,
-    #                                      top_filter_quantile=0.999,
-    #                                      is_post_process=True, pregeneration_frac=2, only_generated_data=False). \
-    #     generate_data_pipe(train.drop('Date', axis=1), None,
-    #                        train.drop('Date', axis=1)
-    #                        )
-    # new_train = collect_dates(new_train)
+    new_train, new_target = GANGenerator(gen_x_times=1.1, cat_cols=['year'], bot_filter_quantile=0.001,
+                                         top_filter_quantile=0.999,
+                                         is_post_process=True, pregeneration_frac=2, only_generated_data=False). \
+        generate_data_pipe(train.drop('Date', axis=1), None,
+                           train.drop('Date', axis=1)
+                           )
+    new_train = collect_dates(new_train)
diff --git a/tests/test_sampler.py b/tests/test_sampler.py
@@ -9,7 +9,7 @@
 import numpy as np
 import pandas as pd
 
-from src.tabgan.sampler import OriginalGenerator, Sampler, GANGenerator
+from src.tabgan.sampler import OriginalGenerator, Sampler, GANGenerator, ForestDiffusionGenerator
 
 
 class TestOriginalGenerator(TestCase):
@@ -94,3 +94,20 @@ def test_generate_data(self):
             self.assertEqual(np.max(self.target.nunique()), np.max(new_target.nunique()))
             self.assertTrue(gen_train.shape[0] > new_train.shape[0])
             self.assertEqual(np.max(self.target.nunique()), np.max(new_target.nunique()))
+
+    class TestSamplerGAN(TestCase):
+        def setUp(self):
+            self.train = pd.DataFrame(np.random.randint(-10, 150, size=(50, 4)), columns=list('ABCD'))
+            self.target = pd.DataFrame(np.random.randint(0, 2, size=(50, 1)), columns=list('Y'))
+            self.test = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))
+            self.gen = ForestDiffusionGenerator(gen_x_times=15)
+            self.sampler = self.gen.get_object_generator()
+
+        def test_generate_data(self):
+            new_train, new_target, test_df = self.sampler.preprocess_data(self.train.copy(),
+                                                                          self.target.copy(), self.test)
+            gen_train, gen_target = self.sampler.generate_data(new_train, new_target, test_df)
+            self.assertEqual(gen_train.shape[0], gen_target.shape[0])
+            self.assertEqual(np.max(self.target.nunique()), np.max(new_target.nunique()))
+            self.assertTrue(gen_train.shape[0] > new_train.shape[0])
+            self.assertEqual(np.max(self.target.nunique()), np.max(new_target.nunique()))