Skip to content

Commit 769c44c

Browse files
committed
v0.4.2
1 parent 273851c commit 769c44c

File tree

3 files changed

+62
-13
lines changed

3 files changed

+62
-13
lines changed

README.md

Lines changed: 59 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -62,17 +62,10 @@ subsetter --config my-config.yaml sample --plan my-plan.yaml --create --truncate
6262
The sampling process proceeds in four phases:
6363

6464
1. If `--create` is specified it will attempt to create any missing tables. Existing tables will not be touched even if the schema does not match what is expected.
65-
2. If `--truncate` is specified any tables about to be sampled will be first truncated.
66-
3. Any sampled tables that are referenced by other tables will first be
67-
materialized into temporary tables on the source database.
65+
2. If `--truncate` is specified any tables about to be sampled will be first truncated. subsetter expects there to be no existing data in the destination database unless configured to run in _merge_ mode.
66+
3. Any sampled tables that are referenced by other tables will first be materialized into temporary tables on the source database.
6867
4. Data is copied for each table from the source to destination.
6968

70-
The sampler also supports filters which allow you to transform and anonymize your
71-
data using simple column filters. Check the [example](subsetter.example.yaml) config's
72-
'sampler.filters' section for more details on what filters are available and how to
73-
configure them. If needed, custom Python plugins can be used to perform
74-
arbitrary transformations.
75-
7669
## Plan and sample in one action
7770

7871
There's also a `subset` subcommand to perform the `plan` and `sample` actions
@@ -84,7 +77,63 @@ each.
8477
subsetter -c my-config.yaml subset --create --truncate
8578
```
8679

87-
# Sampling Multiplicity
80+
# Sample Transformations
81+
82+
By default any sampled row is copied directly from the source database to the
83+
destination database. However, there are several transformation steps that can
84+
be configured at the sampling stage that can change this behavior.
85+
86+
## Filtering
87+
88+
Filters allow you to transform the columns in each sampled row using either a
89+
set of built-in filters or through custom plugins. Built in filters allow you to
90+
easily replace common sources of personally identifiable information with fake
91+
data using the [faker](https://faker.readthedocs.io/en/master/) library. Filters
92+
for name, email, phone number, address, and location, and more come built in.
93+
See [subsetter.example.yaml](subsetter.example.yaml) for full details on what
94+
filters exist and how to create a custom filter plugin.
95+
96+
## Identifier Compaction
97+
98+
Often tables make use of auto-incrementing integer identifiers to function as
99+
their primary key. Sometimes we may want the identifiers in our sampled data
100+
to be compact -- instead of retaining the value in the source database we may
101+
want our N sampled rows to have identifiers ranging from 1 to N. This is useful
102+
for sample data where we want to keep the identifiers easy to reference.
103+
104+
Any other table that has a foreign key that references one of these compacted
105+
columns will automatically also have the column involved in that foreign key
106+
adjusted to maintain semantic consistency.
107+
108+
Note that enabling compaction can have a noticable impact on performance.
109+
Compaction both requires more tables to be materialized on the source database
110+
and requries more joins when streaming data into the destination database.
111+
112+
## Merging
113+
114+
By default the sampler expects no data to exist in the destination database.
115+
To get around this constraint we can turn on "merge" mode. To use merge mode all
116+
sampled tables must be either marked as "passthrough" or have a single-column,
117+
non-negative, integral primary key.
118+
119+
When enabled, the sampler will calculate the largest existing primary key
120+
identifier for each non-passthrough table and automatically shift the primary
121+
key of each sampled row to be larger using the equation:
122+
123+
```
124+
new_id = source_id + max(0, existing_ids...) + 1
125+
```
126+
127+
Passthrough tables instead will be sampled as normal except they will use the
128+
'skip' conflict strategy which will have the effect of only inserting rows in
129+
a passthrough table if no row with the matching primary key exists in the
130+
destination database.
131+
132+
If merging multiple times it may be necessary to turn on identifier compaction
133+
to avoid the largest identifier in each table from growing too quickly due to
134+
large gaps.
135+
136+
## Multiplicity
88137

89138
Sampling usually means condensing a large dataset into a semantically consistent
90139
small dataset. However, there are times that what you really want to do is

subsetter.example.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -155,12 +155,12 @@ sampler:
155155
# If merge is enabled subsetter will attempt to merge sampled data into
156156
# existing tables. Passthrough tables will be inserted as normal except
157157
# it will use the 'skip' conflict strategy. All other tables must have a
158-
# single-column, integral primary key.
158+
# single-column, non-negative, integral primary key.
159159
#
160160
# Non-passthrough tables will have their primary keys remapped using the
161161
# below equation:
162162
#
163-
# new_id = source_id + max(0, existing_pks...) + 1
163+
# new_id = source_id + max(0, existing_ids...) + 1
164164
merge: false
165165

166166
# Alternative configuration to output JSON files within a directory

subsetter/_version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.4.1"
1+
__version__ = "0.4.2"

0 commit comments

Comments
 (0)