Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

Prevent type overflow #757

Merged
merged 8 commits into from
Nov 14, 2023
Merged

Prevent type overflow #757

merged 8 commits into from
Nov 14, 2023

Conversation

pik94
Copy link
Contributor

@pik94 pik94 commented Oct 28, 2023

Some databases, for example, Teradata, have a short limit for string and requires to set a number of symbols in CHAR/VARCHAR, i.e. it is required to write it like this

CREATE TABLE MyTable (
    col_name VARCHAR(n)
)

where n must be set and has some max allowed value.

To calculate md5 hash and convert it to an integer value, values of columns for one row are cast to strings, and concatenated afterwards. Let me clarify on an example:

CREATE TABLE MyTable (
    id INTEGER,
    data VARCHAR(n)
)

To concatenate we use a construction like this:

CAST(id AS VARCHAR(n1)) || '|'  || CAST(data AS VARCHAR(n2)) 

The question is what should n1 and n2 be? I am sure that the maximum allowed value for a specific type, for example, N. It is needed to keep all customer information without losses. However, such a concatenation will lead to a type overflow, because we are trying to have VARCHAR(N + N) which is not allowed.

To avoid such an overflow, we should shorten string values but not to loss information. I see one possible solution: taking hash for each item of a concat op, i.e.

md5( CAST(id AS VARCHAR(n1)) ) || '|'  || md5( CAST(data AS VARCHAR(n2)) )

Benefits:

  • It is possible to avoid data type overflow for most of cases

Drawbacks:

  • Performance might be decreased because we need to calculate hashes quite intensively. However, the current implementation enables it if necessarily i.e. only if at least one of databases participating in cross-diffing has a problem with a fixed varchar. This behavior is controlled by a PREVENT_OVERFLOW_WHEN_CONCAT flag.
  • Probability of collision increases but i do not think dramatically.
  • We still may have data overflow if where are many columns and a max n of VARCHAR(n) is low. Do not think it is a big problem because typically a large N might be 32000 or more symbols so customer columns should have more than 1000 column in diffing.

@pik94 pik94 force-pushed the prevent-type-overflow branch 4 times, most recently from 682e434 to caf766c Compare October 28, 2023 14:04
@dlawin dlawin requested review from nolar and dlawin October 29, 2023 18:25
nolar
nolar previously requested changes Nov 1, 2023
@pik94 pik94 force-pushed the prevent-type-overflow branch from c351010 to 4a3f703 Compare November 8, 2023 18:10
@pik94 pik94 requested review from nolar and dlawin November 11, 2023 08:19
@pik94 pik94 force-pushed the prevent-type-overflow branch from 83ce259 to 842481f Compare November 14, 2023 16:03
@pik94 pik94 requested a review from dlawin November 14, 2023 18:50
@dlawin dlawin dismissed nolar’s stale review November 14, 2023 21:59

dismissing "changes requested" per previous convo

@dlawin dlawin merged commit dfe3390 into datafold:master Nov 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants