-
Notifications
You must be signed in to change notification settings - Fork 415
fully support naive and tz-aware timestamp/time data types #2570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
7359a1b to
c33ed04
Compare
c33ed04 to
365e851
Compare
3580072 to
ddbf1e5
Compare
…te LIMIT statements, tests incl. proper cursor tests for naive/tz aware incremental cursor columns
… for naive datetimes
…gular transaction destination caps
5838a6c to
74f220b
Compare
| from dlt.sources.credentials import ConnectionStringCredentials | ||
|
|
||
|
|
||
| class MSSQLSourceDB: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would've probably created a generic source db that uses sqlalchemy and maybe sqlmodel so we can create an example dataset on any database including an abstraction for manipulating rows. We already have more or less the exact same code for postgres. But that is something for the future maybe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right! we can have a few tables with standard types that should work on all source databases. for that we can extract a few standard tests.
tables with specific datatypes could at least have standardized names. but this is significant amount of work. I did a first step by enabling more source databases here.
dlt/extract/incremental/transform.py
Outdated
| def _adapt_if_datetime(row_value: Any, last_value: Any) -> Any: | ||
| # For datetime cursor, ensure the value is a timezone aware datetime. | ||
| # The object saved in state will always be a tz aware pendulum datetime so this ensures values are comparable | ||
| def _adapt_timezone(row_value: datetime, cursor_value: datetime, cursor_value_name: str) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is smart! If we have incoming rows with varying timezone awareness there will be a big mess though, right? Maybe we should somehow raise in the normalizer when we detect this. Or is there a mechanism somewhere? I have not seen this coming up in the community, but if there is unstructured data being loaded this could always be a possibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh! why this comment is on outdated code? Anyway - if tz-awareness changes on cursor column, the comparison will fail and exact step will abort. we normalize data after extract phase. so user should use map to normalize this data.
incremental is used mostly for sql databases and rest apis so this problem is not popping up - like you say
| # limit works with chunks (by default) | ||
| limit = self.limit.limit(self.chunk_size) | ||
| if limit is not None: | ||
| query = query.limit(limit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity, is it faster to apply a limit on the query even when you are not retrieving all rows from the cursor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very dependent on particular implementation. what really makes a difference is an index on cursor column. then query engine can stream that data in chunks without scanning. in that case limit will not change much AFAIK. it however allows to load data from connectorx in chunks
| # TIMESTAMP is always timezone-aware in BigQuery | ||
| # DATETIME is always timezone-naive in BigQuery | ||
| # NOTE: we disable DATETIME because it does not work with parquet | ||
| return "TIMESTAMP" if timezone else "TIMESTAMP" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line correct? The condition does not change anything..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is but I'll remove it to make it 100% clear. DATETIME does not work on bigquery in practice
| return value | ||
|
|
||
|
|
||
| def _apply_lag_to_datetime( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be apply_lag_to_date I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. typing is wrong.
f51a1bc to
349ca9b
Compare
Description
timestamp and time handling
This PR makes the way
dlthandles timestamps with timezone or without (naive) consistent across normalizers and destinations. The core of the change is described in#2591
It also fixes several edge cases for timestamps with precision (ie. nanoseconds or 100s nanosecond precision in mssql):
#2877
#2486
Standardizing timestamp behavior surfaced several problems with incremental datetime cursors. This PR makes the incremental cursor to always preserve the exact timestamp type of source data:
#2658
#2460
#2225
handling of
timetype was changed to behave as documented in all cases (always naive in UTC)naive timestamps are enabled for all destinations that support that. destination capabilities define timestmap support with additional flags.
sql_database timestamp cursor tests and extensions
This PR addes mssql source test - all data types and tz-aware and naive cursors (as many bugs were reported here). To unify behavior on Connectorx and other backends,
sql_databsewill convertLimitItemstep into LIMIT sql clause (to load in chunks).Schema object cleanup
Data normalization part got kicked out from Schema object and moved to
items_normalizer(identifier normalization stays!). This separates the concerns better (we have many normalizers now, so the old concept of having everything in Schema object is outdated). This should be followed up with removing json relational from Schema.Why this change happened here: because I started to normalize timestamps. And that requires destination capabilities to be present.
Resources maintain parent-child (transformer) relationship
Previously this relationship was maintained by Pipe class (which represents data processing steps, without metadata). Now
DltResourcehas a_parentfield which points to the parent. This allows to maintain full resource tree and to provide resource metadata and correct list of extract resources when grouped in a source. Previously "mock" resources were created in case of resources added to the source implicitly. Example:if you add a transformer with name "TR" that has parent resource "R" only "TR" is visible in the source (explicit) but when extracted both "TR" and "R" will be evaluated. "R" is then implicitly added to the source and previously its metadata was not available.