Skip to content

Conversation

@rudolfix
Copy link
Collaborator

@rudolfix rudolfix commented Apr 28, 2025

Description

timestamp and time handling

This PR makes the way dlt handles timestamps with timezone or without (naive) consistent across normalizers and destinations. The core of the change is described in
#2591

It also fixes several edge cases for timestamps with precision (ie. nanoseconds or 100s nanosecond precision in mssql):
#2877
#2486

Standardizing timestamp behavior surfaced several problems with incremental datetime cursors. This PR makes the incremental cursor to always preserve the exact timestamp type of source data:
#2658
#2460
#2225

handling of time type was changed to behave as documented in all cases (always naive in UTC)

naive timestamps are enabled for all destinations that support that. destination capabilities define timestmap support with additional flags.

sql_database timestamp cursor tests and extensions

This PR addes mssql source test - all data types and tz-aware and naive cursors (as many bugs were reported here). To unify behavior on Connectorx and other backends, sql_databse will convert LimitItem step into LIMIT sql clause (to load in chunks).

Schema object cleanup

Data normalization part got kicked out from Schema object and moved to items_normalizer (identifier normalization stays!). This separates the concerns better (we have many normalizers now, so the old concept of having everything in Schema object is outdated). This should be followed up with removing json relational from Schema.
Why this change happened here: because I started to normalize timestamps. And that requires destination capabilities to be present.

Resources maintain parent-child (transformer) relationship

Previously this relationship was maintained by Pipe class (which represents data processing steps, without metadata). Now DltResource has a _parent field which points to the parent. This allows to maintain full resource tree and to provide resource metadata and correct list of extract resources when grouped in a source. Previously "mock" resources were created in case of resources added to the source implicitly. Example:
if you add a transformer with name "TR" that has parent resource "R" only "TR" is visible in the source (explicit) but when extracted both "TR" and "R" will be evaluated. "R" is then implicitly added to the source and previously its metadata was not available.

@netlify
Copy link

netlify bot commented Apr 28, 2025

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit 349ca9b
🔍 Latest deploy log https://app.netlify.com/projects/dlt-hub-docs/deploys/68b4457118cd670008c009e7
😎 Deploy Preview https://deploy-preview-2570--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@rudolfix rudolfix force-pushed the fix/2486-fixes-mssql-datetime-precision branch 3 times, most recently from 7359a1b to c33ed04 Compare July 27, 2025 09:23
@rudolfix rudolfix force-pushed the fix/2486-fixes-mssql-datetime-precision branch from c33ed04 to 365e851 Compare August 6, 2025 22:09
@rudolfix rudolfix force-pushed the fix/2486-fixes-mssql-datetime-precision branch 9 times, most recently from 3580072 to ddbf1e5 Compare August 11, 2025 21:01
@rudolfix rudolfix force-pushed the fix/2486-fixes-mssql-datetime-precision branch from 5838a6c to 74f220b Compare August 17, 2025 21:05
@rudolfix rudolfix changed the title Fix/2486 fixes mssql datetime precision fully support naive and tz-aware timestamp/time data types Aug 19, 2025
@rudolfix rudolfix marked this pull request as ready for review August 19, 2025 12:15
@rudolfix rudolfix requested a review from sh-rp August 21, 2025 15:07
@sh-rp sh-rp added the breaking This issue introduces breaking change label Aug 29, 2025
from dlt.sources.credentials import ConnectionStringCredentials


class MSSQLSourceDB:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would've probably created a generic source db that uses sqlalchemy and maybe sqlmodel so we can create an example dataset on any database including an abstraction for manipulating rows. We already have more or less the exact same code for postgres. But that is something for the future maybe.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right! we can have a few tables with standard types that should work on all source databases. for that we can extract a few standard tests.

tables with specific datatypes could at least have standardized names. but this is significant amount of work. I did a first step by enabling more source databases here.

def _adapt_if_datetime(row_value: Any, last_value: Any) -> Any:
# For datetime cursor, ensure the value is a timezone aware datetime.
# The object saved in state will always be a tz aware pendulum datetime so this ensures values are comparable
def _adapt_timezone(row_value: datetime, cursor_value: datetime, cursor_value_name: str) -> Any:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is smart! If we have incoming rows with varying timezone awareness there will be a big mess though, right? Maybe we should somehow raise in the normalizer when we detect this. Or is there a mechanism somewhere? I have not seen this coming up in the community, but if there is unstructured data being loaded this could always be a possibility.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh! why this comment is on outdated code? Anyway - if tz-awareness changes on cursor column, the comparison will fail and exact step will abort. we normalize data after extract phase. so user should use map to normalize this data.

incremental is used mostly for sql databases and rest apis so this problem is not popping up - like you say

# limit works with chunks (by default)
limit = self.limit.limit(self.chunk_size)
if limit is not None:
query = query.limit(limit)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity, is it faster to apply a limit on the query even when you are not retrieving all rows from the cursor?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very dependent on particular implementation. what really makes a difference is an index on cursor column. then query engine can stream that data in chunks without scanning. in that case limit will not change much AFAIK. it however allows to load data from connectorx in chunks

# TIMESTAMP is always timezone-aware in BigQuery
# DATETIME is always timezone-naive in BigQuery
# NOTE: we disable DATETIME because it does not work with parquet
return "TIMESTAMP" if timezone else "TIMESTAMP"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line correct? The condition does not change anything..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is but I'll remove it to make it 100% clear. DATETIME does not work on bigquery in practice

return value


def _apply_lag_to_datetime(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be apply_lag_to_date I think

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. typing is wrong.

@rudolfix rudolfix force-pushed the fix/2486-fixes-mssql-datetime-precision branch from f51a1bc to 349ca9b Compare August 31, 2025 12:51
@rudolfix rudolfix merged commit 823bf38 into devel Aug 31, 2025
66 of 67 checks passed
@rudolfix rudolfix deleted the fix/2486-fixes-mssql-datetime-precision branch August 31, 2025 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking This issue introduces breaking change ci full Use to trigger CI on a PR for full load tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants