Skip to content

DuckDB - append adds extra rows even when there's no updated data #971

@karakanb

Description

@karakanb

dlt version

0.4.4

Describe the problem

The append disposition is supposed to not append if there's no new data and it works that way in postgres destination, but with duckdb it reinserts the data.

Expected behavior

in the second iteration, the output should not have an extra row since there's no new data, however it appends anyway.

Steps to reproduce

use the following code:

import os
import dlt
from dlt.sources.credentials import ConnectionStringCredentials
from dlt.common import pendulum
import shutil

import duckdb
from ingestr.src.sql_database import sql_table


def load_standalone_table_resource() -> None:
    pipeline = dlt.pipeline(
        pipeline_name="output",
        destination=dlt.destinations.duckdb(
            credentials=uri,
        ),
        pipelines_dir="pipeline_dataappendprob",
        dataset_name="testschema_append",
    )

    # Run the resources together
    info = pipeline.run(
        sql_table(
            credentials=uri,
            schema="testschema_append",
            table="input",
            incremental=dlt.sources.incremental(
                "updated_at",
            ),
        ),
        dataset_name="testschema_append",
        table_name="output",
        write_disposition="append",
    )
    print(info)


def get_abs_path(relative_path):
    return os.path.abspath(os.path.join(os.path.dirname(__file__), relative_path))

abs_db_path = get_abs_path("sample_append.db")
uri = f"duckdb:///sample_append.db"
conn = duckdb.connect(abs_db_path)
def setup_db():
    try:
        shutil.rmtree(get_abs_path("pipeline_dataappendprob"))
    except:
        pass

    conn.execute("DROP SCHEMA IF EXISTS testschema_append CASCADE")
    conn.execute("CHECKPOINT")

    conn.execute("CREATE SCHEMA testschema_append")
    conn.execute("CREATE TABLE testschema_append.input (id INTEGER, val VARCHAR, updated_at DATE)")
    conn.execute("INSERT INTO testschema_append.input VALUES (1, 'val1', '2022-01-01'), (2, 'val2', '2022-01-02')")
    conn.execute("CHECKPOINT")


def run(iteration: int):
    print("running iteration", iteration)
    load_standalone_table_resource()

    print("--------------")
    print("source:")
    conn.sql("select * from testschema_append.input").show()
    print()
    print("destination:")
    conn.sql("select * from testschema_append.output").show()
    print("--------------")

setup_db()
run(1)
run(2)

Operating system

macOS

Runtime environment

Local

Python version

3.11

dlt data source

dlt sql_table source

dlt destination

No response

Other deployment details

official duckdb

Additional information

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions