Skip to content

[SPARK-51919][PYTHON] Allow overwriting statically registered Python Data Source #50716

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wengh
Copy link
Contributor

@wengh wengh commented Apr 25, 2025

What changes were proposed in this pull request?

  • Allow overwriting static Python Data Sources during registration
  • Update documentation to clarify Python Data Source behavior and registration options

Why are the changes needed?

Static registration is a bit obscure and doesn't always work as expected (e.g. when the module providing DefaultSource is installed after lookup_data_sources already ran).
So in practice users (or LLM agents) often want to explicitly register the data source even if it is provided as a DefaultSource.
Raising an error in this case interrupts the workflow, making LLM agents spend extra tokens regenerating the same code but without registration.

This change also makes the behavior consistent with user data source registration which are already allowed to overwrite previous user registrations.

Does this PR introduce any user-facing change?

Yes. Previously, registering a Python Data Source with the same name as a statically registered one would throw an error. With this change, it will overwrite the static registration.

How was this patch tested?

Added a test in PythonDataSourceSuite.scala to verify that static sources can be overwritten correctly.

Was this patch authored or co-authored using generative AI tooling?

No

@wengh
Copy link
Contributor Author

wengh commented Apr 25, 2025

@allisonwang-db @HyukjinKwon please take a look

- During Data Source resolution, built-in and Scala/Java Data Sources take precedence over Python Data Sources with the same name; to explicitly use a Python Data Source, make sure its name does not conflict with the other Data Sources.
- During Data Source resolution, built-in and Scala/Java Data Sources take precedence over Python Data Sources with the same name; to explicitly use a Python Data Source, make sure its name does not conflict with the other non-Python Data Sources.
- It is allowed to register multiple Python Data Sources with the same name. Later registrations will overwrite earlier ones.
- To automatically register a data source, export it as ``DefaultSource`` in a top level module with name prefix ``pyspark_``. See `pyspark_huggingface <https://github.com/huggingface/pyspark_huggingface>`_ for an example.
Copy link
Contributor Author

@wengh wengh Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention the DefaultSource feature which was previously undocumented?

@wengh wengh changed the title [PYTHON] Allow overwriting statically registered Python Data Source [SPARK-51919][PYTHON] Allow overwriting statically registered Python Data Source Apr 25, 2025
@HyukjinKwon
Copy link
Member

cc @allisonwang-db

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants