Skip to content

Why not use DataFusion as the backend engine (rather than rewriting it all from scratch)? #3282

@alamb

Description

@alamb

Is your feature request related to a problem?

I read this blog. Very cool (and thanks @MrPowers) for pointing it out

https://dataengineeringcentral.substack.com/p/introduction-to-daft-vs-polars

The experience of reading from S3 and handling globs, etc looks pretty amazing with Daft 💯 nice work

I have a question: If your focus is on an amazing experience working with objects on remote object store, why build your own Rust / Arrow based vectorized query engine (yet again) as well when there are already exiting engines?

I can't help but notice there is a substantial amount of similarity between the current structure of your code and DataFusion (Arrow, LogicalPlan, PhysicalPlan, etc) -- likely as a consequence of it being fairly well understood how to build columnar query engines. It is

I also saw several issues that have started enumerating features that are missing in Daft that are already in DataFusion (along with tests, etc)

Describe the solution you'd like

I propose (very selfishly, of course, as a PMC member) that you build on Apache DataFusion and better yet help us extend it to make it even better

You would likely get a full featured, fast engine, and instead of reinventing all the standard, low level operation for processing you could instead focus on making the DataFrame experience processing files remotely on object store amazing.

I can tell you from experience building and maturing a execution engine takes a lot of effort and it is great to do it with a big and established community.

Describe alternatives you've considered

I would also love to know if you have thought about DataFusion and chose a different approach anyways and any thoughts you were willing to share about why you made the choice (so we can see if we can make it easier for future startups to choose to build with DataFusion)

Additional Context

Happy hacking!

Would you like to implement a fix?

No

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions