Skip to content

Expose Python interface for other rust applications #1325

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jg2562 opened this issue Sep 10, 2021 · 28 comments
Closed

Expose Python interface for other rust applications #1325

jg2562 opened this issue Sep 10, 2021 · 28 comments

Comments

@jg2562
Copy link

jg2562 commented Sep 10, 2021

Currently the python-rust interface is within py-polars and is only published to pypi. It would be helpful for other applications that need to pass dataframes over that inferface to have access to the Pyo3 wrapper type.

Is there any way to faciliate have access to the wrapper type to return a dataframe to python using pyo3?

@ritchie46
Copy link
Member

Hi @jg2562 what would you like to do, so that I have a bit more of an understanding what is possible.

@jg2562
Copy link
Author

jg2562 commented Sep 10, 2021

Hi @ritchie46, thanks for the reply. We are working on an application where the core is written in rust. We use Python to call functions in rust (as most the legacy code is written in Python) and we also use python for quick proof of concepts before finalizing it in rust.

For a more concrete example, we are using serde on a struct containing a DataFrame combined with zstd to create a compressed version of our data (which is nonhomogamous in terms of data types). Since rust is loading the data, we currently need to unpack the data from the dataframe into structs which can be passed back to Python.

I was wondering if there was a way to expose the Python interface as a rust library to allow for us to simply pass the DataFrame to Python directly. It seems like other libraries that are written in rust for Python that want to build off of polars will also run into this issue, so it could help them too!

@ritchie46
Copy link
Member

The easiest thing to do is using arrow and pyarrow to communicate the memory. Then those arrow arrays can be used to create polars dataframes/series in python polars as well as rust polars.

This will mostly be zero copy. Here is the code polars uses to communicate between pyarrow/rust-arrow: https://github.com/pola-rs/polars/tree/master/py-polars/src/arrow_interop

@jg2562
Copy link
Author

jg2562 commented Sep 10, 2021

Thank you so much! I will definitely look into that. Just out of curiousity, is there something that makes exposing the interface difficult?

@ritchie46
Copy link
Member

Just out of curiousity, is there something that makes exposing the interface difficult?

Well.. TBH, I don't really know what exposing the interface means? Do you mean compiler rust agains python polars?

Or interact with a precompiled rust binary? Or using rust polars and send a dataframe to a python polars process?

@jg2562
Copy link
Author

jg2562 commented Sep 13, 2021

Thats fair, its pretty vague. I was imagining the last one of having rust polars and sending a dataframe to the python polars processes when I said exposing the interface.

@ritchie46
Copy link
Member

I was imagining the last one of having rust polars and sending a dataframe to the python polars processes when I said exposing the interface.

In that case you should use pyo3 and some copy pasting of the code snippets I referenced. That should work!

@jg2562
Copy link
Author

jg2562 commented Oct 1, 2021

Hey @ritchie46! I ended up working on a different project for a bit but I finally got around to making a small example. I was able to get the snippets to work, so at least i can better show an example of what I was thinking and why I was wondering if the PyDataFrame could be exposed.

Here is the repo, the use case would be running the example.py but you can see that there was a lot of scripting just to emulate passing the dataframe back and forth across the ffi boundry. Lemme know what you think, and thank you so much for the direction and help!

@MarcoLugo
Copy link

Not sure if this is related. I am looking to reuse PyDataFrame in my own library built with pyo3. Is the arrow conversion as @jg2562 did the best way to do it or is there something easier/more direct? Thank you.

I would like to do something like this:

use pyo3::prelude::*;

#[pyfunction]
fn read_my_format() -> PyResult<PyDataFrame> {
    Ok(read_my_format_into_polars_df("my_file"))
}

#[pymodule]
fn my_lib(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(read_my_format, m)?)?;
    Ok(())
}

@jg2562
Copy link
Author

jg2562 commented Oct 28, 2021

@MarcoLugo, After a recent update the repo that I posted breaks if you try to use some data types (DateTime64 for example). I think it would still be valuable to have access to the PyDataFrame if that's doable, since it will be properly tied to the library and isn't a hack on top of it. However, I really do not know how difficult this is, and so we should consult more with @ritchie46 since he would know much more.

@gunjunlee
Copy link
Contributor

gunjunlee commented Jun 10, 2022

@ritchie46 I wrote a code that converts rust dataframe to python polars dataframe

pub fn rust_dataframe_to_py_dataframe(dataframe: &mut DataFrame) -> PyResult<PyObject> {
    let dataframe = dataframe.rechunk();

    let gil = Python::acquire_gil();
    let py = gil.python();

    let names = dataframe.get_column_names();

    let pyarrow = py.import("pyarrow")?;
    let polars = py.import("polars")?;
    let rbs: Vec<PyObject> = dataframe
        .iter_chunks()
        .map(|rb| to_py_rb(&rb, &names, py, pyarrow).unwrap())
        .collect::<Vec<PyObject>>();
    let rbs: PyObject = rbs.into_py(py);
    let rbs: &PyList = rbs.extract(py)?;
    let py_table = pyarrow.getattr("Table")?.call_method1("from_batches", (rbs, ))?;
    let py_df = polars.call_method1("from_arrow", (py_table, ))?;  // << This line takes much time
    Ok(py_df.to_object(py))
}

but this takes too much time.

I guess there is much easier and faster way to convert rust dataframe to python dataframe, because python dataframe is just a wrapper of rust dataframe

But i don't know how to implement this job. Could you help me?

If it is possible to import py-polars in rust, it will be easy to implement idea above
but some reason i cannot import py-polars even i add py-polars in cargo dependency
(ex

[dependencies]
py-polars = { path = "polars/py-polars" }

)

@cavenditti
Copy link

Hello, I was casually looking into this and just wanted to share some insight with @gunjunlee
I'm no Rust expert, so this may be inaccurate. If so, please correct me 🙂

py-polars uses cdylib as crate-type (have a look at linkage reference), this means it cannot be imported in other crates.
That specific crate-type is required by PyO3, because it needs to build a dynamic library to end up in the Python wheel.
I don't have enough understanding of PyO3 and CPython internals to tell you if (and how) it's possible to create some kind of interface to just write a Rust function returning a PyDataFrame from py-polars and make everything work.

I don't think think there is any reasonable alternative to using arrow and pyarrow

@jg2562
Copy link
Author

jg2562 commented Aug 5, 2022

I've seen this issue pop up a few times in the last few days (#4264, #4212, kinda #1830). I wanted to reopen discussion to talk about creating an api that is tied the polars development for people to link against. While the current example is very works and is very helpful, it is something that has to be reimplemented in every code base making it not very ergonomic to use. It also isn't tied to development of polars since its being reimplemented, so it falls out of sync and breaks during updates in different peoples projects. @ritchie46 mentioned he was considering making an api in #4212 if he had time, if you would like help with creating it please let us know!

@jmrgibson
Copy link
Contributor

jmrgibson commented Aug 8, 2022

The way I've done this for my projects is to split up the python content into multiple crates. For example, I have a py-interface rlib crate that would contain #[pyfunctions], #[pyclass], etc, that can be used from other rust projects (and would be published to crates.io). Then I have a py-module cdylib crate that simply includes functions/classes from py-interface, and exports them to a #[pymodule].

In this case, we could keep py-polars as the cdylib and make a new (rlib) crate that contains the pyo3 type definitions. I can work on this if people think this is the right direction to go.

@jg2562
Copy link
Author

jg2562 commented Aug 9, 2022

To me, thats exactly the right direction to go! Just separating them and allowing access to py-interface on crates.io I think would greatly help the rust community to use polars.

@jmrgibson
Copy link
Contributor

@ritchie46 Do you think this is the correct approach?

jmrgibson pushed a commit to jmrgibson/polars that referenced this issue Aug 25, 2022
@jmrgibson
Copy link
Contributor

jmrgibson commented Aug 26, 2022

I'm working on this here: https://github.com/jmrgibson/polars/tree/user/jgibson/split_out_py_polars_as_rust_crate

It appears to work using the nightly compiler. Looks like newer polars relies on simd which is nightly only? I'll continue to investigate, I'd like to get this working on stable.

For example, the following code works:

use py_polars_core::PyDataFrame;
let time: Series = time_ns.into_iter().collect();
let df = Dataframe::new(
    vec![data.clone(), time]
);
let df = PyDataFrame {
    df
};
let args = (df,);
let res = Python::with_gil(|py| -> PyResult<DataFrame> {
     let res = pyfun c.call1(py, args)?; 
     let pdf = res.extract::<PyDataFrame>(py)?;
     Ok(pdf.df)
});

@ritchie46
Copy link
Member

I'm working on this here: https://github.com/jmrgibson/polars/tree/user/jgibson/split_out_py_polars_as_rust_crate

It appears to work using the nightly compiler. Looks like newer polars relies on simd which is nightly only? I'll continue to investigate, I'd like to get this working on stable.

For example, the following code works:

use py_polars_core::PyDataFrame;
let time: Series = time_ns.into_iter().collect();
let df = Dataframe::new(
    vec![data.clone(), time]
);
let df = PyDataFrame {
    df
};
let args = (df,);
let res = Python::with_gil(|py| -> PyResult<DataFrame> {
     let res = pyfun c.call1(py, args)?; 
     let pdf = res.extract::<PyDataFrame>(py)?;
     Ok(pdf.df)
});

I don't think we should shop the python interface for that. We could use arrows c interface for that. That is zero copy and much slimmer.

@jmrgibson
Copy link
Contributor

I don't think we should shop the python interface for that. We could use arrows c interface for that. That is zero copy and much slimmer.

I don't think I understand enough about pyo3 to figure out where the copying is happening this case.

E.g. If I want to call a python function with a dataframe I create in rust, and get a dataframe back to rust:

# module.py
def manipulate_df(df: pl.DataFrame) -> pl.DataFrame:
    ...  # user writes manipulation function here
fn main(){
  let df = df!(
      "data" => [1.0, 2.0],
      "time" => [1.0, 2.0],
  );
  
  let modified_df = Python::with_gil(|py| {
      let module = PyModule::import(py, "module")?;
      let pydf: PyDataFrame = df.into();
      let args = (pydf,);
      let result: PyDataFrame = builtins.getattr("manipulate_df")?.call1(args)?.extract()?;
      Ok(result.df)
  })?;
}

Based on the docs for Py::new, which is what the default #[pyclass] uses, this is creating a new object on the python heap. Does that mean the entire inner DataFrame is getting copied from the rust stack to the python heap?

@AnatolyBuga
Copy link
Contributor

@ritchie46 , do you think it's possible to conver LazyFrame from Python to Rust and back like you did here with Eager frame?

@ritchie46
Copy link
Member

@ritchie46 , do you think it's possible to conver LazyFrame from Python to Rust and back like you did here with Eager frame?

You'd need to serialize the query plan. This will copy data if you use df.lazy(). If you start your query with pl.scan_x then it won't.

@kylebarron
Copy link
Contributor

I don't think we should shop the python interface for that. We could use arrows c interface for that. That is zero copy and much slimmer.

I think this is a good suggestion for something to make the python interface easier for third party bindings. The example code in the python_rust_compiled_function directory only shows how to transfer a single Series through the C Data interface. The C Data interface doesn't define how to transfer an entire DataFrame per se, but you can do it by convention by calling a DataFrame a struct of all the columns in the DataFrame you wish to move. That would be helpful helper code to make available to people wanting to extend Polars but who don't have a ton of Arrow experience

@ritchie46
Copy link
Member

I have a setup of a crate that does this for you hidden behind pyo3 bindings. But haven't yet had the bandwidth/priority to finish this.

@AnatolyBuga
Copy link
Contributor

I have a setup of a crate that does this for you hidden behind pyo3 bindings. But haven't yet had the bandwidth/priority to finish this.

@ritchie46 that would be really useful, especially for types beyond Series/DataFrame (like LazyFrame). I can try helping (although I am still abit of a noob)

@iskandr
Copy link

iskandr commented Dec 23, 2022

I just want to echo that a succinct example of how to create a PyDataFrame in a new Rust project and pass it back into Python code would be very helpful to me and @andyjslee

@kylebarron
Copy link
Contributor

@ritchie46 mentioned on discord: https://github.com/pola-rs/pyo3-polars

@ritchie46
Copy link
Member

Yes, this is the way to go.

@OliverEvans96
Copy link

Thanks, the pyo3-polars crate is exactly what I was looking for!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants