Skip to content
This repository was archived by the owner on May 9, 2024. It is now read-only.

Add declarations and examples for Python HDK API. #170

Merged
merged 3 commits into from
Feb 14, 2023

Conversation

ienkovich
Copy link
Contributor

This is the first version of PyHDK API according to our latest discussions. There is no implementation, only declarations with inline documentation and some usage examples provided as disabled tests.

Please check if it matches your vision. Once we agree on this PR, I'll start implementation. Let me know if more detailed descriptions or more examples/tests are required to make the decision.

@ienkovich ienkovich force-pushed the ienkovich/declare-python-api branch from 1a6fd79 to e2387d2 Compare January 20, 2023 23:00
Copy link
Contributor

@kurapov-peter kurapov-peter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we expose run parameters such as "use CPU only"? Are we going to have run() method parameters?

pass

@not_implemented
def cst(self, value, type=None, scale_decimal=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cst looks more like a cast, not constant. Do we really want to save 2 letters in a very clear const?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a creation of an expression, not simply a cast. E. g. proj("a") and proj(hdk.cst("a")) are completely different things. type parameter allows us to avoid additional cast. hdk.cst("2020-10-05", "date") is equal to hdk.cst("2020-10-05").cast("date").

I'm used to having const as a reserved word. But we can use it here I guess or create an alias to have both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about .literal()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about introducing another alias for constants. Representing expression is hdk::ir::Constant, so it makes sense to use similar words in API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I would probably spell it out then. Worth the extra typing. :)

Copy link
Contributor

@alexbaden alexbaden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks reasonable to me. To @kurapov-peter 's comment, it seems like we need to break out some sort of per-instantiation state (e.g. when you create the hdk object) and per-query state (if you wanted to set per query execution parameters, for example). You could imagine something like:

query_state = hdk.getCurrentQueryState() 
query_state.device = hdk.device.cpu
hdk.sql("SELECT * FROM t;", query_state=query_state)

Would that work with the table aliasing scheme?

pass

@not_implemented
def cst(self, value, type=None, scale_decimal=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about .literal()?

pass

@not_implemented
def import_pydict(self, values, table_name=None, fragment_size=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this expected to be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect it to be used in actual workloads but it can be convenient in small experiments. E.g. in my API tests I use it like the following:

hdk = pyhdk.init()
ht = hdk.import_pydict(
    {"a": [1, 2, 3, 4, 5], "b": [5, 4, 3, 2, 1], "x": [1.1, 2.2, 3.3, 4.4, 5.5]}
)
self.check_res(
    ht.proj(a1=ht["a"] // ht["b"], a2=ht["a"] // 2, a3=ht["x"] // 2.0).run(),
    {
        "a1": [0, 0, 1, 2, 5],
        "a2": [0, 1, 1, 2, 2],
        "a3": [0.0, 1.0, 1.0, 2.0, 2.0],
    },
)

In fact, it's simply a shorter version for import_arrow(pyarrow.Table.from_pydict(values)).

@ienkovich
Copy link
Contributor Author

query_state = hdk.getCurrentQueryState() 
query_state.device = hdk.device.cpu
hdk.sql("SELECT * FROM t;", query_state=query_state)

Would that work with the table aliasing scheme?

It should be OK to reserve some arg for query parameters. Do you think we need a dedicated structure for that? Our RelAlgExecutor.execute uses **kwargs to overwrite the current config while filling in execution options. We might add a dict arg like that:

def sql(self, sql_query, exec_opts = dict(), **kwargs):
    sql_query = self._add_aliases(sql_query, **kwargs)
    ra = self._calcite.process(sql_query)
    ra_executor = RelAlgExecutor(self._executor, self._storage, self._data_mgr, ra)
    return ra_executor.execute(**exec_opts)

@alexbaden
Copy link
Contributor

Ah I see, yes a simple dictionary is probably better. Though it would be nice to be able to generate that w/ default values, so I could print and know what was available (or what my config was).

@kurapov-peter
Copy link
Contributor

The **kwargs seem like a natural way to me. Something like exec_strategy="cpu_only" or exec_strategy="cost_based". An interesting option would be to explicitly specify which expression must be executed where. This can break operator pipelines though.

@ienkovich
Copy link
Contributor Author

The **kwargs seem like a natural way to me. Something like exec_strategy="cpu_only" or exec_strategy="cost_based". An interesting option would be to explicitly specify which expression must be executed where. This can break operator pipelines though.

We can use **kwargs for one thing only. Either options or aliases. The options set is fixed and aliases are not, so we should use a single object for options. Dictionary usage would be more convenient for in-place options but cannot provide a list of available options and default values (because they depend on the current config). We can introduce a new class for execution options with the opportunity to get the current default and pretty-print it. Might still accept dictionaries to pass options without extra lines of code.

@ienkovich ienkovich force-pushed the ienkovich/declare-python-api branch from e2387d2 to a78423d Compare February 9, 2023 20:20
@ienkovich ienkovich merged commit 7ed3990 into main Feb 14, 2023
@ienkovich ienkovich deleted the ienkovich/declare-python-api branch February 14, 2023 21:32
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants