Skip to content

Commit 98b8d8d

Browse files
kevjumbaadchia
andauthored
feat: Feast Spark Offline Store (feast-dev#2349)
* State of feast Signed-off-by: Kevin Zhang <[email protected]> * Clean up changes Signed-off-by: Kevin Zhang <[email protected]> * Fix random incorrect changes Signed-off-by: Kevin Zhang <[email protected]> * Fix lint Signed-off-by: Kevin Zhang <[email protected]> * Fix build errors Signed-off-by: Kevin Zhang <[email protected]> * Fix lint Signed-off-by: Kevin Zhang <[email protected]> * Add spark offline store components to test against current integration tests Signed-off-by: Kevin Zhang <[email protected]> * Fix lint Signed-off-by: Kevin Zhang <[email protected]> * Rename to pass checks Signed-off-by: Kevin Zhang <[email protected]> * Fix issues Signed-off-by: Kevin Zhang <[email protected]> * Fix type checking issues Signed-off-by: Kevin Zhang <[email protected]> * Fix lint Signed-off-by: Kevin Zhang <[email protected]> * Clean up print statements for first review Signed-off-by: Kevin Zhang <[email protected]> * Fix lint Signed-off-by: Kevin Zhang <[email protected]> * Fix flake 8 lint tests Signed-off-by: Kevin Zhang <[email protected]> * Add warnings for alpha version release Signed-off-by: Kevin Zhang <[email protected]> * Format Signed-off-by: Kevin Zhang <[email protected]> * Address review Signed-off-by: Kevin Zhang <[email protected]> * Address review Signed-off-by: Kevin Zhang <[email protected]> * Fix lint Signed-off-by: Kevin Zhang <[email protected]> * Add file store functionality Signed-off-by: Kevin Zhang <[email protected]> * lint Signed-off-by: Kevin Zhang <[email protected]> * Add example feature repo Signed-off-by: Kevin Zhang <[email protected]> * Update data source creator Signed-off-by: Kevin Zhang <[email protected]> * Make cli work for feast init with spark Signed-off-by: Kevin Zhang <[email protected]> * Update the docs Signed-off-by: Kevin Zhang <[email protected]> * Clean up code Signed-off-by: Kevin Zhang <[email protected]> * Clean up more code Signed-off-by: Kevin Zhang <[email protected]> * Uncomment repo configs Signed-off-by: Kevin Zhang <[email protected]> * Fix setup.py Signed-off-by: Kevin Zhang <[email protected]> * Update dependencies Signed-off-by: Kevin Zhang <[email protected]> * Fix ci dependencies Signed-off-by: Kevin Zhang <[email protected]> * Screwed up rebase Signed-off-by: Kevin Zhang <[email protected]> * Screwed up rebase Signed-off-by: Kevin Zhang <[email protected]> * Screwed up rebase Signed-off-by: Kevin Zhang <[email protected]> * Realign with master Signed-off-by: Kevin Zhang <[email protected]> * Fix accidental changes Signed-off-by: Kevin Zhang <[email protected]> * Make type map change cleaner Signed-off-by: Kevin Zhang <[email protected]> * Address review comments Signed-off-by: Kevin Zhang <[email protected]> * Fix tests accidentally broken Signed-off-by: Kevin Zhang <[email protected]> * Add comments Signed-off-by: Kevin Zhang <[email protected]> * Reformat Signed-off-by: Kevin Zhang <[email protected]> * Fix logger Signed-off-by: Kevin Zhang <[email protected]> * Remove unused imports Signed-off-by: Kevin Zhang <[email protected]> * Fix imports Signed-off-by: Kevin Zhang <[email protected]> * Fix CI dependencies Signed-off-by: Danny Chiao <[email protected]> * Prefix destinations with project name Signed-off-by: Kevin Zhang <[email protected]> * Update comment Signed-off-by: Kevin Zhang <[email protected]> * Fix 3.8 Signed-off-by: Kevin Zhang <[email protected]> * temporary fix Signed-off-by: Kevin Zhang <[email protected]> * rollback Signed-off-by: Kevin Zhang <[email protected]> * update Signed-off-by: Kevin Zhang <[email protected]> * Update ci? Signed-off-by: Kevin Zhang <[email protected]> * Move third party to contrib Signed-off-by: Kevin Zhang <[email protected]> * Fix imports Signed-off-by: Kevin Zhang <[email protected]> * Remove third_party refactor Signed-off-by: Kevin Zhang <[email protected]> * Revert ci requirements and update comment in type map Signed-off-by: Kevin Zhang <[email protected]> * Revert 3.8-requirements Signed-off-by: Kevin Zhang <[email protected]> Co-authored-by: Danny Chiao <[email protected]>
1 parent 74f887f commit 98b8d8d

File tree

21 files changed

+1401
-208
lines changed

21 files changed

+1401
-208
lines changed

docs/reference/data-sources/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,5 @@ Please see [Data Source](../../getting-started/concepts/feature-view.md#data-sou
99
{% page-ref page="bigquery.md" %}
1010

1111
{% page-ref page="redshift.md" %}
12+
13+
{% page-ref page="spark.md" %}

docs/reference/data-sources/spark.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Spark
2+
3+
## Description
4+
5+
**NOTE**: Spark data source api is currently in alpha development and the API is not completely stable. The API may change or update in the future.
6+
7+
The spark data source API allows for the retrieval of historical feature values from file/database sources for building training datasets as well as materializing features into an online store.
8+
9+
* Either a table name, a SQL query, or a file path can be provided.
10+
11+
## Examples
12+
13+
Using a table reference from SparkSession(for example, either in memory or a Hive Metastore)
14+
15+
```python
16+
from feast import SparkSource
17+
18+
my_spark_source = SparkSource(
19+
table="FEATURE_TABLE",
20+
)
21+
```
22+
23+
Using a query
24+
25+
```python
26+
from feast import SparkSource
27+
28+
my_spark_source = SparkSource(
29+
query="SELECT timestamp as ts, created, f1, f2 "
30+
"FROM spark_table",
31+
)
32+
```
33+
34+
Using a file reference
35+
36+
```python
37+
from feast import SparkSource
38+
39+
my_spark_source = SparkSource(
40+
path=f"{CURRENT_DIR}/data/driver_hourly_stats",
41+
file_format="parquet",
42+
event_timestamp_column="event_timestamp",
43+
created_timestamp_column="created",
44+
)
45+
```

docs/reference/offline-stores/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,5 @@ Please see [Offline Store](../../getting-started/architecture-and-components/off
99
{% page-ref page="bigquery.md" %}
1010

1111
{% page-ref page="redshift.md" %}
12+
13+
{% page-ref page="spark.md" %}
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# Spark
2+
3+
## Description
4+
5+
The Spark offline store is an offline store currently in alpha development that provides support for reading [SparkSources](../data-sources/spark.md).
6+
7+
## Disclaimer
8+
9+
This Spark offline store still does not achieve full test coverage and continues to fail some integration tests when integrating with the feast universal test suite. Please do NOT assume complete stability of the API.
10+
11+
* Spark tables and views are allowed as sources that are loaded in from some Spark store(e.g in Hive or in memory).
12+
* Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. Pandas dataframes will be converted to a Spark dataframe and processed as a temporary view.
13+
* A `SparkRetrievalJob` is returned when calling `get_historical_features()`.
14+
* This allows you to call
15+
* `to_df` to retrieve the pandas dataframe.
16+
* `to_arrow` to retrieve the dataframe as a pyarrow Table.
17+
* `to_spark_df` to retrieve the dataframe the spark.
18+
19+
## Example
20+
21+
{% code title="feature_store.yaml" %}
22+
```yaml
23+
project: my_project
24+
registry: data/registry.db
25+
provider: local
26+
offline_store:
27+
type: spark
28+
spark_conf:
29+
spark.master: "local[*]"
30+
spark.ui.enabled: "false"
31+
spark.eventLog.enabled: "false"
32+
spark.sql.catalogImplementation: "hive"
33+
spark.sql.parser.quotedRegexColumnNames: "true"
34+
spark.sql.session.timeZone: "UTC"
35+
online_store:
36+
path: data/online_store.db
37+
```
38+
{% endcode %}

sdk/python/feast/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,9 @@
33
from pkg_resources import DistributionNotFound, get_distribution
44

55
from feast.infra.offline_stores.bigquery_source import BigQuerySource
6+
from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import (
7+
SparkSource,
8+
)
69
from feast.infra.offline_stores.file_source import FileSource
710
from feast.infra.offline_stores.redshift_source import RedshiftSource
811
from feast.infra.offline_stores.snowflake_source import SnowflakeSource
@@ -47,4 +50,5 @@
4750
"RedshiftSource",
4851
"RequestFeatureView",
4952
"SnowflakeSource",
53+
"SparkSource",
5054
]

sdk/python/feast/cli.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -477,7 +477,9 @@ def materialize_incremental_command(ctx: click.Context, end_ts: str, views: List
477477
@click.option(
478478
"--template",
479479
"-t",
480-
type=click.Choice(["local", "gcp", "aws", "snowflake"], case_sensitive=False),
480+
type=click.Choice(
481+
["local", "gcp", "aws", "snowflake", "spark"], case_sensitive=False
482+
),
481483
help="Specify a template for the created project",
482484
default="local",
483485
)

sdk/python/feast/inference.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
FileSource,
99
RedshiftSource,
1010
SnowflakeSource,
11+
SparkSource,
1112
)
1213
from feast.data_source import DataSource
1314
from feast.errors import RegistryInferenceFailure
@@ -84,7 +85,9 @@ def update_data_sources_with_inferred_event_timestamp_col(
8485
):
8586
# prepare right match pattern for data source
8687
ts_column_type_regex_pattern = ""
87-
if isinstance(data_source, FileSource):
88+
if isinstance(data_source, FileSource) or isinstance(
89+
data_source, SparkSource
90+
):
8891
ts_column_type_regex_pattern = r"^timestamp"
8992
elif isinstance(data_source, BigQuerySource):
9093
ts_column_type_regex_pattern = "TIMESTAMP|DATETIME"
@@ -97,7 +100,7 @@ def update_data_sources_with_inferred_event_timestamp_col(
97100
"DataSource",
98101
"""
99102
DataSource inferencing of event_timestamp_column is currently only supported
100-
for FileSource and BigQuerySource.
103+
for FileSource, SparkSource, BigQuerySource, RedshiftSource, and SnowflakeSource.
101104
""",
102105
)
103106
# for informing the type checker

sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)