Skip to content

Better document that the spark step launchers needs all project dependencies installed #11476

@slopp

Description

@slopp

What's the issue or suggestion?

The example code here leaves out a key requirement: https://docs.dagster.io/integrations/spark#submitting-pyspark-ops-on-emr, all the project python dependencies need to be installed.

For Databricks this can be done in step launcher config. The API config docs are, unfortunately, rather verbose and do not provide an easy example of the syntax for how to do this. Here is an example:

https://github.com/dagster-io/hooli-data-eng-pipelines/blob/master/hooli_data_eng/resources/databricks.py#L29-L56

For EMR, all the python dependencies in the dagster project (setup.py and requirements.txt) need to be installed manually. Normally this installation would be done using bootstrap.sh as documented in https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub-install-kernels-libs.html

Currently it is painful to figure this out, users report iterating through run launches that cause cryptic log errors (need to view stderr to see the actual message) and then fixing the error messages package by package, run by run 😱

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: docsRelated to documentation in general

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions