You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
where we could try out ideas before proposing them as fixes to the upstream branch. It has grown into its own toolkit for implementing a reproducible data science workflow, and is the basis of our [Bus Number](https://github.com/hackalog/bus_number/) tutorial on **Reproducible Data Science**.
10
+
For most of us, data science is 5% science, 60% data cleaning, and 35%
11
+
IT hell. Easydata focuses on delivering
12
+
* reproducible python environments,
13
+
* reproducible datasets, and
14
+
* reproducible workflows
15
+
in order to get you up and running with your data science quickly, and reproducibly.
12
16
13
-
### Tutorial
14
-
For a tutorial on making use of a previous version of this framework (available via the `bus_number` branch), visit:
Easydata is a python cookiecutter for building custom data science git repos that provides:
20
+
* An **opinionated workflow** for collaboration, storytelling,
21
+
* A **python framework** to support this workflow
22
+
* A **makefile wrapper** for conda and pip environment management
23
+
* prebuilt **dataset recipes*, and
24
+
* a vast library of training materials and documentation around doing reproducible data science.
25
+
26
+
Easydata is **not**
27
+
* an ETL tooklit
28
+
* A data analysis pipreline
29
+
* a containerization solution, or
30
+
* a prescribed data format.
16
31
17
32
18
33
### Requirements to use this cookiecutter template:
@@ -22,15 +37,12 @@ For a tutorial on making use of a previous version of this framework (available
22
37
23
38
-[Cookiecutter Python package](http://cookiecutter.readthedocs.org/en/latest/installation.html) >= 1.4.0: This can be installed with pip by or conda depending on how you manage your Python packages:
24
39
25
-
```bash
26
-
$ pip install cookiecutter
27
-
```
28
-
29
-
or
40
+
once you've installed anaconda, you can install the remaining requirements (including cookiecutter) by doing:
30
41
31
-
```bash
32
-
$ conda config --add channels conda-forge
33
-
$ conda install cookiecutter
42
+
```bash
43
+
conda create -n easydata python=3
44
+
conda activate easydata
45
+
python -m pip install -f requirements.txt
34
46
```
35
47
36
48
@@ -54,6 +66,8 @@ The directory structure of your new project looks like this:
54
66
*`catalog`
55
67
* Data catalog. This is where config information such as data sources
56
68
and data transformations are saved
69
+
*`catalog/config.ini`
70
+
* Local Data Store. This configuration file is for local data only, and is never checked into the repo.
57
71
*`data`
58
72
* Data directory. often symlinked to a filesystem with lots of space
59
73
*`data/raw`
@@ -64,6 +78,8 @@ The directory structure of your new project looks like this:
64
78
* The final, canonical data sets for modeling.
65
79
*`docs`
66
80
* A default Sphinx project; see sphinx-doc.org for details
81
+
*`framework-docs`
82
+
* Markdown documentation for using Easydata
67
83
*`models`
68
84
* Trained and serialized models, model predictions, or model summaries
69
85
*`models/trained`
@@ -86,6 +102,8 @@ The directory structure of your new project looks like this:
86
102
* Generated summary information to be used in reporting
87
103
*`environment.yml`
88
104
* (if using conda) The YAML file for reproducing the analysis environment
105
+
*`environment.(platform).lock.yml`
106
+
* resolved versions, result of processing `environment.yml`
89
107
*`setup.py`
90
108
* Turns contents of `MODULE_NAME` into a
91
109
pip-installable python module (`pip install -e .`) so it can be
@@ -95,15 +113,9 @@ The directory structure of your new project looks like this:
95
113
*`MODULE_NAME/__init__.py`
96
114
* Makes MODULE_NAME a Python module
97
115
*`MODULE_NAME/data`
98
-
* Scripts to fetch or generate data. In particular:
99
-
*`MODULE_NAME/data/make_dataset.py`
100
-
* Run with `python -m MODULE_NAME.data.make_dataset fetch`
101
-
or `python -m MODULE_NAME.data.make_dataset process`
116
+
* code to fetch raw data and generate Datasets from them
102
117
*`MODULE_NAME/analysis`
103
-
* Scripts to turn datasets into output products
104
-
*`MODULE_NAME/models`
105
-
* Scripts to train models and then use trained models to make predictions.
106
-
e.g. `predict_model.py`, `train_model.py`
118
+
* code to turn datasets into output products
107
119
*`tox.ini`
108
120
* tox file with settings for running tox; see tox.testrun.org
109
121
@@ -128,3 +140,8 @@ In case you need to delete the environment later:
Copy file name to clipboardExpand all lines: {{ cookiecutter.repo_name }}/README.md
+11-13Lines changed: 11 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
{{cookiecutter.project_name}}
2
2
==============================
3
-
_Author: {{ cookiecutter.author_name }}
3
+
_Author: {{ cookiecutter.author_name }}_
4
4
5
5
{{cookiecutter.description}}
6
6
@@ -20,18 +20,16 @@ REQUIREMENTS
20
20
21
21
GETTING STARTED
22
22
---------------
23
-
### Checking out the repo
24
-
Note: These instructions assume you are using SSH keys (and not HTTPS authentication) with github.
25
-
If you haven't set up SSH access to GitHub, see [Configuring SSH Access to Github](https://github.com/hackalog/cookiecutter-easydata/wiki/Configuring-SSH-Access-to-Github). This also includes instuctions for using more than one account with SSH keys.
26
-
27
-
1. Fork the repo (on GitHub) to your personal account
You're all set for staying up-to-date with the project repo. Follow the instructions in this handy [Github Workflow Cheat Sheet](https://github.com/hackalog/cookiecutter-easydata/wiki/Github-Workflow-Cheat-Sheet) for keeping your working copy of the repo in sync.
23
+
### Git Configuration and Checking Out the Repo
24
+
25
+
If you haven't yet done so, please follow the instrucitons
26
+
in [Setting up git and Checking Out the Repo](framework-docs/git-configuration.md) in
27
+
order to check-out the code and set-up your remote branches
28
+
29
+
Note: These instructions assume you are using SSH keys (and not HTTPS authentication) with {{ cookiecutter.upstream_location }}.
30
+
If you haven't set up SSH access to {{ cookiecutter.upstream_location }}, see [Configuring SSH Access to {{cookiecutter.upstream_location}}](https://github.com/hackalog/cookiecutter-easydata/wiki/Configuring-SSH-Access-to-Github). This also includes instuctions for using more than one account with SSH keys.
31
+
32
+
Once you've got your local, `origin`, and `upstream` branches configured, you can follow the instructions in this handy [Git Workflow Cheat Sheet](framework-docs/git-workflow.md) to keep your working copy of the repo in sync with the others.
35
33
36
34
### Setting up your environment
37
35
**WARNING**: If you have conda-forge listed as a channel in your `.condarc` (or any other channels other than defaults), remove it during the course of the workshop. Even better, don't use a `.condarc` for managing channels, as it overrides the `environment.yml` instructions and makes things less reproducible. Make the changes to the `environment.yml` file if necessary. We've had some conda-forge related issues with version conflicts. We also recommend [setting your channel priority to 'strict'](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-channels.html) to reduce package incompatibility problems.
Copy file name to clipboardExpand all lines: {{ cookiecutter.repo_name }}/framework-docs/conda-environments.md
+40-16Lines changed: 40 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,40 +1,64 @@
1
1
# Setting up and Maintaining your Conda Environment (Reproducibly)
2
2
3
-
The `{{ cookiecutter.repo_name }}` repo is set up with template code to make managing your conda environments easy and reproducible. Not only will future you appreciate this, but everyone else who tries to run your code will thank you.
3
+
The `{{ cookiecutter.repo_name }}` repo is set up with template code to make managing your conda environments easy and reproducible. Not only will _future you_ appreciate this, but so will anyone else who needs to work with your code after today.
4
4
5
-
If you haven't yet, get your initial environment set up.
5
+
If you haven't yet, configure your conda environment.
6
6
7
-
### Quickstart Instructions
8
-
**WARNING FOR EXISTING CONDA USERS**: If you have conda-forge listed as a channel in your `.condarc` (or any other channels other than defaults), remove it during the course of the project. Even better, don't use a `.condarc` for managing channels, as it overrides the `environment.yml` instructions and makes things less reproducible. Make the changes to the `environment.yml` file if necessary. We've had some conda-forge related issues with version conflicts.
7
+
##Configuring your python environment
8
+
Easydata uses condato manage python packages installed by both conda**and pip**.
9
9
10
-
We also recommend [setting your channel priority to 'strict'](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-channels.html) to reduce package incompatibility problems. This will be default in future conda releases, but it is being rolled out gently.
10
+
### Adjust your `.condarc`
11
+
**WARNING FOR EXISTING CONDA USERS**: If you have `conda-forge` listed as a channel in your `.condarc` (or any other channels other than `default`), **remove them**. These channels should be specified in `environment.yml` instead.
11
12
13
+
We also recommend [setting your channel priority to 'strict'](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-channels.html) to reduce package incompatibility problems. This will be the default in conda 5.0, but in order to assure reproducibility, we need to use this behavior now.
14
+
15
+
```
16
+
conda config --set channel_priority strict
17
+
```
18
+
Whenever possible, re-order your channels so that `default` is first.
19
+
20
+
```
21
+
conda config --prepend channels defaults
22
+
```
23
+
24
+
**Note for Jupyterhub Users**: You will need to store your conda environment in your **home directory** so that they wil be persisted across JupyterHub sessions.
25
+
```
26
+
conda config --prepend envs_dirs ~/.conda/envs # Store environments in local dir for JupyterHub
27
+
```
28
+
29
+
### Fix the CONDA_EXE path
12
30
* Make note of the path to your conda binary:
13
31
```
14
32
$ which conda
15
33
~/miniconda3/bin/conda
16
34
```
17
-
* ensure your `CONDA_EXE` environment variable is set to this value (or edit `Makefile.include` directly)
35
+
* ensure your `CONDA_EXE` environment variable is set correctly in `Makefile.include`
18
36
```
19
37
export CONDA_EXE=~/miniconda3/bin/conda
20
38
```
39
+
### Create the conda environment
21
40
* Create and switch to the virtual environment:
22
41
```
23
42
cd {{ cookiecutter.repo_name }}
24
43
make create_environment
25
44
conda activate {{ cookiecutter.repo_name }}
26
45
make update_environment
27
46
```
28
-
Note: you need to run `make update_environment` for the `{{ cookiecutter.module_name }}` module to install correctly.
47
+
**Note**: When creating the environment the first time, you really do need to run **both**`make create_environment` and `make update_environment` for the `{{ cookiecutter.module_name }}` module to install correctly.
48
+
49
+
To activate the environment, simply `conda activate {{ cookiecutter.repo_name }}`
50
+
51
+
To deactivate it and return to your base environment, use `conda deactivate`
52
+
53
+
## Maintaining your Python environment
29
54
30
-
From here on, to use the environment, simply `conda activate {{ cookiecutter.repo_name }}` and `conda deactivate` to go back to the base environment.
55
+
### Updating your conda and pip environments
56
+
The `make` commands, `make create_environment` and `make update_environment` are wrappers that allow you to easily manage your conda and pip environments using the `environment.yml` file.
31
57
32
-
### Further Instructions
58
+
(If you ever forget which `make` command to run, you can run `make` by itself and it will provide a list of commands that are available.)
33
59
34
-
#### Updating your environment
35
-
The `make` commands, `make create_environment` and `make update_environment` are wrappers that allow you to easily manage your environment using the `environment.yml` file. If you want to make changes to your environment, do so by editing the `environment.yml` file and then running `make update_environment`.
36
60
37
-
If you ever forget which make command to run, you can run `make` and it will list a magic menu of which make commands are available.
61
+
When adding packages to your python environment, **do not `pip install` or `conda install` directly**. Always edit `environment.yml` and `make update_environment` instead.
38
62
39
63
Your `environment.yml` file will look something like this:
To add any package available from conda, add it to the end of the list. If you have a PYPI dependency that's not avaible via conda, add it to the list of pip installable dependencies under ` - pip:`.
66
90
67
-
You can include any GitHub python-based project in the `pip` section via `git+https://github.com/<github handle>/<package>`.
91
+
You can include any {{ cookiecutter.upstream_location }} python-based project in the `pip` section via `git+https://{{ cookiecutter.upstream_location }}/<my_git_handle>/<package>`.
68
92
69
-
In particular, if you're working off of a fork or a work in progress branch of a repo in GitHub (say, your personal version of <package>), you can change `git+https://github.com/<github handle>/<package>` to
93
+
In particular, if you're working off of a fork or a work in progress branch of a repo in {{ cookiecutter.upstream_location }} (say, your personal version of <package>), you can change `git+https://{{ cookiecutter.upstream_location }}/<my_git_handle>/<package>` to
70
94
71
-
*`git+https://github.com/<my github handle>/<package>.git` to point to the master branch of your fork and
72
-
*`git+https://github.com/<my github handle>/<package>.git@<my branch>` to point to a specific branch.
95
+
*`git+https://{{ cookiecutter.upstream_location }}/<my_git_handle>/<package>.git` to point to the {{cookiecutter.default_branch}} branch of your fork and
96
+
*`git+https://{{ cookiecutter.upstream_location }}/<my_git_handle>/<package>.git@<my branch>` to point to a specific branch.
73
97
74
98
Once you're done your edits, run `make update_environment` and voila, you're updated.
0 commit comments