Skip to content

Commit 0b185d2

Browse files
authored
Merge pull request #194 from hackalog/easydata_15
Easydata 1.5
2 parents f7b3b0d + 2ff01e7 commit 0b185d2

22 files changed

+839
-128
lines changed

README.md

Lines changed: 40 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,30 @@
44

55
# Cookiecutter EasyData
66

7-
_A flexible (but opinionated) toolkit for doing and sharing reproducible data science._
7+
_A python framework and git gemplate for data scientists, teams, and workshop organizers
8+
aimed at making your data science **reproducible**__
89

9-
EasyData started life as an experimental fork of
10-
[cookiecutter-data-science](http://drivendata.github.io/cookiecutter-data-science/)
11-
where we could try out ideas before proposing them as fixes to the upstream branch. It has grown into its own toolkit for implementing a reproducible data science workflow, and is the basis of our [Bus Number](https://github.com/hackalog/bus_number/) tutorial on **Reproducible Data Science**.
10+
For most of us, data science is 5% science, 60% data cleaning, and 35%
11+
IT hell. Easydata focuses on delivering
12+
* reproducible python environments,
13+
* reproducible datasets, and
14+
* reproducible workflows
15+
in order to get you up and running with your data science quickly, and reproducibly.
1216

13-
### Tutorial
14-
For a tutorial on making use of a previous version of this framework (available via the `bus_number` branch), visit:
15-
https://github.com/hackalog/bus_number/
17+
## What is Easydata?
18+
19+
Easydata is a python cookiecutter for building custom data science git repos that provides:
20+
* An **opinionated workflow** for collaboration, storytelling,
21+
* A **python framework** to support this workflow
22+
* A **makefile wrapper** for conda and pip environment management
23+
* prebuilt **dataset recipes*, and
24+
* a vast library of training materials and documentation around doing reproducible data science.
25+
26+
Easydata is **not**
27+
* an ETL tooklit
28+
* A data analysis pipreline
29+
* a containerization solution, or
30+
* a prescribed data format.
1631

1732

1833
### Requirements to use this cookiecutter template:
@@ -22,15 +37,12 @@ For a tutorial on making use of a previous version of this framework (available
2237

2338
- [Cookiecutter Python package](http://cookiecutter.readthedocs.org/en/latest/installation.html) >= 1.4.0: This can be installed with pip by or conda depending on how you manage your Python packages:
2439

25-
``` bash
26-
$ pip install cookiecutter
27-
```
28-
29-
or
40+
once you've installed anaconda, you can install the remaining requirements (including cookiecutter) by doing:
3041

31-
``` bash
32-
$ conda config --add channels conda-forge
33-
$ conda install cookiecutter
42+
```bash
43+
conda create -n easydata python=3
44+
conda activate easydata
45+
python -m pip install -f requirements.txt
3446
```
3547

3648

@@ -54,6 +66,8 @@ The directory structure of your new project looks like this:
5466
* `catalog`
5567
* Data catalog. This is where config information such as data sources
5668
and data transformations are saved
69+
* `catalog/config.ini`
70+
* Local Data Store. This configuration file is for local data only, and is never checked into the repo.
5771
* `data`
5872
* Data directory. often symlinked to a filesystem with lots of space
5973
* `data/raw`
@@ -64,6 +78,8 @@ The directory structure of your new project looks like this:
6478
* The final, canonical data sets for modeling.
6579
* `docs`
6680
* A default Sphinx project; see sphinx-doc.org for details
81+
* `framework-docs`
82+
* Markdown documentation for using Easydata
6783
* `models`
6884
* Trained and serialized models, model predictions, or model summaries
6985
* `models/trained`
@@ -86,6 +102,8 @@ The directory structure of your new project looks like this:
86102
* Generated summary information to be used in reporting
87103
* `environment.yml`
88104
* (if using conda) The YAML file for reproducing the analysis environment
105+
* `environment.(platform).lock.yml`
106+
* resolved versions, result of processing `environment.yml`
89107
* `setup.py`
90108
* Turns contents of `MODULE_NAME` into a
91109
pip-installable python module (`pip install -e .`) so it can be
@@ -95,15 +113,9 @@ The directory structure of your new project looks like this:
95113
* `MODULE_NAME/__init__.py`
96114
* Makes MODULE_NAME a Python module
97115
* `MODULE_NAME/data`
98-
* Scripts to fetch or generate data. In particular:
99-
* `MODULE_NAME/data/make_dataset.py`
100-
* Run with `python -m MODULE_NAME.data.make_dataset fetch`
101-
or `python -m MODULE_NAME.data.make_dataset process`
116+
* code to fetch raw data and generate Datasets from them
102117
* `MODULE_NAME/analysis`
103-
* Scripts to turn datasets into output products
104-
* `MODULE_NAME/models`
105-
* Scripts to train models and then use trained models to make predictions.
106-
e.g. `predict_model.py`, `train_model.py`
118+
* code to turn datasets into output products
107119
* `tox.ini`
108120
* tox file with settings for running tox; see tox.testrun.org
109121

@@ -128,3 +140,8 @@ In case you need to delete the environment later:
128140
conda deactivate
129141
make delete_environment
130142
```
143+
144+
145+
## History
146+
Early versions of Easydata were based on
147+
[cookiecutter-data-science](http://drivendata.github.io/cookiecutter-data-science/).

cookiecutter.json

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
11
{
22
"project_name": "project_name",
33
"repo_name": "{{ cookiecutter.project_name.lower().replace(' ', '_') }}",
4+
"default_branch": ["master", "main"],
45
"module_name": "src",
56
"author_name": "Your name (or your organization/company/team)",
67
"description": "A short description of this project.",
78
"open_source_license": ["MIT", "BSD-2-Clause", "Proprietary"],
89
"python_version": ["3.7", "3.6", "latest", "3.8"],
9-
"conda_path": "~/anaconda3/bin/conda"
10+
"conda_path": "~/anaconda3/bin/conda",
11+
"upstream_location": ["github.com", "gitlab.com", "bitbucket.org", "your-custom-repo"]
1012
}

{{ cookiecutter.repo_name }}/.post-create-environment.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
2+
Now would be a good time to initialize a git repo; i.e.
3+
>>> git init
4+
>>> git add .
5+
>>> git commit -m 'initial import'
16
>>> git branch easydata # tag for future easydata upgrades
27

38
NOTE: By default, raw data is installed and unpacked in the

{{ cookiecutter.repo_name }}/Makefile

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -27,11 +27,9 @@ unfinished:
2727
#
2828

2929
.PHONY: data
30-
## convert raw datasets into fully processed datasets
3130
data: transform_data
3231

3332
.PHONY: sources
34-
## Fetch, Unpack, and Process raw DataSources
3533
sources: process_sources
3634

3735
.PHONY: fetch_sources
@@ -56,7 +54,6 @@ process_sources: .make.process_sources
5654
touch .make.process_sources
5755

5856
.PHONY: transform_data
59-
## Apply Transformations to produce fully processed Datsets
6057
transform_data: .make.transform_data
6158

6259
.make.transform_data: .make.process_sources
@@ -71,17 +68,14 @@ clean:
7168
rm -f .make.*
7269

7370
.PHONY: clean_interim
74-
## Delete all interim (DataSource) files
7571
clean_interim:
7672
rm -rf data/interim/*
7773

7874
.PHONY: clean_raw
79-
## Delete the raw downloads directory
8075
clean_raw:
8176
rm -f data/raw/*
8277

8378
.PHONY: clean_processed
84-
## Delete all processed datasets
8579
clean_processed:
8680
rm -f data/processed/*
8781

@@ -103,7 +97,7 @@ lint:
10397
flake8 $(MODULE_NAME)
10498

10599
.PHONY: debug
106-
## Give a report on current status
100+
## dump useful debugging information to $(DEBUG_FILE)
107101
debug:
108102
@echo "\n\n======================"
109103
@echo "\nPlease include the contents $(DEBUG_FILE) when submitting an issue or support request.\n"
@@ -155,7 +149,7 @@ debug:
155149

156150
print-% : ; @echo $* = $($*)
157151

158-
HELP_VARS := PROJECT_NAME
152+
HELP_VARS := PROJECT_NAME DEBUG_FILE
159153

160154
help-prefix:
161155
@echo "To get started:"

{{ cookiecutter.repo_name }}/Makefile.envs

Lines changed: 8 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -14,17 +14,14 @@ else
1414
endif
1515

1616
.PHONY: create_environment
17-
## Set up virtual environment for this project
17+
## Set up virtual (conda) environment for this project
1818
create_environment: environment.$(ARCH).lock.yml
1919
ifeq (conda,$(VIRTUALENV))
20-
$(CONDA_EXE) env update -n $(PROJECT_NAME) -f environment.$(ARCH).lock.yml
20+
@touch environment.yml
21+
@echo
2122
@echo "New conda env created. Activate with:"
2223
@echo ">>> conda activate $(PROJECT_NAME)"
23-
@echo
24-
@echo "Now would be a good time to initialize a git repo; i.e."
25-
@echo ">>> git init"
26-
@echo ">>> git add ."
27-
@echo ">>> git commit -m 'initial import'"
24+
@echo ">>> make update_environment"
2825
ifneq ("X$(wildcard .post-create-environment.txt)","X")
2926
@cat .post-create-environment.txt
3027
endif
@@ -33,11 +30,12 @@ else
3330
endif
3431

3532
.PHONY: delete_environment
36-
## Delete the virtual environment for this project
33+
## Delete the virtual (conda) environment for this project
3734
delete_environment:
3835
ifeq (conda,$(VIRTUALENV))
3936
@echo "Deleting conda environment."
4037
$(CONDA_EXE) env remove -n $(PROJECT_NAME)
38+
rm environment.$(ARCH).lock.yml
4139
ifneq ("X$(wildcard .post-delete-environment.txt)","X")
4240
@cat .post-delete-environment.txt
4341
endif
@@ -46,17 +44,16 @@ else
4644
endif
4745

4846
.PHONY: update_environment
49-
## Install or update Python Dependencies
47+
## Install or update Python Dependencies in the virtual (conda) environment
5048
update_environment: test_environment environment.$(ARCH).lock.yml
5149
ifneq ("X$(wildcard .post-update-environment.txt)","X")
5250
@cat .post-update-environment.txt
5351
endif
5452

5553
.PHONY: test_environment
56-
## Test python environment is set-up correctly
5754
test_environment:
5855
ifeq (conda,$(VIRTUALENV))
59-
ifneq (${CONDA_DEFAULT_ENV}, $(PROJECT_NAME))
56+
ifneq ($(notdir ${CONDA_DEFAULT_ENV}), $(PROJECT_NAME))
6057
$(error Must activate `$(PROJECT_NAME)` environment before proceeding)
6158
endif
6259
else

{{ cookiecutter.repo_name }}/README.md

Lines changed: 11 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{{cookiecutter.project_name}}
22
==============================
3-
_Author: {{ cookiecutter.author_name }}
3+
_Author: {{ cookiecutter.author_name }}_
44

55
{{cookiecutter.description}}
66

@@ -20,18 +20,16 @@ REQUIREMENTS
2020

2121
GETTING STARTED
2222
---------------
23-
### Checking out the repo
24-
Note: These instructions assume you are using SSH keys (and not HTTPS authentication) with github.
25-
If you haven't set up SSH access to GitHub, see [Configuring SSH Access to Github](https://github.com/hackalog/cookiecutter-easydata/wiki/Configuring-SSH-Access-to-Github). This also includes instuctions for using more than one account with SSH keys.
26-
27-
1. Fork the repo (on GitHub) to your personal account
28-
1. Clone your fork to your local machine
29-
`git clone git@github.com:<your github handle>/{{cookiecutter.project_name}}.git`
30-
1. Add the main source repo as a remote branch called `upstream` (to make syncing easier):
31-
`cd {{cookiecutter.project_name}}`
32-
`git remote add upstream git@github.com:<upstream-repo>/{{cookiecutter.project_name}}.git`
33-
34-
You're all set for staying up-to-date with the project repo. Follow the instructions in this handy [Github Workflow Cheat Sheet](https://github.com/hackalog/cookiecutter-easydata/wiki/Github-Workflow-Cheat-Sheet) for keeping your working copy of the repo in sync.
23+
### Git Configuration and Checking Out the Repo
24+
25+
If you haven't yet done so, please follow the instrucitons
26+
in [Setting up git and Checking Out the Repo](framework-docs/git-configuration.md) in
27+
order to check-out the code and set-up your remote branches
28+
29+
Note: These instructions assume you are using SSH keys (and not HTTPS authentication) with {{ cookiecutter.upstream_location }}.
30+
If you haven't set up SSH access to {{ cookiecutter.upstream_location }}, see [Configuring SSH Access to {{cookiecutter.upstream_location}}](https://github.com/hackalog/cookiecutter-easydata/wiki/Configuring-SSH-Access-to-Github). This also includes instuctions for using more than one account with SSH keys.
31+
32+
Once you've got your local, `origin`, and `upstream` branches configured, you can follow the instructions in this handy [Git Workflow Cheat Sheet](framework-docs/git-workflow.md) to keep your working copy of the repo in sync with the others.
3533

3634
### Setting up your environment
3735
**WARNING**: If you have conda-forge listed as a channel in your `.condarc` (or any other channels other than defaults), remove it during the course of the workshop. Even better, don't use a `.condarc` for managing channels, as it overrides the `environment.yml` instructions and makes things less reproducible. Make the changes to the `environment.yml` file if necessary. We've had some conda-forge related issues with version conflicts. We also recommend [setting your channel priority to 'strict'](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-channels.html) to reduce package incompatibility problems.

{{ cookiecutter.repo_name }}/environment.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,4 +34,5 @@ dependencies:
3434
- pandas
3535
- requests
3636
- pathlib
37+
- fsspec
3738
{{ pyver()|indent(2, true) }}

{{ cookiecutter.repo_name }}/framework-docs/conda-environments.md

Lines changed: 40 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,40 +1,64 @@
11
# Setting up and Maintaining your Conda Environment (Reproducibly)
22

3-
The `{{ cookiecutter.repo_name }}` repo is set up with template code to make managing your conda environments easy and reproducible. Not only will future you appreciate this, but everyone else who tries to run your code will thank you.
3+
The `{{ cookiecutter.repo_name }}` repo is set up with template code to make managing your conda environments easy and reproducible. Not only will _future you_ appreciate this, but so will anyone else who needs to work with your code after today.
44

5-
If you haven't yet, get your initial environment set up.
5+
If you haven't yet, configure your conda environment.
66

7-
### Quickstart Instructions
8-
**WARNING FOR EXISTING CONDA USERS**: If you have conda-forge listed as a channel in your `.condarc` (or any other channels other than defaults), remove it during the course of the project. Even better, don't use a `.condarc` for managing channels, as it overrides the `environment.yml` instructions and makes things less reproducible. Make the changes to the `environment.yml` file if necessary. We've had some conda-forge related issues with version conflicts.
7+
## Configuring your python environment
8+
Easydata uses conda to manage python packages installed by both conda **and pip**.
99

10-
We also recommend [setting your channel priority to 'strict'](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-channels.html) to reduce package incompatibility problems. This will be default in future conda releases, but it is being rolled out gently.
10+
### Adjust your `.condarc`
11+
**WARNING FOR EXISTING CONDA USERS**: If you have `conda-forge` listed as a channel in your `.condarc` (or any other channels other than `default`), **remove them**. These channels should be specified in `environment.yml` instead.
1112

13+
We also recommend [setting your channel priority to 'strict'](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-channels.html) to reduce package incompatibility problems. This will be the default in conda 5.0, but in order to assure reproducibility, we need to use this behavior now.
14+
15+
```
16+
conda config --set channel_priority strict
17+
```
18+
Whenever possible, re-order your channels so that `default` is first.
19+
20+
```
21+
conda config --prepend channels defaults
22+
```
23+
24+
**Note for Jupyterhub Users**: You will need to store your conda environment in your **home directory** so that they wil be persisted across JupyterHub sessions.
25+
```
26+
conda config --prepend envs_dirs ~/.conda/envs # Store environments in local dir for JupyterHub
27+
```
28+
29+
### Fix the CONDA_EXE path
1230
* Make note of the path to your conda binary:
1331
```
1432
$ which conda
1533
~/miniconda3/bin/conda
1634
```
17-
* ensure your `CONDA_EXE` environment variable is set to this value (or edit `Makefile.include` directly)
35+
* ensure your `CONDA_EXE` environment variable is set correctly in `Makefile.include`
1836
```
1937
export CONDA_EXE=~/miniconda3/bin/conda
2038
```
39+
### Create the conda environment
2140
* Create and switch to the virtual environment:
2241
```
2342
cd {{ cookiecutter.repo_name }}
2443
make create_environment
2544
conda activate {{ cookiecutter.repo_name }}
2645
make update_environment
2746
```
28-
Note: you need to run `make update_environment` for the `{{ cookiecutter.module_name }}` module to install correctly.
47+
**Note**: When creating the environment the first time, you really do need to run **both** `make create_environment` and `make update_environment` for the `{{ cookiecutter.module_name }}` module to install correctly.
48+
49+
To activate the environment, simply `conda activate {{ cookiecutter.repo_name }}`
50+
51+
To deactivate it and return to your base environment, use `conda deactivate`
52+
53+
## Maintaining your Python environment
2954

30-
From here on, to use the environment, simply `conda activate {{ cookiecutter.repo_name }}` and `conda deactivate` to go back to the base environment.
55+
### Updating your conda and pip environments
56+
The `make` commands, `make create_environment` and `make update_environment` are wrappers that allow you to easily manage your conda and pip environments using the `environment.yml` file.
3157

32-
### Further Instructions
58+
(If you ever forget which `make` command to run, you can run `make` by itself and it will provide a list of commands that are available.)
3359

34-
#### Updating your environment
35-
The `make` commands, `make create_environment` and `make update_environment` are wrappers that allow you to easily manage your environment using the `environment.yml` file. If you want to make changes to your environment, do so by editing the `environment.yml` file and then running `make update_environment`.
3660

37-
If you ever forget which make command to run, you can run `make` and it will list a magic menu of which make commands are available.
61+
When adding packages to your python environment, **do not `pip install` or `conda install` directly**. Always edit `environment.yml` and `make update_environment` instead.
3862

3963
Your `environment.yml` file will look something like this:
4064
```
@@ -64,12 +88,12 @@ name: {{ cookiecutter.repo_name }}
6488
```
6589
To add any package available from conda, add it to the end of the list. If you have a PYPI dependency that's not avaible via conda, add it to the list of pip installable dependencies under ` - pip:`.
6690

67-
You can include any GitHub python-based project in the `pip` section via `git+https://github.com/<github handle>/<package>`.
91+
You can include any {{ cookiecutter.upstream_location }} python-based project in the `pip` section via `git+https://{{ cookiecutter.upstream_location }}/<my_git_handle>/<package>`.
6892

69-
In particular, if you're working off of a fork or a work in progress branch of a repo in GitHub (say, your personal version of <package>), you can change `git+https://github.com/<github handle>/<package>` to
93+
In particular, if you're working off of a fork or a work in progress branch of a repo in {{ cookiecutter.upstream_location }} (say, your personal version of <package>), you can change `git+https://{{ cookiecutter.upstream_location }}/<my_git_handle>/<package>` to
7094

71-
* `git+https://github.com/<my github handle>/<package>.git` to point to the master branch of your fork and
72-
* `git+https://github.com/<my github handle>/<package>.git@<my branch>` to point to a specific branch.
95+
* `git+https://{{ cookiecutter.upstream_location }}/<my_git_handle>/<package>.git` to point to the {{cookiecutter.default_branch}} branch of your fork and
96+
* `git+https://{{ cookiecutter.upstream_location }}/<my_git_handle>/<package>.git@<my branch>` to point to a specific branch.
7397

7498
Once you're done your edits, run `make update_environment` and voila, you're updated.
7599

0 commit comments

Comments
 (0)