Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Dataset improvements #2

Merged
merged 2 commits into from
Jul 13, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 79 additions & 66 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,20 @@
# Cookiecutter Data Science
# Cookiecutter EasyData

_A logical, reasonably standardized, but flexible project structure for doing and sharing data science work._

This is an experimental fork of
[cookiecutter-data-science](http://drivendata.github.io/cookiecutter-data-science/)
where I try out ideas before proposing them for inclusion upstream

#### [Project homepage](http://drivendata.github.io/cookiecutter-data-science/)


### Requirements to use the cookiecutter template:
### Requirements to use this cookiecutter template:
-----------
- Python 2.7 or 3.5
- anaconda (or miniconda)

- python3. Technically, I still prompts for a choice between python and python3,
but but I'm aiming to deprecate this, and move all python version support
to use either conda or pipenv

- [Cookiecutter Python package](http://cookiecutter.readthedocs.org/en/latest/installation.html) >= 1.4.0: This can be installed with pip by or conda depending on how you manage your Python packages:

``` bash
Expand All @@ -26,73 +32,80 @@ $ conda install cookiecutter
### To start a new project, run:
------------

cookiecutter https://github.com/drivendata/cookiecutter-data-science


[![asciicast](https://asciinema.org/a/9bgl5qh17wlop4xyxu9n9wr02.png)](https://asciinema.org/a/9bgl5qh17wlop4xyxu9n9wr02)
cookiecutter https://github.com/hackalog/cookiecutter-easydata


### The resulting directory structure
------------

The directory structure of your new project looks like this:

```
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
├── docs <- A default Sphinx project; see sphinx-doc.org for details
├── models <- Trained and serialized models, model predictions, or model summaries
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
├── references <- Data dictionaries, manuals, and all other explanatory materials.
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
│ │
│ ├── features <- Scripts to turn raw data into features for modeling
│ │ └── build_features.py
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.py
└── tox.ini <- tox file with settings for running tox; see tox.testrun.org
```
The directory structure of your new project looks like this:


* `LICENSE`
* `Makefile`
* top-level makefile. Type `make` for a list of valid commands
* `README.md`
* this file
* `data`
* Data directory. often symlinked to a filesystem with lots of space
* `data/raw`
* Raw (immutable) hash-verified downloads
* `data/interim`
* Extracted and interim data representations
* `data/processed`
* The final, canonical data sets for modeling.
* `docs`
* A default Sphinx project; see sphinx-doc.org for details
* `models`
* Trained and serialized models, model predictions, or model summaries
* `notebooks`
* Jupyter notebooks. Naming convention is a number (for ordering),
the creator's initials, and a short `-` delimited description,
e.g. `1.0-jqp-initial-data-exploration`.
* `references`
* Data dictionaries, manuals, and all other explanatory materials.
* `reports`
* Generated analysis as HTML, PDF, LaTeX, etc.
* `reports/figures`
* Generated graphics and figures to be used in reporting
* `requirements.txt`
* (if using pip+virtualenv) The requirements file for reproducing the
analysis environment, e.g. generated with `pip freeze > requirements.txt`
* `environment.yml`
* (if using conda) The YAML file for reproducing the analysis environment
* `setup.py`
* Turns contents of `MODULE_NAME` into a
pip-installable python module (`pip install -e .`) so it can be
imported in python code
* `MODULE_NAME`
* Source code for use in this project.
* `MODULE_NAME/__init__.py`
* Makes MODULE_NAME a Python module
* `MODULE_NAME/data`
* Scripts to fetch or generate data. In particular:
* `MODULE_NAME/data/make_dataset.py`
* Run with `python -m MODULE_NAME.data.make_dataset fetch`
or `python -m MODULE_NAME.data.make_dataset process`
* `MODULE_NAME/features`
* Scripts to turn raw data into features for modeling, notably `build_features.py`
* `MODULE_NAME/models`
* Scripts to train models and then use trained models to make predictions.
e.g. `predict_model.py`, `train_model.py`
* `MODULE_NAME/visualization`
* Scripts to create exploratory and results oriented visualizations; e.g.
`visualize.py`
* `tox.ini`
* tox file with settings for running tox; see tox.testrun.org

## TODO

* Add pipenv support
* Remove python2 support, (python2 can be supported via a pipenv/conda envinronment
if absolutely needed)

## Contributing

We welcome contributions! [See the docs for guidelines](https://drivendata.github.io/cookiecutter-data-science/#contributing).

### Installing development requirements
------------

pip install -r requirements.txt

### Running the tests
------------

py.test tests
```
make requirements
```
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
import cv2
import glob
import logging
import os
import pathlib
import pandas as pd
import numpy as np
import json
from sklearn.datasets.base import Bunch
from scipy.io import loadmat
from functools import partial
from joblib import Memory
import joblib
import sys

from .utils import fetch_and_unpack, get_dataset_filename
Expand All @@ -14,19 +19,37 @@
_MODULE_DIR = pathlib.Path(os.path.dirname(os.path.abspath(__file__)))
logger = logging.getLogger(__name__)

jlmem = Memory(cachedir=str(interim_data_path))

def new_dataset(*, dataset_name):
"""Return an unpopulated dataset object.

Fills in LICENSE and DESCR if they are present.
Takes metadata from the url_list object if present. Otherwise, if
`*.license` or `*.readme` files are present in the module directory,
these will be as LICENSE and DESCR respectively.
"""
global dataset_raw_files

dset = Bunch()
dset['metadata'] = {}
dset['LICENSE'] = None
dset['DESCR'] = None

filemap = {
'LICENSE': f'{dataset_name}.license',
'DESCR': f'{dataset_name}.readme'
}

# read metadata from disk if present
for metadata_type in filemap:
metadata_file = _MODULE_DIR / filemap[metadata_type]
if metadata_file.exists():
with open(metadata_file, 'r') as fd:
dset[metadata_type] = fd.read()

# Use downloaded metadata if available
ds = dataset_raw_files[dataset_name]
for fetch_dict in ds.get('url_list', []):
name = fetch_dict.get('name', None)
# if metadata is present in the URL list, use it
if name in ['DESCR', 'LICENSE']:
txtfile = get_dataset_filename(fetch_dict)
with open(raw_data_path / txtfile, 'r') as fr:
Expand All @@ -47,28 +70,88 @@ def add_dataset_by_urllist(dataset_name, url_list):
dataset_raw_files = read_datasets()
return dataset_raw_files[dataset_name]

@jlmem.cache
def load_dataset(dataset_name, return_X_y=False, **kwargs):
def add_dataset_metadata(dataset_name, from_file=None, from_str=None, kind='DESCR'):
"""Add metadata to a dataset

from_file: create metadata entry from contents of this file
from_str: create metadata entry from this string
kind: {'DESCR', 'LICENSE'}
"""
global dataset_raw_files

filename_map = {
'DESCR': f'{dataset_name}.readme',
'LICENSE': f'{dataset_name}.license',
}

if dataset_name not in dataset_raw_files:
raise Exception(f'No such dataset: {dataset_name}')

if kind not in filename_map:
raise Exception(f'Unknown kind: {kind}. Must be one of {filename_map.keys()}')

if from_file is not None:
with open(from_file, 'r') as fd:
meta_txt = fd.read()
elif from_str is not None:
meta_txt = from_str
else:
raise Exception(f'One of `from_file` or `from_str` is required')

with open(_MODULE_DIR / filename_map[kind], 'w') as fw:
fw.write(meta_txt)

def load_dataset(dataset_name, return_X_y=False, force=False, **kwargs):
'''Loads a scikit-learn style dataset

dataset_name:
Name of dataset to load
return_X_y: boolean, default=False
if True, returns (data, target) instead of a Bunch object
force: boolean
if True, do complete fetch/process cycle. If False, will use cached object (if present)
'''

if dataset_name not in dataset_raw_files:
raise Exception(f'Unknown Dataset: {dataset_name}')

fetch_and_unpack(dataset_name)

dset = dataset_raw_files[dataset_name]['load_function'](**kwargs)
# check for cached version
cache_file = processed_data_path / f'{dataset_name}.jlib'
if cache_file.exists() and force is not True:
dset = joblib.load(cache_file)
else:
# no cache. Regenerate
fetch_and_unpack(dataset_name)
dset = dataset_raw_files[dataset_name]['load_function'](**kwargs)
with open(cache_file, 'wb') as fo:
joblib.dump(dset, fo)

if return_X_y:
return dset.data, dset.target
else:
return dset

def read_space_delimited(filename, skiprows=None, class_labels=True):
"""Read an space-delimited file

skiprows: list of rows to skip when reading the file.

Note: we can't use automatic comment detection, as
`#` characters are also used as data labels.
class_labels: boolean
if true, the last column is treated as the class label
"""
with open(filename, 'r') as fd:
df = pd.read_table(fd, skiprows=skiprows, skip_blank_lines=True, comment=None, header=None, sep=' ', dtype=str)
# targets are last column. Data is everything else
if class_labels is True:
target = df.loc[:,df.columns[-1]].values
data = df.loc[:,df.columns[:-1]].values
else:
data = df.values
target = np.zeros(data.shape[0])
return data, target

def write_dataset(path=None, filename="datasets.json", indent=4, sort_keys=True):
"""Write a serialized (JSON) dataset file"""
if path is None:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import shutil
import zipfile
import gzip
import zlib

from ..paths import interim_data_path, raw_data_path

Expand Down Expand Up @@ -226,12 +227,18 @@ def unpack(filename, dst_dir=None, create_dst=True):
elif path.endswith('.gz'):
opener, mode = gzip.open, 'rb'
outfile, outmode = path[:-3], 'wb'
elif path.endswith('.Z'):
logger.warning(".Z files are only supported on systems that ship with gzip. Trying...")
os.system(f'gzip -d {path}')
opener, mode = open, 'rb'
path = path[:-2]
outfile, outmode = path, 'wb'
else:
opener, mode = open, 'rb'
outfile, outmode = path, 'wb'
logger.info("No compression detected. Copying...")

with opener(filename, mode) as f_in:
with opener(path, mode) as f_in:
if archive:
logger.info(f"Extracting {filename.name}")
f_in.extractall(path=dst_dir)
Expand Down