diff --git a/README.md b/README.md index dff8cf3..5113c4f 100644 --- a/README.md +++ b/README.md @@ -1,14 +1,20 @@ -# Cookiecutter Data Science +# Cookiecutter EasyData _A logical, reasonably standardized, but flexible project structure for doing and sharing data science work._ +This is an experimental fork of +[cookiecutter-data-science](http://drivendata.github.io/cookiecutter-data-science/) +where I try out ideas before proposing them for inclusion upstream -#### [Project homepage](http://drivendata.github.io/cookiecutter-data-science/) - -### Requirements to use the cookiecutter template: +### Requirements to use this cookiecutter template: ----------- - - Python 2.7 or 3.5 + - anaconda (or miniconda) + + - python3. Technically, I still prompts for a choice between python and python3, + but but I'm aiming to deprecate this, and move all python version support + to use either conda or pipenv + - [Cookiecutter Python package](http://cookiecutter.readthedocs.org/en/latest/installation.html) >= 1.4.0: This can be installed with pip by or conda depending on how you manage your Python packages: ``` bash @@ -26,73 +32,80 @@ $ conda install cookiecutter ### To start a new project, run: ------------ - cookiecutter https://github.com/drivendata/cookiecutter-data-science - - -[![asciicast](https://asciinema.org/a/9bgl5qh17wlop4xyxu9n9wr02.png)](https://asciinema.org/a/9bgl5qh17wlop4xyxu9n9wr02) + cookiecutter https://github.com/hackalog/cookiecutter-easydata ### The resulting directory structure ------------ -The directory structure of your new project looks like this: - -``` -├── LICENSE -├── Makefile <- Makefile with commands like `make data` or `make train` -├── README.md <- The top-level README for developers using this project. -├── data -│ ├── external <- Data from third party sources. -│ ├── interim <- Intermediate data that has been transformed. -│ ├── processed <- The final, canonical data sets for modeling. -│ └── raw <- The original, immutable data dump. -│ -├── docs <- A default Sphinx project; see sphinx-doc.org for details -│ -├── models <- Trained and serialized models, model predictions, or model summaries -│ -├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering), -│ the creator's initials, and a short `-` delimited description, e.g. -│ `1.0-jqp-initial-data-exploration`. -│ -├── references <- Data dictionaries, manuals, and all other explanatory materials. -│ -├── reports <- Generated analysis as HTML, PDF, LaTeX, etc. -│ └── figures <- Generated graphics and figures to be used in reporting -│ -├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g. -│ generated with `pip freeze > requirements.txt` -│ -├── src <- Source code for use in this project. -│ ├── __init__.py <- Makes src a Python module -│ │ -│ ├── data <- Scripts to download or generate data -│ │ └── make_dataset.py -│ │ -│ ├── features <- Scripts to turn raw data into features for modeling -│ │ └── build_features.py -│ │ -│ ├── models <- Scripts to train models and then use trained models to make -│ │ │ predictions -│ │ ├── predict_model.py -│ │ └── train_model.py -│ │ -│ └── visualization <- Scripts to create exploratory and results oriented visualizations -│ └── visualize.py -│ -└── tox.ini <- tox file with settings for running tox; see tox.testrun.org -``` +The directory structure of your new project looks like this: + + +* `LICENSE` +* `Makefile` + * top-level makefile. Type `make` for a list of valid commands +* `README.md` + * this file +* `data` + * Data directory. often symlinked to a filesystem with lots of space + * `data/raw` + * Raw (immutable) hash-verified downloads + * `data/interim` + * Extracted and interim data representations + * `data/processed` + * The final, canonical data sets for modeling. +* `docs` + * A default Sphinx project; see sphinx-doc.org for details +* `models` + * Trained and serialized models, model predictions, or model summaries +* `notebooks` + * Jupyter notebooks. Naming convention is a number (for ordering), + the creator's initials, and a short `-` delimited description, + e.g. `1.0-jqp-initial-data-exploration`. +* `references` + * Data dictionaries, manuals, and all other explanatory materials. +* `reports` + * Generated analysis as HTML, PDF, LaTeX, etc. + * `reports/figures` + * Generated graphics and figures to be used in reporting +* `requirements.txt` + * (if using pip+virtualenv) The requirements file for reproducing the + analysis environment, e.g. generated with `pip freeze > requirements.txt` +* `environment.yml` + * (if using conda) The YAML file for reproducing the analysis environment +* `setup.py` + * Turns contents of `MODULE_NAME` into a + pip-installable python module (`pip install -e .`) so it can be + imported in python code +* `MODULE_NAME` + * Source code for use in this project. + * `MODULE_NAME/__init__.py` + * Makes MODULE_NAME a Python module + * `MODULE_NAME/data` + * Scripts to fetch or generate data. In particular: + * `MODULE_NAME/data/make_dataset.py` + * Run with `python -m MODULE_NAME.data.make_dataset fetch` + or `python -m MODULE_NAME.data.make_dataset process` + * `MODULE_NAME/features` + * Scripts to turn raw data into features for modeling, notably `build_features.py` + * `MODULE_NAME/models` + * Scripts to train models and then use trained models to make predictions. + e.g. `predict_model.py`, `train_model.py` + * `MODULE_NAME/visualization` + * Scripts to create exploratory and results oriented visualizations; e.g. + `visualize.py` +* `tox.ini` + * tox file with settings for running tox; see tox.testrun.org + +## TODO + +* Add pipenv support +* Remove python2 support, (python2 can be supported via a pipenv/conda envinronment + if absolutely needed) -## Contributing -We welcome contributions! [See the docs for guidelines](https://drivendata.github.io/cookiecutter-data-science/#contributing). ### Installing development requirements ------------- - - pip install -r requirements.txt - -### Running the tests ------------- - - py.test tests +``` + make requirements +``` \ No newline at end of file diff --git a/{{ cookiecutter.repo_name }}/{{ cookiecutter.module_name }}/data/datasets.py b/{{ cookiecutter.repo_name }}/{{ cookiecutter.module_name }}/data/datasets.py index 53313cc..462f695 100644 --- a/{{ cookiecutter.repo_name }}/{{ cookiecutter.module_name }}/data/datasets.py +++ b/{{ cookiecutter.repo_name }}/{{ cookiecutter.module_name }}/data/datasets.py @@ -1,10 +1,15 @@ +import cv2 +import glob import logging import os import pathlib +import pandas as pd +import numpy as np import json from sklearn.datasets.base import Bunch +from scipy.io import loadmat from functools import partial -from joblib import Memory +import joblib import sys from .utils import fetch_and_unpack, get_dataset_filename @@ -14,19 +19,37 @@ _MODULE_DIR = pathlib.Path(os.path.dirname(os.path.abspath(__file__))) logger = logging.getLogger(__name__) -jlmem = Memory(cachedir=str(interim_data_path)) - def new_dataset(*, dataset_name): + """Return an unpopulated dataset object. + + Fills in LICENSE and DESCR if they are present. + Takes metadata from the url_list object if present. Otherwise, if + `*.license` or `*.readme` files are present in the module directory, + these will be as LICENSE and DESCR respectively. + """ global dataset_raw_files dset = Bunch() dset['metadata'] = {} dset['LICENSE'] = None dset['DESCR'] = None - + filemap = { + 'LICENSE': f'{dataset_name}.license', + 'DESCR': f'{dataset_name}.readme' + } + + # read metadata from disk if present + for metadata_type in filemap: + metadata_file = _MODULE_DIR / filemap[metadata_type] + if metadata_file.exists(): + with open(metadata_file, 'r') as fd: + dset[metadata_type] = fd.read() + + # Use downloaded metadata if available ds = dataset_raw_files[dataset_name] for fetch_dict in ds.get('url_list', []): name = fetch_dict.get('name', None) + # if metadata is present in the URL list, use it if name in ['DESCR', 'LICENSE']: txtfile = get_dataset_filename(fetch_dict) with open(raw_data_path / txtfile, 'r') as fr: @@ -47,28 +70,88 @@ def add_dataset_by_urllist(dataset_name, url_list): dataset_raw_files = read_datasets() return dataset_raw_files[dataset_name] -@jlmem.cache -def load_dataset(dataset_name, return_X_y=False, **kwargs): +def add_dataset_metadata(dataset_name, from_file=None, from_str=None, kind='DESCR'): + """Add metadata to a dataset + + from_file: create metadata entry from contents of this file + from_str: create metadata entry from this string + kind: {'DESCR', 'LICENSE'} + """ + global dataset_raw_files + + filename_map = { + 'DESCR': f'{dataset_name}.readme', + 'LICENSE': f'{dataset_name}.license', + } + + if dataset_name not in dataset_raw_files: + raise Exception(f'No such dataset: {dataset_name}') + + if kind not in filename_map: + raise Exception(f'Unknown kind: {kind}. Must be one of {filename_map.keys()}') + + if from_file is not None: + with open(from_file, 'r') as fd: + meta_txt = fd.read() + elif from_str is not None: + meta_txt = from_str + else: + raise Exception(f'One of `from_file` or `from_str` is required') + + with open(_MODULE_DIR / filename_map[kind], 'w') as fw: + fw.write(meta_txt) + +def load_dataset(dataset_name, return_X_y=False, force=False, **kwargs): '''Loads a scikit-learn style dataset dataset_name: Name of dataset to load return_X_y: boolean, default=False if True, returns (data, target) instead of a Bunch object + force: boolean + if True, do complete fetch/process cycle. If False, will use cached object (if present) ''' if dataset_name not in dataset_raw_files: raise Exception(f'Unknown Dataset: {dataset_name}') - fetch_and_unpack(dataset_name) - - dset = dataset_raw_files[dataset_name]['load_function'](**kwargs) + # check for cached version + cache_file = processed_data_path / f'{dataset_name}.jlib' + if cache_file.exists() and force is not True: + dset = joblib.load(cache_file) + else: + # no cache. Regenerate + fetch_and_unpack(dataset_name) + dset = dataset_raw_files[dataset_name]['load_function'](**kwargs) + with open(cache_file, 'wb') as fo: + joblib.dump(dset, fo) if return_X_y: return dset.data, dset.target else: return dset +def read_space_delimited(filename, skiprows=None, class_labels=True): + """Read an space-delimited file + + skiprows: list of rows to skip when reading the file. + + Note: we can't use automatic comment detection, as + `#` characters are also used as data labels. + class_labels: boolean + if true, the last column is treated as the class label + """ + with open(filename, 'r') as fd: + df = pd.read_table(fd, skiprows=skiprows, skip_blank_lines=True, comment=None, header=None, sep=' ', dtype=str) + # targets are last column. Data is everything else + if class_labels is True: + target = df.loc[:,df.columns[-1]].values + data = df.loc[:,df.columns[:-1]].values + else: + data = df.values + target = np.zeros(data.shape[0]) + return data, target + def write_dataset(path=None, filename="datasets.json", indent=4, sort_keys=True): """Write a serialized (JSON) dataset file""" if path is None: diff --git a/{{ cookiecutter.repo_name }}/{{ cookiecutter.module_name }}/data/utils.py b/{{ cookiecutter.repo_name }}/{{ cookiecutter.module_name }}/data/utils.py index 7fae0cc..afd1286 100644 --- a/{{ cookiecutter.repo_name }}/{{ cookiecutter.module_name }}/data/utils.py +++ b/{{ cookiecutter.repo_name }}/{{ cookiecutter.module_name }}/data/utils.py @@ -7,6 +7,7 @@ import shutil import zipfile import gzip +import zlib from ..paths import interim_data_path, raw_data_path @@ -226,12 +227,18 @@ def unpack(filename, dst_dir=None, create_dst=True): elif path.endswith('.gz'): opener, mode = gzip.open, 'rb' outfile, outmode = path[:-3], 'wb' + elif path.endswith('.Z'): + logger.warning(".Z files are only supported on systems that ship with gzip. Trying...") + os.system(f'gzip -d {path}') + opener, mode = open, 'rb' + path = path[:-2] + outfile, outmode = path, 'wb' else: opener, mode = open, 'rb' outfile, outmode = path, 'wb' logger.info("No compression detected. Copying...") - with opener(filename, mode) as f_in: + with opener(path, mode) as f_in: if archive: logger.info(f"Extracting {filename.name}") f_in.extractall(path=dst_dir)