hackalog · hackalog · Jul 13, 2018 · Jul 13, 2018 · Jul 13, 2018
diff --git a/README.md b/README.md
@@ -1,14 +1,20 @@
-# Cookiecutter Data Science
+# Cookiecutter EasyData
 
 _A logical, reasonably standardized, but flexible project structure for doing and sharing data science work._
 
+This is an experimental fork of
+[cookiecutter-data-science](http://drivendata.github.io/cookiecutter-data-science/)
+where I try out ideas before proposing them for inclusion upstream
 
-#### [Project homepage](http://drivendata.github.io/cookiecutter-data-science/)
 
-
-### Requirements to use the cookiecutter template:
+### Requirements to use this cookiecutter template:
 -----------
- - Python 2.7 or 3.5
+ - anaconda (or miniconda)
+
+ - python3. Technically, I still prompts for a choice between python and python3,
+   but but I'm aiming to deprecate this, and move all python version support
+   to use either conda or pipenv
+
  - [Cookiecutter Python package](http://cookiecutter.readthedocs.org/en/latest/installation.html) >= 1.4.0: This can be installed with pip by or conda depending on how you manage your Python packages:
 
 ``` bash
@@ -26,73 +32,80 @@ $ conda install cookiecutter
 ### To start a new project, run:
 ------------
 
-    cookiecutter https://github.com/drivendata/cookiecutter-data-science
-
-
-[![asciicast](https://asciinema.org/a/9bgl5qh17wlop4xyxu9n9wr02.png)](https://asciinema.org/a/9bgl5qh17wlop4xyxu9n9wr02)
+    cookiecutter https://github.com/hackalog/cookiecutter-easydata
 
 
 ### The resulting directory structure
 ------------
 
-The directory structure of your new project looks like this: 
-
-```
-├── LICENSE
-├── Makefile           <- Makefile with commands like `make data` or `make train`
-├── README.md          <- The top-level README for developers using this project.
-├── data
-│   ├── external       <- Data from third party sources.
-│   ├── interim        <- Intermediate data that has been transformed.
-│   ├── processed      <- The final, canonical data sets for modeling.
-│   └── raw            <- The original, immutable data dump.
-│
-├── docs               <- A default Sphinx project; see sphinx-doc.org for details
-│
-├── models             <- Trained and serialized models, model predictions, or model summaries
-│
-├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
-│                         the creator's initials, and a short `-` delimited description, e.g.
-│                         `1.0-jqp-initial-data-exploration`.
-│
-├── references         <- Data dictionaries, manuals, and all other explanatory materials.
-│
-├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
-│   └── figures        <- Generated graphics and figures to be used in reporting
-│
-├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
-│                         generated with `pip freeze > requirements.txt`
-│
-├── src                <- Source code for use in this project.
-│   ├── __init__.py    <- Makes src a Python module
-│   │
-│   ├── data           <- Scripts to download or generate data
-│   │   └── make_dataset.py
-│   │
-│   ├── features       <- Scripts to turn raw data into features for modeling
-│   │   └── build_features.py
-│   │
-│   ├── models         <- Scripts to train models and then use trained models to make
-│   │   │                 predictions
-│   │   ├── predict_model.py
-│   │   └── train_model.py
-│   │
-│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
-│       └── visualize.py
-│
-└── tox.ini            <- tox file with settings for running tox; see tox.testrun.org
-```
+The directory structure of your new project looks like this:
+
+
+* `LICENSE`
+* `Makefile`
+    * top-level makefile. Type `make` for a list of valid commands
+* `README.md`
+    * this file
+* `data`
+    * Data directory. often symlinked to a filesystem with lots of space
+    * `data/raw`
+        * Raw (immutable) hash-verified downloads
+    * `data/interim`
+        * Extracted and interim data representations
+    * `data/processed`
+        * The final, canonical data sets for modeling.
+* `docs`
+    * A default Sphinx project; see sphinx-doc.org for details
+* `models`
+    * Trained and serialized models, model predictions, or model summaries
+* `notebooks`
+    *  Jupyter notebooks. Naming convention is a number (for ordering),
+    the creator's initials, and a short `-` delimited description,
+    e.g. `1.0-jqp-initial-data-exploration`.
+* `references`
+    * Data dictionaries, manuals, and all other explanatory materials.
+* `reports`
+    * Generated analysis as HTML, PDF, LaTeX, etc.
+    * `reports/figures`
+        * Generated graphics and figures to be used in reporting
+* `requirements.txt`
+    * (if using pip+virtualenv) The requirements file for reproducing the
+    analysis environment, e.g. generated with `pip freeze > requirements.txt`
+* `environment.yml`
+    * (if using conda) The YAML file for reproducing the analysis environment
+* `setup.py`
+    * Turns contents of `MODULE_NAME` into a
+    pip-installable python module  (`pip install -e .`) so it can be
+    imported in python code
+* `MODULE_NAME`
+    * Source code for use in this project.
+    * `MODULE_NAME/__init__.py`
+        * Makes MODULE_NAME a Python module
+    * `MODULE_NAME/data`
+        * Scripts to fetch or generate data. In particular:
+        * `MODULE_NAME/data/make_dataset.py`
+            * Run with `python -m MODULE_NAME.data.make_dataset fetch`
+            or  `python -m MODULE_NAME.data.make_dataset process`
+    * `MODULE_NAME/features`
+        * Scripts to turn raw data into features for modeling, notably `build_features.py`
+    * `MODULE_NAME/models`
+        * Scripts to train models and then use trained models to make predictions.
+        e.g. `predict_model.py`, `train_model.py`
+    * `MODULE_NAME/visualization`
+        * Scripts to create exploratory and results oriented visualizations; e.g.
+        `visualize.py`
+* `tox.ini`
+    * tox file with settings for running tox; see tox.testrun.org
+
+## TODO
+
+* Add pipenv support
+* Remove python2 support, (python2 can be supported via a pipenv/conda envinronment
+  if absolutely needed)
 
-## Contributing
 
-We welcome contributions! [See the docs for guidelines](https://drivendata.github.io/cookiecutter-data-science/#contributing).
 
 ### Installing development requirements
-------------
-
-    pip install -r requirements.txt
-
-### Running the tests
-------------
-
-    py.test tests
+```
+    make requirements
+```
diff --git a/{{ cookiecutter.repo_name }}/{{ cookiecutter.module_name }}/data/datasets.py b/{{ cookiecutter.repo_name }}/{{ cookiecutter.module_name }}/data/datasets.py
@@ -1,10 +1,15 @@
+import cv2
+import glob
 import logging
 import os
 import pathlib
+import pandas as pd
+import numpy as np
 import json
 from sklearn.datasets.base import Bunch
+from scipy.io import loadmat
 from functools import partial
-from joblib import Memory
+import joblib
 import sys
 
 from .utils import fetch_and_unpack, get_dataset_filename
@@ -14,19 +19,37 @@
 _MODULE_DIR = pathlib.Path(os.path.dirname(os.path.abspath(__file__)))
 logger = logging.getLogger(__name__)
 
-jlmem = Memory(cachedir=str(interim_data_path))
-
 def new_dataset(*, dataset_name):
+    """Return an unpopulated dataset object.
+
+    Fills in LICENSE and DESCR if they are present.
+    Takes metadata from the url_list object if present. Otherwise, if
+    `*.license` or `*.readme` files are present in the module directory,
+    these will be as LICENSE and DESCR respectively.
+    """
     global dataset_raw_files
 
     dset = Bunch()
     dset['metadata'] = {}
     dset['LICENSE'] = None
     dset['DESCR'] = None
-
+    filemap = {
+        'LICENSE': f'{dataset_name}.license',
+        'DESCR': f'{dataset_name}.readme'
+    }
+
+    # read metadata from disk if present
+    for metadata_type in filemap:
+        metadata_file = _MODULE_DIR / filemap[metadata_type]
+        if metadata_file.exists():
+            with open(metadata_file, 'r') as fd:
+                dset[metadata_type] = fd.read()
+
+    # Use downloaded metadata if available
     ds = dataset_raw_files[dataset_name]
     for fetch_dict in ds.get('url_list', []):
         name = fetch_dict.get('name', None)
+        # if metadata is present in the URL list, use it
         if name in ['DESCR', 'LICENSE']:
             txtfile = get_dataset_filename(fetch_dict)
             with open(raw_data_path / txtfile, 'r') as fr:
@@ -47,28 +70,88 @@ def add_dataset_by_urllist(dataset_name, url_list):
     dataset_raw_files = read_datasets()
     return dataset_raw_files[dataset_name]
 
-@jlmem.cache
-def load_dataset(dataset_name, return_X_y=False, **kwargs):
+def add_dataset_metadata(dataset_name, from_file=None, from_str=None, kind='DESCR'):
+    """Add metadata to a dataset
+
+    from_file: create metadata entry from contents of this file
+    from_str: create metadata entry from this string
+    kind: {'DESCR', 'LICENSE'}
+    """
+    global dataset_raw_files
+
+    filename_map = {
+        'DESCR': f'{dataset_name}.readme',
+        'LICENSE': f'{dataset_name}.license',
+    }
+
+    if dataset_name not in dataset_raw_files:
+        raise Exception(f'No such dataset: {dataset_name}')
+
+    if kind not in filename_map:
+        raise Exception(f'Unknown kind: {kind}. Must be one of {filename_map.keys()}')
+
+    if from_file is not None:
+        with open(from_file, 'r') as fd:
+            meta_txt = fd.read()
+    elif from_str is not None:
+        meta_txt = from_str
+    else:
+        raise Exception(f'One of `from_file` or `from_str` is required')
+
+    with open(_MODULE_DIR / filename_map[kind], 'w') as fw:
+        fw.write(meta_txt)
+
+def load_dataset(dataset_name, return_X_y=False, force=False, **kwargs):
     '''Loads a scikit-learn style dataset
 
     dataset_name:
         Name of dataset to load
     return_X_y: boolean, default=False
         if True, returns (data, target) instead of a Bunch object
+    force: boolean
+        if True, do complete fetch/process cycle. If False, will use cached object (if present)
     '''
 
     if dataset_name not in dataset_raw_files:
         raise Exception(f'Unknown Dataset: {dataset_name}')
 
-    fetch_and_unpack(dataset_name)
-
-    dset = dataset_raw_files[dataset_name]['load_function'](**kwargs)
+    # check for cached version
+    cache_file = processed_data_path / f'{dataset_name}.jlib'
+    if cache_file.exists() and force is not True:
+        dset = joblib.load(cache_file)
+    else:
+        # no cache. Regenerate
+        fetch_and_unpack(dataset_name)
+        dset = dataset_raw_files[dataset_name]['load_function'](**kwargs)
+        with open(cache_file, 'wb') as fo:
+            joblib.dump(dset, fo)
 
     if return_X_y:
         return dset.data, dset.target
     else:
         return dset
 
+def read_space_delimited(filename, skiprows=None, class_labels=True):
+    """Read an space-delimited file
+
+    skiprows: list of rows to skip when reading the file.
+
+    Note: we can't use automatic comment detection, as
+    `#` characters are also used as data labels.
+    class_labels: boolean
+        if true, the last column is treated as the class label
+    """
+    with open(filename, 'r') as fd:
+        df = pd.read_table(fd, skiprows=skiprows, skip_blank_lines=True, comment=None, header=None, sep=' ', dtype=str)
+        # targets are last column. Data is everything else
+        if class_labels is True:
+            target = df.loc[:,df.columns[-1]].values
+            data = df.loc[:,df.columns[:-1]].values
+        else:
+            data = df.values
+            target = np.zeros(data.shape[0])
+        return data, target
+
 def write_dataset(path=None, filename="datasets.json", indent=4, sort_keys=True):
     """Write a serialized (JSON) dataset file"""
     if path is None:

diff --git a/{{ cookiecutter.repo_name }}/{{ cookiecutter.module_name }}/data/utils.py b/{{ cookiecutter.repo_name }}/{{ cookiecutter.module_name }}/data/utils.py
@@ -7,6 +7,7 @@
 import shutil
 import zipfile
 import gzip
+import zlib
 
 from ..paths import interim_data_path, raw_data_path
 
@@ -226,12 +227,18 @@ def unpack(filename, dst_dir=None, create_dst=True):
     elif path.endswith('.gz'):
         opener, mode = gzip.open, 'rb'
         outfile, outmode = path[:-3], 'wb'
+    elif path.endswith('.Z'):
+        logger.warning(".Z files are only supported on systems that ship with gzip. Trying...")
+        os.system(f'gzip -d {path}')
+        opener, mode = open, 'rb'
+        path = path[:-2]
+        outfile, outmode = path, 'wb'
     else:
         opener, mode = open, 'rb'
         outfile, outmode = path, 'wb'
         logger.info("No compression detected. Copying...")
 
-    with opener(filename, mode) as f_in:
+    with opener(path, mode) as f_in:
         if archive:
             logger.info(f"Extracting {filename.name}")
             f_in.extractall(path=dst_dir)