Skip to content

Why don't you use X instead?

Amy Wooding edited this page Jan 1, 2023 · 10 revisions

"I notice you've implemented a feature. Why don't you use the X package instead?"

This is a question that gets asked of us all the time. Often, we the answer is simple: we haven't had a chance to give it a thorough test yet. Otherwise, we've given it a thorough test, and found some reasons why we needed a different solution.

This page collects tools that have been suggested for inclusion in (or as alternatives to) Easydata.

It also outlines known problems with the tools we currently use.

Virtual Environments

Conda

We do, most of the time. And we use Conda by default for package management in Easydata. There are a few headaches. First, in many versions, the conda binary is no longer added to the path when Anaconda is installed. Instead, some shell magic is used to invoke conda. This doesn't work in a Makefile, so we have to ask the user for the path to their conda binary (which there's no reason they should have to know).

Also, we rely extensively on conda-env, and we've heard rumblings that there are plansto replace this with an entirely new mechanism.

Next, the lock files we generate from conda-env's environment.yml files are not platform independent, so we can't claim reproducibility across platforms. "One nightmare at a time" is our only response at the moment.

We tried. We really did. In the end, while it had some nice features (notably pipenv shell for use in Makefiles), it made some decisions which made it less useful for our application. In particular, there's no way to specify the version of python required. It's also explicity designed for managing application dependencies, not libraries. These are pretty much fatal flaws for our use cases.

And for the record, no, it isn't python's recommended packaging tool. We got bit by that claim, too.

We like the .toml format. However, the tendency (and recommendation) to pin versions in a way similar to javascript leads to a dependency nightmare. We need all of the flexibility we can to resolve the environment.

DAG Managers

DVC (Data Version Control)

Lets you store and share data artifacts and models (via a git repo and git-LFS) DAG Version Control. Essentially a git-for-datasets, plus the plumbing to create DAGS for data refinement or model generation.

DVC uses Makefiles to describe how one data/model artifact was built from other data and code.

The DAG piece of DVC is morally equivalent to our transformer graph.

Dataset Discovery

Intake

Intake is a swiss army knife for loading data from a variety of raw formats (see the current list of known plugins), and formatting them for various container data types (e.g. Pandas dataframes, Python lists, NumPy arrays)

It is also a catalog solution, allowing for datasets to be described via metadata, and explored via shared catalog information (using the Intake server)

We'd love a Dataset.from_intake() constructor.

Quilt Data

Data catalog and wrapper around AWS to make it easy to host, share, discover, preview, and version datasets.

It would be nice to have a Dataset.from_quilt() constructor.

Containers and Compute Infrastructure

JupyterHub

We love JupyterHub. We use it all the time. In fact, a number of minor improvements were made in Easydata 1.5 to make it work better with JupyterHubs.

Our longer-term goals are to integrate Easydata-derived git repos with a JupyterHub via CD. Stay Tuned

Databricks

We don't currently make use of Databricks.

That said, we'd expect Easydata could be integrated with Databricks via CD.