An orchestration platform for Docker containers running data mining algorithms.
This project exposes a web interface to execute on demand data mining algorithms defined in Docker containers and implemented using any tool or language (R, Python, Java and more are supported).
It relies on a runtime environment containing Mesos and [Chronos])(https://mesos.github.io/chronos/) to control and execute the Docker containers over a cluster.
docker run --rm --env [list of environment variables] --link woken hbpmip/woken:2.8.2
where the environment variables are:
- CLUSTER_IP: Name of this server advertised in the Akka cluster
- CLUSTER_PORT: Port of this server advertised in the Akka cluster
- CLUSTER_NAME: Name of Woken cluster, default to 'woken'
- WOKEN_PORT_8088_TCP_ADDR: Address of Woken master server
- WOKEN_PORT_8088_TCP_PORT: Port of Woken master server, default to 8088
- DOCKER_BRIDGE_NETWORK: Name of the Docker bridge network. Default to 'bridge'
- NETWORK_INTERFACE: IP address for listening to incoming HTTP connections. Default to '0.0.0.0'
- WEB_SERVICES_PORT: Port for the HTTP server in Docker container. Default to 8087
- WEB_SERVICES_SECURE: If yes, HTTPS with a custom certificate will be used. Default to no.
- WEB_SERVICES_USER: Name used to protected the web servers protected with HTTP basic authentication. Default to 'admin'
- WEB_SERVICES_PASSWORD: Password used to protected the web servers protected with HTTP basic authentication.
- LOG_LEVEL: Level for logs on standard output, default to WARNING
- LOG_CONFIG: on/off - log configuration on start, default to off
- VALIDATION_MIN_SERVERS: minimum number of servers with the 'validation' functionality in the cluster, default to 0
- SCORING_MIN_SERVERS: minimum number of servers with the 'scoring' functionality in the cluster, default to 0
- KAMON_ENABLED: enable monitoring with Kamon, default to no
- ZIPKIN_ENABLED: enable reporting traces to Zipkin, default to no. Requires Kamon enabled.
- ZIPKIN_IP: IP address to Zipkin server. Requires Kamon and Zipkin enabled.
- ZIPKIN_PORT: Port to Zipkin server. Requires Kamon and Zipkin enabled.
- PROMETHEUS_ENABLED: enable reporting metrics to Prometheus, default to no. Requires Kamon enabled.
- PROMETHEUS_IP: IP address to Prometheus server. Requires Kamon and Prometheus enabled.
- PROMETHEUS_PORT: Port to Prometheus server. Requires Kamon and Prometheus enabled.
- SIGAR_SYSTEM_METRICS: Enable collection of metrics of the system using Sigar native library, default to no. Requires Kamon enabled.
- JVM_SYSTEM_METRICS: Enable collection of metrics of the JVM using JMX, default to no. Requires Kamon enabled.
- MINING_LIMIT: Maximum number of concurrent mining operations. Default to 100
- EXPERIMENT_LIMIT: Maximum number of concurrent experiments. Default to 100
Follow these steps to get started:
- Git-clone this repository.
git clone https://github.com/LREN-CHUV/woken.git
- Change directory into your clone:
cd woken
- Build the application
You need the following software installed:
./build.sh
- Run the application
You need the following software installed to execute some tests:
cd tests
./run.sh
tests/run.sh uses docker-compose to start a full environment with Mesos, Zookeeper and Chronos, all of those are required for the proper execution of Woken.
- Create a DNS alias in /etc/hosts
127.0.0.1 localhost frontend
- Browse to http://frontend:8087 or run one of the query* script located in folder 'tests'.
The Docker containers that can be executed on this platform require a few specific features.
TODO: define those features - parameters passed as environment variables, in and out directories, entrypoint with a 'compute command', ...
The project algorithm-repository contains the Docker images that can be used with woken.
Performs a data mining task.
Path: /mining/job Verb: POST
Takes a Json document in the body, returns a Json document.
Json input should be of the form:
{
"user": {"code": "user1"},
"variables": [{"code": "var1"}],
"covariables": [{"code": "var2"},{"code": "var3"}],
"grouping": [{"code": "var4"}],
"filters": [],
"algorithm": "",
"datasets": [{"code": "dataset1"},{"code": "dataset2"}]
}
where:
- variables is the list of variables
- covariables is the list of covariables
- grouping is the list of variables to group together
- filters is the list of filters. The format used here is coming from JQuery QueryBuilder filters, for example
{"condition":"AND","rules":[{"id":"FULLNAME", "field":"FULLNAME","type":"string","input":"text","operator":"equal","value":"Isaac Fulmer"}],"valid":true}
- datasets is an optional list of datasets, it can be used in distributed mode to select the nodes to query and in all cases add a filter rule of type
{"condition":"OR","rules":[{"field":"dataset","operator","equals","value":"dataset1"},{"field":"dataset","operator","equals","value":"dataset2"}]}
- algorithm is the algorithm to use.
Currently, the following algorithms are supported:
- data: returns the raw data matching the query
- linearRegression: performs a linear regression
- summaryStatistics: performs a summary statistics than can be used to draw box plots.
- knn
- naiveBayes
Performs an experiment comprised of several data mining tasks and an optional cross-validation step used to compute the fitness of each algorithm and select the best result.
TODO: document API
You need the following software installed:
Execute the following commands to distribute Woken as a Docker container:
./publish.sh
For production, woken requires Mesos and Chronos. To install them, you can use either:
- mip-microservices-infrastructure, a collection of Ansible scripts deploying a full Mesos stack on Ubuntu servers.
- mantl.io, a microservice infrstructure by Cisco, based on Mesos.
- Mesosphere DCOS DC/OS (the datacenter operating system) is an open-source, distributed operating system based on the Apache Mesos distributed systems kernel.
Woken :
- the Woken river in China - we were looking for rivers in China
- passive form of awake - it launches Docker containers and computations
- workflow - the previous name, not too different
This work has been funded by the European Union Seventh Framework Program (FP7/20072013) under grant agreement no. 604102 (HBP)
This work is part of SP8 of the Human Brain Project (SGA1).