SOCCER

SOCCER (Sampling Optimal Clustering Cost Estimation) is a fast and accurate distributed clustering algorithm.

Install Soccer

Install, create & activate a python 3.8+ Anaconda virtualenv.
Run:

pip install -r ./requirements.txt

Experiments

To show and compare SOCCER's performance, we scripted an experiment which reads a dataset, clusters it using a chosen algorithm and prints the results. This code allows to easily reproduce all experiments reported in the paper.

Usage

usage: run_all_soccer_paper_experiments.py [-h]
                                           [--blackbox {KMeans,MiniBatchKMeans,ScalableKMeans}]
                                           [--datasets {kdd,bigcross,census1990,higgs,gaussian} [{kdd,bigcross,census1990,higgs,gaussian} ...]]
                                           [--custom-dataset-csvs CUSTOM_DATASET_CSVS [CUSTOM_DATASET_CSVS ...]]
Runs all SOCCER experiments. If used without any parameters, it runs all experiments exactly as reported in the paper.

The one time prerequisites are:
1. Install Anaconda
2. create a conda env - `conda create --name soccer python=3.8`
3. Install requirements - `conda activate soccer && pip3 install -r requirements.txt`
4. run `scripts/download_and_extract_all_datasets.sh`.
Finally, remember before every run to execute - `conda activate soccer`


Examples:
1. *run all experiments*: `./run_all_soccer_paper_experiments.py`
2. run one dataset with scalable-kmeans blackbox: `./run_all_soccer_paper_experiments.py --datasets kdd --blackbox ScalableKMeans`
3. run a new custom csv dataset: `./run_all_soccer_paper_experiments.py --custom-dataset-csvs ./my/custom/data.csv`
optional arguments:
  -h, --help            show this help message and exit
  --blackbox {KMeans,MiniBatchKMeans,ScalableKMeans}
                        [optional] which internal algorithm to run as the black-box clustering algorithm used by SOCCER. The default is KMeans
  --datasets {kdd,bigcross,census1990,higgs,gaussian} [{kdd,bigcross,census1990,higgs,gaussian} ...]
                        [optional] run the experiment only on these datasets. If unspecified, runs all datasets.
  --custom-dataset-csvs CUSTOM_DATASET_CSVS [CUSTOM_DATASET_CSVS ...]
                        [optional] run on your custom local CSVs

Configuration

The experiment script is highly configurable - which algorithm, how to run it, which dataset to cluster, etc.

The most common configurations are:

K - The number of clusters to calculate
DATASET - Which known dataset to load and cluster. Alternatively, a new dataset can be specified as a csv file from commandline.
BLACKBOX - Which internal algorithm to run as the black-box clustering algorithm used by SOCCER
EPSILON - A value in range (0, 1) which links the coordinator size with the data set size
DELTA - The confidence parameter.
ALGO - Which clustering algorithm to run - SOCCER or Scalable KMeans.

You can configure many other parameters via environment variables. For the full list, see soccer/config.py.

Reading experiment output

Experiment outputs are written in 3 files:

{run_name}_{timestamp}.log - the full log output
{run_name}_{timestamp}_results.csv - the results of each single clustering iteration (out of 10)
{run_name}_{timestamp}_summary.csv - summary results of all (10) iterations

Gaussian Dataset Seed

For reproducibility, we used a default GAUSSIANS_RANDOM_SEED=1234 when generating gaussian datasets. To reuse our gaussian datasets, don't change the gaussians configurations (just choose e.g. DATASET=gaussians_50). To use a random seed (and dataset), set GAUSSIANS_RANDOM_SEED=0.

Name		Name	Last commit message	Last commit date
Latest commit History 309 Commits
.github/workflows		.github/workflows
datasets/higgs		datasets/higgs
scripts		scripts
soccer		soccer
spark		spark
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run_all_soccer_paper_experiments.py		run_all_soccer_paper_experiments.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SOCCER

Install Soccer

Experiments

Usage

Configuration

Reading experiment output

Gaussian Dataset Seed

About

Releases

Packages

Languages

selotape/distributed_k_means

Folders and files

Latest commit

History

Repository files navigation

SOCCER

Install Soccer

Experiments

Usage

Configuration

Reading experiment output

Gaussian Dataset Seed

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages