Readme update

diyabc · Jul 10, 2019 · 5a51cda · 5a51cda
1 parent b151549
commit 5a51cda
Show file tree

Hide file tree

Showing 3 changed files with 94 additions and 93 deletions.
diff --git a/README-ORIG.md b/README-ORIG.md
@@ -24,42 +24,54 @@ Libraries we use :
 
 As a mention, we use our own implementation of LDA and PLS from [@friedman2001elements{81, 114}].
 
-There is two sets of binaries, one for model choice [```ModelChoice```](#model-choice), another for parameter estimation [```EstimParam```](#parameter-estimation). Each set contains a Macos/Linux/Windows (x64 only) binary for each platform.
+There is one set of binaries, which contains a Macos/Linux/Windows (x64 only) binary for each platform.
 There are available within the "[Releases](https://github.com/fradav/abcranger/releases)" tab, under "Assets" section (unfold it to see the list).
 
-Those are pure command line binaries, and they are no prerequisites or library dependencies in order to run them. Just download them and launch them from your terminal software of choice. The usual caveats with command line executable apply there : if you're not proficient with the command line interface of your platform, please learn some basics or ask someone who might help you in those matters. 
+This is pure command line binary, and they are no prerequisites or library dependencies in order to run it. Just download them and launch them from your terminal software of choice. The usual caveats with command line executable apply there : if you're not proficient with the command line interface of your platform, please learn some basics or ask someone who might help you in those matters. 
 
 As a note, we may add a graphical interface in a near future.
 
-# Model Choice
-
-## Usage
+# Usage 
 
 ```text
- - ABC Random Forest/Model choice command line options
+ - ABC Random Forest - Model choice or parameter estimation command line options
 Usage:
-  ModelChoice [OPTION...]
+  abcranger [OPTION...]
 
   -h, --header arg        Header file (default: headerRF.txt)
   -r, --reftable arg      Reftable file (default: reftableRF.bin)
   -b, --statobs arg       Statobs file (default: statobsRF.txt)
-  -o, --output arg        Prefix output (default: modelchoice_out)
+  -o, --output arg        Prefix output (modelchoice_out or estimparam_out by
+                          default)
   -n, --nref arg          Number of samples, 0 means all (default: 0)
   -m, --minnodesize arg   Minimal node size. 0 means 1 for classification or
                           5 for regression (default: 0)
   -t, --ntree arg         Number of trees (default: 500)
   -j, --threads arg       Number of threads, 0 means all (default: 0)
-  -s, --seed arg          Seed, 0 means generated (default: 0)
+  -s, --seed arg          Seed, generated by default (default: 0)
   -c, --noisecolumns arg  Number of noise columns (default: 5)
-  -l, --lda               Enable LDA (default: true)
+      --nolinear          Disable LDA for model choice or PLS for parameter
+                          estimation
+      --chosenscen arg    Chosen scenario (mandatory for parameter
+                          estimation)
+      --ntest arg         number of testing samples (mandatory for parameter
+                          estimation)
+      --parameter arg     name of the parameter of interest (mandatory for
+                          parameter estimation)
       --help              Print help
 ```
 
+- If you provide `--chosenscen`, `--parameter` and `--ntest`, parameter estimation mode is selected.
+- Otherwise by default it's model choice mode.
+- Linear additions are LDA for model choice and PLS for parameter estimation, "--nolinear" options disables them in both case.
+
+# Model Choice
+
 ## Example
 
 Example :
 
-`ModelChoice -t 10000 -j 8`
+`abcranger -t 10000 -j 8`
 
 Header, reftable and statobs files should be in the current directory.
 
@@ -74,47 +86,31 @@ Four files are created :
 
 # Parameter Estimation
 
-Note : The Pls components are selected within 99% of the explained variance of the output.
-As in for the $m$th component and for $N$ samples and $M$ features:
+## A note about PLS heuristic
+
+The Pls components are selected within _at least_ 99% of the maximum explained variance of the output.
 
 $$Yvar^m = \frac{\sum_{i=1}^{N}{(\hat{y}^{m}_{i}-\bar{y})^2}}{\sum_{i=1}^{N}{(y_{i}-\hat{y})^2}}$$
 
 where $\hat{y}^{m}$ is the $Y$ scored by the pls for the $m$th component.
-We take only the first $n_{comp}$ components as in :
+We take only the first $n_{heur}$ components, we stop when :
 
-$$n_{comp} = \underset{Yvar^m \leq{} 0.99*Yvar^M, }{\operatorname{argmax}}$$
+$$\frac{Yvar^{k+1}+Yvar^{k}}{2} \geq 0.99(N-k)\left(Yvar^{k+1}-Yvar^ {k}\right)$$
 
-## Usage
+We can easily prove than $n_{heur}$ is superior or equal to $n_{comp}$ :
+$$n_{heur} \ge n_{comp} = \underset{Yvar^m \leq{} 0.99*Yvar^M, }{\operatorname{argmax}}$$
 
-```text
- - ABC Random Forest/Model parameter estimation command line options
-Usage:
-  EstimParam [OPTION...]
+In practice, we find $n_{heur}$ close enough to $n_{comp}.
 
-  -h, --header arg        Header file (default: headerRF.txt)
-  -r, --reftable arg      Reftable file (default: reftableRF.bin)
-  -b, --statobs arg       Statobs file (default: statobsRF.txt)
-  -o, --output arg        Prefix output (default: estimparam_out)
-  -n, --nref arg          Number of samples, 0 means all (default: 0)
-  -m, --minnodesize arg   Minimal node size. 0 means 1 for classification or
-                          5 for regression (default: 0)
-  -t, --ntree arg         Number of trees (default: 500)
-  -j, --threads arg       Number of threads, 0 means all (default: 0)
-  -s, --seed arg          Seed, 0 means generated (default: 0)
-  -c, --noisecolumns arg  Number of noise columns (default: 5)
-  -p, --pls               Enable PLS (default: true)
-      --chosenscen arg    Chosen scenario (mandatory)
-      --ntrain arg        number of training samples (mandatory)
-      --ntest arg         number of testing samples (mandatory)
-      --parameter arg     name of the parameter of interest (mandatory)
-      --help              Print help
-```
+## The signification of the `ntest` parameter
+
+Computing the whole OOB set for weights predictions (see [@raynal2016abc]), is very costly, memory and cpu-wise, so we advise to compute them for only choose a subset of size `ntest`.
 
 ## Example
 
 Example (working with the dataset in `test/data`) :
 
-`EstimParam -t 1000 -j 8 --parameter ra --chosenscen 1 --ntrain 1000 --ntest 50`
+`abcranger -t 1000 -j 8 --parameter ra --chosenscen 1 --ntest 50`
 
 Header, reftable and statobs files should be in the current directory.
 
@@ -143,7 +139,7 @@ if pls enabled :
 
 ## C++ standalone
 
-- [ ] Merge the two methodologies in a single executable with the (almost) the same options
+- [X] Merge the two methodologies in a single executable with the (almost) the same options
 - [ ] \(Optional) Possibly move to another options parser (CLI?)
 
 ## External interfaces

diff --git a/README.md b/README.md
@@ -1,10 +1,13 @@
+  - [Usage](#usage)
   - [Model Choice](#model-choice)
   - [Parameter Estimation](#parameter-estimation)
   - [TODO](#todo)
   - [References](#references)
 
 <!-- pandoc -f markdown README-ORIG.md -t gfm -o README.md --bibliography=ref.bib -s --toc --toc-depth=1 -->
 
+<!-- pandoc --atx-headers -f markdown README-ORIG.md -t gfm -o README.md --bibliography=ref.bib -s --toc --toc-depth=1 --webtex=https://latex.codecogs.com/png.latex? -->
+
 [![Build
 Status](https://travis-ci.com/fradav/abcranger.svg)](https://travis-ci.com/fradav/abcranger)
 
@@ -28,52 +31,63 @@ As a mention, we use our own implementation of LDA and PLS from
 (Friedman, Hastie, and Tibshirani [2001](#ref-friedman2001elements),
 1:81, 114).
 
-There is two sets of binaries, one for model choice
-[`ModelChoice`](#model-choice), another for parameter estimation
-[`EstimParam`](#parameter-estimation). Each set contains a
-Macos/Linux/Windows (x64 only) binary for each platform. There are
-available within the
+There is one set of binaries, which contains a Macos/Linux/Windows (x64
+only) binary for each platform. There are available within the
 “[Releases](https://github.com/fradav/abcranger/releases)” tab, under
 “Assets” section (unfold it to see the list).
 
-Those are pure command line binaries, and they are no prerequisites or
-library dependencies in order to run them. Just download them and launch
+This is pure command line binary, and they are no prerequisites or
+library dependencies in order to run it. Just download them and launch
 them from your terminal software of choice. The usual caveats with
 command line executable apply there : if you’re not proficient with the
 command line interface of your platform, please learn some basics or ask
 someone who might help you in those matters.
 
 As a note, we may add a graphical interface in a near future.
 
-# Model Choice
-
-## Usage
+# Usage
 
 ``` text
- - ABC Random Forest/Model choice command line options
+ - ABC Random Forest - Model choice or parameter estimation command line options
 Usage:
-  ModelChoice [OPTION...]
+  abcranger [OPTION...]
 
   -h, --header arg        Header file (default: headerRF.txt)
   -r, --reftable arg      Reftable file (default: reftableRF.bin)
   -b, --statobs arg       Statobs file (default: statobsRF.txt)
-  -o, --output arg        Prefix output (default: modelchoice_out)
+  -o, --output arg        Prefix output (modelchoice_out or estimparam_out by
+                          default)
   -n, --nref arg          Number of samples, 0 means all (default: 0)
   -m, --minnodesize arg   Minimal node size. 0 means 1 for classification or
                           5 for regression (default: 0)
   -t, --ntree arg         Number of trees (default: 500)
   -j, --threads arg       Number of threads, 0 means all (default: 0)
-  -s, --seed arg          Seed, 0 means generated (default: 0)
+  -s, --seed arg          Seed, generated by default (default: 0)
   -c, --noisecolumns arg  Number of noise columns (default: 5)
-  -l, --lda               Enable LDA (default: true)
+      --nolinear          Disable LDA for model choice or PLS for parameter
+                          estimation
+      --chosenscen arg    Chosen scenario (mandatory for parameter
+                          estimation)
+      --ntest arg         number of testing samples (mandatory for parameter
+                          estimation)
+      --parameter arg     name of the parameter of interest (mandatory for
+                          parameter estimation)
       --help              Print help
 ```
 
+  - If you provide `--chosenscen`, `--parameter` and `--ntest`,
+    parameter estimation mode is selected.
+  - Otherwise by default it’s model choice mode.
+  - Linear additions are LDA for model choice and PLS for parameter
+    estimation, “–nolinear” options disables them in both case.
+
+# Model Choice
+
 ## Example
 
 Example :
 
-`ModelChoice -t 10000 -j 8`
+`abcranger -t 10000 -j 8`
 
 Header, reftable and statobs files should be in the current directory.
 
@@ -90,11 +104,10 @@ Four files are created :
 
 # Parameter Estimation
 
-Note : The Pls components are selected within 99% of the explained
-variance of the output. As in for the
-![m](https://latex.codecogs.com/png.latex?m "m")th component and for
-![N](https://latex.codecogs.com/png.latex?N "N") samples and
-![M](https://latex.codecogs.com/png.latex?M "M") features:
+## A note about PLS heuristic
+
+The Pls components are selected within *at least* 99% of the maximum
+explained variance of the output.
 
 
 ![Yvar^m =
@@ -106,46 +119,38 @@ where
 "\\hat{y}^{m}") is the ![Y](https://latex.codecogs.com/png.latex?Y "Y")
 scored by the pls for the ![m](https://latex.codecogs.com/png.latex?m
 "m")th component. We take only the first
-![n\_{comp}](https://latex.codecogs.com/png.latex?n_%7Bcomp%7D
-"n_{comp}") components as in :
+![n\_{heur}](https://latex.codecogs.com/png.latex?n_%7Bheur%7D
+"n_{heur}") components, we stop when :
 
 
-![n\_{comp} = \\underset{Yvar^m \\leq{} 0.99\*Yvar^M,
-}{\\operatorname{argmax}}](https://latex.codecogs.com/png.latex?n_%7Bcomp%7D%20%3D%20%5Cunderset%7BYvar%5Em%20%5Cleq%7B%7D%200.99%2AYvar%5EM%2C%20%7D%7B%5Coperatorname%7Bargmax%7D%7D
-"n_{comp} = \\underset{Yvar^m \\leq{} 0.99*Yvar^M, }{\\operatorname{argmax}}")  
+![\\frac{Yvar^{k+1}+Yvar^{k}}{2} \\geq 0.99(N-k)\\left(Yvar^{k+1}-Yvar^
+{k}\\right)](https://latex.codecogs.com/png.latex?%5Cfrac%7BYvar%5E%7Bk%2B1%7D%2BYvar%5E%7Bk%7D%7D%7B2%7D%20%5Cgeq%200.99%28N-k%29%5Cleft%28Yvar%5E%7Bk%2B1%7D-Yvar%5E%20%7Bk%7D%5Cright%29
+"\\frac{Yvar^{k+1}+Yvar^{k}}{2} \\geq 0.99(N-k)\\left(Yvar^{k+1}-Yvar^ {k}\\right)")  
 
-## Usage
+We can easily prove than
+![n\_{heur}](https://latex.codecogs.com/png.latex?n_%7Bheur%7D
+"n_{heur}") is superior or equal to
+![n\_{comp}](https://latex.codecogs.com/png.latex?n_%7Bcomp%7D
+"n_{comp}") :   
+![n\_{heur} \\ge n\_{comp} = \\underset{Yvar^m \\leq{} 0.99\*Yvar^M,
+}{\\operatorname{argmax}}](https://latex.codecogs.com/png.latex?n_%7Bheur%7D%20%5Cge%20n_%7Bcomp%7D%20%3D%20%5Cunderset%7BYvar%5Em%20%5Cleq%7B%7D%200.99%2AYvar%5EM%2C%20%7D%7B%5Coperatorname%7Bargmax%7D%7D
+"n_{heur} \\ge n_{comp} = \\underset{Yvar^m \\leq{} 0.99*Yvar^M, }{\\operatorname{argmax}}")  
 
-``` text
- - ABC Random Forest/Model parameter estimation command line options
-Usage:
-  EstimParam [OPTION...]
+In practice, we find
+![n\_{heur}](https://latex.codecogs.com/png.latex?n_%7Bheur%7D
+"n_{heur}") close enough to $n\_{comp}.
 
-  -h, --header arg        Header file (default: headerRF.txt)
-  -r, --reftable arg      Reftable file (default: reftableRF.bin)
-  -b, --statobs arg       Statobs file (default: statobsRF.txt)
-  -o, --output arg        Prefix output (default: estimparam_out)
-  -n, --nref arg          Number of samples, 0 means all (default: 0)
-  -m, --minnodesize arg   Minimal node size. 0 means 1 for classification or
-                          5 for regression (default: 0)
-  -t, --ntree arg         Number of trees (default: 500)
-  -j, --threads arg       Number of threads, 0 means all (default: 0)
-  -s, --seed arg          Seed, 0 means generated (default: 0)
-  -c, --noisecolumns arg  Number of noise columns (default: 5)
-  -p, --pls               Enable PLS (default: true)
-      --chosenscen arg    Chosen scenario (mandatory)
-      --ntrain arg        number of training samples (mandatory)
-      --ntest arg         number of testing samples (mandatory)
-      --parameter arg     name of the parameter of interest (mandatory)
-      --help              Print help
-```
+## The signification of the `ntest` parameter
+
+Computing the whole OOB set for weights predictions (see (Raynal et al.
+[2016](#ref-raynal2016abc))), is very costly, memory and cpu-wise, so we
+advise to compute them for only choose a subset of size `ntest`.
 
 ## Example
 
 Example (working with the dataset in `test/data`) :
 
-`EstimParam -t 1000 -j 8 --parameter ra --chosenscen 1 --ntrain 1000
---ntest 50`
+`abcranger -t 1000 -j 8 --parameter ra --chosenscen 1 --ntest 50`
 
 Header, reftable and statobs files should be in the current directory.
 
@@ -180,7 +185,7 @@ if pls enabled :
 
 ## C++ standalone
 
-  - [ ] Merge the two methodologies in a single executable with the
+  - [x] Merge the two methodologies in a single executable with the
     (almost) the same options
   - [ ] (Optional) Possibly move to another options parser (CLI?)
 

diff --git a/abcranger.cpp b/abcranger.cpp
@@ -29,7 +29,7 @@ int main(int argc, char* argv[]) {
             ("c,noisecolumns","Number of noise columns",cxxopts::value<size_t>()->default_value("5"))
             ("nolinear","Disable LDA for model choice or PLS for parameter estimation")
             ("chosenscen","Chosen scenario (mandatory for parameter estimation)", cxxopts::value<size_t>())
-            ("ntest","number of testing samples (mandatory for parameter estimation)",cxxopts::value<size_t>())
+            ("ntest","number of oob testing samples (mandatory for parameter estimation)",cxxopts::value<size_t>())
             ("parameter","name of the parameter of interest (mandatory for parameter estimation)",cxxopts::value<std::string>())
             ("help", "Print help")
             ;