Summary of functions

Functionality of tech.ml.dataset version 2.0-beta30

Based on article: https://atrebas.github.io/post/2019-03-03-datatable-dplyr/#addupdatedelete-columns

Full source code with results and R (through clojisr): https://github.com/genmeblog/techtest/blob/master/src/techtest/datatable_dplyr.clj

Some helper functions are created to perform certain operations, they are placed at the beginning of the code:

fn name	desctiption
`aggregate`	aggregate dataset and add result to the given (or empty) map
`aggregate->dataset`	convert result of `aggregate` to a dataset
`group-by-columns-or-fn-and-aggregate`	group dataset by column(s) or fn and aggregate, returns dataset
`sort-by-columns-with-orders`	sort-by columns with given order (`:asc` or `:desc`)
filter-by-external-values->indices	filter sequence and return selected indices
`map-v`	apply fn to values of map, returns map

All functions are not optimized and should be rewritten to use tech.ml.dataset internal functions. Issues are filled already.

Dataset creation

Dataset used in all snippets.

(def DS (ds/name-values-seq->dataset {:V1 (take 9 (cycle [1 2]))
                                      :V2 (range 1 10)
                                      :V3 (take 9 (cycle [0.5 1.0 1.5]))
                                      :V4 (take 9 (cycle [\A \B \C]))}))

(class DS)
;; => tech.ml.dataset.impl.dataset.Dataset
DS
;; => _unnamed [9 4]:
;;    | :V1 | :V2 |    :V3 | :V4 |
;;    |-----+-----+--------+-----|
;;    |   1 |   1 | 0.5000 |   A |
;;    |   2 |   2 |  1.000 |   B |
;;    |   1 |   3 |  1.500 |   C |
;;    |   2 |   4 | 0.5000 |   A |
;;    |   1 |   5 |  1.000 |   B |
;;    |   2 |   6 |  1.500 |   C |
;;    |   1 |   7 | 0.5000 |   A |
;;    |   2 |   8 |  1.000 |   B |
;;    |   1 |   9 |  1.500 |   C |

Basic operations

Filter rows

Filter rows using indices: (ds/select-rows DS [2 3])

Discard rows using indices: (ds/drop-rows DS (range 2 7)) (also remove-rows)

Filter rows using a logical expression (single column): (ds/filter-column #(> ^long % 5) :V2 DS) or (ds/filter-column #{\A \C} :V4 DS) | Filter rows using multiple conditions | (ds/filter #(and (= (:V1 %) 1) (= (:V4 %) \A)) DS) | | | Filter unique rows | (ds/unique-by identity DS :column-name-seq (ds/column-names DS))
(ds/unique-by identity DS :column-name-seq [:V1 :V4]) | | | Discard rows with missing values | (ds/drop-rows DS (ds/missing DS)) | Missing works on whole dataset, to select columns, first create dataset with selected columns. | | 3 random rows | (ds/sample 3 DS) | | | N/2 random rows | (ds/sample (/ (ds/row-count DS) 2) DS) | | | Top N entries | (->> (m/rank (map - (DS :V1)) :dense)<br>(filter-by-external-values->indices #(< ^long % 1))<br>(ds/select-rows DS)) | rank calculation is defined in fastmath library | | Select by regex | (ds/filter #(re-matches #"^B" (str (:V4 %))) DS) | | | Range selection | (ds/filter (comp (partial m/between? 3 5) :V2) DS) | between? defined in fastmath | | Range selection (pure clj) | (ds/filter (comp #(< 3 % 5) :V2) DS) | |

Sort rows

Operation	Code	Comments
Sort rows by column	`(ds/sort-by-column :V3 DS)`
Sort rows by column using indices	`(ds/select-rows DS (m/order (DS :V3)))`	`order` is defined in `fastmath`

Provide feedback

Saved searches