-
Notifications
You must be signed in to change notification settings - Fork 0
Summary of functions
Based on article: https://atrebas.github.io/post/2019-03-03-datatable-dplyr/#addupdatedelete-columns
Full source code with results and R (through clojisr
): https://github.com/genmeblog/techtest/blob/master/src/techtest/datatable_dplyr.clj
Some helper functions are created to perform certain operations, they are placed at the beginning of the code:
fn name | desctiption |
---|---|
aggregate |
aggregate dataset and add result to the given (or empty) map |
aggregate->dataset |
convert result of aggregate to a dataset |
group-by-columns-or-fn-and-aggregate |
group dataset by column(s) or fn and aggregate, returns dataset |
sort-by-columns-with-orders |
sort-by columns with given order (:asc or :desc ) |
filter-by-external-values->indices | filter sequence and return selected indices |
map-v |
apply fn to values of map, returns map |
All functions are not optimized and should be rewritten to use tech.ml.dataset internal functions. Issues are filled already.
Dataset used in all snippets.
(def DS (ds/name-values-seq->dataset {:V1 (take 9 (cycle [1 2]))
:V2 (range 1 10)
:V3 (take 9 (cycle [0.5 1.0 1.5]))
:V4 (take 9 (cycle [\A \B \C]))}))
(class DS)
;; => tech.ml.dataset.impl.dataset.Dataset
DS
;; => _unnamed [9 4]:
;; | :V1 | :V2 | :V3 | :V4 |
;; |-----+-----+--------+-----|
;; | 1 | 1 | 0.5000 | A |
;; | 2 | 2 | 1.000 | B |
;; | 1 | 3 | 1.500 | C |
;; | 2 | 4 | 0.5000 | A |
;; | 1 | 5 | 1.000 | B |
;; | 2 | 6 | 1.500 | C |
;; | 1 | 7 | 0.5000 | A |
;; | 2 | 8 | 1.000 | B |
;; | 1 | 9 | 1.500 | C |
Filter rows using indices
(ds/select-rows DS [2 3])
Discard rows using indices (also remove-rows
)
(ds/drop-rows DS (range 2 7))
Filter rows using a logical expression (single column)
(ds/filter-column #(> ^long % 5) :V2 DS)
(ds/filter-column #{\A \C} :V4 DS)
Filter rows using multiple conditions
(ds/filter #(and (= (:V1 %) 1) (= (:V4 %) \A)) DS)
Filter unique rows
(ds/unique-by identity DS :column-name-seq (ds/column-names DS))
(ds/unique-by identity DS :column-name-seq [:V1 :V4])
Discard rows with missing values (missing
works on whole dataset, to select columns first create dataset with selected columns)
(ds/drop-rows DS (ds/missing DS))
3 random rows
(ds/sample 3 DS)
N/2 random rows
(ds/sample (/ (ds/row-count DS) 2) DS)
Top N entries (rank
calculation is defined in fastmath
library)
(->> (m/rank (map - (DS :V1)) :dense)
(filter-by-external-values->indices #(< ^long % 1))
(ds/select-rows DS))
Select by regex
(ds/filter #(re-matches #"^B" (str (:V4 %))) DS)
Selection by range (between?
defined in fastmath
)
(ds/filter (comp (partial m/between? 3 5) :V2) DS)
Range selection (pure clj)
(ds/filter (comp #(< 3 % 5) :V2) DS)
Sort rows by column
(ds/sort-by-column :V3 DS)
Sort rows by column using indices (order
is defined in fastmath
)
(ds/select-rows DS (m/order (DS :V3)))