-
Notifications
You must be signed in to change notification settings - Fork 0
Summary of functions
Based on article: https://atrebas.github.io/post/2019-03-03-datatable-dplyr/#addupdatedelete-columns
Full source code with results and R (through clojisr
): https://github.com/genmeblog/techtest/blob/master/src/techtest/datatable_dplyr.clj
Some helper functions are created to perform certain operations, they are placed at the beginning of the code:
fn name | desctiption |
---|---|
aggregate |
aggregate dataset and add result to the given (or empty) map |
aggregate->dataset |
convert result of aggregate to a dataset |
group-by-columns-or-fn-and-aggregate |
group dataset by column(s) or fn and aggregate, returns dataset |
sort-by-columns-with-orders |
sort-by columns with given order (:asc or :desc ) |
filter-by-external-values->indices | filter sequence and return selected indices |
map-v |
apply fn to values of map, returns map |
All functions are not optimized and should be rewritten to use tech.ml.dataset internal functions. Issues are filled already.
Dataset used in all snippets.
(def DS (ds/name-values-seq->dataset {:V1 (take 9 (cycle [1 2]))
:V2 (range 1 10)
:V3 (take 9 (cycle [0.5 1.0 1.5]))
:V4 (take 9 (cycle [\A \B \C]))}))
(class DS)
;; => tech.ml.dataset.impl.dataset.Dataset
DS
;; => _unnamed [9 4]:
;; | :V1 | :V2 | :V3 | :V4 |
;; |-----+-----+--------+-----|
;; | 1 | 1 | 0.5000 | A |
;; | 2 | 2 | 1.000 | B |
;; | 1 | 3 | 1.500 | C |
;; | 2 | 4 | 0.5000 | A |
;; | 1 | 5 | 1.000 | B |
;; | 2 | 6 | 1.500 | C |
;; | 1 | 7 | 0.5000 | A |
;; | 2 | 8 | 1.000 | B |
;; | 1 | 9 | 1.500 | C |
Filter rows using indices: (ds/select-rows DS [2 3])
Discard rows using indices: (ds/drop-rows DS (range 2 7))
(also remove-rows
)
Filter rows using a logical expression (single column): (ds/filter-column #(> ^long % 5) :V2 DS)
or (ds/filter-column #{\A \C} :V4 DS)
| Filter rows using multiple conditions | (ds/filter #(and (= (:V1 %) 1) (= (:V4 %) \A)) DS)
| |
| Filter unique rows | (ds/unique-by identity DS :column-name-seq (ds/column-names DS))
(ds/unique-by identity DS :column-name-seq [:V1 :V4])
| |
| Discard rows with missing values | (ds/drop-rows DS (ds/missing DS))
| Missing works on whole dataset, to select columns, first create dataset with selected columns. |
| 3 random rows | (ds/sample 3 DS)
| |
| N/2 random rows | (ds/sample (/ (ds/row-count DS) 2) DS)
| |
| Top N entries | (->> (m/rank (map - (DS :V1)) :dense)<br>(filter-by-external-values->indices #(< ^long % 1))<br>(ds/select-rows DS))
| rank
calculation is defined in fastmath
library |
| Select by regex | (ds/filter #(re-matches #"^B" (str (:V4 %))) DS)
| |
| Range selection | (ds/filter (comp (partial m/between? 3 5) :V2) DS)
| between?
defined in fastmath
|
| Range selection (pure clj) | (ds/filter (comp #(< 3 % 5) :V2) DS)
| |
Operation | Code | Comments |
---|---|---|
Sort rows by column | (ds/sort-by-column :V3 DS) |
|
Sort rows by column using indices | (ds/select-rows DS (m/order (DS :V3))) |
order is defined in fastmath
|