Skip to content

Latest commit

 

History

History
62 lines (50 loc) · 2.7 KB

README.md

File metadata and controls

62 lines (50 loc) · 2.7 KB

unix-tools

Set of command line unix tools I created to speedup and simplify day-to-day work with data files. These tools provides missing features in standard unix tools like finding unique lines in unordered text stream, sorting by combination of columns, set-style operations on two text streams like union or intersection as well as significant performance improvement. Implementation heavily uses STL C++ containers like std::set and std::map.

  • uniq2: uniq implementation which can work on unordered stream
  • setops: set operations: union, difference, intersection, symmetric_difference
  • sortby: fast implementation of in-memory text file sorting with ability to specify columns as a key for sort operation

uniq2

uniq2 filters out repeated lines in an input stream.
Unlike unix uniq it can detect non-adjucent repeated lines i.e. there is no need to sort the lines first.
As a bonus uniq2 prints lines in lexicographic order (byte representation is used for unicode characters).
Arguments:
  -c - (optional) output number of occurances for each repeated line.
  -s - (optional) swap number of occurances and line text in -c mode.

setops

setops provides set operations for text files
Usage: setops [operation] <[file1] 3<[file2]
Available operations:
  union
  difference
  intersection
  symmetric_difference
You can use short prefix to specify operation like 'u' for 'union'.

sortby

sortby is fast implementation of in-memory text file sorting with ability to specify columns as a key for sort operation
sortby reads data from standard input and writes data to standard output.
Arguments:
  -d delim - Delimiter to use to separate fields in source stream.
  -f list - Defines combination of fields to be used as a a sorting key.
             Your can use 'i' suffix to interpret field value as in integer.
  -u - Leave only first occurance for lines with same key.
  -c - In combination with -u print number of lines with the same key.
  -s - Swap number of lines and line text in -c mode.

Example: cat example.txt | sortby -d\| -f3,1i
   Will sort lines from pipe ('|') delimited file example.txt using combination of third
   and fisrt columns as a sorting key. first column value will be interpreted as an integer.

Note: if fields list is empty sortby will work as a faster version of unix sort utility.

benchmarks

Sorting of 1,000,000 lines text files. Each line has about 70 characters

  • sortby: 6.251s
  • unix sort: 31.337s
  • 5x times performance improvemnt

Finding unique lines in 1,000,000 lines text file. Each line has 16 characters. 330,000 unique lines total

  • uniq2: 3.949s
  • unix sort+uniq: 20.354s
  • 5x times performance improvement