Skip to content

Latest commit

 

History

History
916 lines (544 loc) · 22.1 KB

stdlib.md

File metadata and controls

916 lines (544 loc) · 22.1 KB

zawk Standard Library

Standard library for AWK with text, math, crypto, kv, database, network etc.

zawk stdlib Cheat Sheet: https://cheatography.com/linux-china/cheat-sheets/zawk/

Text functions

Text is encoding with utf-8 by default.

length

Get unicode character length of text: length($1), length("你好") to 2.

strlen

Get byte length of text: strlen($1), strlen("你好") to 6.

char_at

Get char at index: char_at($1, 1), starts from 1. If index is out of range, return empty string.

chars

Return char array of text: ar=chars($1), starts from 1.

match(text, re)

if string text matches the regular expression in re. If s matches, the RSTART variable is set with the start of the leftmost match of re, and RLENGTH is set with the length of this match.

substr(text, i[, j])

The 1-indexed substring of string s starting from index i and continuing for the next j characters or until the end of s if i+j exceeds the length of s or if s is not provided.

sub(re, text, s)

Substitutes t for the first matching occurrence of regular expression re in the string s.

gsub(re, text, s)

Like sub, but with all occurrences substituted, not just the first.

index(haystack, needle)/last_index()

  • index(text, 's'): the first index within haystack in which the string needle occurs, 0 if needle does not appear.
  • last_index(text,'s'): the last index within haystack in which the string needle occurs, 0 if needle does not appear.

split(text, arr[, fs])

Splits the string s according to fs, placing the results in the array arr. If fs is not specified then the FS variable is used to split s.

last_part(s [, sep])

Get last part with sep: last_part("a/b/c", "/") to c.

If sep is not provided, zawk will use / to search first, if not found, zawk will use . to search.

  • last_part("a/b/c") to c
  • last_part("a.b.c") to c

sprintf(fmt, s, ...)

Returns a string formatted according to fmt and provided arguments. The goal is to provide the semantics of the libc sprintf function.

printf(fmt, s, ...) [>[>] out]

Like sprintf but the result of the operation is written to standard output, or to out according to the append or overwrite semantics specified by > or >>. Like print, printf can be called without parentheses around its arguments, though arguments are parsed differently in this mode to avoid ambiguities.

hex(s)

Returns the hexadecimal integer (e.g. 0x123abc) encoded in s, or 0 otherwise.

hex("0xFF") returns 255. Please use strtonum("0x11") instead.

join_fields(i, j[, sep])

Returns columns i through j (1-indexed, inclusive) concatenated together, joined by sep, or by OFS if sep is not provided.

join_csv(i, j)

Like join_fields but with columns joined by , and escaped using escape_csv.

join_tsv(i, j)

Like join_fields but with columns joined by tabs and escaped using escape_tsv.

tolower(s)

Returns a copy of s where all uppercase ASCII characters are replaced with their lowercase counterparts; other characters are unchanged.

toupper(s)

Returns a copy of s where all lowercase ASCII characters are replaced with their uppercase counterparts; other characters are unchanged.

strtonum:

numeric value(Decimal) strtonum("0x11").

trim

Trim text with space by default. trim($1). Trim text with chars with trim($1, "[]()")

truncate

truncate($1, 10) or truncate($1, 10, "...")

capitalize/uncapitalize:

capitalize("hello") # Hello or uncapitalize("Hello") # hello

camel_case: functions/attribute names

camel_case("hello World") # helloWorld

kebab_case: file names

kebab_case("hello world") # hello-world

snake_case: functions/attribute names

snake_case("hello world") # hello_world

pascal_case/title_case: Component

title_case("hello world") # Hello World

isint

isint("123")

isnum

isnum("1234.01")

is(format,txt)

Validate text format, such as: is("email", "demo@example.com"). Format list:

  • email
  • url
  • phone
  • ip: IP v4/v6

starts_with/ends_with/contains

The return value is 1 or 0.

  • starts_with($1, "https://")
  • ends_with($1, ".com")
  • contains($1, "//")

Why not use regex? Because starts_with/ends_with/contains are easy to use and understand. Most libraries include these functions, and I don't want AWK stdlib weird.

Tips: You can use regex expression for is_xxx()contains()starts_with()ends_with() functions.

  • is_int: /^\d+$/
  • contains: /xxxx/
  • starts_with: /^xxxx/
  • ends_with: /xxxx$/

mask

mask("abc@example.com"), mask("186612347")

pad

  • pad: pad($1, 10, "*") to ***hello**, pad_start($1, 10, "*") to ***hello, pad_end($1, 10, "**") to hello***,

strcmp:

text compare strcmp($1, $2) return -1, 0, 1

lines

Split text to none-empty lines: lines(text): array of text.

words

text to words: words("hello world? 你好") # ["hello", "world", "你", "好"]

repeat

repeat("*",3) # ***

default_if_empty

Return default value if text is empty or not exist.

default_if_empty(" ", "demo") # demo or default_if_empty(var_is_null, "demo") # demo

append_if_missing/prepend_if_missing

Add suffix/prefix if missing/present

  • append_if_missing("nats://example.com","/") # example.com/
  • preappend_if_missing("example.com","https://") # https://example.com
  • remove_if_end("demo.json", ".json") # demo
  • remove_if_begin("demo.json", "file://./") # file://./demo.json

quote/double_quote

quote/double text if not quoted/double quoted.

  • quote("hello world") # 'hello world'
  • double_quote("hello world") # "hello world"

parse/rparse

  • parse: use wild match - parse("Hello World","{greet} {name}")["greet"]
  • rparse: use regex group - rparse("Hello World","(\\w+) (\\w+)")[1]

format_bytes/to_bytes

Convert bytes to human-readable format, and vice versa. Units(case-insensitive): B, KB, MB, GB, TB, PB, EB, ZB, YB, kib, mib, gib, tib, pib, eib, zib, yib.

  • format_bytes(1024): 1 KB
  • to_bytes("2 KB"): 2024

mkpass

Generate password with numbers, lowercase/uppercase letters, and special chars.

  • mkpass(): 8 chars password
  • mkpass(12): 12 chars password

figlet

Help you to generate ASCII art text with figlet: BEGIN { print figlet("Hello zawk"); }.

Attention: ascii characters only, don't use i18n characters. :)

Text Escape

  • escape: escape("format", $1): support json, csv, tsv, xml, html, sql, shell
  • escape_csv(s): Returns s escaped as a CSV column, adding quotes if necessary, replacing quotes with double-quotes, and escaping other whitespace.
  • escape_tsv(s): Returns s escaped as a TSV column. There is less to do with CSV, but tab and newline characters are replaced with \t and \n.

Text Parser

If you want to see the returned data structure, you can use the var_dump function, such as var_dump(semver("1.2.3-alpha.1+zstd.1.5.0")).

URL

url(url_text) to parse url and return array with following fields:

  • schema
  • user
  • password
  • host
  • port
  • path
  • query
  • fragment

examples: url("https://example.com/user/1"), url("jdbc:mysql://localhost:3306/test")

Data URL(MapStrStr):

data_url("data:text/plain;base64,SGVsbG8sIFdvcmxkIQ==")

  • data
  • mime_type
  • encoding

Command line(MapIntStr)

shlex("ls -l"), https://crates.io/crates/shlex

Path(MapStrStr)

path("./demo.txt")

  • exists: 0 or 1
  • full_path
  • parent
  • file_name
  • file_stem
  • file_ext
  • content_type

Semantic Versioning(MapStrStr):

semver("1.2.3-alpha"), semver("1.2.3-alpha.1+zstd.1.5.0")

array fields:

  • major:
  • minor
  • patch
  • pre
  • build

Pairs

Parse pairs text to array(MapStrStr), for example:

  • URL query string id=1&name=Hello%20World1
  • Trace Context tracestate: congo=congosSecondPosition,rojo=rojosFirstPosition
  • Cookies: pairs(cookies_text, ";", "="), such as: _device_id=c49fdb13b5c41be361ee80236919ba50; user_session=qDSJ7GlA3aLriNnDG-KJsqw_QIFpmTBjt0vcLy5Vq2ay6StZ;

Usage: pairs("a=b,c=d"), pairs("id=1&name=Hello%20World","&"), pairs("a=b;c=d",";","=").

Tips: if pairs("id=1&name=Hello%20World","&"), text will be treated as URL query string, and URL decode will be introduced to decode the value automatically.

Records

Prometheus/OpenMetrics text format, such as http_requests_total{method="post",code="200"}

Usage:

  • record("http_requests_total{method='post',code=200}")
  • record("mysql{host=localhost user=root password=123456 database=test}")
  • record("table1(id int, age int)"): DB table design

Message

A message(record with body) always contains name, headers and body, and text format is like http_requests_total{method="post",code="200"}(100)

Usage:

  • message("http_requests_total{method='post',code=200}(100)")
  • message("login_event{method='post',code=200}('xxx@example.com')")

Function invocation

Parse function invocation format into IntMap<Str>, and 0 indicates function name.

  • arr=func("hello(1,2,3)"): arr[0]=>hello, arr[1]=>1
  • arr=func("welcome('Jackie Chan',3)"): arr[0]=>welcome, arr[1]=>Jackie Chan

ID generator

uuid

uuid : uuid(), uuid("v7")

ID specs:

  • length: 128 bits
  • version: v4, v7, and default is v4.

ulid

ulid: Universally Unique Lexicographically Sortable Identifier, please refer https://github.com/ulid/spec for detail.

ulid() #01ARZ3NDEKTSV4RRFFQ69G5FAV

ID specs:

  • length: 128 bits

tsid

tsid: TSID generator tsid()

snowflake

Snowflake ID is a form of unique identifier used in distributed computing.

snowflake(machine_id), and max value for machine_id is 65535.

ID specs:

  • length: 64 bits
  • machine_id: 16 bits, and max value is 65535;

Array functions

length

length(arr)

delete

  • delete item: delete arr[1]
  • delete array: delete arr

seq

seq(start, end, step): seq command compatible

uniq

uniq(arr): IntMap -> IntMap, uniq command compatible

asort

n = asort(arr): sort array, and return sorted array length

_max/_min/_sum/_mean

_max(arr): IntIntMap -> Int, IntFloatMap -> Float

_join

_join(arr, ",") IntMap -> Str

parse_array

parse_array("['first','second','third']"): IntMap

tuple

tuple("(1,2,'first','second')"): IntMap

variant

variant("week(5)"): StrMap

flags

flags("{vip,top20}"): StrMap

bloom filter

  • bf_insert(item) or bf_insert(item, group)
  • bf_contains(item) or bf_contains(item, group)
  • bf_icontains(item) or bf_icontains(item, group): Insert if not found. It's useful for duplication check.

Find unique phone numbers: !bf_iconatins(phone) { }

Math

Floating-point operations: sin, cos, atan, atan2, log, log2, log10, sqrt, exp are delegated to the Rust standard library, or LLVM intrinsics where available.

rand()

Returns a uniform random floating-point number between 0 and 1.

srand(x)

Seeds the random number generator used by rand, returns the old seed. Bitwise operations. All of these operations coerce their operands to integers before being evaluated.

abs

abs(-1) # 1,

floor

floor(4.5) # 4

ceil

ceil(4.5) # 5

round

round(4.4) # 4,

eval

eval("1+2") or eval("a + 2", context), and return type is Float.

Please refer https://github.com/isibboi/evalexpr for more.

Attention: Now only Int/Float/Boolean are supported, and boolean will be converted to 0/1.

fend

fend("1+2") # 3

Please refer https://github.com/printfn/fend for more.

min/max

min(1,2,3), max("A","B"),

bool

the return value is 0 or 1 for mkbool(s).

examples: mkbool("true"), mkbool("false"), mkbool("1"), mkbool("0"), mkbool("0.0") mkbool(" 0 "), mkbool("Y"), mkbool("Yes"), mkbool(""), mkbool("✓")

int/float

int("11") # 11, float("11.2") # 11.2

Date/Time

utc by default.

systime

systime(): current Unix time

strftime

https://docs.rs/chrono/latest/chrono/format/strftime/index.html

  • strftime("%Y-%m-%d %H:%M:%S")
  • strftime() or strftime("%+"): ISO 8601 / RFC 3339 date & time format.

mktime

please refer https://docs.rs/dateparser/latest/dateparser/#accepted-date-formats

  • mktime("2012 12 21 0 0 0"):
  • mktime("2019-11-29 08:08-08"):

Duration

Convert duration to seconds: duration("2min + 12sec") # 132. Time units: sec, secs, min, minute, minutes, hour, h, day, d, week, wk, month, mo, year, yr.

Color

Convert between hex and rgb.

  • hex2rgb("#FF0000") # [255,0,0]: result is array [r,g,b]
  • rgb2hex(255,0,0) # #FF0000

Fake

Generate fake data for testing: fake("name") or fake("name","cn").

  • locale: EN(default) and CN are supported now.
  • data: name, phone, cell, email, wechat, ip, creditcard, zipcode, plate, postcode, id(身份证).

JSON

from_json

from_json(json_text)

to_json

to_json(array)

json_value

json_value(json_text, json_path): return only one text value - json_value(json_text, '$.store.book[0].title')

Tips: RFC 9535 JSONPath: Query Expressions for JSON

json_query

json_query(json_text, json_path): return array with text value

CSV

from_csv

from_csv(csv_row): array of text value for one rows

to_csv

to_csv(array): csv row

XML

xml_value

xml_value(xml_text, xpath): node's inner_text

Attention: Please refer XPath cheatsheet for xpath syntax.

xml_query

xml_query(xml_text, xpath): array of element's string value

HTML

html_value

html_value(html_text, selector): node's inner_text

Attention: please follow standard CSS selector syntax.

html_query

html_query(html_text, selector): array of node's inner_text

Encoding/Decoding

encode("format",$1)

Formats:

  • hex,
  • base32(RFC4648 without padding),
  • base58
  • base62
  • base64,
  • base64url: url safe without pad
  • zlib2base64url: zlib then base64url, good for online diagram service, such as PlantUML, Kroki
  • url,
  • hex-base64,
  • hex-base64url,
  • base64-hex,
  • base64url-hex

Crypto

Digest

digest("algorithm",$1)

Algorithms:

  • md5
  • sha256,
  • sha512,
  • bcrypt,
  • murmur3,
  • xxh32 or xxh64
  • blake3
  • crc32: checksum
  • adler32: checksum

crypto

  • hmac: hmac("HmacSHA256","your-secret-key", $1) or hmac("HmacSHA512","your-secret-key", $1)
  • encrypt: encrypt("aes-128-cbc", "Secret Text", "your_pass_key"), encrypt("aes-256-gcm", "Secret Text", "your_pass_key")
  • encrypt: decrypt("aes-128-cbc", "7b9c07a4903c9768ceeeb922bcb33448", "your_pass_key")

Explain for encrypt and decrypt:

  • mode — Encryption mode. now only aes-128-cbc, aes-256-cbc, aes-128-gcm, aes-256-gcm support
  • plaintext — Text that need to be encrypted.
  • key — Encryption key. 16 bytes(16 ascii chars) for 128 and 32 bytes(32 ascii chars) for 256.

JWT

Hmac signature: HS256, HS384, HS512:

  • jwt: jwt("HS256","your-secret-key", payload_map)
  • dejwt: dejwt("your-secret-key", token), return payload map.

RSA/ECDSA/EdDSA: RS256, RS384, RS512, ES256, ES384, EdDSA:

  • jwt: jwt("RS256", private_key_pem_text, payload_map)
  • dejwt: dejwt(public_key_pem_text, token)

JWK: RS256, RS384, RS512, ES256, ES384, ES512,

  • dejwt: dejwt("http://example.com/jwks.json#kid", token): please add kid as anchor.

Tips: you can use https://jwkset.com/generate to generate JWK json and keys PEM text.

KV

Key/Value Functions:

  • kv_get(namespace, key)
  • kv_put(namespace, key, text)
  • kv_delete(namespace, key)
  • kv_clear(namespace)

KV with SQLite

namespace is SQLite db name, and db path is $HOME/.awk/sqlite.db.

examples: kv_get("namespace1", "nick").

KV with Redis

namespace is Redis URL: redis://localhost:6379/namespace, or redis://localhost:6379/0/namespace namespace is key name for Hash data structure.

kv_get("redis://user:password@host:6379/db/namespace")

KV with NATS

namespace is NATS URL: nats://localhost:4222/bucket_name, please use nats kv add bucket_name to create bucket

kv_get("nats://localhost:4222/bucket_name/nick")

Network

HTTP

http_get(url,headers), http_post(url, body, headers).

you can ignore headers if not required.

response array:

  • status: such as 200,404. 0 means network error.
  • text: response as text
  • HTTP header names: response headers, such as Content-Type

Attention: If body is json text that starts with { or [ and ends with { or [, and Content-Type = application/json will be added as HTTP header by default.

email

send_mail(from, to, subject, body) by REST API, and to could be multiple emails separated by ,.

Environment variables for email sending:

smtp_send(smtp_url, from, to, subject, body): send email by SMTP

SMTP URL format:

  • SMTP basic: smtp://localhost:1025
  • SMTP with auth: smtp://user:password@host:25
  • SMTP + TLS smtps://user:password@host:465

S3

  • s3_get(bucket, object_name): get object, and return value is text.
  • s3_put(bucket, object_name, body): put object, and body is text

Environment variables for S3 access:

  • S3_ENDPOINT
  • S3_ACCESS_KEY_ID
  • S3_ACCESS_KEY_SECRET
  • S3_REGION

NATS

Publish events to NATS: publish("nats://host:4222/topic", body)

MQTT

Publish events to MQTT: publish("mqtt://servername:1883/topic", body)

  • CloudFlare Pub/Sub: mqtts://BROKER_TOKEN@YOUR-BROKER.YOUR-NAMESPACE.cloudflarepubsub.com/topic

local_ip

local_ip() # 192.168.1.3

Database

SQLite

url: sqlite.db db path

  • sqlite_query("sqlite.db", "select nick,email,age from user"): sqlite_query("sqlite.db", "select nick,email,age from user")[1]
  • sqlite_execute("sqlite.db, "update users set nick ='demo' where id = 1")

libSQL

libSQL url: ./demo.db, http://127.0.0.1:8080 or libsql://db-name-your-name.turso.io?authToken=xxxx.

  • libsql_query(url, "select id, email from users"),
  • libsql_execute(url,"update users set nick ='demo' where id = 1")

Tip: If you don't want to put authToken in url, for example libsql://db-name-your-name.turso.io, you can set up LIBSQL_AUTH_TOKEN environment variable.

PostgreSQL

url: postgresql://postgres:postgres@localhost/db_name

  • pg_query(url, "select id, name from people"),
  • pg_execute(url,"update users set nick ='demo' where id = 1")

MySQL

url: mysql://root:123456@localhost:3306/test

  • mysql_query(url, "select id, name from people"),
  • mysql_execute(url,"update users set nick ='demo' where id = 1")

Data Time

utc by default.

functions:

Date time parse

Parse date time text to array: datatime(), datetime(1621530000)["year"], datetime("2020-02-02")["year"] datetime text format:

date/time array:

  • year: 2024
  • month: 1, 2
  • monthday: 24
  • hour
  • minute
  • second
  • yearday
  • weekday
  • hour: 1-24
  • althour: 1-12

OS

  • whoami(),
  • os(),
  • arch(),
  • os_family(),
  • pwd(),
  • user_home()

getenv

getenv() is a function for ENVIRON["NAME"] with default value: getenv("NAME", "default value").

Attention: zawk reads .env file and injects them as environment variables by default

system2

system2(cmd) is different from system(cmd), and it will return array with code, stdout, stderr.

To capture output of a command, and ou can use getline and pipe to get the output:

function  get_output(command) {
    while (command | getline line) {
        lines[i++] = line
    }
    return lines
}

With new system2 function, you can get the output directly:

    result = system2("curl ifconfig.me")
    println result["stdout"]

Attention: If you don't want to capture output, you can use system function.

I/O

File

  • read file into text: read_all(file_path), read_all("https://example.com/text.gz")
  • write text info file: write_all(file_path, text) Replace if file exits.

Tips: read_all function uses OneIO, and remote(https or ftp) and compressions( gz,bz,lz,xz) are supported.

read_config

Read config file to StrStrMap: read_config("tests/demo.ini"). Now only *.ini and *.properties are supported.

Tips: zawk will load .env as environment variables automatically if it exists in the current directory.

getline

Please visit: https://www.gnu.org/software/gawk/manual/html_node/Getline.html and http://awk.freeshell.org/AllAboutGetline

Misc

Diagnose

  • dump: var_dump(name),
  • logging: log_debug(msg), log_info(), log_warn(), log_error()

Attention: dump/logging output will be directed to std err to avoid std output pollution.

Reflection

zawk

  • version(): return zawk version

Credits

thanks to: