NativeExtractor is a powerful tool that analyzes plaintext and extracts entities from it. Main features of NativeExtractor are:
- High performance (multithreaded by default).
- Support for unlimited plaintext inputs (2^48 Bytes as technical maximum).
- Highly optimized implementation of Patty trie. Includes saving/ loading to/from an Mmapped file.
- Module system and its dynamic loading.
- RegExp compilation to native C
.so
modules - fastest RegExp you have ever seen! - Fast glob pattern recognition.
- Bindings to Node.js and Python included.
- NativeExtractor
- Table of Contents
- Getting started
- Folder structure
- Programming style
- Extractor
- Miners
- Glob miner
- Native RegExps
- Patty trie
- Instant Examples
- Other language bindings
- Contributing
- Special thanks
Compilation process is fully tested on Ubuntu 18.04 and 20.04, however other distros should run without complications. It is expectable that NativeExtractor should run on BSDs as well. MacOS is not supported.
These are general requirements:
- A GNU/Linux distro or Docker
- Makefiles
- glib2.0 + development packages
- python 2.7 + development package (python 3.0 planned soon)
- node.js >=13 + development packages (optional)
Dependencies installation on Ubuntu:
sudo apt install build-essential libglib2.0-dev libpython2.7-dev libcmocka-dev
make build
make install
Note that install script will install also headers into your /usr/include
.
You can simply use pkg-config
to generate gcc/clang flags:
gcc main.c `pkg-config nativeextractor --libs --cflags` -o netest.bin
Declare following includes in your main.c
file for full functionality, or see the docs for individual headers.
#include <nativeextractor/extractor.h>
#include <nativeextractor/miner.h>
#include <nativeextractor/ner.h>
#include <nativeextractor/occurrence.h>
#include <nativeextractor/patricia.h>
#include <nativeextractor/stream.h>
#include <nativeextractor/unicode.h>
#include <nativeextractor/regex_generator.h>
#include <nativeextractor/unicode.h>
#include <nativeextractor/finite_automaton.h>
#include <nativeextractor/terminal.h>
...
NativeExtractor/
|-- build/
| |-- debug/ - Binaries built with config=debug (default)
| | `-- lib/ - Built entity miners (*.so)
| |-- release/ - Binaries built with config=release
| | `-- lib/ - Built entity miners (*.so)
|-- include/
| `-- nativeextractor/ - Header files
|-- src/ - Source files
| `-- miners/ - Place source codes of entity miners here for bult-in miners
| `-- example/` - Programmer-friendly examples to understand NativeExtractor basics
`-- Makefile
- Exclusive usage of standard C11.
- Indentation style in BSD KNF.
- Block indentation uses 2 spaces.
- Compilation should not produce warnings.
- Every published function, struct etc. should be documented.
Native extractor code is object-oriented however it is implemented in standard C. We use structs with methods as its variables. For example:
#include <nativeextractor/common.h> // Includes macro ALLOC
/* C-style class definition. Names should end with _c postfix. */
typedef struct dog_c {
unsigned weight;
unsigned size;
void (*woof)(struct dog_c* self);
} dog_c;
/* Implements method woof: */
void dog_c_woof(dog_c * self) {
printf("I am unnamed dog, my weight is: %u and I am of size: %u\n",
self->weight, self->size);
}
/* Initializes an instance of dog_c. */
void dog_c_init(dog_c * self, unsigned weight, unsigned size) {
self->weight = weight;
self->size = size;
self->woof = dog_c_woof;
}
/* Creates a new instance of dog_c. */
dog_c * dog_c_new(unsigned weight, unsigned size) {
dog_c * self = ALLOC(dog_c);
dog_c_init(self, weight, size);
return self;
}
dog_c * generic_dog = dog_c_new(50, 35);
generic_dog->woof(generic_dog); // Prints "I am unnamed dog, my weight is: 50..."
/* This class inherits from dog_c: */
typedef struct named_dog_c {
dog_c dog; // Base class
const char * name;
void (*woof)(struct dog_c* self);
} named_dog_c;
/* Re-implement method woof: */
void named_dog_c_woof(named_dog_c * self) {
printf("My name is %s dog, my weight is: %u and I am of size: %u\n",
self->name, self->weight, self->size);
}
/* Initializes an instance of named_dog_c. */
void named_dog_c_init(named_dog_c * self, unsigned weight, unsigned size, const char * name) {
dog_c_init(&self->dog, weight, size); // Initialize base class
self->name = name;
self->woof = named_dog_c_woof;
}
/* Creates a new instance of named_dog_c. */
named_dog_c * named_dog_c_new(unsigned weight, unsigned size, const char * name) {
named_dog_c * self = ALLOC(named_dog_c);
named_dog_c_init(self, weight, size, name);
return self;
}
named_dog_c * named_dog = named_dog_c_new(50, 35, "Killer");
/* Now you can call re-implemented method woof: */
named_dog->woof(named_dog); // Prints "My name is Killer, ..."
/* Or method woof inherited from the base class: */
named_dog->dog.woof((dog_c*)named_dog); // Prints "I am unnamed dog, ..."
/* Or a bit longer way: */
((dog_c*)named_dog)->woof((dog_c*)named_dog); // Prints "I am unnamed dog, ..."
Extractor (extractor_c
) is composed from these items:
- Stream - an instance of
stream_c
created from a file or from (mapped) memory. - Marker - something like Turing-machine head operating on the stream.
- List of miners - miners are small programs that accept position on the stream. Every miner have its own instance of a head for each invocation. Miners can move their head to the left or to the right and set markers.
- Threads - every thread can be occupied by some number of miners. By default number of threads is equal to number of logical CPU cores.
Extractor does these operations (when calling next()
method):
- Executes each miner for the current position in the stream (in threads). A miner returns an instance of
occurrence_t
if a match is found at given position. - Accumulates found matches into a NULL-terminated array.
- Moves head to the next UTF-8 character (from left to the right) in the stream.
- Extractor does these operations repeatedly until it reaches end of the stream or a maximum size of a bulk.
An extractor may have these flags enabled:
E_SORT_RESULTS
- Sorts returned occurrences by position (ascending) and length (descending).
E_NO_ENCLOSED_OCCURRENCES
- Filters out any enclosed occurrences.
- An occurrence
A
is enclosed in occurrenceB
ifA.start >= B.start
andA.end <= B.end
. - For example:
A
:{start: 3, end: 6}
B
:{start: 1, end: 9}
C
:{start: 1, end: 7}
D
:{start: 3, end: 9}
- Or as ASCII-art:
A: |--| B: |-------| C: |-----| D: |------|
A
,C
andD
are all enclosed inB
.- Therefore, only
B
is returned.
To set or unset flags for an extractor, use the set_flags
and unset_flags
methods.
For example:
extractor_c *ex = extractor_c_new(...);
ex->set_flags(ex, E_SORT_RESULTS | E_NO_ENCLOSED_OCCURRENCES);
// extractor now sorts results and discards enclosed occurrences
ex->unset_flags(ex, E_NO_ENCLOSED_OCCURRENCES);
// extractor no longer discards enclosed occurrences
Each miner consists of at least two functions - one for creating its instances and one for matching occurrences.
When writing a miner you have two options - either you can include them directly in your main program, or you can create your own .so
modules containing one or more miners and load them dynamically. Benefits of .so
modules are:
- Dynamic loading on runtime.
- Building of miners is independent on extractor routines build.
- Programmatic miner creation with native effectivity.
- Usage in more binaries at a time.
When you decide to use the dynamic libraries, then all miner source files must be placed at the src/miners
directory. One source file can contain definitions of multiple miners - if it makes sense. For example miners of credit card numbers from Visa, MasterCard etc. could be put into a single card_entities.c
source file.
Following is an example of a simple miner which finds all occurrences of "hello".
// hello.c
#include <nativeextractor/miner.h>
#include <nativeextractor/unicode.h>
/** Matches string "hello". */
static occurrence_t* match_hello_impl(miner_c* m) {
if (!m->mark_start(m)) return NULL;
if (!m->match_string(m, "hello", Right)) return NULL;
if (!m->mark_end(m)) return NULL;
return m->make_occurrence(m, 1.0);
}
/** Returns a miner which matches a string "hello". */
miner_c* match_hello() {
return miner_c_create("Hello", NULL, match_hello_impl);
}
/**
* Metainfo for the shared library.
* Format: [ "minerfn1", "label1", "minerfn2", "label2", ..., NULL ]
*/
const char* meta[] = {
"match_hello", "Hello",
NULL
};
For more info, have a look into the miner.h
file, where you can find
documentation of all available miner methods.
To build a single miner directly from NativeExtractor, use the following command:
make miner name=source (config=debug/release)
Here the source
is a name of the source file without the extension (.c
). The
built miner can be found at build/<config>/<source>.so
.
To build all existing miners, use:
make all-miners (config=debug/release)
To build miners from outside of NativeExtractor you have to run make install
first and then run:
gcc -O0 -g3 -DDEBUG `pkg-config --cflags glib-2.0 nativeextractor`\
`pkg-config --libs glib-2.0 nativeextractor` \
-shared -fPIC \
yourminer.c -o yourminer.so
Sometimes you would like to have configurable miners, especially in cases you want to have many similar miners differing in some parameter. We use this for example for a glob miner, where we pass the glob expression as its parameter. Look at the following fragment of the glob miner code.
/* This miner entrypoint accepts glob as a parameter: */
miner_c* match_glob(const char* glob) {
if (!is_glob(glob)) {
fprintf(stderr, "'%s' is not a syntactically correct glob!\n", glob);
return NULL;
}
/* create miner with data set to `glob` */
return miner_c_create("Glob", glob, match_glob_impl);
}
And its usage:
extractor->add_miner_so(extractor, "glob_entities.so", "match_glob", "hell*");
When you've built your miner, you can test it by adding it into the src/main.c
file like so:
// main.c
#include <nativeextractor/common.h>
#include <nativeextractor/extractor.h>
int main(int argc, char ** argv) {
if (argc <= 1) {
printf("Fullpath not specified!\n");
return EXIT_FAILURE;
}
miner_c ** miners = calloc(1, sizeof(miner_c*));
extractor_c * e = extractor_c_new(1, miners);
// Add your miner here...
e->add_miner_so(e, "./build/debug/lib/hello.so", "match_hello", NULL);
stream_file_c * sfc = stream_file_c_new(argv[1]);
e->set_stream(e, (stream_c*)sfc);
while (!((e->stream->state_flags) & STREAM_EOF)) {
occurrence_t ** res = e->next(e, 1000000);
occurrence_t ** pres = res;
while (*pres) {
print_pos(*pres);
free(*pres);
++pres;
}
free(res);
}
DESTROY(e);
return EXIT_SUCCESS;
}
and then running:
make build && ./build/debug/NativeExtractor file.txt
NativeExtractor includes an implementation of a glob miner. Our glob miner supports matching of:
- Exact strings - matches only exactly equal strings, e.g. glob
hello
matches only stringhello
and nothing else. - Wildcard operator ? -
?
matches any single character, e.g.hell?
matcheshello
, but alsohella
,hellz
,hell6
and any other single character at the end. - Wildcard operator * -
*
works similarly to?
, except it matches any number of characters, including 0. Globnat*tor
matchesnativextractor
for example, but onlynattor
as well. - Character sets [] - e.g.
[abc]
matches charactera
,b
, orc
. - Character ranges x-y - are allowed within
[]
sets and it is a substitution for characters in the specified range. E.g.[a-c]
would match charactera
,b
orc
.
extractor_c * ex;
// Common extractor_c initialization code here...
/*
Add glob_entities to the extractor ex, note that path to the
glob_entities.so may vary this glob matches all strings ending
with .exe.
*/
ex->add_miner_so(extractor, "./build/release/lib/glob_entities.so",
"match_glob", "*.exe");
/* Add glob *.bat to the extractor */
ex->add_miner_so(extractor, "./build/release/lib/glob_entities.so",
"match_glob", "*.bat");
/* Add glob *.ini to the extractor */
ex->add_miner_so(extractor, "./build/release/lib/glob_entities.so",
"match_glob", "*.ini");
Glob miner interprets globs on the fly by using backtracking technique, so performance could be a little worse in comparison to native RegExps, but its main benefit is that you don't need to rebuild the .so
module. For maximal performance, it should be possible to implement compilation of globs into native RegExps.
NativeExtractor implements fast RegExp matching - every RegExp is compiled into a native miner and then loaded, resulting into very high performance. This approach is also very useful for cases when input file is large or user expects large amount of files on input. Compilation process and loading of .so
takes only a while and then it can be used unlimited number of times.
You can specify these environmental variables:
- CC - compiler, by default gcc is used.
- REGEX_HEADER_FILES - path to header files, by default
"./src"
. - REGEX_BUILD_PATH - target of output binaries, by default
"/tmp"
.
// Compile RegExp into native miner re_email
regex_t * re_email = regex_compile("[^@ \\t\\r\\n]+@[^@ \\t\\r\\n]+\\.[^@ \\t\\r\\n]+", "simple_email", "EMAIL");
// Typical check
if (re_email->state) {
printf("Compilation terminated with these errors\n");
/* We use glib library here */
GList * errors = re_email->errors;
for (; errors != NULL; errors = errors->next) {
printf(errors->data);
}
exit(-1);
}
// Create a RegExp module.
regex_module_c * g_module = regex_module_c_new("testative", NULL);
// Add previously compiled RegExp re_email to regex module
g_module->add_regex(g_module, re_email);
// Build module and check its result
if (!g_module->build(g_module)) {
/* handle errors here - errors are placed in ((GList*)g_module->errors) */
exit(-1);
}
// Standard Initialization of extractor
miner_c ** miners = calloc(1, sizeof(miner_c*));
extractor_c * g_e = extractor_c_new(-1, miners);
// Load module as miner to the extractor g_e
g_module->load(g_module, g_e);
// Open a file as stream
stream_file_c * strf = stream_file_c_new("/tmp/some_file.txt");
// Set the stream to the extractor (must go after miner addition!)
g_e->set_stream(g_e, (stream_c*)strf);
// Find all occurrences of email till EOF not reached
while (!((g_e->stream->state_flags) & STREAM_EOF)) {
// Analyze 1000 unicode chars and count results into res
occurrence_t ** res = g_e->next(g_e, 1000);
occurrence_t ** pres = res;
// Iterate over results and free
while (*pres) {
// Print match
print_pos(*pres);
free(*pres);
++pres;
++count;
}
free(res);
}
Patty trie is a highly optimized variant of Radix tree. We define Patty trie as Radix tree with count of edges limited by number of unicode characters. Patty trie works on UTF-8. Main properties of Patty trie are these:
- Height - h of the trie is length of the longest string.
- Width - w of the trie is number of unicode chars as maximum.
- Search operation runs in O(log(n)).
- Insert operation runs in O(log(n)).
- Patty trie can be serialized into a file.
- Serialized files can be loaded on the fly with Mmap, using all its benefits.
- Implemented fast prefix matching and lookup.
patricia_c* index = patricia_c_create(NULL);
index->insert(index, "Patrick", 0, 7);
index->insert(index, "Michael", 0, 7);
index->insert(index, "Paul", 0, 4);
uint32_t ret = g_pat->search(g_pat, "Patrick", 7);
if(ret == 7) {
printf("Patrick FOUND\n");
}
index->save(index, "names.patty");
Programmer-friendly examples of use are included in src/example
directory, full documentation is given. Build is done via make examples
. Location of built example is build/debug/
and must be run from project's .
dir.
We offer these examples:
- ngrep - native grep tool compiling regexps to native code and executes that on a given file.
- glob - interpretes a glob on a given file.
- naive_email_miner - creates a simple miner with possibility to extract a subset of RFC-defined email adresses. It is built as simple console application and as a loadable .so module.
NativeExtractor is produced by SpongeData.
This project adheres to an universal code of conduct. By participating, you are expected to uphold this code.
Before you send a pull request, search for previous discussions about the same feature or issue.
When contributing code, make sure build process is functional and hold Programming style.
Before sending a pull request, be sure to have tests for meaningful cases (new features, bugs, ...).
We thank to NativeExtractor logo creator - Adam Říha from Gaupi.
We also thank to our code contributors.