Skip to content

retree is regular-expression-tree, which supports quickly and concurrently matching of lots of regex patterns.

License

Notifications You must be signed in to change notification settings

sisyphsu/retree-java

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

retree-java

Travis CI codecov

Introduce

retree could parse and combine lots of regular expressions as an regular-expression-tree, which is very similar to trie.

First, retree would parse regex as a node-chain, for example:

  • \d+a\W will be parsed as CurlyNode(\d+) -> CharNode(a) -> CharNode(\W),
  • \d+\s\W will be parsed as CurlyNode(\d+) -> CharNode(\s) -> CharNode(\W).

After that, retree will merge those two node-chain as one node-tree (there should have an image to explain how it looks)

When performing multiple regex matching, we could use retree to reduce useless scan and loop, and avoid lots of backtracking.

Maven Dependency

Add maven dependency:

<dependency>
    <groupId>com.github.sisyphsu</groupId>
    <artifactId>retree</artifactId>
    <version>1.0.4</version>
</dependency>

Usage

This example shows how retree works, it's very similar to java.util.regex:

String[] res = {"(\\d{4}-\\d{1,2}-\\d{1,2})", "<b>(?<name>.*)</b>", "(\\w+@\\w+\\.[a-z]+(\\.[a-z]+)?)"};
String input = "Today is 2019-09-05, from <b>sulin</b> (sisyphsu@gmail.com).";
ReMatcher matcher = new ReMatcher(new ReTree(res), input);

assert matcher.find();
assert "2019-09-05".contentEquals(matcher.group());

assert matcher.find();
assert "<b>sulin</b>".contentEquals(matcher.group());
assert "sulin".contentEquals(matcher.group("name"));

assert matcher.find();
assert "sisyphsu@gmail.com".contentEquals(matcher.group());

In this example, we only need to scan input once to complete three different regular expressions' matching:

  • (\d{4}-\d{1,2}-\d{1,2})
  • <b>(?<name>.*)</b>
  • (\\w+@\\w+\\.[a-z]+(\\.[a-z]+)?)

Showcase

dateparser

dateparser is a smart and high-performance date parser library, it supports hundreds of different format, nearly all format that we used.

dateparser use retree to perform the matching operation of lots of different date patterns.

Even if dateparser have thundreds of predefined regular expressions, it still can parse date very fast(1000~1500ns).

Performance & Benchmark

TODO

Multi-Language Support

I will transplant this library to golang and javascript in nearly future.

License

Apache-2.0

About

retree is regular-expression-tree, which supports quickly and concurrently matching of lots of regex patterns.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published