-
-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[Proposal] Rarity score from RegEx #238
Comments
This is cool! Is If you do thanks so much for this, this is absolutely great 🔥 A non-subjective formal way to define rarity would be absolutely amazing :) |
It seems to be promising, but it's just a prototype and needs a lot of works.
By the way, you can play the script in Google Colab. import sre_parse
import strings
# calculate score from literal string (e.g. prefix, suffix)
# use weight metric (*_score) parameters
class RegExScore():
def __init__(self,
repeat_score = 0.01, # score for quantifier `{0,}`
in_score = 0.1, # score for character set `[*]`
ascii_score=1.0, # score for a fixed ascii `a-zA-z`
digit_score=0.2, # score for a fixed digit `0-9`
literal_default_score=0.01, # score for whitespaces
debug=False # print the debug message
):
self.repeat_score = repeat_score
self.in_score = in_score
self.ascii_score = ascii_score
self.digit_score = digit_score
self.literal_default_score = literal_default_score
self.debug = debug
def calculate(self, regexp:str):
return self.token_score(sre_parse.parse(regexp))
def token_score(self, tokens:tuple):
score = 0
for _token in tokens:
if self.debug:
print("Loop: ", _token)
# add the score from subpattern `()`
if _token[0] == sre_parse.SUBPATTERN:
_, _, _, child = _token[1]
if self.debug:
print(_token[0], len(child))
score += self.token_score(child)
# add score from quantifier `{min,max}`
elif _token[0] == sre_parse.MAX_REPEAT:
_min, _max, child = _token[1]
_score = self.repeat_score * (_min + 0 if _max == sre_parse.MAXREPEAT else _max)
if self.debug:
print('\tscore:', _score)
score += _score + self.token_score(child)
# add score from mean of branch group `A|B|C|D`
elif _token[0] == sre_parse.BRANCH:
_, branch = _token[1]
if self.debug:
print('\tbranch:', len(branch))
sub_score = 0
for child in branch:
sub_score += self.token_score(child)
score += sub_score / float(len(branch))
# add score from character set `[]`
elif _token[0] == sre_parse.IN:
if self.debug:
print('\tscore:', self.in_score)
score += self.in_score
# add score from fixed literal
elif _token[0] == sre_parse.LITERAL:
literal = chr(_token[1])
if self.debug:
print('\tchr:', literal)
if literal in string.ascii_letters:
score += self.ascii_score
elif literal in string.digits:
score += self.digit_score
else:
score += self.literal_default_score
return score Feel free to comment or suggest your thoughts. |
Is your feature request related to a problem? Please describe.
The rarity is used to sort the result of PyWhat.
However, I feel like it's a subjective value that didn't have a formal way to define it.
Currently, I'm just looking at the rarity of neighboring RegExps, using my own gut to decide it, and waiting for someone to reject or confirm.
This is a current definition of a rarity on the wiki page
I need some way to calculate the rarity of RegExps or tokens.
Describe the solution you'd like
Deterministic way to calculate rarity
Describe alternatives you've considered
N/A (I can't figure out the alternatives)
Additional context
The text was updated successfully, but these errors were encountered: