-
Notifications
You must be signed in to change notification settings - Fork 53
Configure
Yann Barraud edited this page Mar 7, 2014
·
5 revisions
The plugin uses Duke's comparators. To get a valuable result, one should read carefully duke's documentation and choose the right comparing configuration.
To parametrize the plugin, you can test various configuration using JSON syntax.
{
"size" : 4,
"query" : {
"function_score" : {
"query" : {
"match_all" : { }
},
"script_score" : {
"script" : "entity-resolution",
"lang" : "native",
"params" : {
"entity" : {
"fields" : [ {
"field" : "city",
"value" : "South",
"cleaners" : [ {
"name" : "no.priv.garshol.duke.cleaners.TrimCleaner"
}, {
"name" : "no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"
} ],
"high" : 0.95,
"comparator" : {
"name" : "no.priv.garshol.duke.comparators.JaroWinkler"
},
"low" : 0.1
}, {
"field" : "state",
"value" : "ME",
"cleaners" : [ {
"name" : "no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"
} ],
"high" : 0.95,
"comparator" : {
"name" : "no.priv.garshol.duke.comparators.JaroWinkler"
},
"low" : 0.1
}, {
"field" : "population",
"value" : "26000",
"cleaners" : [ {
"name" : "no.priv.garshol.duke.cleaners.DigitsOnlyCleaner"
} ],
"high" : 0.95,
"comparator" : {
"name" : "no.priv.garshol.duke.comparators.NumericComparator"
},
"low" : 0.1
}, {
"field" : "position",
"value" : "43,70",
"cleaners" : [ {
"name" : "no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"
} ],
"high" : 0.95,
"comparator" : {
"name" : "no.priv.garshol.duke.comparators.GeopositionComparator",
"params" : {
"max-distance" : "100"
}
},
"low" : 0.1
} ]
}
}
},
"boost_mode" : "replace"
}
}
}
First part of the query is a standard Elasticsearch function score query (size, query, ...). You just specify you want to use entity-resolution
native script.
Then you specify for each field parameters. Parameters consist in setting the right cleaners & comparator for each field you want to compare, a JSON array of fields.
-
field
is the field name you want to compare, within the results. -
value
is the value you want each result to be compared to. -
cleaners
is an array of cleaners. Eachcleaner
has aname
, which is the full qualified name of the class to implement, and optionallyparams
structured as following : -
paramname
is the name of the method to be called to set the param -
value
is the value to pass -
comparator
is the comparator to use. Acomparator
has aname
, which is the full qualified name of the class to implement, and optionallyparams
structured as following : -
paramname
is the name of the method to be called to set the param -
value
is the value to pass - Example
"comparator" : {
"name" : "no.priv.garshol.duke.comparators.GeopositionComparator",
"params" : {
"max-distance" : "100"
}
}
-
low
is the lowest score the field gets is it does not match -
high
is the highest score the filed gets if it matches
To get a better approach, here are a few things you have to know :
- A missing field is ignored (ie. it gets a 0.5 score, which does not affect final result)
- You can balance each field weight in final result tunig high & low values
- A 0 low value for one field will automatically discard the current document
- A 1 high value for one field will set a probability of matching to 1 for the current document