Skip to content

Configure

Yann Barraud edited this page Mar 7, 2014 · 5 revisions

Parametrization

The plugin uses Duke's comparators. To get a valuable result, one should read carefully duke's documentation and choose the right comparing configuration.

To parametrize the plugin, you can test various configuration using JSON syntax.

Request structure

Exemple

{
  "size" : 4,
  "query" : {
    "function_score" : {
      "query" : {
        "match_all" : { }
      },
      "script_score" : {
        "script" : "entity-resolution",
        "lang" : "native",
        "params" : {
          "entity" : {
            "fields" : [ {
              "field" : "city",
              "value" : "South",
              "cleaners" : [ {
                "name" : "no.priv.garshol.duke.cleaners.TrimCleaner"
              }, {
                "name" : "no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"
              } ],
              "high" : 0.95,
              "comparator" : {
                "name" : "no.priv.garshol.duke.comparators.JaroWinkler"
              },
              "low" : 0.1
            }, {
              "field" : "state",
              "value" : "ME",
              "cleaners" : [ {
                "name" : "no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"
              } ],
              "high" : 0.95,
              "comparator" : {
                "name" : "no.priv.garshol.duke.comparators.JaroWinkler"
              },
              "low" : 0.1
            }, {
              "field" : "population",
              "value" : "26000",
              "cleaners" : [ {
                "name" : "no.priv.garshol.duke.cleaners.DigitsOnlyCleaner"
              } ],
              "high" : 0.95,
              "comparator" : {
                "name" : "no.priv.garshol.duke.comparators.NumericComparator"
              },
              "low" : 0.1
            }, {
              "field" : "position",
              "value" : "43,70",
              "cleaners" : [ {
                "name" : "no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"
              } ],
              "high" : 0.95,
              "comparator" : {
                "name" : "no.priv.garshol.duke.comparators.GeopositionComparator",
                "params" : {
                  "max-distance" : "100"
                }
              },
              "low" : 0.1
            } ]
          }
        }
      },
      "boost_mode" : "replace"
    }
  }
}

Explanations

First part of the query is a standard Elasticsearch function score query (size, query, ...). You just specify you want to use entity-resolution native script.

Then you specify for each field parameters. Parameters consist in setting the right cleaners & comparator for each field you want to compare, a JSON array of fields.

Field details
  • field is the field name you want to compare, within the results.
  • value is the value you want each result to be compared to.
  • cleaners is an array of cleaners. Each cleaner has a name, which is the full qualified name of the class to implement, and optionally params structured as following :
  • paramname is the name of the method to be called to set the param
  • value is the value to pass
  • comparator is the comparator to use. A comparator has a name, which is the full qualified name of the class to implement, and optionally params structured as following :
  • paramname is the name of the method to be called to set the param
  • value is the value to pass
  • Example
"comparator" : {  
  "name" : "no.priv.garshol.duke.comparators.GeopositionComparator",  
  "params" : {  
    "max-distance" : "100"  
   }  
}
  • low is the lowest score the field gets is it does not match
  • high is the highest score the filed gets if it matches
Tuning comparison

To get a better approach, here are a few things you have to know :

  • A missing field is ignored (ie. it gets a 0.5 score, which does not affect final result)
  • You can balance each field weight in final result tunig high & low values
  • A 0 low value for one field will automatically discard the current document
  • A 1 high value for one field will set a probability of matching to 1 for the current document
Clone this wiki locally