Skip to content
Joseph Martell edited this page Nov 1, 2018 · 11 revisions

Introduction

All data to be deduplicated or linked by Duke must come from a data source. A data source is basically a class which implements the DataSource interface, which is just a single method. Below we describe all the currently supported data sources, but it is easy to implement and plug in your own data source.

Note that you can have any number of data sources in the same XML configuration file.

Note that if you use ElasticSearch there is a separate project providing a plug-in for using Duke to deduplicate data in ElasticSearch.

The built-in data sources

All the data sources are implemented by classes in the no.priv.garshol.duke package.

CSV data source

This source can read CSV files. Below is an example configuration:

<csv>
  <param name="input-file" value="customers.csv"/>

  <column name="ID"/>
  ...
</csv>

Unless header-line is set to false this source assumes the first line not to be skipped is a row of column names. These can be used by the column references inside the csv element.

Columns can also be referred to using their indexes, with the first column referred to as 1. If there is no header line this is the only way to refer to columns.

The source assumes values are separated by commas and that doublequotes are used around strings.

The configuration parameters are:

  • **input-file**: refers to the CSV file
    
  • **encoding**: the character encoding of the file
    
  • **skip-lines**: number of initial lines to skip over (default: 0)
    
  • **header-line**: is there a header line with column names? (true|false, default: true)
    
  • **separator**: column separator character (default ',')
    

If a CSV column can contain multiple values separated textually, configure the column with

  <column name="EMAIL" split-on=";" .../>

where the split-on attribute contains a regular expression to split the column on. Note that cleaning is applied to each individual value after splitting.

JDBC data source

This source can load data from a database over a JDBC connection. Below is an example configuration:

  <jdbc>
    <param name="driver-class" value="com.microsoft.sqlserver.jdbc.SQLServerDriver"/>
    <param name="connection-string" value="jdbc:sqlserver://127.0.0.1;databaseName=DATABASE_NAME"/>
    <param name="user-name" value="user"/>
    <param name="password" value="secret"/>
    <param name="query" value="
        select * from CUSTOMERS
      "/>

    <column property="ID"/>
    ...
  </jdbc>

If you are using SQL Server Integrated Security, add ;integratedSecurity=true to the connection-string param. You may need to put sqljdbc_auth.dll (bundled with the driver) in your System32 folder.

The source connects to the database over JDBC, then issues the given SQL query, and the column references refer to the column names in the SQL result set.

The configuration parameters are:

  • **driver-class**: the Java class name of the JDBC driver (depends on database)
    
  • **connection-string**: the JDBC connection URL
    
  • **user-name**: the user name to connect as
    
  • **password**: the password
    
  • **query**: the query to use to retrieve data 
    

If you need to split column values into multiple values, use the same solution as for CSV data (see above).

JNDI data source

This is the same as the JDBC data source, except that it gets an already configured JDBC connection via JNDI:

  <jndi>
    <param name="jndi-path" value="jdbc/CatalogDB"/>
    <param name="query" value="
        select * from CUSTOMERS
      "/>

    <column property="ID"/>
    ...
  </jndi>

The configuration parameters are:

  • **jndi-path: **the path to the JNDI resource
    
  • **query**: the query to use to retrieve data 
    

NTriples data source

The NTriples data source can read files in the NTriples format, which is a simple serialization format for RDF. We support only this format at the moment, in order to avoid having to add a dependency on an RDF engine, which would require a large number of .jar files to be added.

By default this data source will load the complete NTriples file into memory before processing begins, in order to collect all properties of a resource into a single record. If the NTriples file is sorted you can load in incremental mode instead, which keeps only one record in memory at a time.

Below is an example configuration:

  <ntriples>
    <param name="input-file" value="dogfood.ntriples"/>

    <column name="?uri" property="ID"/>
    <column name="http://xmlns.com/foaf/0.1/name"
            cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"
            property="NAME"/>
    ...
  </ntriples>

Note that special column name "?uri" which is used to add the URI of the resource to the created record. Other column names refer to the URIs of RDF properties.

The configuration parameters are:

  • **input-file**: the NTriples file to load
    
  • **accept-types**: the URI of the class to process (in case the file contains data about resources of more than one class, but you only want to process a single class)
    
  • **incremental-mode**: set to true if the file is sorted 
    

SPARQL data source

The SPARQL data source can load data via SPARQL queries to a SPARQL endpoint. It requires queries to return a tabular result with the resource URI as the first column, the property as the second column, and the value as the third column. Alternatively, you can set triple-mode to false, and map each result row to a record.

Below is an example configuration in triple-mode:

  <sparql>
    <param name="endpoint" value="http://localhost:8080/sparql/"/>
    <param name="query" value="
      SELECT DISTINCT ?instance ?p ?o
      WHERE {
        ?instance a &lt;http://xmlns.com/foaf/0.1/Person> .
        ?instance ?p ?o .
      }
      ORDER BY ?instance ?p ?o
    "/>

    <column name="?uri"
            property="ID"/>
    <column name="http://xmlns.com/foaf/0.1/name"
            cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"
            property="NAME"/>
    ...
  </sparql>

Note that special column name "?uri" which is used to add the URI of the resource to the created record. Other column names refer to the URIs of RDF properties.

The configuration parameters are:

  • **endpoint**: the URI of the SPARQL endpoint
    
  • **query**: the SPARQL query to send
    
  • **page-size**: how many rows to load per query (set to 0 to turn paging off)
    
  • **triple-mode**: whether to use triple mode or not 
    

Note that the data source will use paging to control the size of query results, so the query must order the query results for paging to work consistently. The default page size is 1000 rows.

Below is the same example reformulated to tabular mode:

  <sparql>
    <param name="endpoint" value="http://localhost:8080/sparql/"/>
    <param name="triple-mode" value="false"/>
    <param name="query" value="
      SELECT DISTINCT ?uri ?name ...
      WHERE {
        ?uri a &lt;http://xmlns.com/foaf/0.1/Person> .
        ?uri foaf:name ?name .
      }
      ORDER BY ?uri
    "/>

    <column name="uri"
            property="ID"/>
    <column name="name"
            cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"
            property="NAME"/>
    ...
  </sparql>

JSON data source

(This data source is in the Git repo and will be included in the next release. Contributed by dmnpignaud.)

It can read JSON files from disk, and has some more flexible alternatives in the API. There is just one configuration option:

  • file: The file from which to read the JSON data.

To use it, write:

<data-source class="no.priv.garshol.duke.datasources.JsonDataSource">
  <param name="input-file" value="whatever.json"/>

  ...

Note: The entire JSON dataset must not be between brackets and entries must not be separated by commas. Of course, you can use brackets inside each entry. For example:

{"entry": "1", "name": "John", "last_name": "Doe"}
{"entry": "2", "name": "Jon", "last_name": "Doe"}
{"entry": "3", "name": ["John", "Jone"], "last_name": "Do"}

MongoDB

Note: This data source was added after the 1.2 release, so it will not be available before 1.3 is released. Unless you check out the source code from Github and compile it yourself.

Example of use.

Suppose you have a collection named "newUsers" in a "gatheredData" DB.

If this is the "schema":

{
 _id: ObjectId("54107bb33f2a38e1e44e9961"),
 name:"Adolf",
 address:{
  street:"Rue Av.",
  number: 102,
  zip-code: 4106
 }
}

and you just want to consider "name" and "address.zip-code" fields:

<data-source class="no.priv.garshol.duke.datasources.MongoDBDataSource">
 <param name="server-address" value="domain.com"/>
 <param name="port-number" value="27017"/>
 <param name="database" value="gatheredData"/>
 <param name="collection" value="newUsers"/>
 <param name="projection" value="{_id:0, name:1, address.zip-code:1}"/>

 <column property="Name" name="name"/>
 <column property="ZipCode" name="address.zip-code"/>
</data-source>

Note that the fields "name" and "address.zip-code" have been mapped into "Name" and "ZipCode" in the column definition.

Parameters

  • Required:
  • database
  • collection
  • Optional (and default values):
  • server-address: "localhost"
  • port-number: "27017:
  • db-auth: "false" (other possible values: "true" and "admin", case unsensitive)
  • user-name: required if db-auth is set to "true" or "admin"
  • password: required if db-auth is set to "true" or "admin"
  • cursor-notimeout: "false" (can be set to "true")
  • query: "{}" (query all documents in the collection)
  • projection: if not set, no projection will be performed

Behavior

  1. If a parameter is invalid (port-number, server-address, collection, etc.) an error will be thrown. The value for port-number must be a valid port number (between 1 and 65535), otherwhise the default value (27017) will be left. 2a. Setting a certain parameter to "" is the same as not setting it (commenting/omitting the assignment). 2b. If port-number it not going to be specified, it is preferable to comment (or omit) the assignment. An error with a specific message will be thrown if it's set to "" or to an unparseable string.
  2. When cursor-notimeout is not set (or set to "false") and BATCH_SIZE < db[collection].count(query), it's possible that when Duke is trying to fetch the next batch (by performing a MongoDB getmore) the cursor is already timed out. To prevent this, set cursor-notimeout to "true". 4a. If db-auth is set to "true", the connection will try to find the credentials (user+password) in the DB specified by the database parameter. When set to "admin", it will try to find those credentials in the "admin" DB. 4b. If db-auth is set to a value different from "false", "true" or "admin" (case insensitive), this setting will be ignored and the field will preserve its default value.
  3. The query and projection parameters have to be valid JSON formatted. There's no need to add quotes, even if they operate on a nested field. If format is not valid, an error will be thrown.
  4. You could skip the projection operator and perform the projection in the column definition, but a lot of very large documents would imply a significant slowdown caused by network traffic.
  5. The query document can contain operators like $orderby, $showDiskLoc, etc. See http://docs.mongodb.org/manual/reference/operator/query-modifier/

Custom data sources

If you want to use a custom data source, simply implement the DataSource interface, and then configure it as follows in the XML file:

  <data-source class="class name here">
    <param name="bean property" value="value"/>
    ...
  </data-source>