Defining a Druid DataSource on a Flattened Dataset

Assume you have a flattened dataset that represents denormalized facts(i.e. all dimensional attributes are part of each fact row). For example, consider the following Orders fact with the following schema:

CREATE TABLE if not exists orderLineItemPartSupplierBase(
      o_orderkey integer, o_custkey integer,
      o_orderstatus string, o_totalprice double, o_orderdate string, 
      o_orderpriority string, o_clerk string,
      o_shippriority integer, o_comment string, 
      l_partkey integer, l_suppkey integer,
      l_linenumber integer, l_quantity double, 
      l_extendedprice double, l_discount double, 
      l_tax double, l_returnflag string,
      l_linestatus string, l_shipdate string, 
      l_commitdate string, l_receiptdate string,
      l_shipinstruct string, l_shipmode string, 
      l_comment string, order_year string, 
      ps_partkey integer, ps_suppkey integer,
      ps_availqty integer, ps_supplycost double, 
      ps_comment string, s_name string, 
      s_address string, s_phone string, 
      s_acctbal double, s_comment string, s_nation string,
      s_region string, p_name string,
      p_mfgr string, p_brand string, 
      p_type string, p_size integer, 
      p_container string, p_retailprice double,
      p_comment string, c_name string , 
      c_address string , c_phone string , c_acctbal double ,
      c_mktsegment string , c_comment string , 
      c_nation string , c_region string
)

Each row in this table represents a order lineitem, we capture metrics like the quantity, price, discount, tax of the part sold, by which supplier, to which customer, which order this is part of and many details about the part,supplier, customer and order that this line belongs to.

An Analyst working with this DataSet will want to slice-and-dice this DataSet by any combination of attributes like Avg(price) by Customer Nation and Market Segment, Top 10 buying customers in a particular Region, Avg(quantity) by Product category and so on and on.

Assume a Druid Index is set up on this Flattened Dataset. To accelerate the Adhoc. analysis on this a Druid DataSource needs to be setup. Here is the basic definition:

CREATE TABLE if not exists orderLineItemPartSupplier
      USING org.sparklinedata.druid
      OPTIONS (sourceDataframe "orderLineItemPartSupplierBase",
      timeDimensionColumn "l_shipdate",
      druidDatasource "tpch",
      druidHost "localhost",
      zkQualifyDiscoveryNames "true",
      starSchema '{"factTable" : "orderLineItemPartSupplier","relations" : []}')

At a minimum specify

the underlying dataset that this DataSource accelerates. This is specified by the sourceDataframe option
details about the Druid Index on this Dataset. This is specified by:

how to connect to Druid

specify the Druid cluster Zookeeper ensemble, for Druid 0.9.0 and above set zkQualifyDiscoveryNames to true

druidDatasource

the druid index name

timeDimensionColumn

what source dataset column is mapped to the druid timestamp

starSchema

for flattened datasets this should allows have the factTable set to the DataSource name and the relations list empty. This becomes important when the source data is kept in a Star Schema structure

The complete list of options is in Druid-Datasource-Options, common settings are:

set the queryGranularity to match the Query Granularity of your Druid Index
if approximate TopN processing is acceptable for your analysis, set ~ allowTopN~.
if you want to push non aggregate queries(basic scans and filters) to Druid set ~ nonAggQueryHandling~ to push_filters or push_project_and_filters. By default non-aggregate queries are executed in Spark only by scanning the data of the underlying tables.

Overview
Quick Start
- Installing and Setup Druid
User Guide
- [Defining a DataSource on a Flattened Dataset](https://github.com/SparklineData/spark-druid-olap/wiki/Defining-a Druid-DataSource-on-a-Flattened-Dataset)
- Defining a Star Schema
- Sample Queries
- Approximate Count and Spatial Queries
- Druid Datasource Options
- Sparkline SQLContext Options
- Using Tableau with Sparkline
- How to debug a Query Plan?
- Running the ThriftServer with Sparklinedata components
- [Setting up multiple Sparkline ThriftServers - Load Balancing & HA] (https://github.com/SparklineData/spark-druid-olap/wiki/Setting-up-multiple-Sparkline-ThriftServers-(Load-Balancing-&-HA))
- Runtime Views
- Sparkline SQL extensions
- Sparkline Pluggable Modules
Dev. Guide
Reference Architectures
- Accelerating existing SQL Datasets
Releases
Cluster Spinup Tool
TPCH Benchmark
- Generating Denormalized TPCH Dataset
- Build TPCH Index for Benchmark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defining a Druid DataSource on a Flattened Dataset

Clone this wiki locally