-
Notifications
You must be signed in to change notification settings - Fork 92
Defining a Druid DataSource on a Flattened Dataset
Assume you have a flattened dataset that represents denormalized facts(i.e. all dimensional attributes are part of each fact row). For example, consider the following Orders fact with the following schema:
CREATE TABLE if not exists orderLineItemPartSupplierBase( o_orderkey integer, o_custkey integer, o_orderstatus string, o_totalprice double, o_orderdate string, o_orderpriority string, o_clerk string, o_shippriority integer, o_comment string, l_partkey integer, l_suppkey integer, l_linenumber integer, l_quantity double, l_extendedprice double, l_discount double, l_tax double, l_returnflag string, l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate string, l_shipinstruct string, l_shipmode string, l_comment string, order_year string, ps_partkey integer, ps_suppkey integer, ps_availqty integer, ps_supplycost double, ps_comment string, s_name string, s_address string, s_phone string, s_acctbal double, s_comment string, s_nation string, s_region string, p_name string, p_mfgr string, p_brand string, p_type string, p_size integer, p_container string, p_retailprice double, p_comment string, c_name string , c_address string , c_phone string , c_acctbal double , c_mktsegment string , c_comment string , c_nation string , c_region string )
Each row in this table represents a order lineitem, we capture metrics like the quantity, price, discount, tax
of the
part sold, by which supplier, to which customer, which order this is part of and many details about the part,supplier, customer and order that this line belongs to.
An Analyst working with this DataSet will want to slice-and-dice this DataSet by any combination of attributes like Avg(price)
by Customer Nation and Market Segment, Top 10
buying customers in a particular Region, Avg(quantity)
by Product category and so on and on.
Assume a Druid Index is set up on this Flattened Dataset. To accelerate the Adhoc. analysis on this a Druid DataSource needs to be setup. Here is the basic definition:
CREATE TABLE if not exists orderLineItemPartSupplier USING org.sparklinedata.druid OPTIONS (sourceDataframe "orderLineItemPartSupplierBase", timeDimensionColumn "l_shipdate", druidDatasource "tpch", druidHost "localhost", zkQualifyDiscoveryNames "true", starSchema '{"factTable" : "orderLineItemPartSupplier","relations" : []}')
At a minimum specify
- the underlying dataset that this DataSource accelerates. This is specified by the sourceDataframe option
- details about the Druid Index on this Dataset. This is specified by:
- how to connect to Druid
- specify the Druid cluster Zookeeper ensemble, for Druid 0.9.0 and above set
zkQualifyDiscoveryNames
to true - druidDatasource
- the druid index name
- timeDimensionColumn
- what source dataset column is mapped to the druid timestamp
- starSchema
- for flattened datasets this should allows have the
factTable
set to the DataSource name and therelations
list empty. This becomes important when the source data is kept in a Star Schema structure
The complete list of options is in Druid-Datasource-Options, common settings are:
- set the queryGranularity to match the Query Granularity of your Druid Index
- if approximate TopN processing is acceptable for your analysis, set ~ allowTopN~.
- if you want to push non aggregate queries(basic scans and filters) to Druid set ~ nonAggQueryHandling~ to push_filters or push_project_and_filters. By default non-aggregate queries are executed in Spark only by scanning the data of the underlying tables.
- Overview
- Quick Start
-
User Guide
- [Defining a DataSource on a Flattened Dataset](https://github.com/SparklineData/spark-druid-olap/wiki/Defining-a Druid-DataSource-on-a-Flattened-Dataset)
- Defining a Star Schema
- Sample Queries
- Approximate Count and Spatial Queries
- Druid Datasource Options
- Sparkline SQLContext Options
- Using Tableau with Sparkline
- How to debug a Query Plan?
- Running the ThriftServer with Sparklinedata components
- [Setting up multiple Sparkline ThriftServers - Load Balancing & HA] (https://github.com/SparklineData/spark-druid-olap/wiki/Setting-up-multiple-Sparkline-ThriftServers-(Load-Balancing-&-HA))
- Runtime Views
- Sparkline SQL extensions
- Sparkline Pluggable Modules
- Dev. Guide
- Reference Architectures
- Releases
- Cluster Spinup Tool
- TPCH Benchmark