Skip to content

Tutorial and examples of Data Quality in Big Data System

Notifications You must be signed in to change notification settings

bikash/DataQuality

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 

Repository files navigation

DataQuality

Tutorial and examples of Data Quality in Big Data System.

Data Quality metrics:

  • completeness
  • commission
  • omission
  • thematic accuracy
  • thematic classification correctness
  • non-quantitative attribute correctness
  • qualintitative attribute accuracy
  • logical consistency
  • conceptual consistency
  • domain consistency
  • format consistency
  • topological consistency
  • temporal quality
  • accuracy of a time measurement
  • temporal consistency
  • temporal validity
  • positional accuracy
  • absolute external positional accuracy
  • relative internal positional accuracy
  • gridded data positional accuracy
  • usability

Your contributions are always welcome!

  • Griffin - Data Quality solution for distributed data systems at any scale in both streaming and batch data context. Detect accuracy, Completeness, Validity, Timeliness, Anomaly detection and Data Profiling. (Recommended)
  • drunken-data-quality - provide data quality report using spark,Elasticsearch, Logstash and Kibana (ELK) and demo (https://github.com/FRosner/ddq-demo-elk)
  • DataQuality for BigData - a framework to build parallel and distributed quality checks on big data environments. It can be used to calculate metrics and perform checks to assure quality on structured or unstructured data. It relies entirely on Spark.
  • TopNotch - TopNotch is a system for quality controlling large scale data sets. It addresses the following three problems:How to define and measure data quality , How to efficiently ensure data quality across many data sets, How to institutionalize existing knowledge of data sets.
  • Phasor Data Quality Tracker - The PDQ Tracker administered by the Grid Protection Alliance (GPA) is a high-performance, real-time data processing engine designed to raise alarms, track states, store statistics, and generate reports on both the availability and accuracy of streaming synchrophasor data. [doc] (http://www.gridprotectionalliance.org/docs/products/PDQTracker/highlevelrequirements.pdf)
  • DataCleaner - The premier open source Data Quality solution Documentation
  • data-quality - Talend Open Studio for Data Quality can be download from the Talend website.

About

Tutorial and examples of Data Quality in Big Data System

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published