Skip to content

Data quality scoring

Marcus Bakker edited this page May 8, 2019 · 15 revisions

DeTT&CT describes five different data quality dimensions: device completeness, data completeness, timeliness, consistency and retention. These dimensions are explained in the table below: data quality dimensions.

Scoring your data quality means scoring each of these dimensions for every data source you have. ATT&CK has defined around 50 different types of data sources, which we included in this framework. The scoring table will guide you in scoring. The scoring tables are also included in the following Excel file: scoring_table.xlsx.

A score may not always be a perfect fit. Use the score that fits best.

Data quality dimensions

Dimensions Description Questions? Example
Device completeness Indicates if the required data is available for all involved devices and indicates. When doing a hunting investigation can we cover all devices/users that we need to? We are missing event data for endpoints running an older version of Windows.
Data completeness Indicates to what degree the data has the required information/fields. Are all the required data fields in the event present to perform my investigation? We have proxy logs, but the events do not contain the "Host" header.
Timeliness Indicates when data is available, and how accurate the timestamps of the data are in relation to the actual time an event occurred. Is the data available right away when we need it?

Do the timestamps in the data represent the time the record was created or ingested?
We have a delay of 1-2 days to get the necessary data from all endpoints into the security data lake.

Timestamps are representing not the time an event occurred, but ingestion time in the security data lake.
Consistency Says something about the standardisation of data field names and types. Can we correlate the events with other data sources?

Can we run queries across all data sources using standard naming conventions for specific fields?
Field names within this data source are not in line with that of other data sources.
Retention Indicates how long the data is stored compared to the desired data retention period. For how long is the data available?

How long do you want to keep the data?
Data is stored for 30 days, but we ideally want to have it for 1 year.

Data quality scores

Score Device completeness Data field completeness Timeliness Consistency Retention
0 - None Do not know / not documented / not applicable Do not know / not documented / not applicable Do not know / not documented / not applicable Do not know / not documented / not applicable Do not know / not documented / not applicable
1 - Poor Data source is available from 1-25% of the devices. Required fields are available from 1-25%. It takes a long time before the data is available.

The timestamps in the data deviate much from the actual time events occurred.
1-50% of the fields is standardized in name and type. Data retention is within 1-25% of the desired period.
2 - Fair Data source is available from 26-50% of the devices. Required fields are available from 26-50%. Data retention is within 26-50% of the desired period.
3 - Good Data source is available from 51-75% of the devices. Required fields are available from 51-75%. It takes a while before the data is available, but is acceptable.

The timestamps in the data have a small deviation with the actual time events occurred.
51-99% of the fields is standardized in name and type. Data retention is within 51-75% of the desired period.
4 - Very good Data source is available from 76-99% of the devices. Required fields are available from 76-99%. Data retention is within 76-99% of the desired period.
5 - Excellent Data source is available for 100% of the devices. Required fields are available for 100%. The data is available right away.

The timestamps in the data are 100% accurate.
100% of the fields is standardized in name and type. Data is stored for 100% of the desired retention period.
Clone this wiki locally