Uses a statistical profile of ODBC source data to parameterize a generic probability model. Then generate a structurally equivalent random data set by sampling from the probability model.
Simulated data does not contain patient data. Simulation is one way-there is no-way to reproduce source data from simulated data.
ODBC compatible source data (eg MS SQL Database, Denodo)
- connection string ODBC compliant DSN of source system
- parameter file specifying list of tables that will be copied as is from source to destination
- rerunnable DDL scripts to create empty structures
- rerunnable DML scripts to populate non-private data
- collection of C++ objects repersenting source schema objects
- database,
- schema,
- table,
- column
- parameterized multinomal probability model (ie collection of distinct column values (aka outcome values) and there respective probabilties)
- Extract meta data from source schema A. generate rerunnable DDL scripts
- Instantiate c++ object model of schema from meta data
- C++ column class decorated to add odbc connectivity for querying source system
- Generate and execute queries against source data that determines pairwise functional dependency between columns within each table
- Functional dependency hierarchies are modelled as a tree of column values. a. Only leaf level columns are simulated 2. value of parent columns is determined from functional dependency tree
- Generate and execute queries against source data that return table row counts, column distinct counts, and column column value histograms
- Assume columns and rows are pairwise statistically independent (within and between tables). a. Thus fk constraints are broken.
- Simulate primary keys columns with increasing sequence of values
- Model non unique columns as multinomial probability distribution and estimate parameters with column value histogram
- output of Statistical Profile
- desired row counts for simulated tables
- Fake data
- Execute DDL creation scripts
- Execute DML insertion scripts
- populate destination tables by sampling from column probility models