Skip to content

ignatovskiy/PersonalDataAnonymizer

Repository files navigation

Personal Data Anonymizer

Build Status made-with-python

pdanonymizer is a Python library for finding, masking or replacing personal data in files, IO streams, etc.

Installation

Use the package manager pip to install pdanonymizer.

pip install .

Usage

Training new model by dataset and saving it as folder:

python3 pdanonymizer -a train -d [dataset file] -o [model dir]

Continue training existing model by new dataset and saving it as folder:

python3 pdanonymizer -a post_train -m [model dir] -d [dataset file] -o [model folder]

Automatic validation (testing) of existing model on dataset:

python3 pdanonymizer -a test_auto -m [model dir] -d [dataset file]

Manual testing of existing model by inputting string:

python3 pdanonymizer -a test_manual -m [model dir] -d [string]

Converting mainstream json datasets (generated on websites) to specific pdanonymizer dataset format:

python3 pdanonymizer -a convert -d [dataset file]

Creating new pdanonymizer datasets with needed number of entities from converted one:

python3 pdanonymizer -a create -d [raw dataset file] -v [entities amount] -o [new dataset file]

Replacing personal data in file by fake data with the help of pdanonymizer model and saving edited file:

python3 pdanonymizer -a predict_file -m [model dir] -d [old file] -o [new file]

Masking personal data in file by fake data with the help of pdanonymizer model and saving edited file:

python3 pdanonymizer -a mask_file -m [model dir] -d [old file] -o [new file]

Masking personal data in images by black squares with the help of pdanonymizer model and saving edited image:

python3 pdanonymizer -a mask_image -d [input image filename] -o [output image filename]

Pseudo-GUI mode of application:

python3 pdanonymizer -a interact

Accuracy

Validated on this dataset

Model Accuracy (label) Accuracy (entity) Accuracy (total) Time
model_10 26.57% 29.17% 14.02% 1.33s
model_100 66.98% 65.33% 76.80% 1.33s
model_1000 93.73% 97.40% 93.43% 1.32s
model_10000 97.34% 98.05% 95.74% 1.31s
model_100000 95.92% 98.17% 94.79% 1.31s

Examples

Replacing personal data in SQL database with fake one. Compare outputs of this commands.

w/o pdanonymizer:

python3 examples/sql_example.py
Ivan Ivanov 0-345-43-43 25.03.1874
Rodger Wellington +7-950-434-43-43 05/03/2000
Stan Smith +1-900-456-43-34 01.02.1990

w/ pdanonymizer:

python3 examples/sql_example.py | python3 pdanonymizer
Helen Church 001-625-632-4152 21.10.1983
Perry Lucas 767-863-7211 13.02.2013
Katherine Sheppard 001-822-636-2875x5676 18.12.2016

Another example of data replacing in SQL file provided by test_db project. Compare outputs.

w/o pdanonymizer:

mysql < examples/sql/test_print.sql
emp_no	birth_date	first_name	last_name	gender	hire_date
10037	1963-07-22	Pradeep	Makrucki	M	1990-12-05
10038	1960-07-20	Huan	Lortz	M	1989-09-20
10039	1959-10-01	Alejandro	Brender	M	1988-01-19
10040	1959-09-13	Weiyi	Meriste	F	1993-02-14
10041	1959-08-27	Uri	Lenart	F	1989-11-12
10042	1956-02-26	Magy	Stamatiou	F	1993-03-21
10043	1960-09-19	Yishay	Tzvieli	M	1990-10-20
10044	1961-09-21	Mingsen	Casley	F	1994-05-21
10045	1957-08-14	Moss	Shanbhogue	M	1989-09-02
10046	1960-07-23	Lucien	Rosenbaum	M	1992-06-20
10047	1952-06-29	Zvonko	Nyanchama	M	1989-03-31
10048	1963-07-11	Florian	Syrotiuk	M	1985-02-24
10049	1961-04-24	Basil	Tramer	F	1992-05-04
10050	1958-05-21	Yinghua	Dredge	M	1990-12-25
10051	1953-07-28	Hidefumi	Caine	M	1992-10-15
10052	1961-02-26	Heping	Nitsch	M	1988-05-21
10053	1954-09-13	Sanjiv	Zschoche	F	1986-02-04
10054	1957-04-04	Mayumi	Schueller	M	1995-03-13
10055	1956-06-06	Georgy	Dredge	M	1992-04-27
10056	1961-09-01	Brendon	Bernini	F	1990-02-01
10057	1954-05-30	Ebbe	Callaway	F	1992-01-15

w/ pdanonymizer:

mysql < examples/sql/test_print.sql | python3 pdanonymizer
emp_no	birth_date	first_name	last_name	gender	hire_date
8787379297843	30.10.2004	Scott	Mathews	M	13.07.1981
5527296127105	25.06.2003	Andrade	Jacobson	M	20.01.2013
1436125488352	04.01.1976	George	Powers	M	23.06.2003
9691495824069	09.09.1981	MD	Moore	F	06.02.1988
3952266963094	24.01.1986	Uri	Ortiz	F	02.07.1971
6917352791620	29.05.2000	Dickson	Peterson	F	20.02.1991
2082483942457	20.01.1982	Black	Weaver	M	23.03.2002
4161563061665	10.03.1973	Gilbert	Larson	F	24.07.1986
0867367983250	02.09.2020	Tucker	Garza	M	22.07.1996
2841039452051	03.09.1989	Castaneda	Hickman	M	02.04.1972
1448254445397	04.11.1975	Petersen	Harris	M	02.11.1971
0604775503546	29.01.2015	Martin	Lawrence	M	17.01.2010
9446687385589	26.01.1994	Sutton	Osborne	F	22.03.1993
0447931894436	26.10.1981	Miller	Davis	M	28.08.1993
1497766755999	06.09.1985	Stout	Smith	M	03.12.2000
5108961087094	31.10.1986	Frey	Gillespie	M	31.03.2013
5188092281882	06.05.2015	Miller	Burch	F	07.01.2015
4237196973139	05.01.1979	Walker	Garza	M	25.02.1988
7840814674904	24.12.1983	James	Banks	M	11.12.1986
4594521752198	18.05.1982	Arnold	Crosby	F	26.12.1995
4071661846432	18.04.1974	Hernandez	Ward	F	14.05.1982

Testing on real CSV data:

w/o pdanonymizer:

"Name",     "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex",       "M",   41,       74,      170
"Bert",       "M",   42,       68,      166
"Carl",       "M",   32,       70,      155
"Dave",       "M",   39,       72,      167
"Elly",       "F",   30,       66,      124
"Fran",       "F",   33,       66,      115
"Gwen",       "F",   26,       64,      121
"Hank",       "M",   30,       71,      158
"Ivan",       "M",   53,       72,      175
"Jake",       "M",   32,       69,      143
"Kate",       "F",   47,       69,      139
"Luke",       "M",   34,       72,      163
"Myra",       "F",   23,       62,       98
"Neil",       "M",   36,       75,      160
"Omar",       "M",   38,       70,      145
"Page",       "F",   31,       67,      135
"Quin",       "M",   29,       71,      176
"Ruth",       "F",   28,       65,      131

w/ pdanonymizer (via web-app):

"Name",     "Sex", "Age", "Height (in)", "Weight (lbs)"
"Aguilar",       "M",   41,       74,      170
"Thompson",       "M",   42,       68,      166
"Williams",       "M",   32,       70,      155
"Tucker",       "M",   39,       72,      167
"Lindsey",       "F",   30,       66,      124
"Klein",       "F",   33,       66,      115
"Dean",       "F",   26,       64,      121
"Lane",       "M",   30,       71,      158
"Tyler",      "M",   53,       72,      175
"Brown",       "M",   32,       69,      143
"Phillips",       "F",   47,       69,      139
"Martinez",       "M",   34,       72,      163
"Maldonado",       "F",   23,       62,       98
"Patrick",       "M",   36,       75,      160
"Lawrence",       "M",   38,       70,      145
"Walker",       "F",   31,       67,      135
"Warren",       "M",   29,       71,      176
"Heath",       "F",   28,       65,      131

Example of image data masking:

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

MIT

About

anonymising personal data

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published