Redactor is a python based utillity tool used to redact sensitive information using Natural Language processing tools like Spacy and Nltk.
The project's python code follows PEP8 Style Guide
This utility uses a number of open source packages and tools:
- SpaCy - Industrial-strength Natural Language Processing (NLP) in Python.
- en_core_web_md - SpaCy's English pipeline optimized for CPU.
- nltk - A suite of open source tools, data sets, and tutorials for Natural Language Processing research.
- Pyap - Python address detector and parser.
- SpaCy-Wordnet - SpaCy and Nltk's wordnet Annotator.
- Pytest - Testing framework that supports complex functional testing.
- Pytest-cov - Coverage plugin for pytest.
- autopep8 - Tool that automatically formats Python code to conform to the PEP 8 style guide.
- Clone this repository and move into the folder.
$ git clone https://github.com/Biswas-N/Redactor.git $ cd Redactor
- Install dependencies using Pipenv.
$ pipenv install
- Run the utility tool.
$ make
Note: Project includes a
Makefile
which has commonly used commands. By runningmake
the following commandpipenv run python redactor.py --input '*.txt' --names --dates --phones --genders --address --concept 'war' --concept 'dog' --output 'files/' --stats 'process.log'
is executed. - The redacted files will be stored in
files/
folder with.redacted
extension. - Finally, the stats for the redaction process are stored in
process.log
.
The documentation about code structure and extraction algorithm can be found here.
This utility is tested using pytest.
Documentation about the tests can be found here. Follow the below commands to run tests on your local system.
- Install dev-dependencies.
$ pipenv install --dev
- Run tests using
Makefile
.$ make test
- Run test coverage.
$ make cov
- Names of people, organizations, geo political entities, Nationalitiesm religious or political group names are considered as names and thus redacted if
--names
flag is used. - This tool depends on SpaCy's en_core_web_md model and WordNet. Thus, the accuracy and performance of this application is directly dependent on SpaCy model and WordNet respective accuracies and performances.
- This tools accuracy and performance is enhanced by using regular expressions along with SpaCy and WordNet, but unfortunately not all cases of the entities (names, phones, genders, dates and addresses) were included as regular expressions. Thus, some information may not be redacted if they were not recognized by SpaCy model or present in WordNet and included regular expressions.