You can run the walkthrough notebook in Google Colab with a single click:
Developed by Fast Data Science. Fast Data Science develops products, offers consulting services, and training courses in natural language processing (NLP). Subscribe to our blog for regular news from the NLP universe.
Source code at https://github.com/fastdatascience/faststylometry
Tutorial at https://fastdatascience.com/fast-stylometry-python-library/
Fast Stylometry is a Python library for calculating the Burrows' Delta. Burrows' Delta is an algorithm for comparing the similarity of the writing styles of documents, known as forensic stylometry.
You can install from PyPI.
pip install faststylometry
Demonstration of Burrows' Delta on a small corpus downloaded from Project Gutenberg.
We will test the Burrows' Delta code on two "unknown" texts: Sense and Sensibility by Jane Austen, and Villette by Charlotte Bronte. Both authors are in our training corpus.
You can get the training corpus by cloning https://github.com/fastdatascience/faststylometry, the data is in data
. Or you can call download_examples()
from Python after importing Fast Stylometry:
from faststylometry import download_examples
download_examples()
The Burrows Delta Walkthrough.ipynb Jupyter notebook is the best place to start, but here are the basic commands to use the library:
To create a corpus and add books, the pattern is as follows:
from faststylometry import Corpus
corpus = Corpus()
corpus.add_book("Jane Austen", "Pride and Prejudice", [whole book text])
Here is the pattern for creating a corpus and adding books from a directory on your system. You can also use the method util.load_corpus_from_folder(folder, pattern)
.
import os
import re
from faststylometry.corpus import Corpus
corpus = Corpus()
for root, _, files in os.walk(folder):
for filename in files:
if filename.endswith(".txt") and "_" in filename:
with open(os.path.join(root, filename), "r", encoding="utf-8") as f:
text = f.read()
author, book = re.split("_-_", re.sub(r'\.txt', '', filename))
corpus.add_book(author, book, text)
Download some example data (Project Gutenberg texts) from the Fast Stylometry repository:
from faststylometry import download_examples
download_examples()
Load a corpus and calculate Burrows' Delta
from faststylometry.util import load_corpus_from_folder
from faststylometry.en import tokenise_remove_pronouns_en
from faststylometry.burrows_delta import calculate_burrows_delta
train_corpus = load_corpus_from_folder("data/train")
train_corpus.tokenise(tokenise_remove_pronouns_en)
test_corpus_sense_and_sensibility = load_corpus_from_folder("data/test", pattern="sense")
test_corpus_sense_and_sensibility.tokenise(tokenise_remove_pronouns_en)
calculate_burrows_delta(train_corpus, test_corpus_sense_and_sensibility)
returns a Pandas dataframe of Burrows' Delta scores
Using the probability calibration functionality, you can calculate the probability of two books being by the same author.
from faststylometry.probability import predict_proba, calibrate
calibrate(train_corpus)
predict_proba(train_corpus, test_corpus_sense_and_sensibility)
outputs a Pandas dataframe of probabilities.
Thomas Wood at Fast Data Science
If you'd like to contribute to this project, you can contact us at https://fastdatascience.com/ or make a pull request on our Github repository. You can also raise an issue.
Test code is in tests/ folder using unittest.
The testing tool tox
is used in the automation with GitHub Actions CI/CD.
Install tox and run it:
pip install tox
tox
In our configuration, tox runs a check of source distribution using check-manifest (which requires your repo to be git-initialized (git init
) and added (git add .
) at least), setuptools's check, and unit tests using pytest. You don't need to install check-manifest and pytest though, tox will install them in a separate environment.
The automated tests are run against several Python versions, but on your machine, you might be using only one version of Python, if that is Python 3.9, then run:
tox -e py39
Thanks to GitHub Actions' automated process, you don't need to generate distribution files locally. But if you insist, click to read the "Generate distribution files" section.
This package is based on the template https://pypi.org/project/example-pypi-package/
This package
- uses GitHub Actions for both testing and publishing
- is tested when pushing
master
ormain
branch, and is published when create a release - includes test files in the source distribution
- uses setup.cfg for version single-sourcing (setuptools 46.4.0+)
The code to re-release Fast Stylometry on PyPI is as follows:
source activate py311
pip install twine
rm -rf dist
python setup.py sdist
twine upload dist/*
The tool was developed by:
- Thomas Wood, Natural Language Processing consultant and data scientist at Fast Data Science.
MIT License. Copyright (c) 2023 Fast Data Science
If you are undertaking research in AI, NLP, or other areas, and are publishing your findings, I would be grateful if you could please cite the project.
Wood, T.A., Fast Stylometry [Computer software] (1.0.4). Data Science Ltd. DOI: 10.5281/zenodo.11096941, accessed at https://fastdatascience.com/fast-stylometry-python-library, Fast Data Science (2024)
A BibTeX entry for LaTeX users is:
@software{faststylometry,
author = {Wood, T.A.},
title = {Fast Stylometry (Computer software), Version 1.0.4},
year = {2024},
url = {https://fastdatascience.com/fast-stylometry-python-library/},
doi = {10.5281/zenodo.11096941},
}