Operationalizing conflict and cooperation between automated software agents in Wikipedia: A replication and expansion of "Even Good Bots Fight"
This is the code and partial data repository for a paper that has been published in Proceedings of the ACM on Human Computer Interaction (November 2017) and will be presented at the ACM conference on Computer-Supported Cooperative Work and Social Computing (CSCW) in 2018. The paper is exploring the phenomenon of bot-bot reverts in Wikipedia, in which automated software agents undo each other's edits. While previous research raised caution about these cases and used them as evidence for Wikipedia's failure to properly govern automation, our collaborative computational/ethnographic approach explores these cases in rich detail and nuance, asking when it is appropriate to classify bot-bot reverts as conflict or not. We find that an overwhelming proportion of bot-bot reverts are cases in which bots are doing important, productive work that present as reverts due to the ways in which work in Wikipedia is organized.
We are releasing our code and data around this research project. This GitHub repository contains our processed data files and various Jupyter notebooks we used to explore and analyze the phenomenon of bot-bot reverts in Wikipedia. Due to GitHub storage limits, all of our source datasets and intermediate data files can't be stored on GitHub, but we have uploaded them to various open data repositories. See the Datasets section below for more details.
If you want to play around with our analyses, you can launch this repository now in a free mybinder JupyterHub server by clicking the button below (note that this server is temporary and will expire after an hour or so). All the notebooks in analysis/main/
can be run in your browser from the mybinder server without any additional setup or data processing. Or you can open any of the notebooks in the analysis/
folder in GitHub and see static renderings of the analysis.
Python >=3.3, with the packages (versions specified in requirements.txt
):
pip install mwcli mwreverts mwtypes mwxml jsonable docopt mysqltsv pandas seaborn
R >= 3.2, with the packages:
install.packages("ggplot2")
install.packages("data.table")
Jupyter Notebooks >=4.0 for running notebooks, with the IRKernel for the R notebooks, and xz-utils
for compression.
The environment.yml
file contains specifications for a standardized conda environment (see also Anaconda), with packages needed to run the analyses (including Jupyter notebooks). A Dockerfile is available at Dockerfile.old
(named to avoid incompatibilities with mybinder.org), which is based on the jupyter/datascience-notebook
Dockerfile. For python packages, there is also a requirements.txt
, which specifies required version numbers.
The folder /environment/
contains a Jupyter notebook displaying various information about one of the computational environments that this pipeline was run using, as well as the outputs of pip freeze
, conda list
and conda env export
. Note that this environment has installed far more packages than are needed to run this project.
We have two datasets of bots across language versions of Wikipedia:
-
datasets\crosswiki_category_bot_20170328.tsv
is generated fromget_category_bots.py
(also made in theMakefile
) and contains a list of bots based on Wikidata categories for various language versions of Wikipedia's equivalent of Category:All Wikipedia bots -
datasets\crosswiki_unified_bot_20170328.tsv
is made in theMakefile
and contains the above dataset combined with lists of bots from theuser_groups
andformer_user_groups
database tables ("the bot flag") in our seven language versions of Wikipedia. This dataset can be considered as complete of a list of current and historical bots (including unauthorized bots) as is possible to automatically generate for these language versions of Wikipedia.
Note that if you re-run these scripts and generate new bot lists, they will likely be different than the ones we used for our analysis, as Wikipedians are continually updating these source bot lists. So if you generate new bot lists then re-run our addition data collection, processing, and analysis pipeline, you may get slighly different results, because you will be using a different list of bot accounts.
This project begins with the the stub-meta-history.xml.gz database dumps from the Wikimedia Foundation, which contain metadata for every edit made to every page in particular language versions of Wikipedia (see here for details about the dumps). The BASH script we used to download the April 20th, 2017 database dumps from the Wikimedia Foundation's servers is download_dumps.sh
. However, Wikimedia takes down database dumps after about six months, and so the files we used are no longer accessible using this script. We include it for purposes of computational reproducibility, and the script can be modified to download newer database dumps for future research. We have also archived the April 20th, 2017 database dumps we used at the California Digital Library's DASH project -- you must enter your e-mail and DASH will send you a link to download it.
Note that these files are large -- approximately 93GB compressed -- and on a 16 core Xeon workstation, it can take a week for the first stage of parsing all reverts from the dumps. As we are not taking issue with how previous researchers have computationally identified reverts (only how to interpret reverts as conflict), replicating this step is not crucial. We recommend those interested in replication start with the bot-bot revert datasets, described below.
We then processed the database dumps using mwreverts dump2reverts
(passing in the list of bots as a parameter) to generate .json.bz2 formatted datasets of all reverts to all pages across the languages we analyzed, stored in datasets/reverts/
. these files for non-English Wikipedias are in a single .json.bz2 file, while they are split into five parts for English Wikipedia. These are then used to generate the monthly bot revert tables in step 3.1 and the bot-bot revert datasets in step 4. These data are not included in the GitHub repo because they are multiple gigabytes, but we have publicly released them on Figshare.
Note that the version of the Makefile
in this directory will automatically download these datasets from Figshare if they are not found in the appropriate sub-directory. We have left the original make commands that process the database dumps to generate these .json.bz2 files of all reverts in the Makefile
, but commented them out. To re-run this stage of the process, uncomment those commands and comment out the commands that download from Figshare.
The Makefile
loads the full table of reverts and runs bot_revert_monthly_stats.py
generate TSV formatted tables for each language, which contain namespace-grouped counts of: the number of reverts (reverts), reverts by bots (bot_reverts), bot edits that were reverted (bot_reverteds), and bot-bot-reverts (bot2bot_reverts). This is stored in datasets/monthly_bot_reverts/
and included in this repo.
The Makefile
downloads TSV files containing results of SQL queries run against Wikipedia's databases using the open querying platform Quarry. These queries create, one TSV file for each language, containing monthly counts by namespace of the number of total bot edits (variable name for counts in the TSV is n
). This is stored in datasets/monthly_bot_edits/
and included in this repo. You can go to the links to Quarry left in comments in the Makefile
to see the SQL queries. Note that the query for English Wikipedia exceeds Quarry's time-out limit and was manually run against Wikipedia's databases, then uploaded to GitHub. The Makefile
downloads the English Wikipedia file from GitHub, but the SQL query can still be seen on Quarry.
The Makefile
loads the revert datasets in datasets/reverts/
and the bot list and runs revert_json_2_tsv.py
to generate TSV formatted, bz2 compressed datasets of every bot-bot revert across pages in all namespaces for each language. This is stored in datasets/reverted_bot2bot/
and included in this repo. The format of these datasets can be seen in analysis/0-load-process-data.ipynb
. Starting with these datasets lets you reproduce the novel parts of our analysis pipeline, and so we recommend starting here.
Datasets in datasets/parsed_dataframes/
are created out of the analyses in the Jupyter notebooks in the analysis/
folder. If you are primarily interested in exploring our results and conducting further analysis, we'd recommend starting with df_all_comments_parsed_2016.pickle.xz
. These datasets are compressed with xz (extremely compressed to keep them under GitHub's 100mb limit). The decompressed pickle file is a serialized pandas dataframe that can be loaded in python, as seen in the notebooks in the analysis/paper_plots
folder. The Jupyter notebooks also generate a TSV formatted file, but this is too large to be included in GitHub.
-
df_all_2016.pickle.xz
is a pandas dataframe of all bot-bot reverts in the languages in our dataset. It is generated by running the Jupyter notebookanalysis/main/0-load-process-data.ipynb
, which also shows the variables in this dataset. -
df_all_comments_parsed_2016.pickle.xz
extendsdf_all_2016.pickle.xz
with classifications of reverts. It is generated byanalysis/main/7-2-comment-parsing.ipynb
, which also shows the variables in this dataset. -
possible_botfights.pickle.bz2
andpossible_botfights.tsv.bz2
are bzip2-compressed filtered datasets ofdf_all_comments_parsed_2016.pickle
, containing reverts from all langauges in our analysis that are possible cases of bot-bot conflict (part of a bot-bot reciprocation, with time to revert under 180 days). It is generated byanalysis/main/8-comments-analysis.ipynb
.
Analyses that are presented in the paper are in the analysis/main/
folder, with Jupyter notebooks for each paper section (for example, section 5.2 on time to revert is presented in 5-2-time-to-revert.ipynb). Some of these notebooks include more plots, tables, data, and analyses than we were able to fit in the paper, but we kept them because they could be informative.
We also have various supplemental and exploratory analyses in analyses/exploratory/
.
These tables are accessible at here and in raw HTML form at analysis/sample_tables/
. These were generated by analysis/7-3-comments-sample-diffs.ipynb
.
This is the data dictionary for df_all_comments_parsed_2016.csv/.pickle, which are the final datasets. Intermediate datasets (like df_all_2016.csv/.pickle) have fewer fields, but these descriptions are accurate for fields that appear in those datasets as well.
field name | description | created in | example row 1 | example row 2 |
---|---|---|---|---|
archived | Has the reverting edit been archived? | Makefile, from the database dumps | FALSE | FALSE |
language | two letter country code (*.wikipedia.org subdomain) | Makefile, from the database dumps | fr | fr |
page_namespace | Integer namespace of the page where the edit was made. Matches page_namespace in the page database table | Makefile, from the database dumps | 0 | 0 |
rev_deleted | Has the reverted edit been deleted? | Makefile, from the database dumps | FALSE | FALSE |
rev_id | Integer revision ID of the reverted edit, matches rev_id in the revision database table | Makefile, from the database dumps | 88656915 | 70598552 |
rev_minor_edit | Was the reverted edit flagged as a minor edit by the user who made it? | Makefile, from the database dumps | TRUE | TRUE |
rev_page | Integer page ID of the page where the reverted edit was made, matches rev_page in the revision database table and page_id in the page database table | Makefile, from the database dumps | 4419903 | 412311 |
rev_parent_id | The revision ID of the revision immediately prior to the reverted edit | Makefile, from the database dumps | 8.86E+07 | 6.75E+07 |
rev_revert_offset | distance of the reverted revision from the reverting revision (1 == the most recent reverted revision) | Makefile, from the database dumps | 1 | 1 |
rev_sha1 | The SHA1 hash of the page text made by the reverted edit | Makefile, from the database dumps | lgtqatftj6rma9ezkyy56rsqethdoqf | 0zw28ur2rlxg207ms6w3krqd4qzozq3 |
rev_timestamp | Timestamp of the reverted edit in UTC (YYYYMMDDHHMM) | Makefile, from the database dumps | 20130211173947 | 20110930180432 |
rev_user | User ID of the user who made the reverted edit | Makefile, from the database dumps | 1019240 | 414968 |
rev_user_text | Username of the user who made the reverted edit | Makefile, from the database dumps | MerlIwBot | Luckas-bot |
reverted_to_rev_id | Revision ID of the revision that the reverting edit reverted back to | Makefile, from the database dumps | 88597754 | 67506906 |
reverting_archived | Has the reverting edit been archived? | Makefile, from the database dumps | FALSE | FALSE |
reverting_comment | Edit summary of the reverting edit | Makefile, from the database dumps | r2.7.2+) (robot Retire : [[cbk-zam:Tortellá]] | robot Retire: [[hy:Հակատանկային կառավարվող հրթ... |
reverting_deleted | Has the reverting edit been deleted? | Makefile, from the database dumps | FALSE | FALSE |
reverting_id | Revision ID of the reverting edit, matches rev_id in the revision database table | Makefile, from the database dumps | 89436503 | 70750839 |
reverting_minor_edit | Was the revision ID flagged as a minor edit by the user who made it? | Makefile, from the database dumps | TRUE | TRUE |
reverting_page | Integer page ID of the page where the reverting edit was made, matches rev_page in the revision database table and page_id in the page database table | Makefile, from the database dumps | 4419903 | 412311 |
reverting_parent_id | The revision ID of the revision immediately prior to the reverting edit | Makefile, from the database dumps | 8.87E+07 | 7.06E+07 |
reverting_sha1 | The SHA1 hash of the page text made by the reverted edit | Makefile, from the database dumps | gjz9jni8w2jiccksgid7tbofddevhu0 | myxsvdiky34vgddnhrclg9237cus7nn |
reverting_timestamp | Timestamp of the reverting edit in UTC (YYYYMMDDHHMM) | Makefile, from the database dumps | 20130302203329 | 20111004215328 |
reverting_user | User ID of the user who made the reverted edit | Makefile, from the database dumps | 757129 | 1019240 |
reverting_user_text | Username of the user who made the reverting edit | Makefile, from the database dumps | EmausBot | MerlIwBot |
revisions_reverted | Number of revisions reverted by the reverting edit | Makefile, from the database dumps | 1 | 1 |
namespace_type | Text description of the namespace (0: "article", 14: "category", other odd: "other talk"; other even: "other page") | 0-load-process-data.ipynb | article | article |
reverted_timestamp_dt | Datetime64 version of rev_timestamp |
0-load-process-data.ipynb | 2013-02-11 17:39:47 | 2011-09-30 18:04:32 |
reverting_timestamp_dt | Datetime64 version of reverting_timestamp |
0-load-process-data.ipynb | 2013-03-02 20:33:29 | 2011-10-04 21:53:28 |
time_to_revert | Time between the reverted and reverting revision. Timedelta64 of difference between reverting_timestamp and rev_timestamp |
0-load-process-data.ipynb | 19 days 02:53:42 | 4 days 03:48:56 |
time_to_revert_hrs | Float conversion of time_to_revert in hours |
0-load-process-data.ipynb | 458.9 | 99.82 |
time_to_revert_days | Float conversion of time_to_revert in hours |
0-load-process-data.ipynb | 19.12 | 4.159 |
reverting_year | Integer year of the reverting revision | 0-load-process-data.ipynb | 2013 | 2011 |
time_to_revert_days_log10 | Float log10(time_to_revert_days) | 0-load-process-data.ipynb | 1.282 | 0.619 |
time_to_revert_hrs_log10 | Float log10(time_to_revert_hours) | 0-load-process-data.ipynb | 2.662 | 1.999 |
reverting_comment_nobracket | Reverting comment text with all text inside brackets, parentheses, and braces removed | 0-load-process-data.ipynb | r2.7.2+) | robot Retire: |
botpair | String concatenation of reverting_user_text + " rv " + rev_user_text |
0-load-process-data.ipynb | EmausBot rv MerlIwBot | MerlIwBot rv Luckas-bot |
botpair_sorted | Sorted list of [reverting_user_text , rev_user_text ] |
0-load-process-data.ipynb | ['EmausBot', 'MerlIwBot'] | ['Luckas-bot', 'MerlIwBot'] |
reverts_per_page_botpair | Total number of reverts in this dataset with the same botpair value on this page in this language |
0-load-process-data.ipynb | 1 | 1 |
reverts_per_page_botpair_sorted | Total number of reverts in this dataset with the same botpair_sorted value on this page in this language |
0-load-process-data.ipynb | 1 | 1 |
bottype | Classified type of bot-bot interaction, more granular than bottype_group |
7-2-comment-parsing.ipynb | interwiki link cleanup -- method2 | interwiki link cleanup -- method2 |
bottype_group | Classified type of bot-bot interaction, consolidated from bottype |
7-2-comment-parsing.ipynb | interwiki link cleanup -- method2 | interwiki link cleanup -- method2 |