The philosophy of paperboy
is that the package is a comprehensive
collection of webscraping scripts for news media sites. Many data
scientists and researchers write their own code when they have to
retrieve news media content from websites. At the end of research
projects, this code is often collecting digital dust on researchers hard
drives instead of being made public for others to employ. paperboy
offers writers of webscraping scripts a clear path to publish their code
and earn co-authorship on the package (see For
developers Section). For users, the promise is simple:
paperboy
delivers news media data from many websites in a consistent
format. Check which domains are already supported in the table
below or with the command pb_available()
.
paperboy
is not on CRAN yet. Install via
remotes
(first install remotes
via install.packages("remotes")
:
remotes::install_github("JBGruber/paperboy")
Say you have a link to a news media article, for example, from
mediacloud.org. Simply supply one or multiple
links to a media article to the main function, pb_deliver
:
library(paperboy)
df <- pb_deliver("https://tinyurl.com/386e98k5")
df
url | expanded_url | domain | status | datetime | author | headline | text | misc |
---|---|---|---|---|---|---|---|---|
https://tinyurl.com/386e98k5 | https://www.theguardian.com/tv-and-radio/2021/jul/12/should-marge-divorce-homer | theguardian.com | 200 | 2021-07-12 12:00:13 | https://www.theguardian.com/profile/stuart-heritage | ’A woman trapped in an… | In the Guide’s weekly Solved!… | NULL |
The returned data.frame
contains important meta information about the
news items and their full text. Notice, that the function had no problem
reading the link, even though it was shortened. paperboy
is an
unfinished and highly experimental package at the moment. You will
therefore often encounter this warning:
pb_deliver("google.com")
#> ! No parser for domain google.com yet, attempting generic approach.
url | expanded_url | domain | status | datetime | author | headline | text | misc |
---|---|---|---|---|---|---|---|---|
google.com | http://www.google.com/ | google.com | 200 | NA | NA | © 2023 - Datenschutzerklärung - Nutzungsbedingungen | NULL |
The function still returns a data.frame, but important information is
missing — in this case because it isn’t there. The other URLs will be
processed normally though. If you have a dead link in your url
vector,
the status
column will be different from 200
and contain NA
s.
If you are unhappy with results from the generic approach, you can still use the second function from the package to download raw html code and later parse it yourself:
pb_collect("google.com")
url | expanded_url | domain | status | content_raw |
---|---|---|---|---|
google.com | http://www.google.com/ | google.com | 200 | <!doctype html><html itemscope… |
pb_collect
uses concurrent requests to download many pages at the same
time, making the function very quick to collect large amounts of data.
You can then experiment with rvest
or another package to extract the
information you want from df$content_raw
.
If there is no scraper for a news site and you want to contribute one to
this project, you can become a co-author of this package by adding it
via a pull request. First check available
scrapers and open
issues and pull
requests. Open a new issue
or comment on an existing one to communicate that you are working on a
scraper (so that work isn’t done twice). Then start by pulling a few
articles with pb_collect
and start to parse the html code in the
content_raw
column (preferably with rvest
).
Every webscraper should retrieve a tibble
with the following format:
url | expanded_url | domain | status | datetime | headline | author | text | misc |
---|---|---|---|---|---|---|---|---|
character | character | character | integer | as.POSIXct | character | character | character | list |
the original url fed to the scraper | the full url | the domain | http status code | publication datetime | the headline | the author | the full text | all other information that can be consistently found on a specific outlet |
Since some outlets will give you additional information, the misc
column was included so these can be retained.
domain | status | author | issues |
---|---|---|---|
ac24.cz | @JBGruber | ||
ad.nl | @JBGruber | ||
aktualne.cz | @JBGruber | ||
anotherangryvoice.blogspot.com | @JBGruber | ||
bbc.co.uk | @JBGruber | ||
blesk.cz | @JBGruber | ||
boston.com | #1 | ||
bostonglobe.com | #1 | ||
breakingnews.ie | @JBGruber | ||
breitbart.com | @JBGruber | ||
buzzfeed.com | @JBGruber | ||
cbslnk.cbsileads.com | #1 | ||
cbsnews.com | @JBGruber | ||
ceskatelevize.cz | @JBGruber | ||
cnet.com | @JBGruber | ||
dailymail.co.uk | @JBGruber | ||
decider.com | #1 | ||
denikn.cz | @JBGruber | ||
edition.cnn.com | @JBGruber | ||
eu.usatoday.com | @JBGruber | ||
evolvepolitics.com | @JBGruber | ||
faz.net | @JBGruber | ||
forbes.com | @JBGruber | #2 | |
fortune.com | #1 | ||
foxbusiness.com | @JBGruber | ||
foxnews.com | @JBGruber | ||
ftw.usatoday.com | @JBGruber | ||
geenstijl.nl | @JBGruber | ||
hn.cz | @JBGruber | ||
huffingtonpost.co.uk | @JBGruber | ||
idnes.cz | @JBGruber | ||
independent.co.uk | @JBGruber | ||
independent.ie | @JBGruber | ||
irishexaminer.com | @JBGruber | ||
irishmirror.ie | @JBGruber | ||
irishtimes.com | @JBGruber | ||
irozhlas.cz | @JBGruber | ||
joe.ie | @JBGruber | ||
latimes.com | @JBGruber | ||
lidovky.cz | @JBGruber | ||
lnk.techrepublic.com | #1 | ||
marketwatch.com | @JBGruber | ||
mediacourant.nl | @JBGruber | ||
metronieuws.nl | @JBGruber | ||
msnbc.com | #1 | ||
newstatesman.com | @JBGruber | ||
newsweek.com | @JBGruber | ||
nos.nl | @JBGruber | ||
novinky.cz | @JBGruber | ||
nrc.nl | @JBGruber | ||
nu.nl | @JBGruber | ||
nypost.com | @JBGruber | ||
nytimes.com | @JBGruber | #17 | |
pagesix.com | #1 | ||
parlamentnilisty.cz | @JBGruber | ||
rte.ie | @JBGruber | ||
rtl.nl | @JBGruber | ||
seznamzpravy.cz | @JBGruber | ||
sfgate.com | @JBGruber | ||
skwawkbox.org | @JBGruber | ||
sky.com | @JBGruber | ||
telegraaf.nl | @JBGruber | #17 | |
telegraph.co.uk | @JBGruber | ||
thecanary.co | @JBGruber | ||
theguardian.com | @JBGruber | ||
thejournal.ie | @JBGruber | ||
thelily.com | #1 | ||
thestreet.com | @JBGruber | ||
thesun.ie | @JBGruber | ||
thismorningwithgordondeal.com | #1 | ||
time.com | #1 | ||
tribpub.com | #1 | ||
us.cnn.com | @JBGruber | ||
usatoday.com | @JBGruber | ||
volkskrant.nl | @JBGruber | ||
washingtonpost.com | @JBGruber | ||
wsj.com | @JBGruber | ||
yahoo.com | @JBGruber | ||
zeit.de | @JBGruber |