This repository has been archived by the owner on Dec 30, 2023. It is now read-only.
forked from alastairrushworth/htmldf
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
executable file
·126 lines (92 loc) · 3.53 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
output: github_document
---
# htmldf <img src="man/figures/hex.png" align="right" width="150" />
[](https://app.codecov.io/gh/alastairrushworth/htmldf)
[](https://CRAN.R-project.org/package=htmldf)
[](https://CRAN.R-project.org/package=htmldf)
[](https://cran.r-project.org/web/checks/check_results_htmldf.html)
Overview
---
The package `htmldf` contains a single function `html_df()` which accepts a vector of urls as an input and from each will attempt to download each page, extract and parse the html. The result is returned as a `tibble` where each row corresponds to a document, and the columns contain page attributes and metadata extracted from the html, including:
+ page title
+ inferred language (uses Google's compact language detector)
+ RSS feeds
+ tables coerced to tibbles, where possible
+ hyperlinks
+ image links
+ social media profiles
+ the inferred programming language of any text with code tags
+ page size, generator and server
+ page accessed date
+ page published or last updated dates
+ HTTP status code
+ full page source html
Installation
---
To install the CRAN version of the package:
```{r, eval=FALSE}
install.packages('htmldf')
```
To install the development version of the package:
```{r, eval=FALSE}
remotes::install_github('alastairrushworth/htmldf')
```
Usage
---
First define a vector of URLs you want to gather information from. The function `html_df()` returns a `tibble` where each row corresponds to a webpage, and each column corresponds to an attribute of that webpage:
```{r, message=FALSE, warning=FALSE}
library(htmldf)
library(dplyr)
# An example vector of URLs to fetch data for
urlx <- c("https://alastairrushworth.github.io/Visualising-Tour-de-France-data-in-R/",
"https://medium.com/dair-ai/pytorch-1-2-introduction-guide-f6fa9bb7597c",
"https://www.tensorflow.org/tutorials/images/cnn",
"https://www.analyticsvidhya.com/blog/2019/09/introduction-to-pytorch-from-scratch/")
# use html_df() to gather data
z <- html_df(urlx, show_progress = FALSE)
# have a quick look at the first page
glimpse(z[1, ])
```
To see the page titles, look at the `titles` column.
```{r}
z %>% select(title, url2)
```
Where there are tables embedded on a page in the `<table>` tag, these will be gathered into the list column `tables`. `html_df` will attempt to coerce each table to `tibble` - where that isn't possible, the raw html is returned instead.
```{r}
z$tables
```
`html_df()` does its best to find RSS feeds embedded in the page:
```{r}
z$rss
```
`html_df()` will try to parse out any social profiles embedded or mentioned on the page. Currently, this includes profiles for the sites
+ bitbucket
+ devto
+ facebook
+ github
+ gitlab
+ instagram
+ keybase
+ linkedin
+ mastodon
+ orcid
+ patreon
+ researchgate
+ stackoverflow
+ twitter
+ youtube
```{r}
z$social
```
Code language is inferred from `<code>` chunks using a preditive model. The `code_lang` column contains a numeric score where values near 1 indicate mostly R code, values near -1 indicate mostly Python code:
```{r}
z %>% select(code_lang, url2)
```
Publication dates
```{r}
z %>% select(published, url2)
```
Comments? Suggestions? Issues?
---
Any feedback is welcome! Feel free to write a github issue or send me a message on [twitter](https://twitter.com/rushworth_a).