-
Notifications
You must be signed in to change notification settings - Fork 10
/
part-of-speech-tagging.rmd
247 lines (181 loc) · 10.6 KB
/
part-of-speech-tagging.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
# Part of Speech Tagging
As an initial review of parts of speech, if you need a refresher, the following Schoolhouse Rocks videos should get you squared away:
- [A noun is a person, place, or thing.](https://youtu.be/h0m89e9oZko)
- [Interjections](https://youtu.be/YkAX7Vk3JEw)
- [Pronouns](https://youtu.be/Eu1ciVFbecw)
- [Verbs](https://youtu.be/US8mGU1MzYw)
- [Unpack your adjectives](https://youtu.be/NkuuZEey_bs)
- [Lolly Lolly Lolly Get Your Adverbs Here](https://youtu.be/14fXm4FOMPM)
- [Conjunction Junction](https://youtu.be/RPoBE-E8VOc) (personal fave)
Aside from those, you can also learn how bills get passed, about being a victim of gravity, a comparison of the decimal to other numeric systems used by alien species (I recommend the Chavez remix), and a host of other useful things.
## Basic idea
With <span class="emph">part-of-speech</span> tagging, we classify a word with its corresponding part of speech. The following provides an example.
```{r pos_example, echo=FALSE}
rbind(c('JJ', 'JJ', 'NNS', 'VBP', 'RB'),
c('Colorless', 'green', 'ideas', 'sleep', 'furiously.')) %>%
kable_df()
```
We have two adjectives (JJ), a plural noun (NNS), a verb (VBP), and an adverb (RB).
Common analysis may then be used to predict POS given the current state of the text, comparing the grammar of different texts, human-computer interaction, or translation from one language to another. In addition, using POS information would make for richer sentiment analysis as well.
## POS Examples
The following approach to POS-tagging is very similar to what we did for sentiment analysis as depicted previously. We have a POS dictionary, and can use an inner join to attach the words to their POS. Unfortunately, this approach is unrealistically simplistic, as additional steps would need to be taken to ensure words are correctly classified. For example, without more information, we are unable to tell if some words are being used as nouns or verbs (human being vs. being a problematic part of speech). However, this example can serve as a starting point.
### Barthelme & Carver
In the following we'll compare three texts from Donald Barthelme:
- *The Balloon*
- *The First Thing The Baby Did Wrong*
- *Some Of Us Had Been Threatening Our Friend Colby*
As another comparison, I've included Raymond Carver's *What we talk about when we talk about love*, the unedited version. First we'll load an unnested object from the sentiment analysis, the <span class="objclass">barth</span> object. Then for each work we create a sentence id, unnest the data to words, join the POS data, then create counts/proportions for each POS.
```{r barthelme_pos}
load('data/barth_sentences.RData')
barthelme_pos = barth %>%
mutate(work = str_replace(work, '.txt', '')) %>% # remove file extension
group_by(work) %>%
mutate(sentence_id = 1:n()) %>% # create a sentence id
unnest_tokens(word, sentence, drop=F) %>% # get words
inner_join(parts_of_speech) %>% # join POS
count(pos) %>% # count
mutate(prop=n/sum(n))
```
Next we read in and process the Carver text in the same manner.
```{r carver}
carver_pos =
data_frame(file = dir('data/texts_raw/carver/', full.names = TRUE)) %>%
mutate(text = map(file, read_lines)) %>%
transmute(work = basename(file), text) %>%
unnest(text) %>%
unnest_tokens(word, text, token='words') %>%
inner_join(parts_of_speech) %>%
count(pos) %>%
mutate(work='love',
prop=n/sum(n))
```
This visualization depicts the proportion of occurrence for each part of speech across the works. It would appear Barthelme is fairly consistent, and also that relative to the Barthelme texts, Carver preferred nouns and pronouns.
```{r barthelme_pos_vis, echo=FALSE, out.height='600px', fig.height=6}
carver_pos %>%
group_by(work) %>%
plot_ly() %>%
add_markers(x=~pos, y=~prop, color=I('gray50'), opacity=.5, size=I(20), name='love') %>%
add_markers(x=~pos, y=~prop, color=~work, size=I(10), data=barthelme_pos) %>%
theme_plotly() %>%
layout(xaxis = list(title=F)) %>%
theme_plotly %>%
config(displayModeBar = F)
```
<br>
### More taggin'
More sophisticated POS tagging would require the context of the sentence structure. Luckily there are tools to help with that here, in particular via the <span class="pack">openNLP</span> package. In addition, it will require a certain language model to be installed (English is only one of many available). I don't recommend doing so unless you are really interested in this (the <span class="pack">openNLPmodels.en</span> package is fairly large).
We'll reexamine the Barthelme texts above with this more involved approach. Initially we'll need to get the English-based tagger we need and load the libraries.
```{r koRpus, eval=FALSE, echo=FALSE}
# POS tagging in R with koRpus; requires installation of treeTagger or using its
# own tokenizer, e.g. as used in textual diversity section
# activate library
library(koRpus)
# perform POS tagging
text.tagged <- treetag("data/texts_raw/carver/beginners.txt",
treetagger="manual",
lang="en",
TT.options=list(path="C:\\TreeTagger", preset="en"))
```
```{r openNLP, eval=F}
# install.packages("openNLPmodels.en", repos = "http://datacube.wu.ac.at/", type = "source")
library(NLP)
library(tm) # make sure to load this prior to openNLP
library(openNLP)
library(openNLPmodels.en)
```
Next comes the processing. This more or less follows the help file example for `?Maxent_POS_Tag_Annotator`. Given the several steps involved I show only the processing for one text for clarity. Ideally you'd write a function, and use a <span class="func">group_by</span> approach, to process each of the texts of interest.
```{r baby_pos, eval=F}
load('data/barthelme_start.RData')
baby_string0 = barth0 %>%
filter(id=='baby.txt')
baby_string = unlist(baby_string0$text) %>%
paste(collapse=' ') %>%
as.String
init_s_w = annotate(baby_string, list(Maxent_Sent_Token_Annotator(),
Maxent_Word_Token_Annotator()))
pos_res = annotate(baby_string, Maxent_POS_Tag_Annotator(), init_s_w)
word_subset = subset(pos_res, type=='word')
tags = sapply(word_subset$features , '[[', "POS")
baby_pos = data_frame(word=baby_string[word_subset], pos=tags) %>%
filter(!str_detect(pos, pattern='[[:punct:]]'))
```
```{r other_pos, eval=FALSE, echo=FALSE}
colby_string0 = barth0 %>%
filter(work=='colby.txt')
colby_string = unlist(colby_string0$text) %>%
paste(collapse=' ') %>%
as.String
init_s_w = annotate(colby_string, list(Maxent_Sent_Token_Annotator(),
Maxent_Word_Token_Annotator()))
pos_res = annotate(colby_string, Maxent_POS_Tag_Annotator(), init_s_w)
word_subset = subset(pos_res, type=='word')
tags = sapply(word_subset$features , '[[', "POS")
colby_pos = data_frame(word=colby_string[word_subset], pos=tags) %>%
filter(!str_detect(pos, pattern='[[:punct:]]')) %>%
mutate(text='colby')
balloon_string0 = barth0 %>%
filter(work=='balloon.txt')
balloon_string = unlist(balloon_string0$text) %>%
paste(collapse=' ') %>%
as.String
init_s_w = annotate(balloon_string, list(Maxent_Sent_Token_Annotator(),
Maxent_Word_Token_Annotator()))
pos_res = annotate(balloon_string, Maxent_POS_Tag_Annotator(), init_s_w)
word_subset = subset(pos_res, type=='word')
tags = sapply(word_subset$features , '[[', "POS")
balloon_pos = data_frame(word=balloon_string[word_subset], pos=tags) %>%
filter(!str_detect(pos, pattern='[[:punct:]]')) %>%
mutate(text='balloon')
barthelme_pos = baby_pos %>%
mutate(text='baby') %>%
bind_rows(colby_pos, balloon_pos) %>%
filter(pos != '``') %>%
data.frame # because pander/dplyr issue
save(barthelme_pos, file='data/POS_results.RData')
```
Let's take a look. I've also done the other Barthelme texts as well for comparison.
```{r examine_baby_pos, echo=F}
load('data/POS_results.RData')
kable_df(barthelme_pos %>% head(15))
```
As we can see, we have quite a few more POS to deal with here. They come from the [Penn Treebank](https://en.wikipedia.org/wiki/Treebank). The following table notes what the acronyms stand for. I don't pretend to know all the facets to this.
<img src="img/POS-Tags.png" style="display:block; margin: 0 auto;">
Plotting the differences, we now see a little more distinction between *The Balloon* and the other two texts. It is more likely to use the determiners, adjectives, singular nouns, and less likely to use personal pronouns and verbs (including past tense).
```{r barth_pos, eval=T, echo=F}
load('data/POS_results.RData')
balloon_subset = barthelme_pos %>%
group_by(text) %>%
count(pos) %>%
mutate(prop = n/sum(n)) %>%
filter(text=='balloon', pos %in% c('DT', 'JJ', 'NN', 'PRP', 'VB', 'VBD'))
barthelme_pos %>%
group_by(text) %>%
count(pos) %>%
mutate(prop = n/sum(n)) %>%
plot_ly(width=800) %>%
add_markers(x=~pos, y=~prop, color=~text) %>%
add_markers(x=~pos, y=~prop, color=~text, size=I(15),
opacity=.5, data=balloon_subset, showlegend=F) %>%
theme_plotly() %>%
layout(xaxis = list(showgrid=T,
gridcolor='#0000000D',
title=F)) %>%
config(displayModeBar = F)
```
<br>
## Tagging summary
For more information, consult the following:
- [Penn Treebank](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1603&context=cis_reports)
- [Maxent function](http://maxent.sourceforge.net/about.html)
As with the sentiment analysis demos, the above should be seen only starting point for getting a sense of what you're dealing with. The 'maximum entropy' approach is just one way to go about things. Other models include hidden Markov models, conditional random fields, and more recently, deep learning techniques. Goals might include text prediction (i.e. the thing your phone always gets wrong), translation, and more.
## POS Exercise
As this is a more involved sort of analysis, if nothing else in terms of the tools required, as an exercise I would suggest starting with a cleaned text, and seeing if the above code in the last example can get you to the result of having parsed text. Otherwise, assuming you've downloaded the appropriate packages, feel free to play around with some strings of your choosing as follows.
```{r eval=FALSE}
string = 'Colorless green ideas sleep furiously'
initial_result = string %>%
annotate(list(Maxent_Sent_Token_Annotator(),
Maxent_Word_Token_Annotator())) %>%
annotate(string, Maxent_POS_Tag_Annotator(), .) %>%
subset(type=='word')
sapply(initial_result$features , '[[', "POS") %>% table
```