-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Some penalty data is missing penalty ype #223
Comments
This is inherited from nflscrapR and https://github.com/mrcaseb/nflfastR/blob/master/R/helper_add_nflscrapr_mutations.R#L80 |
I wrote a new parser that fixes most of the problems. penalty_type= .data$desc %>%
stringr::str_extract("(?<=PENALTY on .{1,50}, ).{1,50}(?=, [0-9]{1,2} yard)") %>%
# For some reason Face Mask penalties include the string (15 Yards)
# I thought it's ok to drop that with the next two lines
stringr::str_remove("\\(.+\\)") %>%
stringr::str_squish() The part Summary: The new parser
@guga31bb any ideas how to handle multiple penalties in one play? There are 5 instances with 3 penalties and 139 instances with 2 penalties (I think mostly Unsportsmanlike Conduct). If we don't want to drop them, we need one or more penalty_type columns and we have to parse the library(dplyr, warn.conflicts = FALSE)
new_parser <- pbp_db %>%
dplyr::filter(penalty == 1) %>%
dplyr::select(game_id, play_id, desc, penalty_type) %>%
dplyr::collect() %>%
dplyr::mutate(
new = desc %>%
stringr::str_extract("(?<=PENALTY on .{1,50}, ).{1,50}(?=, [0-9]{1,2} yard)") %>%
stringr::str_remove("\\(.+\\)") %>%
stringr::str_squish(),
multiple_pens = desc %>%
stringr::str_extract("(?<=PENALTY on .{1,50}, ).{50,}(?=, [0-9]{1,2} yard)") %>%
stringr::str_remove("\\(.+\\)") %>%
stringr::str_squish()
)
# new parser fixes penalty_type for 3.3k plays
new_parser %>%
dplyr::filter(is.na(penalty_type), !is.na(new))
#> # A tibble: 3,336 x 6
#> game_id play_id desc penalty_type new multiple_pens
#> <chr> <dbl> <chr> <chr> <chr> <chr>
#> 1 1999_01_~ 2142 TWO-POINT CONVERSION ~ <NA> Defensiv~ <NA>
#> 2 1999_01_~ 2243 C.Hentrich kicks 72 y~ <NA> Face Mask <NA>
#> 3 1999_01_~ 1036 (14:55) B.Johnson pas~ <NA> Face Mask <NA>
#> 4 1999_01_~ 3324 (15:00) T.Aikman pass~ <NA> Roughing~ <NA>
#> 5 1999_01_~ 775 (4:23) K.Anders left ~ <NA> Face Mask <NA>
#> 6 1999_01_~ 243 J.Elam extra point is~ <NA> Defensiv~ <NA>
#> 7 1999_01_~ 2116 (15:00) PENALTY on OA~ <NA> Offensiv~ <NA>
#> 8 1999_01_~ 2025 S.Lindsey kicks 68 ya~ <NA> Face Mask <NA>
#> 9 1999_02_~ 1207 (10:23) C.Hentrich pu~ <NA> Face Mask <NA>
#> 10 1999_02_~ 438 (8:03) L.Johnson punt~ <NA> Face Mask <NA>
#> # ... with 3,326 more rows
# new and old parser differ for 8 plays, mostly because of multiple penalties
new_parser %>%
dplyr::filter(penalty_type != new)
#> # A tibble: 8 x 6
#> game_id play_id desc penalty_type new multiple_pens
#> <chr> <dbl> <chr> <chr> <chr> <chr>
#> 1 2001_16_~ 2548 (1:34) (Shotgun~ Personal Foul Face~ Face Mask 9-S.McNair ~
#> 2 2004_12_~ 2042 (:46) 8-Jo.Mill~ Unsportsmanli~ Face~ Face Mask , 15 yards,~
#> 3 2004_14_~ 2156 (14:22) 4-B.Fav~ Unsportsmanli~ Face~ Face Mask , 15 yards,~
#> 4 2005_19_~ 3426 (12:34) 20-T.Jo~ ball out of b~ Face~ Face Mask
#> 5 2006_06_~ 407 3-K.Brown kicks~ Personal Foul Face~ Face Mask . PENALTY o~
#> 6 2006_07_~ 3251 (6:32) 7-B.Grad~ Unsportsmanli~ Face~ Face Mask , 15 yards,~
#> 7 2014_13_~ 335 (10:55) 2-D.But~ Unsportsmanli~ Face~ Face Mask , 15 yards,~
#> 8 2016_15_~ 3735 (5:12) 34-T.Raw~ Unsportsmanli~ Face~ Face Mask , 15 yards,~
# ~150 plays with multiple penalties. The new parser shows the first penalty in desc
new_parser %>%
dplyr::mutate(n = stringr::str_count(desc, "PENALTY on")) %>%
dplyr::filter(n > 1) %>%
dplyr::arrange(-n)
#> # A tibble: 144 x 7
#> game_id play_id desc penalty_type new multiple_pens n
#> <chr> <dbl> <chr> <chr> <chr> <chr> <int>
#> 1 2002_04~ 1602 (1:50) 88-B~ Illegal Forw~ Illega~ Illegal Forward Pa~ 3
#> 2 2007_16~ 2167 (:25) (Shot~ Illegal Use ~ Illega~ Illegal Use of Han~ 3
#> 3 2013_02~ 4367 (:48) (Shot~ Personal Foul Person~ Personal Foul, 15 ~ 3
#> 4 2013_17~ 2692 (3:50) (Sho~ Unsportsmanl~ Unspor~ Unsportsmanlike Co~ 3
#> 5 2015_09~ 3205 (12:22) 9-D~ Roughing the~ Roughi~ Roughing the Passe~ 3
#> 6 1999_02~ 3957 (4:20) (Sho~ Unnecessary ~ Unnece~ Unnecessary Roughn~ 2
#> 7 1999_06~ 1813 (:10) (Shot~ Illegal Moti~ Illega~ Illegal Motion, 5 ~ 2
#> 8 1999_08~ 2761 (10:30) B.M~ Unsportsmanl~ Unspor~ Unsportsmanlike Co~ 2
#> 9 1999_11~ 3695 (9:24) B.Fa~ Defensive Pa~ Defens~ Defensive Pass Int~ 2
#> 10 1999_11~ 3432 (9:36) M.Br~ Unnecessary ~ Unnece~ Unnecessary Roughn~ 2
#> # ... with 134 more rows Created on 2021-03-16 by the reprex package (v1.0.0) |
The NFL used to have two different kinds of facemask penalties. Before the 2008 season they eliminated the incidental/accidental 5 yard penalty and only kept the 15 yard variety. They wanted to eliminate judgement calls from the game, but it is still a judgement call whether the facemask is a personal foul or incidental. They just changed the penalty for incidental contact, not removed the referee's judgement. I take it the NFL wanted to still be able to search for the same term when finding these penalties in their data, or they considered it not worth their effort to change it. That's why you can still find "(15 yards)" to this day. I wouldn't be surprised, since you have data going back to 1999, if you could find some "5 yards" facemasks. Now that I think about it, since you have data going back to 1999, you may need to keep the "15 yard" string, to distinguish it from the "5 yard" string, if that is what the NFL did. So, maybe you need to add back that detail. |
We'll keep the yardage info for the 5 yard Face Mask penalties. So there will be |
When playing around with with play by play data from the 2019 season, and looking only at penalties (penalty = 1), I noticed 127 rows were missing penalty_type. However, from reading the desc, it's possible to see what the penalty is.
The plays vary from two point conversions, PAT, kick offs, pass plays etc.
The few ten rows are:
For example, in row 1 the penalty is "face mask", in row 2 it's "False Start" etc.
When I try the same code with 2018 data, there are 120 similar issues.
The text was updated successfully, but these errors were encountered: