Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Some penalty data is missing penalty ype #223

Closed
matthewglen opened this issue Mar 15, 2021 · 4 comments · Fixed by #224
Closed

Some penalty data is missing penalty ype #223

matthewglen opened this issue Mar 15, 2021 · 4 comments · Fixed by #224

Comments

@matthewglen
Copy link

When playing around with with play by play data from the 2019 season, and looking only at penalties (penalty = 1), I noticed 127 rows were missing penalty_type. However, from reading the desc, it's possible to see what the penalty is.

The plays vary from two point conversions, PAT, kick offs, pass plays etc.

# Packages ####
library(tidyverse)
library(nflfastR)
library(reprex)

# 2019 season from the data repository
data_2019 <- readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2019.rds'))
# Take just the penalty related columns
needed_2019 <- data_2019 %>%
  select(play_id, game_id, penalty_type, desc, penalty_player_name, 
         penalty_player_id, penalty) %>%
  filter(penalty == 1)

# Noticed some NA penalty type despite the desc showing it
# Filter for just NA penalty type
needed_2019 %>%
  select(play_id, penalty_type, desc, game_id, penalty, penalty_player_name, 
         penalty_player_id) %>%
  filter(is.na(penalty_type)) %>%
  view()

The few ten rows are:

image

For example, in row 1 the penalty is "face mask", in row 2 it's "False Start" etc.

When I try the same code with 2018 data, there are 120 similar issues.

@guga31bb
Copy link
Member

This is inherited from nflscrapR and penalty_type is created from parsing desc (perhaps poorly):

https://github.com/mrcaseb/nflfastR/blob/master/R/helper_add_nflscrapr_mutations.R#L80

@mrcaseb
Copy link
Member

mrcaseb commented Mar 16, 2021

I wrote a new parser that fixes most of the problems.

penalty_type= .data$desc %>% 
      stringr::str_extract("(?<=PENALTY on .{1,50}, ).{1,50}(?=, [0-9]{1,2} yard)") %>% 
      # For some reason Face Mask penalties include the string (15 Yards)
      # I thought it's ok to drop that with the next two lines 
      stringr::str_remove("\\(.+\\)") %>% 
      stringr::str_squish()

The part .{1,50} between the brackets limits the string length of the penalty to 50 characters because we get in trouble on plays with multiple penalties. I will show this below.

Summary: The new parser

  • fixes penalty_type for 3.3k plays in the nflfastR era
  • differs to the old penalty_type in 8 plays (7 of them because of multiple penalties)
  • uses the first penalty type in desc and drops other penalty types if there are more than one

@guga31bb any ideas how to handle multiple penalties in one play? There are 5 instances with 3 penalties and 139 instances with 2 penalties (I think mostly Unsportsmanlike Conduct). If we don't want to drop them, we need one or more penalty_type columns and we have to parse the multiple_pens variable in the below example.

library(dplyr, warn.conflicts = FALSE)

new_parser <- pbp_db %>% 
  dplyr::filter(penalty == 1) %>%
  dplyr::select(game_id, play_id, desc, penalty_type) %>%
  dplyr::collect() %>%
  dplyr::mutate(
    new = desc %>% 
      stringr::str_extract("(?<=PENALTY on .{1,50}, ).{1,50}(?=, [0-9]{1,2} yard)") %>% 
      stringr::str_remove("\\(.+\\)") %>% 
      stringr::str_squish(),
    multiple_pens = desc %>% 
      stringr::str_extract("(?<=PENALTY on .{1,50}, ).{50,}(?=, [0-9]{1,2} yard)") %>% 
      stringr::str_remove("\\(.+\\)") %>% 
      stringr::str_squish()
  )

# new parser fixes penalty_type for 3.3k plays
new_parser %>% 
  dplyr::filter(is.na(penalty_type), !is.na(new))
#> # A tibble: 3,336 x 6
#>    game_id   play_id desc                   penalty_type new       multiple_pens
#>    <chr>       <dbl> <chr>                  <chr>        <chr>     <chr>        
#>  1 1999_01_~    2142 TWO-POINT CONVERSION ~ <NA>         Defensiv~ <NA>         
#>  2 1999_01_~    2243 C.Hentrich kicks 72 y~ <NA>         Face Mask <NA>         
#>  3 1999_01_~    1036 (14:55) B.Johnson pas~ <NA>         Face Mask <NA>         
#>  4 1999_01_~    3324 (15:00) T.Aikman pass~ <NA>         Roughing~ <NA>         
#>  5 1999_01_~     775 (4:23) K.Anders left ~ <NA>         Face Mask <NA>         
#>  6 1999_01_~     243 J.Elam extra point is~ <NA>         Defensiv~ <NA>         
#>  7 1999_01_~    2116 (15:00) PENALTY on OA~ <NA>         Offensiv~ <NA>         
#>  8 1999_01_~    2025 S.Lindsey kicks 68 ya~ <NA>         Face Mask <NA>         
#>  9 1999_02_~    1207 (10:23) C.Hentrich pu~ <NA>         Face Mask <NA>         
#> 10 1999_02_~     438 (8:03) L.Johnson punt~ <NA>         Face Mask <NA>         
#> # ... with 3,326 more rows
  
# new and old parser differ for 8 plays, mostly because of multiple penalties
new_parser %>% 
  dplyr::filter(penalty_type != new) 
#> # A tibble: 8 x 6
#>   game_id   play_id desc             penalty_type   new   multiple_pens         
#>   <chr>       <dbl> <chr>            <chr>          <chr> <chr>                 
#> 1 2001_16_~    2548 (1:34) (Shotgun~ Personal Foul  Face~ Face Mask 9-S.McNair ~
#> 2 2004_12_~    2042 (:46) 8-Jo.Mill~ Unsportsmanli~ Face~ Face Mask , 15 yards,~
#> 3 2004_14_~    2156 (14:22) 4-B.Fav~ Unsportsmanli~ Face~ Face Mask , 15 yards,~
#> 4 2005_19_~    3426 (12:34) 20-T.Jo~ ball out of b~ Face~ Face Mask             
#> 5 2006_06_~     407 3-K.Brown kicks~ Personal Foul  Face~ Face Mask . PENALTY o~
#> 6 2006_07_~    3251 (6:32) 7-B.Grad~ Unsportsmanli~ Face~ Face Mask , 15 yards,~
#> 7 2014_13_~     335 (10:55) 2-D.But~ Unsportsmanli~ Face~ Face Mask , 15 yards,~
#> 8 2016_15_~    3735 (5:12) 34-T.Raw~ Unsportsmanli~ Face~ Face Mask , 15 yards,~

# ~150 plays with multiple penalties. The new parser shows the first penalty in desc
new_parser %>% 
  dplyr::mutate(n = stringr::str_count(desc, "PENALTY on")) %>% 
  dplyr::filter(n > 1) %>% 
  dplyr::arrange(-n)
#> # A tibble: 144 x 7
#>    game_id  play_id desc         penalty_type  new     multiple_pens           n
#>    <chr>      <dbl> <chr>        <chr>         <chr>   <chr>               <int>
#>  1 2002_04~    1602 (1:50) 88-B~ Illegal Forw~ Illega~ Illegal Forward Pa~     3
#>  2 2007_16~    2167 (:25) (Shot~ Illegal Use ~ Illega~ Illegal Use of Han~     3
#>  3 2013_02~    4367 (:48) (Shot~ Personal Foul Person~ Personal Foul, 15 ~     3
#>  4 2013_17~    2692 (3:50) (Sho~ Unsportsmanl~ Unspor~ Unsportsmanlike Co~     3
#>  5 2015_09~    3205 (12:22) 9-D~ Roughing the~ Roughi~ Roughing the Passe~     3
#>  6 1999_02~    3957 (4:20) (Sho~ Unnecessary ~ Unnece~ Unnecessary Roughn~     2
#>  7 1999_06~    1813 (:10) (Shot~ Illegal Moti~ Illega~ Illegal Motion, 5 ~     2
#>  8 1999_08~    2761 (10:30) B.M~ Unsportsmanl~ Unspor~ Unsportsmanlike Co~     2
#>  9 1999_11~    3695 (9:24) B.Fa~ Defensive Pa~ Defens~ Defensive Pass Int~     2
#> 10 1999_11~    3432 (9:36) M.Br~ Unnecessary ~ Unnece~ Unnecessary Roughn~     2
#> # ... with 134 more rows

Created on 2021-03-16 by the reprex package (v1.0.0)

@JoeMarino2021
Copy link

For some reason Face Mask penalties include the string (15 Yards)

The NFL used to have two different kinds of facemask penalties. Before the 2008 season they eliminated the incidental/accidental 5 yard penalty and only kept the 15 yard variety.

They wanted to eliminate judgement calls from the game, but it is still a judgement call whether the facemask is a personal foul or incidental. They just changed the penalty for incidental contact, not removed the referee's judgement.

I take it the NFL wanted to still be able to search for the same term when finding these penalties in their data, or they considered it not worth their effort to change it. That's why you can still find "(15 yards)" to this day. I wouldn't be surprised, since you have data going back to 1999, if you could find some "5 yards" facemasks.

Now that I think about it, since you have data going back to 1999, you may need to keep the "15 yard" string, to distinguish it from the "5 yard" string, if that is what the NFL did. So, maybe you need to add back that detail.

@mrcaseb
Copy link
Member

mrcaseb commented Mar 22, 2021

We'll keep the yardage info for the 5 yard Face Mask penalties. So there will be "Face Mask" and "Face Mask (5 Yards)"

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants