ds_against_traffic_accidents.Rmd

---
title: "Data science applied to the fight against traffic accidents"
author: "Anton Barrera Mora (me@antonio-barrera.cyou)"
date: "April 2023"
output:
  html_document:
    highlight: default
    number_sections: yes
    theme: cosmo
    toc: yes
    toc_depth: 2
    includes:
      in_header: p_brand.html
  word_document: default
  github_document:
    preserve_yaml: true
  pdf_document:
    highlight: zenburn
    toc: yes
editor_options: 
  markdown: 
    wrap: sentence
bibliography: references.bib
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Introduction

In this project, we intend to show how the "life cycle" applies to data mining (DM) projects.The 'project life cycle' refers to the sequential phases that a project goes through from its initiation to its completion.
It provides a structured framework for managing projects and encompasses all the activities, processes, and deliverables involved in the project's lifespan.
We are talking abour a general framework that establishes the stages or phases a project goes through from initiation to completion.
The project lifecycle provides a structure for managing the project as a whole, from defining the objectives to delivering the results.

In the other hand, while the CRISP-DM @whatis model specifically focuses on the processes and steps involved in data mining, the project lifecycle is a broader framework that can be applied to different types of projects, including data mining projects.
Then the CRISP-DM model can be considered as a specific methodology within the project lifecycle of a data mining project.
It provides detailed guidelines on the specific stages to be followed when conducting a data mining project, such as understanding the business, understanding the data, preparing the data, modeling, evaluating, and deploying.

We will limit ourselves to the first three phases:

1.  Understanding the business.

2.  Understanding the data.

3.  Data preparation.To achieve our objectives, we selected dataset sourced from the National Highway Traffic Safety Administration (NHTSA).
    [The Fatality Analysis Reporting System (FARS)](https://www.nhtsa.gov/crash-data-systems/fatality-analysis-reporting-system) was developed by the NHTSA in the United States to provide a comprehensive measure of road safety.
    This dataset specifically pertains to the year 2020 and consists of accident records that capture significant descriptive data.
    Each accident in the dataset involves at least one fatality.

## Phase 1. Business understanding

### Problem:

We are requested to analyze the causes of road accidents in the United States, whether they are of human, material, or environmental origin.
With this information, they hope to annually review the trends in this matter with the help of the model and adjust the intervention plan, whether through investment, campaigns, or training.
They are also very interested in identifying specific states and cities where the intervention in road safety should be increased or modified, as well as any aspects related to infrastructure.
Finally, we are asked to review the three-year time series to understand the evolution of road accidents.

### Objective:

This data mining project aims to explore the dataset, uncover hidden patterns, and in future if potentially possible, develop predictive models to identify factors contributing to severe accidents.
The insights gained from this analysis will provide valuable information for road safety initiatives and support evidence-based decision-making in accident prevention strategies.

The primary analytical objective of this project is to gain insights into the factors that contribute to the severity of an accident and to define what constitutes a severe accident.
By applying data mining techniques, we aim to uncover patterns and relationships within the dataset that can help us understand the key factors associated with severe accidents.

Based on the above and in summary, we will undertake the initial phases to design a data mining model that allows us to understand the following in an updated manner:

1.  The evolution of the time series of fatal accidents.

2.  Major causes of accidents.

3.  Incident volume by states and cities, including "black spots" in the road network.

### Product:

Documented data mining model tailored to extract relevant information in the defined areas of interest (time series, black spots, and causes).

### Tasks:

[Definition of the study population:]{.underline} Traffic accidents in the USA from 2018 to 2020.

[Data collection:]{.underline}

Primary data (provided by the contracting entity).\
For more information on the dataset and the FARS, please refer to the National Highway Traffic Safety Administration's Crash Data Systems website: FARS.

[Relevant variables:]{.underline}

In a holistic approach, the variables or factors in this project are established around the number of traffic accidents, human causes, material causes, and environmental causes that can influence their occurrence.

[Success criteria:]{.underline}

An informative data mining model that fulfills its function of helping to prevent accidents by anticipating material wear (emergence of new black spots in the road network of states or cities), the appearance of new drugs and areas of incidence, and serves as a primary tool for decision-making.

## Phase 2. Understanding the Data

In this phase, we will identify the necessary dataset to achieve the set objective and gain a holistic understanding of its structure.
We will also address ethical considerations in data usage.

### Objectives:

-   Selection and identification of the necessary dataset in relation to the set objectives.

-   Initial filtering of the data to eliminate redundant tables and records identified during the initial analysis.

-   Attention to data ethics aspects:

    -   Ownership of the dataset
    -   Transparency in transactions
    -   Consent
    -   Currency
    -   Privacy
    -   Openness
    -   Compliance with legal aspects of data usage, including storage and security.

### Product:

-   Descriptive document of the available data, including details about the data itself, its origin, availability, storage, privacy, security, and other legal aspects. Also includes information about variables: their nature, structure, problems, validity, and appropriateness.
-   Final datasets suitable for the next phase to achieve the project objectives.

### Tasks:

#### [Selection of the dataset:]{.underline}

The National Highway Traffic Safety Administration (NHTSA) in the United States provides a public dataset on traffic accidents called the Fatality Analysis Reporting System (FARS), which collects detailed information on fatal traffic accidents nationwide since 1975.

FARS contains information about the characteristics of vehicles involved in accidents, drivers and passengers, road and weather conditions, as well as the cause and consequences of the accidents.
The information is collected through reports from law enforcement authorities, accident investigations, and medical and autopsy records.

FARS data is publicly available in CSV format and can be downloaded from the NHTSA website.
There are also online tools and software packages that can assist in analyzing FARS data.

It should be noted that FARS only includes information on traffic accidents with fatalities, so it does not provide data on the entirety of traffic accidents in the United States.
However, it can still be a valuable source of information for research on road safety and accident prevention.

The analytical objective is to understand the main causes of accidents, black spots in the road network, and the temporal evolution of accidents.
We have selected the main dataset for accidents in 2020, called 'FARS2020NationalCSV'.
It is a version of the FARS dataset that includes information on fatal traffic accidents that occurred in the year 2020 in the USA.
It includes several tables that provide detailed information about vehicles, drivers, and passengers involved in the accidents, as well as the circumstances of the accidents.

The selected tables from the overall dataset that will substantiate the project objectives are:

[**'accident'**]{.underline}

Information about the fatal traffic accident.
Some of the variables included in this table are:

"ST_CASE": the case number assigned to the accident.

"STATE": numeric code of the state where the accident occurred.

"STATENAME": name of the state where the accident occurred corresponding to its numeric code.

"COUNTY": numeric code of the county where the accident occurred.

"COUNTYNAME": names of the counties corresponding to their numeric codes.

"CITY": numeric code of cities.

"CITYNAME": names of the cities corresponding to their numeric codes.

"FATALS": number of fatalities.

"DAY": day of the accident.

"MONTH": month in which the accident occurred.

"HOUR": the hour of the day when the accident occurred (in 24-hour format).

"MINUTE": minutes of the hour when the accident occurred.

"NHS": codes indicating if the accident occurred on a National Highway System (NHS) road.

"NHSNAME": confirmation if the accident belongs to an NHS road.

"ROUTE": code indicating the type of route where the accident occurred.

"ROUTENAME": name of the route corresponding to the numeric code.

"TWAY_ID": the identification of the roadway where the accident occurred.

"TWAY_ID2": the identification of a second roadway if necessary.

"MILEPT": the mile at which the accident occurred.

"LATITUDE": the latitude of the accident location.

"LONGITUDE": the longitude of the accident location.

"MAN_COLL": code indicating the type of collision that occurred.

"MAN_COLLNAME": description of the collision type.

"RUR_URB": codes specifying the area of occurrence.

"RUR_URBNAME": confirmation of the area where the accident occurred.

"LGT_COND": code indicating the lighting condition of the vehicle at the time of the accident.

"LG_CONDNAME": description of the lighting condition based on the previous code.

"WEATHER": code indicating the weather conditions.

"WEATHERNAME": name of the adverse weather condition.

"DRUNK_DR": number of drunk drivers.

This table provides valuable information about the characteristics of fatal traffic accidents, including the date, time, location, and number of victims.
These data can be used to identify patterns and trends in traffic accidents, which in turn can be utilized to develop more effective accident prevention strategies.
Additionally, it serves as the basis for calculating the time series and other statistics related to the total number of victims and forecasting.

[**'driverrf'**]{.underline}

Contains information about the drivers involved in the fatal traffic accident.
Some of the variables included in this table are:

"ST_CASE": the case number assigned to the accident.

"VEH_NO": the vehicle number involved in the accident.

"DRIVERRF": numeric code indicating references to driver circumstances, such as driver's license information, improper vehicle handling, and various aspects related to the driver.

"DRIVERRFNAME": description of the above codes regarding relevant driver and driving aspects at the time of the accident.

This table provides valuable information about the characteristics of drivers involved in fatal traffic accidents.
Variables related to the driver's condition and substance use are important for our analytical objectives.

[**'vehicle'**]{.underline}

Contains information about the vehicles involved in the fatal traffic accident.
Some of the variables considered for the analytical objective are:

"ST_CASE": the case number assigned to the accident.

"VEH_NO": the vehicle number involved in the accident.

"MAKE": numeric code for the vehicle's make.

"MAKENAME": make name.

"MODEL": numeric code for the vehicle's model.

"MAK_MOD": numeric code for the make and model.

"MAK_MODNAME": make and model name.

"MOD_YEAR": the vehicle's model year.

"TOW_VEH": whether the vehicle was being towed at the time of the accident.

"MCARR_ID": identification of the commercial vehicle involved in the accident.

"GVWR": gross vehicle weight rating.

"VEH_YEAR": the vehicle's manufacturing year.

This table provides valuable information about the characteristics of vehicles involved in traffic accidents with fatalities.
Variables related to the vehicle's manufacturing date and model can help identify and characterize issues with the vehicles in use.

[**'weather'**]{.underline}

In FARS, it contains information about the weather conditions at the time of the fatal traffic accident.
Some of the variables included in this table are:

"ST_CASE": the case number assigned to the accident.

"STATENAME": the state where the accident occurred.

"WEATHER": the numeric code describing the weather conditions at the time of the accident.

"WEATHERNAME": description of the weather conditions described in the previous variable at the time of the accident.

It provides information about the weather conditions at the time of the fatal traffic accident and can be used to conduct more detailed analyses of road safety and accident prevention under different weather conditions.
In our case, we are interested in understanding the multiple variables involved in accidents and their effects, but this type of information can also enable corrective measures in areas where certain weather phenomena are more frequent.

[**'distract'**]{.underline}

Contains data on the distractions that drivers were experiencing at the time of the fatal traffic accident.
Some of the variables included are:

"ST_CASE": the case number assigned to the accident.

"VEH_NO": the vehicle number involved in the accident.

"DRDISTRACT": the numeric code describing the distraction the driver was experiencing at the time of the accident.

"DRDISTRACTNAME": description of the type of distraction.

It provides information about a wide range of distractions that drivers experienced at the time of the accident.
Some of the distractions captured in the table include:

-   Talking or interacting with a passenger in the vehicle.

-   Eating or drinking while driving.

-   Using a mobile phone or electronic device.

-   Observing something outside the vehicle, such as an accident or scenery.

-   Touching or adjusting the radio or vehicle controls.

-   Looking at or reaching for something inside the vehicle.

-   Other types of distractions, such as being fatigued, being emotionally disturbed, or being distracted by something outside the vehicle.

[**'drugs'**]{.underline}

Contains information about the presence of drugs in drivers involved in fatal traffic accidents.
Some of the variables included in this table are:

"ST_CASE": the case number assigned to the accident.

"VEH_NO": the vehicle number involved in the accident.

"DRUGSPEC": numeric code for the type of drug test conducted.

"DRUGSPECNAME": descriptive specification of the type of drug test conducted.

"DRUGRES": numeric code for the test results.

"DRUGRESNAME": descriptive specification of the test results, specifying the name of the substance.

Understanding the main causes of accidents is the analytical objective, but the variables included in this table can help us understand trends in drug use and highlight areas where this occurs, enabling targeted awareness and information campaigns among the population.

[**'vision'**]{.underline}

Contains information about vision problems of drivers involved in fatal traffic accidents.
Some of the variables included in this table are:

"ST_CASE": the case number assigned to the accident.

"VEH_NO": the vehicle number involved in the accident.

"VISION": the numeric code describing the vision problem the driver had at the time of the accident.

"VISION_NAME": detailed description of the vision problem.

Understanding the variables in this table can be useful for analyzing the relationship between drivers' vision problems and fatal traffic accidents.

[**'Factor'**]{.underline}

Contains information about different factors contributing to fatal traffic accidents.
Vehicle conditions, environmental conditions, driver behavior, and other external factors.

"ST_CASE": the case number assigned to the accident.

"VEH_NO": the vehicle number involved in the accident.

"VEHICLECC": numeric code describing the factor that contributed to the accident.
There are several codes available for mechanical failures observed in the vehicle.

"VEHICLECCNAME": provides descriptive information about the mechanical problem referred to by each code.

It can help us understand the main mechanical causes of traffic accidents, relate them to vehicle models and age, and help develop prevention and control measures.

#### Exploratory Analysis

We are going to make a first approach to explore the chosen dataset and get to know its structure as the number of records, variables, type of variables, problems, determine the value of the data and other important aspects before starting to work with the dataset.

```{r Routes and data loading, echo=TRUE, message=FALSE, warning=FALSE}

# Set paths to different accident datasets
  # 2020 Dataset
path_acc20 = "./FARS2020NationalCSV/accident.CSV"
    # 2019 Dataset
path_acc19 = "./FARS2019NationalAuxiliaryCSV/ACC_AUX.CSV"
      # 2018 Dataset
path_acc18 = "./FARS2018NationalAuxiliaryCSV/ACC_AUX.CSV"

# Set paths to additional 2020 files for further analysis
path_veh = "./FARS2020NationalCSV/vehicle.CSV"
path_per = "./FARS2020NationalCSV/person.CSV"
path_driverrf = "./FARS2020NationalCSV/driverrf.CSV"
path_distract = "./FARS2020NationalCSV/Distract.CSV"
path_drimpair = "./FARS2020NationalCSV/Drimpair.CSV"
path_factor = "./FARS2020NationalCSV/Factor.CSV"
path_vision = "./FARS2020NationalCSV/Vision.CSV"
path_wea = "./FARS2020NationalCSV/weather.CSV"
   

# Load accident data for each year for time series and other analyses
# 2020 Accidents - complete data
acc20 <- read.csv(path_acc20, row.names=NULL)
# 2019 Accidents - auxiliary data
acc19 <- read.csv(path_acc19, row.names=NULL)
# 2018 Accidents - auxiliary data
acc18 <- read.csv(path_acc18, row.names=NULL)

# Load additional 2020 data
# Vehicles
veh <- read.csv(path_veh, row.names=NULL)

# Persons
per <- read.csv(path_per, row.names=NULL)

# Additional information about the driver
drivrf <- read.csv(path_driverrf, row.names=NULL)

# Information about driver impairments
impair <- read.csv(path_drimpair, row.names=NULL)

# Information about mechanical factors
fact <- read.csv(path_factor, row.names=NULL)

# Information about visibility problems
visio <- read.csv(path_vision, row.names=NULL)

# Information about driver distractions
distract <- read.csv(path_distract, row.names=NULL)

# Information about weather conditions
wea <- read.csv(path_wea, row.names=NULL)

```

Once the dataset is loaded, we proceed to the exploration of the dataset.

##### 1. Verifying the structure of the accident set from 2018 to 2020 inclusive:

ACCIDENTS 2020

```{r exp_2020_accident, echo=TRUE, message=TRUE, warning=FALSE}
# Main file acc20

# Check the first few rows of data for each variable
head(acc20)

# Analyze numeric variables
summary(acc20)

# Get the names of variables in the accident table
names(acc20)

# Perform a general analysis of the structure
summario20 <- as.data.frame(str(acc20, give.attr = FALSE, strict.width = "cut"))  
knitr::kable(summario20, caption = "Summary of the acc20 table")

```

We can observe that the 2020 accident table has 3576 records or instances of accidents with 81 variables.
At first glance, we can already see that some variables are not necessary for the project as they provide unnecessary data.
All numeric variables are of integral type, there are no date records, and the remaining variables can be categorized as text strings.

**ACCIDENTS 2019**

```{r exp_acc_2019, echo=TRUE, message=TRUE, warning=FALSE}
# Auxiliary file acc19

# Let's check the first rows of data for each variable
head(acc19)

# An analysis of numeric variables
summary(acc19)

# The names of the variables in the accident table
names(acc19)

# and perform a general analysis of the structure
summario19 <- as.data.frame(str(acc19, give.attr = FALSE, strict.width = "cut"))  
knitr::kable(summario19, caption = "Summary of the acc20 table")

```

We can observe that in this case, the data in the 2019 'Auxiliary' file are of integral type, and we have 33,487 instances or accidents with 42 variables.
The file lacks descriptions in text strings, but the variable names and primary key match the previous one, so we can unify them using the table as a child table and using the primary key as a foreign key.

ACCIDENTS 2018

```{r exploracion accident 2018, echo=TRUE, message=TRUE, warning=FALSE}

# Auxiliary file acc18

# Let's check the first rows of data for each variable
head(acc18)

# An analysis of numeric variables
summary(acc18)

# The names of the variables in the accidents table
names(acc18)

# And let's perform a general analysis of the structure
summary18 <- as.data.frame(str(acc18, give.attr = FALSE, strict.width = "cut"))
knitr::kable(summary18, caption = "Summary of the acc18 table")

```

We observe that the accidents table for 2018 contains 33,919 instances and the same 42 variables as in the previous case, with analogous structures.

Furthermore, the primary/foreign key is the case number:

ST_CASE - Accident identifier

We will study the following aspects of the data:

TEMPORAL TREND STUDY

FATAL - Annual fatalities (2018 to 2020)

CAUSE STUDY

DRUNK_DR - Number of drunk drivers

DAY_WEEKNAME - Interest in weekends

HOUR - Hour

HOURNAME - Time slot

MINUTE - Minute

DRDISTRAC - Code for driver distractions

DRDISTRACNAME - Distraction specification

MOD_YEAR - Year of vehicle model

L_TYPE - License type, but we are interested only in those without a license

WEATHER - Code for weather conditions at the time

WEATHERNAME - Weather condition specification

DRUNK_DR - Number of positive alcohol tests for drivers involved

DRIVERRF - Type of driving infractions or aspects.
This record comes from a table that allows us to select the involved or responsible drivers

AGE - Age, filtering the age of officially responsible drivers

DRIVERRFNAME - Textual specification of the responsible driver

AIR_BAG - Code to specify airbag behavior

AIR_BAGNAME - Textual specification of airbag behavior

DRUGS - Codes for the presence of drugs

DRUGSNAME - Textual specification

VISION - Codes for visibility elements (mirrors, windows, etc.)

VISIONNAME - Specification of the type of anomaly in vision-related elements

VEHICLECC - Code for factors that may contribute to the accident

VEHICLECCNAME - Specification of the factors

DRIMPAIR - Codes for detected physical impairments

DRIMPAIRNAME - Specification of detected psychophysical impairment aspects

LOCATION OF BLACK SPOTS

COUNTY - County code

COUNTYNAME - County name

CITY - City code

CITYNAME - City name

ROUTE - Route code or road type

ROUTE - Specification of the road type

RUR_URB - Code 1-2 to specify if it is rural or urban, respectively

MILEPT - Mile point number

LATITUDE

LONGITUDE

##### 2. Feature processing and management I:

[**Cleaning**]{.underline}

The next step is to ensure that there are no empty or null values.

```{r cleaning, echo=TRUE, message=TRUE, warning=FALSE}


print('NA')

# Checking the accident files
print ('Accidents 2020')
colSums(is.na(acc20))

print('Accidents 2019')
colSums(is.na(acc19))

print('Accidents 2018')
colSums(is.na(acc18))

# Checking the auxiliary files
print('driverrf')
colSums(is.na(drivrf))

print('Factor')
colSums(is.na(fact))

print('impair')
colSums(is.na(impair))

print('Person')
colSums(is.na(per))

print('Vehicles')
colSums(is.na(veh))

print('vision')
colSums(is.na(visio))

print('Blanks')

# Checking the accident files
print ('Accidents 2020')
colSums(acc20=="") 

print('Accidents 2019')
colSums(acc19=="")

print('Accidents 2018')
colSums(acc18=="")

# Checking the auxiliary files
print('driverrf')
colSums(drivrf=="")

print('Factor')
colSums(fact=="")

print('impair')
colSums(impair=="")

print('Person')
colSums(per=="")

print('Vehicles')
colSums(veh=="")

print('vision')
colSums(visio=="")


```

Certain variables in the 'per' (person) table contain blank values, such as MAKENAME or MAK_MOD, which refer to the vehicle models and other specifications in the 'veh' (vehicles) table.
However, these variables are not relevant for our project's objective, so we will discard those records.

At this point, working with the large number of tables and variables becomes challenging.
Therefore, we will need to create the tables on which we will work.
By doing so, we will implicitly remove the variables that are not necessary for the project's objective.

In this initial feature engineering phase (second phase), we will select variables based on the descriptive work conducted earlier.
In the next phase, we will delve deeper into feature engineering based on the specific requirements of each objective.

TIME SERIES ANALYSIS 2018-2020

We need to create a table that includes the number of victims (FATALS), the year (YEAR), and the accident number or case code (ST_CASE), which serves as the key.

```{r install_libraries, echo=TRUE, message=FALSE, warning=FALSE}

#Install various packages as needed
if(!require('ggplot2')) install.packages('ggplot2'); library('ggplot2')
if(!require('Rmisc')) install.packages('Rmisc'); library('Rmisc')
if(!require('dplyr')) install.packages('dplyr'); library('dplyr')
if(!require('xfun')) install.packages('xfun'); library('xfun')
if(!require('magrittr')) install.packages('magrittr'); library('magrittr')
if (!require('factoextra')) install.packages('factoextra'); library('factoextra')
if(!require('pracma')) install.packages('pracma'); library('pracma')

```

```{r creando tabla analisis18_20, echo=TRUE, message=TRUE, warning=FALSE}


print('Creating table for the 2018-2020 FATALS time series')

# Select the necessary columns
accidentData18_20 <- rbind(
  acc18 %>% select(ST_CASE, YEAR, FATALS),
  acc19 %>% select(ST_CASE, YEAR, FATALS),
  acc20 %>% select(ST_CASE, YEAR, FATALS)
)

# Sort the table by year
analysis18_20 <- accidentData18_20 %>% arrange(YEAR, ST_CASE)

# Show the content of the new table
print('New analysis18_20 table')

# Remove the accidentData18_20 table
rm(accidentData18_20)

# Analyze the headers of the new table
head(analysis18_20)


```

FACTORS PRESENT IN ACCIDENTS (2020)

Previously, we have described the variables to consider or study in order to understand their influence on the outcome.
Now, we will create a new table that only includes the variables of interest, discarding others.

```{r creando tabla de causas y factores, echo=TRUE, message=TRUE, warning=FALSE}


print('Creating tables with factors related to accidents and excluding the rest')


# Selecting the necessary columns from each table
# Unique primary key records
acc20_select <- acc20 %>% select(ST_CASE, DAY_WEEKNAME, HOUR, MINUTE, LGT_COND, LGT_CONDNAME)
wea20_select <- wea %>% select(ST_CASE, WEATHER, WEATHERNAME)

# Merging both tables using the merge function
accidentWeather20 <- merge(acc20_select, wea20_select, by = "ST_CASE")


# Composite primary key records
veh_select <- veh %>% select(ST_CASE, VEH_NO, MOD_YEAR)
per_select <- per %>% select(ST_CASE, VEH_NO, AGE, AIR_BAG, AIR_BAGNAME, DRINKING, DRINKINGNAME, DRUGS, DRUGSNAME)
drivrf_select <- drivrf %>% select(ST_CASE, VEH_NO, DRIVERRF, DRIVERRFNAME)
fact_select <- fact %>% select(ST_CASE, VEH_NO, VEHICLECC, VEHICLECCNAME)
impair_select <- impair %>% select(ST_CASE, VEH_NO, DRIMPAIR, DRIMPAIRNAME)
visio_select <- visio %>% select(ST_CASE, VEH_NO, VISION, VISIONNAME)
distract_select <- distract %>% select(ST_CASE,VEH_NO, DRDISTRACT, DRDISTRACTNAME)

# Combining each table using the merge function
accidentFacts20 <- merge(veh_select, per_select, by = c("ST_CASE", "VEH_NO"), all = TRUE)
accidentFacts20 <- merge(accidentFacts20, drivrf_select, by = c("ST_CASE", "VEH_NO"), all = TRUE)
accidentFacts20 <- merge(accidentFacts20, fact_select, by = c("ST_CASE", "VEH_NO"), all = TRUE)
accidentFacts20 <- merge(accidentFacts20, impair_select, by = c("ST_CASE", "VEH_NO"), all = TRUE)
accidentFacts20 <- merge(accidentFacts20, visio_select, by = c("ST_CASE", "VEH_NO"), all = TRUE)
accidentFacts20 <- merge(accidentFacts20, distract_select, by = c("ST_CASE", "VEH_NO"), all = TRUE)

# Removing datasets that have been merged
rm(acc20_select)
rm(distract_select)
rm(drivrf_select)
rm(fact_select)
rm(impair_select)
rm(per_select)
rm(veh_select)
rm(visio_select)
rm(wea20_select)

# Analyzing the result using the head() function
print("Accident Facts Table")
head(accidentFacts20)


print("Accident Weather Table")
head(accidentWeather20)

```

We have now two tables that collect different variables that may be related to the occurrence of accidents.
We decided to unify them into a single table:

```{r Creando la tabla final con factores de accidentes, echo=TRUE, message=TRUE, warning=FALSE}
print("Merging tables with different keys")
accFacts20 <- merge(accidentFacts20, accidentWeather20, by = "ST_CASE", all = TRUE)


# Removing the tables that have been merged
rm(accidentFacts20)
rm(accidentWeather20)

# Viewing the result using the head() function
print("Applying head() and names() to the new table accFacts20")
head(accFacts20)
names(accFacts20)

```

ROAD BLACK SPOTS

In line with the two previous tables that collect the different aspects we intend to analyze, we proceed to create a third table with geographic data to later determine where the road black spots are located.
We start with the base of the 2020 accident table, which contains all the information we need:

```{r black_points, echo=TRUE, message=TRUE, warning=FALSE}

print("Creating the table with the location of accidents")
accBpoint20 <- acc20 %>% select(ST_CASE, STATE, STATENAME, COUNTY, COUNTYNAME, CITY, CITYNAME, ROUTE, ROUTENAME, RUR_URB, RUR_URBNAME, MILEPT, LATITUDE, LONGITUD)


# Analyzing the content with head() and the variables with names()
print("Table head")
head(accBpoint20)


print("Variable Names")
names(accBpoint20)

```

At this point, we have achieved the objectives for this phase, as we have: 1.
Description of the dataset and the variables represented in it.
2.
Clean dataset with an initial phase of feature management.

Finally, we proceed to clean the workspace before moving on to the third phase.

```{r limpieza del entorno de trabajo previo a Fase 3, echo=TRUE, message=TRUE, warning=FALSE}


# Removing unnecessary elements except for the 3 tables we will work with
rm(acc18, acc19, acc20, distract, drivrf, fact, impair, per, veh, visio, wea)

# Removing 'paths' and 'summaries'
rm(list = ls(pattern = "^pat"))
rm(list = ls(pattern = "^summa"))


```

## Phase 3. Data Preparation

### Objectives:

In this phase, we will continue with feature engineering, which involves selecting and transforming variables or features of the data to improve the performance of the machine learning model.
In this case, we will adapt the different tables and variables to the project's needs.

### Deliverable:

Obtain a dataset from 3 tables that contain the relevant variables for the 3 aspects of the project:

1.  Analyze the evolution of accidents in the 2018 to 2020 series.
2.  Identify blackspots in the road network.
3.  Explore factors that may influence the occurrence of accidents.

These tables will have undergone data transformation and dimensionality reduction methods.

### Tasks:

##### 1. Feature processing and management II

TABLE FOR THE 2018-2020 TIME SERIES

The table for the analysis of the time series, 'analisis18_20', contains only 3 variables as we saw earlier.

```{r fase 3 table_analysis, echo=TRUE, message=TRUE, warning=FALSE}

# Table analysis
head(analysis18_20)

names(analysis18_20)

```

Feature processing (or engineering) is typically applied when there is a large number of variables and the goal is to reduce their dimensionality by eliminating irrelevant or highly correlated variables.
However, in this case, there are only four variables in the table and all of them appear to be relevant for the analysis.
Therefore, we will not perform any feature engineering on this table.

TABLE OF FACTORS IN ACCIDENTS

The resulting table from Phase 2, 'accFacts20', contains, as we will see below, 96966 records and 27 variables.

```{r fase 3: Table_factors, echo=TRUE, message=TRUE, warning=FALSE}

#Table factors
head(accFacts20)
names(accFacts20)
```

The large number of variables makes it challenging to work with the data.
At this point, we will perform a Principal Component Analysis (PCA).
The goal is to find linear combinations of the original variables that explain the most variability in the data.
By doing so, we can reduce the number of variables to those that truly contribute relevant information:

```{r previos_PCA_1, echo=TRUE, message=TRUE, warning=FALSE}

# To perform a PCA we need to work only with the numerical variables.
print("Filtering variables to exclude numerical variables")
accFacts20_num <- accFacts20[, !grepl("NAME", names(accFacts20))]


# Looking for missing values or NA
sum(is.na(accFacts20_num))

```

We observe that there are indeed missing values in the data.
Additionally, upon visual inspection and as a result of merging different tables, we can see values that are completely out of range (9999, 99, 0.99), which require further handling.

```{r previos_PCA_2, echo=TRUE, message=TRUE, warning=FALSE}

library("pracma")
print("Revisamos cada variable para conocer como se distribuyen los valores ausentes")


# establecemos un limite razonable de impresiones para no saturar el documento
options(max.print=20)


print("MOD_YEAR")
which(is.na(accFacts20_num$MOD_YEAR))


print("AGE")
which(is.na(accFacts20_num$AGE))


print("DRINKING")
which(is.na(accFacts20_num$DRINKING))


print("DRUGS")
which(is.na(accFacts20_num$DRUGS))


print("DRIVERRF")
which(is.na(accFacts20_num$DRIVERRF))


print("HOUR")
which(is.na(accFacts20_num$MINUTE))


print("MINUTE")
which(is.na(accFacts20_num$MINUTE))


print("WEATHER")
which(is.na(accFacts20_num$WEATHER))

```

In general, when using the 'which' function without limits, we observe that there are approximately 8579 NA values.
In the case of the vehicle's age, it is clear that not all records capture this information, as there are accidents involving non-motorized vehicles.
Therefore, we know that accidents with 'VEH_NO' = 0 correspond to accidents involving non-motorized vehicles.
Additionally, we know that among the vehicles involved, 'VEH_NO' represents the vehicle considered to have caused the accident.
Hence, during the statistical analysis in the upcoming phases, we need to take this into account.

We also observe that there are a large number of discrete variables, especially related to technical aspects of the vehicle at the time of the accident, driver-related details, substance consumption, etc., which need to be transformed into continuous variables.

Another aspect to consider is the presence of attributes that have taken on values like "type 9999" due to the merging of tables.
We will address each case individually.

```{r previos_PCA_3, echo=TRUE, message=TRUE, warning=FALSE}

print("Filtramos valores = 0 en VEH_NO que se referen a otros involucrados o vehiculos sin motor ")
accFacts20_num_filtrado <- accFacts20_num[accFacts20_num$VEH_NO != 0, ]

print("imputamos a los valores 'MOD_YEAR'= NA el valor del vehiculo principal en el mismo caso")
for (i in unique(accFacts20_num_filtrado$ST_CASE)) {
  vehiculo_principal_mod_year <- na.omit(accFacts20_num_filtrado$MOD_YEAR[accFacts20_num_filtrado$ST_CASE == i & accFacts20_num_filtrado$VEH_NO == 1])
  if (length(vehiculo_principal_mod_year) > 0) {
    accFacts20_num_filtrado$MOD_YEAR[accFacts20_num_filtrado$ST_CASE == i & is.na(accFacts20_num_filtrado$MOD_YEAR)] <- vehiculo_principal_mod_year[1]
  }
}

print("Comprobamos los valores NA en 'MOD_YEAR")
print("MOD_YEAR")
which(is.na(accFacts20_num$MOD_YEAR))

```

We observed different problems, such as "9999" values in the dates of the models and others in the variable AGE.

```{r previos_pca_4, echo=TRUE, message=TRUE, warning=FALSE}

print("Eliminamos valores imposibles en MOD_YEAR y AGE")
accFacts20_num_filtrado <- subset(accFacts20_num_filtrado, !(MOD_YEAR > 2020 & AGE > 90))
print("eliminamos edades imposibles o casi, para conducir")
accFacts20_num_filtrado <- subset(accFacts20_num_filtrado, !(AGE > 90))
```

We realise that for the purposes of the analysis it is of little relevance to analyse those involved other than the main vehicle, so we rectify to correct and obtain a table containing the causal cars and drivers, including all the factors we want to study in the following phases:

```{r previos_pca_5, echo=TRUE, message=TRUE, warning=FALSE}


print("Estableciendo el numero de caso = VEH_NO=1")

accFacts20_num_filtrado_veh1 <- accFacts20_num_filtrado %>%
  filter(VEH_NO == 1) %>%
  distinct(ST_CASE, .keep_all = TRUE)

print("cambiamos el nombre a la tabla y limpiamos el entorno de trabajo")
accFacts20v2 <- accFacts20_num_filtrado_veh1
rm(accFacts20_num, accFacts20_num_filtrado, accFacts20_num_filtrado_veh1)

print("Eliminamos la variable AIR_BAG porque cometimos un error seleccionandola y no tiene utilidad para el objetivo del proyecto")
accFacts20v2 <- accFacts20v2 %>%
  select(-AIR_BAG)
```

```{r previos_pca_6, echo=TRUE, message=TRUE, warning=FALSE}

#Tenemos que convertir en binaria la variable "Drinking"
print("Creamos una nueva variable binaria 0/1 (1= para conductor bebido (probado) y 0= no bebido o no probado)")
# Codificar la variable DRINKING como binaria (0/1)
accFacts20v2$DRINKING <- ifelse(accFacts20v2$DRINKING == 1, 1, 0)
accFacts20v2$DRINKING <- ifelse(accFacts20v2$DRINKING %in% c(8, 9), 0, accFacts20v2$DRINKING)

print('En la nueva variable, hemos respetado la presuncion de inocencia, es decir, los valores 8 y 9 que eran respectivamente "no informado" o "informado desconocido" los hemos codificado como 0')
```

We have coded the variable 'Drinking' as binary: YES or No alcohol 1/0.
We repeat the process with the variable 'Drugs':

```{r previos_pca_6.2, echo=TRUE, message=TRUE, warning=FALSE}

#Tenemos que convertir en binaria la variable "Drugs"
print("Creamos una nueva variable binaria 0/1 (1= para conductor drogado (probado) y 0= no drogado o no probado)")
# Codificar la variable DRINKING como binaria (0/1)
accFacts20v2$DRUGS <- ifelse(accFacts20v2$DRUGS == 1, 1, 0)
accFacts20v2$DRUGS <- ifelse(accFacts20v2$DRUGS %in% c(8, 9), 0, accFacts20v2$DRUGS)

print('En la nueva variable, hemos respetado la presuncion de inocencia, es decir, los valores 8 y 9 que eran respectivamente "no informado" o "informado desconocido" los hemos codificado como 0')
```

On the other hand, we initially selected the variable 'DRIVERRF' because it contains valuable information.
It is important to exclude accidents involving fire or police vehicles, as well as accidents where vehicles are being towed.
These correspond to the codes 97, 96, 95, 94, 86, and 16.
Afterward, we will remove the 'DRIVERRF' variable as it is no longer needed for our analysis.

```{r previo 7 pca, echo=TRUE, message=TRUE, warning=FALSE}

print("creamos un vector con los valores que pretendemos excluir de la variable DRIVERRF")
valores_no_deseados <- c(97, 96, 95, 94, 86, 16)
accFacts20v2_filtrado <- accFacts20v2[!accFacts20v2$DRIVERRF %in% valores_no_deseados, ]


print("Eliminamos la variable DRIVERRF")
accFacts20v2 <- accFacts20v2_filtrado %>%
  select(-DRIVERRF)
rm(accFacts20v2_filtrado)
```

The variable VEHICLECC refers to mechanical problems in the vehicle and specifies the type, we convert it to binary, 0= no problems and 1= mechanical problems.

```{r previo 8 pca, echo=TRUE, message=TRUE, warning=FALSE}

#Tenemos que convertir en binaria la variable "VEHICLECC"
print("Creamos una nueva variable binaria 0/1 (1= para problemas mecanicos y 0= no problemas mecanicos)")
# Codificar la variable DRINKING como binaria (0/1)
accFacts20v2$VEHICLECC <- ifelse(accFacts20v2$VEHICLECC == 1, 1, 0)
accFacts20v2$VEHICLECC <- ifelse(accFacts20v2$VEHICLECC %in% c(2,3,4,5,6,7,8,9,10,12,13,14,15,16,17,97), 1, accFacts20v2$VEHICLECC)
accFacts20v2$VEHICLECC <- ifelse(accFacts20v2$VEHICLECC %in% c(98, 99), 0, accFacts20v2$VEHICLECC)


print('En la nueva variable o recodificacion, hemos respetado la "presuncion de inocencia mecanica", es decir, los valores codificados como 99 "desconocido" los hemos codificado como 0')
```

The variable 'DRIMPAIR' is very interesting as it contains codes to describe different situations where the drivers' psychophysical abilities are affected.
These situations may include blindness, deafness, physical injuries, and more.
We will exclude the effects of alcohol and drugs by setting their values to 0, as we have encoded them in other specific variables.
The 'DRIMPAIR' variable will encompass all the psychological and physical aspects that may have influenced the accident.

```{r previos_pca_9, echo=TRUE, message=TRUE, warning=FALSE}
#Tenemos que convertir en binaria la variable "DRIMPAIR"
print("Creamos una nueva variable binaria 0/1 (1= para problemas fisicos y psicologicos (no por consumo de sustancias ilegales) y 0= no presentes)")
# Codificar la variable DRIMPAIR como binaria (0/1)
accFacts20v2$DRIMPAIR <- ifelse(accFacts20v2$DRIMPAIR == 1, 1, 0)
accFacts20v2$DRIMPAIR <- ifelse(accFacts20v2$DRIMPAIR %in% c(4,5,6,7,8,10,96), 1, accFacts20v2$DRIMPAIR)
accFacts20v2$VEHICLECC <- ifelse(accFacts20v2$VEHICLECC %in% c(98, 99, 9, 95), 0, accFacts20v2$VEHICLECC)


print('En la nueva variable o recodificacion, los valores codificados como 99 "desconocido" los hemos codificado como 0')
print("La nueva variable es por tanto representantiva de aquellos problemas fisicos o psicologicos no directamente relacionados con el consumo de alcohol o drogas de abuso")

```

```{r previo 10 pca, echo=TRUE, message=TRUE, warning=FALSE}

print("una simple revision visual nos permite ver que aun tenemos algunos valores NA o infinitos, por lo que volvemos a eliminarlos pero teniendo en cuenta que esten presentes en otras variables diferentes de 'AGE' y 'MOD_YEAR'")

print("Eliminamos valores NA comunes a VEHICLECC y DRIMPAIR")
accFacts20v2 <- subset(accFacts20v2, !is.na(MOD_YEAR) & !is.na(AGE))

```

We will now proceed with the variable 'VISION', which refers to difficulties in visibility such as fog, smoke, or defective reflective elements, among others.
We will convert this variable into a binary variable, indicating whether or not visibility problems were present at the time of the accident.
Visibility problems may include issues related to the design of the road (structural).

```{r previos_pca_11, echo=TRUE, message=TRUE, warning=FALSE}

#Tenemos que convertir en binaria la variable "VISION"
print("Creamos una nueva variable binaria 0/1 (1= para problemas de visibildad y 0= no presentes)")
# Codificar la variable DRIMPAIR como binaria (0/1)
accFacts20v2$VISION <- ifelse(accFacts20v2$VISION == 1, 1, 0)
accFacts20v2$VISION <- ifelse(accFacts20v2$VISION %in% c(2,3,4,5,6,7,8,9,10,11,12,13,14,97,98), 1, accFacts20v2$VISION)
accFacts20v2$VISION <- ifelse(accFacts20v2$VISION %in% c(95, 99), 0, accFacts20v2$VISION)


print('En la nueva variable o recodificacion, los valores codificados como 99 "desconocido" los hemos codificado como 0')
print("La nueva variable es por tanto representantiva de aquellos problemas fisicos o psicologicos no directamente relacionados con el consumo de alcohol o drogas de abuso")
```

We need to express the years of manufacture of the vehicles in years of age.
So let's create a new variable OLD and delete MOD_YEAR

```{r previos_pca_12, echo=TRUE, message=TRUE, warning=FALSE}
print("vamos a crear una variable nueva desde el punto de partida de MOD_YEAR")

# Obtener el año actual
anyo_actual <- as.numeric(format(Sys.Date(), "%Y"))


# Calcular la antigüedad en años
accFacts20v2$ANTIGUEDAD <- anyo_actual - accFacts20v2$MOD_YEAR


# Convertir la antigüedad a valor absoluto como precaución, aunque no espero valores negativos
accFacts20v2$ANTIGUEDAD <- abs(accFacts20v2$ANTIGUEDAD)


print("Una vez creada la variable 'ANTIGUEDAD', nos deshacemos de la variable 'MOD_YEAR'")
accFacts20v2 <- accFacts20v2 %>%
  select(-MOD_YEAR)
```

We consider making the new variable binary:

```{r previos_pca_13, echo=TRUE, message=TRUE, warning=FALSE}
print("consideraremos un coche antiguo cuando tenga mas de 10 años, por lo que convertiremos en binaria la variable ANTIGUEDAD, 10 o más años=1 y menos=0")

# Creamos una nueva variable binaria que indique si el vehículo tiene 10 años o más
accFacts20v2$OLD <- ifelse(accFacts20v2$ANTIGUEDAD >= 10, 1, 0)
```

```{r previos_pca_14,echo=TRUE, message=TRUE, warning=FALSE}

print("Una vez creada la variable 'OLD', nos deshacemos de la variable 'ANTIGUEDAD'")
accFacts20v2 <- accFacts20v2 %>%
  select(-ANTIGUEDAD)
```

Another factor to consider is distractions.
In a world where technology makes us increasingly hyperconnected, these devices have become a double-edged sword.
Additionally, apart from technology-related distractions, other "classic" distractions have always been present.
We want to evaluate the impact of distractions on accidents, so we will convert the variable 'DRDISTRAC' into another binary variable, considering 0 for no distractions or unspecified distractions, and 1 for reported distractions.

```{r previos_pca_15,echo=TRUE, message=TRUE, warning=FALSE}

print("Modificamos la variable a binaria 0/1 (1= para problemas de distracciones al volante y 0= no presentes)")
# Codificamos la variable DRDISTRAC como binaria (0/1)
accFacts20v2$DRDISTRACT <- ifelse(accFacts20v2$DRDISTRACT == 99, 0, 1)
```

The variable 'HOUR' is relevant as it signifies a possible cause that can influence the psychophysical conditions of those involved in accidents.
We will consider the hours between 6:00 PM (18:00) and 6:00 AM (6:00) as nighttime.
We will create a new variable called 'NIGHT_HOUR' to represent this.

```{r previo 16 pca, echo=TRUE, message=TRUE, warning=FALSE}

# Codificar la variable HOUR como binaria (0/1)
print("creamos la nueva variable binaria NIGHT_HOUR")
accFacts20v2$NIGHT_HOUR <- ifelse(accFacts20v2$HOUR >= 18 | accFacts20v2$HOUR < 6, 1, 0)

#Eliminamos la variable 'HOUR' y 'MINUTE'
accFacts20v2 <- accFacts20v2 %>%
  select(-HOUR)
accFacts20v2 <- accFacts20v2 %>%
  select(-MINUTE)

```

The variable 'LGT_COND' refers to the road conditions, but we will discard it because we can study the nighttime factor using the 'NIGHT_HOUR' variable.
Additionally, we will address the 'AGE' column, which pertains to the age of the primary individuals involved, in intervals.

```{r previo_pca_17, echo=TRUE, message=TRUE, warning=FALSE}

#Eliminamos la variable 'LGT_COND'
print("eliminamos la variable 'LGT_COND'")
accFacts20v2 <- accFacts20v2 %>%
  select(-LGT_COND)

print("Dividimos en intervalos la variable 'AGE'")
# Dividir la columna 'AGE' en intervalos usando la función cut()
accFacts20v2 <- accFacts20v2 %>%
  mutate(AGE_GRUP = cut(AGE, breaks = c(-Inf, 16, 44, 72, 100, Inf), labels = c("<=16", "17-44", "45-72", "73-100", ">100"), right = FALSE))
```

As a precaution before removing the 'AGE' variable, we will create another binary variable called 'AGE_BIN' to categorize individuals as young (0) or senior (1).
We will consider individuals with an age less than or equal to 25 as young, and those with an age greater than 25 as senior:

```{r previo_pca_18, echo=TRUE, message=TRUE, warning=FALSE}
# Crear la variable binaria 'AGE_BIN'
print("creamos la variable 'AGE_BIN'")
accFacts20v2 <- accFacts20v2 %>%
  mutate(AGE_BIN = ifelse(AGE <= 25, 0, 1))

```

Finally, we will address the 'WEATHER' variable, which we will also convert into a binary variable called 'WEA_BIN'.
We will assign a value of 0 for clear weather conditions and a value of 1 for inclement weather conditions.

```{r previos_pca_19, echo=TRUE, message=TRUE, warning=FALSE}

print("Creamos la variable convirtiendo los registros segun se naturaleza 0= despejado y 1= No despejado")
# Crear la variable binaria 'WEA_BIN'
accFacts20v2 <- accFacts20v2 %>%
  mutate(WEA_BIN = ifelse(WEATHER %in% c(1, 98, 99), 0, 1))
```

We eliminate the variables no longer needed:

```{r previos_pca_20, echo=TRUE, message=TRUE, warning=FALSE}

#Eliminamos la variable 'AGE'
print("eliminamos la variable 'AGE'")
accFacts20v2 <- accFacts20v2 %>%
  select(-AGE)

print("eliminamos la variable 'VEH_NO'")
accFacts20v2 <- accFacts20v2 %>%
  select(-VEH_NO)

print("eliminamos la variable 'WEATHER'")
accFacts20v2 <- accFacts20v2 %>%
  select(-WEATHER)

#copia de seguridad
accFacts20vBACK <- accFacts20v2

#retomamos la tabla principal
accFacts20 <- accFacts20v2

```

We continue with the PCA once all the variables of interest have been reviewed, cleaned and transformed:

```{r PCA, echo=TRUE, message=TRUE, warning=FALSE}

# Cargar librería para PCA
library(stats)
options(max.print = 1000)

# Creamos una nueva tabla excluyendo las variables 'ST_CASE' y 'AGE_GRUP'
accFacts_pca <- accFacts20[, !(colnames(accFacts20) %in% c("ST_CASE", "AGE_GRUP"))]

# Realizar el PCA
pca_result <- prcomp(accFacts_pca, scale. = TRUE)

# Obtener los resultados del PCA
pca_variances <- pca_result$sdev^2
pca_proportions <- pca_variances / sum(pca_variances)
pca_loadings <- pca_result$rotation

# Imprimir los resultados
cat("Varianzas explicadas por cada componente principal:\n")
cat(pca_variances, "\n")
cat("\nProporciones de varianzas explicadas por cada componente principal:\n")
cat(pca_proportions, "\n")
cat("\nLoadings de cada variable en cada componente principal:\n")
print(pca_loadings)


```

In the principal component analysis conducted in our study, it was found that the first three principal components explain a significant portion of the total variance, with the first principal component explaining approximately 14% of the variance, the second principal component explaining around 12%, and the third principal component explaining around 11%.
This suggests that these three principal components capture the majority of the variability in the original data.
Furthermore, loading patterns of the variables on each principal component were identified, indicating the direction and magnitude of the influence of each variable on the principal components.
For example, it was found that the variable 'DRINKING' has a strong positive influence on the first principal component, while the variable 'DRUGS' has a strong negative influence on the second principal component.
These results help us understand how the original variables contribute to the structure of the principal components and how they relate to each other in our analysis @jolliffe2002. We have decided to keep all the variables.

Finally, the table 'accBpoint20', we look for NA or infinite values:

```{r accBpoint20, echo=TRUE, message=TRUE, warning=FALSE}

print("Comprobando NA")
sum(is.na(accBpoint20))


print("Comprobando encabezados")
head(accBpoint20)


print("nombres de variables")
names(accBpoint20)

```

We observe that there are no outliers or special issues with this final table.
Additionally, all the variables are relevant for creating a map of black spots, so we have decided to keep all the variables, concluding Phase 3.
In this project, we are not going to model, evaluate, or deploy, as the objective was to put into practice in a comprehensible way the feature engineering process, which is considered one of the most complex and time-consuming parts of a data mining project.

## BIBLIOGRAPHY