-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathds_against_traffic_accidents.Rmd
1291 lines (819 loc) · 50.6 KB
/
ds_against_traffic_accidents.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Data science applied to the fight against traffic accidents"
author: "Anton Barrera Mora (me@antonio-barrera.cyou)"
date: "April 2023"
output:
html_document:
highlight: default
number_sections: yes
theme: cosmo
toc: yes
toc_depth: 2
includes:
in_header: p_brand.html
word_document: default
github_document:
preserve_yaml: true
pdf_document:
highlight: zenburn
toc: yes
editor_options:
markdown:
wrap: sentence
bibliography: references.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
In this project, we intend to show how the "life cycle" applies to data mining (DM) projects.The 'project life cycle' refers to the sequential phases that a project goes through from its initiation to its completion.
It provides a structured framework for managing projects and encompasses all the activities, processes, and deliverables involved in the project's lifespan.
We are talking abour a general framework that establishes the stages or phases a project goes through from initiation to completion.
The project lifecycle provides a structure for managing the project as a whole, from defining the objectives to delivering the results.
In the other hand, while the CRISP-DM @whatis model specifically focuses on the processes and steps involved in data mining, the project lifecycle is a broader framework that can be applied to different types of projects, including data mining projects.
Then the CRISP-DM model can be considered as a specific methodology within the project lifecycle of a data mining project.
It provides detailed guidelines on the specific stages to be followed when conducting a data mining project, such as understanding the business, understanding the data, preparing the data, modeling, evaluating, and deploying.
We will limit ourselves to the first three phases:
1. Understanding the business.
2. Understanding the data.
3. Data preparation.To achieve our objectives, we selected dataset sourced from the National Highway Traffic Safety Administration (NHTSA).
[The Fatality Analysis Reporting System (FARS)](https://www.nhtsa.gov/crash-data-systems/fatality-analysis-reporting-system) was developed by the NHTSA in the United States to provide a comprehensive measure of road safety.
This dataset specifically pertains to the year 2020 and consists of accident records that capture significant descriptive data.
Each accident in the dataset involves at least one fatality.
## Phase 1. Business understanding
### Problem:
We are requested to analyze the causes of road accidents in the United States, whether they are of human, material, or environmental origin.
With this information, they hope to annually review the trends in this matter with the help of the model and adjust the intervention plan, whether through investment, campaigns, or training.
They are also very interested in identifying specific states and cities where the intervention in road safety should be increased or modified, as well as any aspects related to infrastructure.
Finally, we are asked to review the three-year time series to understand the evolution of road accidents.
### Objective:
This data mining project aims to explore the dataset, uncover hidden patterns, and in future if potentially possible, develop predictive models to identify factors contributing to severe accidents.
The insights gained from this analysis will provide valuable information for road safety initiatives and support evidence-based decision-making in accident prevention strategies.
The primary analytical objective of this project is to gain insights into the factors that contribute to the severity of an accident and to define what constitutes a severe accident.
By applying data mining techniques, we aim to uncover patterns and relationships within the dataset that can help us understand the key factors associated with severe accidents.
Based on the above and in summary, we will undertake the initial phases to design a data mining model that allows us to understand the following in an updated manner:
1. The evolution of the time series of fatal accidents.
2. Major causes of accidents.
3. Incident volume by states and cities, including "black spots" in the road network.
### Product:
Documented data mining model tailored to extract relevant information in the defined areas of interest (time series, black spots, and causes).
### Tasks:
[Definition of the study population:]{.underline} Traffic accidents in the USA from 2018 to 2020.
[Data collection:]{.underline}
Primary data (provided by the contracting entity).\
For more information on the dataset and the FARS, please refer to the National Highway Traffic Safety Administration's Crash Data Systems website: FARS.
[Relevant variables:]{.underline}
In a holistic approach, the variables or factors in this project are established around the number of traffic accidents, human causes, material causes, and environmental causes that can influence their occurrence.
[Success criteria:]{.underline}
An informative data mining model that fulfills its function of helping to prevent accidents by anticipating material wear (emergence of new black spots in the road network of states or cities), the appearance of new drugs and areas of incidence, and serves as a primary tool for decision-making.
## Phase 2. Understanding the Data
In this phase, we will identify the necessary dataset to achieve the set objective and gain a holistic understanding of its structure.
We will also address ethical considerations in data usage.
### Objectives:
- Selection and identification of the necessary dataset in relation to the set objectives.
- Initial filtering of the data to eliminate redundant tables and records identified during the initial analysis.
- Attention to data ethics aspects:
- Ownership of the dataset
- Transparency in transactions
- Consent
- Currency
- Privacy
- Openness
- Compliance with legal aspects of data usage, including storage and security.
### Product:
- Descriptive document of the available data, including details about the data itself, its origin, availability, storage, privacy, security, and other legal aspects. Also includes information about variables: their nature, structure, problems, validity, and appropriateness.
- Final datasets suitable for the next phase to achieve the project objectives.
### Tasks:
#### [Selection of the dataset:]{.underline}
The National Highway Traffic Safety Administration (NHTSA) in the United States provides a public dataset on traffic accidents called the Fatality Analysis Reporting System (FARS), which collects detailed information on fatal traffic accidents nationwide since 1975.
FARS contains information about the characteristics of vehicles involved in accidents, drivers and passengers, road and weather conditions, as well as the cause and consequences of the accidents.
The information is collected through reports from law enforcement authorities, accident investigations, and medical and autopsy records.
FARS data is publicly available in CSV format and can be downloaded from the NHTSA website.
There are also online tools and software packages that can assist in analyzing FARS data.
It should be noted that FARS only includes information on traffic accidents with fatalities, so it does not provide data on the entirety of traffic accidents in the United States.
However, it can still be a valuable source of information for research on road safety and accident prevention.
The analytical objective is to understand the main causes of accidents, black spots in the road network, and the temporal evolution of accidents.
We have selected the main dataset for accidents in 2020, called 'FARS2020NationalCSV'.
It is a version of the FARS dataset that includes information on fatal traffic accidents that occurred in the year 2020 in the USA.
It includes several tables that provide detailed information about vehicles, drivers, and passengers involved in the accidents, as well as the circumstances of the accidents.
The selected tables from the overall dataset that will substantiate the project objectives are:
[**'accident'**]{.underline}
Information about the fatal traffic accident.
Some of the variables included in this table are:
"ST_CASE": the case number assigned to the accident.
"STATE": numeric code of the state where the accident occurred.
"STATENAME": name of the state where the accident occurred corresponding to its numeric code.
"COUNTY": numeric code of the county where the accident occurred.
"COUNTYNAME": names of the counties corresponding to their numeric codes.
"CITY": numeric code of cities.
"CITYNAME": names of the cities corresponding to their numeric codes.
"FATALS": number of fatalities.
"DAY": day of the accident.
"MONTH": month in which the accident occurred.
"HOUR": the hour of the day when the accident occurred (in 24-hour format).
"MINUTE": minutes of the hour when the accident occurred.
"NHS": codes indicating if the accident occurred on a National Highway System (NHS) road.
"NHSNAME": confirmation if the accident belongs to an NHS road.
"ROUTE": code indicating the type of route where the accident occurred.
"ROUTENAME": name of the route corresponding to the numeric code.
"TWAY_ID": the identification of the roadway where the accident occurred.
"TWAY_ID2": the identification of a second roadway if necessary.
"MILEPT": the mile at which the accident occurred.
"LATITUDE": the latitude of the accident location.
"LONGITUDE": the longitude of the accident location.
"MAN_COLL": code indicating the type of collision that occurred.
"MAN_COLLNAME": description of the collision type.
"RUR_URB": codes specifying the area of occurrence.
"RUR_URBNAME": confirmation of the area where the accident occurred.
"LGT_COND": code indicating the lighting condition of the vehicle at the time of the accident.
"LG_CONDNAME": description of the lighting condition based on the previous code.
"WEATHER": code indicating the weather conditions.
"WEATHERNAME": name of the adverse weather condition.
"DRUNK_DR": number of drunk drivers.
This table provides valuable information about the characteristics of fatal traffic accidents, including the date, time, location, and number of victims.
These data can be used to identify patterns and trends in traffic accidents, which in turn can be utilized to develop more effective accident prevention strategies.
Additionally, it serves as the basis for calculating the time series and other statistics related to the total number of victims and forecasting.
[**'driverrf'**]{.underline}
Contains information about the drivers involved in the fatal traffic accident.
Some of the variables included in this table are:
"ST_CASE": the case number assigned to the accident.
"VEH_NO": the vehicle number involved in the accident.
"DRIVERRF": numeric code indicating references to driver circumstances, such as driver's license information, improper vehicle handling, and various aspects related to the driver.
"DRIVERRFNAME": description of the above codes regarding relevant driver and driving aspects at the time of the accident.
This table provides valuable information about the characteristics of drivers involved in fatal traffic accidents.
Variables related to the driver's condition and substance use are important for our analytical objectives.
[**'vehicle'**]{.underline}
Contains information about the vehicles involved in the fatal traffic accident.
Some of the variables considered for the analytical objective are:
"ST_CASE": the case number assigned to the accident.
"VEH_NO": the vehicle number involved in the accident.
"MAKE": numeric code for the vehicle's make.
"MAKENAME": make name.
"MODEL": numeric code for the vehicle's model.
"MAK_MOD": numeric code for the make and model.
"MAK_MODNAME": make and model name.
"MOD_YEAR": the vehicle's model year.
"TOW_VEH": whether the vehicle was being towed at the time of the accident.
"MCARR_ID": identification of the commercial vehicle involved in the accident.
"GVWR": gross vehicle weight rating.
"VEH_YEAR": the vehicle's manufacturing year.
This table provides valuable information about the characteristics of vehicles involved in traffic accidents with fatalities.
Variables related to the vehicle's manufacturing date and model can help identify and characterize issues with the vehicles in use.
[**'weather'**]{.underline}
In FARS, it contains information about the weather conditions at the time of the fatal traffic accident.
Some of the variables included in this table are:
"ST_CASE": the case number assigned to the accident.
"STATENAME": the state where the accident occurred.
"WEATHER": the numeric code describing the weather conditions at the time of the accident.
"WEATHERNAME": description of the weather conditions described in the previous variable at the time of the accident.
It provides information about the weather conditions at the time of the fatal traffic accident and can be used to conduct more detailed analyses of road safety and accident prevention under different weather conditions.
In our case, we are interested in understanding the multiple variables involved in accidents and their effects, but this type of information can also enable corrective measures in areas where certain weather phenomena are more frequent.
[**'distract'**]{.underline}
Contains data on the distractions that drivers were experiencing at the time of the fatal traffic accident.
Some of the variables included are:
"ST_CASE": the case number assigned to the accident.
"VEH_NO": the vehicle number involved in the accident.
"DRDISTRACT": the numeric code describing the distraction the driver was experiencing at the time of the accident.
"DRDISTRACTNAME": description of the type of distraction.
It provides information about a wide range of distractions that drivers experienced at the time of the accident.
Some of the distractions captured in the table include:
- Talking or interacting with a passenger in the vehicle.
- Eating or drinking while driving.
- Using a mobile phone or electronic device.
- Observing something outside the vehicle, such as an accident or scenery.
- Touching or adjusting the radio or vehicle controls.
- Looking at or reaching for something inside the vehicle.
- Other types of distractions, such as being fatigued, being emotionally disturbed, or being distracted by something outside the vehicle.
[**'drugs'**]{.underline}
Contains information about the presence of drugs in drivers involved in fatal traffic accidents.
Some of the variables included in this table are:
"ST_CASE": the case number assigned to the accident.
"VEH_NO": the vehicle number involved in the accident.
"DRUGSPEC": numeric code for the type of drug test conducted.
"DRUGSPECNAME": descriptive specification of the type of drug test conducted.
"DRUGRES": numeric code for the test results.
"DRUGRESNAME": descriptive specification of the test results, specifying the name of the substance.
Understanding the main causes of accidents is the analytical objective, but the variables included in this table can help us understand trends in drug use and highlight areas where this occurs, enabling targeted awareness and information campaigns among the population.
[**'vision'**]{.underline}
Contains information about vision problems of drivers involved in fatal traffic accidents.
Some of the variables included in this table are:
"ST_CASE": the case number assigned to the accident.
"VEH_NO": the vehicle number involved in the accident.
"VISION": the numeric code describing the vision problem the driver had at the time of the accident.
"VISION_NAME": detailed description of the vision problem.
Understanding the variables in this table can be useful for analyzing the relationship between drivers' vision problems and fatal traffic accidents.
[**'Factor'**]{.underline}
Contains information about different factors contributing to fatal traffic accidents.
Vehicle conditions, environmental conditions, driver behavior, and other external factors.
"ST_CASE": the case number assigned to the accident.
"VEH_NO": the vehicle number involved in the accident.
"VEHICLECC": numeric code describing the factor that contributed to the accident.
There are several codes available for mechanical failures observed in the vehicle.
"VEHICLECCNAME": provides descriptive information about the mechanical problem referred to by each code.
It can help us understand the main mechanical causes of traffic accidents, relate them to vehicle models and age, and help develop prevention and control measures.
#### Exploratory Analysis
We are going to make a first approach to explore the chosen dataset and get to know its structure as the number of records, variables, type of variables, problems, determine the value of the data and other important aspects before starting to work with the dataset.
```{r Routes and data loading, echo=TRUE, message=FALSE, warning=FALSE}
# Set paths to different accident datasets
# 2020 Dataset
path_acc20 = "./FARS2020NationalCSV/accident.CSV"
# 2019 Dataset
path_acc19 = "./FARS2019NationalAuxiliaryCSV/ACC_AUX.CSV"
# 2018 Dataset
path_acc18 = "./FARS2018NationalAuxiliaryCSV/ACC_AUX.CSV"
# Set paths to additional 2020 files for further analysis
path_veh = "./FARS2020NationalCSV/vehicle.CSV"
path_per = "./FARS2020NationalCSV/person.CSV"
path_driverrf = "./FARS2020NationalCSV/driverrf.CSV"
path_distract = "./FARS2020NationalCSV/Distract.CSV"
path_drimpair = "./FARS2020NationalCSV/Drimpair.CSV"
path_factor = "./FARS2020NationalCSV/Factor.CSV"
path_vision = "./FARS2020NationalCSV/Vision.CSV"
path_wea = "./FARS2020NationalCSV/weather.CSV"
# Load accident data for each year for time series and other analyses
# 2020 Accidents - complete data
acc20 <- read.csv(path_acc20, row.names=NULL)
# 2019 Accidents - auxiliary data
acc19 <- read.csv(path_acc19, row.names=NULL)
# 2018 Accidents - auxiliary data
acc18 <- read.csv(path_acc18, row.names=NULL)
# Load additional 2020 data
# Vehicles
veh <- read.csv(path_veh, row.names=NULL)
# Persons
per <- read.csv(path_per, row.names=NULL)
# Additional information about the driver
drivrf <- read.csv(path_driverrf, row.names=NULL)
# Information about driver impairments
impair <- read.csv(path_drimpair, row.names=NULL)
# Information about mechanical factors
fact <- read.csv(path_factor, row.names=NULL)
# Information about visibility problems
visio <- read.csv(path_vision, row.names=NULL)
# Information about driver distractions
distract <- read.csv(path_distract, row.names=NULL)
# Information about weather conditions
wea <- read.csv(path_wea, row.names=NULL)
```
Once the dataset is loaded, we proceed to the exploration of the dataset.
##### 1. Verifying the structure of the accident set from 2018 to 2020 inclusive:
ACCIDENTS 2020
```{r exp_2020_accident, echo=TRUE, message=TRUE, warning=FALSE}
# Main file acc20
# Check the first few rows of data for each variable
head(acc20)
# Analyze numeric variables
summary(acc20)
# Get the names of variables in the accident table
names(acc20)
# Perform a general analysis of the structure
summario20 <- as.data.frame(str(acc20, give.attr = FALSE, strict.width = "cut"))
knitr::kable(summario20, caption = "Summary of the acc20 table")
```
We can observe that the 2020 accident table has 3576 records or instances of accidents with 81 variables.
At first glance, we can already see that some variables are not necessary for the project as they provide unnecessary data.
All numeric variables are of integral type, there are no date records, and the remaining variables can be categorized as text strings.
**ACCIDENTS 2019**
```{r exp_acc_2019, echo=TRUE, message=TRUE, warning=FALSE}
# Auxiliary file acc19
# Let's check the first rows of data for each variable
head(acc19)
# An analysis of numeric variables
summary(acc19)
# The names of the variables in the accident table
names(acc19)
# and perform a general analysis of the structure
summario19 <- as.data.frame(str(acc19, give.attr = FALSE, strict.width = "cut"))
knitr::kable(summario19, caption = "Summary of the acc20 table")
```
We can observe that in this case, the data in the 2019 'Auxiliary' file are of integral type, and we have 33,487 instances or accidents with 42 variables.
The file lacks descriptions in text strings, but the variable names and primary key match the previous one, so we can unify them using the table as a child table and using the primary key as a foreign key.
ACCIDENTS 2018
```{r exploracion accident 2018, echo=TRUE, message=TRUE, warning=FALSE}
# Auxiliary file acc18
# Let's check the first rows of data for each variable
head(acc18)
# An analysis of numeric variables
summary(acc18)
# The names of the variables in the accidents table
names(acc18)
# And let's perform a general analysis of the structure
summary18 <- as.data.frame(str(acc18, give.attr = FALSE, strict.width = "cut"))
knitr::kable(summary18, caption = "Summary of the acc18 table")
```
We observe that the accidents table for 2018 contains 33,919 instances and the same 42 variables as in the previous case, with analogous structures.
Furthermore, the primary/foreign key is the case number:
ST_CASE - Accident identifier
We will study the following aspects of the data:
TEMPORAL TREND STUDY
FATAL - Annual fatalities (2018 to 2020)
CAUSE STUDY
DRUNK_DR - Number of drunk drivers
DAY_WEEKNAME - Interest in weekends
HOUR - Hour
HOURNAME - Time slot
MINUTE - Minute
DRDISTRAC - Code for driver distractions
DRDISTRACNAME - Distraction specification
MOD_YEAR - Year of vehicle model
L_TYPE - License type, but we are interested only in those without a license
WEATHER - Code for weather conditions at the time
WEATHERNAME - Weather condition specification
DRUNK_DR - Number of positive alcohol tests for drivers involved
DRIVERRF - Type of driving infractions or aspects.
This record comes from a table that allows us to select the involved or responsible drivers
AGE - Age, filtering the age of officially responsible drivers
DRIVERRFNAME - Textual specification of the responsible driver
AIR_BAG - Code to specify airbag behavior
AIR_BAGNAME - Textual specification of airbag behavior
DRUGS - Codes for the presence of drugs
DRUGSNAME - Textual specification
VISION - Codes for visibility elements (mirrors, windows, etc.)
VISIONNAME - Specification of the type of anomaly in vision-related elements
VEHICLECC - Code for factors that may contribute to the accident
VEHICLECCNAME - Specification of the factors
DRIMPAIR - Codes for detected physical impairments
DRIMPAIRNAME - Specification of detected psychophysical impairment aspects
LOCATION OF BLACK SPOTS
COUNTY - County code
COUNTYNAME - County name
CITY - City code
CITYNAME - City name
ROUTE - Route code or road type
ROUTE - Specification of the road type
RUR_URB - Code 1-2 to specify if it is rural or urban, respectively
MILEPT - Mile point number
LATITUDE
LONGITUDE
##### 2. Feature processing and management I:
[**Cleaning**]{.underline}
The next step is to ensure that there are no empty or null values.
```{r cleaning, echo=TRUE, message=TRUE, warning=FALSE}
print('NA')
# Checking the accident files
print ('Accidents 2020')
colSums(is.na(acc20))
print('Accidents 2019')
colSums(is.na(acc19))
print('Accidents 2018')
colSums(is.na(acc18))
# Checking the auxiliary files
print('driverrf')
colSums(is.na(drivrf))
print('Factor')
colSums(is.na(fact))
print('impair')
colSums(is.na(impair))
print('Person')
colSums(is.na(per))
print('Vehicles')
colSums(is.na(veh))
print('vision')
colSums(is.na(visio))
print('Blanks')
# Checking the accident files
print ('Accidents 2020')
colSums(acc20=="")
print('Accidents 2019')
colSums(acc19=="")
print('Accidents 2018')
colSums(acc18=="")
# Checking the auxiliary files
print('driverrf')
colSums(drivrf=="")
print('Factor')
colSums(fact=="")
print('impair')
colSums(impair=="")
print('Person')
colSums(per=="")
print('Vehicles')
colSums(veh=="")
print('vision')
colSums(visio=="")
```
Certain variables in the 'per' (person) table contain blank values, such as MAKENAME or MAK_MOD, which refer to the vehicle models and other specifications in the 'veh' (vehicles) table.
However, these variables are not relevant for our project's objective, so we will discard those records.
At this point, working with the large number of tables and variables becomes challenging.
Therefore, we will need to create the tables on which we will work.
By doing so, we will implicitly remove the variables that are not necessary for the project's objective.
In this initial feature engineering phase (second phase), we will select variables based on the descriptive work conducted earlier.
In the next phase, we will delve deeper into feature engineering based on the specific requirements of each objective.
TIME SERIES ANALYSIS 2018-2020
We need to create a table that includes the number of victims (FATALS), the year (YEAR), and the accident number or case code (ST_CASE), which serves as the key.
```{r install_libraries, echo=TRUE, message=FALSE, warning=FALSE}
#Install various packages as needed
if(!require('ggplot2')) install.packages('ggplot2'); library('ggplot2')
if(!require('Rmisc')) install.packages('Rmisc'); library('Rmisc')
if(!require('dplyr')) install.packages('dplyr'); library('dplyr')
if(!require('xfun')) install.packages('xfun'); library('xfun')
if(!require('magrittr')) install.packages('magrittr'); library('magrittr')
if (!require('factoextra')) install.packages('factoextra'); library('factoextra')
if(!require('pracma')) install.packages('pracma'); library('pracma')
```
```{r creando tabla analisis18_20, echo=TRUE, message=TRUE, warning=FALSE}
print('Creating table for the 2018-2020 FATALS time series')
# Select the necessary columns
accidentData18_20 <- rbind(
acc18 %>% select(ST_CASE, YEAR, FATALS),
acc19 %>% select(ST_CASE, YEAR, FATALS),
acc20 %>% select(ST_CASE, YEAR, FATALS)
)
# Sort the table by year
analysis18_20 <- accidentData18_20 %>% arrange(YEAR, ST_CASE)
# Show the content of the new table
print('New analysis18_20 table')
# Remove the accidentData18_20 table
rm(accidentData18_20)
# Analyze the headers of the new table
head(analysis18_20)
```
FACTORS PRESENT IN ACCIDENTS (2020)
Previously, we have described the variables to consider or study in order to understand their influence on the outcome.
Now, we will create a new table that only includes the variables of interest, discarding others.
```{r creando tabla de causas y factores, echo=TRUE, message=TRUE, warning=FALSE}
print('Creating tables with factors related to accidents and excluding the rest')
# Selecting the necessary columns from each table
# Unique primary key records
acc20_select <- acc20 %>% select(ST_CASE, DAY_WEEKNAME, HOUR, MINUTE, LGT_COND, LGT_CONDNAME)
wea20_select <- wea %>% select(ST_CASE, WEATHER, WEATHERNAME)
# Merging both tables using the merge function
accidentWeather20 <- merge(acc20_select, wea20_select, by = "ST_CASE")
# Composite primary key records
veh_select <- veh %>% select(ST_CASE, VEH_NO, MOD_YEAR)
per_select <- per %>% select(ST_CASE, VEH_NO, AGE, AIR_BAG, AIR_BAGNAME, DRINKING, DRINKINGNAME, DRUGS, DRUGSNAME)
drivrf_select <- drivrf %>% select(ST_CASE, VEH_NO, DRIVERRF, DRIVERRFNAME)
fact_select <- fact %>% select(ST_CASE, VEH_NO, VEHICLECC, VEHICLECCNAME)
impair_select <- impair %>% select(ST_CASE, VEH_NO, DRIMPAIR, DRIMPAIRNAME)
visio_select <- visio %>% select(ST_CASE, VEH_NO, VISION, VISIONNAME)
distract_select <- distract %>% select(ST_CASE,VEH_NO, DRDISTRACT, DRDISTRACTNAME)
# Combining each table using the merge function
accidentFacts20 <- merge(veh_select, per_select, by = c("ST_CASE", "VEH_NO"), all = TRUE)
accidentFacts20 <- merge(accidentFacts20, drivrf_select, by = c("ST_CASE", "VEH_NO"), all = TRUE)
accidentFacts20 <- merge(accidentFacts20, fact_select, by = c("ST_CASE", "VEH_NO"), all = TRUE)
accidentFacts20 <- merge(accidentFacts20, impair_select, by = c("ST_CASE", "VEH_NO"), all = TRUE)
accidentFacts20 <- merge(accidentFacts20, visio_select, by = c("ST_CASE", "VEH_NO"), all = TRUE)
accidentFacts20 <- merge(accidentFacts20, distract_select, by = c("ST_CASE", "VEH_NO"), all = TRUE)
# Removing datasets that have been merged
rm(acc20_select)
rm(distract_select)
rm(drivrf_select)
rm(fact_select)
rm(impair_select)
rm(per_select)
rm(veh_select)
rm(visio_select)
rm(wea20_select)
# Analyzing the result using the head() function
print("Accident Facts Table")
head(accidentFacts20)
print("Accident Weather Table")
head(accidentWeather20)
```
We have now two tables that collect different variables that may be related to the occurrence of accidents.
We decided to unify them into a single table:
```{r Creando la tabla final con factores de accidentes, echo=TRUE, message=TRUE, warning=FALSE}
print("Merging tables with different keys")
accFacts20 <- merge(accidentFacts20, accidentWeather20, by = "ST_CASE", all = TRUE)
# Removing the tables that have been merged
rm(accidentFacts20)
rm(accidentWeather20)
# Viewing the result using the head() function
print("Applying head() and names() to the new table accFacts20")
head(accFacts20)
names(accFacts20)
```
ROAD BLACK SPOTS
In line with the two previous tables that collect the different aspects we intend to analyze, we proceed to create a third table with geographic data to later determine where the road black spots are located.
We start with the base of the 2020 accident table, which contains all the information we need:
```{r black_points, echo=TRUE, message=TRUE, warning=FALSE}
print("Creating the table with the location of accidents")
accBpoint20 <- acc20 %>% select(ST_CASE, STATE, STATENAME, COUNTY, COUNTYNAME, CITY, CITYNAME, ROUTE, ROUTENAME, RUR_URB, RUR_URBNAME, MILEPT, LATITUDE, LONGITUD)
# Analyzing the content with head() and the variables with names()
print("Table head")
head(accBpoint20)
print("Variable Names")
names(accBpoint20)
```
At this point, we have achieved the objectives for this phase, as we have: 1.
Description of the dataset and the variables represented in it.
2.
Clean dataset with an initial phase of feature management.
Finally, we proceed to clean the workspace before moving on to the third phase.
```{r limpieza del entorno de trabajo previo a Fase 3, echo=TRUE, message=TRUE, warning=FALSE}
# Removing unnecessary elements except for the 3 tables we will work with
rm(acc18, acc19, acc20, distract, drivrf, fact, impair, per, veh, visio, wea)
# Removing 'paths' and 'summaries'
rm(list = ls(pattern = "^pat"))
rm(list = ls(pattern = "^summa"))
```
## Phase 3. Data Preparation
### Objectives:
In this phase, we will continue with feature engineering, which involves selecting and transforming variables or features of the data to improve the performance of the machine learning model.
In this case, we will adapt the different tables and variables to the project's needs.
### Deliverable:
Obtain a dataset from 3 tables that contain the relevant variables for the 3 aspects of the project:
1. Analyze the evolution of accidents in the 2018 to 2020 series.
2. Identify blackspots in the road network.
3. Explore factors that may influence the occurrence of accidents.
These tables will have undergone data transformation and dimensionality reduction methods.
### Tasks:
##### 1. Feature processing and management II
TABLE FOR THE 2018-2020 TIME SERIES
The table for the analysis of the time series, 'analisis18_20', contains only 3 variables as we saw earlier.
```{r fase 3 table_analysis, echo=TRUE, message=TRUE, warning=FALSE}
# Table analysis
head(analysis18_20)
names(analysis18_20)
```
Feature processing (or engineering) is typically applied when there is a large number of variables and the goal is to reduce their dimensionality by eliminating irrelevant or highly correlated variables.
However, in this case, there are only four variables in the table and all of them appear to be relevant for the analysis.
Therefore, we will not perform any feature engineering on this table.
TABLE OF FACTORS IN ACCIDENTS
The resulting table from Phase 2, 'accFacts20', contains, as we will see below, 96966 records and 27 variables.
```{r fase 3: Table_factors, echo=TRUE, message=TRUE, warning=FALSE}
#Table factors
head(accFacts20)
names(accFacts20)
```
The large number of variables makes it challenging to work with the data.
At this point, we will perform a Principal Component Analysis (PCA).
The goal is to find linear combinations of the original variables that explain the most variability in the data.
By doing so, we can reduce the number of variables to those that truly contribute relevant information:
```{r previos_PCA_1, echo=TRUE, message=TRUE, warning=FALSE}
# To perform a PCA we need to work only with the numerical variables.
print("Filtering variables to exclude numerical variables")
accFacts20_num <- accFacts20[, !grepl("NAME", names(accFacts20))]
# Looking for missing values or NA
sum(is.na(accFacts20_num))
```
We observe that there are indeed missing values in the data.
Additionally, upon visual inspection and as a result of merging different tables, we can see values that are completely out of range (9999, 99, 0.99), which require further handling.
```{r previos_PCA_2, echo=TRUE, message=TRUE, warning=FALSE}
library("pracma")
print("Revisamos cada variable para conocer como se distribuyen los valores ausentes")
# establecemos un limite razonable de impresiones para no saturar el documento
options(max.print=20)
print("MOD_YEAR")
which(is.na(accFacts20_num$MOD_YEAR))
print("AGE")
which(is.na(accFacts20_num$AGE))
print("DRINKING")
which(is.na(accFacts20_num$DRINKING))
print("DRUGS")
which(is.na(accFacts20_num$DRUGS))
print("DRIVERRF")
which(is.na(accFacts20_num$DRIVERRF))
print("HOUR")
which(is.na(accFacts20_num$MINUTE))
print("MINUTE")
which(is.na(accFacts20_num$MINUTE))
print("WEATHER")
which(is.na(accFacts20_num$WEATHER))
```
In general, when using the 'which' function without limits, we observe that there are approximately 8579 NA values.
In the case of the vehicle's age, it is clear that not all records capture this information, as there are accidents involving non-motorized vehicles.
Therefore, we know that accidents with 'VEH_NO' = 0 correspond to accidents involving non-motorized vehicles.
Additionally, we know that among the vehicles involved, 'VEH_NO' represents the vehicle considered to have caused the accident.
Hence, during the statistical analysis in the upcoming phases, we need to take this into account.
We also observe that there are a large number of discrete variables, especially related to technical aspects of the vehicle at the time of the accident, driver-related details, substance consumption, etc., which need to be transformed into continuous variables.
Another aspect to consider is the presence of attributes that have taken on values like "type 9999" due to the merging of tables.
We will address each case individually.
```{r previos_PCA_3, echo=TRUE, message=TRUE, warning=FALSE}
print("Filtramos valores = 0 en VEH_NO que se referen a otros involucrados o vehiculos sin motor ")
accFacts20_num_filtrado <- accFacts20_num[accFacts20_num$VEH_NO != 0, ]
print("imputamos a los valores 'MOD_YEAR'= NA el valor del vehiculo principal en el mismo caso")
for (i in unique(accFacts20_num_filtrado$ST_CASE)) {
vehiculo_principal_mod_year <- na.omit(accFacts20_num_filtrado$MOD_YEAR[accFacts20_num_filtrado$ST_CASE == i & accFacts20_num_filtrado$VEH_NO == 1])
if (length(vehiculo_principal_mod_year) > 0) {
accFacts20_num_filtrado$MOD_YEAR[accFacts20_num_filtrado$ST_CASE == i & is.na(accFacts20_num_filtrado$MOD_YEAR)] <- vehiculo_principal_mod_year[1]
}
}
print("Comprobamos los valores NA en 'MOD_YEAR")
print("MOD_YEAR")
which(is.na(accFacts20_num$MOD_YEAR))
```
We observed different problems, such as "9999" values in the dates of the models and others in the variable AGE.
```{r previos_pca_4, echo=TRUE, message=TRUE, warning=FALSE}
print("Eliminamos valores imposibles en MOD_YEAR y AGE")
accFacts20_num_filtrado <- subset(accFacts20_num_filtrado, !(MOD_YEAR > 2020 & AGE > 90))
print("eliminamos edades imposibles o casi, para conducir")
accFacts20_num_filtrado <- subset(accFacts20_num_filtrado, !(AGE > 90))
```
We realise that for the purposes of the analysis it is of little relevance to analyse those involved other than the main vehicle, so we rectify to correct and obtain a table containing the causal cars and drivers, including all the factors we want to study in the following phases:
```{r previos_pca_5, echo=TRUE, message=TRUE, warning=FALSE}
print("Estableciendo el numero de caso = VEH_NO=1")
accFacts20_num_filtrado_veh1 <- accFacts20_num_filtrado %>%
filter(VEH_NO == 1) %>%
distinct(ST_CASE, .keep_all = TRUE)
print("cambiamos el nombre a la tabla y limpiamos el entorno de trabajo")
accFacts20v2 <- accFacts20_num_filtrado_veh1
rm(accFacts20_num, accFacts20_num_filtrado, accFacts20_num_filtrado_veh1)
print("Eliminamos la variable AIR_BAG porque cometimos un error seleccionandola y no tiene utilidad para el objetivo del proyecto")
accFacts20v2 <- accFacts20v2 %>%
select(-AIR_BAG)
```
```{r previos_pca_6, echo=TRUE, message=TRUE, warning=FALSE}
#Tenemos que convertir en binaria la variable "Drinking"