Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Parsing error warning #4

Open
dmurdoch opened this issue Jan 1, 2024 · 1 comment · May be fixed by #5
Open

Parsing error warning #4

dmurdoch opened this issue Jan 1, 2024 · 1 comment · May be fixed by #5

Comments

@dmurdoch
Copy link

dmurdoch commented Jan 1, 2024

When I download the CPI table, I get a warning about a parsing issue:

statcanR::statcan_data("18-10-0006-01", "eng")
#> statcanR: downloading remote table.
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
#> Rows: 47 Columns: 10
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (10): Cube Title, Product Id, CANSIM Id, URL, Cube Notes, Archive Status...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#>         REF_DATE    GEO          DGUID
#>    1: 1992-01-01 Canada 2016A000011124
#>    2: 1992-01-01 Canada 2016A000011124
#>    3: 1992-01-01 Canada 2016A000011124
#>    4: 1992-01-01 Canada 2016A000011124
#>    5: 1992-01-01 Canada 2016A000011124
#>   ---                                 
#> 4209: 2023-11-01 Canada 2016A000011124
#> 4210: 2023-11-01 Canada 2016A000011124
#> 4211: 2023-11-01 Canada 2016A000011124
#> 4212: 2023-11-01 Canada 2016A000011124
#> 4213: 2023-11-01 Canada 2016A000011124
#>                                           Products and product groups      UOM
#>    1:                                                       All-items 2002=100
#>    2:                                                            Food 2002=100
#>    3:                                                         Shelter 2002=100
#>    4:                 Household operations, furnishings and equipment 2002=100
#>    5:                                           Clothing and footwear 2002=100
#>   ---                                                                         
#> 4209:                                        Health and personal care 2002=100
#> 4210:                               Recreation, education and reading 2002=100
#> 4211: Alcoholic beverages, tobacco products and recreational cannabis 2002=100
#> 4212:                                        All-items excluding food 2002=100
#> 4213:                             All-items excluding food and energy 2002=100
#>       UOM_ID SCALAR_FACTOR SCALAR_ID    VECTOR COORDINATE VALUE STATUS SYMBOL
#>    1:     17         units         0 v41690914        1.1  83.1     NA     NA
#>    2:     17         units         0 v41690915        1.2  82.0     NA     NA
#>    3:     17         units         0 v41690916        1.3  87.6     NA     NA
#>    4:     17         units         0 v41690917        1.4  87.7     NA     NA
#>    5:     17         units         0 v41690918        1.5  94.1     NA     NA
#>   ---                                                                        
#> 4209:     17         units         0 v41690920        1.7 147.3     NA     NA
#> 4210:     17         units         0 v41690921        1.8 129.0     NA     NA
#> 4211:     17         units         0 v41690922        1.9 193.3     NA     NA
#> 4212:     17         units         0 v41690923        1.1 154.0     NA     NA
#> 4213:     17         units         0 v41690924       1.11 149.4     NA     NA
#>       TERMINATED DECIMALS                                          INDICATOR
#>    1:         NA        1 Consumer Price Index, monthly, seasonally adjusted
#>    2:         NA        1 Consumer Price Index, monthly, seasonally adjusted
#>    3:         NA        1 Consumer Price Index, monthly, seasonally adjusted
#>    4:         NA        1 Consumer Price Index, monthly, seasonally adjusted
#>    5:         NA        1 Consumer Price Index, monthly, seasonally adjusted
#>   ---                                                                       
#> 4209:         NA        1 Consumer Price Index, monthly, seasonally adjusted
#> 4210:         NA        1 Consumer Price Index, monthly, seasonally adjusted
#> 4211:         NA        1 Consumer Price Index, monthly, seasonally adjusted
#> 4212:         NA        1 Consumer Price Index, monthly, seasonally adjusted
#> 4213:         NA        1 Consumer Price Index, monthly, seasonally adjusted

Created on 2024-01-01 with reprex v2.0.2

The message from vroom "call problems() on your data frame for details," doesn't work, because the details have been removed by the time the dataset is returned, and I don't see a way to follow the advice to "Specify the column types or set show_col_types = FALSE to quiet this message.".

@dmurdoch
Copy link
Author

dmurdoch commented Jan 1, 2024

I've taken a closer look, and I see this in the metadata file being read here:

"Cube Title","Product Id","CANSIM Id",URL,"Cube Notes","Archive Status",Frequency,"Start Reference Period","End Reference Period","Total number of dimensions"
"Consumer Price Index, monthly, seasonally adjusted","18100006","326-0022","https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1810000601",1;2;3;4;6;7;10,"CURRENT - a cube available to the public and that is current","Monthly","1992-01-01","2023-11-01","2",

"Dimension ID","Dimension name","Dimension Notes","Dimension Definitions"
"1","Geography",,""
"2","Products and product groups",10,""

followed by more lines defining other things. I think there are two issues here that cause the warning:

  1. The statcan_data function only uses the first two lines at this point, and shouldn't be reading the rest of the file. This can be fixed by setting n_max = 1 in the read_csv call.
  2. The metadata has 10 fields in the header on line 1, and 10 fields followed by a comma on line 2, so read_csv sees it as 11 fields.

Problem 2 is harder to deal with. The User Guide https://www.statcan.gc.ca/en/developers/csv/user-guide is unclear about whether this is normal or an error at StatCan. It says there are two kinds of metadata: non-census cubes and census cubes, with different numbers of fields (10 vs 12), so reading exactly 10 fields would mess up census cubes.

@dmurdoch dmurdoch linked a pull request Jan 1, 2024 that will close this issue
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant