`require_all_on` fails with iterables #434

aulemahal · 2022-01-18T20:39:17Z

Here's a quick checklist in what to include:

Include a detailed description of the bug or suggestion
Output of intake_esm.show_versions()
Minimal, self-contained copy-pastable example that generates the issue if possible. Please be concise with code posted.

Description

When searching a catalog with a query that includes columns storing iterables and specifying require_all_on (on another column than the one with the iterables), the output is empty.

csv file:

simulation_id,ensemble_id,model_institution_id,model_id,experiment_id,timestep_id,domain_id,member_id,variable_id,file
A,CMIP6,CCCma,CanESM,historical,day,NAM,r1i1p1,"('tasmax', 'tasmin')",file1
A,CMIP6,CCCma,CanESM,historical,day,NAM,r1i1p1,"('pr', 'prsn')",file2
B,CMIP6,CCCmb,CanESM,historical,day,NAM,r1i1p1,"('tasmax', 'tasmin')",file3
C,CMIP6,CCCmc,CanESM,historical,day,NAM,r1i1p1,"('tasmax', 'tasmin')",file4
D,CMIP6,CCCmd,CanESM,historical,day,NAM,r1i1p1,"('pr', 'tasmin')",file5
D,CMIP6,CCCme,CanESM,historical,day,NAM,r1i1p1,"('tasmax', 'prsn')",file6

json file:

{
    "esmcat_version": "0.1.0",
    "assets": {
        "column_name": "file",
        "format": "netCDF"
    },
    "aggregation_control": {
        "variable_column_name": "variable_id",
        "groupby_attrs": ["simulation_id", "domain_id", "timestep_id"],
        "aggregations": [
            {"type": "join_new", "attribute_name": "member_id"},
            {"type": "union", "attribute_name": "variable_id"}
        ],
    },
    "attributes" : [],
    "catalog_file": "test.csv"
}

What I Did

import ast
import intake

cat = intake.open_esm_datastore('test.json', read_csv_kwargs={'converters': {'variable_id': ast.literal_eval}})
cat.search(variable_id=['tasmax', 'prsn'], require_all_on=['simulation_id'])
# got: < catalog with 0 dataset(s) from 0 asset(s)>
# expected : < catalog with 2 dataset(s) from 3 asset(s)>

The culprit is intake_esm._search.search_apply_require_all_on which doesn't know about columns with iterables. When it regroups the columns of each group and creates the set to be compared with the expected one, elements in iterable columns are still iterables, whereas the expected condition has elements from these iterables.

PR coming sooner than not.

Version information: output of `intake_esm.show_versions()`

INSTALLED VERSIONS

cftime: 1.5.1.1
dask: 2021.12.0
fastprogress: 0.2.7
fsspec: 2021.11.1
gcsfs: 2021.11.1
intake: 0.6.4
intake_esm: 2021.8.17.post43+dirty
netCDF4: 1.5.8
pandas: 1.3.5
requests: 2.26.0
s3fs: 2021.11.1
xarray: 0.20.2
zarr: 2.10.3

The text was updated successfully, but these errors were encountered:

aulemahal mentioned this issue Jan 18, 2022

Support iterable columns with require_all_on #435

Merged

3 tasks

andersy005 closed this as completed in #435 Feb 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`require_all_on` fails with iterables #434

`require_all_on` fails with iterables #434

aulemahal commented Jan 18, 2022

INSTALLED VERSIONS

require_all_on fails with iterables #434

require_all_on fails with iterables #434

Comments

aulemahal commented Jan 18, 2022

Description

What I Did

Version information: output of intake_esm.show_versions()

INSTALLED VERSIONS

`require_all_on` fails with iterables #434

`require_all_on` fails with iterables #434

Version information: output of `intake_esm.show_versions()`