Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

API Brainstorm Thread #252

Open
khusmann opened this issue Jul 18, 2024 · 0 comments
Open

API Brainstorm Thread #252

khusmann opened this issue Jul 18, 2024 · 0 comments

Comments

@khusmann
Copy link
Contributor

I'm starting this thread to brainstorm some of the ideas I mention in #198 and #251. It leans into the idea of data packages and table resources being their own class, not just lightweight descriptors. In this approach:

  • A data package object would be a list of resource objects. Properties would be stored in its attributes, and be accessed with get_prop() and set_props(). These functions would ensure the object was always valid.
  • A table resource object would be a list of fields objects. Properties would be stored in its attributes, as with data package objects.
  • When table resource objects were read with read_resource(), it would make them a tibble AND a table resource object. So it would allow you to manage a data frame with frictionless metadata simultaneously.

Although it does introduce a lot of implementation complexity in some areas, I think it potentially simplifies user experience and reduces complexity in other areas:

  • users no longer have to keep their loaded data frames synchronized with their descriptor metadata, because a loaded resource tibble IS a table resource object in all of its metadata glory
  • we can easily carry context around in an object (e.g. the working directory of a descriptor; Should we use resource as an argument? #251), without it polluting the rest of the descriptor attributes
  • validation is streamlined because properties are always modified through fns that insure the object stays valid

It's also a pretty big departure from the current architecture, so I totally understand if you're not wanting to go this direction... I'm mostly sharing this to just get more ideas / possibilities flowing.

pkg <- example_package()

pkg
#> A Data Package with 3 resources:
#> • deployments
#> • observations
#> • media
#> Use `get_descriptor()` to print the Data Package as a list.

# Instead of using `unclass()`, we use `get_descriptor()` to convert the
# data package object into a raw descriptor object (list)

get_descriptor(pkg)
#> $name
#> [1] "example_package"
#> 
#> $id
#> [1] "115f49c1-8603-463e-a908-68de98327266"
#> 
#> $created
#> [1] "2021-03-02T17:22:33Z"
#> 
#> $image
#> ...

# Instead of setting properties directly on the data package object, we get
# and set properties using `get_prop()` and `set_props()`. This allows us to
# validate the properties before setting them, so the data package object
# is always guaranteed to be valid.

get_prop(pkg, "id")
#> [1] "115f49c1-8603-463e-a908-68de98327266"

pkg <- set_props(pkg, id = "new-id")

get_prop(pkg, "id")
#> [1] "new-id"

# Because all properties are stored as attributes in the data package object,
# we can have the object's items refer directly to the child resources
# of the data package:

pkg$deployments
#> A Table Resource with 5 fields:
#> • deployment_id (string)
#> • longitude (number)
#> • lattitude (number)
#> • start (date)
#> • comments (string)
#> Use `get_descriptor()` to print the Table Resource as a list.
#> Use `read_resource()` to load the data of this Table Resource.

# As with a data package object, we can use `get_descriptor()` to convert
# the resource object into a raw descriptor object (list)

get_descriptor(pkg$deployments)
#> $name
#> [1] "deployments"
#> 
#> $path
#> [1] "<...>"
#> 
#> $profile
#> [1] "tabular-data-resource"
#> 
#> $title
#> [1] "Camera trap deployments"
#> ...

# As with data package objects, we use get_prop() and set_props() to work with
# properties:

get_prop(pkg$deployments, "title")
#> [1] "Camera trap deployments"

pkg$deployments <- set_props(pkg$deployments, title = "Camera trap deployments (modified)")

get_prop(pkg$deployments, "title")
#> [1] "Camera trap deployments (modified)"

# We let the child items of table resource objects refer to field objects:

pkg$deployments$deployment_id
#> A Field:
#> • name: deployment_id
#> • type: string
#> • constraints: {required: TRUE, unique: TRUE}
#> Use `get_descriptor()` to print the Field as a list.

# And as usual, we can convert to raw descriptor via `get_descriptor()`:

get_descriptor(pkg$deployments$deployment_id)
#> $name
#> [1] "deployment_id"
#> 
#> $type
#> [1] "string"
#> 
#> $constraints
#> $constraints$required
#> [1] TRUE
#> 
#> $constraints$unique
#> [1] TRUE

# (Also, `get_prop()` and `set_props()` would work with field objects)

# Where this approach gets really interesting is when we start loading the data
# from resources:

rsc <- read_resource(pkg$deployments)
#> # A Table Resource tibble: 3 × 5
#>   deployment_id longitude latitude start      comments
#>   <chr>             <dbl>    <dbl> <date>     <chr>
#> 1 1                  4.62     50.8 2020-09-25  NA
#> 2 2                  4.64     50.8 2020-10-01 "On \"forêt\" road."
#> 3 3                  4.65     50.8 2020-10-05 "Malfunction/no photos, data"

# Notice the header in the printout -- this is not your average tibble!
# What we get here is a subclassed tibble allowing it to be both a tibble AND
# keep track of the resource metadata simultaneously. This means `get_prop()`
# and `set_props()` can still be used!

get_prop(rsc, "title")
#> [1] "Camera trap deployments (modified)"

rsc <- set_props(rsc, title = "Camera trap deployments (modified again)")

get_prop(rsc, "title")
#> [1] "Camera trap deployments (modified again)"

# We can also still use `get_descriptor()` with the tibble!

get_descriptor(rsc)
#> $name
#> [1] "deployments"
#> 
#> $path
#> [1] "<...>"
#> 
#> $profile
#> [1] "tabular-data-resource"
#> 
#> $title
#> [1] "Camera trap deployments (modified again)"
#> ...

# Properties of fields could be set in tidy pipelines, and new fields
# could be created by adding columns:

rsc <- rsc |>
  mutate(
    deployment_id = set_props(deployment_id, title = "New deployment ID title"),
  ) |>
  mutate(
    new_field = start + 1,
    new_field = set_props(new_field, title = "The day after the start day"),
  )

# What's cool about this, is now we can use `get_descriptor()` to get the
# descriptor of the resource tibble, and it will include the new field in the
# resulting schema.

# And we can update our package with the new resource at any time:

pkg$deployments <- rsc

# We could also update the resource's path to control how the resource
# will be saved when we write the package to disk:

pkg$deployments <- set_props(pkg$deployments, path = "deployments_new.csv")

# Or set the path to NULL to have the resource embed the tibble data in the
# "data" prop when it's converted to a descriptor:

pkg$deployments <- set_props(pkg$deployments, path = NULL)
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant