Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Serialization Type for Fields Containing Python Type object #138

Open
robodair opened this issue Nov 7, 2024 · 0 comments
Open

Serialization Type for Fields Containing Python Type object #138

robodair opened this issue Nov 7, 2024 · 0 comments

Comments

@robodair
Copy link

robodair commented Nov 7, 2024

I am using py-ocsf-models in combination with svdimchenko/pydantic-glue to generate the schema for tables in AWS Glue, and currently I need to subclass and override field serialization for Pydantic to generate a compatible schema for fields with an arbitrary type.

Would you be interested in a PR to bring support for this upstream?

The OCSF Schema defines several attributes of only type Object as an unordered collection of attributes.

It also defines the Type of a few attributes at JSON, as is the case with an Enrichment object.

Currently in py-ocsf-models these types are handled by using the python type name object:

unmapped: Optional[object]

data: Optional[dict[str, object]]

data: Optional[dict[str, object]]

data: Optional[dict[str, object]]

data: Optional[dict[str, object]]

Pydantic will generate a JSON Schema element such as {"type": "object"} for such fields, and the only way to represent this in AWS Glue type definitions would be struct<> which is not valid.

The workaround is to use a string typed column, and store as a string, and then parse and query the JSON in the query engine you use, such as the AWS Athena support for querying JSON data.

This is an example of how I add this serialization typing in a subclass for BaseEvent:

from pydantic import field_serializer

from py_ocsf_models.events.base_event import BaseEvent
from py_ocsf_models.objects.enrichment import Enrichment


def jsonable_object_serializer(value: object) -> str:
    """
    Serialize a JSON-able field that contains a type of `object` to `str` by dumping as JSON.

    For fields that contain a type of 'object' a reasonable schema for conversion to AWS Glue column
    type definitions cannot be provided by Pydantic, which when providing a JSON schema will use an
    entry of type "object" but with no "properties" key, which if we convert to Glue schema, will
    type as `struct<>` which is not valid.

    The workaround is to use a string typed column, and store as a string, and then parse and query
    the JSON in the query engine you use, such as the AWS Athena support for querying JSON data.

    References:
    - https://docs.aws.amazon.com/athena/latest/ug/querying-JSON.html
    - https://repost.aws/questions/QU0CQ6q_tkSwGCd_vQ36M0TA/best-glue-catalog-table-column-type-to-store-variable-json-docs
    """
    return json.dumps(value)


class OCSFEnrichment(Enrichment):

    @field_serializer("data")
    def data_serializer(self, value: object) -> str:
        return jsonable_object_serializer(value)


class OCSFBaseEvent(BaseEvent):
    enrichments: Optional[list[OCSFEnrichment]]

    @field_serializer("unmapped")
    def unmapped_serializer(self, value: object) -> str:
        return jsonable_object_serializer(value)

Which when the JSON Schema is then generated, dumped, and processed by pydantic-glue, returns the Glue column types as desired:

schema = json.dumps(
    OCSFBaseEvent.model_json_schema(
        mode="serialization",
    ),
)

import pydantic_glue
import pprint

pprint.pprint(pydantic_glue.convert(schema))

[('enrichments',
  'array<struct<data:string,name:string,provider:string,type:string,value:string>>'),
 ('message', 'string'),
 ('metadata',
  'struct<correlation_uid:string,event_code:string,uid:string,labels:array<string>,log_level:string,log_name:string,log_provider:string,log_version:string,logged_time:timestamp,loggers:array<struct<device:struct<uid_alt:string,autoscale_uid:string,is_compliant:boolean,created_time:timestamp,desc:string,domain:string,first_seen_time:timestamp,location:struct<city:string,continent:string,coordinates:array<float>,country:string,desc:string,isp:string,is_on_premises:boolean,postal_code:string,provider:string,region:string>,groups:array<struct<type:string,desc:string,domain:string,name:string,privileges:array<string>,uid:string>>,hw_info:struct<bios_date:string,bios_manufacturer:string,bios_ver:string,cpu_bits:int,cpu_cores:int,cpu_count:int,chassis:string,desktop_display:string,keyboard_info:string,cpu_speed:int,cpu_type:string,ram_size:int,serial_number:string>,hostname:string,hypervisor:string,imei:string,ip:string,image:struct<tag:string,labels:array<string>,name:string,path:string,uid:string>,instance_uid:string,last_seen_time:timestamp,mac:string,is_managed:boolean,modified_time:timestamp,name:string,interface_uid:string,interface_name:string,network_interfaces:array<struct<hostname:string,ip:string,mac:string,name:string,namespace:string,subnet_prefix:int,type:string,type_id:int,uid:string>>,zone:string,os:struct<cpu_bits:int,country:string,lang:string,name:string,build:string,edition:string,sp_name:string,sp_ver:int,cpe_name:string,type:string,type_id:int,version:string>,org:struct<name:string,ou_uid:string,ou_name:string,uid:string>,is_personal:boolean,region:string,risk_level:string,risk_level_id:int,risk_score:int,subnet:string,subnet_uid:string,is_trusted:boolean,type:string,type_id:int,uid:string,vlan_uid:string,vpc_uid:string>,log_level:string,log_name:string,log_provider:string,log_version:string,logged_time:timestamp,name:string,product:struct<feature:struct<name:string,uid:string,version:string>,lang:string,name:string,path:string,cpe_name:string,url_string:string,uid:string,vendor_name:string,version:string>,transmit_time:timestamp,uid:string,version:string>>,modified_time:timestamp,original_time:string,processed_time:timestamp,product:struct<feature:struct<name:string,uid:string,version:string>,lang:string,name:string,path:string,cpe_name:string,url_string:string,uid:string,vendor_name:string,version:string>,profiles:array<string>,extensions:array<struct<name:string,uid:string,version:string>>,sequence:int,tenant_uid:string,version:string>'),
 ('observables',
  'array<struct<name:string,reputation:struct<provider:string,base_score:float,score:string,score_id:int>,type:string,type_id:int,value:string>>'),
 ('raw_data', 'string'),
 ('severity_id', 'int'),
 ('severity', 'string'),
 ('status', 'string'),
 ('status_code', 'string'),
 ('status_detail', 'string'),
 ('status_id', 'int'),
 ('unmapped', 'string')]

Is this something that you'd consider supporting in py-ocsf-models?

Further note: Pydantic's field_serializer supports a return_type argument, so it should be possible to make this behaviour optional, and controlled by an environment variable (say PY_OCSF_MODELS_OBJECT_SERIALIZATION_FORMAT=dumped_json) set before import if the existing behaviour is relied upon.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant