Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Validate Business Role yaml using pydantic #161

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kokorin
Copy link

@kokorin kokorin commented Dec 4, 2024

No description provided.

@kokorin kokorin force-pushed the feature/pydantic-based-configuration branch from 2bebf90 to a38b9a8 Compare December 4, 2024 18:56
@littleK0i
Copy link
Owner

Ok, I've looked into differences between jsonschema and current pydantic validation. Also, I've checked comments of pydantic contributors on this. It seems like jsonschema is still a slightly better option in this specific use case.

Reasons:

(1) Nested structures
Pydantic is "flat". When we have to define a nested structure for Pydantic, it means we have to defined multiple models in reverse order, from inner-models to outer-models.

JSON schema is "nested" by design. All layers are defined in the same data structure and in natural order.

We have a lot of nested and complex structures in other configs, e.g. TABLE: https://github.com/littleK0i/SnowDDL/blob/master/snowddl/parser/table.py#L34-L226

(2) Conditional keywords
JSON schema has some nice built-in keywords: allOf, anyOf, oneOf. These keywords can be used anywhere in the schema. JSON schema also has natural dependentRequire construct.

As far as I know, if we want to implement similar logic with pydantic, it would require coding and manual exception handling, similar to how you implemented XOR for technical_roles / tech_roles.

(3) Used for validation only
Using pydantic for blueprints is a no-brainer, since blueprint objects are very widely used in the codebase. But YAML config validations are usually defined and used only once in the same parser.


Currently I would prefer to stay on jsonschema, since changing it seems to be low ROI or slightly negative RIO. Unless we find a genuinely better data structure validation library.

@kokorin
Copy link
Author

kokorin commented Dec 5, 2024

(1) Nested structures
Pydantic is "flat". When we have to define a nested structure for Pydantic, it means we have to defined multiple models in reverse order, from inner-models to outer-models.

In the TableParse JSON schema takes 190 lines, defining Python dataclasses would take less and be more clear developers who are not very proficient with JSON Schema.

(2) Conditional keywords
JSON schema has some nice built-in keywords: allOf, anyOf, oneOf. These keywords can be used anywhere in the schema. JSON schema also has natural dependentRequire construct.

If dataclass' property is defined as Union type like Union[str, ColumnSpec], or even as Union[ColumnName, ColumnSpec] (if ColumnName type is configured as an alias to str) would make intentions much clear. In generated JSON Schema there will be anyOf.

Here is an example:

ColumnName = Annotated[str, Field(min_length=1, description="Column name")]
VarcharType = Annotated[str, Field(pattern=r"varchar\(\d+\, ?\d+\)")]
BoolType = Literal['bool']
ColumnType = VarcharType | BoolType # other types

@dataclass
class ColumnSpec:
    name: ColumnName
    type: ColumnType
    comment: Optional[str] = None

TableName = Annotated[str, Field(min_length=1)]

@dataclass
class Table:
    name: TableName
    columns: list[ColumnName | ColumnSpec] = Field(min_length=1)
    
ta = TypeAdapter(Table)
print(json.dumps(ta.json_schema()))

Gives:

{
  "$defs": {
    "ColumnSpec": {
      "properties": {
        "name": {
          "description": "Column name",
          "minLength": 1,
          "title": "Name",
          "type": "string"
        },
        "type": {
          "anyOf": [
            {
              "pattern": "varchar\\(\\d+\\, ?\\d+\\)",
              "type": "string"
            },
            {
              "const": "bool",
              "type": "string"
            }
          ],
          "title": "Type"
        },
        "comment": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Comment"
        }
      },
      "required": [
        "name",
        "type"
      ],
      "title": "ColumnSpec",
      "type": "object"
    }
  },
  "properties": {
    "name": {
      "minLength": 1,
      "title": "Name",
      "type": "string"
    },
    "columns": {
      "items": {
        "anyOf": [
          {
            "description": "Column name",
            "minLength": 1,
            "type": "string"
          },
          {
            "$ref": "#/$defs/ColumnSpec"
          }
        ]
      },
      "minItems": 1,
      "title": "Columns",
      "type": "array"
    }
  },
  "required": [
    "name",
    "columns"
  ],
  "title": "Table",
  "type": "object"
}

(3) Used for validation only
Using pydantic for blueprints is a no-brainer, since blueprint objects are very widely used in the codebase. But YAML config validations are usually defined and used only once in the same parser.

Agree, after recent changes parsers become much more simple. My last argument in defense of dataclasses is that property names are defined as strings in JSON Schema (when manually defined) and Blueprint constructor also use string to access parsed properties. Having dataclasses would reveal potential typos.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants