Validate Business Role yaml using pydantic #161

kokorin · 2024-12-04T18:54:35Z

No description provided.

littleK0i · 2024-12-05T11:02:26Z

Ok, I've looked into differences between jsonschema and current pydantic validation. Also, I've checked comments of pydantic contributors on this. It seems like jsonschema is still a slightly better option in this specific use case.

Reasons:

(1) Nested structures
Pydantic is "flat". When we have to define a nested structure for Pydantic, it means we have to defined multiple models in reverse order, from inner-models to outer-models.

JSON schema is "nested" by design. All layers are defined in the same data structure and in natural order.

We have a lot of nested and complex structures in other configs, e.g. TABLE: https://github.com/littleK0i/SnowDDL/blob/master/snowddl/parser/table.py#L34-L226

(2) Conditional keywords
JSON schema has some nice built-in keywords: allOf, anyOf, oneOf. These keywords can be used anywhere in the schema. JSON schema also has natural dependentRequire construct.

As far as I know, if we want to implement similar logic with pydantic, it would require coding and manual exception handling, similar to how you implemented XOR for technical_roles / tech_roles.

(3) Used for validation only
Using pydantic for blueprints is a no-brainer, since blueprint objects are very widely used in the codebase. But YAML config validations are usually defined and used only once in the same parser.

Currently I would prefer to stay on jsonschema, since changing it seems to be low ROI or slightly negative RIO. Unless we find a genuinely better data structure validation library.

kokorin · 2024-12-05T12:16:15Z

(1) Nested structures
Pydantic is "flat". When we have to define a nested structure for Pydantic, it means we have to defined multiple models in reverse order, from inner-models to outer-models.

In the TableParse JSON schema takes 190 lines, defining Python dataclasses would take less and be more clear developers who are not very proficient with JSON Schema.

(2) Conditional keywords
JSON schema has some nice built-in keywords: allOf, anyOf, oneOf. These keywords can be used anywhere in the schema. JSON schema also has natural dependentRequire construct.

If dataclass' property is defined as Union type like Union[str, ColumnSpec], or even as Union[ColumnName, ColumnSpec] (if ColumnName type is configured as an alias to str) would make intentions much clear. In generated JSON Schema there will be anyOf.

Here is an example:

ColumnName = Annotated[str, Field(min_length=1, description="Column name")]
VarcharType = Annotated[str, Field(pattern=r"varchar\(\d+\, ?\d+\)")]
BoolType = Literal['bool']
ColumnType = VarcharType | BoolType # other types

@dataclass
class ColumnSpec:
    name: ColumnName
    type: ColumnType
    comment: Optional[str] = None

TableName = Annotated[str, Field(min_length=1)]

@dataclass
class Table:
    name: TableName
    columns: list[ColumnName | ColumnSpec] = Field(min_length=1)
    
ta = TypeAdapter(Table)
print(json.dumps(ta.json_schema()))

Gives:

{
  "$defs": {
    "ColumnSpec": {
      "properties": {
        "name": {
          "description": "Column name",
          "minLength": 1,
          "title": "Name",
          "type": "string"
        },
        "type": {
          "anyOf": [
            {
              "pattern": "varchar\\(\\d+\\, ?\\d+\\)",
              "type": "string"
            },
            {
              "const": "bool",
              "type": "string"
            }
          ],
          "title": "Type"
        },
        "comment": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Comment"
        }
      },
      "required": [
        "name",
        "type"
      ],
      "title": "ColumnSpec",
      "type": "object"
    }
  },
  "properties": {
    "name": {
      "minLength": 1,
      "title": "Name",
      "type": "string"
    },
    "columns": {
      "items": {
        "anyOf": [
          {
            "description": "Column name",
            "minLength": 1,
            "type": "string"
          },
          {
            "$ref": "#/$defs/ColumnSpec"
          }
        ]
      },
      "minItems": 1,
      "title": "Columns",
      "type": "array"
    }
  },
  "required": [
    "name",
    "columns"
  ],
  "title": "Table",
  "type": "object"
}

(3) Used for validation only
Using pydantic for blueprints is a no-brainer, since blueprint objects are very widely used in the codebase. But YAML config validations are usually defined and used only once in the same parser.

Agree, after recent changes parsers become much more simple. My last argument in defense of dataclasses is that property names are defined as strings in JSON Schema (when manually defined) and Blueprint constructor also use string to access parsed properties. Having dataclasses would reveal potential typos.

Validate Business Role yaml using pydantic

a38b9a8

kokorin force-pushed the feature/pydantic-based-configuration branch from 2bebf90 to a38b9a8 Compare December 4, 2024 18:56

kokorin mentioned this pull request Dec 4, 2024

Introduce config validators #155

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate Business Role yaml using pydantic #161

Validate Business Role yaml using pydantic #161

kokorin commented Dec 4, 2024

littleK0i commented Dec 5, 2024

kokorin commented Dec 5, 2024

Validate Business Role yaml using pydantic #161

Are you sure you want to change the base?

Validate Business Role yaml using pydantic #161

Conversation

kokorin commented Dec 4, 2024

littleK0i commented Dec 5, 2024

kokorin commented Dec 5, 2024