Declarative schema definition outside Python code
Problem
The Dataface YAML spec is implicitly defined by Python Pydantic models spread across types.py, compiled_types.py, and query_types.py. There is no single, language-agnostic document that says "this is what valid Dataface YAML looks like." This creates multiple problems:
- No versioning anchor: Without a formal schema, there's no way to diff schema versions or write automated migrations (see
task-m2-yaml-version-migrations). - Hand-maintained AI prompts:
schema.pycontains hand-written markdown summaries that drift from the actual types. - No editor integration: VS Code/Cursor can't provide YAML autocompletion or validation without a JSON Schema or equivalent.
- No extensibility contract: Users can't register custom chart types because there's no schema to extend.
- No external tooling: Non-Python tools (IDE extensions, CI validators, documentation generators) can't consume the spec definition.
json-render solved this with defineCatalog — a single typed definition that powers validation, AI prompts, editor tooling, and the renderer simultaneously.
Context
- Research origin:
ai_notes/research/json-render-deep-dive.md— "Typed Catalog with Validation" section. - Current Pydantic types:
dataface/core/compile/types.py(input types),compiled_types.py(output types),query_types.py,source_types.py - Hand-maintained schema summary:
dataface/core/compile/schema.py - Normalizer:
dataface/core/compile/normalizer.py— the compilation pipeline that enforces schema - Pairs with:
task-m2-yaml-version-migrations— versioning needs a formal schema to diff against - Enables:
schema-derived-ai-prompts-from-compiled-types— auto-generated prompts from the schema - Enables:
extensible-schema-with-custom-elements-and-chart-types— extensibility needs a schema contract to extend
Possible Solutions
A. JSON Schema generated from Pydantic models
Use Pydantic's built-in model_json_schema() to export JSON Schema, then treat the exported schema as the canonical artifact. Enrich with json_schema_extra for AI descriptions and editor hints.
Pros: Minimal new code. Pydantic does the heavy lifting. JSON Schema is widely supported (VS Code YAML extension, ajv, etc.). Can start immediately. Cons: JSON Schema is verbose and hard to read. The "source of truth" is still the Python code — the schema is a derived artifact. Some Dataface concepts (layout unions, chart type polymorphism) are awkward in JSON Schema.
B. Custom YAML schema format Recommended
Define a Dataface-native schema format in YAML that describes all element types, their fields, types, defaults, descriptions, and constraints. Generate Pydantic models and JSON Schema from this definition.
version: "1.0"
elements:
chart:
description: "A data visualization"
fields:
type:
type: enum
values: [bar, line, area, scatter, pie, donut, kpi, table, ...]
description: "Chart visualization type"
query:
type: string | inline_query
required: true
description: "Query name or inline query definition"
x:
type: field_name
description: "Field for x-axis encoding"
y:
type: field_name
description: "Field for y-axis encoding"
# ...
Pros: Human-readable. Easy to version/diff. Can include AI-specific metadata (descriptions, examples, common mistakes). Single source of truth. Generates both Python models and JSON Schema. Cons: Requires building a schema-to-code generator. New abstraction to maintain. Must be kept in sync — but that's the point (it IS the source, code is derived).
C. Pydantic as source of truth, with enriched metadata
Keep Pydantic models as the canonical definition, but add rich Field(description=..., json_schema_extra=...) metadata. Export JSON Schema for editors, export AI prompts via introspection.
Pros: No new format. Incremental improvement. Metadata lives next to the code it describes. Cons: Python remains the source of truth — non-Python tools must parse exported JSON Schema. Extensibility is harder (users would need to write Python to extend).
Plan
- Choose the canonical schema format and the minimum metadata it must carry.
- Map the current compile types into that schema with a small representative slice first.
- Prove generation of at least one downstream artifact such as JSON Schema or model code.
- Define migration and rollout steps for replacing handwritten schema summaries.
Implementation Progress
Review Feedback
- [ ] Review cleared