Declarative schema definition outside Python code

ID	DFT_CORE-DECLARATIVE_SCHEMA_DEFINITION_OUTSIDE_PYTHON_CODE
Status	not_started
Priority	p1
Milestone	m3-public-launch
Owner	sr-engineer-architect

Problem

The Dataface YAML spec is implicitly defined by Python Pydantic models spread across types.py, compiled_types.py, and query_types.py. There is no single, language-agnostic document that says "this is what valid Dataface YAML looks like." This creates multiple problems:

No versioning anchor: Without a formal schema, there's no way to diff schema versions or write automated migrations (see task-m2-yaml-version-migrations).
Hand-maintained AI prompts: schema.py contains hand-written markdown summaries that drift from the actual types.
No editor integration: VS Code/Cursor can't provide YAML autocompletion or validation without a JSON Schema or equivalent.
No extensibility contract: Users can't register custom chart types because there's no schema to extend.
No external tooling: Non-Python tools (IDE extensions, CI validators, documentation generators) can't consume the spec definition.

json-render solved this with defineCatalog — a single typed definition that powers validation, AI prompts, editor tooling, and the renderer simultaneously.

Context

Research origin: ai_notes/research/json-render-deep-dive.md — "Typed Catalog with Validation" section.
Current Pydantic types: dataface/core/compile/types.py (input types), compiled_types.py (output types), query_types.py, source_types.py
Hand-maintained schema summary: dataface/core/compile/schema.py
Normalizer: dataface/core/compile/normalizer.py — the compilation pipeline that enforces schema
Pairs with: task-m2-yaml-version-migrations — versioning needs a formal schema to diff against
Enables: schema-derived-ai-prompts-from-compiled-types — auto-generated prompts from the schema
Enables: extensible-schema-with-custom-elements-and-chart-types — extensibility needs a schema contract to extend

Possible Solutions

A. JSON Schema generated from Pydantic models

Use Pydantic's built-in model_json_schema() to export JSON Schema, then treat the exported schema as the canonical artifact. Enrich with json_schema_extra for AI descriptions and editor hints.

Pros: Minimal new code. Pydantic does the heavy lifting. JSON Schema is widely supported (VS Code YAML extension, ajv, etc.). Can start immediately. Cons: JSON Schema is verbose and hard to read. The "source of truth" is still the Python code — the schema is a derived artifact. Some Dataface concepts (layout unions, chart type polymorphism) are awkward in JSON Schema.

B. Custom YAML schema format Recommended

Define a Dataface-native schema format in YAML that describes all element types, their fields, types, defaults, descriptions, and constraints. Generate Pydantic models and JSON Schema from this definition.

version: "1.0"
elements:
  chart:
    description: "A data visualization"
    fields:
      type:
        type: enum
        values: [bar, line, area, scatter, pie, donut, kpi, table, ...]
        description: "Chart visualization type"
      query:
        type: string | inline_query
        required: true
        description: "Query name or inline query definition"
      x:
        type: field_name
        description: "Field for x-axis encoding"
      y:
        type: field_name
        description: "Field for y-axis encoding"
      # ...

Pros: Human-readable. Easy to version/diff. Can include AI-specific metadata (descriptions, examples, common mistakes). Single source of truth. Generates both Python models and JSON Schema. Cons: Requires building a schema-to-code generator. New abstraction to maintain. Must be kept in sync — but that's the point (it IS the source, code is derived).

C. Pydantic as source of truth, with enriched metadata

Keep Pydantic models as the canonical definition, but add rich Field(description=..., json_schema_extra=...) metadata. Export JSON Schema for editors, export AI prompts via introspection.

Pros: No new format. Incremental improvement. Metadata lives next to the code it describes. Cons: Python remains the source of truth — non-Python tools must parse exported JSON Schema. Extensibility is harder (users would need to write Python to extend).

Plan

Choose the canonical schema format and the minimum metadata it must carry.
Map the current compile types into that schema with a small representative slice first.
Prove generation of at least one downstream artifact such as JSON Schema or model code.
Define migration and rollout steps for replacing handwritten schema summaries.

Implementation Progress

Review Feedback

[ ] Review cleared