Dataface Tasks

dft inspect native CSV support via ephemeral DuckDB

IDCONTEXT_CATALOG_NIMBLE-DFT_INSPECT_NATIVE_CSV_SUPPORT_VIA_EPHEMERAL_DUCKDB
Statuscompleted
Priorityp1
Milestonem1-ft-analytics-analyst-pilot
Ownerdata-ai-engineer-architect
Completed bydave
Completed2026-03-22

Problem

dft inspect only works on SQL databases. Users with CSV sources (common for non-dbt projects, prototyping, and the playground examples) cannot profile their data at all. The CsvAdapter uses Python's csv.DictReader — completely separate from the inspect pipeline.

Context

  • InspectConnection supports: DuckDB, PostgreSQL, SQLite, BigQuery, Snowflake, Databricks, Redshift, MySQL, SQL Server
  • CsvAdapter (dataface/core/execute/adapters/csv_adapter.py) reads CSVs with stdlib, no SQL
  • DuckDB can natively read CSVs: SELECT * FROM 'file.csv' or CREATE TABLE t AS SELECT * FROM read_csv_auto('file.csv')
  • DuckDB is already a dependency (used for dbt projects and as default dialect)
  • CSV sources are defined in dataface.yml or _sources.yaml with type: csv and file: path
  • The playground examples are all CSV-based today
  • Depends on: complements the dft inspect catalog builder task

Possible Solutions

When dft inspect encounters a CSV source, spin up a :memory: DuckDB connection, load the CSV with read_csv_auto(), and profile via the existing DuckDB inspector path. No persistent DuckDB file created — the CSV remains the source of truth.

  • Pros: Zero config, reuses existing DuckDB inspector, no new files to manage
  • Cons: Re-reads CSV on every inspect run (fine — CSVs are small, and incremental inspect task handles caching)

B. Require users to load CSVs into DuckDB first

  • Pros: Simple
  • Cons: Bad UX, extra manual step, defeats the purpose

Plan

Approach A.

  1. Detect CSV sources — when dft inspect discovers sources from dataface.yml / _sources.yaml, identify type: csv entries
  2. Create ephemeral DuckDBduckdb.connect(':memory:'), load each CSV via CREATE TABLE {name} AS SELECT * FROM read_csv_auto('{path}')
  3. Profile via existing path — pass the DuckDB connection to TableInspector with dialect='duckdb'
  4. Store in inspect.json — same format as any other table, with source metadata indicating CSV origin
  5. Also support dft inspect table file.csv — direct CSV path as argument, auto-detect as CSV by extension

Files to modify: - dataface/core/inspect/inspector.py — CSV detection + ephemeral DuckDB setup - dataface/cli/commands/inspect.py — accept CSV file paths as arguments - dataface/core/inspect/connection.py — possibly no changes if DuckDB path is reused

Implementation Progress

M1: Core CSV inspect via ephemeral DuckDB (done)

  • [x] is_csv_path(table_arg) helper in inspector.py — detects .csv extension
  • [x] TableInspector.from_csv(path) classmethod — creates :memory: DuckDB, loads CSV via read_csv_auto(), returns inspector ready for profiling
  • [x] CLI routing in inspect_command() — when table arg is an existing .csv file, uses from_csv instead of normal TableInspector constructor
  • [x] Validation: FileNotFoundError for missing files, ValueError for non-CSV extensions
  • [x] 15 tests in tests/core/test_inspect_csv.py covering factory, full profile pipeline, path detection, CLI routing
  • [x] No changes needed to connection.py — reuses existing DuckDB path entirely

Key decision: from_csv is a classmethod factory (like from_bigquery_client) rather than a modification to __init__. This keeps the constructor clean and the CSV concern isolated.

Files changed: - dataface/core/inspect/inspector.py — added is_csv_path() + TableInspector.from_csv() - dataface/cli/commands/inspect.py — CSV auto-detection in inspect_command() - tests/core/test_inspect_csv.py — new test file (15 tests)

Review Feedback

  • [ ] Review cleared