dft inspect native CSV support via ephemeral DuckDB
Problem
dft inspect only works on SQL databases. Users with CSV sources (common for non-dbt projects, prototyping, and the playground examples) cannot profile their data at all. The CsvAdapter uses Python's csv.DictReader — completely separate from the inspect pipeline.
Context
InspectConnectionsupports: DuckDB, PostgreSQL, SQLite, BigQuery, Snowflake, Databricks, Redshift, MySQL, SQL Server- CsvAdapter (
dataface/core/execute/adapters/csv_adapter.py) reads CSVs with stdlib, no SQL - DuckDB can natively read CSVs:
SELECT * FROM 'file.csv'orCREATE TABLE t AS SELECT * FROM read_csv_auto('file.csv') - DuckDB is already a dependency (used for dbt projects and as default dialect)
- CSV sources are defined in
dataface.ymlor_sources.yamlwithtype: csvandfile:path - The playground examples are all CSV-based today
- Depends on: complements the
dft inspectcatalog builder task
Possible Solutions
A. Ephemeral in-memory DuckDB — Recommended
When dft inspect encounters a CSV source, spin up a :memory: DuckDB connection, load the CSV with read_csv_auto(), and profile via the existing DuckDB inspector path. No persistent DuckDB file created — the CSV remains the source of truth.
- Pros: Zero config, reuses existing DuckDB inspector, no new files to manage
- Cons: Re-reads CSV on every inspect run (fine — CSVs are small, and incremental inspect task handles caching)
B. Require users to load CSVs into DuckDB first
- Pros: Simple
- Cons: Bad UX, extra manual step, defeats the purpose
Plan
Approach A.
- Detect CSV sources — when
dft inspectdiscovers sources fromdataface.yml/_sources.yaml, identifytype: csventries - Create ephemeral DuckDB —
duckdb.connect(':memory:'), load each CSV viaCREATE TABLE {name} AS SELECT * FROM read_csv_auto('{path}') - Profile via existing path — pass the DuckDB connection to
TableInspectorwithdialect='duckdb' - Store in inspect.json — same format as any other table, with source metadata indicating CSV origin
- Also support
dft inspect table file.csv— direct CSV path as argument, auto-detect as CSV by extension
Files to modify:
- dataface/core/inspect/inspector.py — CSV detection + ephemeral DuckDB setup
- dataface/cli/commands/inspect.py — accept CSV file paths as arguments
- dataface/core/inspect/connection.py — possibly no changes if DuckDB path is reused
Implementation Progress
M1: Core CSV inspect via ephemeral DuckDB (done)
- [x]
is_csv_path(table_arg)helper ininspector.py— detects.csvextension - [x]
TableInspector.from_csv(path)classmethod — creates:memory:DuckDB, loads CSV viaread_csv_auto(), returns inspector ready for profiling - [x] CLI routing in
inspect_command()— when table arg is an existing.csvfile, usesfrom_csvinstead of normalTableInspectorconstructor - [x] Validation:
FileNotFoundErrorfor missing files,ValueErrorfor non-CSV extensions - [x] 15 tests in
tests/core/test_inspect_csv.pycovering factory, full profile pipeline, path detection, CLI routing - [x] No changes needed to
connection.py— reuses existing DuckDB path entirely
Key decision: from_csv is a classmethod factory (like from_bigquery_client) rather than a modification to __init__. This keeps the constructor clean and the CSV concern isolated.
Files changed:
- dataface/core/inspect/inspector.py — added is_csv_path() + TableInspector.from_csv()
- dataface/cli/commands/inspect.py — CSV auto-detection in inspect_command()
- tests/core/test_inspect_csv.py — new test file (15 tests)
Review Feedback
- [ ] Review cleared