Data Lake Queries is a data workflow that chains DuckDB + S3 to automate a common task. Query Parquet files directly from S3 using DuckDB without any ETL. Results are returned in seconds for ad-hoc analytics. Once configured, it saves ~20 hours/week of analyst wait time, plus $5-20k/month saved on warehouse costs and runs through Claude Code, Cursor, Windsurf or any MCP-compatible AI agent.
Query Parquet files directly from S3 using DuckDB without any ETL. Results are returned in seconds for ad-hoc analytics.
None of these MCPs are hosted yet. Install and run the recipe locally:
mcpizy recipe install duckdb-s3-data-lakeS3 stores your Parquet data lake cheaply at any scale; DuckDB queries it with full SQL semantics and columnar performance without a cluster. Together they give you BigQuery-style analytics on your own data without the cost or complexity of a managed warehouse.
Spin up an EMR cluster, wait 10 minutes, run a Spark job, get results, shut down the cluster. Cost: $30. Time: 45 minutes.
DuckDB scans S3 Parquet directly. Same query. 8 seconds. Zero cluster management.
Concrete ROI — not marketing fluff.
Time saved
~20 hours/week of analyst wait time, plus $5-20k/month saved on warehouse costs
This prompt is the workflow. Paste into Claude Code, Cursor, or Windsurf.
You are a data-lake query agent. Invoked ad-hoc with a natural-language question.
Given a question about data in s3://${S3_BUCKET}/${S3_PREFIX}/:
1. Call aws.s3_list_objects(bucket, prefix) to enumerate Parquet files and their partition keys
2. Call duckdb.execute("DESCRIBE SELECT * FROM read_parquet('s3://bucket/prefix/*.parquet')") to get schema
3. Translate the question into a SQL query with explicit partition predicates to enable pushdown
4. Call duckdb.execute(sql) — ensure httpfs + s3 credentials are configured via SET s3_region, s3_access_key_id, s3_secret_access_key
5. Format result as a markdown table if <50 rows, else describe with aggregate stats + link to full CSV export
Always add LIMIT 10000 and a time-range filter unless the user explicitly asks for full-scan.How this workflow fires and what env vars you need.
Run in Claude Code when you need ad-hoc analytics
DUCKDB_DATABASEPath to local DuckDB file (or :memory:)
e.g. ./analytics.duckdb
AWS_SECRET_ACCESS_KEYAWS secret access key
e.g. wJalrXUtnFEMI/...
AWS_REGIONBucket region
e.g. us-east-1
S3_BUCKETTarget S3 bucket with Parquet data
e.g. acme-datalake
S3_PREFIXKey prefix for the dataset
e.g. events/year=2026/
Install everything — MCPs, prompt, env template — in a single call.
$ mcpizy recipe install duckdb-s3-data-lake ✓ Installs all 2 MCP servers ✓ Writes prompt to ~/.mcpizy/prompts/duckdb-s3-data-lake.md ✓ Generates .env.example in current directory ✓ Ready to paste into Claude Code
Requires mcpizy CLI v1.1+ — install via npm i -g mcpizy.
$ mcpizy install duckdb && mcpizy install awsSchedule a Firecrawl scrape of any website and store the structured results directly in a Supabase table for analysis.
Run Tavily searches on scheduled topics and index the results in Supabase for trend analysis and content research.
When a Supabase row changes, the corresponding Redis cache key is automatically invalidated to keep your API fresh.
Parse your GitHub repos and build a Neo4j knowledge graph of files, functions, imports, and authors for code intelligence.
Data Lake Queries is a data automation that uses DuckDB + S3 together via the Model Context Protocol. Query Parquet files directly from S3 using DuckDB without any ETL. Results are returned in seconds for ad-hoc analytics.
Setup takes around 10 min setup, ad-hoc queries in seconds. You install the required MCP servers with `mcpizy install duckdb && mcpizy install aws`, connect your accounts, and the workflow is ready to run.
Once running, this workflow saves ~20 hours/week of analyst wait time, plus $5-20k/month saved on warehouse costs. The concrete business value: Replaces Snowflake/BigQuery for early-stage analytics — saves $60-240k/year in warehouse bills; Cuts query time from 45min (EMR spin-up) to 8 seconds — analysts iterate 10x faster on hypotheses.
You need 2 MCP servers: DuckDB (mcpizy install duckdb), S3 (mcpizy install aws). All are installable in one command via the MCPizy CLI and configured in your `.claude.json` or `.cursor/mcp.json`.
Yes. The workflow runs with any MCP-compatible AI agent — Claude Code, Claude Desktop, Cursor, Windsurf, VS Code with Copilot, and custom agents built on the MCP SDK. The MCP servers are identical across clients; only the config file path (`.claude.json` vs `.cursor/mcp.json`) changes.
Install the required MCPs from the marketplace and automate this in 10 min setup.
$ mcpizy install duckdb && mcpizy install aws
Free to install. Connect your accounts and this workflow runs itself.