HomeBack to recipes
RecipesDataData Lake Queries

TL;DR

Data Lake Queries is a data workflow that chains DuckDB + S3 to automate a common task. Query Parquet files directly from S3 using DuckDB without any ETL. Results are returned in seconds for ad-hoc analytics. Once configured, it saves ~20 hours/week of analyst wait time, plus $5-20k/month saved on warehouse costs and runs through Claude Code, Cursor, Windsurf or any MCP-compatible AI agent.

🦆☁️
DataIntermediate

Data Lake Queries

Query Parquet files directly from S3 using DuckDB without any ETL. Results are returned in seconds for ad-hoc analytics.

10 min setup, ad-hoc queries in seconds2 MCPs requiredSaves ~20 hours/week of analyst wait time, plus $5-20k/month saved on warehouse costs

How it works

🦆DuckDB
☁️S3
Automated
1Connect DuckDB to S3 bucket via httpfs2Scan Parquet file metadata for schema3Run analytical SQL query with predicate pushdown+2 more steps

Local CLI only

None of these MCPs are hosted yet. Install and run the recipe locally:

mcpizy recipe install duckdb-s3-data-lake

Why this combo?

S3 stores your Parquet data lake cheaply at any scale; DuckDB queries it with full SQL semantics and columnar performance without a cluster. Together they give you BigQuery-style analytics on your own data without the cost or complexity of a managed warehouse.

Without this workflow

Spin up an EMR cluster, wait 10 minutes, run a Spark job, get results, shut down the cluster. Cost: $30. Time: 45 minutes.

With MCPizy

DuckDB scans S3 Parquet directly. Same query. 8 seconds. Zero cluster management.

Business value

Concrete ROI — not marketing fluff.

Time saved

~20 hours/week of analyst wait time, plus $5-20k/month saved on warehouse costs

  • Replaces Snowflake/BigQuery for early-stage analytics — saves $60-240k/year in warehouse bills
  • Cuts query time from 45min (EMR spin-up) to 8 seconds — analysts iterate 10x faster on hypotheses
  • Zero infrastructure to manage: no cluster tuning, no ops team needed, no off-hours warehouse babysitting
  • Enables product analytics without hiring a full data team — founders can self-serve on a laptop

Workflow steps

  1. 1
    Connect DuckDB to S3 bucket via httpfs
  2. 2
    Scan Parquet file metadata for schema
  3. 3
    Run analytical SQL query with predicate pushdown
  4. 4
    Return result set to caller
  5. 5
    Cache query plan for repeat queries

Use cases

  • Ad-hoc analytics on event logs stored as Parquet in S3
  • Scheduled aggregate reports without loading data into a warehouse
  • Cross-partition queries across months of data with predicate pushdown
  • Fast prototyping of analytics before committing to a schema

MCPs required

🦆

DuckDB

DuckDB MCP Server

View
☁️

S3

S3 MCP Server

View

Agent prompt (copy into Claude Code)

This prompt is the workflow. Paste into Claude Code, Cursor, or Windsurf.

You are a data-lake query agent. Invoked ad-hoc with a natural-language question.

Given a question about data in s3://${S3_BUCKET}/${S3_PREFIX}/:
1. Call aws.s3_list_objects(bucket, prefix) to enumerate Parquet files and their partition keys
2. Call duckdb.execute("DESCRIBE SELECT * FROM read_parquet('s3://bucket/prefix/*.parquet')") to get schema
3. Translate the question into a SQL query with explicit partition predicates to enable pushdown
4. Call duckdb.execute(sql) — ensure httpfs + s3 credentials are configured via SET s3_region, s3_access_key_id, s3_secret_access_key
5. Format result as a markdown table if <50 rows, else describe with aggregate stats + link to full CSV export

Always add LIMIT 10000 and a time-range filter unless the user explicitly asks for full-scan.

Trigger & credentials

How this workflow fires and what env vars you need.

.env.example
ManualTrigger
Run in Claude Code when you need ad-hoc analytics
🦆DuckDB· 1 var
DUCKDB_DATABASE

Path to local DuckDB file (or :memory:)

e.g. ./analytics.duckdb

☁️S3· 5 vars
AWS_ACCESS_KEY_IDGet key

AWS access key with s3:GetObject on the bucket

e.g. AKIA...

AWS_SECRET_ACCESS_KEY

AWS secret access key

e.g. wJalrXUtnFEMI/...

AWS_REGION

Bucket region

e.g. us-east-1

S3_BUCKET

Target S3 bucket with Parquet data

e.g. acme-datalake

S3_PREFIX

Key prefix for the dataset

e.g. events/year=2026/

One-command deploy

Install everything — MCPs, prompt, env template — in a single call.

$ mcpizy recipe install duckdb-s3-data-lake

✓ Installs all 2 MCP servers
✓ Writes prompt to ~/.mcpizy/prompts/duckdb-s3-data-lake.md
✓ Generates .env.example in current directory
✓ Ready to paste into Claude Code

Requires mcpizy CLI v1.1+ — install via npm i -g mcpizy.

Quick install (MCPs only)

10 min setup, ad-hoc queries in seconds
$ mcpizy install duckdb && mcpizy install aws

More Data recipes

🔥🟢

Web Scraping to Database

Schedule a Firecrawl scrape of any website and store the structured results directly in a Supabase table for analysis.

🔍🟢

Search Results Indexing

Run Tavily searches on scheduled topics and index the results in Supabase for trend analysis and content research.

🔴🟢

Cache Invalidation Pipeline

When a Supabase row changes, the corresponding Redis cache key is automatically invalidated to keep your API fresh.

🕸️🐙

Knowledge Graph from Code

Parse your GitHub repos and build a Neo4j knowledge graph of files, functions, imports, and authors for code intelligence.

Frequently asked questions

What is this workflow?

Data Lake Queries is a data automation that uses DuckDB + S3 together via the Model Context Protocol. Query Parquet files directly from S3 using DuckDB without any ETL. Results are returned in seconds for ad-hoc analytics.

How long does setup take?

Setup takes around 10 min setup, ad-hoc queries in seconds. You install the required MCP servers with `mcpizy install duckdb && mcpizy install aws`, connect your accounts, and the workflow is ready to run.

How much time does this workflow save?

Once running, this workflow saves ~20 hours/week of analyst wait time, plus $5-20k/month saved on warehouse costs. The concrete business value: Replaces Snowflake/BigQuery for early-stage analytics — saves $60-240k/year in warehouse bills; Cuts query time from 45min (EMR spin-up) to 8 seconds — analysts iterate 10x faster on hypotheses.

Which MCP servers do I need for this?

You need 2 MCP servers: DuckDB (mcpizy install duckdb), S3 (mcpizy install aws). All are installable in one command via the MCPizy CLI and configured in your `.claude.json` or `.cursor/mcp.json`.

Does this work with Claude Code, Cursor, and Windsurf?

Yes. The workflow runs with any MCP-compatible AI agent — Claude Code, Claude Desktop, Cursor, Windsurf, VS Code with Copilot, and custom agents built on the MCP SDK. The MCP servers are identical across clients; only the config file path (`.claude.json` vs `.cursor/mcp.json`) changes.

Start building this workflow

Install the required MCPs from the marketplace and automate this in 10 min setup.

$ mcpizy install duckdb && mcpizy install aws

🦆Install DuckDB☁️Install S3

Free to install. Connect your accounts and this workflow runs itself.