KevsRobots Learning Platform

DuckDB - Fast, free analytics

7% Percent Complete

Introduction

Start here to learn what DuckDB is, why it shines for analytics, and how this course is structured.

By Kevin McAleer,    3 Minutes


DuckDB for Analytics — Introduction

DuckDB is an in-process, columnar SQL database that runs anywhere you can run your app. It excels at fast analytical queries on local files like Parquet and CSV, without needing a server, a cluster, or cloud credits.

New to terms like Parquet, CTE, or materialize? See the Beginner glossary.

This course is part of the Databases learning pathway, focused on practical analytics: querying files, creating small analytical datasets, and integrating with Python.

What you’ll learn

  • Install and run DuckDB (CLI and Python)
  • Query CSV/Parquet directly with SQL
  • Create and persist a local analytics database
  • Work with DuckDB from Python and Pandas/Polars
  • Read data from data lakes (local, HTTP/S3) and tune performance
  • Use extensions and best practices for analytical workloads

Who this is for

  • Makers, analysts, and developers who want fast local analytics
  • Beginners to databases and data lakes who prefer practical examples

Prerequisites

  • Basic command line and Python familiarity
  • macOS, Linux, or Windows
  • Optional: Python 3.10+ and VS Code for notebook-style work

Why DuckDB

  • No server: embed it in your scripts and apps
  • Fast analytical SQL: vectorized, columnar execution
  • File-first: query Parquet/CSV directly, no ETL required
  • Great with Python: query DataFrames and return results as DataFrames
  • Portable: a single .duckdb file you can version and ship

When to choose it:

  • You need quick, local analytics (OLAP) on files
  • You want to prototype or build repeatable data workflows without infra
  • You want SQL over DataFrames, or to complement Pandas/Polars

Quick start (macOS)

  • CLI: brew install duckdb
  • Python: python -m pip install duckdb

Try the CLI:

duckdb

Then run a tiny analytics query directly on a CSV from the web:

-- Create a table from a remote CSV
CREATE TABLE tips AS
SELECT * FROM read_csv_auto('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv');

-- Simple aggregate
SELECT day, ROUND(SUM(total_bill), 2) AS revenue, COUNT(*) AS orders
FROM tips
GROUP BY day
ORDER BY revenue DESC;

Exit with .quit.

Or from Python:

import duckdb, pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')

# Query a DataFrame directly
result = duckdb.query("""
  SELECT day, AVG(total_bill) AS avg_bill
  FROM df
  GROUP BY day
  ORDER BY avg_bill DESC
""").df()

print(result)

What we’ll build in this course

  • A small, local analytics workspace using DuckDB
  • Reproducible queries over CSV/Parquet (local and remote)
  • A persisted .duckdb database and lightweight “data mart”
  • Python integrations for analysis and notebooks (like Jupyter notebooks)
  • Practical performance tips (statistics, Parquet predicate pushdown, and more)

Course structure

  • 00 intro: what DuckDB is and why use it
  • 01 installing: CLI, Python, VS Code setup
  • 02 basic queries: selecting, filtering, aggregations, joins
  • 03 advanced features: views, CTEs, extensions, persistence
  • 04 python integration: DataFrames, results, notebooks
  • 05 data analysis: end-to-end mini analysis
  • 06 data lakes: reading from local/remote object storage
  • 07 parquet: formats, partitioning, predicates
  • 08 performance tuning: tips for speed and memory
  • 09 conclusion: recap and patterns
  • 10 additional resources: docs, tools, datasets

Next up

If you want a one-page setup and first queries, see the Quickstart. Otherwise, head to 01 installing to set up DuckDB and the Python environment.


Next >

You can use the arrows  ← → on your keyboard to navigate between lessons.