By Adrien Payong and Shaoni Mukherjee

Pandas has been the standard choice for data manipulation in Python for more than a decade. If you’ve been cleaning features for a machine learning model, exploring a CSV in a notebook, or writing a lightweight ETL pipeline, it’s likely that your code has revolved around the Pandas DataFrame. It has occupied a central place in Python’s data landscape.
However, larger datasets and more compute-intensive analytical workloads have long been stretching the in-memory, single-threaded limits of Pandas. Enter DuckDB, an in-process SQL OLAP(Online Analytical Processing) engine whose “SQLite for analytics” tagline reflects its promise of:
This article presents these two tools through a practitioner-focused, on-the-ground comparison. We examine how the tools differ in architecture, performance, scalability, and developer experience. We show where DuckDB can meaningfully replace Pandas and where Pandas still has the edge.
Understanding what each tool represents will better prepare you for comparing performance and workflow capabilities. Pandas is a DataFrame library that is very tightly coupled to Python. DuckDB, on the other hand, is an embedded SQL analytics engine that runs in your Python process.
Pandas is the de facto standard library for data wrangling in Python. Pandas brought the DataFrame to Python back in 2008; it’s become one of the key tools for the data science ecosystem. A Pandas DataFrame is a 2D, in-memory tabular data structure (axes labeled by rows and columns), providing fast, indexed access with a Python API. Operations on Pandas DataFrames (e.g., slicing, filtering, transformations) are backed by C/NumPy operations for vectorized execution.
Key characteristics of Pandas:
DuckDB is sometimes described as the “SQLite for analytics”. You can query data inside your Python process, like Pandas. However, under the hood, DuckDB is very different from Pandas:
Comparing DuckDB and pandas reveals several critical architectural differences:
| Feature | pandas (DataFrame library) | DuckDB (in-process SQL engine) |
|---|---|---|
| Core abstraction | Python DataFrame and Series objects with an imperative API. |
Relational tables queried with declarative SQL; exposed through a Python API. |
| Storage format | Row-oriented DataFrame in practice: columns backed by NumPy/Arrow arrays, but many operations traverse rows in Python. | Columnar storage: data stored in contiguous column blocks, optimized for scans and aggregations. |
| Execution model | Executes via Python loops and vectorized NumPy / Arrow operations; single-threaded by default (multi-threaded only in specific cases). | A vectorized query engine that processes batches of values (vectors) per operator, using a C++ engine with multi-threading across all available CPU cores. |
| Memory & scale | Entire DataFrame (or groups in a group-by) must fit in RAM; out-of-core patterns require manual workarounds or other tools. | Designed for larger-than-memory analytics; can spill to disk transparently and stream data from columnar files (e.g., Parquet). |
| Concurrency & transactions | No ACID transaction layer; operations mutate in-memory objects within a single Python process. | Provides ACID transactions with MVCC; supports safe, transactional updates and consistent reads, even when persisting to a .duckdb file. |
| Query language & ergonomics | Python DataFrame API: method chains (.groupby(), .agg(), .merge(), etc.); very flexible but can become verbose for complex pipelines. |
Standard SQL with joins, window functions, CTEs, and macros; often more concise and readable for complex analytics if you are comfortable with SQL. |
| Integration with DataFrames | Operates directly on in-memory DataFrames; ecosystem built around the pandas API. | Can query pandas / Polars DataFrames via replacement scans (zero-copy where possible), and return results as pandas, Polars, Arrow, or NumPy. |
| File & external data access | Uses separate Python libraries for many tasks (e.g., pyarrow, boto3) for Parquet, cloud storage, and other formats; I/O patterns are usually explicit Python code. |
Built-in connectors and extensions: query CSV/Parquet/JSON and many other formats, including S3 and HTTP, directly from SQL without extra Python packages. |
| Ecosystem & tooling | Large, mature ecosystem; first-class support in many visualization and ML libraries (e.g., Matplotlib, Seaborn, scikit-learn, statsmodels). | Growing ecosystem focused on analytics; integrates with tools via Arrow and connectors, and can act as a fast local engine under BI tools, dbt, or custom applications. |
| Typical sweet spot | Interactive analysis, feature engineering, and small-to-medium datasets that fit comfortably in memory; quick experimentation in notebooks. | Heavy analytical queries (joins, aggregations, window functions) on large local datasets; “SQL-first” workflows, local OLAP, and offloading expensive operations from pandas. |
DuckDB is designed as a columnar database, so an aggregation such as SELECT SUM(value) only reads the relevant columns instead of the entire rows. Operations are applied to large blocks of values at once using vectorized execution, which reduces per-operation overhead. Pandas, by default, processes data one row at a time, or in small slices of data. The end result is that DuckDB is often faster and more memory‑efficient for operations such as aggregation, filtering, and joins, particularly with larger datasets.
The following is a recent benchmark comparison between DuckDB, Polars, and Pandas. The hardware specification on which this example was run: MacBook Pro M2, 16 GB RAM. The artificial benchmark dataset represents 1 million persons and 10 million position records. The performance results showed that DuckDB is the fastest performing solution. On average, DuckDB completed in 3.84 seconds. Polars comes in a close second with a runtime average of 5.77 seconds. Pandas is a slower solution compared to both DuckDB and Polars, as it took an average runtime of 19.57 seconds, over five times slower than DuckDB.
This result can inform discussions about the use cases and limitations of Pandas, especially for large datasets or performance-sensitive data processing tasks. It also highlights the increasing importance of newer, more specialized data processing libraries such as DuckDB and Polars. This may be better suited to certain types of workloads and offer significant performance advantages over Pandas.
One of DuckDB’s biggest advantages is that it does not force you to choose between SQL and pandas. Instead, it plugs into your existing Python workflow:
DuckDB becomes a “SQL engine inside your notebook” rather than a separate system to manage.
With DuckDB’s replacement scan mechanism, you can treat a pandas DataFrame like a SQL table by just referencing the variable name in the query. No explicit registration required.
pip install duckdb
import duckdb
import pandas as pd
# A small DataFrame in pandas
df = pd.DataFrame(
{
"city": ["Paris", "Berlin", "Lagos", "Delhi"],
"temp_c": [21, 19, 30, 35],
}
)
# DuckDB sees 'df' as a table via replacement scans
result = duckdb.sql("""
SELECT
city,
temp_c,
temp_c * 9.0/5.0 + 32 AS temp_f
FROM df
WHERE temp_c >= 25
""").df()
print(result)
Key points:
You may also use an explicit connection and registration if you want more control:
con = duckdb.connect()
con.register("cities", df)
hot_cities = con.sql("""
SELECT city, temp_c
FROM cities
WHERE temp_c >= 25
""").df()
After running a SQL query, you typically want to continue working in your favorite Python data library. For this purpose, DuckDB’s Python API exposes convenient conversion methods:
import duckdb
import pandas as pd
import polars as pl
import pyarrow as pa
# Simple demo dataset
df = pd.DataFrame(
{
"product": ["A", "A", "B", "B"],
"amount": [100, 150, 80, 120],
}
)
rel = duckdb.sql("""
SELECT
product,
SUM(amount) AS total_amount
FROM df
GROUP BY product
ORDER BY total_amount DESC
""")
# 1) Get results as pandas
pandas_result = rel.df()
# 2) Get results as Polars
polars_result = rel.pl()
# 3) Get results as an Arrow table
arrow_result = rel.arrow()
print("Pandas:\n", pandas_result)
print("Polars:\n", polars_result)
print("Arrow schema:\n", arrow_result.schema)
In a real workflow, you might:
where DuckDB excels, however, is in hybrid workflows that involve:
You can join them in a single SQL statement, without having to write custom ETL steps.
For example, you might have:
import os
import duckdb
import pandas as pd
from pathlib import Path
# 1. Ensure the 'data' directory exists
data_dir = Path("data")
data_dir.mkdir(exist_ok=True)
# 2. Create a small demo "orders" dataset in pandas
orders = pd.DataFrame(
{
"order_id": [101, 102, 103, 104, 105],
"customer_id": [1, 2, 1, 3, 2],
"order_date": pd.to_datetime(
["2025-01-02", "2025-01-03", "2025-01-05", "2025-02-01", "2025-02-10"]
),
"order_amount": [200.0, 150.0, 300.0, 80.0, 120.0],
}
)
parquet_path = data_dir / "orders.parquet"
# 3. Save as Parquet (requires pyarrow or fastparquet installed)
orders.to_parquet(parquet_path, index=False)
print(f"Saved demo file at: {parquet_path.resolve()}")
# 4. Create a small customer segments DataFrame
customer_segments = pd.DataFrame(
{
"customer_id": [1, 2, 3],
"segment": ["Enterprise", "SMB", "Consumer"],
}
)
# 5. Connect DuckDB and register the dimension table
con = duckdb.connect()
con.register("customer_segments", customer_segments)
# 6. Run the hybrid query joining Parquet + pandas
query = f"""
SELECT
o.customer_id,
s.segment,
SUM(o.order_amount) AS total_spend
FROM '{parquet_path.as_posix()}' AS o
JOIN customer_segments AS s
ON o.customer_id = s.customer_id
WHERE o.order_date >= DATE '2025-01-01'
GROUP BY o.customer_id, s.segment
ORDER BY total_spend DESC
"""
top_customers = con.sql(query).df()
print(top_customers)
This pattern illustrates the “best-of-both-worlds” approach:
DuckDB plays well with pandas, but that is not the end of the story. It also integrates very nicely with Polars as well as Apache Arrow. This is essential if you are using a heterogeneous stack.
There are two common scenarios:
import duckdb
import polars as pl
# Polars DataFrame
pl_df = pl.DataFrame(
{
"id": [1, 2, 3, 4],
"value": [10, 20, 30, 40],
}
)
# DuckDB can see 'pl_df' similarly to pandas via replacement scans
rel = duckdb.sql("""
SELECT
id,
value,
value * 2 AS value_x2
FROM pl_df
WHERE value >= 20
""")
# Get the result as Polars again
result_pl = rel.pl()
print(result_pl)
import duckdb
import pyarrow as pa
# Create a small Arrow table
arrow_table = pa.table(
{
"event_type": ["click", "view", "click", "purchase"],
"user_id": [10, 10, 11, 10],
"amount": [None, None, None, 99.0],
}
)
# Register the Arrow table explicitly (optional but clear)
con = duckdb.connect()
con.register("events", arrow_table)
# Run SQL on the Arrow table
agg = con.sql("""
SELECT
user_id,
COUNT(*) AS num_events,
SUM(CASE WHEN event_type = 'purchase' THEN amount ELSE 0 END) AS revenue
FROM events
GROUP BY user_id
""").arrow()
print(agg)
In both cases:
Should you drop Pandas entirely in favor of DuckDB? In some cases, the answer is yes, but not always. To clarify, let’s look at the types of problems DuckDB can (and should) solve, and those that are better left to Pandas:
| Tool | Use Case | When to Prefer It / Description |
|---|---|---|
| DuckDB | Huge datasets / “big data on a laptop” | Analyzing datasets too large for Pandas (tens of millions of rows, multi-GB files). Ideal for log files, telemetry, and other big tables where Pandas would require sampling/downcasting or fail. Often handles 100M+ rows on a typical 16 GB RAM laptop. |
| DuckDB | Analytical queries & OLAP-style reporting | Workflows that look like SQL: grouping, joining, filtering, aggregates, window functions. DuckDB can fully replace df.groupby().agg() + merge chains with concise SQL. Commonly used for reports, exploratory analysis, and transformations in notebooks. |
| DuckDB | Columnar file formats (Parquet / Arrow) | Pipelines where data is stored mainly as Parquet/Arrow. DuckDB can query these files in place without loading entire DataFrames, and can also write Parquet (e.g., COPY TO). Acts as a local query engine over “data lake” files, faster and more memory-efficient. |
| DuckDB | Interactive data exploration with SQL | Users prefer exploring data via SQL (SELECT ... WHERE ... LIMIT 100). DuckDB allows iterative querying in notebooks/scripts, similar to a SQL IDE but local and offline. Can replace many Pandas-based exploratory steps for SQL-oriented users. |
| DuckDB | Memory-bound computations | Operations that are very memory-intensive in Pandas (large group-bys with many categories, multiple big joins, wide tables). DuckDB uses streaming aggregation and disk-spilling join strategies, making these workloads feasible where Pandas might blow up. |
| DuckDB | Production / data engineering pipelines | Recurring analytics in scripts/apps on large data. More robust and predictable at scale than Pandas, which may fail as data grows. Good for nightly jobs aggregating tens of millions of rows, batch ETL, or embedded analytics; usable from Python, R, Java, etc. |
| Pandas | Small data and quick scripts | Tiny or small datasets (e.g., a 5,000-row CSV) and quick one-off analyses. Loading and manipulating data in Pandas is instantaneous, and writing a short snippet is often faster than formulating a SQL query, especially for Python-fluent users. |
| Pandas | Complex data manipulation & feature engineering | Tasks needing arbitrary Python logic, external libraries, or iterative algorithms. Easy to generate new columns via loops, vectorization, or apply. Ideal for custom text processing, complex ranking rules, bespoke pivots, and advanced feature engineering. |
| Pandas | Ecosystem and library compatibility | Many libraries expect Pandas DataFrames or NumPy arrays (e.g., scikit-learn, Matplotlib, seaborn). Even if DuckDB prepares data, you typically end up in Pandas at the edges (modeling, visualization). Staying in Pandas can be simpler when every step needs it. |
| Pandas | Real-time interactive tweaks in notebooks | Very interactive data cleaning and inspection: df.head(), boolean filters, manual edits like df.iloc[5, 3] = None. Data is fully in memory and directly editable, making ad-hoc tweaks more convenient than re-running SQL queries. |
| Pandas | Team habits and existing codebase | Large existing Pandas codebase or a team skilled in Pandas but not SQL. Switching everything to DuckDB has a cost; incremental adoption (using DuckDB only for pain points) is usually wiser. Pandas’ maturity and familiarity remain strong advantages. |
Pandas retains an edge for flexibility and for the long tail of small or medium data workloads where its overhead is unnoticeable, and its native Python integration is a benefit. DuckDB doesn’t aim to be a drop-in replacement for that; it’s designed for cases where Pandas struggles (large data, intensive analytics, SQL users).
Despite its advantages, DuckDB is not a silver bullet. Some limitations include:
DuckDB is not here to kill pandas, but rather it is here to complete it. As data workloads outgrow a single, in-memory DataFrame, DuckDB can bring a modern, high-performance engine right into your Python workflow. Read Parquet files, push filters down to storage, scale across all CPU cores, and work with datasets much larger than RAM. You can perform this without deploying a server, changing your environment, or any other hassles of distributed databases.
Pandas remains a great fit for day-to-day exploration, prototyping, or the kind of quick feature engineering that can benefit from Python’s flexibility. However, when the job requires scanning millions of rows or joining large Parquet files, or simply working with Arrow-backed datasets, DuckDB provides the speed and scale that pandas does not.
DuckDB empowers data scientists with the speed and power of a local analytical database from within a notebook, while seamlessly extending pandas, Polars, Arrow, and dbt.
In practice, the future is not “DuckDB versus pandas,” but DuckDB and pandas together. Each has a role to play in the modern analytics stack. Together, they compose a workflow that is fast, flexible, and fit for the scale of today’s data.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
I am a skilled AI consultant and technical writer with over four years of experience. I have a master’s degree in AI and have written innovative articles that provide developers and researchers with actionable insights. As a thought leader, I specialize in simplifying complex AI concepts through practical content, positioning myself as a trusted voice in the tech community.
With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.