How to Handle Large Datasets in Python Even If You’re a Beginner

by IT Consulting Group December 18, 2025 0 Comments Blog

Image by Author

# Introduction

Working with large datasets in Python often leads to a common problem: you load your data with Pandas, and your program slows to a crawl or crashes entirely. This typically occurs because you are attempting to load everything into memory simultaneously.

Most memory issues stem from how you load and process data. With a handful of practical techniques, you can handle datasets much larger than your available memory.

In this article, you will learn seven techniques for working with large datasets efficiently in Python. We will start simply and build up, so by the end, you will know exactly which approach fits your use case.

🔗 You can find the code on GitHub. If you’d like, you can run this sample data generator Python script to get sample CSV files and use the code snippets to process them.

# 1. Read Data in Chunks

The most beginner-friendly approach is to process your data in smaller pieces instead of loading everything at once.

Consider a scenario where you have a large sales dataset and you want to find the total revenue. The following code demonstrates this approach:

import pandas as pd

# Define chunk size (number of rows per chunk)
chunk_size = 100000
total_revenue = 0

# Read and process the file in chunks
for chunk in pd.read_csv('large_sales_data.csv', chunksize=chunk_size):
    # Process each chunk
    total_revenue += chunk['revenue'].sum()

print(f"Total Revenue: ${total_revenue:,.2f}")

Instead of loading all 10 million rows at once, we are loading 100,000 rows at a time. We calculate the sum for each chunk and add it to our running total. Your RAM only ever holds 100,000 rows, no matter how big the file is.

When to use this: When you need to perform aggregations (sum, count, average) or filtering operations on large files.

# 2. Use Specific Columns Only

Often, you do not need every column in your dataset. Loading only what you need can reduce memory usage significantly.

Suppose you are analyzing customer data, but you only require age and purchase amount, rather than the numerous other columns:

import pandas as pd

# Only load the columns you actually need
columns_to_use = ['customer_id', 'age', 'purchase_amount']

df = pd.read_csv('customers.csv', usecols=columns_to_use)

# Now work with a much lighter dataframe
average_purchase = df.groupby('age')['purchase_amount'].mean()
print(average_purchase)

By specifying usecols, Pandas only loads those three columns into memory. If your original file had 50 columns, you have just cut your memory usage by roughly 94%.

When to use this: When you know exactly which columns you need before loading the data.

# 3. Optimize Data Types

By default, Pandas might use more memory than necessary. A column of integers might be stored as 64-bit when 8-bit would work fine.

For instance, if you are loading a dataset with product ratings (1-5 stars) and user IDs:

import pandas as pd

# First, let's see the default memory usage
df = pd.read_csv('ratings.csv')
print("Default memory usage:")
print(df.memory_usage(deep=True))

# Now optimize the data types
df['rating'] = df['rating'].astype('int8')  # Ratings are 1-5, so int8 is enough
df['user_id'] = df['user_id'].astype('int32')  # Assuming user IDs fit in int32

print("\nOptimized memory usage:")
print(df.memory_usage(deep=True))

By converting the rating column from the probable int64 (8 bytes per number) to int8 (1 byte per number), we achieve an 8x memory reduction for that column.

Common conversions include:

int64 → int8, int16, or int32 (depending on the range of numbers).
float64 → float32 (if you do not need extreme precision).
object → category (for columns with repeated values).

# 4. Use Categorical Data Types

When a column contains repeated text values (like country names or product categories), Pandas stores each value separately. The category dtype stores the unique values once and uses efficient codes to reference them.

Suppose you are working with a product inventory file where the category column has only 20 unique values, but they repeat across all rows in the dataset:

import pandas as pd

df = pd.read_csv('products.csv')

# Check memory before conversion
print(f"Before: {df['category'].memory_usage(deep=True) / 1024**2:.2f} MB")

# Convert to category
df['category'] = df['category'].astype('category')

# Check memory after conversion
print(f"After: {df['category'].memory_usage(deep=True) / 1024**2:.2f} MB")

# It still works like normal text
print(df['category'].value_counts())

This conversion can substantially reduce memory usage for columns with low cardinality (few unique values). The column still functions similarly to standard text data: you can filter, group, and sort as usual.

When to use this: For any text column where values repeat frequently (categories, states, countries, departments, and the like).

# 5. Filter While Reading

Sometimes you know you only need a subset of rows. Instead of loading everything and then filtering, you can filter during the load process.

For example, if you only care about transactions from the year 2024:

import pandas as pd

# Read in chunks and filter
chunk_size = 100000
filtered_chunks = []

for chunk in pd.read_csv('transactions.csv', chunksize=chunk_size):
    # Filter each chunk before storing it
    filtered = chunk[chunk['year'] == 2024]
    filtered_chunks.append(filtered)

# Combine the filtered chunks
df_2024 = pd.concat(filtered_chunks, ignore_index=True)

print(f"Loaded {len(df_2024)} rows from 2024")

We are combining chunking with filtering. Each chunk is filtered before being added to our list, so we never hold the full dataset in memory, only the rows we actually want.

When to use this: When you need only a subset of rows based on some condition.

# 6. Use Dask for Parallel Processing

For datasets that are truly massive, Dask provides a Pandas-like API but handles all the chunking and parallel processing automatically.

Here is how you would calculate the average of a column across a huge dataset:

import dask.dataframe as dd

# Read with Dask (it handles chunking automatically)
df = dd.read_csv('huge_dataset.csv')

# Operations look just like pandas
result = df['sales'].mean()

# Dask is lazy - compute() actually executes the calculation
average_sales = result.compute()

print(f"Average Sales: ${average_sales:,.2f}")

Dask does not load the entire file into memory. Instead, it creates a plan for how to process the data in chunks and executes that plan when you call .compute(). It can even use multiple CPU cores to speed up computation.

When to use this: When your dataset is too large for Pandas, even with chunking, or when you want parallel processing without writing complex code.

# 7. Sample Your Data for Exploration

When you are just exploring or testing code, you do not need the full dataset. Load a sample first.

Suppose you are building a machine learning model and want to test your preprocessing pipeline. You can sample your dataset as shown:

import pandas as pd

# Read just the first 50,000 rows
df_sample = pd.read_csv('huge_dataset.csv', nrows=50000)

# Or read a random sample using skiprows
import random
skip_rows = lambda x: x > 0 and random.random() > 0.01  # Keep ~1% of rows

df_random_sample = pd.read_csv('huge_dataset.csv', skiprows=skip_rows)

print(f"Sample size: {len(df_random_sample)} rows")

The first approach loads the first N rows, which is suitable for rapid exploration. The second approach randomly samples rows throughout the file, which is better for statistical analysis or when the file is sorted in a way that makes the top rows unrepresentative.

When to use this: During development, testing, or exploratory analysis before running your code on the full dataset.

# Conclusion

Handling large datasets does not require expert-level skills. Here is a quick summary of techniques we have discussed:

Technique	When to use it
Chunking	For aggregations, filtering, and processing data you cannot fit in RAM.
Column selection	When you need only a few columns from a wide dataset.
Data type optimization	Always; do this after loading to save memory.
Categorical types	For text columns with repeated values (categories, states, etc.).
Filter while reading	When you need only a subset of rows.
Dask	For very large datasets or when you want parallel processing.
Sampling	During development and exploration.

The first step is knowing both your data and your task. Most of the time, a combination of chunking and smart column selection will get you 90% of the way there.

As your needs grow, move to more advanced tools like Dask or consider converting your data to more efficient file formats like Parquet or HDF5.

Now go ahead and start working with those massive datasets. Happy analyzing!

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.

Source link

IT Consulting Group