Data Handling
Last updated
Last updated
While working with large datasets that exceed the memory limits of local machines or platforms like Google Colab, you can adopt the following strategies to handle and process your data efficiently:
Switch to cloud-based platforms that are specifically designed to handle large datasets. Some options are:
a) Google Cloud Platform (GCP)
BigQuery: Upload the data to Google BigQuery and perform SQL-based analysis directly in the cloud.
Cloud Storage: Store your large files in buckets and use pandas-gbq
or bigquery
Python libraries for analysis.
b) Amazon Web Services (AWS)
S3 for Storage: Store your large files in S3 buckets.
Athena: Query the data directly in S3 using Athena, which allows SQL-based querying without moving the data.
c) Azure
Data Lake: Use Azure Data Lake for efficient storage and querying.
Azure Synapse Analytics: Process large datasets using distributed computing.
d) Databricks
Use Databricks, which supports distributed processing with Apache Spark, for efficient handling of big data.
If you must work locally or on Colab, process the data in chunks to avoid memory overflows.
Example for Chunking CSV Files:
import pandas as pd
# Process the data in chunks
chunk_size = 1_000_000 # Number of rows to process at a time
file_path = 'large_file.csv'
# Iterate through chunks
for chunk in pd.read_csv(file_path, chunksize=chunk_size):
# Perform your processing here
print(chunk.shape)
Benefits:
Reduces memory usage.
Processes one part of the data at a time.
Frameworks like Apache Spark or Dask can efficiently handle large datasets by splitting the workload across multiple nodes or threads.
a) Dask:
Install Dask and use it as a drop-in replacement for pandas:
import dask.dataframe as dd
# Load the data using Dask
df = dd.read_csv('large_file.csv')
# Perform processing (e.g., filtering, grouping)
result = df[df['column_name'] > 1000].compute()
b) PySpark:
Use PySpark for distributed processing:
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName('Large Data Processing').getOrCreate()
# Load the data
df = spark.read.csv('large_file.csv', header=True, inferSchema=True)
# Perform operations
df_filtered = df.filter(df['column_name'] > 1000)
df_filtered.show()
Use more efficient file formats, such as Parquet or ORC, for faster reading and reduced storage size.
a) Convert CSV to Parquet:
import pandas as pd
# Read CSV
df = pd.read_csv('large_file.csv')
# Save as Parquet
df.to_parquet('large_file.parquet')
b) Load Parquet with Pandas or Dask:
import pandas as pd
df = pd.read_parquet('large_file.parquet')
Split Files: Divide large files into smaller chunks before loading (e.g., using shell commands or Python).
Sampling: Load only a sample of the data for initial analysis:
df = pd.read_csv('large_file.csv', nrows=10_000) # Load only 10k rows
Store your large datasets in a relational database (e.g., MySQL, PostgreSQL) or NoSQL database (e.g., MongoDB, Cassandra). Query only the necessary portions of the data using SQL or the database's query language.
Example for MySQL:
import pymysql
import pandas as pd
# Connect to the database
connection = pymysql.connect(host='localhost', user='user', password='password', database='database_name')
# Query the data
query = "SELECT * FROM large_table LIMIT 1000"
df = pd.read_sql(query, connection)
Upload the large files to Google Drive or Google Cloud Storage.
Use the following code to load data from Google Drive:
from google.colab import drive
drive.mount('/content/drive')
# Load the data
file_path = '/content/drive/My Drive/large_file.csv'
df = pd.read_csv(file_path)
Compress large files using tools like gzip
or bz2
and load them directly:
import pandas as pd
# Load a compressed file
df = pd.read_csv('large_file.csv.gz', compression='gzip')
Build an ETL (Extract, Transform, Load) pipeline to:
Load the data in chunks.
Perform preprocessing and feature engineering incrementally.
Store processed data in a database or an optimized file format.
If working with the road, trajectory, pickup, and delivery datasets together, partition the data by date, region, or courier_id to reduce the working set size during analysis.
Decide whether to use a cloud-based solution or process locally with chunking or Dask/Spark.
Optimize the data format (e.g., Parquet) and compress files where possible.
Use databases or external storage systems if the dataset needs persistent querying.
Start with exploratory data analysis (EDA) on smaller samples to understand the data structure and refine your approach.