Back to Blog

CSV ETL Automation: Extract, Transform, and Load Data Like a Pro

Published: October 2, 2025

CSV ETL Automation: Extract, Transform, and Load Data Like a Pro

ETL — Extract, Transform, Load — is the backbone of every data pipeline. And CSV files are where most ETL pipelines start or end. Whether you are pulling data from APIs, cleaning messy exports, or loading records into a database, mastering CSV ETL automation saves hours of manual work every week.

What Is CSV ETL?

ETL applied to CSV files means:

  • Extract: Get CSV data from sources — APIs, databases, email attachments, FTP servers, web scraping, or manual exports
  • Transform: Clean, reshape, enrich, validate, and standardize the data
  • Load: Push the processed data to a destination — database, data warehouse, another CSV, spreadsheet, or API

The goal is turning raw, messy CSV files into clean, reliable data that feeds your analytics, reports, or applications.

Extract: Getting CSV Data from Anywhere

From APIs

Many services offer CSV exports via API:

python

import requests

Download CSV from an API endpoint

response = requests.get(

'https://api.example.com/reports/sales',

headers={'Authorization': 'Bearer YOUR_TOKEN'},

params={'format': 'csv', 'date_from': '2024-01-01'}

)

with open('sales_export.csv', 'w') as f:

f.write(response.text)

From Databases

Export query results directly to CSV:

python

import duckdb

Query PostgreSQL and export to CSV

con = duckdb.connect()

con.execute("""

INSTALL postgres; LOAD postgres;

COPY (

SELECT * FROM postgres_scan('dbname=mydb', 'public', 'orders')

WHERE order_date >= '2024-01-01'

) TO 'orders_export.csv' (HEADER, DELIMITER ',')

""")

From Email Attachments

python

import imaplib

import email

mail = imaplib.IMAP4_SSL('imap.gmail.com')

mail.login('user@gmail.com', 'app_password')

mail.select('inbox')

_, messages = mail.search(None, 'SUBJECT "Daily Report" UNSEEN')

for num in messages[0].split():

_, data = mail.fetch(num, '(RFC822)')

msg = email.messagefrombytes(data[0][1])

for part in msg.walk():

if part.getfilename() and part.getfilename().endswith('.csv'):

with open(f'incoming/{part.get_filename()}', 'wb') as f:

f.write(part.get_payload(decode=True))

From Cloud Storage

python

import boto3

s3 = boto3.client('s3')

s3.download_file('my-bucket', 'exports/daily-report.csv', 'daily-report.csv')

Transform: Cleaning and Reshaping CSV Data

Transformation is where most of the work happens. Here are the most common operations:

Basic Cleaning with pandas

python

import pandas as pd

df = pd.readcsv('rawdata.csv', encoding='utf-8')

Remove duplicate rows

df = df.drop_duplicates()

Strip whitespace from string columns

strcols = df.selectdtypes(include='object').columns

df[strcols] = df[strcols].apply(lambda x: x.str.strip())

Standardize date formats

df['date'] = pd.to_datetime(df['date'], format='mixed')

Fill missing values

df['category'] = df['category'].fillna('Uncategorized')

Remove rows with negative amounts

df = df[df['amount'] >= 0]

df.tocsv('cleaneddata.csv', index=False)

Advanced Transformation with DuckDB

For larger files or SQL-preferred workflows, DuckDB transforms CSV data efficiently:

sql

-- Aggregate, filter, and reshape in one query

COPY (

SELECT

date_trunc('month', date) as month,

region,

product_category,

COUNT(*) as order_count,

SUM(amount) as total_revenue,

AVG(amount) as avgordervalue

FROM readcsvauto('raw_orders.csv')

WHERE date >= '2024-01-01'

AND status != 'cancelled'

GROUP BY 1, 2, 3

ORDER BY 1, 4 DESC

) TO 'monthly_summary.csv' (HEADER)

Command-Line Transformation with csvkit

csvkit is a suite of command-line tools for CSV manipulation:

bash

Preview structure

csvstat sales.csv

Filter rows

csvgrep -c region -m "North" sales.csv > north_sales.csv

Select specific columns

csvcut -c date,product,revenue sales.csv > slim_sales.csv

Sort by column

csvsort -c revenue -r sales.csv > sorted_sales.csv

Join two CSV files

csvjoin -c productid sales.csv products.csv > enrichedsales.csv

Convert to JSON

csvjson sales.csv > sales.json

Data Validation

Validate before loading to catch problems early:

python

def validate_csv(filepath):

df = pd.read_csv(filepath)

errors = []

# Check required columns

required = ['order_id', 'date', 'amount']

missing = [col for col in required if col not in df.columns]

if missing:

errors.append(f"Missing columns: {missing}")

# Check for nulls in required fields

for col in required:

if col in df.columns:

null_count = df[col].isnull().sum()

if null_count > 0:

errors.append(f"{col}: {null_count} null values")

# Check data types

if 'amount' in df.columns:

nonnumeric = pd.tonumeric(df['amount'], errors='coerce').isnull().sum()

if non_numeric > 0:

errors.append(f"amount: {non_numeric} non-numeric values")

return errors

Load: Delivering Processed Data

To a Database

python

from sqlalchemy import create_engine

engine = create_engine('postgresql://user:pass@localhost/mydb')

df.tosql('orders', engine, ifexists='append', index=False)

To Google Sheets

python

import gspread

gc = gspread.service_account(filename='credentials.json')

sh = gc.open('Monthly Report')

worksheet = sh.sheet1

worksheet.update([df.columns.values.tolist()] + df.values.tolist())

To Another CSV (with different structure)

python

Reshape for the target system's expected format

output = df[['order_id', 'date', 'total']].copy()

output.columns = ['OrderNumber', 'OrderDate', 'Amount'] # Rename for target schema

output['OrderDate'] = output['OrderDate'].dt.strftime('%m/%d/%Y') # Reformat dates

output.tocsv('forimport.csv', index=False)

Building a Complete ETL Pipeline

Here is a production-ready pipeline script:

python

import pandas as pd

import logging

from pathlib import Path

from datetime import datetime

logging.basicConfig(level=logging.INFO)

log = logging.getLogger(name)

def run_pipeline():

timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

# EXTRACT

log.info("Extracting data...")

rawpath = Path('incoming/dailyexport.csv')

if not raw_path.exists():

log.error("Source file not found")

return False

df = pd.readcsv(rawpath, encoding='utf-8')

log.info(f"Extracted {len(df)} rows")

# TRANSFORM

log.info("Transforming...")

initial_count = len(df)

df = df.drop_duplicates()

df = df.dropna(subset=['order_id', 'amount'])

df['amount'] = pd.to_numeric(df['amount'], errors='coerce')

df = df[df['amount'] > 0]

log.info(f"Kept {len(df)}/{initial_count} rows after cleaning")

# LOAD

log.info("Loading...")

outputpath = f'processed/clean{timestamp}.csv'

df.tocsv(outputpath, index=False)

log.info(f"Saved to {output_path}")

# ARCHIVE

rawpath.rename(f'archive/dailyexport_{timestamp}.csv')

log.info("Pipeline complete")

return True

if name == 'main':

run_pipeline()

Quick Tools for Manual ETL Steps

Not every step needs code. Use browser-based tools for the parts between automated processes:

  • Inspect source files: Open in CSV Viewer to verify structure before building your pipeline
  • Convert formats: Use the Excel ↔ CSV converter when your source delivers Excel files but your pipeline expects CSV
  • Generate test data: Build sample CSV files with the CSV Creator to test your pipeline without using production data
  • Validate output visually: Check your pipeline's output in CSV Charts to catch issues that row counts alone won't reveal

Scheduling Your Pipeline

cron (Linux/macOS)

bash

Run ETL pipeline every day at 6 AM

0 6 * cd /home/user/etl && python pipeline.py >> /var/log/etl.log 2>&1

GitHub Actions

yaml

name: Daily ETL

on:

schedule:

  • cron: '0 6 *'

jobs:

etl:

runs-on: ubuntu-latest

steps:

  • uses: actions/checkout@v4
  • uses: actions/setup-python@v5

with:

python-version: '3.12'

  • run: pip install pandas
  • run: python pipeline.py

Conclusion

CSV ETL automation is not just for data engineers. With pandas for transformation, DuckDB for SQL-based processing, csvkit for command-line operations, and browser tools like CSV Viewer for inspection, anyone who works with data regularly can build reliable pipelines. Start with a single script that automates your most repetitive workflow, then expand from there.