Use Python for Data Analysis: Build a Reproducible Workflow

Create a repeatable pandas pipeline in Jupyter to load, clean, analyze, and visualize data from databases or files.

pandas DataFrames, backed by NumPy and visualized with Matplotlib or Seaborn, give you a direct path from raw records to trustworthy insights without manual, error-prone steps. The approach below prioritizes a single, reproducible pipeline you can re-run on new data and share with teammates.

Prerequisites

Python 3.9+ installed via Anaconda or your system package manager.
JupyterLab or Jupyter Notebook for iterative analysis.
Libraries: pandas, numpy, matplotlib, seaborn, scikit-learn; plus drivers for any external sources you use (for example, pyodbc or psycopg2).

Method 1 — Build a Reproducible pandas Pipeline (Jupyter)

Step 1: Create and activate a dedicated Python environment.

conda create -n data-pipeline python=3.11 -y
conda activate data-pipeline
# or with pip + venv:
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

Step 2: Install core analysis libraries.

python -m pip install pandas numpy matplotlib seaborn scikit-learn pyarrow openpyxl lxml

Step 3: Start JupyterLab or Notebook and create a new notebook.

jupyter lab
# Run cells with

Shift+Enter executes the current cell and moves to the next.

Step 4: Load a CSV into a DataFrame with dtypes inferred.

import pandas as pd

df = pd.read_csv("data.csv").convert_dtypes()
df.head()

Step 5: Standardize column names to snake_case for consistent code.

df.columns = (
    df.columns
      .str.strip()
      .str.lower()
      .str.replace(r"[^0-9a-zA-Z]+", "_", regex=True)
      .str.replace(r"_+", "_", regex=True)
      .str.removesuffix("_")
)

Step 6: Inspect types and missing values to plan fixes.

df.info()
df.isna().sum()

Step 7: Patch known missing values without overwriting good rows.

# Example: fill specific cells by index using combine_first
patch = pd.DataFrame({"score_a": {42: 7.1}, "score_b": {42: 6.8}})
df = df.combine_first(patch)

Step 8: Clean currency strings and cast to numeric in one chain.

df = df.assign(
    revenue_usd=lambda d: d["revenue_usd"]
        .astype("string")
        .str.replace(r"[$,]", "", regex=True)
        .astype("Float64"),
    budget_usd=lambda d: d["budget_usd"]
        .astype("string")
        .str.replace(r"[$,]", "", regex=True)
        .astype("Float64"),
)

Step 9: Convert duration text like “130 mins” to integer minutes.

df = df.assign(
    duration_min=lambda d: d["duration_min"]
        .astype("string")
        .str.replace(" mins", "", regex=False)
        .astype("Int64")
)

Step 10: Parse human-readable dates to a proper datetime column.

df = df.assign(
    release_date=lambda d: pd.to_datetime(d["release_date"], format="%B, %Y")
)

Step 11: Derive useful features (for example, release year) for grouping.

df = df.assign(release_year=lambda d: d["release_date"].dt.year.astype("Int64"))

Step 12: Fix typos and inconsistent categories for reliable grouping.

df = df.assign(
    lead_actor=lambda d: d["lead_actor"]
        .str.replace(r"^Shawn", "Sean", regex=True)
        .str.replace("MOORE", "Moore"),
    car_brand=lambda d: d["car_brand"].str.replace("Astin", "Aston"),
)

Step 13: Identify improbable outliers with quick stats.

df[["duration_min", "martinis"]].describe()

Step 14: Correct verified bad values to realistic numbers.

df = df.assign(
    duration_min=lambda d: d["duration_min"].replace({1200: 120}),
    martinis=lambda d: d["martinis"].replace({-6: 6})
)

Step 15: Remove duplicate rows and reindex for a clean dataset.

df = df.drop_duplicates(ignore_index=True)

Step 16: Persist the cleansed dataset for future analysis.

# Parquet keeps types and compresses well
df.to_parquet("clean.parquet", index=False)
# CSV is broadly compatible
df.to_csv("clean.csv", index=False)

Why this method first: a single, chained pipeline reduces manual steps, prevents hidden Excel edits, and lets teammates reproduce results on demand. It also scales better than spreadsheets when your dataset grows beyond a few hundred thousand rows.

Method 2 — Query Data Directly from a Database Into pandas

Step 1: Install a DB driver and SQL toolkit for your engine.

# SQL Server example
python -m pip install pyodbc sqlalchemy
# PostgreSQL example
python -m pip install psycopg2-binary sqlalchemy

Step 2: Create a SQLAlchemy engine with a secure connection string.

from sqlalchemy import create_engine

# Example for SQL Server with ODBC Driver 17:
engine = create_engine(
    "mssql+pyodbc://username:password@SERVER/DB?driver=ODBC+Driver+17+for+SQL+Server",
    fast_executemany=True
)

Step 3: Pull data straight into pandas without exporting CSVs.

import pandas as pd

sql = "SELECT col_a, col_b, created_at FROM schema.table WHERE created_at >= '2024-01-01';"
df = pd.read_sql(sql, engine).convert_dtypes()

Step 4: Parameterize queries to avoid SQL injection risks.

from sqlalchemy import text

stmt = text("SELECT * FROM sales WHERE region = :region AND dt >= :start")
df = pd.read_sql(stmt, engine, params={"region": "EMEA", "start": "2025-01-01"})

Step 5: Close connections after use or rely on context managers.

engine.dispose()

This approach eliminates manual exports, supports scheduled jobs, and keeps source-of-truth logic in SQL where appropriate. It’s ideal when your data already lives in relational systems used by BI tools.

Method 3 — Read From Common File and Web Sources

Step 1: Load columnar Parquet files for speed and preserved dtypes.

df = pd.read_parquet("data.parquet").convert_dtypes()

Step 2: Import Excel sheets when teams share .xlsx files.

df = pd.read_excel("workbook.xlsx", sheet_name="Sheet1").convert_dtypes()

Step 3: Read JSON documents into tidy tables.

df = pd.read_json("records.json").convert_dtypes()

Step 4: Scrape simple HTML tables when an API is unavailable.

tables = pd.read_html("https://example.com/tables-page")
df = tables[0].convert_dtypes()

Tip: Prefer Parquet for intermediate storage because it compresses and retains types; if you must share with tools lacking Parquet support, export CSV as a fallback.

Method 4 — Analyze, Visualize, and Model

Step 1: Compute quick descriptive statistics to spot ranges and anomalies.

df.describe(numeric_only=True)

Step 2: Create a scatter plot to assess correlation between two metrics.

import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.scatter(df["metric_x"], df["metric_y"], alpha=0.7)
ax.set_title("Metric Y vs Metric X")
ax.set_xlabel("Metric X")
ax.set_ylabel("Metric Y")
plt.show()

Step 3: Fit a simple linear regression and plot the best-fit line.

from sklearn.linear_model import LinearRegression
import numpy as np

X = df[["metric_x"]].to_numpy()
y = df["metric_y"].to_numpy()
model = LinearRegression().fit(X, y)
r2 = model.score(X, y)
y_pred = model.predict(X)

fig, ax = plt.subplots()
ax.scatter(X, y, alpha=0.7)
ax.plot(X, y_pred, color="red")
ax.set_title(f"Linear Fit (R²={r2:.2f})")
ax.set_xlabel("Metric X")
ax.set_ylabel("Metric Y")
plt.show()

Step 4: Inspect distributions with binned counts to see typical ranges.

counts = df["duration_min"].value_counts(bins=7).sort_index()
ax = counts.plot.bar(title="Duration Distribution", xlabel="Minutes (bins)", ylabel="Count")
plt.show()

Step 5: Group and aggregate to compare segments.

(df.groupby("release_year")["revenue_usd"]
   .agg(["count", "mean", "sum"])
   .sort_index())

Reading the plots: a visible upward trend in the scatter indicates a positive relationship, while a cloud with no slope suggests little to no linear relationship. Use R² to quantify how well the regression line explains variance in the data.

Operational Tips and Cautions

Automate recurring work by keeping everything in one notebook or script and running it on a schedule with your orchestrator of choice.
Use method chaining (for example, df.assign(...).drop_duplicates(...)) to keep code readable and to avoid accidental intermediate edits.
Prefer typed integers (Int64) and floats (Float64) for math; strings won’t sum or average correctly.
Validate at each stage with df.info(), df.head(), and spot-checks to catch mistakes early.
When datasets exceed Excel comfort limits, pandas typically loads, filters, and groups millions of rows faster and with fewer crashes.

By centering your work in a single pandas pipeline, you cut busywork, speed up repeat runs, and make results easier to audit and reuse. Add direct SQL reads when available, and keep Parquet snapshots to move data through your workflow quickly.

Use Python for Data Analysis: Build a Reproducible Workflow

Prerequisites

Method 1 — Build a Reproducible pandas Pipeline (Jupyter)

Method 2 — Query Data Directly from a Database Into pandas

Method 3 — Read From Common File and Web Sources

Method 4 — Analyze, Visualize, and Model

Operational Tips and Cautions

Author

Shivam Malani

On this page

Related Posts

PLS DONATE Codes (October 2025) — Latest working list

Duet Night Abyss quests explained — and what DOA’s Quest mode is

Duet Night Abyss: who to unlock second and how to get them fast

Prerequisites

Method 1 — Build a Reproducible pandas Pipeline (Jupyter)

Method 2 — Query Data Directly from a Database Into pandas

Method 3 — Read From Common File and Web Sources

Method 4 — Analyze, Visualize, and Model

Operational Tips and Cautions

Comments

Author

Shivam Malani

On this page

Related Posts

PLS DONATE Codes (October 2025) — Latest working list

Duet Night Abyss quests explained — and what DOA’s Quest mode is

Duet Night Abyss: who to unlock second and how to get them fast