Dr Data Insight - M2: Healthcare Data Manipulation with Pandas & NumPy

Welcome

M1: Python Foundation

M2: Healthcare Data Manipulation with Pandas & NumPy

M3: Healthcare Data Visualization & Reporting

M4: Automation & APIs in Healthcare

M5: Introduction to AI/ML in Healthcare

Module 2: Healthcare Data Manipulation with Pandas & NumPy

In healthcare, raw data is rarely clean or immediately useful—it’s messy, inconsistent, and often overwhelming. That’s where this module comes in. Here, you’ll learn how to harness the power of Python's most trusted data libraries: Pandas for data manipulation and NumPy for efficient numerical operations. You'll clean real-world datasets, handle missing patient records, group data by diagnosis codes, and start performing statistical summaries with ease. By the end of this module, you'll not only understand how to structure data for analysis—you’ll begin to extract meaningful patterns that support smarter clinical and operational decisions.

Module 2: Healthcare Data Manipulation with Pandas and NumPy

Module Overview

This module focuses on using Python libraries Pandas and NumPy to clean, organize, and transform healthcare data. Whether you’re working with EHR exports, claims data, or lab results, mastering these tools is essential for real-world healthcare analytics.

Learning Objectives

By the end of this module, learners will be able to:

Read, explore, and structure healthcare datasets using Pandas
Clean and standardize data (missing values, duplicates, data types)
Use grouping, filtering, and aggregation to uncover patterns
Apply NumPy to perform fast, vectorized calculations
Join multiple datasets to build complete patient views

2.1 Introduction to Pandas and NumPy

Key Concepts:

What are Pandas and NumPy?
How they are used in healthcare analytics

Code Example:

python

CopyEdit

import pandas as pd

import numpy as np

Activity:

Install both libraries via pip or Anaconda
Import them into a Jupyter Notebook and confirm with pd.__version__ and np.__version__

2.2 Loading and Exploring Healthcare Data

Key Concepts:

Reading CSV, Excel, and JSON files
Exploring with .head(), .info(), .describe()

Code Example:

python

CopyEdit

df = pd.read_csv("patient_data.csv")

print(df.head())

print(df.info())

Practice:

Load a dataset of 1,000 patient records
Inspect column types, null values, and basic statistics

2.3 Data Cleaning and Standardization

Key Concepts:

Handling missing values (.isnull(), .fillna(), .dropna())
Removing duplicates
Changing column types
Renaming and reordering columns

Code Example:

python

CopyEdit

df['Age'] = df['Age'].fillna(df['Age'].median())

df = df.drop_duplicates()

df['Admission Date'] = pd.to_datetime(df['Admission Date'])

Practice:

Clean a dataset with missing BP and cholesterol values
Remove duplicate entries
Convert all column names to lowercase and replace spaces with underscores

2.4 Filtering, Sorting, and Conditional Logic

Key Concepts:

Filtering rows by conditions
Combining multiple filters
Sorting data
Creating new calculated columns

Code Example:

python

CopyEdit

high_risk = df[(df['Blood Pressure'] > 140) | (df['Cholesterol'] > 240)]

df['Risk Flag'] = np.where(high_risk, 'High', 'Normal')

Module 2: Healthcare Data Manipulation with Pandas and NumPy

Practice:

Filter patients older than 60 with high BP
Create a new column: BMI = weight_kg / (height_m ** 2)
Sort patients by admission date (newest to oldest)

2.5 Grouping and Aggregation

Key Concepts:

Grouping by categorical variables (e.g., Diagnosis)
Aggregation: .mean(), .sum(), .count(), .agg()
Multi-level grouping (e.g., by Hospital and Department)

Code Example:

python

CopyEdit

cost_by_diagnosis = df.groupby('Diagnosis')['Cost'].mean()

outcomes_by_age = df.groupby('Age Group')['Outcome'].value_counts()

Practice:

Group by Diagnosis and find the average treatment cost
Group by Age Group and count outcomes
Create a summary table of treatment costs per provider per month

2.6 Merging and Joining Datasets

Key Concepts:

Combining datasets using .merge(), .concat()
Understanding inner, left, right, and outer joins
Joining patient demographics with lab results or appointments

Code Example:

python

CopyEdit

merged_df = pd.merge(patients, labs, on="Patient ID", how="inner")

Practice:

Merge a patient table with test results using Patient ID
Concatenate 3 months of admissions data into a single DataFrame
Perform a left join to retain all patients even if lab data is missing

2.7 Summary and Assessment

Summary:

In this module, learners have built the ability to:

Clean, structure, and manipulate healthcare datasets using Pandas
Perform calculations and conditional logic using NumPy
Extract insights through grouping and filtering
Merge and combine multiple datasets into analytical-ready views

Assessment Tasks:

Load and clean a dataset of 500 patient records with missing lab results
Create a summary showing average cost per diagnosis and per age group
Identify high-risk patients using blood pressure and cholesterol thresholds
Merge demographic data with vitals data to create a unified patient profile
Write a function that flags patients with both high BP and high cholesterol

Resources Provided

patient_data.csv – Practice dataset
lab_results.csv – Joinable dataset for exercises
Sample notebook with code scaffolding
Cheat sheet for Pandas/NumPy operations

Let me know if you'd like:

This module converted to a formatted notebook with live exercises
Sample datasets for testing and merging
A companion video or curated tutorial link for reinforcement

Once approved, I’ll proceed with Module 3: Healthcare Data Visualization and Reporting.

Healthcare Pandas & Numpy Quiz

Disclaimer: The videos included have been thoughtfully selected to support and enrich the learning experience. While not essential to the completion of the course, they offer valuable insights that may deepen your understanding of the module content. Dr Data Insights does not claim authorship or involvement in the creation of these tutorials.