Module 2: Healthcare Data Manipulation with Pandas and NumPy
Module Overview
This module focuses on using Python libraries Pandas and NumPy to clean, organize, and transform healthcare data. Whether you’re working with EHR exports, claims data, or lab results, mastering these tools is essential for real-world healthcare analytics.
Learning Objectives
By the end of this module, learners will be able to:
Read, explore, and structure healthcare datasets using Pandas
Clean and standardize data (missing values, duplicates, data types)
Use grouping, filtering, and aggregation to uncover patterns
Apply NumPy to perform fast, vectorized calculations
Join multiple datasets to build complete patient views
2.1 Introduction to Pandas and NumPy
Key Concepts:
Code Example:
python
CopyEdit
import pandas as pd
import numpy as np
Activity:
2.2 Loading and Exploring Healthcare Data
Key Concepts:
Reading CSV, Excel, and JSON files
Exploring with .head(), .info(), .describe()
Code Example:
python
CopyEdit
df = pd.read_csv("patient_data.csv")
print(df.head())
print(df.info())
Practice:
Load a dataset of 1,000 patient records
Inspect column types, null values, and basic statistics
2.3 Data Cleaning and Standardization
Key Concepts:
Handling missing values (.isnull(), .fillna(), .dropna())
Removing duplicates
Changing column types
Renaming and reordering columns
Code Example:
python
CopyEdit
df['Age'] = df['Age'].fillna(df['Age'].median())
df = df.drop_duplicates()
df['Admission Date'] = pd.to_datetime(df['Admission Date'])
Practice:
2.4 Filtering, Sorting, and Conditional Logic
Key Concepts:
Filtering rows by conditions
Combining multiple filters
Sorting data
Creating new calculated columns
Code Example:
python
CopyEdit
high_risk = df[(df['Blood Pressure'] > 140) | (df['Cholesterol'] > 240)]
df['Risk Flag'] = np.where(high_risk, 'High', 'Normal')