Pandas and Exploratory Data Analysis

What You’ll Learn

What Exploratory Data Analysis (EDA) is and why you should always explore before you analyze
The EDA workflow: inspect, clean, explore, summarize
What pandas is: the Series and DataFrame data structures
Why pandas is a massive upgrade over raw CSV files and dictionaries
How to load, inspect, clean, merge, group, and summarize tabular data with pandas
How to create new columns and compute derived statistics from the golf dataset

Concept

What Is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is the practice of examining a dataset before you try to model it, test hypotheses, or draw conclusions. The term was coined by statistician John Tukey in the 1970s, and the core idea is simple: look at the data before you do anything else.

Why? Because real-world data is messy. Columns have missing values. Dates come in inconsistent formats. IDs that should match across tables sometimes don’t. Distributions are skewed in ways you did not expect. If you jump straight to modeling without exploring first, you will build models on flawed assumptions and get misleading results.

In our golf context: before we can answer questions like “Which player performs best in bad weather?” or “Which club gains the most strokes?”, we need to understand what the data actually looks like. How many rounds does each player have? Are there missing shot records? Are the dates parsed correctly? EDA answers these questions.

The EDA Workflow

EDA follows a general pattern:

Inspect – Load the data and look at its shape, types, and first few rows. Get oriented.
Clean – Handle missing values, fix data types (especially dates and numbers stored as strings), rename confusing columns.
Explore – Filter, sort, and group the data. Ask questions. Look at distributions, counts, and relationships between columns.
Summarize – Compute descriptive statistics, build summary tables, and identify the key patterns you will carry into deeper analysis.

This is not a strict linear process. You will loop back frequently – exploring might reveal a data quality issue that sends you back to cleaning, or a summary might raise a new question that sends you back to exploring. That is normal.

What Is pandas?

pandas is a Python library for working with tabular data – the kind of data that lives in spreadsheets, CSV files, and database tables. It was created by Wes McKinney in 2008 while he was working at a hedge fund, because he needed a tool that could handle the messy, real-world data wrangling that Excel could not scale to.

pandas has two core data structures:

Series – A one-dimensional labeled array. Think of it as a single column from a spreadsheet: a sequence of values with an index that labels each value. A Series of player handicaps, for example, or a Series of total scores.

DataFrame – A two-dimensional labeled table. Think of it as a spreadsheet or a SQL table: rows and columns, where each column is a Series. A DataFrame of rounds has columns for round_id, player_id, course_id, date, total_score, and weather.

The relationship is straightforward: a DataFrame is a collection of Series that share the same index.

Why pandas Over Raw CSV and Dicts?

In Topics 03 and 05, we loaded CSV files with csv.DictReader and built lists of dictionaries (or dataclass instances). That approach works, but it has real limitations as your data grows:

Task	Raw CSV / Dicts	pandas
Load a CSV	4-5 lines with `open()`, `DictReader`, list comprehension	`pd.read_csv('file.csv')` – one line
Get average score	`sum(int(r['total_score']) for r in rounds) / len(rounds)`	`rounds['total_score'].mean()`
Filter rows	List comprehension: `[r for r in rounds if r['weather'] == 'sunny']`	`rounds[rounds['weather'] == 'sunny']`
Join two tables	Build a lookup dict, loop through rows, match keys manually	`pd.merge(rounds, players, on='player_id')`
Group and aggregate	Nested loops, manual dict accumulation	`rounds.groupby('player_id')['total_score'].mean()`
Handle missing data	Check every field manually with `if` statements	`.isna().sum()` to see all missing values at once

pandas is not magic – under the hood it uses NumPy arrays for fast vectorized operations. But it gives you a high-level, expressive API that makes common data tasks concise and readable.

The tradeoff: pandas has a learning curve. The API is large, and some operations have multiple ways to accomplish the same thing. This notebook will focus on the core operations you will use in 90% of your work.

Code

1. Loading Data with pandas

The first thing you do with pandas is load data. pd.read_csv() reads a CSV file and returns a DataFrame in one line. Compare this to the multi-line open() + csv.DictReader + list comprehension pattern from Topic 03.

import pandas as pd
import numpy as np

# Load all five golf CSVs -- one line each
players = pd.read_csv('../../data/players.csv')
courses = pd.read_csv('../../data/courses.csv')
holes = pd.read_csv('../../data/holes.csv')
rounds = pd.read_csv('../../data/rounds.csv')
shots = pd.read_csv('../../data/shots.csv')

print(f'Players:  {players.shape[0]:>5,} rows x {players.shape[1]} columns')
print(f'Courses:  {courses.shape[0]:>5,} rows x {courses.shape[1]} columns')
print(f'Holes:    {holes.shape[0]:>5,} rows x {holes.shape[1]} columns')
print(f'Rounds:   {rounds.shape[0]:>5,} rows x {rounds.shape[1]} columns')
print(f'Shots:    {shots.shape[0]:>5,} rows x {shots.shape[1]} columns')

That replaced all of this from Topic 03:

import csv

def read_csv(filepath):
    with open(filepath, 'r') as f:
        reader = csv.DictReader(f)
        return list(reader)

players = read_csv('../../data/players.csv')

And unlike DictReader, pandas automatically infers column types – numbers are numbers, not strings.

# .head() shows the first 5 rows by default
players.head()

# .tail() shows the last rows -- useful for checking the end of large files
rounds.tail()

# .shape gives (rows, columns) as a tuple
print('Shots shape:', shots.shape)
print()

# .dtypes shows the data type of each column
# Notice: pandas inferred int64, float64, and object (string) types automatically
print('Rounds dtypes:')
print(rounds.dtypes)

# .info() gives a comprehensive summary: column names, types, non-null counts, memory usage
shots.info()

2. Series Basics

A Series is a single column of a DataFrame. You select it using bracket notation with the column name. A Series has its own methods for summary statistics, counting, and element-wise operations.

# Select a single column -- returns a Series
scores = rounds['total_score']

print(type(scores))
print()
print(scores)

# Basic statistics on the total_score Series
print(f'Mean score:   {scores.mean():.1f}')
print(f'Median score: {scores.median():.1f}')
print(f'Std dev:      {scores.std():.1f}')
print(f'Min score:    {scores.min()}')
print(f'Max score:    {scores.max()}')
print(f'Count:        {scores.count()}')

Compare that to the manual approach from earlier topics:

# Without pandas -- lots of manual work
score_list = [int(r['total_score']) for r in rounds]
avg = sum(score_list) / len(score_list)

With pandas, the type conversion and aggregation happen in one call.

# value_counts() is one of the most useful Series methods
# It counts how many times each unique value appears

# What clubs appear most often in the shots data?
print('Club usage across all shots:')
print(shots['club'].value_counts())

# How many rounds were played in each weather condition?
print('Rounds by weather:')
print(rounds['weather'].value_counts())

# .unique() returns the distinct values in a Series
print('Unique lies in the shots data:')
print(shots['start_lie'].unique())
print()
print(f'Number of unique clubs: {shots["club"].nunique()}')

3. DataFrame Basics: Selecting and Filtering

A DataFrame is a collection of Series (columns) that share the same row index. You can select columns, filter rows, and slice in multiple ways.

# Select multiple columns -- pass a list of column names
# This returns a new DataFrame (not a Series)
rounds[['player_id', 'total_score', 'weather']].head()

# Boolean indexing -- filter rows based on a condition
# Find all rounds where the score was under 85
sub_85 = rounds[rounds['total_score'] < 85]

print(f'{len(sub_85)} rounds with score under 85:')
sub_85

# How boolean indexing works under the hood:
# The comparison creates a Series of True/False values
mask = rounds['total_score'] < 85
print(mask)
print()
print(f'Type: {type(mask)}')
print(f'True count: {mask.sum()}')

# Filter shots by player -- but wait, shots.csv does not have player_id directly.
# It has round_id, and we need to know which round_ids belong to Bear Woods.
# We will do this properly with merge in section 5, but for now:

# Bear Woods is player_id = 1
bear_round_ids = rounds[rounds['player_id'] == 1]['round_id']
print('Bear Woods round IDs:')
print(bear_round_ids.values)

# Filter shots to only Bear Woods' rounds using .isin()
bear_shots = shots[shots['round_id'].isin(bear_round_ids)]
print(f'\nBear Woods has {len(bear_shots)} total shots across {bear_round_ids.count()} rounds')

# Filter shots hit with Driver
driver_shots = shots[shots['club'] == 'Driver']
print(f'{len(driver_shots)} Driver shots in the dataset')
print(f'Average distance to pin at start: {driver_shots["start_distance_to_pin"].mean():.1f} yards')
print(f'Average distance to pin at end:   {driver_shots["end_distance_to_pin"].mean():.1f} yards')

# Multiple conditions: use & (and), | (or), ~ (not)
# Each condition must be wrapped in parentheses

# Shots from the tee with Driver that ended in the rough
bad_drives = shots[
    (shots['club'] == 'Driver') &
    (shots['start_lie'] == 'tee') &
    (shots['end_lie'] == 'rough')
]

print(f'{len(bad_drives)} drives that ended in the rough (out of {len(driver_shots)} total drives)')
print(f'Average strokes gained on those bad drives: {bad_drives["strokes_gained"].mean():.3f}')

# .loc[] -- label-based indexing (by row label and column name)
# .iloc[] -- integer-based indexing (by position)

# Get the first 3 rows, specific columns with .loc
print('Using .loc[0:2, [columns]]:')
print(rounds.loc[0:2, ['player_id', 'course_id', 'total_score']])
print()

# Get rows 0-2, columns 0-2 with .iloc (purely positional)
print('Using .iloc[0:3, 0:3]:')
print(rounds.iloc[0:3, 0:3])

Important distinction: .loc slicing is inclusive on both ends (0:2 gives rows 0, 1, 2). .iloc slicing follows Python convention and is exclusive on the right (0:3 gives rows at positions 0, 1, 2). This is a common source of confusion.

4. Data Cleaning

Real data is never perfectly clean. The first step of any analysis is checking for problems: missing values, wrong types, inconsistent formatting. pandas makes this straightforward.

# Check for missing values in every DataFrame
print('Missing values per column:')
print('\nPlayers:')
print(players.isna().sum())
print('\nCourses:')
print(courses.isna().sum())
print('\nRounds:')
print(rounds.isna().sum())
print('\nShots:')
print(shots.isna().sum())

Our data is well-structured (it was generated, not scraped), so you may see zero missing values. In the real world, you would almost always find some. The .isna().sum() pattern is one you will use at the start of every project.

# Check the date column -- pandas loaded it as a string (object), not a date
print('Date column type:', rounds['date'].dtype)
print()
print('First few dates:')
print(rounds['date'].head())
print()
print('Type of first value:', type(rounds['date'].iloc[0]))

# Convert the date column to proper datetime objects
rounds['date'] = pd.to_datetime(rounds['date'])

print('Date column type after conversion:', rounds['date'].dtype)
print()
print(rounds['date'].head())

# With proper datetime types, we can do date arithmetic and extraction
print('Date range:', rounds['date'].min(), 'to', rounds['date'].max())
print()

# Extract the day of the week for each round
print('Day of week for each round:')
print(rounds['date'].dt.day_name().value_counts())

# Pro tip: you can parse dates at load time with the parse_dates parameter
# Let's reload rounds with the date parsed automatically
rounds = pd.read_csv('../../data/rounds.csv', parse_dates=['date'])

print('Date type after reload:', rounds['date'].dtype)
print(rounds.dtypes)

# Check column names for consistency across DataFrames
print('Players columns:', list(players.columns))
print('Courses columns:', list(courses.columns))
print('Holes columns:  ', list(holes.columns))
print('Rounds columns: ', list(rounds.columns))
print('Shots columns:  ', list(shots.columns))

Our column names are already clean and consistent – lowercase, underscored, matching across tables (e.g., player_id in both players and rounds). In messy real-world data, you would often need to rename columns with .rename(columns={'old_name': 'new_name'}) or standardize them with something like df.columns = df.columns.str.lower().str.replace(' ', '_').

5. Merging DataFrames

In Topic 03, we joined data across CSV files by building lookup dictionaries manually:

# The old way -- manual dict-based join
player_lookup = {p['player_id']: p['name'] for p in players}
for r in rounds:
    name = player_lookup[r['player_id']]

pandas replaces all of that with pd.merge(), which works like a SQL JOIN.

# Merge rounds with players on player_id
# This adds the player's name and handicap to each round row
rounds_with_players = pd.merge(rounds, players, on='player_id')

print(f'Shape before merge: {rounds.shape}')
print(f'Shape after merge:  {rounds_with_players.shape}')
print()
rounds_with_players.head()

# Now merge with courses to add course details
# Note: both players and courses have a 'name' column, so pandas
# automatically adds suffixes to disambiguate them
round_detail = pd.merge(rounds_with_players, courses, on='course_id', suffixes=('_player', '_course'))

print('Columns after both merges:')
print(list(round_detail.columns))
print()
round_detail.head()

# Clean up the merged DataFrame -- rename the suffixed columns for clarity
round_detail = round_detail.rename(columns={
    'name_player': 'player_name',
    'name_course': 'course_name',
})

# Select the columns we care about for a clean view
round_detail[['round_id', 'player_name', 'course_name', 'date', 'total_score',
               'weather', 'handicap', 'slope_rating', 'course_rating']].head(10)

# Compare: the manual join from Topic 03 required 10+ lines.
# The pandas version is two merge calls and a rename.

# Let's verify the merge is correct by spot-checking a known round
# Round 1: Bear Woods at North Park, score 85, sunny
r1 = round_detail[round_detail['round_id'] == 1].iloc[0]
print(f'Round 1: {r1["player_name"]} at {r1["course_name"]}')
print(f'Score: {r1["total_score"]}, Weather: {r1["weather"]}')
print(f'Course rating: {r1["course_rating"]}, Slope: {r1["slope_rating"]}')

6. Groupby Operations

Groupby is where pandas truly shines. The pattern is:

Split the data into groups based on one or more columns.
Apply a function to each group (mean, sum, count, etc.).
Combine the results into a new DataFrame or Series.

This is the pandas equivalent of writing nested loops with manual dict accumulation.

# Average score by player
avg_by_player = round_detail.groupby('player_name')['total_score'].mean()

print('Average score by player:')
print(avg_by_player.sort_values())

# Average score by course
avg_by_course = round_detail.groupby('course_name')['total_score'].mean()

print('Average score by course:')
print(avg_by_course.sort_values())

# Average score by weather condition
avg_by_weather = round_detail.groupby('weather')['total_score'].mean()

print('Average score by weather:')
print(avg_by_weather.sort_values())
print()

# Also show the count so we know how much data backs each average
weather_summary = round_detail.groupby('weather')['total_score'].agg(['mean', 'count', 'std'])
print('Weather summary with count and std:')
print(weather_summary.sort_values('mean'))

# Club usage frequency from shots.csv
# value_counts() is essentially a groupby + count
club_counts = shots['club'].value_counts()

print('Club usage frequency:')
print(club_counts)
print(f'\nTotal shots: {club_counts.sum()}')

# Average strokes gained by club
# Positive = gaining strokes vs field, Negative = losing strokes
sg_by_club = shots.groupby('club')['strokes_gained'].mean().sort_values()

print('Average strokes gained by club:')
print(sg_by_club)

# Average starting distance to pin by club
# This tells us how far each club is typically used from
dist_by_club = shots.groupby('club')['start_distance_to_pin'].mean().sort_values(ascending=False)

print('Average distance to pin when club is used (yards):')
print(dist_by_club.round(1))

# Multiple aggregations at once with .agg()
club_summary = shots.groupby('club').agg(
    shot_count=('strokes_gained', 'count'),
    avg_strokes_gained=('strokes_gained', 'mean'),
    avg_start_dist=('start_distance_to_pin', 'mean'),
    avg_end_dist=('end_distance_to_pin', 'mean'),
).sort_values('avg_start_dist', ascending=False)

print('Club summary:')
print(club_summary.round(2))

# Best (lowest) round per player
# Method: sort by score, then take the first row per group
best_rounds = (
    round_detail
    .sort_values('total_score')
    .groupby('player_name')
    .first()
    [['course_name', 'date', 'total_score', 'weather']]
    .sort_values('total_score')
)

print('Best round per player:')
print(best_rounds)

# Alternative approach using .idxmin() to find the index of the minimum score
# This is more direct when you just need the index of the best row
idx_of_best = round_detail.groupby('player_name')['total_score'].idxmin()

print('Index of each player\'s best round:')
print(idx_of_best)
print()

# Use those indices to pull the full rows
best_detail = round_detail.loc[idx_of_best, ['player_name', 'course_name', 'date', 'total_score', 'weather']]
print('Best round details:')
print(best_detail.to_string(index=False))

7. Descriptive Statistics

pandas provides .describe() as a one-call summary of all numeric columns. You can also compute correlations, which tell you whether two columns tend to move together.

# .describe() for the rounds detail DataFrame
round_detail.describe()

# .describe() for shots -- shows distributions of distances and strokes gained
shots.describe()

# Correlation between numeric columns in rounds
# Is there a relationship between handicap and total_score?
numeric_cols = round_detail[['total_score', 'handicap', 'slope_rating', 'course_rating']]
print('Correlation matrix:')
print(numeric_cols.corr().round(3))

# The handicap-score correlation should be positive (higher handicap = higher scores)
corr = round_detail['handicap'].corr(round_detail['total_score'])
print(f'Correlation between handicap and total score: {corr:.3f}')
print()
print('Interpretation: a correlation near +1 means players with higher handicaps')
print('tend to shoot higher scores, which is exactly what we would expect.')

# NumPy operations on pandas data
# pandas is built on NumPy, so NumPy functions work directly on Series
scores_array = round_detail['total_score'].values  # extract the underlying NumPy array

print(f'Type: {type(scores_array)}')
print(f'np.mean:   {np.mean(scores_array):.2f}')
print(f'np.std:    {np.std(scores_array):.2f}')
print(f'np.median: {np.median(scores_array):.1f}')
print(f'np.percentile (25th, 50th, 75th): {np.percentile(scores_array, [25, 50, 75])}')

8. Creating New Columns

A big part of EDA is engineering new features – deriving new columns from existing ones. pandas makes this easy because operations on Series are element-wise by default.

# Add a relative_to_par column
# First, we need each course's par. Let's compute it from the holes table.
course_par = holes.groupby('course_id')['par'].sum().reset_index()
course_par.columns = ['course_id', 'total_par']
print('Course pars:')
print(course_par)
print()

# Merge total_par into our round_detail DataFrame
round_detail = pd.merge(round_detail, course_par, on='course_id')

# Now compute relative to par
round_detail['relative_to_par'] = round_detail['total_score'] - round_detail['total_par']

print('Rounds with relative_to_par:')
print(round_detail[['player_name', 'course_name', 'total_score', 'total_par',
                     'relative_to_par']].to_string(index=False))

# Add a scoring_category column to the shots DataFrame
# First, we need to know the par for each hole of each shot.
# We need to figure out which course each shot was played on (via rounds),
# then look up the par for that course and hole number.

# Step 1: Get course_id for each shot via rounds
shot_rounds = pd.merge(shots, rounds[['round_id', 'course_id']], on='round_id')

# Step 2: Get par for each hole via the holes table
shot_detail = pd.merge(shot_rounds, holes[['course_id', 'hole_number', 'par']],
                       left_on=['course_id', 'hole'],
                       right_on=['course_id', 'hole_number'])

print(f'Shot detail shape: {shot_detail.shape}')
shot_detail[['round_id', 'hole', 'shot_number', 'club', 'par']].head(10)

# To classify each hole as birdie/par/bogey, we need the number of strokes per hole.
# Count shots per (round_id, hole) combination
strokes_per_hole = shots.groupby(['round_id', 'hole'])['shot_number'].max().reset_index()
strokes_per_hole.columns = ['round_id', 'hole', 'strokes']

# Merge with hole par info
hole_scores = pd.merge(strokes_per_hole,
                       shot_detail[['round_id', 'hole', 'par']].drop_duplicates(),
                       on=['round_id', 'hole'])

hole_scores['score_vs_par'] = hole_scores['strokes'] - hole_scores['par']

hole_scores.head(10)

# Use .apply() to map the numeric difference to a category name
def scoring_category(score_vs_par):
    """Convert a score relative to par into a category name."""
    categories = {
        -3: 'albatross',
        -2: 'eagle',
        -1: 'birdie',
         0: 'par',
         1: 'bogey',
         2: 'double bogey',
         3: 'triple bogey',
    }
    return categories.get(score_vs_par, f'+{score_vs_par}' if score_vs_par > 0 else str(score_vs_par))


hole_scores['scoring_category'] = hole_scores['score_vs_par'].apply(scoring_category)

print('Scoring category distribution across all rounds:')
print(hole_scores['scoring_category'].value_counts())

# Note: .apply() is flexible but slow on large datasets because it runs a
# Python function once per row. For simple mappings, pandas has faster
# vectorized alternatives like .map() with a dictionary:

category_map = {-3: 'albatross', -2: 'eagle', -1: 'birdie', 0: 'par',
                1: 'bogey', 2: 'double bogey', 3: 'triple bogey'}

hole_scores['scoring_category_v2'] = hole_scores['score_vs_par'].map(category_map)

# .map() returns NaN for values not in the dict, which is fine here --
# any score worse than triple bogey is rare enough that NaN is acceptable.
print('Check: do both methods agree?')
print(hole_scores[['scoring_category', 'scoring_category_v2']].head(10))

# Scoring category breakdown per player
# First, add player info to hole_scores via rounds
hole_scores_with_player = pd.merge(
    hole_scores,
    pd.merge(rounds[['round_id', 'player_id']], players[['player_id', 'name']], on='player_id'),
    on='round_id'
)

# Crosstab: rows = players, columns = scoring categories
scoring_crosstab = pd.crosstab(hole_scores_with_player['name'],
                                hole_scores_with_player['scoring_category'])

# Reorder columns from best to worst
col_order = ['eagle', 'birdie', 'par', 'bogey', 'double bogey', 'triple bogey']
col_order = [c for c in col_order if c in scoring_crosstab.columns]
remaining = [c for c in scoring_crosstab.columns if c not in col_order]
scoring_crosstab = scoring_crosstab[col_order + remaining]

print('Scoring category breakdown by player:')
print(scoring_crosstab)

AI

Exercise 1: Ask AI to Load, Merge, and Find Patterns

Give an AI assistant the following prompt:

I have five CSV files for a golf dataset:

players.csv: player_id, name, handicap (4 players)

courses.csv: course_id, name, city, state, slope_rating, course_rating (3 courses)

holes.csv: course_id, hole_number, par, yardage, handicap_index (54 rows, 18 holes per course)

rounds.csv: round_id, player_id, course_id, date, total_score, weather (24 rows)

shots.csv: round_id, hole, shot_number, club, start_lie, start_distance_to_pin, end_lie, end_distance_to_pin, strokes_gained (~2100 rows)

Load these files with pandas, merge them appropriately, and tell me what is interesting about the data. Show your work.

Evaluate the AI’s response:

Did it load all five files?
Did it merge on the correct keys (player_id, course_id, round_id)?
Did it find real patterns in the data? (For example: the relationship between handicap and score, differences between courses, club usage patterns, strokes gained insights.)
Did it miss anything obvious? (Check against the groupby results we computed above.)
Did it make any wrong assumptions about the data? (For example, assuming strokes_gained is per-round rather than per-shot, or confusing handicap_index on holes with player handicap.)

# Paste the AI-generated code here and run it.
# Compare its findings to what we discovered in Sections 6-8 above.

Exercise 2: Ask AI to Write a Strokes Gained Groupby

Give an AI assistant this prompt:

Using pandas, write code that finds each player’s average strokes gained by club. The data is in these files:

players.csv has player_id and name

rounds.csv has round_id and player_id

shots.csv has round_id, club, and strokes_gained

The result should show player names (not IDs) and be easy to read.

Evaluate the AI’s response:

Does it correctly merge shots -> rounds -> players? (The join chain must go through rounds to connect shots to players.)
Does it use groupby(['name', 'club']) or equivalent?
Does it handle the groupby output in a readable way? (A pivot table or unstacked DataFrame is much easier to read than a multi-index Series.)
Does it round the output to a reasonable number of decimal places?
Test it with our data and verify the numbers make sense.

# Paste the AI-generated code here and run it.
# Verify the output is correct by spot-checking a few values.

Exercise 3: Ask AI to Suggest Questions, Then Analyze One