Introduction to Strokes Gained

What You’ll Learn

  • What strokes gained is and why it is the most important statistic in modern golf analytics
  • How the strokes gained formula works: expected strokes from a given lie and distance, minus what actually happened
  • How to implement the expected strokes lookup with interpolation in Python
  • How to categorize shots into the four strokes gained categories: off the tee, approach, around the green, putting
  • How to use pandas groupby to analyze player performance by category and by club
  • How to build visualizations that tell a player’s strokes gained story
  • How to produce a personal performance report that turns data into actionable coaching insights

This is the capstone topic. It ties together everything from the course: file I/O (Topic 03), comprehensions (Topic 04), classes and data modeling (Topic 05), pandas (Topic 06), and visualization (Topic 07). Strokes gained is the thread that connects all of the golf data we have been building throughout the semester.


Concept

What Is Strokes Gained?

Every shot in golf starts from a specific situation: a lie (tee, fairway, rough, sand, green, recovery) and a distance to the hole. For every possible combination of lie and distance, there is a statistical expected number of strokes it takes to finish the hole from that spot. These expectations come from analyzing millions of shots hit by PGA Tour professionals.

Strokes gained measures how much better or worse a single shot was compared to that expectation. The formula is simple:

strokes_gained = expected_strokes(start) - expected_strokes(end) - 1

The - 1 accounts for the stroke you just took. If you started in a position that was expected to take 3.25 strokes to finish, and you hit a shot that left you in a position expected to take 1.56 strokes to finish, you gained:

SG = 3.25 - 1.56 - 1 = +0.69 strokes

That shot was 0.69 strokes better than the baseline expectation. Positive strokes gained means you gained on the field. Negative means you lost.

When a shot is holed out (the ball goes in the cup), the expected strokes at the end is 0:

SG = expected_strokes(start) - 0 - 1

Worked Example

You are 150 yards out in the fairway. According to the baseline table, a PGA Tour player takes an average of 2.82 strokes to finish the hole from this position.

You hit your approach shot and it lands on the green, 20 feet (about 7 yards) from the hole. From 8 feet on the green, the baseline is 1.33 strokes to finish.

Your strokes gained on that approach shot:

SG = 2.82 - 1.33 - 1 = +0.49

Good shot. You gained about half a stroke on the field.

Now suppose instead you hit a poor approach from 150 yards that ends up in the rough, 40 yards from the pin. The baseline from 40 yards in the rough is 2.65 strokes.

SG = 2.82 - 2.65 - 1 = -0.83

That shot cost you 0.83 strokes relative to the baseline. It barely advanced your expected outcome at all, and you used up a stroke doing it.

Then you chip from 40 yards in the rough onto the green, 3 feet from the hole. Baseline from 3 feet on the green is 1.04 strokes.

SG = 2.65 - 1.04 - 1 = +0.61

Nice recovery. You gained back more than half a stroke with a good chip.

Then you sink the 3-foot putt:

SG = 1.04 - 0 - 1 = +0.04

Making a 3-footer is expected, so it barely moves the needle – a tiny positive gain because the baseline from 3 feet is slightly above 1.0.

Notice that the strokes gained values for the entire hole sum to the score relative to par (approximately). This is a fundamental property: total strokes gained across all shots on a hole equals the player’s score minus the expected strokes from the tee.

Why Strokes Gained Matters

Traditional golf statistics are deeply flawed:

  • Fairways hit treats all misses equally. Missing into the first cut of rough from 310 yards is not the same as hooking it into the trees from 240 yards, but they both count as a “miss.”
  • Greens in regulation (GIR) does not distinguish between a shot that lands 5 feet from the pin and one that lands 50 feet away. Both are “on the green.”
  • Putts per round penalizes players who hit better approach shots. If you always land close to the pin, you face shorter putts and make more one-putts – but your putts-per-round looks the same as someone who lands far from the pin and lags up. Worse, if you miss a green entirely and chip to 2 feet, that hole counts as only one putt, which improves your putting stats even though it was your approach that failed.

Strokes gained fixes all of these problems because it accounts for where the ball starts and where it ends up. Every shot is evaluated in context. A drive that goes 280 yards into the rough from a 400-yard hole is not the same as a drive that goes 280 yards into the rough from a 550-yard hole, and strokes gained captures that difference.

Mark Broadie, the Columbia professor who developed strokes gained, showed that the traditional belief – “driving is about distance, scoring is about putting” – was wrong. His analysis of PGA Tour data revealed that long game performance (tee shots and approach shots) explains roughly twice as much of the scoring difference between players as short game and putting combined. Strokes gained made this visible for the first time.

The Four Strokes Gained Categories

Strokes gained is typically broken into four categories that correspond to different phases of play:

Category Which Shots What It Measures
Off the Tee Tee shots (shot_number == 1 on any hole) Driving performance: distance and accuracy combined
Approach Non-tee shots that are not on the green and not within 30 yards Approach shot quality: iron play and long game
Around the Green Shots within 30 yards of the pin that are not on the green Short game: chipping, pitching, bunker play
Putting Shots hit from the green Putting: reads, speed control, holing out

By summing strokes gained within each category, you can see exactly where a player is gaining or losing strokes. A player might be an excellent putter (positive SG: Putting) but a poor iron player (negative SG: Approach). Traditional stats would mask this because the good putting would offset the poor approach play in the final score.

Baseline Tables

The expected strokes values come from baseline tables that represent PGA Tour average performance. These tables map each (lie, distance) combination to the average number of strokes it takes a tour player to finish the hole from that spot.

Our baseline tables (simplified from Broadie’s research) are:

  • Tee: 150-600 yards. A 400-yard tee shot starts at 3.75 expected strokes.
  • Fairway: 20-260 yards. A 150-yard fairway shot starts at 2.97 expected strokes.
  • Rough: 20-240 yards. A 150-yard rough shot starts at 3.18 – slightly worse than the same distance from the fairway.
  • Sand: 10-100 yards. Sand is penalized more heavily, especially at longer distances.
  • Green: 2-90 feet. A 10-foot putt starts at 1.41 expected strokes.
  • Recovery: 20-200 yards. The worst lie – behind trees, deep rough, or other trouble.

When the exact distance is not in the table, we interpolate between the two nearest entries. For example, if the table has entries for 100 yards (2.82) and 120 yards (2.90), then 110 yards would be approximately 2.86.

A positive strokes gained value means the player performed better than this PGA Tour baseline. For amateurs, almost all strokes gained values will be negative – that is expected. The interesting part is which categories are more negative than others, which reveals where the player has the most room for improvement.


Code

Setup: Imports and Data Loading

We will use pandas for data wrangling, matplotlib and seaborn for visualization. This is the same setup as Topics 06 and 07.

%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style='whitegrid')
# Load all five golf CSVs
players = pd.read_csv('../../data/players.csv')
courses = pd.read_csv('../../data/courses.csv')
holes = pd.read_csv('../../data/holes.csv')
rounds = pd.read_csv('../../data/rounds.csv', parse_dates=['date'])
shots = pd.read_csv('../../data/shots.csv')

print(f'Players:  {players.shape[0]:>5,} rows x {players.shape[1]} columns')
print(f'Courses:  {courses.shape[0]:>5,} rows x {courses.shape[1]} columns')
print(f'Holes:    {holes.shape[0]:>5,} rows x {holes.shape[1]} columns')
print(f'Rounds:   {rounds.shape[0]:>5,} rows x {rounds.shape[1]} columns')
print(f'Shots:    {shots.shape[0]:>5,} rows x {shots.shape[1]} columns')
# Quick look at each table
print('--- Players ---')
print(players.to_string(index=False))
print()
print('--- Courses ---')
print(courses.to_string(index=False))
print()
print('--- Shots (first 10 rows) ---')
print(shots.head(10).to_string(index=False))

1. Load and Merge All Data

To analyze strokes gained per player, we need to connect shots to players. The join chain is: shots -> rounds (via round_id) -> players (via player_id). We also join to courses for course context and to holes for par information.

# Step 1: Merge shots with rounds to get player_id and course_id for each shot
shot_data = pd.merge(shots, rounds[['round_id', 'player_id', 'course_id', 'date', 'weather']], on='round_id')

# Step 2: Merge with players to get player names and handicaps
shot_data = pd.merge(shot_data, players, on='player_id')

# Step 3: Merge with courses to get course names
shot_data = pd.merge(shot_data, courses[['course_id', 'name']], on='course_id')
shot_data = shot_data.rename(columns={'name_x': 'player_name', 'name_y': 'course_name'})

# Step 4: Merge with holes to get par for each hole
shot_data = pd.merge(shot_data, holes[['course_id', 'hole_number', 'par', 'yardage']],
                     left_on=['course_id', 'hole'],
                     right_on=['course_id', 'hole_number'])

print(f'Merged shot_data: {shot_data.shape[0]:,} rows x {shot_data.shape[1]} columns')
print(f'Columns: {list(shot_data.columns)}')
print()
shot_data[['round_id', 'player_name', 'course_name', 'hole', 'shot_number', 'club',
           'start_lie', 'start_distance_to_pin', 'end_lie', 'end_distance_to_pin',
           'strokes_gained', 'par']].head(10)

We also build the round_detail DataFrame from Topics 06 and 07 for round-level analysis.

# Build round_detail with player names and course pars
rounds_with_players = pd.merge(rounds, players, on='player_id')
round_detail = pd.merge(rounds_with_players, courses, on='course_id', suffixes=('_player', '_course'))
round_detail = round_detail.rename(columns={'name_player': 'player_name', 'name_course': 'course_name'})

course_par = holes.groupby('course_id')['par'].sum().reset_index()
course_par.columns = ['course_id', 'total_par']
round_detail = pd.merge(round_detail, course_par, on='course_id')
round_detail['relative_to_par'] = round_detail['total_score'] - round_detail['total_par']

print('round_detail ready:')
round_detail[['round_id', 'player_name', 'course_name', 'date', 'total_score',
              'total_par', 'relative_to_par']].head(8)

2. Implement the Expected Strokes Interpolation Function

The baseline tables give expected strokes at specific distances. But a shot might start or end at a distance that is not exactly in the table (e.g., 135 yards from the fairway). We need an interpolation function that finds the two nearest entries in the table and computes a weighted average.

This is a linear interpolation: if 120 yards = 2.90 and 140 yards = 2.97, then 130 yards = 2.90 + (130-120)/(140-120) * (2.97-2.90) = 2.935.

# Baseline expected strokes tables
# Keys are distances (yards for non-green lies, feet for green)
# Values are expected strokes to hole out from that position

BASELINE_STROKES = {
    "tee": {
        150: 2.99, 175: 3.07, 200: 3.15, 225: 3.25, 250: 3.35,
        300: 3.45, 350: 3.55, 400: 3.75, 450: 3.95, 500: 4.15,
        550: 4.35, 600: 4.70
    },
    "fairway": {
        20: 2.40, 40: 2.52, 60: 2.60, 80: 2.72, 100: 2.82,
        120: 2.90, 140: 2.97, 160: 3.05, 180: 3.15, 200: 3.25,
        220: 3.45, 240: 3.65, 260: 3.85
    },
    "rough": {
        20: 2.55, 40: 2.65, 60: 2.75, 80: 2.87, 100: 2.97,
        120: 3.08, 140: 3.18, 160: 3.28, 180: 3.40, 200: 3.55,
        220: 3.70, 240: 3.90
    },
    "sand": {
        10: 2.43, 20: 2.53, 30: 2.73, 40: 2.93, 60: 3.10,
        80: 3.25, 100: 3.40
    },
    "green": {
        2: 1.01, 3: 1.04, 5: 1.15, 8: 1.33, 10: 1.41,
        15: 1.56, 20: 1.67, 30: 1.82, 40: 1.93, 50: 2.02,
        60: 2.09, 90: 2.24
    },
    "recovery": {
        20: 2.80, 40: 2.95, 60: 3.10, 80: 3.25, 100: 3.45,
        150: 3.75, 200: 4.00
    },
}
def expected_strokes(lie, distance):
    """
    Look up the expected strokes to hole out from a given lie and distance.
    Uses linear interpolation between the two nearest baseline entries.
    
    Parameters:
        lie (str): One of 'tee', 'fairway', 'rough', 'sand', 'green', 'recovery'
        distance (float): Distance to the pin in yards (or feet for green)
    
    Returns:
        float: Expected strokes to hole out
    """
    if lie == 'holed':
        return 0.0
    
    table = BASELINE_STROKES[lie]
    distances = sorted(table.keys())
    
    # Clamp to the table range
    if distance <= distances[0]:
        return table[distances[0]]
    if distance >= distances[-1]:
        return table[distances[-1]]
    
    # Find the two bracketing distances
    for i in range(len(distances) - 1):
        d_low = distances[i]
        d_high = distances[i + 1]
        if d_low <= distance <= d_high:
            # Linear interpolation
            fraction = (distance - d_low) / (d_high - d_low)
            return table[d_low] + fraction * (table[d_high] - table[d_low])
    
    # Should not reach here, but return the last value as fallback
    return table[distances[-1]]
# Test the function at known table values
print('--- Exact table lookups (should match the baseline tables) ---')
print(f'Tee, 400 yards:     {expected_strokes("tee", 400):.2f}  (table: 3.75)')
print(f'Fairway, 100 yards: {expected_strokes("fairway", 100):.2f}  (table: 2.82)')
print(f'Green, 10 feet:     {expected_strokes("green", 10):.2f}  (table: 1.41)')
print(f'Sand, 30 yards:     {expected_strokes("sand", 30):.2f}  (table: 2.73)')
print()

# Test interpolation between table values
print('--- Interpolated lookups ---')
print(f'Fairway, 110 yards: {expected_strokes("fairway", 110):.3f}  (between 100=2.82 and 120=2.90)')
print(f'Fairway, 130 yards: {expected_strokes("fairway", 130):.3f}  (between 120=2.90 and 140=2.97)')
print(f'Green, 6 feet:      {expected_strokes("green", 6):.3f}  (between 5=1.15 and 8=1.33)')
print(f'Rough, 150 yards:   {expected_strokes("rough", 150):.3f}  (between 140=3.18 and 160=3.28)')
print()

# Test edge cases: clamping to table boundaries
print('--- Edge cases (clamped to table range) ---')
print(f'Tee, 100 yards:     {expected_strokes("tee", 100):.2f}  (below min 150, clamped)')
print(f'Fairway, 300 yards: {expected_strokes("fairway", 300):.2f}  (above max 260, clamped)')
print(f'Holed:              {expected_strokes("holed", 0):.2f}  (ball in the cup)')

3. Verify the Strokes Gained Calculation

The shots.csv file already has strokes_gained pre-computed. Let’s verify those values by recalculating a few shots step by step using our expected_strokes function.

# Take the first few shots from the dataset and verify
print('Verifying strokes gained calculations for Bear Woods, Round 1, Hole 1:\n')

hole_1_shots = shots[(shots['round_id'] == 1) & (shots['hole'] == 1)].copy()

for _, shot in hole_1_shots.iterrows():
    start_exp = expected_strokes(shot['start_lie'], shot['start_distance_to_pin'])
    end_exp = expected_strokes(shot['end_lie'], shot['end_distance_to_pin'])
    calculated_sg = start_exp - end_exp - 1
    
    print(f"Shot {shot['shot_number']}: {shot['club']}")
    print(f"  Start: {shot['start_lie']} at {shot['start_distance_to_pin']} -> expected {start_exp:.3f}")
    print(f"  End:   {shot['end_lie']} at {shot['end_distance_to_pin']} -> expected {end_exp:.3f}")
    print(f"  SG calculated: {start_exp:.3f} - {end_exp:.3f} - 1 = {calculated_sg:.3f}")
    print(f"  SG in dataset: {shot['strokes_gained']:.3f}")
    print(f"  Match: {abs(calculated_sg - shot['strokes_gained']) < 0.01}")
    print()
# Verify across the entire dataset
# Calculate strokes gained for every shot and compare to the pre-computed values

shots['sg_calculated'] = shots.apply(
    lambda row: expected_strokes(row['start_lie'], row['start_distance_to_pin'])
                - expected_strokes(row['end_lie'], row['end_distance_to_pin'])
                - 1,
    axis=1
)

shots['sg_diff'] = abs(shots['sg_calculated'] - shots['strokes_gained'])

print(f'Total shots checked: {len(shots):,}')
print(f'Max difference:      {shots["sg_diff"].max():.6f}')
print(f'Mean difference:     {shots["sg_diff"].mean():.6f}')
print(f'Shots within 0.01:   {(shots["sg_diff"] < 0.01).sum():,} / {len(shots):,}')

# Clean up the temporary columns
shots = shots.drop(columns=['sg_calculated', 'sg_diff'])

The calculated values match the pre-computed values in the dataset. Small floating-point differences are expected and negligible. This confirms that the strokes_gained column in shots.csv was computed using the same baseline tables and interpolation method.

4. Categorize Shots into Strokes Gained Categories

Now we classify every shot into one of the four standard strokes gained categories:

  • Off the Tee: shot_number == 1 (tee shots on every hole)
  • Putting: start_lie == "green" (any shot from the putting surface)
  • Around the Green: start_distance_to_pin <= 30 and start_lie != "green" (chips, pitches, bunker shots near the green)
  • Approach: everything else (iron shots, layups, long approach shots)
def categorize_sg(row):
    """Classify a shot into one of the four strokes gained categories."""
    if row['shot_number'] == 1:
        return 'off_the_tee'
    elif row['start_lie'] == 'green':
        return 'putting'
    elif row['start_distance_to_pin'] <= 30:
        return 'around_the_green'
    else:
        return 'approach'


shot_data['sg_category'] = shot_data.apply(categorize_sg, axis=1)

print('Strokes gained category distribution:')
print(shot_data['sg_category'].value_counts())
print()

# Show average strokes gained by category across all players
print('Average strokes gained by category (all players):')
print(shot_data.groupby('sg_category')['strokes_gained'].mean().round(3))

Putting has the most shots because every hole involves at least one putt (and usually 2-3). The negative averages across all categories are expected – our players are amateurs being measured against a PGA Tour baseline. The question is not whether they lose strokes (they do), but where they lose the most.

5. Player Strokes Gained Analysis by Category

This is the core analysis. By grouping strokes gained by player and category, we can see each player’s strengths and weaknesses.

# Average strokes gained per shot, by player and category
sg_by_player_cat = shot_data.groupby(['player_name', 'sg_category'])['strokes_gained'].mean().round(3)

# Reshape into a readable table: players as rows, categories as columns
sg_pivot = sg_by_player_cat.unstack()

# Reorder columns in a logical order
col_order = ['off_the_tee', 'approach', 'around_the_green', 'putting']
sg_pivot = sg_pivot[col_order]

print('Average Strokes Gained per Shot by Player and Category:')
print(sg_pivot)
print()

# Also show total strokes gained per round by category
# This is more intuitive: "how many strokes does this player gain/lose per round in each category?"
sg_per_round = shot_data.groupby(['player_name', 'round_id', 'sg_category'])['strokes_gained'].sum().reset_index()
sg_per_round_avg = sg_per_round.groupby(['player_name', 'sg_category'])['strokes_gained'].mean().unstack()
sg_per_round_avg = sg_per_round_avg[col_order].round(2)
sg_per_round_avg['total'] = sg_per_round_avg.sum(axis=1).round(2)

print('Average Strokes Gained per Round by Player and Category:')
print(sg_per_round_avg)

The “per round” table is the most actionable view. It tells you: on an average round, how many strokes does each player gain or lose in each phase of play, compared to a PGA Tour baseline?

Remember, all values are negative because these are amateurs compared to PGA Tour players. The player with values closest to zero is performing most like a tour pro in that category. The player with the most negative values has the most room for improvement.

6. Strokes Gained by Club per Player

Breaking down strokes gained by club shows which specific clubs are helping or hurting each player. A player might lose strokes overall on approach shots, but the analysis might reveal that their 7-Iron is fine – it is their long irons that are the problem.

# Average strokes gained by club for each player
sg_by_club = shot_data.groupby(['player_name', 'club']).agg(
    avg_sg=('strokes_gained', 'mean'),
    shot_count=('strokes_gained', 'count'),
    total_sg=('strokes_gained', 'sum')
).round(3)

# Show for each player, sorted by average strokes gained
for player in sorted(shot_data['player_name'].unique()):
    player_clubs = sg_by_club.loc[player].sort_values('avg_sg')
    print(f'\n--- {player} ---')
    print(player_clubs.to_string())
# Create a pivot table of average SG by player and club
# This gives a compact view for comparison
club_pivot = shot_data.groupby(['player_name', 'club'])['strokes_gained'].mean().unstack()

# Order clubs roughly by distance (long to short)
club_order = ['Driver', '3-Wood', '5-Wood', 'Hybrid', '3-Iron', '4-Iron', '5-Iron',
              '6-Iron', '7-Iron', '8-Iron', '9-Iron', 'PW', 'GW', 'SW', 'LW', 'Putter']
available_clubs = [c for c in club_order if c in club_pivot.columns]
club_pivot = club_pivot[available_clubs].round(3)

print('Average Strokes Gained by Player and Club:')
print(club_pivot.to_string())

7. Visualizations

Numbers are essential, but charts make patterns visible at a glance. We will create three key strokes gained visualizations.

Visualization 1: Strokes Gained by Category per Player (Grouped Bar Chart)

This is the most important strokes gained chart. It shows each player’s average strokes gained per round in each of the four categories, side by side.

# Grouped bar chart: SG by category per player
# Use the per-round averages for a more intuitive scale
sg_plot_data = sg_per_round_avg.drop(columns='total')

fig, ax = plt.subplots(figsize=(10, 6))

# Set up bar positions
categories = sg_plot_data.columns.tolist()
players_list = sg_plot_data.index.tolist()
x = np.arange(len(categories))
width = 0.18
offsets = np.arange(len(players_list)) - (len(players_list) - 1) / 2

colors = ['#2ecc71', '#3498db', '#e67e22', '#9b59b6']

for i, (player, color) in enumerate(zip(players_list, colors)):
    values = sg_plot_data.loc[player].values
    ax.bar(x + offsets[i] * width, values, width, label=player, color=color, alpha=0.85)

ax.axhline(y=0, color='gray', linestyle='--', linewidth=0.8)
ax.set_xlabel('Strokes Gained Category')
ax.set_ylabel('Average Strokes Gained per Round')
ax.set_title('Strokes Gained by Category per Player')
ax.set_xticks(x)
ax.set_xticklabels(['Off the Tee', 'Approach', 'Around Green', 'Putting'])
ax.legend(title='Player')
plt.tight_layout()
plt.show()

This chart immediately shows the relative strengths and weaknesses of each player. Bear Woods (low handicap) should have bars closest to zero across all categories. Bobby Bogey (high handicap) should have the most negative bars, especially in the categories that separate high and low handicap players.

Look for patterns: is there a category where all players lose about the same amount? That might indicate a universal challenge. Is there a category where the gap between the best and worst player is largest? That is where skill differentiation is greatest.

Visualization 2: Total Strokes Gained per Round for Brian Kolowitz

Let’s zoom in on one player and track their strokes gained across rounds. This shows consistency and trends.

# Total strokes gained per round for Brian Kolowitz
brian_rounds = shot_data[shot_data['player_name'] == 'Brian Kolowitz']

brian_sg_per_round = brian_rounds.groupby(['round_id', 'date']).agg(
    total_sg=('strokes_gained', 'sum')
).reset_index().sort_values('date')

# Create labels for each round
brian_sg_per_round['round_label'] = brian_sg_per_round['date'].dt.strftime('%m/%d')

fig, ax = plt.subplots(figsize=(10, 5))

bar_colors = ['#2ecc71' if v >= 0 else '#e74c3c' for v in brian_sg_per_round['total_sg']]
ax.bar(brian_sg_per_round['round_label'], brian_sg_per_round['total_sg'], color=bar_colors, alpha=0.85)
ax.axhline(y=0, color='gray', linestyle='--', linewidth=0.8)
ax.set_xlabel('Round Date')
ax.set_ylabel('Total Strokes Gained')
ax.set_title('Brian Kolowitz: Total Strokes Gained per Round')

# Add value labels on each bar
for i, (_, row) in enumerate(brian_sg_per_round.iterrows()):
    offset = -0.8 if row['total_sg'] < 0 else 0.3
    ax.text(i, row['total_sg'] + offset, f"{row['total_sg']:.1f}",
            ha='center', va='bottom' if row['total_sg'] >= 0 else 'top', fontsize=9)

plt.tight_layout()
plt.show()

Green bars mean Brian gained strokes overall in that round (unlikely against a PGA Tour baseline for an amateur, but possible on a shorter course). Red bars mean he lost strokes. The height of each bar shows the magnitude. This gives a quick visual of consistency – are the bars roughly the same height, or does Brian have wild swings between good and bad rounds?

Visualization 3: Strokes Gained Heatmap (Player x Category)

A heatmap gives a dense, color-coded view of the same data. Red cells are where players lose the most strokes; cells closer to white or green are where they lose the least (or gain).

# Heatmap of average strokes gained per round by player and category
heatmap_data = sg_per_round_avg.drop(columns='total')
heatmap_data.columns = ['Off the Tee', 'Approach', 'Around Green', 'Putting']

fig, ax = plt.subplots(figsize=(9, 5))
sns.heatmap(heatmap_data, annot=True, fmt='.2f', cmap='RdYlGn', center=0,
            linewidths=1, ax=ax, cbar_kws={'label': 'Avg SG per Round'})
ax.set_title('Strokes Gained per Round: Player x Category')
ax.set_ylabel('Player')
ax.set_xlabel('SG Category')
plt.tight_layout()
plt.show()

The heatmap uses a diverging Red-Yellow-Green colormap centered at zero. The darkest red cells represent the biggest strokes lost per round; greener cells represent areas where the player is closer to (or better than) the baseline. You can instantly see the overall pattern: which player-category combinations are the brightest red, and which are closest to neutral.

This is the same data as the grouped bar chart, but the heatmap is often better for spotting patterns across a matrix of values – your eye naturally finds the darkest and lightest cells.

8. Personal Performance Report: Brian Kolowitz

The ultimate goal of strokes gained analysis is to produce actionable insights for a specific player. Let’s build a complete performance report for Brian Kolowitz that a golf coach could use to design a practice plan.

Brian Kolowitz has a 13.9 handicap – a mid-handicap player. He is beyond the beginner stage and is looking to break into single digits. The question is: where should he focus his practice time?

# Brian Kolowitz performance overview
brian_shots = shot_data[shot_data['player_name'] == 'Brian Kolowitz']
brian_rounds_detail = round_detail[round_detail['player_name'] == 'Brian Kolowitz']

print('=' * 60)
print('PERFORMANCE REPORT: Brian Kolowitz')
print('=' * 60)
print()
print(f'Handicap: {players[players["name"] == "Brian Kolowitz"]["handicap"].values[0]}')
print(f'Rounds played: {len(brian_rounds_detail)}')
print(f'Total shots: {len(brian_shots):,}')
print()

# Scoring summary
print('--- Scoring Summary ---')
print(f'Average score:     {brian_rounds_detail["total_score"].mean():.1f}')
print(f'Best round:        {brian_rounds_detail["total_score"].min()}')
print(f'Worst round:       {brian_rounds_detail["total_score"].max()}')
print(f'Std deviation:     {brian_rounds_detail["total_score"].std():.1f}')
print(f'Avg vs par:        +{brian_rounds_detail["relative_to_par"].mean():.1f}')
# Strokes gained by category
brian_sg_cat = brian_shots.groupby('sg_category')['strokes_gained'].agg(['mean', 'sum', 'count'])
brian_sg_cat.columns = ['avg_sg_per_shot', 'total_sg', 'shot_count']

# Per-round averages
brian_sg_round = brian_shots.groupby(['round_id', 'sg_category'])['strokes_gained'].sum().reset_index()
brian_sg_round_avg = brian_sg_round.groupby('sg_category')['strokes_gained'].mean()
brian_sg_cat['avg_sg_per_round'] = brian_sg_round_avg

cat_order = ['off_the_tee', 'approach', 'around_the_green', 'putting']
brian_sg_cat = brian_sg_cat.reindex(cat_order)

print('--- Strokes Gained by Category ---')
print(brian_sg_cat.round(3).to_string())
print()
print(f'Total SG per round (all categories): {brian_sg_cat["avg_sg_per_round"].sum():.2f}')
# Strokes gained by club
brian_club_sg = brian_shots.groupby('club').agg(
    avg_sg=('strokes_gained', 'mean'),
    total_sg=('strokes_gained', 'sum'),
    shot_count=('strokes_gained', 'count')
).sort_values('avg_sg')

print('--- Strokes Gained by Club ---')
print(brian_club_sg.round(3).to_string())
# Brian Kolowitz performance dashboard: 2x2 grid
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# --- Top left: SG by category ---
ax1 = axes[0, 0]
cat_labels = ['Off the Tee', 'Approach', 'Around Green', 'Putting']
cat_values = brian_sg_cat['avg_sg_per_round'].values
bar_colors = ['#2ecc71' if v >= 0 else '#e74c3c' for v in cat_values]
ax1.barh(cat_labels, cat_values, color=bar_colors, alpha=0.85)
ax1.axvline(x=0, color='gray', linestyle='--', linewidth=0.8)
ax1.set_xlabel('Avg Strokes Gained per Round')
ax1.set_title('SG by Category')
for i, v in enumerate(cat_values):
    ax1.text(v + (0.1 if v >= 0 else -0.1), i, f'{v:.2f}',
             ha='left' if v >= 0 else 'right', va='center', fontsize=10)

# --- Top right: SG by club ---
ax2 = axes[0, 1]
club_data = brian_club_sg[brian_club_sg['shot_count'] >= 5].sort_values('avg_sg')
club_colors = ['#2ecc71' if v >= 0 else '#e74c3c' for v in club_data['avg_sg']]
ax2.barh(club_data.index, club_data['avg_sg'], color=club_colors, alpha=0.85)
ax2.axvline(x=0, color='gray', linestyle='--', linewidth=0.8)
ax2.set_xlabel('Avg Strokes Gained per Shot')
ax2.set_title('SG by Club (min 5 shots)')

# --- Bottom left: SG per round trend ---
ax3 = axes[1, 0]
brian_round_trend = brian_shots.groupby(['round_id', 'date'])['strokes_gained'].sum().reset_index()
brian_round_trend = brian_round_trend.sort_values('date')
round_colors = ['#2ecc71' if v >= 0 else '#e74c3c' for v in brian_round_trend['strokes_gained']]
ax3.bar(range(len(brian_round_trend)), brian_round_trend['strokes_gained'], color=round_colors, alpha=0.85)
ax3.axhline(y=0, color='gray', linestyle='--', linewidth=0.8)
ax3.set_xticks(range(len(brian_round_trend)))
ax3.set_xticklabels(brian_round_trend['date'].dt.strftime('%m/%d'), rotation=30)
ax3.set_xlabel('Round Date')
ax3.set_ylabel('Total Strokes Gained')
ax3.set_title('SG per Round Over Time')

# --- Bottom right: SG distribution by category (box plot) ---
ax4 = axes[1, 1]
brian_cat_data = brian_shots.copy()
brian_cat_data['sg_category'] = pd.Categorical(
    brian_cat_data['sg_category'],
    categories=cat_order,
    ordered=True
)
cat_box_data = [brian_cat_data[brian_cat_data['sg_category'] == cat]['strokes_gained'].values
                for cat in cat_order]
bp = ax4.boxplot(cat_box_data, labels=['Tee', 'Approach', 'A.Green', 'Putting'], patch_artist=True)
box_colors = ['#2ecc71', '#3498db', '#e67e22', '#9b59b6']
for patch, color in zip(bp['boxes'], box_colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)
ax4.axhline(y=0, color='gray', linestyle='--', linewidth=0.8)
ax4.set_ylabel('Strokes Gained per Shot')
ax4.set_title('SG Distribution by Category')

fig.suptitle('Brian Kolowitz -- Strokes Gained Performance Report', fontsize=14, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()
# Generate actionable insights
print('=' * 60)
print('ACTIONABLE INSIGHTS: Brian Kolowitz')
print('=' * 60)
print()

# Find the weakest and strongest categories
worst_cat = brian_sg_cat['avg_sg_per_round'].idxmin()
best_cat = brian_sg_cat['avg_sg_per_round'].idxmax()

print(f'1. BIGGEST WEAKNESS: {worst_cat.replace("_", " ").title()}')
print(f'   Brian loses {abs(brian_sg_cat.loc[worst_cat, "avg_sg_per_round"]):.2f} strokes per round')
print(f'   in this category. This is where practice will have the biggest impact.')
print()

print(f'2. RELATIVE STRENGTH: {best_cat.replace("_", " ").title()}')
print(f'   Brian loses only {abs(brian_sg_cat.loc[best_cat, "avg_sg_per_round"]):.2f} strokes per round')
print(f'   here -- his best category relative to the PGA Tour baseline.')
print()

# Find worst and best clubs
worst_club = brian_club_sg[brian_club_sg['shot_count'] >= 5]['avg_sg'].idxmin()
worst_club_sg = brian_club_sg.loc[worst_club, 'avg_sg']
best_club = brian_club_sg[brian_club_sg['shot_count'] >= 5]['avg_sg'].idxmax()
best_club_sg = brian_club_sg.loc[best_club, 'avg_sg']

print(f'3. WORST CLUB: {worst_club}')
print(f'   Average SG per shot: {worst_club_sg:.3f}')
print(f'   Consider lessons focused specifically on this club, or explore')
print(f'   whether a different club or strategy could replace some of these shots.')
print()

print(f'4. BEST CLUB: {best_club}')
print(f'   Average SG per shot: {best_club_sg:.3f}')
print(f'   This is Brian\'s most reliable club. Course strategy should maximize')
print(f'   opportunities to use it.')
print()

# Consistency analysis
brian_round_sgs = brian_shots.groupby('round_id')['strokes_gained'].sum()
print(f'5. CONSISTENCY:')
print(f'   SG per round range: {brian_round_sgs.min():.1f} to {brian_round_sgs.max():.1f}')
print(f'   Std deviation:      {brian_round_sgs.std():.1f}')
sg_range = brian_round_sgs.max() - brian_round_sgs.min()
if sg_range > 15:
    print(f'   Brian has high variance ({sg_range:.1f} stroke spread). Reducing blowup')
    print(f'   holes would lower his average more than incremental skill improvement.')
else:
    print(f'   Brian is fairly consistent ({sg_range:.1f} stroke spread).')
    print(f'   Focus on systematic improvement in the weakest category.')
print()

# Compare to the other players
print(f'6. RANKING vs. THIS GROUP:')
all_sg = shot_data.groupby(['player_name', 'round_id'])['strokes_gained'].sum().reset_index()
all_sg_avg = all_sg.groupby('player_name')['strokes_gained'].mean().sort_values(ascending=False)
for i, (name, sg) in enumerate(all_sg_avg.items(), 1):
    marker = '  <-- Brian' if name == 'Brian Kolowitz' else ''
    print(f'   {i}. {name:20s} {sg:+.2f} SG/round{marker}')

This is the payoff of the entire course. We started with raw CSV files in Topic 03, learned to model them with classes in Topic 05, wrangled them with pandas in Topic 06, visualized patterns in Topic 07, and now in Topic 08 we have produced a personalized, data-driven coaching report.

The insights are specific and actionable. “Practice your short game” is vague. “You lose 3.2 strokes per round on approach shots, primarily with your 5-Iron and 6-Iron” is something a coach can build a practice plan around.


AI

Exercise 1: Ask AI to Explain Strokes Gained to a Beginner

Give an AI assistant the following prompt:

Explain the concept of “strokes gained” in golf to someone who plays golf casually but has never heard of strokes gained. Use a concrete example to illustrate the idea.

Evaluate the AI’s response:

  • Accuracy: Does it correctly describe the formula? Strokes gained = expected strokes at start - expected strokes at end - 1. If the AI says something like “strokes gained = par - actual strokes” or confuses it with simple scoring, that is wrong.
  • Baseline explanation: Does it explain where the expected strokes numbers come from? (PGA Tour averages, millions of shots analyzed.) If the AI skips this, the concept loses its foundation.
  • Example quality: Does the worked example use realistic numbers? A 150-yard approach from the fairway should reference a baseline around 2.8-3.0 strokes, not 1.5 or 5.0. If the numbers are wildly unrealistic, the AI is making them up.
  • Key insight: Does it explain why strokes gained is better than traditional stats? The core insight is that SG accounts for the difficulty of the starting position, which stats like fairways hit and putts per round do not.
  • What it might get wrong: AI often says “strokes gained measures how many fewer strokes a player takes than average.” This is only true at the round level, not the shot level. At the shot level, it measures how much one shot improved (or worsened) the expected outcome. Watch for this conflation.
# Paste the AI-generated explanation here as a comment or markdown.
# Then write your evaluation:
# - Was the formula correct?
# - Were the example numbers realistic?
# - Did it explain the baseline source?
# - Did it explain why SG > traditional stats?

Exercise 2: Ask AI to Implement Expected Strokes Interpolation

Give an AI assistant this prompt:

I have a Python dictionary that maps distances to expected strokes for a golf lie type:

fairway_baseline = {20: 2.40, 40: 2.52, 60: 2.60, 80: 2.72, 100: 2.82, 120: 2.90, 140: 2.97}

Write a function expected_strokes(distance, baseline) that returns the expected strokes for any distance, using linear interpolation between the two nearest entries. Handle edge cases where the distance is outside the table range.

Evaluate the AI’s response:

  • Does the interpolation work correctly? Test it: expected_strokes(110, fairway_baseline) should return something between 2.82 and 2.90 (specifically, 2.86). If the AI’s function returns 2.82 (nearest neighbor without interpolation) or errors out, it failed.
  • Edge case: below the table minimum. What does it do for expected_strokes(5, fairway_baseline)? Reasonable options: return the value at 20 (clamping), raise an error, or extrapolate. Clamping is what our implementation does.
  • Edge case: above the table maximum. What does it do for expected_strokes(200, fairway_baseline)? Same choices as above.
  • Edge case: exact table value. expected_strokes(100, fairway_baseline) should return exactly 2.82.
  • Code quality: Is the function clean and readable? Does it use bisect or manual search? Either is fine, but the logic should be clear.
# Paste the AI-generated function here and test it:

# fairway_baseline = {20: 2.40, 40: 2.52, 60: 2.60, 80: 2.72, 100: 2.82, 120: 2.90, 140: 2.97}

# Test cases:
# print(expected_strokes(110, fairway_baseline))  # Should be ~2.86
# print(expected_strokes(100, fairway_baseline))  # Should be exactly 2.82
# print(expected_strokes(5, fairway_baseline))    # Edge case: below minimum
# print(expected_strokes(200, fairway_baseline))  # Edge case: above maximum
# print(expected_strokes(20, fairway_baseline))   # Exact boundary

Exercise 3: Ask AI for Insights from Brian Kolowitz’s SG Data

Give an AI assistant the strokes gained summary data we computed for Brian Kolowitz. Copy and paste the output from section 8 (the category breakdown and club breakdown) into your prompt:

Here is the strokes gained data for a golfer named Brian Kolowitz (13.9 handicap). All values are compared to a PGA Tour baseline (so negative values are expected for an amateur).

[Paste the category and club SG tables here]

Based on this data, what are the top 3 things Brian should focus on to lower his scores? Be specific and actionable.

Evaluate the AI’s response:

  • Does it identify the weakest category correctly? The AI should point to whichever category has the most negative strokes gained per round.
  • Are the recommendations actionable? “Practice more” is useless. “Spend 30 minutes per practice session on 100-150 yard approach shots with your 7-Iron and 8-Iron” is actionable. Does the AI get specific enough?
  • Does it account for volume? A club with -0.5 average SG but only 3 shots matters less than one with -0.2 average SG and 50 shots. Does the AI consider the shot count column?
  • Does it acknowledge the PGA Tour baseline? Since Brian is a 13.9 handicap, being negative against a PGA Tour baseline is completely normal. Does the AI frame the advice in terms of relative weakness (where Brian loses the most compared to his other categories), or does it just say “everything is bad”?
  • What it might miss: The AI probably cannot infer course-specific or weather-specific patterns from the summary tables alone. If it makes claims about Brian’s performance in wind or on specific holes, it is hallucinating beyond the data provided.
# Paste the AI's recommendations here as a comment.
# Then write your evaluation:
# - Did it identify the correct weakest category?
# - Were the recommendations specific and actionable?
# - Did it consider shot count (volume)?
# - Did it properly frame the PGA Tour baseline context?

Summary

Key Strokes Gained Concepts

Concept What It Means
Expected strokes The average number of strokes a PGA Tour player needs to hole out from a given lie and distance
Strokes gained (shot) expected_strokes(start) - expected_strokes(end) - 1 – how much one shot beat (or missed) the baseline
Strokes gained (round) Sum of all shot-level SG values in a round – total strokes gained or lost vs. baseline
Positive SG The shot (or category, or round) was better than the PGA Tour baseline
Negative SG The shot was worse than the baseline – strokes were lost
SG: Off the Tee Tee shot performance – distance and accuracy combined
SG: Approach Iron play and long approach shots (not tee shots, not near the green, not putting)
SG: Around the Green Short game within 30 yards, excluding putts
SG: Putting All shots from the putting surface
Interpolation Estimating expected strokes between table entries using a weighted average
Baseline The PGA Tour average that all strokes gained values are measured against

How This Capstone Ties All Eight Topics Together

This notebook used skills from every previous topic in the course:

Topic Skill Used Here
Topic 01: Getting Started Python environment, Jupyter notebooks, running code cells
Topic 02: Python Basics Variables, data types, conditionals, loops, f-strings
Topic 03: Working with Files CSV data loaded with pd.read_csv() (the pandas upgrade from csv.DictReader)
Topic 04: Comprehensions List comprehensions for bar colors, club ordering, data filtering
Topic 05: Classes and Data Modeling Data modeling concepts – understanding how player, course, round, and shot data relate to each other
Topic 06: Pandas and EDA DataFrames, merge, groupby, agg, filtering, pivot tables – the backbone of the entire analysis
Topic 07: Data Visualization matplotlib bar charts, seaborn heatmaps, subplots, dashboards – turning numbers into insight
Topic 08: Strokes Gained Domain-specific analysis: implementing the SG formula, categorizing shots, building a performance report

What’s Next: Version 2 Ideas

This analysis is a solid foundation, but there is much more you could build on top of it:

  1. Custom baselines. Our baselines are PGA Tour averages. For amateur analysis, you could build baselines from the players’ own data (or from a pool of similar-handicap players). This would make the strokes gained values relative to peer performance rather than tour performance.

  2. Weather and course adjustments. Does Brian lose more strokes in windy conditions? Does he perform differently on the harder courses (higher slope rating)? Filtering the SG analysis by weather or course could reveal situational weaknesses.

  3. Hole-by-hole analysis. Which specific holes cost Brian the most strokes? Combining hole difficulty (handicap index from holes.csv) with strokes gained per hole could identify whether Brian struggles on long par 4s, short par 3s, or something else entirely.

  4. Time-series tracking. As Brian practices and plays more rounds, does his strokes gained improve over time? A rolling average of SG per round would show whether practice is translating to on-course improvement.

  5. Interactive dashboard. Using a library like Plotly or Streamlit, you could build a web-based dashboard where any player can select their name and see their personalized SG report update in real time.

  6. Shot pattern maps. If GPS data were available for each shot (landing position on the course), you could create visual shot maps showing dispersion patterns off the tee, approach shot clustering, and putting heat maps.

Each of these extensions uses the same core skills – pandas, visualization, and the strokes gained framework – just applied to a more specific or more ambitious question.

Get the Complete Course Bundle

All notebooks, the full golf dataset, and new tutorials — straight to your inbox.