Data Visualization

What You’ll Learn

  • Why visualization matters: patterns that are invisible in tables become obvious in charts
  • The major chart types and when to use each one: bar, histogram, line, scatter, box plot, heatmap
  • How to build charts with matplotlib, the foundational Python plotting library
  • How to use seaborn for statistical plots with less code and better defaults
  • How to combine multiple charts into a single dashboard-style figure
  • How to spot (and avoid) misleading charts

Concept

Why Visualize?

In Topic 06 we used pandas to compute summary statistics – means, medians, counts, correlations. Those numbers are essential, but they can hide important patterns.

The classic example is Anscombe’s quartet: four datasets that have nearly identical means, standard deviations, and correlations, but look completely different when plotted. One is a straight line, one is a curve, one has an outlier pulling the correlation, and one is a vertical cluster with a single extreme point. The summary statistics say “these are the same.” The charts say “these are nothing alike.”

The lesson: always plot your data. A table of numbers can confirm what you already suspect, but a chart can reveal what you never thought to ask about.

In our golf context: you can compute that Bear Woods averages 82.4 strokes per round and Bobby Bogey averages 95.4. But a chart can show you how those scores are distributed – is Bear Woods consistently in the low 80s, or does he swing between 75 and 90? Does Bobby Bogey have a few terrible rounds dragging up his average, or is he consistently high? These are the kinds of questions that charts answer at a glance.

Chart Types and When to Use Them

There is no single “best” chart. The right chart depends on the question you are trying to answer and the type of data you have.

Chart Type Best For Golf Example
Bar chart Comparing categories Average score by player, club usage frequency
Histogram Showing the distribution of a single numeric variable Distribution of all shot distances, spread of total scores
Line chart Showing trends over time or a sequence How each player’s scores change across rounds over the season
Scatter plot Showing the relationship between two numeric variables Handicap vs. average score, distance to pin vs. strokes gained
Box plot Comparing distributions across categories, spotting outliers Score distributions by player, strokes gained by club type
Heatmap Showing patterns in matrix or correlation data Correlation between handicap, score, slope rating, course rating

A useful rule of thumb: - One categorical variable, one numeric –> bar chart or box plot - One numeric variable alone –> histogram - Two numeric variables –> scatter plot - Numeric variable over time –> line chart - Matrix of numbers –> heatmap

How to Lie with Charts

Charts can mislead as easily as they can inform. Here are the most common tricks – learn them so you can spot them in the wild and avoid them in your own work:

Truncated y-axis. If a bar chart shows scores of 80, 82, 84, 86 but the y-axis starts at 78 instead of 0, the visual difference between bars is exaggerated. A 2-stroke difference looks like a 50% difference. Always check where the axis starts.

Cherry-picked time range. Show only the rounds where a player improved and it looks like a steady upward trend. Show the full season and the “trend” might disappear into noise.

Misleading scales. Plotting two different metrics on the same chart with different y-axis scales can make unrelated things look correlated. Dual-axis charts are almost always confusing.

Omitting context. A bar showing “Bobby Bogey: 95 average” looks bad. But if you do not also show that the course average is 92, the context is lost.

The antidote to all of these is simple: be honest. Start axes at zero when comparing magnitudes. Show the full data range. Label everything clearly. Let the data tell its own story.

Principles of Good Charts

A good chart follows a few basic principles:

  1. Clear title. The reader should know what the chart shows without reading any other text. “Average Score by Player” is good. “Figure 1” is not.
  2. Labeled axes. Every axis needs a label with units. “Score (strokes)” not just “Score.” “Distance (yards)” not just a number line.
  3. Appropriate scale. Start the y-axis at zero for bar charts. Use a consistent scale when comparing panels. Do not stretch or compress axes to exaggerate effects.
  4. Minimal clutter. Remove unnecessary gridlines, borders, and decorations. Every element on the chart should help the reader understand the data.
  5. Honest representation. Do not cherry-pick data to support a narrative. Show the full picture. If your chart requires a paragraph of caveats to be accurate, it is the wrong chart.

The Tools: matplotlib and seaborn

matplotlib is the foundational plotting library in Python. Almost every other plotting library (including seaborn) is built on top of it. matplotlib gives you full control over every element of a chart – axes, ticks, colors, fonts, layout – but that control comes at the cost of verbosity. Simple charts take several lines of code.

seaborn is a statistical visualization library built on matplotlib. It provides high-level functions that produce polished charts with less code and better default styling. seaborn is especially good at charts that show distributions and relationships in data – exactly the kind of charts you need for exploratory analysis.

We will learn matplotlib first (because you need to understand the fundamentals) and then layer seaborn on top (because you need to be productive).


Code

Setup: Imports and Data Loading

We will use pandas for data wrangling (you already know this from Topic 06), matplotlib for plotting, and seaborn for statistical charts. The %matplotlib inline magic command tells Jupyter to display charts directly in the notebook.

%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load all five golf CSVs
players = pd.read_csv('../../data/players.csv')
courses = pd.read_csv('../../data/courses.csv')
holes = pd.read_csv('../../data/holes.csv')
rounds = pd.read_csv('../../data/rounds.csv', parse_dates=['date'])
shots = pd.read_csv('../../data/shots.csv')

print(f'Players:  {players.shape[0]:>5,} rows x {players.shape[1]} columns')
print(f'Courses:  {courses.shape[0]:>5,} rows x {courses.shape[1]} columns')
print(f'Holes:    {holes.shape[0]:>5,} rows x {holes.shape[1]} columns')
print(f'Rounds:   {rounds.shape[0]:>5,} rows x {rounds.shape[1]} columns')
print(f'Shots:    {shots.shape[0]:>5,} rows x {shots.shape[1]} columns')
# Build the merged round_detail DataFrame from Topic 06 -- we will use this throughout
rounds_with_players = pd.merge(rounds, players, on='player_id')
round_detail = pd.merge(rounds_with_players, courses, on='course_id', suffixes=('_player', '_course'))
round_detail = round_detail.rename(columns={'name_player': 'player_name', 'name_course': 'course_name'})

# Add total par per course
course_par = holes.groupby('course_id')['par'].sum().reset_index()
course_par.columns = ['course_id', 'total_par']
round_detail = pd.merge(round_detail, course_par, on='course_id')
round_detail['relative_to_par'] = round_detail['total_score'] - round_detail['total_par']

print('round_detail ready:')
round_detail[['round_id', 'player_name', 'course_name', 'date', 'total_score', 'total_par', 'relative_to_par']].head(8)

1. Matplotlib Fundamentals

matplotlib uses a figure/axes model. A figure is the entire image – think of it as the canvas. Axes are the individual plots within that figure. A figure can contain one set of axes (a single chart) or many (a grid of charts).

The simplest workflow uses the plt interface directly: 1. Create a figure with plt.figure() 2. Call a plotting function like plt.bar(), plt.plot(), or plt.hist() 3. Add labels with plt.xlabel(), plt.ylabel(), plt.title() 4. Call plt.tight_layout() to prevent labels from overlapping 5. Call plt.show() to display the chart

Let’s start with a bar chart showing average scores by player.

# Compute average score per player
avg_by_player = round_detail.groupby('player_name')['total_score'].mean().sort_values()

print('Data we are plotting:')
print(avg_by_player)
# Our first matplotlib chart -- a vertical bar chart
plt.figure(figsize=(8, 5))
plt.bar(avg_by_player.index, avg_by_player.values, color='steelblue')
plt.xlabel('Player')
plt.ylabel('Average Score (strokes)')
plt.title('Average Score by Player')
plt.tight_layout()
plt.show()

Let’s break down what each line does:

  • plt.figure(figsize=(8, 5)) – Creates a new figure that is 8 inches wide and 5 inches tall. Without this, matplotlib uses a default size that is often too small.
  • plt.bar(x, height, color) – Draws a bar chart. The first argument is the category labels (player names), the second is the bar heights (average scores), and color sets the fill color.
  • plt.xlabel() and plt.ylabel() – Label the axes. Always include units.
  • plt.title() – Title for the chart. Should describe what the chart shows.
  • plt.tight_layout() – Adjusts spacing so labels do not get clipped. Always call this.
  • plt.show() – Renders and displays the chart. In Jupyter with %matplotlib inline, the chart appears below the cell.

2. Histograms

A histogram shows the distribution of a single numeric variable. It divides the range of values into bins and counts how many values fall in each bin. This tells you whether the data is symmetric, skewed, has outliers, or has multiple peaks.

Let’s look at two distributions: the starting distance to pin for all shots, and the total scores across all rounds.

# Histogram of starting distance to pin (all shots)
plt.figure(figsize=(8, 5))
plt.hist(shots['start_distance_to_pin'], bins=30, color='steelblue', edgecolor='white')
plt.xlabel('Starting Distance to Pin (yards)')
plt.ylabel('Number of Shots')
plt.title('Distribution of Starting Distance to Pin')
plt.tight_layout()
plt.show()

Notice the shape: there is a large spike near zero (putts on the green) and a spread of longer shots. This is not a normal bell curve – it is bimodal, with one cluster of short shots (putting) and another cluster of longer approach and tee shots. A histogram reveals this instantly; shots['start_distance_to_pin'].mean() would hide it behind a single number.

The bins parameter controls how many bars the histogram has. More bins show finer detail; fewer bins show the overall shape. Try changing bins=30 to bins=10 or bins=60 to see the difference.

# Histogram of total scores across all rounds
plt.figure(figsize=(8, 5))
plt.hist(rounds['total_score'], bins=12, color='goldenrod', edgecolor='white')
plt.xlabel('Total Score (strokes)')
plt.ylabel('Number of Rounds')
plt.title('Distribution of Total Scores Across All Rounds')
plt.tight_layout()
plt.show()

With only 24 rounds the histogram is a bit sparse, but you can still see the spread. Scores cluster in the low-to-mid 80s with a tail reaching past 100 – that tail is mostly Bobby Bogey. With more data, this shape would become clearer.

The bins parameter matters a lot with small datasets. Too many bins and every bar has 1-2 counts; too few and you lose all detail. For 24 data points, somewhere between 8 and 15 bins is usually reasonable.

3. Bar Charts

Bar charts compare a numeric value across categories. We already made one (average score by player). Let’s make two more: average score by course, and club usage frequency.

When category labels are long, horizontal bars (plt.barh()) are easier to read because the labels do not overlap.

# Average score by course
avg_by_course = round_detail.groupby('course_name')['total_score'].mean().sort_values()

plt.figure(figsize=(8, 5))
plt.barh(avg_by_course.index, avg_by_course.values, color='forestgreen')
plt.xlabel('Average Score (strokes)')
plt.ylabel('Course')
plt.title('Average Score by Course')
plt.tight_layout()
plt.show()

Bob O’Connor Golf Course has the lowest average score, which makes sense: it is a par 68 course with a lower slope rating (104) compared to the par 72 courses at North Park (slope 117) and South Park (slope 123). This is a case where the chart immediately raises a follow-up question: is the lower score because the course is easier, or because different players happened to play there? You would need to control for player ability to answer that properly.

# Club usage frequency -- horizontal bars for readability with many labels
club_counts = shots['club'].value_counts().sort_values()

plt.figure(figsize=(8, 6))
plt.barh(club_counts.index, club_counts.values, color='steelblue')
plt.xlabel('Number of Shots')
plt.ylabel('Club')
plt.title('Club Usage Frequency Across All Rounds')
plt.tight_layout()
plt.show()

The Putter dominates – which is expected, since every hole ends with at least one putt and most holes end with several. The Driver is the next most common club because it is used on the tee of most par 4s and par 5s. This chart gives you a quick sense of the data composition: when you compute average strokes gained by club later, you will know that Putter and Driver averages are backed by hundreds of shots, while some irons might have only a few dozen.

4. Line Charts

Line charts show trends over a sequence – usually time. They connect data points with lines, making it easy to see whether values are going up, down, or staying flat.

Let’s track each player’s scoring trend across their rounds. This requires sorting the data by date and plotting one line per player.

# Scoring trends over time -- one line per player
plt.figure(figsize=(10, 6))

for player_name in sorted(round_detail['player_name'].unique()):
    player_data = round_detail[round_detail['player_name'] == player_name].sort_values('date')
    plt.plot(player_data['date'], player_data['total_score'], marker='o', label=player_name)

plt.xlabel('Date')
plt.ylabel('Total Score (strokes)')
plt.title('Scoring Trends Over Time by Player')
plt.legend()
plt.tight_layout()
plt.show()

A few things to notice:

  • plt.plot() draws a line chart. The marker='o' argument adds dots at each data point so you can see exactly where the rounds fall.
  • label=player_name assigns a name to each line, and plt.legend() displays the legend so you can tell the lines apart.
  • Bear Woods (low handicap) consistently stays at the bottom of the chart. Bobby Bogey (high handicap) consistently stays at the top. The lines rarely cross – handicap is a strong predictor of scoring.

pandas DataFrames also have a built-in .plot() method that wraps matplotlib. This is often the fastest way to get a chart from grouped data.

# Using pandas .plot() -- pivot so each player is a column, then plot
score_pivot = round_detail.pivot_table(
    index='date', columns='player_name', values='total_score'
)

score_pivot.plot(figsize=(10, 6), marker='o', linewidth=2)
plt.ylabel('Total Score (strokes)')
plt.title('Scoring Trends Over Time by Player (pandas .plot())')
plt.tight_layout()
plt.show()

The pandas .plot() method automatically creates a legend from the column names and labels the x-axis from the index. It produces the same chart with less code. Under the hood, it is calling matplotlib – so everything you learn about plt.xlabel(), plt.title(), etc. still applies.

Note: pivot_table reshapes the data so that each player becomes a column, which is exactly the shape that .plot() expects for multiple lines.

5. Scatter Plots

Scatter plots show the relationship between two numeric variables. Each point represents one observation, positioned by its x-value and y-value. Patterns in the scatter (upward slope, downward slope, clusters, no pattern) tell you how the two variables are related.

Let’s start with handicap vs. average score – we expect a positive relationship (higher handicap = higher scores).

# Scatter plot: handicap vs. average score per player
player_stats = round_detail.groupby('player_name').agg(
    avg_score=('total_score', 'mean'),
    handicap=('handicap', 'first')
).reset_index()

plt.figure(figsize=(8, 5))
plt.scatter(player_stats['handicap'], player_stats['avg_score'], s=100, color='steelblue', zorder=5)

# Label each point with the player name
for _, row in player_stats.iterrows():
    plt.annotate(row['player_name'], (row['handicap'], row['avg_score']),
                 textcoords='offset points', xytext=(8, 5), fontsize=10)

# Add a trend line using numpy polyfit
z = np.polyfit(player_stats['handicap'], player_stats['avg_score'], 1)
p = np.poly1d(z)
x_range = np.linspace(player_stats['handicap'].min() - 1, player_stats['handicap'].max() + 1, 100)
plt.plot(x_range, p(x_range), '--', color='gray', alpha=0.7, label=f'Trend (slope: {z[0]:.2f})')

plt.xlabel('Handicap')
plt.ylabel('Average Score (strokes)')
plt.title('Handicap vs. Average Score')
plt.legend()
plt.tight_layout()
plt.show()

The positive trend is clear: higher handicap, higher average score. The dashed trend line is a linear fit computed with np.polyfit(). With only four players this is not statistically rigorous, but it confirms the expected direction.

plt.annotate() adds text labels near each point – critical when you have a small number of labeled data points. With hundreds of points, you would skip annotations and use color or size to encode categories instead.

Now let’s look at a scatter with many more points: distance to pin vs. strokes gained for each shot.

# Scatter plot: distance to pin vs. strokes gained (all shots)
# Filter out putts to focus on approach/tee shots where distance varies more
non_putts = shots[shots['club'] != 'Putter']

plt.figure(figsize=(10, 6))
plt.scatter(non_putts['start_distance_to_pin'], non_putts['strokes_gained'],
            alpha=0.4, s=20, color='steelblue')
plt.axhline(y=0, color='gray', linestyle='--', linewidth=0.8)
plt.xlabel('Starting Distance to Pin (yards)')
plt.ylabel('Strokes Gained')
plt.title('Distance to Pin vs. Strokes Gained (Non-Putt Shots)')
plt.tight_layout()
plt.show()

A few things to notice:

  • alpha=0.4 makes each point semi-transparent, so overlapping points appear darker. This helps when you have many data points in the same area.
  • plt.axhline(y=0) draws a horizontal reference line at zero strokes gained. Points above the line are shots where the player gained strokes (performed better than expected); points below lost strokes.
  • The spread of strokes gained is wider at longer distances, which makes sense: there is more variance in outcomes on a 200-yard approach than a 50-yard chip.

6. Box Plots

A box plot shows the distribution of a numeric variable, broken down by category. The box spans the 25th to 75th percentile (the interquartile range, or IQR), with a line at the median. Whiskers extend to 1.5x the IQR, and points beyond that are shown as individual dots (outliers).

Box plots are excellent for comparing distributions across groups – you can see the center, spread, and outliers for each group at a glance.

# Box plot: score distributions by player
# Sort players by median score for a cleaner chart
player_order = round_detail.groupby('player_name')['total_score'].median().sort_values().index

plt.figure(figsize=(8, 5))

# Prepare data as a list of arrays (one per player) for matplotlib boxplot
data_by_player = [round_detail[round_detail['player_name'] == name]['total_score'].values
                  for name in player_order]

bp = plt.boxplot(data_by_player, labels=player_order, patch_artist=True)

# Color the boxes
for patch in bp['boxes']:
    patch.set_facecolor('steelblue')
    patch.set_alpha(0.7)

plt.ylabel('Total Score (strokes)')
plt.title('Score Distributions by Player')
plt.tight_layout()
plt.show()

The box plot shows not just averages, but the spread of each player’s scores. Bear Woods has a tight box (consistent scoring), while Bobby Bogey has a wider box (more variable). This is information that a simple bar chart of averages would hide completely.

Now let’s look at strokes gained by club category. We will group clubs into four categories: woods, irons, wedges, and putter.

# Define club categories
def categorize_club(club):
    """Group clubs into broad categories."""
    if club == 'Putter':
        return 'Putter'
    elif club in ('Driver', '3-Wood', '5-Wood'):
        return 'Wood'
    elif 'Iron' in club or club == 'Hybrid':
        return 'Iron'
    else:
        return 'Wedge'

shots['club_category'] = shots['club'].apply(categorize_club)

print('Club category counts:')
print(shots['club_category'].value_counts())
# Box plot: strokes gained by club category
cat_order = ['Wood', 'Iron', 'Wedge', 'Putter']
data_by_cat = [shots[shots['club_category'] == cat]['strokes_gained'].values for cat in cat_order]

plt.figure(figsize=(8, 5))
bp = plt.boxplot(data_by_cat, labels=cat_order, patch_artist=True)

colors = ['#2ecc71', '#3498db', '#e67e22', '#9b59b6']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

plt.axhline(y=0, color='gray', linestyle='--', linewidth=0.8)
plt.ylabel('Strokes Gained')
plt.title('Strokes Gained by Club Category')
plt.tight_layout()
plt.show()

The horizontal reference line at zero is important context: it represents the benchmark. Positive strokes gained means the shot was better than expected; negative means worse. The boxes show the typical range, and the outlier dots show unusually good or bad shots.

Notice how the Putter category has many outliers below zero – those are three-putts and missed short putts, which cost strokes and stick out from the typical putting distribution.

7. Seaborn

seaborn is a visualization library built on top of matplotlib. It provides higher-level functions that produce polished statistical charts with less code. Where matplotlib gives you raw control (“draw a bar here, draw a line there”), seaborn thinks in terms of statistical concepts (“show me the distribution of X grouped by Y”).

Let’s start by applying seaborn’s default theme, which instantly improves the look of all charts.

# Apply seaborn's default theme -- this changes the look of ALL subsequent charts
sns.set_theme(style='whitegrid')

sns.set_theme(style='whitegrid') applies a clean white background with light gridlines. Other options include 'darkgrid', 'white', 'dark', and 'ticks'. The 'whitegrid' style is a good default for most charts.

Now let’s recreate the average-score-by-player bar chart with seaborn’s barplot. The key advantage: seaborn automatically adds error bars showing confidence intervals, so you can see not just the average but how certain it is.

# seaborn barplot -- average score by player with error bars
# Sort by average score using the order parameter
player_order = round_detail.groupby('player_name')['total_score'].mean().sort_values().index

plt.figure(figsize=(8, 5))
sns.barplot(data=round_detail, x='player_name', y='total_score', order=player_order,
            palette='Blues_d', errorbar='sd')
plt.xlabel('Player')
plt.ylabel('Average Score (strokes)')
plt.title('Average Score by Player (with Std Dev Error Bars)')
plt.tight_layout()
plt.show()

Notice the difference from our earlier matplotlib bar chart:

  • seaborn computes the mean automatically. You pass the raw data (data=round_detail) and tell it which columns to use for x and y. No need to pre-compute the average yourself.
  • Error bars are included by default. Here we set errorbar='sd' to show standard deviation. The default is a 95% confidence interval (errorbar='ci'). Error bars show how much the data varies – essential context that a plain bar chart hides.
  • The palette parameter controls the color scheme. 'Blues_d' is a sequential blue palette. seaborn has dozens of built-in palettes.
  • The order parameter controls the left-to-right ordering of bars.
# seaborn boxplot -- score distributions by player
plt.figure(figsize=(8, 5))
sns.boxplot(data=round_detail, x='player_name', y='total_score', order=player_order,
            palette='Set2')
plt.xlabel('Player')
plt.ylabel('Total Score (strokes)')
plt.title('Score Distributions by Player')
plt.tight_layout()
plt.show()

Compare this to the matplotlib boxplot we built earlier. seaborn’s version is more concise: one line for the plot, and it automatically handles colors, layout, and grouping. The data parameter accepts a DataFrame directly, and x and y specify columns by name.

seaborn also offers violin plots, which combine a box plot with a kernel density estimate to show the full shape of the distribution.

# seaborn violin plot -- shows the full distribution shape
plt.figure(figsize=(8, 5))
sns.violinplot(data=round_detail, x='player_name', y='total_score', order=player_order,
               palette='Set2', inner='point')
plt.xlabel('Player')
plt.ylabel('Total Score (strokes)')
plt.title('Score Distributions by Player (Violin Plot)')
plt.tight_layout()
plt.show()

The violin plot shows the density of scores at each level – wider areas mean more rounds at that score. The inner='point' parameter shows individual data points inside the violin. With our small dataset (6 rounds per player), the violins are a bit sparse, but with larger datasets they become very informative.

# seaborn heatmap -- correlation matrix
numeric_cols = round_detail[['total_score', 'handicap', 'slope_rating', 'course_rating', 'total_par']]
corr_matrix = numeric_cols.corr()

plt.figure(figsize=(7, 6))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0,
            square=True, linewidths=1)
plt.title('Correlation Matrix of Round Statistics')
plt.tight_layout()
plt.show()

A heatmap turns a matrix of numbers into a color-coded grid. Here, each cell shows the correlation between two variables:

  • annot=True prints the numeric value in each cell.
  • fmt='.2f' formats to two decimal places.
  • cmap='coolwarm' uses a diverging color scale: blue for negative correlations, red for positive, white for zero.
  • center=0 ensures zero correlation is the neutral color.

The strongest correlation should be between handicap and total score – higher handicap players score higher. The diagonal is always 1.0 (every variable is perfectly correlated with itself).

# seaborn scatterplot with hue -- color points by player
# Merge player info into shots via rounds
shots_with_player = pd.merge(
    shots,
    pd.merge(rounds[['round_id', 'player_id']], players[['player_id', 'name']], on='player_id'),
    on='round_id'
)

# Filter to non-putt shots for clearer visualization
non_putts_player = shots_with_player[shots_with_player['club'] != 'Putter']

plt.figure(figsize=(10, 6))
sns.scatterplot(data=non_putts_player, x='start_distance_to_pin', y='strokes_gained',
                hue='name', alpha=0.5, s=30)
plt.axhline(y=0, color='gray', linestyle='--', linewidth=0.8)
plt.xlabel('Starting Distance to Pin (yards)')
plt.ylabel('Strokes Gained')
plt.title('Distance to Pin vs. Strokes Gained by Player')
plt.legend(title='Player')
plt.tight_layout()
plt.show()

The hue parameter is one of seaborn’s most powerful features. It automatically assigns a different color to each category (player) and adds a legend. This lets you see whether certain players cluster above or below the zero line at different distances.

Bear Woods (low handicap) should have more points above zero (gaining strokes), while Bobby Bogey should have more below. The chart makes this pattern visible even without computing a single number.

8. Subplots: Building a Dashboard

So far, every chart has been a standalone figure. But often you want to show several related charts side by side – a dashboard that tells a complete story.

plt.subplots() creates a figure with a grid of axes. You specify the number of rows and columns, and it returns the figure and an array of axes objects. Then you plot on each axis individually.

Let’s build a 2x2 “Round Report” dashboard with four complementary charts.

# 2x2 dashboard: Round Report
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# --- Top left: Average score by player (bar chart) ---
ax1 = axes[0, 0]
avg_scores = round_detail.groupby('player_name')['total_score'].mean().sort_values()
ax1.barh(avg_scores.index, avg_scores.values, color='steelblue')
ax1.set_xlabel('Average Score (strokes)')
ax1.set_title('Average Score by Player')

# --- Top right: Club distance distribution (box plot) ---
ax2 = axes[0, 1]
cat_order = ['Wood', 'Iron', 'Wedge', 'Putter']
box_data = [shots[shots['club_category'] == cat]['start_distance_to_pin'].values for cat in cat_order]
bp = ax2.boxplot(box_data, labels=cat_order, patch_artist=True)
colors = ['#2ecc71', '#3498db', '#e67e22', '#9b59b6']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)
ax2.set_ylabel('Distance to Pin (yards)')
ax2.set_title('Starting Distance by Club Category')

# --- Bottom left: Strokes gained by club category (bar chart) ---
ax3 = axes[1, 0]
sg_by_cat = shots.groupby('club_category')['strokes_gained'].mean().reindex(cat_order)
bar_colors = ['#2ecc71' if v >= 0 else '#e74c3c' for v in sg_by_cat.values]
ax3.bar(sg_by_cat.index, sg_by_cat.values, color=bar_colors)
ax3.axhline(y=0, color='gray', linestyle='--', linewidth=0.8)
ax3.set_ylabel('Avg Strokes Gained')
ax3.set_title('Average Strokes Gained by Club Category')

# --- Bottom right: Scoring trend over time (line chart) ---
ax4 = axes[1, 1]
for player_name in sorted(round_detail['player_name'].unique()):
    player_data = round_detail[round_detail['player_name'] == player_name].sort_values('date')
    ax4.plot(player_data['date'], player_data['total_score'], marker='o', label=player_name)
ax4.set_xlabel('Date')
ax4.set_ylabel('Total Score (strokes)')
ax4.set_title('Scoring Trend Over Time')
ax4.legend(fontsize=8)
ax4.tick_params(axis='x', rotation=30)

fig.suptitle('Golf Round Report Dashboard', fontsize=16, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

Key differences when using subplots:

  • fig, axes = plt.subplots(2, 2) creates a 2-by-2 grid. axes is a 2D array: axes[0, 0] is top-left, axes[0, 1] is top-right, etc.
  • Plot on a specific axis by calling methods on the axis object (ax1.barh(...)) instead of the global plt.bar(). This is the “object-oriented” matplotlib interface.
  • Axis-level labeling uses ax.set_xlabel(), ax.set_ylabel(), ax.set_title() instead of plt.xlabel(), etc.
  • fig.suptitle() adds an overall title above all four panels.
  • plt.tight_layout() is especially important with subplots – without it, labels will overlap between panels.

This dashboard tells a coherent story: who scores well (top-left), how far each club type is used from (top-right), which club categories gain or lose strokes (bottom-left), and how scores trend over the season (bottom-right). A good dashboard answers multiple related questions in a single view.


AI

Exercise 1: Ask AI to Visualize Scoring Distributions

Give an AI assistant the following prompt:

I have a golf dataset loaded in pandas. round_detail is a DataFrame with columns player_name and total_score (24 rows, 4 players, 6 rounds each). Create a visualization that shows each player’s scoring distribution. Use matplotlib or seaborn.

Evaluate the AI’s response:

  • Chart type choice: Did it pick an appropriate chart type? A box plot, violin plot, or overlapping histograms are all reasonable choices. A simple bar chart of averages is not a distribution chart – that hides the spread. If the AI chose a bar chart, that is a red flag.
  • Labeling: Does the chart have a title, axis labels, and a legend (if applicable)? Missing labels are a sign of careless code.
  • Color usage: Are the players distinguishable? Are the colors accessible (not relying on red/green distinctions that color-blind readers cannot see)?
  • Honesty: Does the y-axis start at zero (if it is a bar chart)? Are the axes scaled fairly? Does the chart accurately represent the data, or could it mislead a reader?
  • Code quality: Is the code clean and readable? Does it run without errors?
# Paste the AI-generated code here and run it.
# Then answer: what chart type did it choose? Is that a good choice for showing distributions?

Exercise 2: Ask AI to Improve a Chart

Take one of the charts you made in this notebook and describe it to an AI assistant. For example:

I created a horizontal bar chart showing club usage frequency in a golf dataset. It uses plt.barh() with a single color (steelblue). The x-axis is “Number of Shots” and the y-axis lists club names. What are 5 specific ways I could improve this chart?

Evaluate the AI’s suggestions:

  • Are the suggestions actually improvements, or just cosmetic changes that add clutter?
  • Does it suggest anything that would make the chart less honest (like a dual axis or 3D effects)?
  • Does it suggest useful additions like sorting the bars, adding data labels, or grouping related clubs by color?
  • Does it suggest removing unnecessary elements (chart junk) or only adding more?
  • Try implementing 2-3 of the suggestions. Do they actually make the chart better?
# Paste the AI's suggestions here as a comment.
# Then implement 2-3 of them below and re-create the improved chart.

Exercise 3: Ask AI to Create a Player Comparison Dashboard

Give an AI assistant this prompt:

Using the golf dataset (round_detail DataFrame with player_name, total_score, course_name, date, handicap; and shots DataFrame with round_id, club, strokes_gained, start_distance_to_pin), create a “Player Comparison Dashboard” with 4 charts in a 2x2 grid using plt.subplots(). The dashboard should help a golf coach quickly compare the 4 players.

Evaluate the AI’s response:

  • Does it use subplots? If it creates 4 separate figures instead of a 2x2 grid, it missed the point of the exercise.
  • Are the 4 charts complementary? A good dashboard answers different but related questions. Four bar charts showing slight variations of the same metric is redundant. Look for a mix of chart types (bar + line + scatter + box, for example).
  • Does it tell a coherent story? Can you look at the dashboard and quickly understand who the best player is, how consistent they are, and what their strengths and weaknesses are?
  • Technical quality: Does the code run without errors? Is tight_layout() called? Are labels readable and not overlapping?
# Paste the AI-generated dashboard code here and run it.
# Then answer: does the dashboard tell a coherent story about the 4 players?

Summary

Quick Reference: Chart Types and Functions

Chart Type When to Use matplotlib seaborn
Bar chart Compare a numeric value across categories plt.bar() / plt.barh() sns.barplot()
Histogram Distribution of a single numeric variable plt.hist() sns.histplot()
Line chart Trends over time or a sequence plt.plot() sns.lineplot()
Scatter plot Relationship between two numeric variables plt.scatter() sns.scatterplot()
Box plot Distribution by category, with outliers plt.boxplot() sns.boxplot()
Violin plot Full distribution shape by category sns.violinplot()
Heatmap Correlation matrix or other matrix data sns.heatmap()

Quick Reference: Common matplotlib Commands

Command What It Does
plt.figure(figsize=(w, h)) Create a new figure with specified size in inches
plt.xlabel('label') Label the x-axis
plt.ylabel('label') Label the y-axis
plt.title('title') Set the chart title
plt.legend() Display the legend
plt.tight_layout() Adjust spacing to prevent label overlap
plt.show() Render and display the chart
plt.axhline(y=val) Draw a horizontal reference line
plt.annotate(text, xy) Add a text annotation at a specific point
fig, axes = plt.subplots(r, c) Create a figure with an r-by-c grid of axes
ax.set_xlabel('label') Label x-axis on a specific axis (subplot)
fig.suptitle('title') Add an overall title to a multi-panel figure

Key Takeaways

  1. Always plot your data. Summary statistics hide distributions, outliers, and patterns. Charts reveal them.
  2. Choose the right chart for the question. Bar charts for categorical comparisons, histograms for distributions, line charts for trends, scatter plots for relationships, box plots for grouped distributions.
  3. matplotlib is the foundation. It is verbose but gives you full control. Learn plt.figure(), plt.bar(), plt.plot(), plt.hist(), plt.scatter(), plt.boxplot(), and the labeling functions.
  4. seaborn is the accelerator. It produces polished statistical charts with less code. sns.barplot() adds error bars automatically, sns.scatterplot() with hue colors points by category, and sns.heatmap() turns correlation matrices into readable visuals.
  5. Subplots create dashboards. plt.subplots(rows, cols) lets you combine multiple charts into a single figure that tells a complete story.
  6. Be honest with your charts. Start axes at zero for magnitude comparisons. Label everything. Do not cherry-pick data or truncate axes to exaggerate effects.

Next up: Topic 08 – Introduction to Strokes Gained.

Get the Complete Course Bundle

All notebooks, the full golf dataset, and new tutorials — straight to your inbox.