Working with CSV Files

What You’ll Learn

How files work in Python: text vs binary, opening and closing
What CSV files are and why they dominate data science
Reading and writing CSVs with the csv module
Combining data from multiple CSV files
Using pathlib.Path for clean file path handling

Concept

What Are Files?

Everything on your computer is stored in files. For programming, there are two categories:

Text files contain human-readable characters (letters, numbers, punctuation). You can open them in any text editor. Examples: .txt, .csv, .py, .json.
Binary files contain raw bytes that only specific programs can interpret. Examples: .xlsx, .pdf, .jpg, .sqlite.

What Is CSV?

CSV stands for Comma-Separated Values. It is the simplest structured data format:

player_id,name,handicap
1,Bear Woods,2.1
2,Brian Kolowitz,13.9
3,Sam Shanks,18.7
4,Bobby Bogey,25.3

The first row is typically the header (column names).
Each subsequent row is a record.
Values are separated by commas (the delimiter).

Why CSV Is Everywhere in Data Science

Universal – every tool reads CSV: Python, R, Excel, databases, cloud services.
Simple – no special software needed to create or inspect one.
Lightweight – plain text with zero overhead.
Version-control friendly – diffs are meaningful because it is just text.

A Note on File Paths

Our golf data lives in ../../data/ relative to this notebook. We will start with plain strings for paths, then switch to Python’s pathlib.Path at the end of this notebook. pathlib gives you an object-oriented way to build and manipulate file paths that works across operating systems.

Code

1. File I/O Basics with `open()`

Python’s built-in open() function returns a file object. The with statement ensures the file is automatically closed when the block ends, even if an error occurs.

# Reading a file
with open('../../data/players.csv', 'r') as f:
    contents = f.read()

print(contents)

# The file is closed automatically after the with block
print(f'File closed? {f.closed}')

# Reading line by line
with open('../../data/players.csv', 'r') as f:
    for line in f:
        print(repr(line))  # repr() shows the newline characters

# Writing a file
with open('sample_output.txt', 'w') as f:
    f.write('Round 1: Bear Woods scored 79\n')
    f.write('Round 2: Brian Kolowitz scored 97\n')

# Verify it wrote correctly
with open('sample_output.txt', 'r') as f:
    print(f.read())

2. Reading CSV Manually (and Why It Breaks)

The naive approach: split each line on commas.

with open('../../data/players.csv', 'r') as f:
    for line in f:
        fields = line.strip().split(',')
        print(fields)

This works for simple data, but it is fragile. Consider what happens when a value itself contains a comma:

course_id,name,location
1,"Bob O'Connor Golf Course, Schenley Park",Pittsburgh

A naive split(',') would break "Bob O'Connor Golf Course, Schenley Park" into two fields. The csv module handles quoting, escaping, and edge cases correctly.

# Demonstrating the problembad_line = '1,"Bob O\'Connor Golf Course, Schenley Park",Pittsburgh'print('Naive split:', bad_line.split(','))# Expected: ['1', "Bob O'Connor Golf Course, Schenley Park", 'Pittsburgh']# Got: ['1', '"Bob O\'Connor Golf Course', ' Schenley Park"', 'Pittsburgh']

3. The `csv` Module

Python’s standard library includes the csv module. Two main readers:

csv.reader – returns each row as a list of strings.
csv.DictReader – returns each row as a dictionary with header names as keys.

import csv

# csv.reader -- rows as lists
with open('../../data/players.csv', 'r') as f:
    reader = csv.reader(f)
    header = next(reader)  # pull out the header row
    print('Header:', header)
    print()
    for row in reader:
        print(row)

# csv.DictReader -- rows as dictionaries
with open('../../data/players.csv', 'r') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row)

# DictReader makes field access readable
with open('../../data/players.csv', 'r') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(f"{row['name']} has a handicap of {row['handicap']}")

Key difference: With csv.reader, you access fields by index (row[1]). With csv.DictReader, you access them by name (row['name']). DictReader is almost always preferable because the code is self-documenting and won’t break if columns are reordered.

4. Reading the Golf Data Files

Let’s explore the course’s golf datasets. We will build a reusable pattern: read a CSV into a list of dictionaries.

def read_csv(filepath):
    """Read a CSV file and return a list of dictionaries."""
    with open(filepath, 'r') as f:
        reader = csv.DictReader(f)
        return list(reader)

players = read_csv('../../data/players.csv')
print(f'{len(players)} players loaded')
for p in players:
    print(f"  {p['name']:20s} handicap: {p['handicap']}")

rounds = read_csv('../../data/rounds.csv')
print(f'{len(rounds)} rounds loaded')
print()
# Show the first 5 rounds
for r in rounds[:5]:
    print(r)

courses = read_csv('../../data/courses.csv')
print(f'{len(courses)} courses loaded')
for c in courses:
    print(f"  {c['name']:30s} {c['city']}, {c['state']}  (slope: {c['slope_rating']})")

Important: Notice that every value from csv.DictReader is a string. If you need to do math, you must convert types explicitly.

# All values are strings
print(type(rounds[0]['total_score']))  # <class 'str'>
print(type(rounds[0]['player_id']))    # <class 'str'>

# Calculate average score -- must convert to int first
scores = [int(r['total_score']) for r in rounds]
avg_score = sum(scores) / len(scores)
print(f'Average score across all rounds: {avg_score:.1f}')

5. Combining Data from Multiple CSVs

In a relational dataset, information is split across files and connected by shared keys (like player_id). This is a join. Let’s combine rounds.csv and players.csv to see player names alongside their scores.

# Build a lookup dictionary: player_id -> player name
player_lookup = {}
for p in players:
    player_lookup[p['player_id']] = p['name']

print(player_lookup)

# Join rounds with player names
for r in rounds:
    name = player_lookup[r['player_id']]
    print(f"{r['date']}  {name:20s}  score: {r['total_score']}  ({r['weather']})")

# Build a richer joined dataset as a list of dicts
course_lookup = {c['course_id']: c['name'] for c in courses}

joined_rounds = []
for r in rounds:
    joined_rounds.append({
        'date': r['date'],
        'player': player_lookup[r['player_id']],
        'course': course_lookup[r['course_id']],
        'score': int(r['total_score']),
        'weather': r['weather']
    })

# Show the first 5 joined records
for jr in joined_rounds[:5]:
    print(jr)

# Find each player's best (lowest) round
best_rounds = {}
for jr in joined_rounds:
    player = jr['player']
    if player not in best_rounds or jr['score'] < best_rounds[player]['score']:
        best_rounds[player] = jr

print('Best round per player:')
print('-' * 60)
for player, info in best_rounds.items():
    print(f"{player:20s}  {info['score']}  at {info['course']}  ({info['date']})")

6. Writing CSV Files

The csv module provides csv.writer and csv.DictWriter for creating CSV files.

# csv.writer -- write rows as lists
with open('sunny_rounds.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['date', 'player', 'course', 'score'])  # header
    for jr in joined_rounds:
        if jr['weather'] == 'sunny':
            writer.writerow([jr['date'], jr['player'], jr['course'], jr['score']])

# Verify output
with open('sunny_rounds.csv', 'r') as f:
    print(f.read())

# csv.DictWriter -- write rows as dictionaries
fieldnames = ['date', 'player', 'course', 'score', 'weather']

with open('all_rounds_joined.csv', 'w', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(joined_rounds)

# Verify output
with open('all_rounds_joined.csv', 'r') as f:
    print(f.read())

Note the newline='' parameter when opening files for writing with the csv module. This prevents the writer from adding extra blank lines on Windows. It is a good habit to always include it.

7. Using `pathlib.Path` for File Paths

String concatenation for file paths is error-prone (forward slashes vs backslashes, missing separators). pathlib.Path solves this.

from pathlib import Path

# Build a path using the / operator
data_dir = Path('..') / '..' / 'data'
print(data_dir)
print(data_dir.resolve())  # absolute path

# List all CSV files in the data directory
for csv_file in sorted(data_dir.glob('*.csv')):
    print(f'{csv_file.name:20s}  {csv_file.stat().st_size:>8,} bytes')

# Use Path objects directly with open() and csv
players_path = data_dir / 'players.csv'

with open(players_path, 'r') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row['name'])

# Useful Path methods
p = data_dir / 'rounds.csv'
print(f'exists:  {p.exists()}')
print(f'is_file: {p.is_file()}')
print(f'suffix:  {p.suffix}')
print(f'stem:    {p.stem}')
print(f'parent:  {p.parent}')

# Updated read_csv function using pathlib
def read_csv_path(filepath):
    """Read a CSV file and return a list of dictionaries. Accepts str or Path."""
    filepath = Path(filepath)
    if not filepath.exists():
        raise FileNotFoundError(f'No such file: {filepath}')
    with open(filepath, 'r') as f:
        return list(csv.DictReader(f))

# Use it
holes = read_csv_path(data_dir / 'holes.csv')
print(f'{len(holes)} hole records loaded')
print(f'First record: {holes[0]}')

AI

Exercise 1: Ask AI to Join and Analyze

Give an AI assistant this prompt:

Write a Python script that reads rounds.csv and players.csv from ../../data/, joins them by player_id, and prints each player’s best (lowest) round with the date and score.

What to look for in the AI’s response: - Does it use with open() so files are properly closed? - Does it use csv.DictReader or csv.reader? - Does it convert total_score to int before comparing? - Does it build a lookup dictionary for the join, or does it use a nested loop? (Both work, but the lookup is more efficient.)

# Paste the AI-generated code here and run it.
# Compare the output to our results from Section 5 above.

Exercise 2: Ask AI to Explain DictReader vs reader

Give an AI assistant this prompt:

Explain the difference between csv.reader and csv.DictReader in Python. When would you choose one over the other? Give a short code example of each reading the same file.

What to look for in the AI’s response: - Does it mention that reader returns lists and DictReader returns dictionaries? - Does it explain that DictReader uses the first row as keys by default? - Does it note that DictReader is better for readability and column-order independence? - Does it mention any performance difference? (There is a small overhead for DictReader, but it is negligible for most use cases.)

Exercise 3: Evaluate AI Code Quality

Give an AI assistant this prompt:

Write a Python script that reads all CSV files in ../../data/, prints each filename and its row count, and writes a summary to data_summary.csv.

Evaluate the response against these criteria:

File paths – Does it use pathlib or string concatenation? Does it handle the relative path correctly?
File closing – Does it use with statements for every file it opens?
CSV writing – Does it include newline='' in the open() call for writing?
Edge cases – What happens if the data directory does not exist? Does the code handle that?

# Paste the AI-generated code here and run it.
# Check: does the output match what you'd expect from our data files?

Summary

Topic	Key Takeaway
Files	Use `with open()` to ensure files are always closed properly
CSV format	First row = headers, subsequent rows = records, comma-delimited
Manual parsing	`split(',')` breaks on quoted fields – do not use it for real data
`csv.reader`	Returns rows as lists; access fields by index
`csv.DictReader`	Returns rows as dicts; access fields by name (preferred)
`csv.writer` / `DictWriter`	Write CSVs; always use `newline=''` when opening the file
Joining CSVs	Build a lookup dictionary on the shared key, then match records
`pathlib.Path`	Object-oriented file paths; use `/` operator to build paths
Type conversion	All CSV values are strings – convert with `int()`, `float()` as needed

Next up: Topic 03, Notebook 02 – Working with JSON files.

What You’ll Learn

Concept

What Are Files?

What Is CSV?

Why CSV Is Everywhere in Data Science

A Note on File Paths

Code

1. File I/O Basics with open()

2. Reading CSV Manually (and Why It Breaks)

3. The csv Module

4. Reading the Golf Data Files

5. Combining Data from Multiple CSVs

6. Writing CSV Files

7. Using pathlib.Path for File Paths

AI

Exercise 1: Ask AI to Join and Analyze

Exercise 2: Ask AI to Explain DictReader vs reader

Exercise 3: Evaluate AI Code Quality

Summary

Get the Complete Course Bundle

1. File I/O Basics with `open()`

3. The `csv` Module

7. Using `pathlib.Path` for File Paths