# Reading a file
with open('../../data/players.csv', 'r') as f:
contents = f.read()
print(contents)Working with CSV Files
What You’ll Learn
- How files work in Python: text vs binary, opening and closing
- What CSV files are and why they dominate data science
- Reading and writing CSVs with the
csvmodule - Combining data from multiple CSV files
- Using
pathlib.Pathfor clean file path handling
Concept
What Are Files?
Everything on your computer is stored in files. For programming, there are two categories:
- Text files contain human-readable characters (letters, numbers, punctuation). You can open them in any text editor. Examples:
.txt,.csv,.py,.json. - Binary files contain raw bytes that only specific programs can interpret. Examples:
.xlsx,.pdf,.jpg,.sqlite.
What Is CSV?
CSV stands for Comma-Separated Values. It is the simplest structured data format:
player_id,name,handicap
1,Bear Woods,2.1
2,Brian Kolowitz,13.9
3,Sam Shanks,18.7
4,Bobby Bogey,25.3
- The first row is typically the header (column names).
- Each subsequent row is a record.
- Values are separated by commas (the delimiter).
Why CSV Is Everywhere in Data Science
- Universal – every tool reads CSV: Python, R, Excel, databases, cloud services.
- Simple – no special software needed to create or inspect one.
- Lightweight – plain text with zero overhead.
- Version-control friendly – diffs are meaningful because it is just text.
A Note on File Paths
Our golf data lives in ../../data/ relative to this notebook. We will start with plain strings for paths, then switch to Python’s pathlib.Path at the end of this notebook. pathlib gives you an object-oriented way to build and manipulate file paths that works across operating systems.
Code
1. File I/O Basics with open()
Python’s built-in open() function returns a file object. The with statement ensures the file is automatically closed when the block ends, even if an error occurs.
# The file is closed automatically after the with block
print(f'File closed? {f.closed}')# Reading line by line
with open('../../data/players.csv', 'r') as f:
for line in f:
print(repr(line)) # repr() shows the newline characters# Writing a file
with open('sample_output.txt', 'w') as f:
f.write('Round 1: Bear Woods scored 79\n')
f.write('Round 2: Brian Kolowitz scored 97\n')
# Verify it wrote correctly
with open('sample_output.txt', 'r') as f:
print(f.read())2. Reading CSV Manually (and Why It Breaks)
The naive approach: split each line on commas.
with open('../../data/players.csv', 'r') as f:
for line in f:
fields = line.strip().split(',')
print(fields)This works for simple data, but it is fragile. Consider what happens when a value itself contains a comma:
course_id,name,location
1,"Bob O'Connor Golf Course, Schenley Park",Pittsburgh
A naive split(',') would break "Bob O'Connor Golf Course, Schenley Park" into two fields. The csv module handles quoting, escaping, and edge cases correctly.
# Demonstrating the problembad_line = '1,"Bob O\'Connor Golf Course, Schenley Park",Pittsburgh'print('Naive split:', bad_line.split(','))# Expected: ['1', "Bob O'Connor Golf Course, Schenley Park", 'Pittsburgh']# Got: ['1', '"Bob O\'Connor Golf Course', ' Schenley Park"', 'Pittsburgh']3. The csv Module
Python’s standard library includes the csv module. Two main readers:
csv.reader– returns each row as a list of strings.csv.DictReader– returns each row as a dictionary with header names as keys.
import csv# csv.reader -- rows as lists
with open('../../data/players.csv', 'r') as f:
reader = csv.reader(f)
header = next(reader) # pull out the header row
print('Header:', header)
print()
for row in reader:
print(row)# csv.DictReader -- rows as dictionaries
with open('../../data/players.csv', 'r') as f:
reader = csv.DictReader(f)
for row in reader:
print(row)# DictReader makes field access readable
with open('../../data/players.csv', 'r') as f:
reader = csv.DictReader(f)
for row in reader:
print(f"{row['name']} has a handicap of {row['handicap']}")Key difference: With csv.reader, you access fields by index (row[1]). With csv.DictReader, you access them by name (row['name']). DictReader is almost always preferable because the code is self-documenting and won’t break if columns are reordered.
4. Reading the Golf Data Files
Let’s explore the course’s golf datasets. We will build a reusable pattern: read a CSV into a list of dictionaries.
def read_csv(filepath):
"""Read a CSV file and return a list of dictionaries."""
with open(filepath, 'r') as f:
reader = csv.DictReader(f)
return list(reader)players = read_csv('../../data/players.csv')
print(f'{len(players)} players loaded')
for p in players:
print(f" {p['name']:20s} handicap: {p['handicap']}")rounds = read_csv('../../data/rounds.csv')
print(f'{len(rounds)} rounds loaded')
print()
# Show the first 5 rounds
for r in rounds[:5]:
print(r)courses = read_csv('../../data/courses.csv')
print(f'{len(courses)} courses loaded')
for c in courses:
print(f" {c['name']:30s} {c['city']}, {c['state']} (slope: {c['slope_rating']})")Important: Notice that every value from csv.DictReader is a string. If you need to do math, you must convert types explicitly.
# All values are strings
print(type(rounds[0]['total_score'])) # <class 'str'>
print(type(rounds[0]['player_id'])) # <class 'str'># Calculate average score -- must convert to int first
scores = [int(r['total_score']) for r in rounds]
avg_score = sum(scores) / len(scores)
print(f'Average score across all rounds: {avg_score:.1f}')5. Combining Data from Multiple CSVs
In a relational dataset, information is split across files and connected by shared keys (like player_id). This is a join. Let’s combine rounds.csv and players.csv to see player names alongside their scores.
# Build a lookup dictionary: player_id -> player name
player_lookup = {}
for p in players:
player_lookup[p['player_id']] = p['name']
print(player_lookup)# Join rounds with player names
for r in rounds:
name = player_lookup[r['player_id']]
print(f"{r['date']} {name:20s} score: {r['total_score']} ({r['weather']})")# Build a richer joined dataset as a list of dicts
course_lookup = {c['course_id']: c['name'] for c in courses}
joined_rounds = []
for r in rounds:
joined_rounds.append({
'date': r['date'],
'player': player_lookup[r['player_id']],
'course': course_lookup[r['course_id']],
'score': int(r['total_score']),
'weather': r['weather']
})
# Show the first 5 joined records
for jr in joined_rounds[:5]:
print(jr)# Find each player's best (lowest) round
best_rounds = {}
for jr in joined_rounds:
player = jr['player']
if player not in best_rounds or jr['score'] < best_rounds[player]['score']:
best_rounds[player] = jr
print('Best round per player:')
print('-' * 60)
for player, info in best_rounds.items():
print(f"{player:20s} {info['score']} at {info['course']} ({info['date']})")6. Writing CSV Files
The csv module provides csv.writer and csv.DictWriter for creating CSV files.
# csv.writer -- write rows as lists
with open('sunny_rounds.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['date', 'player', 'course', 'score']) # header
for jr in joined_rounds:
if jr['weather'] == 'sunny':
writer.writerow([jr['date'], jr['player'], jr['course'], jr['score']])
# Verify output
with open('sunny_rounds.csv', 'r') as f:
print(f.read())# csv.DictWriter -- write rows as dictionaries
fieldnames = ['date', 'player', 'course', 'score', 'weather']
with open('all_rounds_joined.csv', 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(joined_rounds)
# Verify output
with open('all_rounds_joined.csv', 'r') as f:
print(f.read())Note the newline='' parameter when opening files for writing with the csv module. This prevents the writer from adding extra blank lines on Windows. It is a good habit to always include it.
7. Using pathlib.Path for File Paths
String concatenation for file paths is error-prone (forward slashes vs backslashes, missing separators). pathlib.Path solves this.
from pathlib import Path# Build a path using the / operator
data_dir = Path('..') / '..' / 'data'
print(data_dir)
print(data_dir.resolve()) # absolute path# List all CSV files in the data directory
for csv_file in sorted(data_dir.glob('*.csv')):
print(f'{csv_file.name:20s} {csv_file.stat().st_size:>8,} bytes')# Use Path objects directly with open() and csv
players_path = data_dir / 'players.csv'
with open(players_path, 'r') as f:
reader = csv.DictReader(f)
for row in reader:
print(row['name'])# Useful Path methods
p = data_dir / 'rounds.csv'
print(f'exists: {p.exists()}')
print(f'is_file: {p.is_file()}')
print(f'suffix: {p.suffix}')
print(f'stem: {p.stem}')
print(f'parent: {p.parent}')# Updated read_csv function using pathlib
def read_csv_path(filepath):
"""Read a CSV file and return a list of dictionaries. Accepts str or Path."""
filepath = Path(filepath)
if not filepath.exists():
raise FileNotFoundError(f'No such file: {filepath}')
with open(filepath, 'r') as f:
return list(csv.DictReader(f))
# Use it
holes = read_csv_path(data_dir / 'holes.csv')
print(f'{len(holes)} hole records loaded')
print(f'First record: {holes[0]}')AI
Exercise 1: Ask AI to Join and Analyze
Give an AI assistant this prompt:
Write a Python script that reads
rounds.csvandplayers.csvfrom../../data/, joins them byplayer_id, and prints each player’s best (lowest) round with the date and score.
What to look for in the AI’s response: - Does it use with open() so files are properly closed? - Does it use csv.DictReader or csv.reader? - Does it convert total_score to int before comparing? - Does it build a lookup dictionary for the join, or does it use a nested loop? (Both work, but the lookup is more efficient.)
# Paste the AI-generated code here and run it.
# Compare the output to our results from Section 5 above.Exercise 2: Ask AI to Explain DictReader vs reader
Give an AI assistant this prompt:
Explain the difference between
csv.readerandcsv.DictReaderin Python. When would you choose one over the other? Give a short code example of each reading the same file.
What to look for in the AI’s response: - Does it mention that reader returns lists and DictReader returns dictionaries? - Does it explain that DictReader uses the first row as keys by default? - Does it note that DictReader is better for readability and column-order independence? - Does it mention any performance difference? (There is a small overhead for DictReader, but it is negligible for most use cases.)
Exercise 3: Evaluate AI Code Quality
Give an AI assistant this prompt:
Write a Python script that reads all CSV files in
../../data/, prints each filename and its row count, and writes a summary todata_summary.csv.
Evaluate the response against these criteria:
- File paths – Does it use
pathlibor string concatenation? Does it handle the relative path correctly? - File closing – Does it use
withstatements for every file it opens? - CSV writing – Does it include
newline=''in theopen()call for writing? - Edge cases – What happens if the data directory does not exist? Does the code handle that?
# Paste the AI-generated code here and run it.
# Check: does the output match what you'd expect from our data files?Summary
| Topic | Key Takeaway |
|---|---|
| Files | Use with open() to ensure files are always closed properly |
| CSV format | First row = headers, subsequent rows = records, comma-delimited |
| Manual parsing | split(',') breaks on quoted fields – do not use it for real data |
csv.reader |
Returns rows as lists; access fields by index |
csv.DictReader |
Returns rows as dicts; access fields by name (preferred) |
csv.writer / DictWriter |
Write CSVs; always use newline='' when opening the file |
| Joining CSVs | Build a lookup dictionary on the shared key, then match records |
pathlib.Path |
Object-oriented file paths; use / operator to build paths |
| Type conversion | All CSV values are strings – convert with int(), float() as needed |
Next up: Topic 03, Notebook 02 – Working with JSON files.