Rahul conducted a survey in his school: "Which subjects do students find most difficult? How many hours do they study? Does study time actually improve marks?" He collected 120 responses in a Google Form and downloaded the CSV file. Now he had a spreadsheet — but no easy way to make sense of it.
His data science teacher showed him pandas and matplotlib in Google Colab. In 45 minutes, Rahul had 6 beautiful charts revealing patterns no one had noticed before: students who study in the evening consistently scored higher than morning studiers, and Maths was hardest for exactly the students studying least. He presented this at the school assembly — and it changed the homework schedule.
We'll work with a student survey dataset throughout this lesson:
| Name | City | Class | Fav Subject | Study Hours/Day | Marks (%) | Study Time |
|---|---|---|---|---|---|---|
| Ananya | Jaipur | 9 | Maths | 4 | 85 | Evening |
| Rohit | Pune | 9 | Science | 2 | 62 | Morning |
| Priya | Delhi | 10 | English | 5 | 91 | Evening |
| Arjun | Chennai | 9 | Maths | 1.5 | 55 | Morning |
| Meera | Bengaluru | 10 | Science | 6 | 94 | Evening |
| Dev | Mumbai | 9 | History | 3 | 74 | Night |
| … 114 more rows | ||||||
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Create the survey dataset
np.random.seed(42)
n = 120
subjects = ['Maths','Science','English','History','Hindi']
cities = ['Jaipur','Pune','Delhi','Chennai','Bengaluru','Mumbai','Hyderabad']
times = ['Morning','Evening','Night']
df = pd.DataFrame({
'class': np.random.choice([9,10], n),
'subject': np.random.choice(subjects, n),
'city': np.random.choice(cities, n),
'study_hours': np.round(np.random.uniform(1, 7, n), 1),
'study_time': np.random.choice(times, n, p=[0.3, 0.5, 0.2]),
})
# Marks depend on study hours + evening boost
df['marks'] = (40 + df['study_hours'] * 7
+ (df['study_time'] == 'Evening') * 8
+ np.random.normal(0, 6, n)).clip(40, 100).round(1)
# ── Basic exploration ──
print(df.shape) # (120, 5) — 120 rows, 5 columns
print(df.describe()) # Stats for all numeric columns
# ── Filter: students who scored above 85 ──
high_scorers = df[df['marks'] > 85]
print(f"\nHigh scorers (>85): {len(high_scorers)} students")
# ── groupby: average marks per subject ──
subject_avg = df.groupby('subject')['marks'].mean().sort_values(ascending=False)
print("\nAverage marks by subject:")
print(subject_avg.round(1))
# ── value_counts: subject popularity ──
print("\nFavourite subject counts:")
print(df['subject'].value_counts())
# ── groupby: study time analysis ──
time_avg = df.groupby('study_time')['marks'].mean().sort_values(ascending=False)
print("\nAverage marks by study time:")
print(time_avg.round(1))
# ── Plotting ──
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
fig.suptitle("Student Survey Analysis", fontsize=14, fontweight='bold')
# Bar chart — avg marks by subject
subject_avg.plot(kind='bar', ax=axes[0], color='#f97316', edgecolor='white')
axes[0].set_title('Avg Marks by Subject')
axes[0].set_ylabel('Average Marks')
axes[0].tick_params(axis='x', rotation=30)
# Pie chart — subject popularity
df['subject'].value_counts().plot(kind='pie', ax=axes[1],
autopct='%1.0f%%', colors=['#f97316','#ea580c','#c2410c','#9a3412','#7c2d12'])
axes[1].set_title('Subject Popularity')
axes[1].set_ylabel('')
# Line chart — study hours vs marks (scatter with trend)
axes[2].scatter(df['study_hours'], df['marks'], alpha=0.4, color='#f97316', s=20)
m, b = np.polyfit(df['study_hours'], df['marks'], 1)
x_line = np.linspace(df['study_hours'].min(), df['study_hours'].max(), 100)
axes[2].plot(x_line, m * x_line + b, color='#7c2d12', linewidth=2)
axes[2].set_title('Study Hours vs Marks')
axes[2].set_xlabel('Study Hours')
axes[2].set_ylabel('Marks')
plt.tight_layout()
plt.savefig('student_survey.png', dpi=150, bbox_inches='tight')
plt.show()
print("Chart saved as student_survey.png")- Bar chart: Comparing values across categories (average marks per subject). Best for 2–8 categories.
- Pie chart: Showing how a total is split into parts (% of students per subject). Keep to 5 or fewer slices.
- Line chart: Showing trends over time (monthly test scores across the year).
- Scatter plot: Showing the relationship between two numeric variables (study hours vs marks). Reveals correlation.
- Histogram: Showing the distribution of one numeric variable (how marks are spread from 40–100).