Lesson 9 — Data Analysis with Pandas and Charts | Class 9

Meet Rahul — Class 9, Jaipur

Rahul conducted a survey in his school: "Which subjects do students find most difficult? How many hours do they study? Does study time actually improve marks?" He collected 120 responses in a Google Form and downloaded the CSV file. Now he had a spreadsheet — but no easy way to make sense of it.

His data science teacher showed him pandas and matplotlib in Google Colab. In 45 minutes, Rahul had 6 beautiful charts revealing patterns no one had noticed before: students who study in the evening consistently scored higher than morning studiers, and Maths was hardest for exactly the students studying least. He presented this at the school assembly — and it changed the homework schedule.

Quick Recap

The Sample Dataset We'll Analyse

We'll work with a student survey dataset throughout this lesson:

Name	City	Class	Fav Subject	Study Hours/Day	Marks (%)	Study Time
Ananya	Jaipur	9	Maths	4	85	Evening
Rohit	Pune	9	Science	2	62	Morning
Priya	Delhi	10	English	5	91	Evening
Arjun	Chennai	9	Maths	1.5	55	Morning
Meera	Bengaluru	10	Science	6	94	Evening
Dev	Mumbai	9	History	3	74	Night
… 114 more rows

Part 1

Key pandas Operations

df[df['marks'] > 80]

Filter rows — keep only students who scored above 80.

df.groupby('subject')['marks'].mean()

Group by subject and compute the average marks per subject.

df['subject'].value_counts()

Count how many students picked each subject as favourite.

df.sort_values('marks', ascending=False)

Sort the DataFrame from highest to lowest marks.

df[['name','marks']].head(5)

Select only specific columns and show the first 5 rows.

df.describe()

Summary statistics: count, mean, min, max, quartiles for all numeric columns.

Visual

Average Marks by Favourite Subject (Preview)

Average Marks by Favourite Subject

Maths

78.4

Science

82.1

English

86.3

History

71.2

Hindi

74.8

This chart = one line of pandas + one line of matplotlib. Science and English favourites score highest overall.

Part 2

Full Analysis in Python

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Create the survey dataset
np.random.seed(42)
n = 120
subjects = ['Maths','Science','English','History','Hindi']
cities = ['Jaipur','Pune','Delhi','Chennai','Bengaluru','Mumbai','Hyderabad']
times = ['Morning','Evening','Night']

df = pd.DataFrame({
    'class': np.random.choice([9,10], n),
    'subject': np.random.choice(subjects, n),
    'city': np.random.choice(cities, n),
    'study_hours': np.round(np.random.uniform(1, 7, n), 1),
    'study_time': np.random.choice(times, n, p=[0.3, 0.5, 0.2]),
})
# Marks depend on study hours + evening boost
df['marks'] = (40 + df['study_hours'] * 7
               + (df['study_time'] == 'Evening') * 8
               + np.random.normal(0, 6, n)).clip(40, 100).round(1)

# ── Basic exploration ──
print(df.shape)        # (120, 5) — 120 rows, 5 columns
print(df.describe())   # Stats for all numeric columns

# ── Filter: students who scored above 85 ──
high_scorers = df[df['marks'] > 85]
print(f"\nHigh scorers (>85): {len(high_scorers)} students")

# ── groupby: average marks per subject ──
subject_avg = df.groupby('subject')['marks'].mean().sort_values(ascending=False)
print("\nAverage marks by subject:")
print(subject_avg.round(1))

# ── value_counts: subject popularity ──
print("\nFavourite subject counts:")
print(df['subject'].value_counts())

# ── groupby: study time analysis ──
time_avg = df.groupby('study_time')['marks'].mean().sort_values(ascending=False)
print("\nAverage marks by study time:")
print(time_avg.round(1))

# ── Plotting ──
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
fig.suptitle("Student Survey Analysis", fontsize=14, fontweight='bold')

# Bar chart — avg marks by subject
subject_avg.plot(kind='bar', ax=axes[0], color='#f97316', edgecolor='white')
axes[0].set_title('Avg Marks by Subject')
axes[0].set_ylabel('Average Marks')
axes[0].tick_params(axis='x', rotation=30)

# Pie chart — subject popularity
df['subject'].value_counts().plot(kind='pie', ax=axes[1],
    autopct='%1.0f%%', colors=['#f97316','#ea580c','#c2410c','#9a3412','#7c2d12'])
axes[1].set_title('Subject Popularity')
axes[1].set_ylabel('')

# Line chart — study hours vs marks (scatter with trend)
axes[2].scatter(df['study_hours'], df['marks'], alpha=0.4, color='#f97316', s=20)
m, b = np.polyfit(df['study_hours'], df['marks'], 1)
x_line = np.linspace(df['study_hours'].min(), df['study_hours'].max(), 100)
axes[2].plot(x_line, m * x_line + b, color='#7c2d12', linewidth=2)
axes[2].set_title('Study Hours vs Marks')
axes[2].set_xlabel('Study Hours')
axes[2].set_ylabel('Marks')

plt.tight_layout()
plt.savefig('student_survey.png', dpi=150, bbox_inches='tight')
plt.show()
print("Chart saved as student_survey.png")

Part 3

Choosing the Right Chart Type

Bar chart: Comparing values across categories (average marks per subject). Best for 2–8 categories.
Pie chart: Showing how a total is split into parts (% of students per subject). Keep to 5 or fewer slices.
Line chart: Showing trends over time (monthly test scores across the year).
Scatter plot: Showing the relationship between two numeric variables (study hours vs marks). Reveals correlation.
Histogram: Showing the distribution of one numeric variable (how marks are spread from 40–100).

Golden rule: Always label your axes. Always add a title. A chart without labels is as useless as data without column names.

🧪 Check Your Understanding — Lesson 9 Quiz

1. What does df[df['marks'] > 80] do?

a) Deletes all rows where marks are above 80

b) Selects only the 'marks' column

c) Returns a new DataFrame containing only rows where marks > 80

d) Sorts the DataFrame by marks descending

2. You want to find the average study hours for Class 9 students separately from Class 10. Which pandas method do you use?

a) df.filter('class')

b) df.groupby('class')['study_hours'].mean()

c) df['study_hours'].value_counts()

d) df.sort_values('study_hours')

3. df['subject'].value_counts() returns:

a) The unique subjects in alphabetical order

b) The number of rows with missing subject values

c) How many students chose each subject as their favourite, sorted by count

d) The average marks for each subject

4. Which chart type is best for showing how test scores have changed month by month over a school year?

a) Pie chart

b) Scatter plot

c) Line chart — it shows trends over continuous time

d) Bar chart

5. In matplotlib, plt.savefig('chart.png') does what?

a) Displays the chart on screen

b) Loads a chart from a file

c) Saves the current figure as a PNG image file named chart.png

d) Creates a new empty figure

6. You want to show the proportion of students from each city (as %). Which chart is best?

a) Scatter plot

b) Line chart

c) Histogram

d) Pie chart — it shows how a total is split into parts

7. df.describe() gives you:

a) The data types of each column

b) Count, mean, std, min, max, and quartiles for all numeric columns

c) The first 5 rows of the DataFrame

d) A list of column names

8. np.polyfit(x, y, 1) in the code is used to:

a) Normalise the data between 0 and 1

b) Find the polynomial degree of the dataset

c) Calculate the slope and intercept of the best-fit line through the scatter plot points

d) Create a random polynomial for visual decoration

← Lesson 8: NLP Basics Lesson 10: AI in the Workplace →

Data Analysis with Pandas and Charts 📊

Class 9 Lesson 9 - Data Analysis with Pandas and Charts

🧪 Check Your Understanding — Lesson 9 Quiz