Data Analysis with Pandas and Charts 📊

Class 9Age 13–14Lesson 9 of 12🆓 Free
Colourful bar and pie charts on a laptop screen showing student survey results, student taking notes beside it
Watch first - 2-3 minutes

Class 9 Lesson 9 - Data Analysis with Pandas and Charts

No sign-in needed - English narration - Safe for all school ages

Meet Rahul — Class 9, Jaipur

Rahul conducted a survey in his school: "Which subjects do students find most difficult? How many hours do they study? Does study time actually improve marks?" He collected 120 responses in a Google Form and downloaded the CSV file. Now he had a spreadsheet — but no easy way to make sense of it.

His data science teacher showed him pandas and matplotlib in Google Colab. In 45 minutes, Rahul had 6 beautiful charts revealing patterns no one had noticed before: students who study in the evening consistently scored higher than morning studiers, and Maths was hardest for exactly the students studying least. He presented this at the school assembly — and it changed the homework schedule.

Quick Recap
The Sample Dataset We'll Analyse

We'll work with a student survey dataset throughout this lesson:

NameCityClassFav SubjectStudy Hours/DayMarks (%)Study Time
AnanyaJaipur9Maths485Evening
RohitPune9Science262Morning
PriyaDelhi10English591Evening
ArjunChennai9Maths1.555Morning
MeeraBengaluru10Science694Evening
DevMumbai9History374Night
… 114 more rows
Part 1
Key pandas Operations
df[df['marks'] > 80]
Filter rows — keep only students who scored above 80.
df.groupby('subject')['marks'].mean()
Group by subject and compute the average marks per subject.
df['subject'].value_counts()
Count how many students picked each subject as favourite.
df.sort_values('marks', ascending=False)
Sort the DataFrame from highest to lowest marks.
df[['name','marks']].head(5)
Select only specific columns and show the first 5 rows.
df.describe()
Summary statistics: count, mean, min, max, quartiles for all numeric columns.
Visual
Average Marks by Favourite Subject (Preview)
Average Marks by Favourite Subject
Maths
78.4
Science
82.1
English
86.3
History
71.2
Hindi
74.8

This chart = one line of pandas + one line of matplotlib. Science and English favourites score highest overall.

Part 2
Full Analysis in Python
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Create the survey dataset
np.random.seed(42)
n = 120
subjects = ['Maths','Science','English','History','Hindi']
cities = ['Jaipur','Pune','Delhi','Chennai','Bengaluru','Mumbai','Hyderabad']
times = ['Morning','Evening','Night']

df = pd.DataFrame({
    'class': np.random.choice([9,10], n),
    'subject': np.random.choice(subjects, n),
    'city': np.random.choice(cities, n),
    'study_hours': np.round(np.random.uniform(1, 7, n), 1),
    'study_time': np.random.choice(times, n, p=[0.3, 0.5, 0.2]),
})
# Marks depend on study hours + evening boost
df['marks'] = (40 + df['study_hours'] * 7
               + (df['study_time'] == 'Evening') * 8
               + np.random.normal(0, 6, n)).clip(40, 100).round(1)

# ── Basic exploration ──
print(df.shape)        # (120, 5) — 120 rows, 5 columns
print(df.describe())   # Stats for all numeric columns

# ── Filter: students who scored above 85 ──
high_scorers = df[df['marks'] > 85]
print(f"\nHigh scorers (>85): {len(high_scorers)} students")

# ── groupby: average marks per subject ──
subject_avg = df.groupby('subject')['marks'].mean().sort_values(ascending=False)
print("\nAverage marks by subject:")
print(subject_avg.round(1))

# ── value_counts: subject popularity ──
print("\nFavourite subject counts:")
print(df['subject'].value_counts())

# ── groupby: study time analysis ──
time_avg = df.groupby('study_time')['marks'].mean().sort_values(ascending=False)
print("\nAverage marks by study time:")
print(time_avg.round(1))

# ── Plotting ──
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
fig.suptitle("Student Survey Analysis", fontsize=14, fontweight='bold')

# Bar chart — avg marks by subject
subject_avg.plot(kind='bar', ax=axes[0], color='#f97316', edgecolor='white')
axes[0].set_title('Avg Marks by Subject')
axes[0].set_ylabel('Average Marks')
axes[0].tick_params(axis='x', rotation=30)

# Pie chart — subject popularity
df['subject'].value_counts().plot(kind='pie', ax=axes[1],
    autopct='%1.0f%%', colors=['#f97316','#ea580c','#c2410c','#9a3412','#7c2d12'])
axes[1].set_title('Subject Popularity')
axes[1].set_ylabel('')

# Line chart — study hours vs marks (scatter with trend)
axes[2].scatter(df['study_hours'], df['marks'], alpha=0.4, color='#f97316', s=20)
m, b = np.polyfit(df['study_hours'], df['marks'], 1)
x_line = np.linspace(df['study_hours'].min(), df['study_hours'].max(), 100)
axes[2].plot(x_line, m * x_line + b, color='#7c2d12', linewidth=2)
axes[2].set_title('Study Hours vs Marks')
axes[2].set_xlabel('Study Hours')
axes[2].set_ylabel('Marks')

plt.tight_layout()
plt.savefig('student_survey.png', dpi=150, bbox_inches='tight')
plt.show()
print("Chart saved as student_survey.png")
Part 3
Choosing the Right Chart Type
Golden rule: Always label your axes. Always add a title. A chart without labels is as useless as data without column names.

🧪 Check Your Understanding — Lesson 9 Quiz

1. What does df[df['marks'] > 80] do?
a) Deletes all rows where marks are above 80
b) Selects only the 'marks' column
c) Returns a new DataFrame containing only rows where marks > 80
d) Sorts the DataFrame by marks descending
2. You want to find the average study hours for Class 9 students separately from Class 10. Which pandas method do you use?
a) df.filter('class')
b) df.groupby('class')['study_hours'].mean()
c) df['study_hours'].value_counts()
d) df.sort_values('study_hours')
3. df['subject'].value_counts() returns:
a) The unique subjects in alphabetical order
b) The number of rows with missing subject values
c) How many students chose each subject as their favourite, sorted by count
d) The average marks for each subject
4. Which chart type is best for showing how test scores have changed month by month over a school year?
a) Pie chart
b) Scatter plot
c) Line chart — it shows trends over continuous time
d) Bar chart
5. In matplotlib, plt.savefig('chart.png') does what?
a) Displays the chart on screen
b) Loads a chart from a file
c) Saves the current figure as a PNG image file named chart.png
d) Creates a new empty figure
6. You want to show the proportion of students from each city (as %). Which chart is best?
a) Scatter plot
b) Line chart
c) Histogram
d) Pie chart — it shows how a total is split into parts
7. df.describe() gives you:
a) The data types of each column
b) Count, mean, std, min, max, and quartiles for all numeric columns
c) The first 5 rows of the DataFrame
d) A list of column names
8. np.polyfit(x, y, 1) in the code is used to:
a) Normalise the data between 0 and 1
b) Find the polynomial degree of the dataset
c) Calculate the slope and intercept of the best-fit line through the scatter plot points
d) Create a random polynomial for visual decoration
← Lesson 8: NLP Basics Lesson 10: AI in the Workplace →