No sign-in needed - English narration - Safe for all school ages
Meet Priya โ Class 9, Pune
Priya's science teacher showed the class a government website called data.gov.in. "This is where real data lives," he said. Priya clicked on a dataset about rainfall in Maharashtra. It had 10,000 rows and columns called "district", "year", "rainfall_mm" and "season". She opened it in Excel and stared at it.
"But how does AI read this? What does this mean to a model?" she asked. Today we'll answer exactly that โ and by the end you'll be able to load any CSV file, understand its structure, and describe what kind of AI problem it could solve.
Part 1
What Is a Dataset?
A dataset is a collection of examples, all in the same format, that an AI model learns from. Think of it like a very organised notebook: each page is one example (a row), and every page has the same columns (features).
๐
Row
One example/observation. One student, one day of weather, one product review. Also called a sample or record.
๐
Column (Feature)
One attribute measured for every row. "Height", "Marks", "City". Each column = one piece of information.
๐ท๏ธ
Label (Target)
The answer we want AI to predict. "Pass/Fail", "Spam/Not Spam", "Price". One special column โ the output.
๐
Shape
Rows ร Columns. A dataset with 1,000 students and 5 features has shape (1000, 5). Written as n ร m.
Part 2
What a Dataset Looks Like
Here is a tiny example: a class of 6 students with marks, attendance and outcome:
student_id
maths_marks
attendance_%
study_hours
passed
S001
72
91
5
Yes
S002
45
62
2
No
S003
88
95
7
Yes
S004
55
74
3
No
S005
93
98
8
Yes
S006
38
55
1
No
Features (X): maths_marks, attendance_%, study_hours โ these are the inputs to the model
Label (y): passed โ this is what we want the model to predict
Shape: (6, 4) โ 6 rows, 4 columns (we don't count student_id as a feature)
Real datasets can be huge. MNIST (handwritten digits) has 70,000 rows and 784 columns (one per pixel). ImageNet has over 14 million rows (images). The Maharashtra rainfall dataset on data.gov.in has over 10,000 rows.
Part 3
Types of Data in Columns
Not all columns contain the same type of information. Knowing the type helps you decide how to process the column:
๐ข
Numerical (Continuous)
Numbers that can take any value. Height, temperature, marks, rainfall in mm. Can be added, subtracted, averaged.
๐ท๏ธ
Categorical
Fixed set of categories. City name, subject, gender, district. Cannot average "Pune + Mumbai".
โ๏ธ
Ordinal
Ordered categories. Rating 1โ5, grade A/B/C. Order matters but gap between values may not be equal.
โ
Binary
Only two values. Yes/No, True/False, 1/0, Pass/Fail. The simplest label for classification.
๐
Date/Time
Dates and times. Can extract day, month, year, season as new columns for the model.
๐
Text (Unstructured)
Free-form sentences. Reviews, tweets, complaints. Needs special processing (NLP) before a model can use it.
Part 4
Structured vs Unstructured Data
The two biggest categories in AI data work are:
๐
Structured
Rows and columns, clear format. CSV files, Excel sheets, SQL databases. Pandas loads this easily. Most ML models prefer structured data.
๐
Unstructured
Photos, audio files, raw text, videos. No rows/columns. Needs Deep Learning (CNNs for images, NLP for text) to extract patterns.
80% of the world's data is unstructured โ photos on Instagram, WhatsApp voice messages, YouTube comments. The AI models that work with unstructured data (like ChatGPT and Gemini) are far more complex than the structured-data models we'll build in Class 9.
Part 5
Finding Real Indian Datasets
You don't have to make up data. India has excellent free public datasets:
data.gov.in โ Government of India's open data portal. Agriculture, health, education, weather, transport โ thousands of datasets in CSV format.
kaggle.com/datasets โ Global platform with datasets on cricket scores, Bollywood ratings, Indian food, stock markets, and more. Free to download.
NITI Aayog SDG India Index โ State-level data on health, education, environment.
UCI Machine Learning Repository โ Cleaned datasets perfect for practice (iris flowers, wine quality, car prices).
seaborn.load_dataset() โ Built-in datasets in the Python seaborn library. Available instantly in Colab, no download needed.
Quick Start in Colab: Type import seaborn as sns; df = sns.load_dataset('tips') to instantly get a 244-row restaurant tips dataset โ no download needed!
Part 6
Loading and Inspecting a Dataset with pandas
pandas is Python's most popular library for working with structured data. It stores data in a DataFrame โ essentially a programmable table. Here's how to load and explore a CSV file:
# In Google Colab โ run each cell one at a time
import pandas as pd
import seaborn as sns
# Option 1: Use a built-in dataset (no download needed)
df = sns.load_dataset('titanic')
# Option 2: Load from a CSV URL
# df = pd.read_csv('https://raw.githubusercontent.com/dsrscientist/dataset1/master/titanic.csv')
# --- EXPLORING YOUR DATASET ---
# How many rows and columns?
print("Shape:", df.shape) # Output: (891, 15)
# Show the first 5 rows
print(df.head())
# Show last 5 rows
print(df.tail())
# What columns exist and what type is each?
print(df.dtypes)
# Basic statistics: min, max, mean, std for numeric columns
print(df.describe())
# How many missing values per column?
print(df.isnull().sum())
# How many unique values in the 'class' column?
print(df['class'].value_counts())
Run this in Colab now. Copy the code above, go to colab.research.google.com, create a new notebook, paste it into a cell and press Shift+Enter. You just loaded a real dataset and explored it!
Part 7
Reading the .describe() Output
When you run df.describe(), pandas shows you a summary table for numeric columns. Here's what each row means:
Statistic
What It Means
Example (Age column)
count
How many non-missing values
714 (177 missing)
mean
Average value
29.7 years
std
Standard deviation (spread)
14.5 โ most ages within 14 years of mean
min
Smallest value in column
0.42 years (infant)
25%
25% of values are below this
20 years
50%
Median (middle value)
28 years
75%
75% of values are below this
38 years
max
Largest value in column
80 years
Check min and max carefully. If a "marks" column has a minimum of -5 or a maximum of 150 (out of 100), there are data errors. describe() helps you catch these before training.
๐งช Check Your Understanding โ Lesson 2 Quiz
1. In a dataset, each row represents:
a) One column of measurements
b) One feature/attribute
c) One example/observation (e.g., one student, one day of weather)
d) The label we want to predict
2. In supervised machine learning, the "label" column is:
a) The ID column that uniquely identifies each row
b) The output the model should learn to predict
c) The first column in the dataset
d) Any numerical column
3. A dataset with 500 students and 6 feature columns has what shape?
a) (6, 500)
b) (3000, 1)
c) (500, 6)
d) (500, 5)
4. Which of these is "unstructured" data?
a) A CSV with student marks
b) A Google Sheets rainfall table
c) WhatsApp voice messages
d) An Excel file with product prices
5. Which pandas function shows min, max, mean, and standard deviation for all numeric columns at once?
a) df.info()
b) df.head()
c) df.dtypes
d) df.describe()
6. You have a column "rating" with values: 1, 2, 3, 4, 5. What data type is this?
a) Unstructured
b) Ordinal (ordered categories)
c) Binary
d) Text
7. Which free resource would you use to find government datasets about rainfall in Maharashtra?