Lesson 2 — Your First Dataset | Class 9

Meet Priya — Class 9, Pune

Priya's science teacher showed the class a government website called data.gov.in. "This is where real data lives," he said. Priya clicked on a dataset about rainfall in Maharashtra. It had 10,000 rows and columns called "district", "year", "rainfall_mm" and "season". She opened it in Excel and stared at it.

"But how does AI read this? What does this mean to a model?" she asked. Today we'll answer exactly that — and by the end you'll be able to load any CSV file, understand its structure, and describe what kind of AI problem it could solve.

Part 1

What Is a Dataset?

A dataset is a collection of examples, all in the same format, that an AI model learns from. Think of it like a very organised notebook: each page is one example (a row), and every page has the same columns (features).

📋

Row

One example/observation. One student, one day of weather, one product review. Also called a sample or record.

📏

Column (Feature)

One attribute measured for every row. "Height", "Marks", "City". Each column = one piece of information.

🏷️

Label (Target)

The answer we want AI to predict. "Pass/Fail", "Spam/Not Spam", "Price". One special column — the output.

📐

Shape

Rows × Columns. A dataset with 1,000 students and 5 features has shape (1000, 5). Written as n × m.

Part 2

What a Dataset Looks Like

Here is a tiny example: a class of 6 students with marks, attendance and outcome:

student_id	maths_marks	attendance_%	study_hours	passed
S001	72	91	5	Yes
S002	45	62	2	No
S003	88	95	7	Yes
S004	55	74	3	No
S005	93	98	8	Yes
S006	38	55	1	No

Features (X): maths_marks, attendance_%, study_hours — these are the inputs to the model
Label (y): passed — this is what we want the model to predict
Shape: (6, 4) — 6 rows, 4 columns (we don't count student_id as a feature)

Real datasets can be huge. MNIST (handwritten digits) has 70,000 rows and 784 columns (one per pixel). ImageNet has over 14 million rows (images). The Maharashtra rainfall dataset on data.gov.in has over 10,000 rows.

Part 3

Types of Data in Columns

Not all columns contain the same type of information. Knowing the type helps you decide how to process the column:

🔢

Numerical (Continuous)

Numbers that can take any value. Height, temperature, marks, rainfall in mm. Can be added, subtracted, averaged.

🏷️

Categorical

Fixed set of categories. City name, subject, gender, district. Cannot average "Pune + Mumbai".

⚖️

Ordinal

Ordered categories. Rating 1–5, grade A/B/C. Order matters but gap between values may not be equal.

✅

Binary

Only two values. Yes/No, True/False, 1/0, Pass/Fail. The simplest label for classification.

📅

Date/Time

Dates and times. Can extract day, month, year, season as new columns for the model.

📝

Text (Unstructured)

Free-form sentences. Reviews, tweets, complaints. Needs special processing (NLP) before a model can use it.

Part 4

Structured vs Unstructured Data

The two biggest categories in AI data work are:

📋

Structured

Rows and columns, clear format. CSV files, Excel sheets, SQL databases. Pandas loads this easily. Most ML models prefer structured data.

🌐

Unstructured

Photos, audio files, raw text, videos. No rows/columns. Needs Deep Learning (CNNs for images, NLP for text) to extract patterns.

80% of the world's data is unstructured — photos on Instagram, WhatsApp voice messages, YouTube comments. The AI models that work with unstructured data (like ChatGPT and Gemini) are far more complex than the structured-data models we'll build in Class 9.

Part 5

Finding Real Indian Datasets

You don't have to make up data. India has excellent free public datasets:

data.gov.in — Government of India's open data portal. Agriculture, health, education, weather, transport — thousands of datasets in CSV format.
kaggle.com/datasets — Global platform with datasets on cricket scores, Bollywood ratings, Indian food, stock markets, and more. Free to download.
NITI Aayog SDG India Index — State-level data on health, education, environment.
UCI Machine Learning Repository — Cleaned datasets perfect for practice (iris flowers, wine quality, car prices).
seaborn.load_dataset() — Built-in datasets in the Python seaborn library. Available instantly in Colab, no download needed.

Quick Start in Colab: Type import seaborn as sns; df = sns.load_dataset('tips') to instantly get a 244-row restaurant tips dataset — no download needed!

Part 6

Loading and Inspecting a Dataset with pandas

pandas is Python's most popular library for working with structured data. It stores data in a DataFrame — essentially a programmable table. Here's how to load and explore a CSV file:

# In Google Colab — run each cell one at a time

import pandas as pd
import seaborn as sns

# Option 1: Use a built-in dataset (no download needed)
df = sns.load_dataset('titanic')

# Option 2: Load from a CSV URL
# df = pd.read_csv('https://raw.githubusercontent.com/dsrscientist/dataset1/master/titanic.csv')

# --- EXPLORING YOUR DATASET ---

# How many rows and columns?
print("Shape:", df.shape)           # Output: (891, 15)

# Show the first 5 rows
print(df.head())

# Show last 5 rows
print(df.tail())

# What columns exist and what type is each?
print(df.dtypes)

# Basic statistics: min, max, mean, std for numeric columns
print(df.describe())

# How many missing values per column?
print(df.isnull().sum())

# How many unique values in the 'class' column?
print(df['class'].value_counts())

Run this in Colab now. Copy the code above, go to colab.research.google.com, create a new notebook, paste it into a cell and press Shift+Enter. You just loaded a real dataset and explored it!

Part 7

Reading the .describe() Output

When you run df.describe(), pandas shows you a summary table for numeric columns. Here's what each row means:

Statistic	What It Means	Example (Age column)
count	How many non-missing values	714 (177 missing)
mean	Average value	29.7 years
std	Standard deviation (spread)	14.5 — most ages within 14 years of mean
min	Smallest value in column	0.42 years (infant)
25%	25% of values are below this	20 years
50%	Median (middle value)	28 years
75%	75% of values are below this	38 years
max	Largest value in column	80 years

Check min and max carefully. If a "marks" column has a minimum of -5 or a maximum of 150 (out of 100), there are data errors. describe() helps you catch these before training.

🧪 Check Your Understanding — Lesson 2 Quiz

1. In a dataset, each row represents:

a) One column of measurements

b) One feature/attribute

c) One example/observation (e.g., one student, one day of weather)

d) The label we want to predict

2. In supervised machine learning, the "label" column is:

a) The ID column that uniquely identifies each row

b) The output the model should learn to predict

c) The first column in the dataset

d) Any numerical column

3. A dataset with 500 students and 6 feature columns has what shape?

a) (6, 500)

b) (3000, 1)

c) (500, 6)

d) (500, 5)

4. Which of these is "unstructured" data?

a) A CSV with student marks

b) A Google Sheets rainfall table

c) WhatsApp voice messages

d) An Excel file with product prices

5. Which pandas function shows min, max, mean, and standard deviation for all numeric columns at once?

a) df.info()

b) df.head()

c) df.dtypes

d) df.describe()

6. You have a column "rating" with values: 1, 2, 3, 4, 5. What data type is this?

a) Unstructured

b) Ordinal (ordered categories)

c) Binary

d) Text

7. Which free resource would you use to find government datasets about rainfall in Maharashtra?

a) Google Drive

b) Wikipedia

c) data.gov.in

d) WhatsApp

8. df.isnull().sum() tells you:

a) The total number of rows in the DataFrame

b) How many missing values exist in each column

c) Whether the DataFrame has duplicate rows

d) The mean of all numeric columns

← Lesson 1: Neural Networks Lesson 3: Cleaning Data →

Your First Dataset 📊

Class 9 Lesson 2 - Your First Dataset

🧪 Check Your Understanding — Lesson 2 Quiz