Lesson 2 — Data: The Ingredient AI Needs | Class 8

Story · Rahul's School Survey Problem

The Data Was There — But Was It Useful? 🗂️

Rahul, 13, from Guntur, was doing a school project: "Find out which subject Class 8 students find most difficult and why." He surveyed 50 students and collected their answers in a notebook.

Then he tried to make sense of it. Some students had written "Maths" — others had written "math", "MATHS", "mathematics", and "maths (hate it!)". Some had left the column blank. One had written "all of them." A few had given two answers. And three forms had been filled in by the same person using different names.

Rahul had data — but the data was messy. Before he could find any patterns, he had to clean it: standardise the spelling, handle blanks, remove duplicates, and decide what to do with unusual entries.

His teacher nodded. "Rahul has just discovered the most important lesson in data science: collecting data is easy. Having clean, usable data is hard. And AI has the same problem — at a much larger scale."

👉 This lesson explains what data really is, how it is structured, what makes it clean or messy, and why the phrase "garbage in, garbage out" is the most important warning in all of AI.

Section 1 of 7

🗃️ What Is a Dataset?

A dataset is a organised collection of related information. It is the raw material that an AI model learns from.

Think of a dataset like a very large table. Each row is one example — called a record or instance. Each column is one property — called a feature or attribute. If the data is for supervised learning, one column is the label — the answer the model is trying to learn to predict.

Here is a small example — a toy dataset for predicting whether a student will pass or fail:

Student ID	Study hours/day	Attendance %	Previous score	Label (pass/fail)
S001	3.5	92	72	Pass
S002	1.0	65	44	Fail
S003	4.0	88	81	Pass
S004 ?	(missing)	74	58	Pass
S005	2.5	80	-99	Fail

⚠️ Two data quality issues are highlighted above: a missing value (S004 study hours) and a suspicious value (S005 previous score of -99 — probably an error).

Key vocabulary: Record (row = one example), Feature (column = one property), Label (the answer the model learns to predict), Dataset (the whole table).

Section 2 of 7

🎨 Types of Data

Data can take many forms. Understanding the type of data helps you understand what kind of AI model can work with it.

Numerical

Numbers: temperature, exam score, age, rainfall in mm, distance in km.
Can be compared and calculated with.

Categorical

Fixed categories: subject name (Maths, Science), city, grade (A/B/C), type of crop.
Has a limited set of possible values.

Text

Free-form written language: product reviews, WhatsApp messages, news articles, student essays.
Needs special processing to use in AI.

Image

Pixels in a grid: photos of crops, X-rays, satellite images, handwritten digits.
Each image = thousands or millions of numbers.

Audio

Sound waves sampled over time: speech recordings, music, bird calls.
Used in voice AI and language translation.

Time-series

Values measured over time: daily rainfall, stock price, heartbeat, electricity usage.
Order matters — sequence has meaning.

Section 3 of 7

🧹 Clean Data vs Messy Data

Real-world data is almost never perfectly clean when it is first collected. Data cleaning — fixing problems before training — is one of the most important and time-consuming parts of building any AI system.

✅ Clean data has...

Consistent formatting (dates, spellings, units)
No missing values (or clearly handled ones)
No duplicates
Sensible ranges (no age = -5, no temperature = 9999)
Labels that are correct and consistent
Representative coverage (not missing whole groups)

❌ Messy data has...

Inconsistent spelling: "Maths", "maths", "MATH"
Missing values: blank cells, "N/A", "-99"
Duplicates: same record entered twice
Outliers: age = 200, income = -1
Wrong labels: a photo of a cat labelled "dog"
Gaps: no data from rural areas, no data from women

Common data cleaning steps

Standardise format: make all city names, dates, and spellings consistent
Handle missing values: remove the row, fill with average, or note as unknown
Remove duplicates: check for and delete repeated entries
Fix outliers: values that are impossible or extreme — investigate and correct or remove
Check labels: spot-check that categories are correctly assigned

"Garbage in, garbage out" — this is the most important rule in AI and data science. A brilliant algorithm trained on dirty data will produce worse results than a simple algorithm trained on clean data. Data quality always beats model complexity.

Section 4 of 7

⚖️ Representative Data — The Coverage Problem

Even clean data can be problematic if it does not represent the full picture. A dataset that works perfectly for some users might fail for others — because those users were not well-represented in the training data.

Three common coverage problems

Geographic bias: A crop disease detection AI trained only on photos from Maharashtra might not recognise diseases common in Tamil Nadu or Assam — because those regions were not in the training set.
Demographic gaps: A face recognition system trained mostly on light-skinned faces will perform worse on darker skin tones — as has been shown in multiple published studies.
Language gaps: AI language tools trained mostly on English and European languages perform worse on Telugu, Tamil, Kannada, and other Indian languages — because far less text data existed for them during training.

Why this matters for India: Most of the world's largest AI datasets were built in the US, UK, and China. Indian contexts — local names, addresses, regional accents, seasonal patterns, crop varieties, medical conditions common in India — are often underrepresented. This means AI tools can be less accurate for Indian users, even when they seem very capable in general.

What is being done

Projects like Bhashini (a government initiative for Indian language AI), AI4Bharat (IIT Madras) and several Indian startups are actively building large, diverse datasets specifically for Indian languages, voices, and contexts. This is essential work for making AI fair and useful in India.

Section 5 of 7

🔢 How Much Data Is Enough?

A question students always ask: how many examples does an AI really need?

The answer depends on the task, but here is a rough guide:

Type of task	Rough data size needed	Example
Simple classification (2–3 categories)	Hundreds to a few thousand examples	Spam filter with 2 categories
Image recognition (many categories)	Thousands to millions per category	ImageNet: 14 million labelled images
Language model (like GPT)	Hundreds of billions of text tokens	GPT-3 trained on ~500 billion tokens
Medical diagnosis AI	Tens of thousands of verified cases	Diabetic retinopathy detection: 128,000 images

The good news: You do not always need to start from scratch. Transfer learning lets you take a model already trained on huge data (like an image recognition model) and fine-tune it for your specific task with a much smaller dataset. This is how Indian language tools are often built — start from a large multilingual model, then fine-tune with Indian language data.

Section 6 of 7

🔏 Data Ethics — Who Owns It and What Happens to It?

Every AI system was built using data collected from somewhere — and from someone. Before finishing this lesson, it is important to think about the ethics of data.

Three important questions about any dataset

Consent: Did the people whose data was used agree to it being used to train AI? Many early datasets scraped public websites without asking anyone.
Privacy: Does the dataset contain personal information? Even "anonymised" data can sometimes be re-identified by combining it with other data sources.
Purpose: Was the data collected for one purpose (e.g. hospital records) but now being used for something else (e.g. insurance risk scoring)? This is called "repurposing" and raises ethical questions.

India's data law: India's Digital Personal Data Protection Act (DPDPA) 2023 gives Indian citizens rights over their personal data — including the right to know how it is being used. As AI users, students should understand that their activity data (searches, messages, app usage) can be collected and used to train or improve AI systems.

Section 7 of 7

🗺️ Data in Your Own Life

You generate data every day. Understanding what data you produce — and where it goes — makes you a more informed digital citizen.

Action you take	Data generated	Who might use it
Search something on Google	Search query, time, location, device	Google — to train search ranking and ad models
Watch a YouTube video	Video ID, watch time, skip points, device	YouTube — to train recommendation models
Use a voice assistant	Audio recording of your voice and command	Platform (Google/Amazon/Apple) — to improve speech recognition
Use an AI chatbot	Your questions and feedback	Platform — to improve the language model
Use navigation/maps app	GPS route, speed, time	Maps service — to improve traffic prediction

The key insight: AI is not separate from your life — it is built from data that people like you generated. Every time you use an AI tool and rate its response, click a recommended video, or speak to a voice assistant, you may be contributing to its training data. Understanding this helps you be a more conscious and empowered AI user.

📊 Quiz — Lesson 2

8 questions · Click your answer · Submit for your score

1. What is a "feature" in a dataset?

2. "Garbage in, garbage out" means:

3. Which is an example of a "label" in a supervised learning dataset?

4. A crop disease detection AI trained only on photos from one region may fail in other regions because:

5. What is "transfer learning"?

6. Rahul's survey has some students who wrote "Maths", "math", "MATHS", and "mathematics" all meaning the same subject. This is an example of:

7. The Bhashini project is important because:

8. When you click on a YouTube video recommendation, which of the following is MOST likely happening?

📝 Worksheet — The Mini Dataset Challenge

Tip: in the print dialog, choose "Save as PDF" to download.

Complete this in your notebook. This is a hands-on data exercise.

Ask 10 classmates: "Which subject do you find hardest and why?" — record their answers in a simple table (name, subject, reason).
Look at your collected data. Find at least 2 data quality problems: inconsistent spelling, missing answers, unclear reasons.
Clean the data: standardise spelling, decide how to handle missing values, group similar reasons together.
Answer: Which subject appears most? What are the top 2 reasons given? Would you trust an AI trained on this raw (uncleaned) data?
Reflect: What would you do differently if you were designing this survey to produce cleaner data from the start?

📋 Note for Parents and Teachers

What this lesson covers: What a dataset is (records, features, labels), types of data, the difference between clean and messy data, the representative data problem with a focus on Indian contexts, data ethics and India's data protection law, and how students' daily actions generate AI training data.

Practical exercise: The worksheet mini-dataset challenge can be done as a classroom exercise. It gives students direct experience of why data cleaning is both necessary and time-consuming — building empathy for the challenges in real AI development.

Discussion prompts:

"What data do you generate every day? Who might be using it?"
"If you were building an AI to diagnose crop diseases in Telangana, what would make a good training dataset?"

Data: The Ingredient AI Needs 📊

Class 8 Lesson 2 — Data: The Ingredient AI Needs

The Data Was There — But Was It Useful? 🗂️

🗃️ What Is a Dataset?

🎨 Types of Data

🧹 Clean Data vs Messy Data

✅ Clean data has...

❌ Messy data has...

Common data cleaning steps

⚖️ Representative Data — The Coverage Problem

Three common coverage problems

What is being done

🔢 How Much Data Is Enough?

🔏 Data Ethics — Who Owns It and What Happens to It?

Three important questions about any dataset

🗺️ Data in Your Own Life

📊 Quiz — Lesson 2

📝 Worksheet — The Mini Dataset Challenge

📋 Note for Parents and Teachers