AI for Students · Class 8 · Age 12–13 · Lesson 2 of 12

Data: The Ingredient AI Needs 📊

AI is only as good as the data it learns from. This lesson explores what data really is, what makes it clean or messy, and why the quality of your data matters more than the sophistication of your model.

📘 Class 8 · Lesson 2 🕐 45–55 min 🚫 No coding needed 🆓 Free lesson
Illustrated scene: Indian student sorting colourful data cards into neat and messy piles, representing clean vs messy data
Watch first · 2–3 minutes

Class 8 Lesson 2 — Data: The Ingredient AI Needs

No sign-in needed · English narration · Safe for all school ages

Story · Rahul's School Survey Problem

The Data Was There — But Was It Useful? 🗂️

Rahul, 13, from Guntur, was doing a school project: "Find out which subject Class 8 students find most difficult and why." He surveyed 50 students and collected their answers in a notebook.

Then he tried to make sense of it. Some students had written "Maths" — others had written "math", "MATHS", "mathematics", and "maths (hate it!)". Some had left the column blank. One had written "all of them." A few had given two answers. And three forms had been filled in by the same person using different names.

Rahul had data — but the data was messy. Before he could find any patterns, he had to clean it: standardise the spelling, handle blanks, remove duplicates, and decide what to do with unusual entries.

His teacher nodded. "Rahul has just discovered the most important lesson in data science: collecting data is easy. Having clean, usable data is hard. And AI has the same problem — at a much larger scale."

👉 This lesson explains what data really is, how it is structured, what makes it clean or messy, and why the phrase "garbage in, garbage out" is the most important warning in all of AI.
Section 1 of 7

🗃️ What Is a Dataset?

A dataset is a organised collection of related information. It is the raw material that an AI model learns from.

Think of a dataset like a very large table. Each row is one example — called a record or instance. Each column is one property — called a feature or attribute. If the data is for supervised learning, one column is the label — the answer the model is trying to learn to predict.

Here is a small example — a toy dataset for predicting whether a student will pass or fail:

Student IDStudy hours/dayAttendance %Previous scoreLabel (pass/fail)
S0013.59272Pass
S0021.06544Fail
S0034.08881Pass
S004 ?(missing)7458Pass
S0052.580-99Fail

⚠️ Two data quality issues are highlighted above: a missing value (S004 study hours) and a suspicious value (S005 previous score of -99 — probably an error).

Key vocabulary: Record (row = one example), Feature (column = one property), Label (the answer the model learns to predict), Dataset (the whole table).
Section 2 of 7

🎨 Types of Data

Data can take many forms. Understanding the type of data helps you understand what kind of AI model can work with it.

Numerical
Numbers: temperature, exam score, age, rainfall in mm, distance in km.
Can be compared and calculated with.
Categorical
Fixed categories: subject name (Maths, Science), city, grade (A/B/C), type of crop.
Has a limited set of possible values.
Text
Free-form written language: product reviews, WhatsApp messages, news articles, student essays.
Needs special processing to use in AI.
Image
Pixels in a grid: photos of crops, X-rays, satellite images, handwritten digits.
Each image = thousands or millions of numbers.
Audio
Sound waves sampled over time: speech recordings, music, bird calls.
Used in voice AI and language translation.
Time-series
Values measured over time: daily rainfall, stock price, heartbeat, electricity usage.
Order matters — sequence has meaning.
Section 3 of 7

🧹 Clean Data vs Messy Data

Real-world data is almost never perfectly clean when it is first collected. Data cleaning — fixing problems before training — is one of the most important and time-consuming parts of building any AI system.

✅ Clean data has...

  • Consistent formatting (dates, spellings, units)
  • No missing values (or clearly handled ones)
  • No duplicates
  • Sensible ranges (no age = -5, no temperature = 9999)
  • Labels that are correct and consistent
  • Representative coverage (not missing whole groups)

❌ Messy data has...

  • Inconsistent spelling: "Maths", "maths", "MATH"
  • Missing values: blank cells, "N/A", "-99"
  • Duplicates: same record entered twice
  • Outliers: age = 200, income = -1
  • Wrong labels: a photo of a cat labelled "dog"
  • Gaps: no data from rural areas, no data from women

Common data cleaning steps

  1. Standardise format: make all city names, dates, and spellings consistent
  2. Handle missing values: remove the row, fill with average, or note as unknown
  3. Remove duplicates: check for and delete repeated entries
  4. Fix outliers: values that are impossible or extreme — investigate and correct or remove
  5. Check labels: spot-check that categories are correctly assigned
"Garbage in, garbage out" — this is the most important rule in AI and data science. A brilliant algorithm trained on dirty data will produce worse results than a simple algorithm trained on clean data. Data quality always beats model complexity.
Section 4 of 7

⚖️ Representative Data — The Coverage Problem

Even clean data can be problematic if it does not represent the full picture. A dataset that works perfectly for some users might fail for others — because those users were not well-represented in the training data.

Three common coverage problems

Why this matters for India: Most of the world's largest AI datasets were built in the US, UK, and China. Indian contexts — local names, addresses, regional accents, seasonal patterns, crop varieties, medical conditions common in India — are often underrepresented. This means AI tools can be less accurate for Indian users, even when they seem very capable in general.

What is being done

Projects like Bhashini (a government initiative for Indian language AI), AI4Bharat (IIT Madras) and several Indian startups are actively building large, diverse datasets specifically for Indian languages, voices, and contexts. This is essential work for making AI fair and useful in India.

Section 5 of 7

🔢 How Much Data Is Enough?

A question students always ask: how many examples does an AI really need?

The answer depends on the task, but here is a rough guide:

Type of taskRough data size neededExample
Simple classification (2–3 categories)Hundreds to a few thousand examplesSpam filter with 2 categories
Image recognition (many categories)Thousands to millions per categoryImageNet: 14 million labelled images
Language model (like GPT)Hundreds of billions of text tokensGPT-3 trained on ~500 billion tokens
Medical diagnosis AITens of thousands of verified casesDiabetic retinopathy detection: 128,000 images
The good news: You do not always need to start from scratch. Transfer learning lets you take a model already trained on huge data (like an image recognition model) and fine-tune it for your specific task with a much smaller dataset. This is how Indian language tools are often built — start from a large multilingual model, then fine-tune with Indian language data.
Section 6 of 7

🔏 Data Ethics — Who Owns It and What Happens to It?

Every AI system was built using data collected from somewhere — and from someone. Before finishing this lesson, it is important to think about the ethics of data.

Three important questions about any dataset

India's data law: India's Digital Personal Data Protection Act (DPDPA) 2023 gives Indian citizens rights over their personal data — including the right to know how it is being used. As AI users, students should understand that their activity data (searches, messages, app usage) can be collected and used to train or improve AI systems.
Section 7 of 7

🗺️ Data in Your Own Life

You generate data every day. Understanding what data you produce — and where it goes — makes you a more informed digital citizen.

Action you takeData generatedWho might use it
Search something on GoogleSearch query, time, location, deviceGoogle — to train search ranking and ad models
Watch a YouTube videoVideo ID, watch time, skip points, deviceYouTube — to train recommendation models
Use a voice assistantAudio recording of your voice and commandPlatform (Google/Amazon/Apple) — to improve speech recognition
Use an AI chatbotYour questions and feedbackPlatform — to improve the language model
Use navigation/maps appGPS route, speed, timeMaps service — to improve traffic prediction
The key insight: AI is not separate from your life — it is built from data that people like you generated. Every time you use an AI tool and rate its response, click a recommended video, or speak to a voice assistant, you may be contributing to its training data. Understanding this helps you be a more conscious and empowered AI user.

📊 Quiz — Lesson 2

8 questions · Click your answer · Submit for your score

1. What is a "feature" in a dataset?
2. "Garbage in, garbage out" means:
3. Which is an example of a "label" in a supervised learning dataset?
4. A crop disease detection AI trained only on photos from one region may fail in other regions because:
5. What is "transfer learning"?
6. Rahul's survey has some students who wrote "Maths", "math", "MATHS", and "mathematics" all meaning the same subject. This is an example of:
7. The Bhashini project is important because:
8. When you click on a YouTube video recommendation, which of the following is MOST likely happening?

📝 Worksheet — The Mini Dataset Challenge

Tip: in the print dialog, choose "Save as PDF" to download.

Complete this in your notebook. This is a hands-on data exercise.

  1. Ask 10 classmates: "Which subject do you find hardest and why?" — record their answers in a simple table (name, subject, reason).
  2. Look at your collected data. Find at least 2 data quality problems: inconsistent spelling, missing answers, unclear reasons.
  3. Clean the data: standardise spelling, decide how to handle missing values, group similar reasons together.
  4. Answer: Which subject appears most? What are the top 2 reasons given? Would you trust an AI trained on this raw (uncleaned) data?
  5. Reflect: What would you do differently if you were designing this survey to produce cleaner data from the start?

📋 Note for Parents and Teachers

What this lesson covers: What a dataset is (records, features, labels), types of data, the difference between clean and messy data, the representative data problem with a focus on Indian contexts, data ethics and India's data protection law, and how students' daily actions generate AI training data.

Practical exercise: The worksheet mini-dataset challenge can be done as a classroom exercise. It gives students direct experience of why data cleaning is both necessary and time-consuming — building empathy for the challenges in real AI development.

Discussion prompts:

← Lesson 1: How AI Really Works Lesson 3: How AI Learns →