The Data Was There — But Was It Useful? 🗂️
Rahul, 13, from Guntur, was doing a school project: "Find out which subject Class 8 students find most difficult and why." He surveyed 50 students and collected their answers in a notebook.
Then he tried to make sense of it. Some students had written "Maths" — others had written "math", "MATHS", "mathematics", and "maths (hate it!)". Some had left the column blank. One had written "all of them." A few had given two answers. And three forms had been filled in by the same person using different names.
Rahul had data — but the data was messy. Before he could find any patterns, he had to clean it: standardise the spelling, handle blanks, remove duplicates, and decide what to do with unusual entries.
His teacher nodded. "Rahul has just discovered the most important lesson in data science: collecting data is easy. Having clean, usable data is hard. And AI has the same problem — at a much larger scale."
🗃️ What Is a Dataset?
A dataset is a organised collection of related information. It is the raw material that an AI model learns from.
Think of a dataset like a very large table. Each row is one example — called a record or instance. Each column is one property — called a feature or attribute. If the data is for supervised learning, one column is the label — the answer the model is trying to learn to predict.
Here is a small example — a toy dataset for predicting whether a student will pass or fail:
| Student ID | Study hours/day | Attendance % | Previous score | Label (pass/fail) |
|---|---|---|---|---|
| S001 | 3.5 | 92 | 72 | Pass |
| S002 | 1.0 | 65 | 44 | Fail |
| S003 | 4.0 | 88 | 81 | Pass |
| S004 ? | (missing) | 74 | 58 | Pass |
| S005 | 2.5 | 80 | -99 | Fail |
⚠️ Two data quality issues are highlighted above: a missing value (S004 study hours) and a suspicious value (S005 previous score of -99 — probably an error).
🎨 Types of Data
Data can take many forms. Understanding the type of data helps you understand what kind of AI model can work with it.
Can be compared and calculated with.
Has a limited set of possible values.
Needs special processing to use in AI.
Each image = thousands or millions of numbers.
Used in voice AI and language translation.
Order matters — sequence has meaning.
🧹 Clean Data vs Messy Data
Real-world data is almost never perfectly clean when it is first collected. Data cleaning — fixing problems before training — is one of the most important and time-consuming parts of building any AI system.
✅ Clean data has...
- Consistent formatting (dates, spellings, units)
- No missing values (or clearly handled ones)
- No duplicates
- Sensible ranges (no age = -5, no temperature = 9999)
- Labels that are correct and consistent
- Representative coverage (not missing whole groups)
❌ Messy data has...
- Inconsistent spelling: "Maths", "maths", "MATH"
- Missing values: blank cells, "N/A", "-99"
- Duplicates: same record entered twice
- Outliers: age = 200, income = -1
- Wrong labels: a photo of a cat labelled "dog"
- Gaps: no data from rural areas, no data from women
Common data cleaning steps
- Standardise format: make all city names, dates, and spellings consistent
- Handle missing values: remove the row, fill with average, or note as unknown
- Remove duplicates: check for and delete repeated entries
- Fix outliers: values that are impossible or extreme — investigate and correct or remove
- Check labels: spot-check that categories are correctly assigned
⚖️ Representative Data — The Coverage Problem
Even clean data can be problematic if it does not represent the full picture. A dataset that works perfectly for some users might fail for others — because those users were not well-represented in the training data.
Three common coverage problems
- Geographic bias: A crop disease detection AI trained only on photos from Maharashtra might not recognise diseases common in Tamil Nadu or Assam — because those regions were not in the training set.
- Demographic gaps: A face recognition system trained mostly on light-skinned faces will perform worse on darker skin tones — as has been shown in multiple published studies.
- Language gaps: AI language tools trained mostly on English and European languages perform worse on Telugu, Tamil, Kannada, and other Indian languages — because far less text data existed for them during training.
What is being done
Projects like Bhashini (a government initiative for Indian language AI), AI4Bharat (IIT Madras) and several Indian startups are actively building large, diverse datasets specifically for Indian languages, voices, and contexts. This is essential work for making AI fair and useful in India.
🔢 How Much Data Is Enough?
A question students always ask: how many examples does an AI really need?
The answer depends on the task, but here is a rough guide:
| Type of task | Rough data size needed | Example |
|---|---|---|
| Simple classification (2–3 categories) | Hundreds to a few thousand examples | Spam filter with 2 categories |
| Image recognition (many categories) | Thousands to millions per category | ImageNet: 14 million labelled images |
| Language model (like GPT) | Hundreds of billions of text tokens | GPT-3 trained on ~500 billion tokens |
| Medical diagnosis AI | Tens of thousands of verified cases | Diabetic retinopathy detection: 128,000 images |
🔏 Data Ethics — Who Owns It and What Happens to It?
Every AI system was built using data collected from somewhere — and from someone. Before finishing this lesson, it is important to think about the ethics of data.
Three important questions about any dataset
- Consent: Did the people whose data was used agree to it being used to train AI? Many early datasets scraped public websites without asking anyone.
- Privacy: Does the dataset contain personal information? Even "anonymised" data can sometimes be re-identified by combining it with other data sources.
- Purpose: Was the data collected for one purpose (e.g. hospital records) but now being used for something else (e.g. insurance risk scoring)? This is called "repurposing" and raises ethical questions.
🗺️ Data in Your Own Life
You generate data every day. Understanding what data you produce — and where it goes — makes you a more informed digital citizen.
| Action you take | Data generated | Who might use it |
|---|---|---|
| Search something on Google | Search query, time, location, device | Google — to train search ranking and ad models |
| Watch a YouTube video | Video ID, watch time, skip points, device | YouTube — to train recommendation models |
| Use a voice assistant | Audio recording of your voice and command | Platform (Google/Amazon/Apple) — to improve speech recognition |
| Use an AI chatbot | Your questions and feedback | Platform — to improve the language model |
| Use navigation/maps app | GPS route, speed, time | Maps service — to improve traffic prediction |
📊 Quiz — Lesson 2
8 questions · Click your answer · Submit for your score
📝 Worksheet — The Mini Dataset Challenge
Tip: in the print dialog, choose "Save as PDF" to download.Complete this in your notebook. This is a hands-on data exercise.
- Ask 10 classmates: "Which subject do you find hardest and why?" — record their answers in a simple table (name, subject, reason).
- Look at your collected data. Find at least 2 data quality problems: inconsistent spelling, missing answers, unclear reasons.
- Clean the data: standardise spelling, decide how to handle missing values, group similar reasons together.
- Answer: Which subject appears most? What are the top 2 reasons given? Would you trust an AI trained on this raw (uncleaned) data?
- Reflect: What would you do differently if you were designing this survey to produce cleaner data from the start?
📋 Note for Parents and Teachers
What this lesson covers: What a dataset is (records, features, labels), types of data, the difference between clean and messy data, the representative data problem with a focus on Indian contexts, data ethics and India's data protection law, and how students' daily actions generate AI training data.
Practical exercise: The worksheet mini-dataset challenge can be done as a classroom exercise. It gives students direct experience of why data cleaning is both necessary and time-consuming — building empathy for the challenges in real AI development.
Discussion prompts:
- "What data do you generate every day? Who might be using it?"
- "If you were building an AI to diagnose crop diseases in Telangana, what would make a good training dataset?"