Lesson 6 — Regression: Predicting Numbers | Class 9

Meet Ananya — Class 9, Bengaluru

Ananya's family is looking to buy a flat. Her father told her: "Location, size, age of building — all these affect the price." Ananya opened Magicbricks.com and saw 200 flats in Whitefield. She thought: "Could an AI learn from these 200 listings and predict the price of any new flat in that area?"

That's exactly what regression does. Not categories (like pass/fail) — but a specific number (like ₹45 lakhs). She ran the numbers after this lesson and the model predicted within ₹3 lakhs of actual prices 80% of the time. Let's learn how.

Classification vs Regression

Two Types of Supervised Learning

You've already learned classification. Regression is its sibling — they're both supervised learning, but with a key difference:

Classification

Predicts a Category

Output: one label from a fixed list
Pass / Fail
Spam / Not Spam
Cat / Dog / Bird
Metric: Accuracy, F1

Regression

Predicts a Number

Output: any continuous number
Flat price: ₹42,50,000
Temperature tomorrow: 34.5°C
Student marks: 78.3
Metric: MAE, RMSE, R²

Part 1

Linear Regression: The Simplest Model

Linear regression finds the best straight line through your data points. That line is the model — and you use it to predict any new value.

Study Hours vs Exam Marks (scatter plot with regression line)

100755025

1h3h5h7h9h

Each orange dot = one student. The dark line = the linear regression prediction line. A new student studying 6 hours → read up from 6h to the line → predicted marks ≈ 72.

The equation of the line is: y = mx + b

y = the prediction (marks)
x = the input feature (study hours)
m = slope — how much y increases per unit of x ("each extra study hour adds about 8 marks")
b = intercept — the predicted y when x = 0 ("a student who studies 0 hours scores ~20")

Linear regression learns the best m and b by minimising the total squared error between its predictions and the actual values. This is called Ordinary Least Squares (OLS) and is solved in milliseconds.

Part 2

Regression Evaluation Metrics

Accuracy doesn't make sense for regression — there's no "correct category". Instead we use error metrics:

MAE

Mean Absolute Error

Average absolute difference between predicted and actual. "Off by ₹3 lakhs on average." Easy to understand.

RMSE

Root Mean Squared Error

Like MAE but penalises large errors more. Good when big mistakes are especially costly.

R² Score

R-squared (Coefficient of Determination)

0 = model is no better than the mean. 1 = perfect predictions. Target R² > 0.85 for good models.

Part 3

Build a Marks Predictor in Python

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
import matplotlib.pyplot as plt

# Generate student study/marks data
np.random.seed(42)
n = 100
study_hours = np.random.uniform(1, 9, n)
marks = 10 + 8.5 * study_hours + np.random.normal(0, 5, n)
marks = np.clip(marks, 20, 100)

df = pd.DataFrame({'study_hours': study_hours, 'marks': marks})

# Single feature regression
X = df[['study_hours']]   # 2D array (required by sklearn)
y = df['marks']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Train linear regression
model = LinearRegression()
model.fit(X_train, y_train)

print(f"Slope (m): {model.coef_[0]:.2f}  ← each extra hour adds ~{model.coef_[0]:.1f} marks")
print(f"Intercept (b): {model.intercept_:.2f}")

# Evaluate
y_pred = model.predict(X_test)
print(f"\nMAE: {mean_absolute_error(y_test, y_pred):.2f} marks")
print(f"R²:  {r2_score(y_test, y_pred):.3f}")

# Predict for 5 hours study
pred = model.predict([[5]])[0]
print(f"\nPredicted marks for 5 hours study: {pred:.1f}")

# Plot
plt.figure(figsize=(8,5))
plt.scatter(X_test, y_test, alpha=0.6, label='Actual', color='#f97316')
plt.plot(X_test.sort_values('study_hours'),
         model.predict(X_test.sort_values('study_hours')),
         color='#7c2d12', linewidth=2, label='Predicted line')
plt.xlabel('Study Hours')
plt.ylabel('Marks')
plt.title('Linear Regression: Study Hours vs Marks')
plt.legend()
plt.show()

🎚 Try It: Study Hours → Marks Predictor

This uses the linear equation: marks = 10 + 8.5 × study_hours (approximate from the model above). Slide to explore:

Study Hours: 5 hours

Predicted Marks

52.5

🧪 Check Your Understanding — Lesson 6 Quiz

1. What is the key difference between classification and regression?

a) Classification needs more data than regression

b) Classification predicts a category; regression predicts a continuous number

c) Regression is only used for time series data

d) Classification uses neural networks; regression uses decision trees

2. In the linear equation y = mx + b, what does "b" represent?

a) The slope — how much y increases per unit of x

b) The number of data points

c) The intercept — the predicted y value when x = 0

d) The error of the model

3. A flat price predictor has MAE = ₹2.5 lakhs. This means:

a) The model is always wrong by exactly ₹2.5 lakhs

b) On average, predictions are off by ₹2.5 lakhs

c) 2.5% of predictions are wrong

d) The model is 97.5% accurate

4. An R² score of 0.92 means:

a) 92% of test rows were predicted correctly

b) The model explains 92% of the variance in the data — a very good fit

c) The model made errors in 8% of cases

d) 92 rows were in the training set

5. Which of these is a regression problem?

a) Will this email be spam or not spam?

b) What is the next word in this sentence?

c) What will the temperature be in Hyderabad tomorrow (°C)?

d) Is this image a cat, dog, or bird?

6. Why do we write X = df[['study_hours']] (double brackets) instead of X = df['study_hours'] in scikit-learn?

a) Double brackets are required for string columns

b) scikit-learn requires X to be a 2D array (DataFrame), not a 1D Series

c) It's just a Python convention, both work equally

d) To include multiple copies of the same column

7. RMSE penalises large errors more than MAE. When is this useful?

a) When all errors are equally important

b) When a few very large prediction errors are especially harmful (e.g., bridge load calculations)

c) Only for classification problems

d) When the dataset has fewer than 100 rows

8. Using the equation marks = 10 + 8.5 × study_hours, what would a student who studies 4 hours be predicted to score?

a) 34

b) 42

c) 44

d) 48

← Lesson 5: Model Evaluation Lesson 7: Generative AI →

Regression: Predicting Numbers 📈

Class 9 Lesson 6 - Regression: Predicting Numbers

Predicts a Category

Predicts a Number

🎚 Try It: Study Hours → Marks Predictor

🧪 Check Your Understanding — Lesson 6 Quiz