Ananya's family is looking to buy a flat. Her father told her: "Location, size, age of building โ all these affect the price." Ananya opened Magicbricks.com and saw 200 flats in Whitefield. She thought: "Could an AI learn from these 200 listings and predict the price of any new flat in that area?"
That's exactly what regression does. Not categories (like pass/fail) โ but a specific number (like โน45 lakhs). She ran the numbers after this lesson and the model predicted within โน3 lakhs of actual prices 80% of the time. Let's learn how.
You've already learned classification. Regression is its sibling โ they're both supervised learning, but with a key difference:
Predicts a Category
- Output: one label from a fixed list
- Pass / Fail
- Spam / Not Spam
- Cat / Dog / Bird
- Metric: Accuracy, F1
Predicts a Number
- Output: any continuous number
- Flat price: โน42,50,000
- Temperature tomorrow: 34.5ยฐC
- Student marks: 78.3
- Metric: MAE, RMSE, Rยฒ
Linear regression finds the best straight line through your data points. That line is the model โ and you use it to predict any new value.
Each orange dot = one student. The dark line = the linear regression prediction line. A new student studying 6 hours โ read up from 6h to the line โ predicted marks โ 72.
The equation of the line is: y = mx + b
- y = the prediction (marks)
- x = the input feature (study hours)
- m = slope โ how much y increases per unit of x ("each extra study hour adds about 8 marks")
- b = intercept โ the predicted y when x = 0 ("a student who studies 0 hours scores ~20")
Accuracy doesn't make sense for regression โ there's no "correct category". Instead we use error metrics:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
import matplotlib.pyplot as plt
# Generate student study/marks data
np.random.seed(42)
n = 100
study_hours = np.random.uniform(1, 9, n)
marks = 10 + 8.5 * study_hours + np.random.normal(0, 5, n)
marks = np.clip(marks, 20, 100)
df = pd.DataFrame({'study_hours': study_hours, 'marks': marks})
# Single feature regression
X = df[['study_hours']] # 2D array (required by sklearn)
y = df['marks']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
# Train linear regression
model = LinearRegression()
model.fit(X_train, y_train)
print(f"Slope (m): {model.coef_[0]:.2f} โ each extra hour adds ~{model.coef_[0]:.1f} marks")
print(f"Intercept (b): {model.intercept_:.2f}")
# Evaluate
y_pred = model.predict(X_test)
print(f"\nMAE: {mean_absolute_error(y_test, y_pred):.2f} marks")
print(f"Rยฒ: {r2_score(y_test, y_pred):.3f}")
# Predict for 5 hours study
pred = model.predict([[5]])[0]
print(f"\nPredicted marks for 5 hours study: {pred:.1f}")
# Plot
plt.figure(figsize=(8,5))
plt.scatter(X_test, y_test, alpha=0.6, label='Actual', color='#f97316')
plt.plot(X_test.sort_values('study_hours'),
model.predict(X_test.sort_values('study_hours')),
color='#7c2d12', linewidth=2, label='Predicted line')
plt.xlabel('Study Hours')
plt.ylabel('Marks')
plt.title('Linear Regression: Study Hours vs Marks')
plt.legend()
plt.show()๐ Try It: Study Hours โ Marks Predictor
This uses the linear equation: marks = 10 + 8.5 ร study_hours (approximate from the model above). Slide to explore: