Lesson 8 — Natural Language Processing Basics | Class 9

Meet Sneha — Class 9, Nagpur

Sneha runs a small YouTube channel about Indian street food. She gets hundreds of comments in both English and Hindi every week. One month she asked herself: "Are people mostly positive or negative about my chaat recipe video? I can't read 400 comments manually."

Her older sister, studying computer science, said: "Let's write a simple NLP script. It will read all 400 comments and tell you the overall sentiment." They spent 30 minutes coding — and discovered 78% positive, 12% neutral, 10% negative. Sneha knew which recipes to continue and which needed improvement. Today you'll learn all the techniques that make this possible.

What Is NLP?

Teaching Computers to Understand Language

Natural Language Processing (NLP) is the branch of AI that handles human language — reading, understanding, translating, and generating text and speech. It's what powers Google Translate, Siri, Google Search autocomplete, spam filters, and ChatGPT.

The challenge: human language is ambiguous, sarcastic, context-dependent, and full of cultural references. The same word means different things in different sentences. NLP tools must handle all of this.

Part 1

Tokenisation: Breaking Text into Pieces

Before any processing, text must be split into tokens (individual units — usually words or characters). This is always the first step in any NLP pipeline.

Original sentence: "The biryani recipe was absolutely amazing!"

After tokenisation:

The biryani recipe was absolutely amazing !

orange = keyword token grey = likely stop word

Part 2

NLP Text Processing Pipeline

Lowercasing

"Amazing" and "amazing" → both become "amazing". Without this, they'd be treated as different words.

Remove Stop Words

Stop words are very common words (the, is, a, was, and) that carry little meaning. Removing them reduces noise. Keep words like "not" — they change sentiment!

Stemming / Lemmatisation

Reduce words to their root form. "running", "runs", "ran" → "run". This way, all variations count as the same word.

Vectorisation (Bag-of-Words or TF-IDF)

Convert text to numbers so ML models can process it. The computer can only work with numbers — text must be encoded.

Feed into ML Model

The numeric representation goes into a classifier (Naive Bayes, Logistic Regression, etc.) for classification or into a regressor for sentiment scoring.

Part 3

Bag-of-Words vs TF-IDF

Bag-of-Words (BoW) counts how often each word appears in a document. Two sentences become rows in a word-count table:

Sentence	biryani	amazing	recipe	bad	chicken
"The biryani recipe is amazing"	1	1	1	0	0
"The chicken biryani was bad"	1	0	0	1	1

TF-IDF (Term Frequency–Inverse Document Frequency) is smarter. It gives higher weight to words that appear often in this document but rarely across all documents. "the" appears in every document → low weight. "biryani" appears mainly in food reviews → higher weight. TF-IDF usually produces better results than plain BoW.

Part 4

Sentiment Analysis

Sentiment analysis classifies text as positive, negative, or neutral. Let's see how the same words produce very different scores:

+0.92

Very Positive

"This samosa is absolutely delicious, best I've ever had!"

+0.02

Neutral

"I ordered the samosa. It arrived in 30 minutes."

-0.85

Very Negative

"Terrible. The samosa was cold and completely tasteless."

Tricky case — negation: "The food was not bad at all" is actually positive. Basic BoW models struggle here because they see "not" + "bad" as negative. Modern models like BERT handle negation much better because they look at word context, not just word counts.

Try It in Colab

Build a Simple Sentiment Classifier

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Small dataset of food review comments (label: 1=positive, 0=negative)
reviews = [
    "This biryani is absolutely amazing and delicious",
    "The best dosa I have ever tasted",
    "Wonderful service and great food",
    "Loved the chole bhature completely",
    "Excellent taste, will order again",
    "Terrible food, very bad experience",
    "The samosa was cold and tasteless",
    "Horrible service, never ordering again",
    "Food arrived late and was completely soggy",
    "Worst delivery experience, very disappointing",
    "Good food but slightly expensive",
    "Decent taste, not the best but okay",
    "Average experience, nothing special",
    "Pretty good value for the price",
    "Nice flavours but could be better",
]
labels = [1,1,1,1,1, 0,0,0,0,0, 1,0,0,1,1]

# Step 1: TF-IDF vectorisation
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
X = vectorizer.fit_transform(reviews)

# Step 2: Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, labels, test_size=0.3, random_state=42)

# Step 3: Train Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train, y_train)

# Step 4: Evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(classification_report(y_test, y_pred,
      target_names=['Negative', 'Positive']))

# Step 5: Predict new comments
new_reviews = [
    "The paneer tikka was fantastic",
    "Very disappointing, food was cold"
]
new_X = vectorizer.transform(new_reviews)
predictions = model.predict(new_X)
for r, p in zip(new_reviews, predictions):
    label = "Positive ✓" if p == 1 else "Negative ✗"
    print(f"  {label}: {r}")

Try extending this: Replace the reviews list with real YouTube comments (copy-paste them). Add more examples in Hindi/Hinglish. See how the model performs on mixed-language text — this is a real NLP research challenge!

Indian Context

NLP Challenges for Indian Languages

Code-switching: Most Indians mix English and their regional language mid-sentence ("Yeh biryani toh bahut tasty hai!"). Models trained on pure English fail on this.
Script variety: Telugu, Hindi, Tamil, Bengali — 22 official scripts. Text processing requires Unicode-aware tools.
Low-resource languages: English NLP datasets have billions of examples. Telugu datasets have millions. Models for low-resource languages need more careful design.
Tools that help: IndicBERT (AI4Bharat project), Hugging Face Indian language models, Google Translate API
Opportunity: India needs more NLP researchers and datasets for its 22 languages. This is an exciting area for young Indian AI builders!

🧪 Check Your Understanding — Lesson 8 Quiz

1. What is tokenisation in NLP?

a) Giving a unique ID number to each document

b) Breaking text into individual units (usually words or characters) for processing

c) Translating text from one language to another

d) Removing all punctuation marks from text

2. Why do we remove stop words in NLP preprocessing?

a) They cause programming errors in Python

b) They are very common words (the, is, a) that add noise but carry little meaning for classification

c) They slow down tokenisation

d) They make the dataset too large

3. Bag-of-Words (BoW) represents a sentence as:

a) A sequence of word embeddings

b) The sentence split character by character

c) A count of how many times each word appears in the sentence

d) The sentiment score of each word

4. TF-IDF is better than plain Bag-of-Words because:

a) It uses fewer features and is always faster

b) It gives higher weight to words that are important in a specific document but rare across all documents

c) It handles grammar and sentence structure

d) It automatically translates text

5. A sentiment analysis model gives the comment "The food was not bad at all" a negative score. This is most likely because:

a) The model is broken

b) The model correctly identified negative sentiment

c) A simple BoW model saw the word "bad" and scored it negative, missing the negation "not"

d) Food reviews are always negative

6. In the Colab code, why do we use vectorizer.transform() for new_reviews instead of vectorizer.fit_transform()?

a) transform() is faster than fit_transform()

b) We must use the same vocabulary learned from training data. fit_transform() would create a new vocabulary.

c) fit_transform() doesn't work on lists

d) There's no difference — both give the same result

7. "Code-switching" in Indian NLP refers to:

a) Switching between Python and JavaScript code

b) Changing the programming language used for NLP tasks

c) Mixing two languages in a single sentence (e.g., English and Hindi mid-sentence)

d) Writing code in Telugu script

8. Stemming and lemmatisation both aim to:

a) Remove all words shorter than 3 characters

b) Translate words to English

c) Reduce different word forms to a common base so "running", "ran", "runs" all count as the same word

d) Convert words to their TF-IDF scores

← Lesson 7: Generative AI Lesson 9: Pandas and Charts →

Natural Language Processing Basics 🗣️

Class 9 Lesson 8 - Natural Language Processing Basics

🧪 Check Your Understanding — Lesson 8 Quiz