Sneha runs a small YouTube channel about Indian street food. She gets hundreds of comments in both English and Hindi every week. One month she asked herself: "Are people mostly positive or negative about my chaat recipe video? I can't read 400 comments manually."
Her older sister, studying computer science, said: "Let's write a simple NLP script. It will read all 400 comments and tell you the overall sentiment." They spent 30 minutes coding β and discovered 78% positive, 12% neutral, 10% negative. Sneha knew which recipes to continue and which needed improvement. Today you'll learn all the techniques that make this possible.
Natural Language Processing (NLP) is the branch of AI that handles human language β reading, understanding, translating, and generating text and speech. It's what powers Google Translate, Siri, Google Search autocomplete, spam filters, and ChatGPT.
The challenge: human language is ambiguous, sarcastic, context-dependent, and full of cultural references. The same word means different things in different sentences. NLP tools must handle all of this.
Before any processing, text must be split into tokens (individual units β usually words or characters). This is always the first step in any NLP pipeline.
After tokenisation:
orange = keyword token grey = likely stop word
Bag-of-Words (BoW) counts how often each word appears in a document. Two sentences become rows in a word-count table:
| Sentence | biryani | amazing | recipe | bad | chicken |
|---|---|---|---|---|---|
| "The biryani recipe is amazing" | 1 | 1 | 1 | 0 | 0 |
| "The chicken biryani was bad" | 1 | 0 | 0 | 1 | 1 |
TF-IDF (Term FrequencyβInverse Document Frequency) is smarter. It gives higher weight to words that appear often in this document but rarely across all documents. "the" appears in every document β low weight. "biryani" appears mainly in food reviews β higher weight. TF-IDF usually produces better results than plain BoW.
Sentiment analysis classifies text as positive, negative, or neutral. Let's see how the same words produce very different scores:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Small dataset of food review comments (label: 1=positive, 0=negative)
reviews = [
"This biryani is absolutely amazing and delicious",
"The best dosa I have ever tasted",
"Wonderful service and great food",
"Loved the chole bhature completely",
"Excellent taste, will order again",
"Terrible food, very bad experience",
"The samosa was cold and tasteless",
"Horrible service, never ordering again",
"Food arrived late and was completely soggy",
"Worst delivery experience, very disappointing",
"Good food but slightly expensive",
"Decent taste, not the best but okay",
"Average experience, nothing special",
"Pretty good value for the price",
"Nice flavours but could be better",
]
labels = [1,1,1,1,1, 0,0,0,0,0, 1,0,0,1,1]
# Step 1: TF-IDF vectorisation
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
X = vectorizer.fit_transform(reviews)
# Step 2: Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, labels, test_size=0.3, random_state=42)
# Step 3: Train Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train, y_train)
# Step 4: Evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(classification_report(y_test, y_pred,
target_names=['Negative', 'Positive']))
# Step 5: Predict new comments
new_reviews = [
"The paneer tikka was fantastic",
"Very disappointing, food was cold"
]
new_X = vectorizer.transform(new_reviews)
predictions = model.predict(new_X)
for r, p in zip(new_reviews, predictions):
label = "Positive β" if p == 1 else "Negative β"
print(f" {label}: {r}")- Code-switching: Most Indians mix English and their regional language mid-sentence ("Yeh biryani toh bahut tasty hai!"). Models trained on pure English fail on this.
- Script variety: Telugu, Hindi, Tamil, Bengali β 22 official scripts. Text processing requires Unicode-aware tools.
- Low-resource languages: English NLP datasets have billions of examples. Telugu datasets have millions. Models for low-resource languages need more careful design.
- Tools that help: IndicBERT (AI4Bharat project), Hugging Face Indian language models, Google Translate API
- Opportunity: India needs more NLP researchers and datasets for its 22 languages. This is an exciting area for young Indian AI builders!