NLP · Information Retrieval
Semantic Search and Information Retrieval Engine
A modular information-retrieval case study comparing transparent lexical baselines with context-aware semantic search and multilingual flow.
Individual modular retrieval research build
01
Clean
Text preprocessing
02
Compare
TF-IDF baseline
03
Encode
Dense embeddings
04
Rank
Evaluate relevance
Classical-to-semantic retrieval map
The visual explains the comparison path without inventing benchmark improvements.
Scope
Role and problem
My role: Built the reusable preprocessing, retrieval, ranking, and evaluation workflow.
Keyword matching is inspectable and fast, but meaning can vary across phrasing and language. The system explores how classical TF-IDF baselines and transformer embeddings can be compared honestly rather than treated as competing slogans.
Architecture
System flow
Raw text and documents
Cleaning and tokenisation
TF-IDF baseline
Sentence-transformer embeddings
Vector search
Ranking logic
Retrieval evaluation
Evidence
Measured signals
TF-IDF
Transparent lexical baseline
Maintains an interpretable keyword-oriented comparison path.
Dense
Semantic retrieval path
Uses embedding-based similarity for context-aware matching.
Top-k
Retrieval evaluation
Evaluates ranked results across representative queries.
Public scope: The public scope focuses on architecture and evaluation design without claiming unpublished comparative performance.
Contribution
- Built reusable preprocessing, lexical baseline, embedding, vector-search, and ranking components.
- Separated retrieval quality from response fluency.
- Kept multilingual support as an evaluated workflow requirement rather than a decorative feature claim.
Lessons
- Classical baselines remain useful because they are transparent and cheap.
- Dense retrieval should earn its complexity with query-level evidence.
- Retrieval systems need reviewed examples and evaluation, not only architecture diagrams.
Limitations
- No comparative retrieval improvement is claimed publicly.
- The public scope focuses on method architecture and evaluation design.
- Architecture remains separate from unverified outcome claims.
Stack
- TF-IDF
- Sentence Transformers
- Vector Search
- Ranking
- Information Retrieval