NLP · Information Retrieval

Semantic Search and Information Retrieval Engine

A modular information-retrieval case study comparing transparent lexical baselines with context-aware semantic search and multilingual flow.

Individual modular retrieval research build

Clean

Text preprocessing

Compare

TF-IDF baseline

Encode

Dense embeddings

Rank

Evaluate relevance

Classical-to-semantic retrieval map

The visual explains the comparison path without inventing benchmark improvements.

Scope

Role and problem

My role: Built the reusable preprocessing, retrieval, ranking, and evaluation workflow.

Keyword matching is inspectable and fast, but meaning can vary across phrasing and language. The system explores how classical TF-IDF baselines and transformer embeddings can be compared honestly rather than treated as competing slogans.

Architecture

System flow

Raw text and documents

Cleaning and tokenisation

TF-IDF baseline

Sentence-transformer embeddings

Vector search

Ranking logic

Retrieval evaluation

Evidence

Measured signals

TF-IDF

Transparent lexical baseline

Maintains an interpretable keyword-oriented comparison path.

Dense

Semantic retrieval path

Uses embedding-based similarity for context-aware matching.

Top-k

Retrieval evaluation

Evaluates ranked results across representative queries.

Public scope: The public scope focuses on architecture and evaluation design without claiming unpublished comparative performance.

Contribution

Built reusable preprocessing, lexical baseline, embedding, vector-search, and ranking components.
Separated retrieval quality from response fluency.
Kept multilingual support as an evaluated workflow requirement rather than a decorative feature claim.

Lessons

Classical baselines remain useful because they are transparent and cheap.
Dense retrieval should earn its complexity with query-level evidence.
Retrieval systems need reviewed examples and evaluation, not only architecture diagrams.

Limitations

No comparative retrieval improvement is claimed publicly.
The public scope focuses on method architecture and evaluation design.
Architecture remains separate from unverified outcome claims.

Stack

TF-IDF
Sentence Transformers
Vector Search
Ranking
Information Retrieval