← Back to work

NLP · Information Retrieval

Semantic Search and Information Retrieval Engine

A modular information-retrieval case study comparing transparent lexical baselines with context-aware semantic search and multilingual flow.

Individual modular retrieval research build

01

Clean

Text preprocessing

02

Compare

TF-IDF baseline

03

Encode

Dense embeddings

04

Rank

Evaluate relevance

Classical-to-semantic retrieval map

The visual explains the comparison path without inventing benchmark improvements.

Scope

Role and problem

My role: Built the reusable preprocessing, retrieval, ranking, and evaluation workflow.

Keyword matching is inspectable and fast, but meaning can vary across phrasing and language. The system explores how classical TF-IDF baselines and transformer embeddings can be compared honestly rather than treated as competing slogans.

Architecture

System flow

01

Raw text and documents

02

Cleaning and tokenisation

03

TF-IDF baseline

04

Sentence-transformer embeddings

05

Vector search

06

Ranking logic

07

Retrieval evaluation

Evidence

Measured signals

TF-IDF

Transparent lexical baseline

Maintains an interpretable keyword-oriented comparison path.

Dense

Semantic retrieval path

Uses embedding-based similarity for context-aware matching.

Top-k

Retrieval evaluation

Evaluates ranked results across representative queries.

Public scope: The public scope focuses on architecture and evaluation design without claiming unpublished comparative performance.

Contribution

  • Built reusable preprocessing, lexical baseline, embedding, vector-search, and ranking components.
  • Separated retrieval quality from response fluency.
  • Kept multilingual support as an evaluated workflow requirement rather than a decorative feature claim.

Lessons

  • Classical baselines remain useful because they are transparent and cheap.
  • Dense retrieval should earn its complexity with query-level evidence.
  • Retrieval systems need reviewed examples and evaluation, not only architecture diagrams.

Limitations

  • No comparative retrieval improvement is claimed publicly.
  • The public scope focuses on method architecture and evaluation design.
  • Architecture remains separate from unverified outcome claims.

Stack

  • TF-IDF
  • Sentence Transformers
  • Vector Search
  • Ranking
  • Information Retrieval