Asymmetric semantic search using document contextual embeddings on long documents

Overview

Asymmetric semantic search is a task of matching short prompts and long texts based on semantic meaning. This repository contains project done as part of NLP course in University of Tartu.

Main idea of the project was to collect dataset of books, preprocess it, transform them into embeddings using Sentence-BERT and then use embedding of the input prompt

Report about it could be found here.

You can access demo for the project on Hugginface Spaces.

Dataset

Dataset contains 78 preprocessed books from Project Guttenberg library. Dataset could be found here, additional metadata used for evaluation - here. More details on dataset could be found in the report.

Code for preprocessing original book files into structured dataset can be found here.

Embeddings generation

For generating appropriate embeddings for the task 6 Sentence-BERT pretrained models were tested out (see report for more details and evaluation). Code for generating embeddings and testing them out coudl be found here.

Evaluation

For evaluation descriptions of the books from metadata were used as queries. Metrics used were top-1, top-5 and top-10 accuracies.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
embeddings		embeddings
.gitignore		.gitignore
NLP_Project_Report.pdf		NLP_Project_Report.pdf
README.md		README.md
app.py		app.py
embeddings.ipynb		embeddings.ipynb
evaluation.ipynb		evaluation.ipynb
preprocess.ipynb		preprocess.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Asymmetric semantic search using document contextual embeddings on long documents

Overview

Dataset

Embeddings generation

Evaluation

About

Releases

Packages

Languages

nikiandr/nlp_project

Folders and files

Latest commit

History

Repository files navigation

Asymmetric semantic search using document contextual embeddings on long documents

Overview

Dataset

Embeddings generation

Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages