Asymmetric semantic search is a task of matching short prompts and long texts based on semantic meaning. This repository contains project done as part of NLP course in University of Tartu.
Main idea of the project was to collect dataset of books, preprocess it, transform them into embeddings using Sentence-BERT and then use embedding of the input prompt
Report about it could be found here.
You can access demo for the project on Hugginface Spaces.
Dataset contains 78 preprocessed books from Project Guttenberg library. Dataset could be found here, additional metadata used for evaluation - here. More details on dataset could be found in the report.
Code for preprocessing original book files into structured dataset can be found here.
For generating appropriate embeddings for the task 6 Sentence-BERT pretrained models were tested out (see report for more details and evaluation). Code for generating embeddings and testing them out coudl be found here.
For evaluation descriptions of the books from metadata were used as queries. Metrics used were top-1, top-5 and top-10 accuracies.