The goal of this project is to understand what makes two questions semantically the same (according to Quora). The labeled data come from Quora via a Kaggle competition. For Quora, such a classifier would help improve user experience and reduce website maintenance costs.
Because the problem has been solved best by complex deep learning models, we sought to create a model that uses interpretable features as inputs. Using Spacy's pretrained language model and processing pipeline with fuzzywuzzy match ratios, we engineered 17 features:
- Question Similarity
- Similarity of Different Words
- Entity Type Match Ratio
- Entity Match Ratio
- Proper Noun Match Ratio
- Noun Match Ratio
- Noun Similarity
- Similarity of Different Nouns
- The previous three points, for verbs, adjectives, and adverbs
Similarity refers to the cosine similarity of the aggregate word embeddings by document or subdocument
Entity Type refers to Spacy's named entity recognition. These are "real world objects with names", ie person, country, place, money, date
Entity refers to the entity instance, ie Theresa May, Great Britain, $12.12, October 1999
Mean Cross Val Score: 67.17%