Skip to content

colemiller94/quora_question_project

Repository files navigation

Redundant Question Classification

The goal of this project is to understand what makes two questions semantically the same (according to Quora). The labeled data come from Quora via a Kaggle competition. For Quora, such a classifier would help improve user experience and reduce website maintenance costs.

Features

Because the problem has been solved best by complex deep learning models, we sought to create a model that uses interpretable features as inputs. Using Spacy's pretrained language model and processing pipeline with fuzzywuzzy match ratios, we engineered 17 features:

  • Question Similarity
  • Similarity of Different Words
  • Entity Type Match Ratio
  • Entity Match Ratio
  • Proper Noun Match Ratio
  • Noun Match Ratio
  • Noun Similarity
  • Similarity of Different Nouns
  • The previous three points, for verbs, adjectives, and adverbs

Similarity refers to the cosine similarity of the aggregate word embeddings by document or subdocument
Entity Type refers to Spacy's named entity recognition. These are "real world objects with names", ie person, country, place, money, date
Entity refers to the entity instance, ie Theresa May, Great Britain, $12.12, October 1999

Feature Explorer

Duplicate

Models

Logistic Regression

Mean Cross Val Score: 67.17%

Random Forest

Mean Cross Val Score: 73.39%
Feature Importance:
Feature Importance XGBoost

XGBoost

Mean Cross Val Score: 73.64%
Feature Importance:
Feature Importance XGBoost

About

Classify Redundant Questions w/Quora Kaggle Dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published