This project is an attempt to write a SVM based Question Classifier. A lot of feature extraction is motivated by the Paper Question Classification using Head Words and their Hypernyms
A typical dataset for this problem would look like the following:
- NUM:dist How far is it from Denver to Aspen ?
- LOC:city What county is Modesto , California in ?
- HUM:desc Who was Galileo ?
- DESC:def What is an atom ?
- NUM:date When did Hawaii become a state ?
- NUM:dist How tall is the Sears Building ?
The first two words separated by ":" denote the class or the category of the question. For simplicity and in this initial attempt, we have chosen only the first word as the category i.e. for data : "HUM:desc Who was Galileo ?" HUM is the category of the question. The second part can again be detected using another classifier (hierarchical classifiers).
Complete dataset is available here
We use SVM based linear classifier to build a model to classify a given question to a correct class.
Following features are used to train the model
- WH word type (The wh-word feature is the question wh-word in given questions. For example, the wh-word of question What is the population of China is what. We have taken all known question wh-words, namely what, which, when, where, who, how, why, and unknown
- Head Word extractor after chunking using opennlp parser.
- ShapeExtractor. Word shape in a given question may be useful for question classification. For instance, the question Who is Duke Ellington has a mixed shape (begins a with capital letter and follows by lower case letters) for Duke, which roughly serves as a named entity recognizer.
- Wordnet Hypernyms for head word for better regularization of features.
- N grams of the question text.
- git clone https://github.com/utk4rsh/question-classifier.git ( do git pull if you have cloned in the past)
- mvn install:install-file -Dfile=lib/edu.mit.jwi_2.4.0.jar -DgroupId=edu.mit -DartifactId=jwi -Dversion=2.4.0 -Dpackaging=jar
- mvn clean install
- mvn exec:java -Dexec.mainClass="us.ml.question.classifier.QuestionCategoryEvaluation" -Dexec.cleanupDaemonThreads=false
This should produce the below accuracy.
Cross Validation Results:
Precision | Recall | FScore | Gold | System | Correct | Class |
---|---|---|---|---|---|---|
0.723 | 0.723 | 0.723 | 5452 | 5452 | 3942 | OVERALL |
0.837 | 0.477 | 0.607 | 86 | 49 | 41 | ABBR |
0.597 | 0.834 | 0.696 | 1162 | 1624 | 969 | DESC |
0.578 | 0.577 | 0.577 | 1250 | 1247 | 721 | ENTY |
0.832 | 0.739 | 0.783 | 1223 | 1087 | 904 | HUM |
0.884 | 0.721 | 0.794 | 835 | 681 | 602 | LOC |
0.923 | 0.787 | 0.849 | 896 | 764 | 705 | NUM |
Holdout Set Results:
Precision | Recall | FScore | Gold | System | Correct | Class |
---|---|---|---|---|---|---|
0.726 | 0.726 | 0.726 | 500 | 500 | 363 | OVERALL |
1.000 | 0.778 | 0.875 | 9 | 7 | 7 | ABBR |
0.556 | 0.978 | 0.709 | 138 | 243 | 135 | DESC |
0.667 | 0.426 | 0.519 | 94 | 60 | 40 | ENTY |
0.887 | 0.846 | 0.866 | 65 | 62 | 55 | HUM |
0.981 | 0.630 | 0.767 | 81 | 52 | 51 | LOC |
0.987 | 0.664 | 0.794 | 113 | 76 | 75 | NUM |
- More feature engineering can be done to identify feature apart from the listed ones.
- Trying with different SVM kernels to see if there are any improvements.
- See if this could be achieved using word2vec or Neural networks.(https://github.com/ashishbaghudana/question_classification)