- INPUT
->shape of input vector (20, 100)
->each word is represented as a vector of 100 features
->sentences with more than 20 words are clipped
->and one with less than 20 words are padded with zero vectors
->target : one hot vectors of dimension = length of vocabulary
- OUTPUT
->softmax output
Approach Taken :-
-
First we load all the tweet data from the file 'consolidate.csv' and separately store original and corrected tweets in two lists after tokenizing them. We used nltk library to tokenize sentences into a list of words.
-
The original data is then preprocessed. Each word is converted to its lowercase words, to maintain uniformity in the dataset. The corrected tweets are processed to find all the unique words and their count. Only a subset of these unique words are chosen according to their occurrence count for our bag of words.
-
We Created one-hot vectors for each word in our bag of words, which will constitute our 'expected output' data. Each word in the original tweet dataset is converted to its corresponding vector. Gensim's word2vec was used for this purpose.
-
Now that we had all the required data in their proper format, we segregated and randomly chose X(input) and y(expected output) vectors from the dataset. This data was split into training(4050) and validation data(50).
-
For our network :- Model used : encoder-decoder RNN model using keras error function : Categorical cross entropy activation : softmax