Language models capture which sequences of words are more likely to appear in a given domain, they are applicable to many natural language processing tasks. In this project, we consider a type of text classiﬁcation problem. The goal here is to guess whether the sender of the given email is of a lower rank than the recipient (UPSPEAK) or vice versa (DOWNSPEAK) based on the content of the email. The creation of a decently performing classiﬁer for this task would conﬁrm the idea that people indeed write diﬀerently depending on the relative rank of the recipient.
To do this, ﬁrst, language models are trained for UPSPEAK and DOWNSPEAK. Then these models were further enhancedusing smoothing methods. Then, given an email from updown test, the probability of the email assigned by the 2 language models were computed. If the UPSPEAK model assigns the higher probability, the email is classified as UPSPEAK, else it is classified as DOWNSPEAK. The complete implementation of the project can be accessed at the github link.
These models were tested on the test set presented by the Kaggle competition. These models were found to be 59.004 % accurate. The development of these models is explained below.
There are 3 input files, one each for training, validation and testing. The training and validation input files contain a set of emails in the following format:
The UPSEAK label indicates that the email is sent from junior to senior and the lable DOWNSPEAK indicates that the email is sent from senior to junior employee.
The test dataset lacks the UPSPEAK/DOWNSPEAK labels since they are meant for final testing.
Using a standard corpus often requires preprocessing of the data to ﬁt the task at hand. In this section, the datasets are processed to generated corpus for different labels. First, the emails are segregated into UPSPEAK emails and DOWNSPEAK emails as per their lables. Once all emails from same label are aggregated, email bodies are extracted and tokenized into sentences. Then the start(<s>) and stop(<\s>) flags are inserted for each sentence. Set of all these sentences from same label forms the corpus for that label. Thus, training dataset creates the uptrain and downtrain corpus for training the model. The validation dataset creates the up validation and down validation corpus for validating the trained model. Finally, since the test dataset doesn’t contain any labels, all emails are extracted from the test dataset and are processed to generate the updown test corpus for testing the model.
Unigram, Bigram and Trigram models are developed and trained on uptrain and downtrain datasets separately. Thus we generate the unigram, bigram and trigram uptrain models and similar language models for downtrain. To generate these models, first we count the occurrences of each N-grams and then compute their conditional probabilities. While generating the N-grams, the case sensitivity was ignored, thus ‘the’ and ‘The’ are counted as same N-gram.
Random sentences are generated from each of these above trained models. This was done to see which of the N-gram models generate a meaningful sentences. To generated the random sentence, a random word was selected that contained the start flag(<s>) as the current state. For Unigram, the second unigram was selected randomly. Then for each current state, the next highest probable N-gram was selected from the model and appended to the random sentence. The current state is then assigned to the new N-gram thus selected. The process is continued until the stop Flag(). From these random sentences thus generated, we could conclude that Trigram models were better in generating random, yet, meaningful sentences. In general, the higher the order of N-grams, the better the models in generating meaningful sentences.
The models generated were very sparse. Many of the entries for counts of the N-grams were zero. Thus, there was a need for smoothing the models. We started with the simplest of the smoothing model – Laplace Smoothing. Laplace smoothing is used to reduce the problem of data sparsity by adding 1 to all counts thus rendering the probability values non-zero. Since the counts of N-gram were incremented by 1, the total number of tokens should be increased by the number of N-grams to conserve the probability mass to 1. The purpose of applying Laplace smoothing is to assign some probability values to the tokens that are known but unseen in the corpus.
To handle unknown words in the test set, a special token UNK was introduced to keep track of unknown words. Also, since the size of the N-grams were huge, the model was reduced by assigning the less frequent N-grams to UNK token. The N-grams were deemed as less frequent if their count was less than or equal to 2. Any N-gram having a count greater than this threshold value will be considered as the valid N-gram in the corpus.
The language models were evaluated by their perplexity. Perplexity is a measurement of how well a probability distribution or probability model predicts a sample. It can be used to compare different N-gram models. A low perplexity indicates the probability distribution is good at predicting the sample thus indicating a better model.
The perplexity for various language models were computed as follows: Perplexity for the unigram uptrain : 725.08 Perplexity for the bigram uptrain : 8.54 Perplexity for the trigram uptrain : 1.88 Perplexity for the unigram downtrain : 852.9 Perplexity for the bigram downtrain : 11.27 Perplexity for the trigram downtrain : 1.92
Perplexities appears to be consistent with our expectations as trigram complexity value for both uptrain and downtrain training corpus is less than that for unigram and bigram.
For email classification, Laplace smoothing was found to be inefficient in distributing the probability mass across UNK tokens, thus new customized smoothing algorithm was developed for better probability mass distribution. This algorithm is explained below.
Good Turing smoothening algorithm states that “Probability of any unknown word = Probability of the word seen once”
Mathematically, Probability of word with zero frequency= Nc/N
Discounted Probability = C/N
Nc= count of tokens with frequency c
N is the total number of tokens in the corpus.
However, the distribution fails if Nc+1 is zero. Therefore, we have customized the good Turing algorithm for non-zero probability values by redistributing them over the non-zero probability values:
Discounted count C**= C(N-N1)/N
Probability of word with zero frequency= Nc/N
Discounted Probability = C**/N
Each word token in the email will be assigned probabilities based on the trained upstream and downstream models and total probability is then calculated by summing them all. Email will then be classified as UPSPEAK or DOWNSPEAK based on whichever probability value is higher. These models are tested on the emails presented by the Kaggle competition and found to be 59.004 % accurate.