\ \
Document classification or Document Categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. But when the same task is done for the documents without any knowlwdge about the labels or categories, then the problem turns into a clustering problem. Here in this project we are trying to cluster news articles using the concept of word embeddings and doc embeddings.
Word Embeddings are a type of learned word representation that allows words with similar meaning to have a similar representation.
Word embeddings are in fact a class of techniques where individual words are represented as real-valued vectors in a predefined vector space. The idea will be more clear by the following example.
The following figure shows vector representation of some words in a two dimensional vector space.
The project is divided into the following three stages.
There are two datasets which I have choosen for the project
Both these datasets are news datasets. Though datasets contain many field like TITLE, CONTENT, ID , AUTHOR , DOP etc but I will mainly focus on the fields from which we get relevant information.
By midway I expect to complete the data preprocessing and generation of vectors of the articles. If possible I will try to apply the clustering algorithms to genarate some results. In post midway phase I will try to apply more clustering algorithms and hope to apply some neural networks techniques and try to improve the techniques in the preprocessing phase if possible and try to get better results.
Here I have picked the content of the news as the primary source of information about the news. I have decided to consider the title/heading of the news later because the title contains some of the important words which can be useful during clustering. After collection of articles the preprocessing started with separation of individual words in the article followed by conversion of each words lower case. During preprocessing of the articles I have removed stopwords, for this purpose I have used nltk standard stopwords package. Here is a snapshot of what the news articles look like after the preprocessing.
In this preprocessing there are few problems like dates are being treated as individual words. For example 11 September 2001 is an important date but and is important conveying the topic of the news but here it is being treated as three different words 11, september and 2001. Similar are the problems with word combos like '1 billion' being treated separately. So these are the things which are to be taken care of. Here I also have not used any stemming and lemmatizing, but I plan on using them later to if it helpls in any betterment.
For the creation of the embeddings for the news articles I have used the gendim:doc2vec package. Doc2Vec is based on word2vec but instead of generating
embeddings for words it generates embeddings for variable length doccuments. Now words maintain logical (grammatical) structure but documents don’t have any logical
structures. To solve this problem another vector (Paragraph ID) needs to add with word2vec model. This is the only difference between word2vec and doc2vec. One thing to
be noted is ParagraphID is a unique document ID.
Now there are two versions of doc2vec available
We will have a little idea about both of them.
Distributed Memory (DM) model is similar to Continuous-Bag-of-Words (CBOW) model in word2vec which attempts to guess the output (target word) from its neighboring words (context words) with the addition of a paragraph ID. Lets say we have a single doccument say
"I like natural language processing"
and it will be predicting next word for a given word. Then the model will look like belowSo here it learns to predict a word based on the words present in the context. Here it trains the doccument vector along with the words with the intution that given the vector of the doccument, it should be good enough to predict the words in the document.
Distributed Bag-Of-Words (DBOW) Model similar to skip-gram model of word2vec, which guesses the context words from a target word. The following figure explains it
So here it learns to predict the context words based on the doccument. There is only one difference between skip-gram and distributed bag of words (DBOW) is instead of using the target word as the input, Distributed Bag of Words (DBOW) takes the document ID (Paragraph ID) as the input and tries to predict randomly sampled words from the document.
After the creation of the doc vectors for the articles I applied the K-means clustering algorithms. Here I have choosen vector length of 2 for simplicity in visualization. After the vectors of the documents were created and when we plotted it 2D it looked like the following.
So this plot looks like all the news are in the single cluster at first glance but since this cluster is spread out horizontally so news articles present at the extreme ends of the clusters have high chance to be of different topics because their vector representations are quite separated from each other. So my guess before the clutering was it would be best to have two clusters out of this plot. So to confirm this elbow method was conducted. Here we calculated the inertia score for cluster numbers 1 to 10. Inertia is the sum of squared distances of samples to their closest cluster centre. It is also sometimes called Sum of Squared Errors(SSE). The following is the expression for inertia.
The output of the elbow method looked like the following
The elbow method confirmed that 3 clusters is optimal. So after applying the K-means algorithm for two clusters the output was the following
But as you can notice that since we are focusing on vectors of length 2, its very difficult to represent the documents in 2D and still retaing the features. So instead of creating 2D vectors I created vectors whose dimensions are 150 and then used a dimension reduction algorithm to get the 2 length vectors. Here for dimension reduction I have used t-SNE algorithm because gives us better results.
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. It is extensively applied in image processing, NLP, genomic data and speech processing.
So after creating the 150 dimension vectors we apply the t-SNE algorithm on them to reduce them to 2 dimension for visualisation. The following figure depicts the following
After the generation of vectors and applying of Kmeans and getting the clusters. Now is the time to see what clusters talks about. So for this I formed the word clouds of all the three clusters. For this I took the first 100 frequent occuring words in each clusters and tried to form the wordcloud. Here is the image.
From the wordclouds we can observe that Cluster 0 does not talk about anything specific. It mainly I think focuses on some international government news. There are no significant frequent words in cluster 0. Cluster 1 on the other hand contains very good significant words like 'palestinian','israel','west bank','gaza strip','prime minister' and some other words. From here we can conclude that Cluster 1 talks mostly about political news in these areas. Cluster 2 contains words like 'government', 'attack', 'iraq', 'police', 'saddam hussein', 'soldiers', 'al qaida', 'baghdad', 'afghanistan'. There are some words with some less frequency like 'american', 'muslim', 'pakistan', 'blast', 'human rights'. So according to me cluster 3 speaks a lot about terrorist news in some specific countries and the issues related to it.
How ever I also targeted to implement and check the results on applying LDA but it could not be achieved. More rigorious data preprocessing could have improved the results according to me.