\ \ \

News Articles Clusterings using Word and Doc Embeddings

Subham Bhattacharjee


Introduction

Document classification or Document Categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. But when the same task is done for the documents without any knowlwdge about the labels or categories, then the problem turns into a clustering problem. Here in this project we are trying to cluster news articles using the concept of word embeddings and doc embeddings.

What are Word Embeddings:

Word Embeddings are a type of learned word representation that allows words with similar meaning to have a similar representation. Word embeddings are in fact a class of techniques where individual words are represented as real-valued vectors in a predefined vector space. The idea will be more clear by the following example.
The following figure shows vector representation of some words in a two dimensional vector space.

Idea and Pathway

The project is divided into the following three stages.

Dataset

There are two datasets which I have choosen for the project

  1. All the news
  2. Al Jazeera English News

Both these datasets are news datasets. Though datasets contain many field like TITLE, CONTENT, ID , AUTHOR , DOP etc but I will mainly focus on the fields from which we get relevant information.

What to expect by Midway and post midway

By midway I expect to complete the data preprocessing and generation of vectors of the articles. If possible I will try to apply the clustering algorithms to genarate some results. In post midway phase I will try to apply more clustering algorithms and hope to apply some neural networks techniques and try to improve the techniques in the preprocessing phase if possible and try to get better results.

Preprocessing of the Articles

Here I have picked the content of the news as the primary source of information about the news. I have decided to consider the title/heading of the news later because the title contains some of the important words which can be useful during clustering. After collection of articles the preprocessing started with separation of individual words in the article followed by conversion of each words lower case. During preprocessing of the articles I have removed stopwords, for this purpose I have used nltk standard stopwords package. Here is a snapshot of what the news articles look like after the preprocessing.

In this preprocessing there are few problems like dates are being treated as individual words. For example 11 September 2001 is an important date but and is important conveying the topic of the news but here it is being treated as three different words 11, september and 2001. Similar are the problems with word combos like '1 billion' being treated separately. So these are the things which are to be taken care of. Here I also have not used any stemming and lemmatizing, but I plan on using them later to if it helpls in any betterment.

Creation of embeddings for the doccuments

For the creation of the embeddings for the news articles I have used the gendim:doc2vec package. Doc2Vec is based on word2vec but instead of generating embeddings for words it generates embeddings for variable length doccuments. Now words maintain logical (grammatical) structure but documents don’t have any logical structures. To solve this problem another vector (Paragraph ID) needs to add with word2vec model. This is the only difference between word2vec and doc2vec. One thing to be noted is ParagraphID is a unique document ID.
Now there are two versions of doc2vec available

We will have a little idea about both of them.

Distributed Memory Model

Distributed Memory (DM) model is similar to Continuous-Bag-of-Words (CBOW) model in word2vec which attempts to guess the output (target word) from its neighboring words (context words) with the addition of a paragraph ID. Lets say we have a single doccument say

"I like natural language processing"

and it will be predicting next word for a given word. Then the model will look like below

So here it learns to predict a word based on the words present in the context. Here it trains the doccument vector along with the words with the intution that given the vector of the doccument, it should be good enough to predict the words in the document.

Distributed Bag of Words

Distributed Bag-Of-Words (DBOW) Model similar to skip-gram model of word2vec, which guesses the context words from a target word. The following figure explains it

So here it learns to predict the context words based on the doccument. There is only one difference between skip-gram and distributed bag of words (DBOW) is instead of using the target word as the input, Distributed Bag of Words (DBOW) takes the document ID (Paragraph ID) as the input and tries to predict randomly sampled words from the document.

Applying the Clustering Algorithms

After the creation of the doc vectors for the articles I applied the K-means clustering algorithms. Here I have choosen vector length of 2 for simplicity in visualization. After the vectors of the documents were created and when we plotted it 2D it looked like the following.

So this plot looks like all the news are in the single cluster at first glance but since this cluster is spread out horizontally so news articles present at the extreme ends of the clusters have high chance to be of different topics because their vector representations are quite separated from each other. So my guess before the clutering was it would be best to have two clusters out of this plot. So to confirm this elbow method was conducted. Here we calculated the inertia score for cluster numbers 1 to 10. Inertia is the sum of squared distances of samples to their closest cluster centre. It is also sometimes called Sum of Squared Errors(SSE). The following is the expression for inertia.

$$ SSE= \sum_{i=1}^n\sum_{j=1}^k w^{(i,j)}|x^i-\mu^j|_2^2$$
Here $$\mu^j$$ is the center for cluster j
and $$ w^{(i,j)}=1\ \text{if the sample } x^i \text{is in cluster} j,\ 0\ \text{otherwise} $$

The output of the elbow method looked like the following

The elbow method confirmed that 3 clusters is optimal. So after applying the K-means algorithm for two clusters the output was the following

But as you can notice that since we are focusing on vectors of length 2, its very difficult to represent the documents in 2D and still retaing the features. So instead of creating 2D vectors I created vectors whose dimensions are 150 and then used a dimension reduction algorithm to get the 2 length vectors. Here for dimension reduction I have used t-SNE algorithm because gives us better results.

t-SNE

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. It is extensively applied in image processing, NLP, genomic data and speech processing.

A brief overview

Working Principle of t-SNE

  1. Mikolov, Tomas & Chen, Kai & Corrado, G.s & Dean, Jeffrey. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings of Workshop at ICLR. 2013.

  2. Saxena, Amit & Prasad, Mukesh & Gupta, Akshansh & Bharill, Neha & Patel, op & Tiwari, Aruna & Er, Meng & Lin, Chin-Teng. (2017). A Review of Clustering Techniques and Developments. Neurocomputing. 267. 10.1016/j.neucom.2017.06.053.

  3. Le, Quoc & Mikolov, Tomas. (2014). Distributed Representations of Sentences and Documents. 31st International Conference on Machine Learning, ICML 2014. 4.

  4. Trieu, Lap & Tran, Huy & Tran, Minh-Triet. (2017). News Classification from Social Media Using Twitter-based Doc2Vec Model and Automatic Query Expansion. 460-467. 10.1145/3155133.3155206.