Advait Ramesh Iyer

I find unbiased and generalisable patterns, and effectively communicate insights to non-technical audiences.

Top words and k-means for text clustering

less than 1 minute read

Large text file is clustered into 10 different clusters after top 500 words are identified.

1. Pre-processing

Cleaned continuous text by splitting it into sentences, removing “\n”, and headers.

2. Top-words

Built dictionary of words and sorted them in descending order. Identified top 500 words.

3. Vectorization

Vectorized the words.

4. K-Means Clustering

For k=10, performed k-means using Scikit-learn.

Check out the code here.

Twitter Facebook Google+ LinkedIn

You May Also Enjoy

Dynamic newsvendor model for Optimistic and Pessimistic policy-based profit forecasting

Abstract

Primer for Linear Algebra

Character Recognition: Can Machine Learning Identify Human Written Characters?

Despite the rise of personal computers and smartphones, many people and businesses are dependant on hand written notes. For many people, written notes are fa...

Analysis of co-purchased products on Amazon

Large sized graphs are difficult to visualize, as they are computationally very expensive to plot. In such cases, we have to rely on algorithms which help us...