K-Means

Coding the algo and other resources: https://mubaris.com/posts/kmeans-clustering/

K-Means

The K-Means algorithm is used to cluster all sorts of data.

It can group together

  1. Books of similar genres or written by the same authors.

  2. Similar movies.

  3. Similar music.

  4. Similar groups of customers.

This clustering can lead to product, movie, music and other types of recommendations.

In the K-means algorithm 'k' represents the number of clusters you have in your dataset

How do you find the number of clusters when you have no idea?

The main point of interest if what is known as the elbow. Finding the elbow is a bit of a judgement call.

How Does K-Means Work?

Even before this video, you know most of what you needed to about how the k-means algorithm works:

  1. You choose k as the number of clusters you believe to be in your dataset or...

  2. You use the elbow method to determine k for your data.

Then this number of clusters is created within your dataset, where each point is assigned to each group.

However, to understand what edge cases might occur when grouping points together, it is necessary to understand exactly what the k-means algorithm is doing. Here is one method for computing k-means:

1. Randomly place k centroids amongst your data.

Then within a loop until convergence perform the following two steps:

2. Assign each point to the closest centroid.

3. Move the centroid to the center of the points assigned to it.

At the end of this process, you should have k-clusters of points

Starting points and average dist to the centers

In this video, you saw that how the starting points of the centroids can actually make a difference as to the final results you obtain from the k-means algorithm.

Starting points should be very different from on try to the other.

In order to assure you have the "best" set of clusters, the algorithm you saw earlier will be performed a few times with different starting points. The best set of clusters is then the clustering that creates the smallest average distance from each point to its corresponding centroid.

Feature scaling

For any machine learning algorithm that uses distances as a part of its optimization, it is important to scale your features.

You saw this earlier in regularized forms of regression like Ridge and Lasso, but it is also true for k-means. In future sections on PCA and ICA, feature scaling will again be important for the successful optimization of your machine learning algorithms.

Though there are a large number of ways that you can go about scaling your features, there are two ways that are most common:

  1. Normalizing or Max-Min Scaling - this type of scaling transforms variable values to between 0 and 1.

  2. Standardizing or Z-Score Scaling - this type of scaling transforms variable values so they have a mean of 0 and standard deviation of 1.

We usually do Standardizing. Normalizing is used for the scaling of the color of an image.

Doesn't work well for all types of data

Last updated