JulienBeaulieu
  • Introduction
  • Sciences
    • Math
      • Probability
        • Bayes Rule
        • Binomial distribution
        • Conditional Probability
      • Statistics
        • Descriptive Statistics
        • Inferential Statistics
          • Normal Distributions
          • Sampling Distributions
          • Confidence Intervals
          • Hypothesis Testing
          • AB Testing
        • Simple Linear Regression
        • Multiple Linear Regression
          • Statistical learning course
          • Model Assumptions And How To Address Each
        • Logistic Regression
      • Calculus
        • The big picture of Calculus
          • Derivatives
          • 2nd derivatives
          • The exponential e^x
        • Calculus
        • Gradient
      • Linear Algebra
        • Matrices
          • Matrix Multiplication
          • Inverses and Transpose and permutations
        • Vector Space and subspaces
        • Orthogonality
          • Orthogonal Sets
          • Projections
          • Least Squares
        • Gaussian Elimination
    • Programming
      • Command Line
      • Git & GitHub
      • Latex
      • Linear Algebra
        • Element-wise operations, Multiplication Transpose
      • Encodings and Character Sets
      • Uncategorized
      • Navigating Your Working Directory and File I/O
      • Python
        • Problem Solving
        • Strings
        • Lists & Dictionaries
        • Storing Data
        • HTTP Requests
      • SQL
        • Basic Statements
        • Entity Relationship Diagram
      • Jupyter Notebooks
      • Data Analysis
        • Data Visualization
          • Data Viz Cheat Sheet
          • Explanatory Analysis
          • Univariate Exploration of Data
            • Bar Chart
            • Pie Charts
            • Histograms
            • Kernel Density Estimation
            • Figures, Axes, and Subplots
            • Choosing a Plot for Discrete Data
            • Scales and Transformations (Log)
          • Bivariate Exploration of Data
            • Scatterplots
            • Overplotting, Transparency, and Jitter
            • Heatmaps
            • Violin & Box Plots
            • Categorical Variable Analysis
            • Faceting
            • Line Plots
            • Adapted Bar Charts
            • Q-Q, Swarm, Rug, Strip, Stacked, and Rigeline Plots
          • Multivariate Exploration of Data
            • Non-Positional Encodings for Third Variables
            • Color Palettes
            • Faceting for Multivariate Data
            • Plot and Correlation Matrices
            • Other Adaptations of Bivariate PLots
            • Feature Engineering for Data Viz
        • Python - Cheat Sheet
    • Machine Learning
      • Courses
        • Practical Deep learning for coders
          • Convolutional Neural Networks
            • Image Restauration
            • U-net
          • Lesson 1
          • Lesson 2
          • Lesson 3
          • Lesson 4 NLP, Collaborative filtering, Embeddings
          • Lesson 5 - Backprop, Accelerated SGD
          • Tabular data
        • Fast.ai - Intro to ML
          • Neural Nets
          • Business Applications
          • Class 1 & 2 - Random Forests
          • Lessons 3 & 4
      • Unsupervised Learning
        • Dimensionality Reduction
          • Independant Component Analysis
          • Random Projection
          • Principal Component Analysis
        • K-Means
        • Hierarchical Clustering
        • DBSCAN
        • Gaussian Mixture Model Clustering
        • Cluster Validation
      • Preprocessing
      • Machine Learning Overview
        • Confusion Matrix
      • Linear Regression
        • Feature Scaling and Normalization
        • Regularization
        • Polynomial Regression
        • Error functions
      • Decision Trees
      • Support Vector Machines
      • Training and Tuning
      • Model Evaluation Metrics
      • NLP
      • Neural Networks
        • Perceptron Algorithm
        • Multilayer Perceptron
        • Neural Network Architecture
        • Gradient Descent
        • Backpropagation
        • Training Neural Networks
  • Business
    • Analytics
      • KPIs for a Website
  • Books
    • Statistics
      • Practice Statistics for Data Science
        • Exploring Binary and Categorical Data
        • Data and Sampling Distributions
        • Statistical Experiments and Significance Testing
        • Regression and Prediction
        • Classification
        • Correlation
    • Pragmatic Thinking and Learning
      • Untitled
    • A Mind For Numbers: How to Excel at Math and Science
      • Focused and diffuse mode
      • Procrastination
      • Working memory and long term memory
        • Chunking
      • Importance of sleeping
      • Q&A with Terrence Sejnowski
      • Illusions of competence
      • Seeing the bigger picture
        • The value of a Library of Chunks
        • Overlearning
Powered by GitBook
On this page
  • K-Means
  • How Does K-Means Work?
  • Starting points and average dist to the centers
  • Feature scaling

Was this helpful?

  1. Sciences
  2. Machine Learning
  3. Unsupervised Learning

K-Means

PreviousPrincipal Component AnalysisNextHierarchical Clustering

Last updated 5 years ago

Was this helpful?

Coding the algo and other resources:

K-Means

The K-Means algorithm is used to cluster all sorts of data.

It can group together

  1. Books of similar genres or written by the same authors.

  2. Similar movies.

  3. Similar music.

  4. Similar groups of customers.

This clustering can lead to product, movie, music and other types of recommendations.

In the K-means algorithm 'k' represents the number of clusters you have in your dataset

How do you find the number of clusters when you have no idea?

The main point of interest if what is known as the elbow. Finding the elbow is a bit of a judgement call.

How Does K-Means Work?

Even before this video, you know most of what you needed to about how the k-means algorithm works:

  1. You choose k as the number of clusters you believe to be in your dataset or...

  2. You use the elbow method to determine k for your data.

Then this number of clusters is created within your dataset, where each point is assigned to each group.

However, to understand what edge cases might occur when grouping points together, it is necessary to understand exactly what the k-means algorithm is doing. Here is one method for computing k-means:

1. Randomly place k centroids amongst your data.

Then within a loop until convergence perform the following two steps:

2. Assign each point to the closest centroid.

3. Move the centroid to the center of the points assigned to it.

At the end of this process, you should have k-clusters of points

Starting points and average dist to the centers

In this video, you saw that how the starting points of the centroids can actually make a difference as to the final results you obtain from the k-means algorithm.

Starting points should be very different from on try to the other.

In order to assure you have the "best" set of clusters, the algorithm you saw earlier will be performed a few times with different starting points. The best set of clusters is then the clustering that creates the smallest average distance from each point to its corresponding centroid.

Feature scaling

For any machine learning algorithm that uses distances as a part of its optimization, it is important to scale your features.

You saw this earlier in regularized forms of regression like Ridge and Lasso, but it is also true for k-means. In future sections on PCA and ICA, feature scaling will again be important for the successful optimization of your machine learning algorithms.

Though there are a large number of ways that you can go about scaling your features, there are two ways that are most common:

  1. Normalizing or Max-Min Scaling - this type of scaling transforms variable values to between 0 and 1.

  2. Standardizing or Z-Score Scaling - this type of scaling transforms variable values so they have a mean of 0 and standard deviation of 1.

We usually do Standardizing. Normalizing is used for the scaling of the color of an image.

Doesn't work well for all types of data

https://mubaris.com/posts/kmeans-clustering/