Questions tagged [k-means]
k-means is a clustering algorithm, implemented in popular data science tools. Use this tag for questions related to the k-means clustering algorithm itself, or to its use with the tools that implement it (alongside other tags specific to those tools).
                                	
	k-means
    
                            
                        
                    
            3,494
            questions
        
        
            463
            votes
        
        
            8
            answers
        
        
            284k
            views
        
    Cluster analysis in R: determine the optimal number of clusters
                How can I choose the best number of clusters to do a k-means analysis. After plotting a subset of below data, how many clusters will be appropriate? How can I perform cluster dendro analysis?
n = 1000
...
            
        
       
    
            234
            votes
        
        
            11
            answers
        
        
            129k
            views
        
    Is it possible to specify your own distance function using scikit-learn K-Means Clustering?
                Is it possible to specify your own distance function using scikit-learn K-Means Clustering?
            
        
       
    
            154
            votes
        
        
            20
            answers
        
        
            126k
            views
        
    How do I determine k when using k-means clustering?
                I've been studying about k-means clustering, and one thing that's not clear is how you choose the value of k.  Is it just a matter of trial and error, or is there more to it?
            
        
       
    
            121
            votes
        
        
            3
            answers
        
        
            186k
            views
        
    Will scikit-learn utilize GPU?
                Reading implementation of scikit-learn in TensorFlow: http://learningtensorflow.com/lesson6/ and scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html I'm ...
            
        
       
    
            60
            votes
        
        
            6
            answers
        
        
            3k
            views
        
    Branchless K-means (or other optimizations)
                Note: I'd appreciate more of a guide to how to approach and come up with these kinds of solutions rather than the solution itself.
I have a very performance-critical function in my system showing up ...
            
        
       
    
            60
            votes
        
        
            18
            answers
        
        
            58k
            views
        
    K-means algorithm variation with equal cluster size
                I'm looking for the fastest algorithm for grouping points on a map into equally sized groups, by distance. The k-means clustering algorithm looks straightforward and promising, but does not produce ...
            
        
       
    
            55
            votes
        
        
            3
            answers
        
        
            75k
            views
        
    Scikit Learn - K-Means - Elbow - criterion
                Today i'm trying to learn something about K-means. I Have understand the algorithm and i know how it works. Now i'm looking for the right k... I found the elbow criterion as a method to detect the ...
            
        
       
    
            50
            votes
        
        
            7
            answers
        
        
            76k
            views
        
    How to get the samples in each cluster?
                I am using the sklearn.cluster KMeans package. Once I finish the clustering if I need to know which values were grouped together how can I do it?
Say I had 100 data points and KMeans gave me 5 cluster....
            
        
       
    
            49
            votes
        
        
            8
            answers
        
        
            91k
            views
        
    Python k-means algorithm
                I am looking for Python implementation of k-means algorithm with examples to cluster and cache my database of coordinates.
            
        
       
    
            48
            votes
        
        
            3
            answers
        
        
            47k
            views
        
    Simple approach to assigning clusters for new data after k-means clustering
                I'm running k-means clustering on a data frame df1, and I'm looking for a simple approach to computing the closest cluster center for each observation in a new data frame df2 (with the same variable ...
            
        
       
    
            46
            votes
        
        
            4
            answers
        
        
            38k
            views
        
    kmeans: Quick-TRANSfer stage steps exceeded maximum
                I am running k-means clustering in R on a dataset with 636,688 rows and 7 columns using the standard stats package: kmeans(dataset, centers = 100, nstart = 25, iter.max = 20). 
I get the following ...
            
        
       
    
            42
            votes
        
        
            7
            answers
        
        
            32k
            views
        
    Kmeans without knowing the number of clusters? [duplicate]
                I am attempting to apply k-means on a set of high-dimensional data points (about 50 dimensions) and was wondering if there are any implementations that find the optimal number of clusters. 
I ...
            
        
       
    
            41
            votes
        
        
            3
            answers
        
        
            34k
            views
        
    How Could One Implement the K-Means++ Algorithm?
                I am having trouble fully understanding the K-Means++ algorithm.  I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means ...
            
        
       
    
            40
            votes
        
        
            2
            answers
        
        
            52k
            views
        
    Calculating the percentage of variance measure for k-means?
                On the Wikipedia page, an elbow method is described for determining the number of clusters in k-means. The built-in method of scipy provides an implementation but I am not sure I understand how the ...
            
        
       
    
            37
            votes
        
        
            2
            answers
        
        
            66k
            views
        
    Will pandas dataframe object work with sklearn kmeans clustering?
                dataset is pandas dataframe. This is sklearn.cluster.KMeans
 km = KMeans(n_clusters = n_Clusters)
 km.fit(dataset)
 prediction = km.predict(dataset)
This is how I decide which entity belongs to ...
            
        
       
    
            36
            votes
        
        
            3
            answers
        
        
            36k
            views
        
    What makes the distance measure in k-medoid "better" than k-means?
                I am reading about the difference between k-means clustering and k-medoid clustering.
Supposedly there is an advantage to using the pairwise distance measure in the k-medoid algorithm, instead of the ...
            
        
       
    
            34
            votes
        
        
            1
            answer
        
        
            37k
            views
        
    Cluster one-dimensional data optimally? [closed]
                Does anyone have a paper that explains how the Ckmeans.1d.dp algorithm works?
Or: what is the most optimal way to do k-means clustering in one-dimension?
            
        
       
    
            32
            votes
        
        
            2
            answers
        
        
            42k
            views
        
    Scikit-learn: How to run KMeans on a one-dimensional array?
                I have an array of 13.876(13,876) values between 0 and 1. I would like to apply sklearn.cluster.KMeans to only this vector to find the different clusters in which the values are grouped. However, it ...
            
        
       
    
            31
            votes
        
        
            5
            answers
        
        
            43k
            views
        
    whats is the difference between "k means" and "fuzzy c means" objective functions?
                I am trying to see if the performance of both can be compared based on the objective functions they work on?
            
        
       
    
            31
            votes
        
        
            3
            answers
        
        
            44k
            views
        
    Understanding "score" returned by scikit-learn KMeans
                I applied clustering on a set of text documents (about 100). I converted them to Tfidf vectors using TfIdfVectorizer and supplied the vectors as input to scikitlearn.cluster.KMeans(n_clusters=2, init='...
            
        
       
    
            30
            votes
        
        
            1
            answer
        
        
            20k
            views
        
    Online k-means clustering
                Is there a online version of the k-Means clustering algorithm?
By online I mean that every data point is processed in serial, one at a time as they enter the system, hence saving computing time when ...
            
        
       
    
            28
            votes
        
        
            5
            answers
        
        
            95k
            views
        
    Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
                I have a data table ("norm") containing numeric - at least to what I can see - normalized values of the following form:
When I am executing
k <- kmeans(norm,center=3)
I am receving the following ...
            
        
       
    
            27
            votes
        
        
            2
            answers
        
        
            54k
            views
        
    What is the time complexity of k-means?
                I was going through the k-means Wikipedia page. Based on the algorithm, I think the complexity is O(n*k*i) (n = total elements, k = number of cluster iteration)
So can someone explain me this ...
            
        
       
    
            27
            votes
        
        
            2
            answers
        
        
            22k
            views
        
    Group n points in k clusters of equal size [duplicate]
                Possible Duplicate:
  K-means algorithm variation with equal cluster size  
EDIT: like casperOne point it out to me this question is a duplicate. Anyways here is a more generalized question that ...
            
        
       
    
            26
            votes
        
        
            1
            answer
        
        
            42k
            views
        
    Clustering text documents using scikit-learn kmeans in Python
                I need to implement scikit-learn's kMeans for clustering text documents. The example code works fine as it is but takes some 20newsgroups data as input. I want to use the same code for clustering a ...
            
        
       
    
            26
            votes
        
        
            6
            answers
        
        
            18k
            views
        
    Fast (< n^2) clustering algorithm
                I have 1 million 5-dimensional points that I need to group into k clusters with k << 1 million. In each cluster, no two points should be too far apart (e.g. they could be bounding spheres with a ...
            
        
       
    
            26
            votes
        
        
            3
            answers
        
        
            36k
            views
        
    Using K-means with cosine similarity - Python
                I am trying to implement Kmeans algorithm in python which will use cosine distance instead of euclidean distance as distance metric.
I understand that using different distance function can be fatal ...
            
        
       
    
            26
            votes
        
        
            2
            answers
        
        
            4k
            views
        
    Estimation of number of Clusters via gap statistics and prediction strength
                I am trying to translate the R implementations of gap statistics and prediction strength http://edchedch.wordpress.com/2011/03/19/counting-clusters/ into python scripts for the estimation of number of ...
            
        
       
    
            25
            votes
        
        
            3
            answers
        
        
            72k
            views
        
    kmeans scatter plot: plot different colors per cluster
                I am trying to do a scatter plot of a kmeans output which clusters sentences of the same topic together. The problem i am facing is plotting points that belongs to each cluster a certain color.
...
            
        
       
    
            25
            votes
        
        
            2
            answers
        
        
            25k
            views
        
    K-Means: Lloyd,Forgy,MacQueen,Hartigan-Wong
                I'm working with the K-Means Algorithm in R and I want to figure out the differences of the 4 Algorithms Lloyd,Forgy,MacQueen and Hartigan-Wong which are available for the function "kmeans" in the ...
            
        
       
    
            24
            votes
        
        
            11
            answers
        
        
            124k
            views
        
    setting an array element with a sequence requested array has an inhomogeneous shape after 1 dimensions The detected shape was (2,)+inhomogeneous part
                import os
import numpy as np
from scipy.signal import *
import csv
import matplotlib.pyplot as plt
from scipy import signal
from brainflow.board_shim import BoardShim, BrainFlowInputParams, LogLevels,...
            
        
       
    
            24
            votes
        
        
            5
            answers
        
        
            33k
            views
        
    Changes of clustering results after each time run in Python scikit-learn
                I have a bunch of sentences and I want to cluster them using scikit-learn spectral clustering. I've run the code and get the results with no problem. But, every time I run it I get different results. ...
            
        
       
    
            23
            votes
        
        
            2
            answers
        
        
            16k
            views
        
    How does pytorch backprop through argmax?
                I'm building Kmeans in pytorch using gradient descent on centroid locations, instead of expectation-maximisation. Loss is the sum of square distances of each point to its nearest centroid.  To ...
            
        
       
    
            22
            votes
        
        
            7
            answers
        
        
            30k
            views
        
    Can k-means clustering do classification?
                I want to know whether the k-means clustering algorithm can do classification?
If I have done a simple k-means clustering .
Assume I have many data , I use k-means clusterings, then get 2 clusters A,...
            
        
       
    
            22
            votes
        
        
            6
            answers
        
        
            25k
            views
        
    scikit-learn: Finding the features that contribute to each KMeans cluster
                Say you have 10 features you are using to create 3 clusters. Is there a way to see the level of contribution each of the features have for each of the clusters?
What I want to be able to say is that ...
            
        
       
    
            21
            votes
        
        
            4
            answers
        
        
            38k
            views
        
    ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive) when using silhouette_score
                I am trying to calculate silhouette score as I find the optimal number of clusters to create, but get an error that says:
ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (...
            
        
       
    
            21
            votes
        
        
            3
            answers
        
        
            19k
            views
        
    How would I implement k-means with TensorFlow?
                The intro tutorial, which uses the built-in gradient descent optimizer, makes a lot of sense. However, k-means isn't just something I can plug into gradient descent. It seems like I'd have to write my ...
            
        
       
    
            21
            votes
        
        
            5
            answers
        
        
            38k
            views
        
    How can I perform K-means clustering on time series data?
                How can I do K-means clustering of time series data?
I understand how this works when the input data is a set of points, but I don't know how to cluster a time series with 1XM, where M is the data ...
            
        
       
    
            20
            votes
        
        
            1
            answer
        
        
            29k
            views
        
    How to add k-means predicted clusters in a column to a dataframe in Python
                I have a question about kmeans clustering in python.
So I did the analysis that way:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=12, random_state=1)
new = data._get_numeric_data()....
            
        
       
    
            19
            votes
        
        
            3
            answers
        
        
            33k
            views
        
    plot a document tfidf 2D graph
                I would like to plot a 2d graph with the x-axis as term and y-axis as TFIDF score (or document id) for my list of sentences. I used scikit learn's fit_transform() to get the scipy matrix but i do not ...
            
        
       
    
            19
            votes
        
        
            2
            answers
        
        
            38k
            views
        
    Clustering geo location coordinates (lat,long pairs) using KMeans algorithm with Python
                Using the following code to cluster geolocation coordinates results in 3 clusters:
    import numpy as np
    import matplotlib.pyplot as plt
    from scipy.cluster.vq import kmeans2, whiten
    ...
            
        
       
    
            19
            votes
        
        
            5
            answers
        
        
            25k
            views
        
    How to calculate BIC for k-means clustering in R
                I've been using k-means to cluster my data in R but I'd like to be able to assess the fit vs. model complexity of my clustering using Baysiean Information Criterion (BIC) and AIC. Currently the code I'...
            
        
       
    
            18
            votes
        
        
            3
            answers
        
        
            36k
            views
        
    OpenCV using k-means to posterize an image
                I want to posterize an image with k-means and OpenCV in C++ interface (cv namespace) and I get weird results. I need it for reduce some noise. This is my code:
#include "cv.h"
#include "...
            
        
       
    
            17
            votes
        
        
            2
            answers
        
        
            38k
            views
        
    KMeans clustering in PySpark
                I have a spark dataframe 'mydataframe' with many columns. I am trying to run kmeans on only two columns: lat and long (latitude & longitude) using them as simple values). I want to extract 7 ...
            
        
       
    
            17
            votes
        
        
            4
            answers
        
        
            23k
            views
        
    Can I use K-means algorithm on a string?
                I am working on a python project where I study RNA structure evolution (represented as a string for example: "(((...)))" where the parenthesis represent basepairs). The point being is that I have an ...
            
        
       
    
            17
            votes
        
        
            2
            answers
        
        
            21k
            views
        
    How to set k-Means clustering labels from highest to lowest with Python?
                I have a dataset of 38 apartments and their electricity consumption in the morning, afternoon and evening. I am trying to clusterize this dataset using the k-Means implementation from scikit-learn, ...
            
        
       
    
            17
            votes
        
        
            2
            answers
        
        
            60k
            views
        
    How to identify Cluster labels in kmeans scikit learn
                I am learning python scikit.
The example given here 
displays the top occurring words in each Cluster and not Cluster name.
http://scikit-learn.org/stable/auto_examples/document_clustering.html
I ...
            
        
       
    
            16
            votes
        
        
            1
            answer
        
        
            67k
            views
        
    How to use silhouette score in k-means clustering from sklearn library?
                I'd like to use silhouette score in my script, to automatically compute number of clusters in k-means clustering from sklearn.
import numpy as np
import pandas as pd
import csv
from sklearn.cluster ...
            
        
       
    
            16
            votes
        
        
            1
            answer
        
        
            36k
            views
        
    initial centroids for scikit-learn kmeans clustering
                if I already have a numpy array that can serve as the initial centroids, how can I properly initialize the kmeans algorithm? I am using the scikit-learn Kmeans class
this post (k-means with selected ...
            
        
       
    
            16
            votes
        
        
            2
            answers
        
        
            3k
            views
        
    How to detect multiple objects with OpenCV in C++?
                I got inspiration from this answer here, which is a Python implementation, but I need C++, that answer works very well, I got the thought is that: detectAndCompute to get keypoints, use kmeans to ...