Questions tagged [k-means]
k-means is a clustering algorithm, implemented in popular data science tools. Use this tag for questions related to the k-means clustering algorithm itself, or to its use with the tools that implement it (alongside other tags specific to those tools).
k-means
3,494
questions
463
votes
8
answers
284k
views
Cluster analysis in R: determine the optimal number of clusters
How can I choose the best number of clusters to do a k-means analysis. After plotting a subset of below data, how many clusters will be appropriate? How can I perform cluster dendro analysis?
n = 1000
...
234
votes
11
answers
129k
views
Is it possible to specify your own distance function using scikit-learn K-Means Clustering?
Is it possible to specify your own distance function using scikit-learn K-Means Clustering?
154
votes
20
answers
126k
views
How do I determine k when using k-means clustering?
I've been studying about k-means clustering, and one thing that's not clear is how you choose the value of k. Is it just a matter of trial and error, or is there more to it?
121
votes
3
answers
186k
views
Will scikit-learn utilize GPU?
Reading implementation of scikit-learn in TensorFlow: http://learningtensorflow.com/lesson6/ and scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html I'm ...
60
votes
6
answers
3k
views
Branchless K-means (or other optimizations)
Note: I'd appreciate more of a guide to how to approach and come up with these kinds of solutions rather than the solution itself.
I have a very performance-critical function in my system showing up ...
60
votes
18
answers
58k
views
K-means algorithm variation with equal cluster size
I'm looking for the fastest algorithm for grouping points on a map into equally sized groups, by distance. The k-means clustering algorithm looks straightforward and promising, but does not produce ...
55
votes
3
answers
75k
views
Scikit Learn - K-Means - Elbow - criterion
Today i'm trying to learn something about K-means. I Have understand the algorithm and i know how it works. Now i'm looking for the right k... I found the elbow criterion as a method to detect the ...
50
votes
7
answers
76k
views
How to get the samples in each cluster?
I am using the sklearn.cluster KMeans package. Once I finish the clustering if I need to know which values were grouped together how can I do it?
Say I had 100 data points and KMeans gave me 5 cluster....
49
votes
8
answers
91k
views
Python k-means algorithm
I am looking for Python implementation of k-means algorithm with examples to cluster and cache my database of coordinates.
48
votes
3
answers
47k
views
Simple approach to assigning clusters for new data after k-means clustering
I'm running k-means clustering on a data frame df1, and I'm looking for a simple approach to computing the closest cluster center for each observation in a new data frame df2 (with the same variable ...
46
votes
4
answers
38k
views
kmeans: Quick-TRANSfer stage steps exceeded maximum
I am running k-means clustering in R on a dataset with 636,688 rows and 7 columns using the standard stats package: kmeans(dataset, centers = 100, nstart = 25, iter.max = 20).
I get the following ...
42
votes
7
answers
32k
views
Kmeans without knowing the number of clusters? [duplicate]
I am attempting to apply k-means on a set of high-dimensional data points (about 50 dimensions) and was wondering if there are any implementations that find the optimal number of clusters.
I ...
41
votes
3
answers
34k
views
How Could One Implement the K-Means++ Algorithm?
I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means ...
40
votes
2
answers
52k
views
Calculating the percentage of variance measure for k-means?
On the Wikipedia page, an elbow method is described for determining the number of clusters in k-means. The built-in method of scipy provides an implementation but I am not sure I understand how the ...
37
votes
2
answers
66k
views
Will pandas dataframe object work with sklearn kmeans clustering?
dataset is pandas dataframe. This is sklearn.cluster.KMeans
km = KMeans(n_clusters = n_Clusters)
km.fit(dataset)
prediction = km.predict(dataset)
This is how I decide which entity belongs to ...
36
votes
3
answers
36k
views
What makes the distance measure in k-medoid "better" than k-means?
I am reading about the difference between k-means clustering and k-medoid clustering.
Supposedly there is an advantage to using the pairwise distance measure in the k-medoid algorithm, instead of the ...
34
votes
1
answer
37k
views
Cluster one-dimensional data optimally? [closed]
Does anyone have a paper that explains how the Ckmeans.1d.dp algorithm works?
Or: what is the most optimal way to do k-means clustering in one-dimension?
32
votes
2
answers
42k
views
Scikit-learn: How to run KMeans on a one-dimensional array?
I have an array of 13.876(13,876) values between 0 and 1. I would like to apply sklearn.cluster.KMeans to only this vector to find the different clusters in which the values are grouped. However, it ...
31
votes
5
answers
43k
views
whats is the difference between "k means" and "fuzzy c means" objective functions?
I am trying to see if the performance of both can be compared based on the objective functions they work on?
31
votes
3
answers
44k
views
Understanding "score" returned by scikit-learn KMeans
I applied clustering on a set of text documents (about 100). I converted them to Tfidf vectors using TfIdfVectorizer and supplied the vectors as input to scikitlearn.cluster.KMeans(n_clusters=2, init='...
30
votes
1
answer
20k
views
Online k-means clustering
Is there a online version of the k-Means clustering algorithm?
By online I mean that every data point is processed in serial, one at a time as they enter the system, hence saving computing time when ...
28
votes
5
answers
95k
views
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
I have a data table ("norm") containing numeric - at least to what I can see - normalized values of the following form:
When I am executing
k <- kmeans(norm,center=3)
I am receving the following ...
27
votes
2
answers
54k
views
What is the time complexity of k-means?
I was going through the k-means Wikipedia page. Based on the algorithm, I think the complexity is O(n*k*i) (n = total elements, k = number of cluster iteration)
So can someone explain me this ...
27
votes
2
answers
22k
views
Group n points in k clusters of equal size [duplicate]
Possible Duplicate:
K-means algorithm variation with equal cluster size
EDIT: like casperOne point it out to me this question is a duplicate. Anyways here is a more generalized question that ...
26
votes
1
answer
42k
views
Clustering text documents using scikit-learn kmeans in Python
I need to implement scikit-learn's kMeans for clustering text documents. The example code works fine as it is but takes some 20newsgroups data as input. I want to use the same code for clustering a ...
26
votes
6
answers
18k
views
Fast (< n^2) clustering algorithm
I have 1 million 5-dimensional points that I need to group into k clusters with k << 1 million. In each cluster, no two points should be too far apart (e.g. they could be bounding spheres with a ...
26
votes
3
answers
36k
views
Using K-means with cosine similarity - Python
I am trying to implement Kmeans algorithm in python which will use cosine distance instead of euclidean distance as distance metric.
I understand that using different distance function can be fatal ...
26
votes
2
answers
4k
views
Estimation of number of Clusters via gap statistics and prediction strength
I am trying to translate the R implementations of gap statistics and prediction strength http://edchedch.wordpress.com/2011/03/19/counting-clusters/ into python scripts for the estimation of number of ...
25
votes
3
answers
72k
views
kmeans scatter plot: plot different colors per cluster
I am trying to do a scatter plot of a kmeans output which clusters sentences of the same topic together. The problem i am facing is plotting points that belongs to each cluster a certain color.
...
25
votes
2
answers
25k
views
K-Means: Lloyd,Forgy,MacQueen,Hartigan-Wong
I'm working with the K-Means Algorithm in R and I want to figure out the differences of the 4 Algorithms Lloyd,Forgy,MacQueen and Hartigan-Wong which are available for the function "kmeans" in the ...
24
votes
11
answers
124k
views
setting an array element with a sequence requested array has an inhomogeneous shape after 1 dimensions The detected shape was (2,)+inhomogeneous part
import os
import numpy as np
from scipy.signal import *
import csv
import matplotlib.pyplot as plt
from scipy import signal
from brainflow.board_shim import BoardShim, BrainFlowInputParams, LogLevels,...
24
votes
5
answers
33k
views
Changes of clustering results after each time run in Python scikit-learn
I have a bunch of sentences and I want to cluster them using scikit-learn spectral clustering. I've run the code and get the results with no problem. But, every time I run it I get different results. ...
23
votes
2
answers
16k
views
How does pytorch backprop through argmax?
I'm building Kmeans in pytorch using gradient descent on centroid locations, instead of expectation-maximisation. Loss is the sum of square distances of each point to its nearest centroid. To ...
22
votes
7
answers
30k
views
Can k-means clustering do classification?
I want to know whether the k-means clustering algorithm can do classification?
If I have done a simple k-means clustering .
Assume I have many data , I use k-means clusterings, then get 2 clusters A,...
22
votes
6
answers
25k
views
scikit-learn: Finding the features that contribute to each KMeans cluster
Say you have 10 features you are using to create 3 clusters. Is there a way to see the level of contribution each of the features have for each of the clusters?
What I want to be able to say is that ...
21
votes
4
answers
38k
views
ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive) when using silhouette_score
I am trying to calculate silhouette score as I find the optimal number of clusters to create, but get an error that says:
ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (...
21
votes
3
answers
19k
views
How would I implement k-means with TensorFlow?
The intro tutorial, which uses the built-in gradient descent optimizer, makes a lot of sense. However, k-means isn't just something I can plug into gradient descent. It seems like I'd have to write my ...
21
votes
5
answers
38k
views
How can I perform K-means clustering on time series data?
How can I do K-means clustering of time series data?
I understand how this works when the input data is a set of points, but I don't know how to cluster a time series with 1XM, where M is the data ...
20
votes
1
answer
29k
views
How to add k-means predicted clusters in a column to a dataframe in Python
I have a question about kmeans clustering in python.
So I did the analysis that way:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=12, random_state=1)
new = data._get_numeric_data()....
19
votes
3
answers
33k
views
plot a document tfidf 2D graph
I would like to plot a 2d graph with the x-axis as term and y-axis as TFIDF score (or document id) for my list of sentences. I used scikit learn's fit_transform() to get the scipy matrix but i do not ...
19
votes
2
answers
38k
views
Clustering geo location coordinates (lat,long pairs) using KMeans algorithm with Python
Using the following code to cluster geolocation coordinates results in 3 clusters:
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.vq import kmeans2, whiten
...
19
votes
5
answers
25k
views
How to calculate BIC for k-means clustering in R
I've been using k-means to cluster my data in R but I'd like to be able to assess the fit vs. model complexity of my clustering using Baysiean Information Criterion (BIC) and AIC. Currently the code I'...
18
votes
3
answers
36k
views
OpenCV using k-means to posterize an image
I want to posterize an image with k-means and OpenCV in C++ interface (cv namespace) and I get weird results. I need it for reduce some noise. This is my code:
#include "cv.h"
#include "...
17
votes
2
answers
38k
views
KMeans clustering in PySpark
I have a spark dataframe 'mydataframe' with many columns. I am trying to run kmeans on only two columns: lat and long (latitude & longitude) using them as simple values). I want to extract 7 ...
17
votes
4
answers
23k
views
Can I use K-means algorithm on a string?
I am working on a python project where I study RNA structure evolution (represented as a string for example: "(((...)))" where the parenthesis represent basepairs). The point being is that I have an ...
17
votes
2
answers
21k
views
How to set k-Means clustering labels from highest to lowest with Python?
I have a dataset of 38 apartments and their electricity consumption in the morning, afternoon and evening. I am trying to clusterize this dataset using the k-Means implementation from scikit-learn, ...
17
votes
2
answers
60k
views
How to identify Cluster labels in kmeans scikit learn
I am learning python scikit.
The example given here
displays the top occurring words in each Cluster and not Cluster name.
http://scikit-learn.org/stable/auto_examples/document_clustering.html
I ...
16
votes
1
answer
67k
views
How to use silhouette score in k-means clustering from sklearn library?
I'd like to use silhouette score in my script, to automatically compute number of clusters in k-means clustering from sklearn.
import numpy as np
import pandas as pd
import csv
from sklearn.cluster ...
16
votes
1
answer
36k
views
initial centroids for scikit-learn kmeans clustering
if I already have a numpy array that can serve as the initial centroids, how can I properly initialize the kmeans algorithm? I am using the scikit-learn Kmeans class
this post (k-means with selected ...
16
votes
2
answers
3k
views
How to detect multiple objects with OpenCV in C++?
I got inspiration from this answer here, which is a Python implementation, but I need C++, that answer works very well, I got the thought is that: detectAndCompute to get keypoints, use kmeans to ...