Highest scored 'cluster-analysis' questions

463 votes

8 answers

284k views

Cluster analysis in R: determine the optimal number of clusters

How can I choose the best number of clusters to do a k-means analysis. After plotting a subset of below data, how many clusters will be appropriate? How can I perform cluster dendro analysis? n = 1000 ...

user2153893

4,667

asked Mar 13, 2013 at 2:39

234 votes

11 answers

129k views

Is it possible to specify your own distance function using scikit-learn K-Means Clustering?

bmasc

2,470

asked Apr 3, 2011 at 12:39

199 votes

20 answers

235k views

Difference between classification and clustering in data mining? [closed]

Can someone explain what the difference is between classification and clustering in data mining? If you can, please give examples of both to understand the main idea.

Kristaps

2,031

asked Feb 21, 2011 at 10:39

154 votes

20 answers

126k views

How do I determine k when using k-means clustering?

I've been studying about k-means clustering, and one thing that's not clear is how you choose the value of k. Is it just a matter of trial and error, or is there more to it?

Jason Baker

195k

asked Nov 24, 2009 at 22:58

120 votes

8 answers

46k views

What is an intuitive explanation of the Expectation Maximization technique? [closed]

Expectation Maximization (EM) is a kind of probabilistic method to classify data. Please correct me if I am wrong if it is not a classifier. What is an intuitive explanation of this EM technique? ...

London guy

27.7k

asked Aug 4, 2012 at 10:56

115 votes

7 answers

92k views

1D Number Array Clustering

So let's say I have an array like this: [1,1,2,3,10,11,13,67,71] Is there a convenient way to partition the array into something like this? [[1,1,2,3],[10,11,13],[67,71]] I looked through similar ...

E.H.

3,351

asked Jul 16, 2012 at 22:25

99 votes

7 answers

69k views

Unsupervised clustering with unknown number of clusters

I have a large set of vectors in 3 dimensions. I need to cluster these based on Euclidean distance such that all the vectors in any particular cluster have a Euclidean distance between each other less ...

London guy

27.7k

asked Apr 13, 2012 at 6:54

60 votes

18 answers

58k views

K-means algorithm variation with equal cluster size

I'm looking for the fastest algorithm for grouping points on a map into equally sized groups, by distance. The k-means clustering algorithm looks straightforward and promising, but does not produce ...

pixelistik

7,740

asked Mar 27, 2011 at 21:27

55 votes

3 answers

75k views

Scikit Learn - K-Means - Elbow - criterion

Today i'm trying to learn something about K-means. I Have understand the algorithm and i know how it works. Now i'm looking for the right k... I found the elbow criterion as a method to detect the ...

Linda

2,395

asked Oct 5, 2013 at 12:19

55 votes

2 answers

32k views

plotting results of hierarchical clustering on top of a matrix of data

How can I plot a dendrogram right on top of a matrix of values, reordered appropriately to reflect the clustering, in Python? An example is the following figure: This is Figure 6 from: A panel of ...

user248237

asked Jun 6, 2010 at 2:50

50 votes

7 answers

76k views

How to get the samples in each cluster?

I am using the sklearn.cluster KMeans package. Once I finish the clustering if I need to know which values were grouped together how can I do it? Say I had 100 data points and KMeans gave me 5 cluster....

user77005

1,819

asked Mar 24, 2016 at 7:56

50 votes

9 answers

39k views

scikit-learn: Predicting new points with DBSCAN

I am using DBSCAN to cluster some data using Scikit-Learn (Python 2.7): from sklearn.cluster import DBSCAN dbscan = DBSCAN(random_state=0) dbscan.fit(X) However, I found that there was no built-in ...

slaw

6,727

asked Jan 7, 2015 at 15:27

49 votes

8 answers

91k views

Python k-means algorithm

I am looking for Python implementation of k-means algorithm with examples to cluster and cache my database of coordinates.

Eeyore

2,126

asked Oct 9, 2009 at 19:16

47 votes

5 answers

43k views

Plot dendrogram using sklearn.AgglomerativeClustering

I'm trying to build a dendrogram using the children_ attribute provided by AgglomerativeClustering, but so far I'm out of luck. I can't use scipy.cluster since agglomerative clustering provided in ...

Shukhrat Khannanov

471

asked Mar 18, 2015 at 16:07

46 votes

4 answers

38k views

kmeans: Quick-TRANSfer stage steps exceeded maximum

I am running k-means clustering in R on a dataset with 636,688 rows and 7 columns using the standard stats package: kmeans(dataset, centers = 100, nstart = 25, iter.max = 20). I get the following ...

Anna Dunietz

845

asked Jan 27, 2014 at 13:55

42 votes

3 answers

40k views

How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?

I need to cluster a simple univariate data set into a preset number of clusters. Technically it would be closer to binning or sorting the data since it is only 1D, but my boss is calling it clustering,...

Alex Kinman

2,497

asked Jan 29, 2016 at 21:35

42 votes

3 answers

28k views

Grid search for hyperparameter evaluation of clustering in scikit-learn

I'm clustering a sample of about 100 records (unlabelled) and trying to use grid_search to evaluate the clustering algorithm with various hyperparameters. I'm scoring using silhouette_score which ...

Jamie Bull

13.2k

asked Jan 5, 2016 at 11:49

41 votes

3 answers

34k views

How Could One Implement the K-Means++ Algorithm?

I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means ...

Anton Andreev

2,092

asked Mar 28, 2011 at 23:45

41 votes

6 answers

74k views

Choosing eps and minpts for DBSCAN (R)?

I've been searching for an answer for this question for quite a while, so I'm hoping someone can help me. I'm using dbscan from the fpc library in R. For example, I am looking at the USArrests data ...

Belinda Chiera

447

asked Oct 15, 2012 at 10:12

40 votes

2 answers

52k views

Calculating the percentage of variance measure for k-means?

On the Wikipedia page, an elbow method is described for determining the number of clusters in k-means. The built-in method of scipy provides an implementation but I am not sure I understand how the ...

Legend

115k

asked Jul 11, 2011 at 4:55

38 votes

5 answers

28k views

scikit-learn DBSCAN memory usage

UPDATED: In the end, the solution I opted to use for clustering my large dataset was one suggested by Anony-Mousse below. That is, using ELKI's DBSCAN implimentation to do my clustering rather than ...

JamesT

417

asked May 5, 2013 at 5:04

37 votes

2 answers

66k views

Will pandas dataframe object work with sklearn kmeans clustering?

dataset is pandas dataframe. This is sklearn.cluster.KMeans km = KMeans(n_clusters = n_Clusters) km.fit(dataset) prediction = km.predict(dataset) This is how I decide which entity belongs to ...

Dark Knight

869

asked Jan 19, 2015 at 2:17

37 votes

4 answers

30k views

Text clustering with Levenshtein distances

I have a set (2k - 4k) of small strings (3-6 characters) and I want to cluster them. Since I use strings, previous answers on How does clustering (especially String clustering) work?, informed me that ...

Alexandros

2,180

asked Feb 2, 2014 at 14:38

37 votes

5 answers

38k views

sklearn agglomerative clustering linkage matrix

I'm trying to draw a complete-link scipy.cluster.hierarchy.dendrogram, and I found that scipy.cluster.hierarchy.linkage is slower than sklearn.AgglomerativeClustering. However, sklearn....

Presian Abarov

373

asked Nov 10, 2014 at 19:33

36 votes

4 answers

33k views

How does clustering (especially String clustering) work?

I heard about clustering to group similar data. I want to know how it works in the specific case for String. I have a table with more than different 100,000 words. I want to identify the same word ...

Renato Dinhani

35.8k

asked Nov 19, 2011 at 18:48

36 votes

3 answers

36k views

What makes the distance measure in k-medoid "better" than k-means?

I am reading about the difference between k-means clustering and k-medoid clustering. Supposedly there is an advantage to using the pairwise distance measure in the k-medoid algorithm, instead of the ...

tumultous_rooster

12.3k

asked Feb 7, 2014 at 5:08

36 votes

2 answers

28k views

Extracting clusters from seaborn clustermap

I am using the seaborn clustermap to create clusters and visually it works great (this example produces very similar results). However I am having trouble figuring out how to programmatically extract ...

sedavidw

11.4k

asked Jan 13, 2015 at 14:48

35 votes

6 answers

36k views

How to group latitude/longitude points that are 'close' to each other?

I have a database of user submitted latitude/longitude points and am trying to group 'close' points together. 'Close' is relative, but for now it seems to ~500 feet. At first it seemed I could just ...

Tim Lytle

17.5k

asked Dec 3, 2010 at 19:28

34 votes

17 answers

6k views

Clustering Algorithm for Paper Boys

I need help selecting or creating a clustering algorithm according to certain criteria. Imagine you are managing newspaper delivery persons. You have a set of street addresses, each of which is ...

carrier

32.6k

asked Feb 18, 2009 at 21:25

34 votes

3 answers

32k views

Spectral Clustering a graph in python

I'd like to cluster a graph in python using spectral clustering. Spectral clustering is a more general technique which can be applied not only to graphs, but also images, or any sort of data, ...

Alex Lenail

13.7k

asked Sep 16, 2017 at 21:38

34 votes

1 answer

37k views

Cluster one-dimensional data optimally? [closed]

Does anyone have a paper that explains how the Ckmeans.1d.dp algorithm works? Or: what is the most optimal way to do k-means clustering in one-dimension?

Laciel

367

asked Oct 23, 2011 at 22:12

34 votes

2 answers

24k views

Reordering matrix elements to reflect column and row clustering in naiive python [duplicate]

I'm looking for a way to perform clustering separately on matrix rows and than on its columns, reorder the data in the matrix to reflect the clustering and putting it all together. The clustering ...

Boris Gorelik

30.8k

asked Mar 16, 2010 at 15:39

33 votes

5 answers

58k views

DBSCAN for clustering of geographic location data

I have a dataframe with latitude and longitude pairs. Here is my dataframe look like. order_lat order_long 0 19.111841 72.910729 1 19.111342 72.908387 2 19.111342 72.908387 3 19....

Neil

8,057

asked Jan 3, 2016 at 17:09

33 votes

5 answers

20k views

Scikit Learn GridSearchCV without cross validation (unsupervised learning)

Is it possible to use GridSearchCV without cross validation? I am trying to optimize the number of clusters in KMeans clustering via grid search, and thus I don't need or want cross validation. The ...

DataMan

3,295

asked Jun 19, 2017 at 17:15

33 votes

6 answers

19k views

Which machine learning library to use [closed]

I am looking for a library that, ideally, has the following features: implements hierarchical clustering of multidimensional data (ideally on similiarity or distance matrix) implements support vector ...

Björn Pollex

76k

asked May 26, 2010 at 17:32

33 votes

4 answers

32k views

Clustering Algorithm for Mapping Application

I'm looking into clustering points on a map (latitude/longitude). Are there any recommendations as to a suitable algorithm that is fast and scalable? More specifically, I have a series of latitude/...

Codebeef

43.7k

asked Sep 16, 2008 at 15:59

32 votes

7 answers

24k views

Python Implementation of OPTICS (Clustering) Algorithm

I'm looking for a decent implementation of the OPTICS algorithm in Python. I will use it to form density-based clusters of points ((x,y) pairs). I'm looking for something that takes in (x,y) pairs ...

Murat Derya Özen

2,154

asked Apr 1, 2011 at 15:43

31 votes

5 answers

43k views

whats is the difference between "k means" and "fuzzy c means" objective functions?

I am trying to see if the performance of both can be compared based on the objective functions they work on?

n0ob

1,275

asked Feb 27, 2010 at 1:37

31 votes

14 answers

13k views

How can I find the center of a cluster of data points?

Let's say I plotted the position of a helicopter every day for the past year and came up with the following map: Any human looking at this would be able to tell me that this helicopter is based out ...

Ryan

15k

asked Jun 14, 2013 at 16:03

31 votes

2 answers

28k views

python scikit-learn clustering with missing data

I want to cluster data with missing columns. Doing it manually I would calculate the distance in case of a missing column simply without this column. With scikit-learn, missing data is not possible. ...

Michael Hecht

2,153

asked Feb 24, 2016 at 19:39

30 votes

1 answer

20k views

Online k-means clustering

Is there a online version of the k-Means clustering algorithm? By online I mean that every data point is processed in serial, one at a time as they enter the system, hence saving computing time when ...

Theodor

5,606

asked Sep 13, 2010 at 7:33

30 votes

1 answer

49k views

differences in heatmap/clustering defaults in R (heatplot versus heatmap.2)?

I'm comparing two ways of creating heatmaps with dendrograms in R, one with made4's heatplot and one with gplots of heatmap.2. The appropriate results depend on the analysis but I'm trying to ...

user248237

asked Jul 29, 2013 at 13:02

28 votes

5 answers

95k views

Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

I have a data table ("norm") containing numeric - at least to what I can see - normalized values of the following form: When I am executing k <- kmeans(norm,center=3) I am receving the following ...

Jonathan Rhein

1,685

asked Apr 7, 2016 at 7:40

28 votes

1 answer

16k views

How to compute cluster assignments from linkage/distance matrices

if you have this hierarchical clustering call in scipy in Python: from scipy.cluster.hierarchy import linkage # dist_matrix is long form distance matrix linkage_matrix = linkage(squareform(...

user248237

asked Apr 11, 2013 at 14:41

27 votes

3 answers

38k views

Clustering values by their proximity in python (machine learning?) [duplicate]

I have an algorithm that is running on a set of objects. This algorithm produces a score value that dictates the differences between the elements in the set. The sorted output is something like this: ...

PCoelho

7,920

asked Aug 21, 2013 at 17:31

27 votes

2 answers

22k views

Group n points in k clusters of equal size [duplicate]

Possible Duplicate: K-means algorithm variation with equal cluster size EDIT: like casperOne point it out to me this question is a duplicate. Anyways here is a more generalized question that ...

Pierre-David Belanger

1,024

asked Jan 9, 2012 at 23:30

27 votes

1 answer

2k views

Clustering (fkmeans) with Mahout using Clojure

I am trying to write a short script to cluster my data via clojure (calling Mahout classes though). I have my input data in this format (which is an output from a php script) format: (tag) (image) (...

Jeffrey04

6,238

asked Aug 25, 2011 at 7:36

26 votes

1 answer

42k views

Clustering text documents using scikit-learn kmeans in Python

I need to implement scikit-learn's kMeans for clustering text documents. The example code works fine as it is but takes some 20newsgroups data as input. I want to use the same code for clustering a ...

Nabila Shahid

419

asked Jan 11, 2015 at 17:20

26 votes

3 answers

26k views

Understanding concept of Gaussian Mixture Models

I'm trying to understand GMM by reading the sources available online. I have achieved clustering using K-Means and was seeing how GMM would compare to K-means. Here is what I have understood, please ...

StuckInPhDNoMore

2,589

asked Sep 24, 2014 at 14:33

26 votes

6 answers

18k views

Fast (< n^2) clustering algorithm

I have 1 million 5-dimensional points that I need to group into k clusters with k << 1 million. In each cluster, no two points should be too far apart (e.g. they could be bounding spheres with a ...

John Hawksley

261

asked Dec 9, 2010 at 23:11

Collectives™ on Stack Overflow

Questions tagged [cluster-analysis]

Related Tags