Questions tagged [cluster-analysis]
Cluster analysis is the process of grouping "similar" objects into groups known as "clusters", along with the analysis of these results.
cluster-analysis
6,238
questions
463
votes
8
answers
284k
views
Cluster analysis in R: determine the optimal number of clusters
How can I choose the best number of clusters to do a k-means analysis. After plotting a subset of below data, how many clusters will be appropriate? How can I perform cluster dendro analysis?
n = 1000
...
234
votes
11
answers
129k
views
Is it possible to specify your own distance function using scikit-learn K-Means Clustering?
Is it possible to specify your own distance function using scikit-learn K-Means Clustering?
199
votes
20
answers
235k
views
Difference between classification and clustering in data mining? [closed]
Can someone explain what the difference is between classification and clustering in data mining?
If you can, please give examples of both to understand the main idea.
154
votes
20
answers
126k
views
How do I determine k when using k-means clustering?
I've been studying about k-means clustering, and one thing that's not clear is how you choose the value of k. Is it just a matter of trial and error, or is there more to it?
120
votes
8
answers
46k
views
What is an intuitive explanation of the Expectation Maximization technique? [closed]
Expectation Maximization (EM) is a kind of probabilistic method to classify data. Please correct me if I am wrong if it is not a classifier.
What is an intuitive explanation of this EM technique? ...
115
votes
7
answers
92k
views
1D Number Array Clustering
So let's say I have an array like this:
[1,1,2,3,10,11,13,67,71]
Is there a convenient way to partition the array into something like this?
[[1,1,2,3],[10,11,13],[67,71]]
I looked through similar ...
99
votes
7
answers
69k
views
Unsupervised clustering with unknown number of clusters
I have a large set of vectors in 3 dimensions. I need to cluster these based on Euclidean distance such that all the vectors in any particular cluster have a Euclidean distance between each other less ...
60
votes
18
answers
58k
views
K-means algorithm variation with equal cluster size
I'm looking for the fastest algorithm for grouping points on a map into equally sized groups, by distance. The k-means clustering algorithm looks straightforward and promising, but does not produce ...
55
votes
3
answers
75k
views
Scikit Learn - K-Means - Elbow - criterion
Today i'm trying to learn something about K-means. I Have understand the algorithm and i know how it works. Now i'm looking for the right k... I found the elbow criterion as a method to detect the ...
55
votes
2
answers
32k
views
plotting results of hierarchical clustering on top of a matrix of data
How can I plot a dendrogram right on top of a matrix of values, reordered appropriately to reflect the clustering, in Python? An example is the following figure:
This is Figure 6 from: A panel of ...
50
votes
7
answers
76k
views
How to get the samples in each cluster?
I am using the sklearn.cluster KMeans package. Once I finish the clustering if I need to know which values were grouped together how can I do it?
Say I had 100 data points and KMeans gave me 5 cluster....
50
votes
9
answers
39k
views
scikit-learn: Predicting new points with DBSCAN
I am using DBSCAN to cluster some data using Scikit-Learn (Python 2.7):
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(random_state=0)
dbscan.fit(X)
However, I found that there was no built-in ...
49
votes
8
answers
91k
views
Python k-means algorithm
I am looking for Python implementation of k-means algorithm with examples to cluster and cache my database of coordinates.
47
votes
5
answers
43k
views
Plot dendrogram using sklearn.AgglomerativeClustering
I'm trying to build a dendrogram using the children_ attribute provided by AgglomerativeClustering, but so far I'm out of luck. I can't use scipy.cluster since agglomerative clustering provided in ...
46
votes
4
answers
38k
views
kmeans: Quick-TRANSfer stage steps exceeded maximum
I am running k-means clustering in R on a dataset with 636,688 rows and 7 columns using the standard stats package: kmeans(dataset, centers = 100, nstart = 25, iter.max = 20).
I get the following ...
42
votes
3
answers
40k
views
How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?
I need to cluster a simple univariate data set into a preset number of clusters. Technically it would be closer to binning or sorting the data since it is only 1D, but my boss is calling it clustering,...
42
votes
3
answers
28k
views
Grid search for hyperparameter evaluation of clustering in scikit-learn
I'm clustering a sample of about 100 records (unlabelled) and trying to use grid_search to evaluate the clustering algorithm with various hyperparameters. I'm scoring using silhouette_score which ...
41
votes
3
answers
34k
views
How Could One Implement the K-Means++ Algorithm?
I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means ...
41
votes
6
answers
74k
views
Choosing eps and minpts for DBSCAN (R)?
I've been searching for an answer for this question for quite a while, so I'm hoping someone can help me. I'm using dbscan from the fpc library in R. For example, I am looking at the USArrests data ...
40
votes
2
answers
52k
views
Calculating the percentage of variance measure for k-means?
On the Wikipedia page, an elbow method is described for determining the number of clusters in k-means. The built-in method of scipy provides an implementation but I am not sure I understand how the ...
38
votes
5
answers
28k
views
scikit-learn DBSCAN memory usage
UPDATED: In the end, the solution I opted to use for clustering my large dataset was one suggested by Anony-Mousse below. That is, using ELKI's DBSCAN implimentation to do my clustering rather than ...
37
votes
2
answers
66k
views
Will pandas dataframe object work with sklearn kmeans clustering?
dataset is pandas dataframe. This is sklearn.cluster.KMeans
km = KMeans(n_clusters = n_Clusters)
km.fit(dataset)
prediction = km.predict(dataset)
This is how I decide which entity belongs to ...
37
votes
4
answers
30k
views
Text clustering with Levenshtein distances
I have a set (2k - 4k) of small strings (3-6 characters) and I want to cluster them. Since I use strings, previous answers on How does clustering (especially String clustering) work?, informed me that ...
37
votes
5
answers
38k
views
sklearn agglomerative clustering linkage matrix
I'm trying to draw a complete-link scipy.cluster.hierarchy.dendrogram, and I found that scipy.cluster.hierarchy.linkage is slower than sklearn.AgglomerativeClustering.
However, sklearn....
36
votes
4
answers
33k
views
How does clustering (especially String clustering) work?
I heard about clustering to group similar data. I want to know how it works in the specific case for String.
I have a table with more than different 100,000 words.
I want to identify the same word ...
36
votes
3
answers
36k
views
What makes the distance measure in k-medoid "better" than k-means?
I am reading about the difference between k-means clustering and k-medoid clustering.
Supposedly there is an advantage to using the pairwise distance measure in the k-medoid algorithm, instead of the ...
36
votes
2
answers
28k
views
Extracting clusters from seaborn clustermap
I am using the seaborn clustermap to create clusters and visually it works great (this example produces very similar results).
However I am having trouble figuring out how to programmatically extract ...
35
votes
6
answers
36k
views
How to group latitude/longitude points that are 'close' to each other?
I have a database of user submitted latitude/longitude points and am trying to group 'close' points together. 'Close' is relative, but for now it seems to ~500 feet.
At first it seemed I could just ...
34
votes
17
answers
6k
views
Clustering Algorithm for Paper Boys
I need help selecting or creating a clustering algorithm according to certain criteria.
Imagine you are managing newspaper delivery persons.
You have a set of street addresses, each of which is ...
34
votes
3
answers
32k
views
Spectral Clustering a graph in python
I'd like to cluster a graph in python using spectral clustering.
Spectral clustering is a more general technique which can be applied not only to graphs, but also images, or any sort of data, ...
34
votes
1
answer
37k
views
Cluster one-dimensional data optimally? [closed]
Does anyone have a paper that explains how the Ckmeans.1d.dp algorithm works?
Or: what is the most optimal way to do k-means clustering in one-dimension?
34
votes
2
answers
24k
views
Reordering matrix elements to reflect column and row clustering in naiive python [duplicate]
I'm looking for a way to perform clustering separately on matrix rows and than on its columns, reorder the data in the matrix to reflect the clustering and putting it all together. The clustering ...
33
votes
5
answers
58k
views
DBSCAN for clustering of geographic location data
I have a dataframe with latitude and longitude pairs.
Here is my dataframe look like.
order_lat order_long
0 19.111841 72.910729
1 19.111342 72.908387
2 19.111342 72.908387
3 19....
33
votes
5
answers
20k
views
Scikit Learn GridSearchCV without cross validation (unsupervised learning)
Is it possible to use GridSearchCV without cross validation? I am trying to optimize the number of clusters in KMeans clustering via grid search, and thus I don't need or want cross validation.
The ...
33
votes
6
answers
19k
views
Which machine learning library to use [closed]
I am looking for a library that, ideally, has the following features:
implements hierarchical clustering of multidimensional data (ideally on similiarity or distance matrix)
implements support vector ...
33
votes
4
answers
32k
views
Clustering Algorithm for Mapping Application
I'm looking into clustering points on a map (latitude/longitude). Are there any recommendations as to a suitable algorithm that is fast and scalable?
More specifically, I have a series of latitude/...
32
votes
7
answers
24k
views
Python Implementation of OPTICS (Clustering) Algorithm
I'm looking for a decent implementation of the OPTICS algorithm in Python. I will use it to form density-based clusters of points ((x,y) pairs).
I'm looking for something that takes in (x,y) pairs ...
31
votes
5
answers
43k
views
whats is the difference between "k means" and "fuzzy c means" objective functions?
I am trying to see if the performance of both can be compared based on the objective functions they work on?
31
votes
14
answers
13k
views
How can I find the center of a cluster of data points?
Let's say I plotted the position of a helicopter every day for the past year and came up with the following map:
Any human looking at this would be able to tell me that this helicopter is based out ...
31
votes
2
answers
28k
views
python scikit-learn clustering with missing data
I want to cluster data with missing columns. Doing it manually I would calculate the distance in case of a missing column simply without this column.
With scikit-learn, missing data is not possible. ...
30
votes
1
answer
20k
views
Online k-means clustering
Is there a online version of the k-Means clustering algorithm?
By online I mean that every data point is processed in serial, one at a time as they enter the system, hence saving computing time when ...
30
votes
1
answer
49k
views
differences in heatmap/clustering defaults in R (heatplot versus heatmap.2)?
I'm comparing two ways of creating heatmaps with dendrograms in R, one with made4's heatplot and one with gplots of heatmap.2. The appropriate results depend on the analysis but I'm trying to ...
28
votes
5
answers
95k
views
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
I have a data table ("norm") containing numeric - at least to what I can see - normalized values of the following form:
When I am executing
k <- kmeans(norm,center=3)
I am receving the following ...
28
votes
1
answer
16k
views
How to compute cluster assignments from linkage/distance matrices
if you have this hierarchical clustering call in scipy in Python:
from scipy.cluster.hierarchy import linkage
# dist_matrix is long form distance matrix
linkage_matrix = linkage(squareform(...
27
votes
3
answers
38k
views
Clustering values by their proximity in python (machine learning?) [duplicate]
I have an algorithm that is running on a set of objects. This algorithm produces a score value that dictates the differences between the elements in the set.
The sorted output is something like this:
...
27
votes
2
answers
22k
views
Group n points in k clusters of equal size [duplicate]
Possible Duplicate:
K-means algorithm variation with equal cluster size
EDIT: like casperOne point it out to me this question is a duplicate. Anyways here is a more generalized question that ...
27
votes
1
answer
2k
views
Clustering (fkmeans) with Mahout using Clojure
I am trying to write a short script to cluster my data via clojure (calling Mahout classes though). I have my input data in this format (which is an output from a php script)
format: (tag) (image) (...
26
votes
1
answer
42k
views
Clustering text documents using scikit-learn kmeans in Python
I need to implement scikit-learn's kMeans for clustering text documents. The example code works fine as it is but takes some 20newsgroups data as input. I want to use the same code for clustering a ...
26
votes
3
answers
26k
views
Understanding concept of Gaussian Mixture Models
I'm trying to understand GMM by reading the sources available online. I have achieved clustering using K-Means and was seeing how GMM would compare to K-means.
Here is what I have understood, please ...
26
votes
6
answers
18k
views
Fast (< n^2) clustering algorithm
I have 1 million 5-dimensional points that I need to group into k clusters with k << 1 million. In each cluster, no two points should be too far apart (e.g. they could be bounding spheres with a ...