Questions tagged [cluster-analysis]
Cluster analysis is the process of grouping "similar" objects into groups known as "clusters", along with the analysis of these results.
                                	
	cluster-analysis
    
                            
                        
                    
            6,238
            questions
        
        
            463
            votes
        
        
            8
            answers
        
        
            284k
            views
        
    Cluster analysis in R: determine the optimal number of clusters
                How can I choose the best number of clusters to do a k-means analysis. After plotting a subset of below data, how many clusters will be appropriate? How can I perform cluster dendro analysis?
n = 1000
...
            
        
       
    
            234
            votes
        
        
            11
            answers
        
        
            129k
            views
        
    Is it possible to specify your own distance function using scikit-learn K-Means Clustering?
                Is it possible to specify your own distance function using scikit-learn K-Means Clustering?
            
        
       
    
            199
            votes
        
        
            20
            answers
        
        
            235k
            views
        
    Difference between classification and clustering in data mining? [closed]
                Can someone explain what the difference is between classification and clustering in data mining?
If you can, please give examples of both to understand the main idea.
            
        
       
    
            154
            votes
        
        
            20
            answers
        
        
            126k
            views
        
    How do I determine k when using k-means clustering?
                I've been studying about k-means clustering, and one thing that's not clear is how you choose the value of k.  Is it just a matter of trial and error, or is there more to it?
            
        
       
    
            120
            votes
        
        
            8
            answers
        
        
            46k
            views
        
    What is an intuitive explanation of the Expectation Maximization technique? [closed]
                Expectation Maximization (EM) is a kind of probabilistic method to classify data. Please correct me if I am wrong if it is not a classifier. 
What is an intuitive explanation of this EM technique? ...
            
        
       
    
            115
            votes
        
        
            7
            answers
        
        
            92k
            views
        
    1D Number Array Clustering
                So let's say I have an array like this:
[1,1,2,3,10,11,13,67,71]
Is there a convenient way to partition the array into something like this?
[[1,1,2,3],[10,11,13],[67,71]]
I looked through similar ...
            
        
       
    
            99
            votes
        
        
            7
            answers
        
        
            69k
            views
        
    Unsupervised clustering with unknown number of clusters
                I have a large set of vectors in 3 dimensions. I need to cluster these based on Euclidean distance such that all the vectors in any particular cluster have a Euclidean distance between each other less ...
            
        
       
    
            60
            votes
        
        
            18
            answers
        
        
            58k
            views
        
    K-means algorithm variation with equal cluster size
                I'm looking for the fastest algorithm for grouping points on a map into equally sized groups, by distance. The k-means clustering algorithm looks straightforward and promising, but does not produce ...
            
        
       
    
            55
            votes
        
        
            3
            answers
        
        
            75k
            views
        
    Scikit Learn - K-Means - Elbow - criterion
                Today i'm trying to learn something about K-means. I Have understand the algorithm and i know how it works. Now i'm looking for the right k... I found the elbow criterion as a method to detect the ...
            
        
       
    
            55
            votes
        
        
            2
            answers
        
        
            32k
            views
        
    plotting results of hierarchical clustering on top of a matrix of data
                How can I plot a dendrogram right on top of a matrix of values, reordered appropriately to reflect the clustering, in Python?  An example is the following figure:
This is Figure 6 from: A panel of ...
            
        
       
    
            50
            votes
        
        
            7
            answers
        
        
            76k
            views
        
    How to get the samples in each cluster?
                I am using the sklearn.cluster KMeans package. Once I finish the clustering if I need to know which values were grouped together how can I do it?
Say I had 100 data points and KMeans gave me 5 cluster....
            
        
       
    
            50
            votes
        
        
            9
            answers
        
        
            39k
            views
        
    scikit-learn: Predicting new points with DBSCAN
                I am using DBSCAN to cluster some data using Scikit-Learn (Python 2.7):
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(random_state=0)
dbscan.fit(X)
However, I found that there was no built-in ...
            
        
       
    
            49
            votes
        
        
            8
            answers
        
        
            91k
            views
        
    Python k-means algorithm
                I am looking for Python implementation of k-means algorithm with examples to cluster and cache my database of coordinates.
            
        
       
    
            47
            votes
        
        
            5
            answers
        
        
            43k
            views
        
    Plot dendrogram using sklearn.AgglomerativeClustering
                I'm trying to build a dendrogram using the children_ attribute provided by AgglomerativeClustering, but so far I'm out of luck. I can't use scipy.cluster since agglomerative clustering provided in ...
            
        
       
    
            46
            votes
        
        
            4
            answers
        
        
            38k
            views
        
    kmeans: Quick-TRANSfer stage steps exceeded maximum
                I am running k-means clustering in R on a dataset with 636,688 rows and 7 columns using the standard stats package: kmeans(dataset, centers = 100, nstart = 25, iter.max = 20). 
I get the following ...
            
        
       
    
            42
            votes
        
        
            3
            answers
        
        
            40k
            views
        
    How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?
                I need to cluster a simple univariate data set into a preset number of clusters. Technically it would be closer to binning or sorting the data since it is only 1D, but my boss is calling it clustering,...
            
        
       
    
            42
            votes
        
        
            3
            answers
        
        
            28k
            views
        
    Grid search for hyperparameter evaluation of clustering in scikit-learn
                I'm clustering a sample of about 100 records (unlabelled) and trying to use grid_search to evaluate the clustering algorithm with various hyperparameters. I'm scoring using silhouette_score which ...
            
        
       
    
            41
            votes
        
        
            3
            answers
        
        
            34k
            views
        
    How Could One Implement the K-Means++ Algorithm?
                I am having trouble fully understanding the K-Means++ algorithm.  I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means ...
            
        
       
    
            41
            votes
        
        
            6
            answers
        
        
            74k
            views
        
    Choosing eps and minpts for DBSCAN (R)?
                I've been searching for an answer for this question for quite a while, so I'm hoping someone can help me.  I'm using dbscan from the fpc library in R.  For example, I am looking at the USArrests data ...
            
        
       
    
            40
            votes
        
        
            2
            answers
        
        
            52k
            views
        
    Calculating the percentage of variance measure for k-means?
                On the Wikipedia page, an elbow method is described for determining the number of clusters in k-means. The built-in method of scipy provides an implementation but I am not sure I understand how the ...
            
        
       
    
            38
            votes
        
        
            5
            answers
        
        
            28k
            views
        
    scikit-learn DBSCAN memory usage
                UPDATED: In the end, the solution I opted to use for clustering my large dataset was one suggested by Anony-Mousse below. That is, using ELKI's DBSCAN implimentation to do my clustering rather than ...
            
        
       
    
            37
            votes
        
        
            2
            answers
        
        
            66k
            views
        
    Will pandas dataframe object work with sklearn kmeans clustering?
                dataset is pandas dataframe. This is sklearn.cluster.KMeans
 km = KMeans(n_clusters = n_Clusters)
 km.fit(dataset)
 prediction = km.predict(dataset)
This is how I decide which entity belongs to ...
            
        
       
    
            37
            votes
        
        
            4
            answers
        
        
            30k
            views
        
    Text clustering with Levenshtein distances
                I have a set (2k - 4k) of small strings (3-6 characters) and I want to cluster them. Since I use strings, previous answers on How does clustering (especially String clustering) work?, informed me that ...
            
        
       
    
            37
            votes
        
        
            5
            answers
        
        
            38k
            views
        
    sklearn agglomerative clustering linkage matrix
                I'm trying to draw a complete-link scipy.cluster.hierarchy.dendrogram, and I found that scipy.cluster.hierarchy.linkage is slower than sklearn.AgglomerativeClustering.
However, sklearn....
            
        
       
    
            36
            votes
        
        
            4
            answers
        
        
            33k
            views
        
    How does clustering (especially String clustering) work?
                I heard about clustering to group similar data. I want to know how it works in the specific case for String.
I have a table with more than different 100,000 words. 
I want to identify the same word ...
            
        
       
    
            36
            votes
        
        
            3
            answers
        
        
            36k
            views
        
    What makes the distance measure in k-medoid "better" than k-means?
                I am reading about the difference between k-means clustering and k-medoid clustering.
Supposedly there is an advantage to using the pairwise distance measure in the k-medoid algorithm, instead of the ...
            
        
       
    
            36
            votes
        
        
            2
            answers
        
        
            28k
            views
        
    Extracting clusters from seaborn clustermap
                I am using the seaborn clustermap to create clusters and visually it works great (this example produces very similar results).
However I am having trouble figuring out how to programmatically extract ...
            
        
       
    
            35
            votes
        
        
            6
            answers
        
        
            36k
            views
        
    How to group latitude/longitude points that are 'close' to each other?
                I have a database of user submitted latitude/longitude points and am trying to group 'close' points together. 'Close' is relative, but for now it seems to ~500 feet.
At first it seemed I could just ...
            
        
       
    
            34
            votes
        
        
            17
            answers
        
        
            6k
            views
        
    Clustering Algorithm for Paper Boys
                I need help selecting or creating a clustering algorithm according to certain criteria.
Imagine you are managing newspaper delivery persons.
You have a set of street addresses, each of which is ...
            
        
       
    
            34
            votes
        
        
            3
            answers
        
        
            32k
            views
        
    Spectral Clustering a graph in python
                I'd like to cluster a graph in python using spectral clustering. 
Spectral clustering is a more general technique which can be applied not only to graphs, but also images, or any sort of data, ...
            
        
       
    
            34
            votes
        
        
            1
            answer
        
        
            37k
            views
        
    Cluster one-dimensional data optimally? [closed]
                Does anyone have a paper that explains how the Ckmeans.1d.dp algorithm works?
Or: what is the most optimal way to do k-means clustering in one-dimension?
            
        
       
    
            34
            votes
        
        
            2
            answers
        
        
            24k
            views
        
    Reordering matrix elements to reflect column and row clustering in naiive python [duplicate]
                I'm looking for a way to perform clustering separately on matrix rows and than on its columns, reorder the data in the matrix to reflect the clustering and putting it all together. The clustering ...
            
        
       
    
            33
            votes
        
        
            5
            answers
        
        
            58k
            views
        
    DBSCAN for clustering of geographic location data
                I have a dataframe with latitude and longitude pairs.
Here is my dataframe look like.
    order_lat  order_long
0   19.111841   72.910729
1   19.111342   72.908387
2   19.111342   72.908387
3   19....
            
        
       
    
            33
            votes
        
        
            5
            answers
        
        
            20k
            views
        
    Scikit Learn GridSearchCV without cross validation (unsupervised learning)
                Is it possible to use GridSearchCV without cross validation? I am trying to optimize the number of clusters in KMeans clustering via grid search, and thus I don't need or want cross validation. 
The ...
            
        
       
    
            33
            votes
        
        
            6
            answers
        
        
            19k
            views
        
    Which machine learning library to use [closed]
                I am looking for a library that, ideally, has the following features:
implements hierarchical clustering of multidimensional data (ideally on similiarity or distance matrix)
implements support vector ...
            
        
       
    
            33
            votes
        
        
            4
            answers
        
        
            32k
            views
        
    Clustering Algorithm for Mapping Application
                I'm looking into clustering points on a map (latitude/longitude). Are there any recommendations as to a suitable algorithm that is fast and scalable?
More specifically, I have a series of latitude/...
            
        
       
    
            32
            votes
        
        
            7
            answers
        
        
            24k
            views
        
    Python Implementation of OPTICS (Clustering) Algorithm
                I'm looking for a decent implementation of the OPTICS algorithm in Python. I will use it to form density-based clusters of points ((x,y) pairs).
I'm looking for something that takes in (x,y) pairs ...
            
        
       
    
            31
            votes
        
        
            5
            answers
        
        
            43k
            views
        
    whats is the difference between "k means" and "fuzzy c means" objective functions?
                I am trying to see if the performance of both can be compared based on the objective functions they work on?
            
        
       
    
            31
            votes
        
        
            14
            answers
        
        
            13k
            views
        
    How can I find the center of a cluster of data points?
                Let's say I plotted the position of a helicopter every day for the past year and came up with the following map:
Any human looking at this would be able to tell me that this helicopter is based out ...
            
        
       
    
            31
            votes
        
        
            2
            answers
        
        
            28k
            views
        
    python scikit-learn clustering with missing data
                I want to cluster data with missing columns. Doing it manually I would calculate the distance in case of a missing column simply without this column.
With scikit-learn, missing data is not possible. ...
            
        
       
    
            30
            votes
        
        
            1
            answer
        
        
            20k
            views
        
    Online k-means clustering
                Is there a online version of the k-Means clustering algorithm?
By online I mean that every data point is processed in serial, one at a time as they enter the system, hence saving computing time when ...
            
        
       
    
            30
            votes
        
        
            1
            answer
        
        
            49k
            views
        
    differences in heatmap/clustering defaults in R (heatplot versus heatmap.2)?
                I'm comparing two ways of creating heatmaps with dendrograms in R, one with made4's heatplot and one with gplots of heatmap.2. The appropriate results depend on the analysis but I'm trying to ...
            
        
       
    
            28
            votes
        
        
            5
            answers
        
        
            95k
            views
        
    Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
                I have a data table ("norm") containing numeric - at least to what I can see - normalized values of the following form:
When I am executing
k <- kmeans(norm,center=3)
I am receving the following ...
            
        
       
    
            28
            votes
        
        
            1
            answer
        
        
            16k
            views
        
    How to compute cluster assignments from linkage/distance matrices
                if you have this hierarchical clustering call in scipy in Python:
from scipy.cluster.hierarchy import linkage
# dist_matrix is long form distance matrix
linkage_matrix = linkage(squareform(...
            
        
       
    
            27
            votes
        
        
            3
            answers
        
        
            38k
            views
        
    Clustering values by their proximity in python (machine learning?) [duplicate]
                I have an algorithm that is running on a set of objects. This algorithm produces a score value that dictates the differences between the elements in the set.
The sorted output is something like this:
...
            
        
       
    
            27
            votes
        
        
            2
            answers
        
        
            22k
            views
        
    Group n points in k clusters of equal size [duplicate]
                Possible Duplicate:
  K-means algorithm variation with equal cluster size  
EDIT: like casperOne point it out to me this question is a duplicate. Anyways here is a more generalized question that ...
            
        
       
    
            27
            votes
        
        
            1
            answer
        
        
            2k
            views
        
    Clustering (fkmeans) with Mahout using Clojure
                I am trying to write a short script to cluster my data via clojure (calling Mahout classes though). I have my input data in this format (which is an output from a php script)
format: (tag) (image) (...
            
        
       
    
            26
            votes
        
        
            1
            answer
        
        
            42k
            views
        
    Clustering text documents using scikit-learn kmeans in Python
                I need to implement scikit-learn's kMeans for clustering text documents. The example code works fine as it is but takes some 20newsgroups data as input. I want to use the same code for clustering a ...
            
        
       
    
            26
            votes
        
        
            3
            answers
        
        
            26k
            views
        
    Understanding concept of Gaussian Mixture Models
                I'm trying to understand GMM by reading the sources available online. I have achieved clustering using K-Means and was seeing how GMM would compare to K-means.
Here is what I have understood, please ...
            
        
       
    
            26
            votes
        
        
            6
            answers
        
        
            18k
            views
        
    Fast (< n^2) clustering algorithm
                I have 1 million 5-dimensional points that I need to group into k clusters with k << 1 million. In each cluster, no two points should be too far apart (e.g. they could be bounding spheres with a ...
            
        
       
     
         
         
         
         
         
         
         
        