Clustering Techniques

Clustering

Cluster Analysis("data segmentation") is an exploratory method for identifying homogenous groups ("clusters") of records

Similar records should belong to the same cluster
Dissimilar records should belong to different clusters
In Clustering there are two types of Clusters they are:
- Hierarchical Clustering
- Non-Hierarchical Clustering

Hierarchical Clustering Alogorithm:

Hierarchical methods-agglomeratives: Begin with n clusters; sequentially merge similar clusters until 1 cluster is left. Useful when goal is to arrange the clusters into a natural hierarchy. Requires specifying distance measure

Clustering

Cluster Analysis(“data segmentation”) is an exploratory method for identifying homogeneous groups
(“clusters”) of records

Similar records should belongs to same cluster
Dissimilar records should belongs to different clusters
In clustering there are two types of clusters they are:
- Hierarchical clustering
- Non-Hierarchical clustering

Hierarchical clustering Algorithm:

Hierarchical methods - Agglomerative

Begin with n clusters;sequentially merge similar clusters until 1 cluster is left.
Useful when the goal is to arrange the clusters into natural hierarchy.
Requires specifying distance measures

Dendrogram:

Tree like diagram that summarise the clustering process
Similar records join by links
Record location determined by similarity to other records

Start with n clusters

( 1 record in each cluster)

Step 1: Two closest records merged into one cluster
At every step, pairs of records/clusters with smallest distance are merged.

- Two records are merged,
- or single record added to the existing cluster,
- or two existing clusters are combined
- requires a definition of distance.

Pairwise distance between records:

dij = distance between records i and j
Distance requirements : Non-negative(dij>0) dii =0
Symmetry(dij = dji)
Triangle inequality (dij+djk>=djk)

Distance for binary data

Binary Euclidean Distance : (b+c) / (a+b+c+d)
Simple matching coefficient : (a+d) / (a+b+c+d)
Jaquard’s coefficient : d / (b+c+d)

for>2 categories,distance=0 only if both items have same category.Otherwise=1

Distances for mixed(Numerical+categorical)Data

Simple:Standardized numerical variables to [0,1],the use euclidean distance for all
Gower’s general dissimilarity coefficient

Distances between clusters: ;’Single linkage’(‘nearest neighbour’)

Distance between 2 clusters = minimum distance between members of the two clusters

Distances between clusters: ;’complete linkage’(‘farthest neighbour’)

Distance between 2 clusters = Greatest distance between members of the two clusters

Distances between clusters: ; ‘Average linkage’

Distance between 2 clusters = Average of all distances between members of the two clusters

Distances between clusters: ; ‘Centroid linkage’

Distance between 2 clusters = Distance between centroids(centers)

Hierarchical
The good

Finds “natural” grouping - no need to specify number of clusters
Dendrogram: transparency of process, good for presentation

The Bad

Require computation and storage of n/n distance matrix
Algorithm makes only one pass through the data.records that are incorrectly allocated early on cannot be reallocated subsequently
Low stability : recording data or dropping a few records can leads to different solution
single complete linkage robust to distance metrics as long as the relative ordering is kept .
Average linkage is Not.
Most distances sensitive to outliers.

Non-Hierarchical methods:
K-means clustering

pre-specified number of clusters ,assign records to each of the clusters.
Requires specifying # clusters
Computationally cheap
Predetermined number(k) of non-overlapping clusters
Clusters are homogenous yet dissimilar to other clusters
Need measures of within-cluster similarity(homogeneity) and between-cluster similarity
No hierarchy(no dendrogram)!End-product is final cluster memberships
Useful for large datasets

Algorithm minimizes within-cluster variance(heterogeneity)

1.For a user-specified value of k,partition dataset into k initial clusters
2.For each record,assign it to cluster with closest centroid
3.Re-calculate centroids for the “losing” and “receiving” clusters.Can be done

After reassignment of each record,or
After one complete pass through all records(cheaper)

4.Repeat step 2-3 until no more assignment is necessary

Initial partition into k clusters
Initial partitions can be obtained by either

User-specified initial partitions,or
User-specified initial centroids,or
Random partitions

Stability : run algorithm with different initial partitions
Selecting K

Re-run algorithm for different values of k
Trade off: simplicity (interpretation) vs.adequacy ( within-cluster homogeneity)
Plot within-cluster variability as a function of k

K-Means
The Good

Computationally fast for large datasets
Useful when certain k needed

The Bad

Can take long to terminate
Final solution not guaranteed to be “globally optimal”
Different initial partitions can lead to different solutions
Must re-run the algorithm for different values of K No dendrogram

About the Author

Ram Varun

Research Engineer(Data Science), Alumni of GWU (George Washington University) having 6.5+ years of experience in Data Science and Machine Learning, Specialist in Machine Learning and AI Facilitated and collaborated on many industry wide data science and machine learning projects

Post Comments