Hierarchical clustering

Clusters

Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types: *Agglomerative*: This is a "bottom-up" approach where each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. *Divisive*: This is a "top-down" approach where all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. The results of hierarchical clustering are usually presented in a dendrogram.

Overview[edit | edit source]

In the field of data analysis, hierarchical clustering is a powerful tool that allows the analyst to identify the natural groupings or structures within a dataset. Unlike k-means clustering, which requires the analyst to specify the number of clusters beforehand, hierarchical clustering does not require the number of clusters to be specified in advance, making it particularly useful for exploratory data analysis.

Algorithm[edit | edit source]

The algorithm for hierarchical clustering can be described as follows:

Agglomerative Clustering[edit | edit source]

Start by treating each data point as a single cluster.
Find the closest (most similar) pair of clusters and merge them into a single cluster.
Compute distances (similarities) between the new cluster and each of the old clusters.
Repeat steps 2 and 3 until all items are clustered into a single cluster of size n.

Divisive Clustering[edit | edit source]

Start with all observations in a single cluster.
Find the cluster to split and how to split it.
Perform the split to create two new clusters.
Repeat steps 2 and 3 until each observation is in its own cluster.

Distance Measures[edit | edit source]

The choice of distance measure is a critical step in clustering. It defines how the similarity of two elements is calculated and it will influence the shape of the clusters. The most common distance measures used in hierarchical clustering are:

Euclidean distance: The standard distance measure also known as straight-line distance.
Manhattan distance: Sum of the absolute differences of their Cartesian coordinates also known as city block distance.
Cosine similarity: Measures the cosine of the angle between two vectors.

Applications[edit | edit source]

Hierarchical clustering is widely used in various fields such as:

Biology, for constructing phylogenetic trees.
Information retrieval, for document clustering.
Social sciences, for clustering individuals based on their characteristics.
Market research, for customer segmentation.

Advantages and Disadvantages[edit | edit source]

Advantages[edit | edit source]

Does not require the number of clusters to be specified in advance.
Easy to implement and provides hierarchical relationships among the observations.

Disadvantages[edit | edit source]

Can be computationally expensive, especially for large datasets.
The results can be sensitive to the choice of distance measure and linkage criteria.

Search WikiMD

Ad.Tired of being Overweight? Try W8MD's physician weight loss program.
Semaglutide (Ozempic / Wegovy and Tirzepatide (Mounjaro / Zepbound) available.
Advertise on WikiMD

WikiMD is not a substitute for professional medical advice. See full disclaimer.

Credits:Most images are courtesy of Wikimedia commons, and templates Wikipedia, licensed under CC BY SA or similar.