Blog

DBSCAN Clustering in Machine Learning

DBSCAN Clustering in Machine Learning

DBSCAN is one of the famous clustering algorithm. Spatial clustering of density-based applications (DBSCAN) is a well-known data clustering algorithm widely used in data mining and machine learning. Based on a set of points (let's think in a 2-dimensional space), DBSCAN groups together points that are close to each other based on a distance measurement (usually Euclidean distance) and a minimum number of points. It also marks the points in low-density regions as outliers. DBSCAN can find non-linearly separable clusters. 

DBSCAN Clustering in Machine Learning


Clustering is an unsupervised machine learning algorithm that divides a data into meaningful sub-groups, called clusters. The subgroups are chosen such that the intra-cluster differences are minimized and the inter-cluster differences are maximized. The very definition of a ‘cluster’ depends on the application. There are a myriad of clustering algorithms. These algorithms can be generally classified into four categories: partitioning based, hierarchy based, density based and grid based.

DBSCAN is one of the famous clustering algorithm. Spatial clustering of density-based applications (DBSCAN) is a well-known data clustering algorithm widely used in data mining and machine learning. Based on a set of points (let's think in a 2-dimensional space), DBSCAN groups together points that are close to each other based on a distance measurement (usually Euclidean distance) and a minimum number of points. It also marks the points in low-density regions as outliers. DBSCAN can find non-linearly separable clusters. 

DBSCAN Algorithm (Pseudocode):

DBSCAN(DB, distFunc, eps, minPts) {
C = 0 /* Cluster counter */
for each point P in database DB {
if label(P) ≠ undefined then continue /* Previously processed in inner loop */
Neighbors N = RangeQuery(DB, distFunc, P, eps) /* Find neighbors */
if |N| < minPts then { /* Density check */
label(P) = Noise /* Label as Noise */
continue
}
C = C + 1 /* next cluster label */
label(P) = C /* Label initial point */
Seed set S = N \ {P} /* Neighbors to expand */
for each point Q in S { /* Process every seed point */
if label(Q) = Noise then label(Q) = C /* Change Noise to border point */
if label(Q) ≠ undefined then continue /* Previously processed */
label(Q) = C /* Label neighbor */
Neighbors N = RangeQuery(DB, distFunc, Q, eps) /* Find neighbors */
if |N| ≥ minPts then { /* Density check */
S = S ∪ N /* Add new neighbors to seed set */
}
}
}
}

/* credit: wikipedia */

A working python example in scikit -learn is shown below.

# credit: geeksforgeeks
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn import datasets
# Load data in X
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print(labels)
# Plot result
import matplotlib.pyplot as plt
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = ['y', 'b', 'g', 'r']
print(colors)
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = 'k'
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k',
markersize=6)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k',
markersize=6)
plt.title('number of clusters: %d' %n_clusters_)
plt.show()

DBSCAN overcomes many problems of K-means. Basically, DBSCAN algorithm identifies the dense region by grouping together data points that are closed to each other based on distance measurement. The algorithm finds dense areas and expands these recursively to find dense arbitrarily shaped clusters. Two main parameters to DBSCAN are ‘ε’ and ‘minPoints’. ‘ε’ defines radius of the ‘neighborhood region’ and ‘minPoints’ defines the minimum number of points that should be contained within that neighborhood. Since it has a concept of noise, it works well even with noisy datasets. The algorithm can be highly parallelized which is an added achievement.