Nov 23, 2014

Density-based clustering in spatial data (2)

This is the second of a series of posts on cluster-algorithms and ideas in data analysis (and related fields).

Ordering points to identify the clustering structure (OPTICS) is a data clustering algorithm presented by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander in 1999 [1]. It builds on the ideas of DBSCAN, which I described in a previous post. But let's cut the intro and dive right into it...

Let $(X,d)$ be a finite discrete metric space, i.e.~let $X$ be a data set on which we have a notion of similarity expressed in terms of a suitable distance function $d$. Given a positive integer $m \in \mathbb{N}$ we can define a notion of density on points of our data set, which in turn can be used to define a perturbed metric. Given a starting point $x_0 \in X$ the OPTICS-algorithm iteratively defines a linear ordering on $X$ in terms of a filtration $X_0 \subset X_1 \subset \ldots \subset X_n$ of $X$, where $X_{n+1}$ is obtained from $X_n$ by appending the closest element (with respect to the perturbed metric) in its complement.
The original OPTICS-algorithm also takes an additional parameter $\varepsilon > 0$. However this is only needed to reduce the time complexity and is ignored in our discussion for now -- to be more precise, we implicitly set $\varepsilon = \infty$.


Definitions

Let $(X,d)$ be a finite discrete metric space and $m \in \mathbb{N}$ a positive integer. We define the (co-)density $\delta_m(x)$ of a point $x$ by \[ \delta_m(x) := d(x, \mathfrak{nn}_m(x) ), \] where $\mathfrak{nn}_m(x) $ is an $m$-nearest-neighbor of $x$. Loosely speaking: the lower the value $\delta_m(x)$ the closer the neighbors of $x$ are distributed around $x$. Note that with $\varepsilon = \delta_m(x)$ the point $x$ is a core point in the sense of [2] -- in a previous post about DBSCAN we called such a point $\varepsilon$-dense. That is why in the literature $\delta_m(x)$ is referred to as core-distance. I however like to think of it, and hence refer to it, as a notion of co-density. Further note that for any $y \in X$ we have (as long as $d$ satisfies the triangle inequality) \[ \delta_m(y) \leq d(x,y) + \delta_m(x). \] Given the co-density $\delta_m(.)$ we can define the reachability distance $r_x(y)$ of $y$ from $x$ by \[ r_x(y) := \text{max} \big\{ \delta_m(x), d(x,y) \big\}. \] Note that $r_x(y)$ is not symmetric, since the density of $x$ and $y$ may differ.

Sketch of the algorithm

Choose a starting point $x_0 \in X$. Then we can iteratively define a filtration $X_0 \subset \ldots \subset X_n$ of the data set $X$ by \[ X_0 := \{ x_0 \} \ \ \text{and} \ \ X_{k+1} := X_k \cup \{ x_{k+1} \}, \] where $x_{k+1}$ minimizes $r_{X_k}(.)$ over $X \setminus X_k$. In parallel we define a sequence $(r_n)$ of distances by \[ r_0 = 0 \ \ \text{and} \ \ r_{k+1} := r_{X_k}(x_{k+1}). \] Note that a small distance $r_k$ may be understood as $x_k$ being close to a rather dense region. Therefore the filtration tries to stay in dense regions for as long as possible, before it passes a less dense region. The cluster structure can now be extracted by analyzing the reachability-plot, that is a $2$-dimensional plot, with the ordered $x_k$ on the $x$-axis and the associated distances $r_k$ on the $y$-axis. By the above considerations it should be clear that clusters show up as valleys in the reachability plot. The deeper the valley, the denser the cluster.

Example

Consider the data points $X\subset \mathbb{R}^2$ in the euclidian plane given in the following figure:


Below you see the reachability plot corresponding the OPTICS-algorithm applied to the above data set with $m=4$. The starting point is chosen among the group of points on the right.



References

[1] M. Ankerst, M.M. Breunig, H.-P. Kriegel, OPTICS: Ordering Points To Identify the Clustering Structure, ACM SIGMOD international conference on Management of data (1999), pp 49--60.
[2] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (1996), pp. 226--231.

1 comment :

  1. Dear Mirko,

    this is very interesting again. :-)

    I have one question or thought about the reachability and the filtration. If I were to implement that algorithm, my intuition would tell me "let's try to define the core-distance/co-density with respect to X_k instead of X". Because this way the algorithm would prefer to extend the filtration on those points which are dense in X_k (instead of dense in X). I think it would be interesting to see what difference that makes.

    To me that would have been a more natural choice at first, but the longer I think about it I am already reconsidering. My variation of the algorithm would probably not have the ability (or 'tendency') to drift to more dense regions if it started in a sparse region. That would seem to be the main difference. Ok, not a real question after all..

    But very interesting post again, I enjoy reading your blog. :-)

    ReplyDelete