Feb 22, 2015

Mapper Python Implementation

A few posts ago I wrote about the mapper construction by Carlsson-Memoli-Singh and want to follow up on that a little.

I wrote a straightforward implementation of the construction in Python. It can be found here
Below you find an example how to use it.

Jan 23, 2015

K-Means, coverings, and Voronoi diagrams

This is the 4th of a series of posts on cluster-algorithms and ideas in data analysis.

The $k$-Means algorithm computes a Voronoi partition of the data set such that each landmark is given by the centroid of the corresponding cell. Let me quickly quote Wikipedia on the history of the algorithm before I explain what it is about: The term "$k$-means" was first used by James MacQueen in 1967, though the idea goes back to Hugo Steinhaus in 1957. The standard algorithm was first proposed by Stuart Lloyd in 1957.

Dec 8, 2014

Mapper - A discrete generalization of the Reeb graph

This is the third of a series of posts on cluster-algorithms and ideas in data analysis.

Mapper is a construction that uses a given cluster-algorithm to associate a simplicial complex to a reference map on a given data set. It was introduced by Carlsson--Mémoli--Singh in [1] and lies at the core of the of the topological data analysis startup Ayasdi. A good reference, I personally enjoyed reading, is [2]. Mapper is closely related to the Reeb graph of a real-valued function on a topological space. Just as the Reeb graph it is able to capture certain topological and geometrical features of the underlying space.

Nov 23, 2014

Density-based clustering in spatial data (2)

This is the second of a series of posts on cluster-algorithms and ideas in data analysis (and related fields).

Ordering points to identify the clustering structure (OPTICS) is a data clustering algorithm presented by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander in 1999 [1]. It builds on the ideas of DBSCAN, which I described in a previous post. But let's cut the intro and dive right into it...

Let $(X,d)$ be a finite discrete metric space, i.e.~let $X$ be a data set on which we have a notion of similarity expressed in terms of a suitable distance function $d$. Given a positive integer $m \in \mathbb{N}$ we can define a notion of density on points of our data set, which in turn can be used to define a perturbed metric. Given a starting point $x_0 \in X$ the OPTICS-algorithm iteratively defines a linear ordering on $X$ in terms of a filtration $X_0 \subset X_1 \subset \ldots \subset X_n$ of $X$, where $X_{n+1}$ is obtained from $X_n$ by appending the closest element (with respect to the perturbed metric) in its complement.
The original OPTICS-algorithm also takes an additional parameter $\varepsilon > 0$. However this is only needed to reduce the time complexity and is ignored in our discussion for now -- to be more precise, we implicitly set $\varepsilon = \infty$.

Nov 17, 2014

Density-based clustering in spatial data (1)

This is the first of a series of posts on cluster-algorithms and ideas in data analysis (and related fields).

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996 [1]. It uses notions of connectivity and density of points in the data set to compute the connected components of dense enough regions of the data set. But let's cut the intro and dive right into it...