Introduction to Topological Data Analysis

TODO:

[x] Make Goals

[x] Make Outline

[x] Make abstract

[ ] Make Deck text

[ ] Make Deck drawings

[ ] Make deck pdf

Goals

  1. Understand the general “zen” of TDA

  2. Understand concrete example(s)

  3. Understand a few applications

  4. Validation (time permitting)

Abstract

Topological data analysis (TDA) is an ecosystem of (unsupervised) learning algorithms based on the ansatz: “Data has shape, shape has meaning.” After a high level overview, we present the primary workhorse of TDA (Mapper) and two applications to healthcare. Time permitting, I will outline a couple validation strategies.

Outline:

  1. Part I: Zen of TDA

    1. Data -> Space

    2. Space -> low dimensional embedding

    3. Space -> structured features ((persistent) homology)

      1. Qualitative/computable features, e.g. number of connected components

  2. Part II: Examples

    1. DBSCAN

    2. Mapper

    3. Example of Mapper on a point cloud from a circle

      1. circle has two points connected by two qualitatively distinct paths

  3. Part III: Applications

    1. Breast Cancer

    2. Malaria

Slides

Loose draft of the contents of my slide deck.

Slide 1: Title Slide

Title: Introduction to TDA: “Data has shape, shape has meaning”

Author: Jeremy Mann, PhD

Slide 2: What is TDA?

  • TDA is the application of homotopy theory to data science, primarily through an ecosystem of unsupervised learning algorithms

  • Homotopy theory = (De)Construction and analysis of “spaces”
    • “connectivity”: when two things can be connected via a continuous path

    • “locality”: computable via divide-and-conquer

  • Examples of spaces: sets, graphs, circles, torii, spheres, solutions to f(x)=0, invertible matrices, ….

  • Examples of analysis: number of connected components.

Slide 3: Step I

  • We want something of the form Data –> Insight

  • Idea: break this process into two steps:

    1. Data –> Space

    2. Space –> Insight

      • Space –> Structured data

      • Space –> Low dimensional embedding

Slide 3: Step I

  • Idea: given input of some kind (a physical theory, a protein, a point cloud of structured data, an image, …) “naturally” construct a shape whose properties reflect something interesting about the input.

    • Family tree, connectivity ~ kinship

    • Gauge theory, connectivity ~ physical indistinguishability

  • Point: All of these notions of connectivity (and more!) are specified once you construct a space!

Slide 5: DBSCAN

  • DSCAN is a classical clustering algorithm

  • Primary hyperparameter is a scale parameter \(\epsilon > 0\) determining when two points are close/similar

  • Number of clusters is not explicitly specified ahead of time.

  • Breaks into two steps

Slide 6: DBSCAN

  1. DBSCAN constructs a space (graph) from a point cloud \(\{x_i\}\) and a

    • Points: there is a point for every example

    • Edges: there is an edge between \(x_i\) and \(x_j\) whenever they are less than \(\epsilon\) units away from each other.

  2. Two data points are declared to be in the same cluster if they can be connected by a path in this graph.

    • number of clusters = number of connected components of this space.

Slide 6: DCSCAN

Explicit example

Slide 7: Mapper

  • Mapper modifies this construction in by integrating two pieces: - Allows for/leverages intersections of subpopulations - Integrates the data of a “lens” function :

    • particularly interesting feature(s)

    • highest component generated by PCA

    • loss function of a learning algorithm

    • points ~ subpopulations

    • edges ~ intersections of subpopulations

    • color ~ (average/median) value of original function

  • Mapper requires additional choice: (overlapping) cover of range of function

Slide 8: Covering the range

  • Essentially hyperparameter means

  • For every element of the cover, we have a subset of the data

  • For every intersectin in the data, we have a subset of the data.
    • These subsets sit inside of two larger subsets

Slide 8: Mapper Constructs a Space

#. Mapper constructs a space (graph) - First step: stratify data by the “lens” function.

  • KEY POINT: strata should be overlapping!

  • Perform DBSCAN

  • Say when two points are not being connected

Slide 9: Mapper Constructs a Space

Slide : Embed the Graph - There are many ways to embed graphs in low dimensions - These + mapper gives low dimensional visualizations of the data - Can remember the original lens function

  • The colors should vary continuously as you move along the edges of the graph

Slide 9

  • Mapper breast cancer application

  • Interpretation of picture

Slide 10

  • Mapper malaria example

Drawings:

  1. point cloud approximation of \(S^1 \coprod *\) clustered by DBSCAN - draw the length scale \(\epsilon\) on the side

  2. picture of an overlapping cover.

3. picture of point cloud broken into independent parts and clustered accordingly 4.