Introduction to Topological Data Analysis¶

TODO:¶

[x] Make Goals

[x] Make Outline

[x] Make abstract

[ ] Make Deck text

[ ] Make Deck drawings

[ ] Make deck pdf

Goals¶

Understand the general “zen” of TDA
Understand concrete example(s)
Understand a few applications
Validation (time permitting)

Abstract¶

Topological data analysis (TDA) is an ecosystem of (unsupervised) learning algorithms based on the ansatz: “Data has shape, shape has meaning.” After a high level overview, we present the primary workhorse of TDA (Mapper) and two applications to healthcare. Time permitting, I will outline a couple validation strategies.

Outline:¶

Part I: Zen of TDA
1. Data -> Space
2. Space -> low dimensional embedding
3. Space -> structured features ((persistent) homology)
  1. Qualitative/computable features, e.g. number of connected components
Part II: Examples
1. DBSCAN
2. Mapper
3. Example of Mapper on a point cloud from a circle
  1. circle has two points connected by two qualitatively distinct paths
Part III: Applications
1. Breast Cancer
2. Malaria

Slides¶

Loose draft of the contents of my slide deck.

Slide 1: Title Slide¶

Title: Introduction to TDA: “Data has shape, shape has meaning”

Author: Jeremy Mann, PhD

Slide 2: What is TDA?¶

TDA is the application of homotopy theory to data science, primarily through an ecosystem of unsupervised learning algorithms
Homotopy theory = (De)Construction and analysis of “spaces”
- “connectivity”: when two things can be connected via a continuous path
- “locality”: computable via divide-and-conquer
Examples of spaces: sets, graphs, circles, torii, spheres, solutions to f(x)=0, invertible matrices, ….
Examples of analysis: number of connected components.

Slide 3: Step I¶

We want something of the form Data –> Insight
Idea: break this process into two steps:
1. Data –> Space
2. Space –> Insight
  - Space –> Structured data
  - Space –> Low dimensional embedding

Slide 3: Step I¶

Idea: given input of some kind (a physical theory, a protein, a point cloud of structured data, an image, …) “naturally” construct a shape whose properties reflect something interesting about the input.
- Family tree, connectivity ~ kinship
- Gauge theory, connectivity ~ physical indistinguishability
Point: All of these notions of connectivity (and more!) are specified once you construct a space!

Slide 5: DBSCAN¶

DSCAN is a classical clustering algorithm
Primary hyperparameter is a scale parameter \(\epsilon > 0\) determining when two points are close/similar
Number of clusters is not explicitly specified ahead of time.
Breaks into two steps

Slide 6: DBSCAN¶

DBSCAN constructs a space (graph) from a point cloud \(\{x_i\}\) and a
- Points: there is a point for every example
- Edges: there is an edge between \(x_i\) and \(x_j\) whenever they are less than \(\epsilon\) units away from each other.
Two data points are declared to be in the same cluster if they can be connected by a path in this graph.
- number of clusters = number of connected components of this space.

Slide 6: DCSCAN¶

Explicit example

Slide 7: Mapper¶

Mapper modifies this construction in by integrating two pieces: - Allows for/leverages intersections of subpopulations - Integrates the data of a “lens” function :
- particularly interesting feature(s)
- highest component generated by PCA
- loss function of a learning algorithm
- points ~ subpopulations
- edges ~ intersections of subpopulations
- color ~ (average/median) value of original function
Mapper requires additional choice: (overlapping) cover of range of function

Slide 8: Covering the range¶

Essentially hyperparameter means
For every element of the cover, we have a subset of the data
For every intersectin in the data, we have a subset of the data.
- These subsets sit inside of two larger subsets

Slide 8: Mapper Constructs a Space¶

#. Mapper constructs a space (graph) - First step: stratify data by the “lens” function.

KEY POINT: strata should be overlapping!

Perform DBSCAN
Say when two points are not being connected

Slide 9: Mapper Constructs a Space¶

Slide : Embed the Graph - There are many ways to embed graphs in low dimensions - These + mapper gives low dimensional visualizations of the data - Can remember the original lens function

The colors should vary continuously as you move along the edges of the graph

Slide 9¶

Mapper breast cancer application
Interpretation of picture

Slide 10¶

Mapper malaria example

Drawings:¶

point cloud approximation of \(S^1 \coprod *\) clustered by DBSCAN - draw the length scale \(\epsilon\) on the side
picture of an overlapping cover.

3. picture of point cloud broken into independent parts and clustered accordingly 4.

Introduction to Topological Data Analysis¶

TODO:¶

Goals¶

Abstract¶

Outline:¶

Slides¶

Slide 1: Title Slide¶

Slide 2: What is TDA?¶

Slide 3: Step I¶

Slide 3: Step I¶

Slide 5: DBSCAN¶

Slide 6: DBSCAN¶

Slide 6: DCSCAN¶

Slide 7: Mapper¶

Slide 8: Covering the range¶

Slide 8: Mapper Constructs a Space¶

Slide 9: Mapper Constructs a Space¶

Slide 9¶

Slide 10¶

Drawings:¶

Jeremy Mann's Writing

Navigation

Related Topics