Introduction to Topological Data Analysis¶
TODO:¶
[x] Make Goals
[x] Make Outline
[x] Make abstract
[ ] Make Deck text
[ ] Make Deck drawings
[ ] Make deck pdf
Goals¶
Understand the general “zen” of TDA
Understand concrete example(s)
Understand a few applications
Validation (time permitting)
Abstract¶
Topological data analysis (TDA) is an ecosystem of (unsupervised) learning algorithms based on the ansatz: “Data has shape, shape has meaning.” After a high level overview, we present the primary workhorse of TDA (Mapper) and two applications to healthcare. Time permitting, I will outline a couple validation strategies.
Outline:¶
Part I: Zen of TDA
Data -> Space
Space -> low dimensional embedding
Space -> structured features ((persistent) homology)
Qualitative/computable features, e.g. number of connected components
Part II: Examples
DBSCAN
Mapper
Example of Mapper on a point cloud from a circle
circle has two points connected by two qualitatively distinct paths
Part III: Applications
Breast Cancer
Malaria
Slides¶
Loose draft of the contents of my slide deck.
Slide 1: Title Slide¶
Title: Introduction to TDA: “Data has shape, shape has meaning”
Author: Jeremy Mann, PhD
Slide 2: What is TDA?¶
TDA is the application of homotopy theory to data science, primarily through an ecosystem of unsupervised learning algorithms
- Homotopy theory = (De)Construction and analysis of “spaces”
“connectivity”: when two things can be connected via a continuous path
“locality”: computable via divide-and-conquer
Examples of spaces: sets, graphs, circles, torii, spheres, solutions to f(x)=0, invertible matrices, ….
Examples of analysis: number of connected components.
Slide 3: Step I¶
We want something of the form Data –> Insight
Idea: break this process into two steps:
Data –> Space
Space –> Insight
Space –> Structured data
Space –> Low dimensional embedding
Slide 3: Step I¶
Idea: given input of some kind (a physical theory, a protein, a point cloud of structured data, an image, …) “naturally” construct a shape whose properties reflect something interesting about the input.
Family tree, connectivity ~ kinship
Gauge theory, connectivity ~ physical indistinguishability
Point: All of these notions of connectivity (and more!) are specified once you construct a space!
Slide 5: DBSCAN¶
DSCAN is a classical clustering algorithm
Primary hyperparameter is a scale parameter \(\epsilon > 0\) determining when two points are close/similar
Number of clusters is not explicitly specified ahead of time.
Breaks into two steps
Slide 6: DBSCAN¶
DBSCAN constructs a space (graph) from a point cloud \(\{x_i\}\) and a
Points: there is a point for every example
Edges: there is an edge between \(x_i\) and \(x_j\) whenever they are less than \(\epsilon\) units away from each other.
Two data points are declared to be in the same cluster if they can be connected by a path in this graph.
number of clusters = number of connected components of this space.
Slide 6: DCSCAN¶
Explicit example
Slide 7: Mapper¶
Mapper modifies this construction in by integrating two pieces: - Allows for/leverages intersections of subpopulations - Integrates the data of a “lens” function :
particularly interesting feature(s)
highest component generated by PCA
loss function of a learning algorithm
points ~ subpopulations
edges ~ intersections of subpopulations
color ~ (average/median) value of original function
Mapper requires additional choice: (overlapping) cover of range of function
Slide 8: Covering the range¶
Essentially hyperparameter means
For every element of the cover, we have a subset of the data
- For every intersectin in the data, we have a subset of the data.
These subsets sit inside of two larger subsets
Slide 8: Mapper Constructs a Space¶
#. Mapper constructs a space (graph) - First step: stratify data by the “lens” function.
KEY POINT: strata should be overlapping!
Perform DBSCAN
Say when two points are not being connected
Slide 9: Mapper Constructs a Space¶
Slide : Embed the Graph - There are many ways to embed graphs in low dimensions - These + mapper gives low dimensional visualizations of the data - Can remember the original lens function
The colors should vary continuously as you move along the edges of the graph
Slide 9¶
Mapper breast cancer application
Interpretation of picture
Slide 10¶
Mapper malaria example
Drawings:¶
point cloud approximation of \(S^1 \coprod *\) clustered by DBSCAN - draw the length scale \(\epsilon\) on the side
picture of an overlapping cover.
3. picture of point cloud broken into independent parts and clustered accordingly 4.