Introduction to Topological Data Analysis ========================================= TODO: ----- [x] Make Goals [x] Make Outline [x] Make abstract [ ] Make Deck text [ ] Make Deck drawings [ ] Make deck pdf Goals ------ #. Understand the general "zen" of TDA #. Understand concrete example(s) #. Understand a few applications #. Validation (time permitting) Abstract -------- Topological data analysis (TDA) is an ecosystem of (unsupervised) learning algorithms based on the ansatz: "Data has shape, shape has meaning." After a high level overview, we present the primary workhorse of TDA (Mapper) and two applications to healthcare. Time permitting, I will outline a couple validation strategies. Outline: -------- #. Part I: Zen of TDA #. Data -> Space #. Space -> low dimensional embedding #. Space -> structured features ((persistent) homology) #. Qualitative/computable features, e.g. number of connected components #. Part II: Examples #. DBSCAN #. Mapper #. Example of Mapper on a point cloud from a circle #. circle has two points connected by two qualitatively distinct paths #. Part III: Applications #. Breast Cancer #. Malaria Slides ------- Loose draft of the contents of my slide deck. Slide 1: Title Slide ~~~~~~~~~~~~~~~~~~~~ Title: Introduction to TDA: "Data has shape, shape has meaning" Author: Jeremy Mann, PhD Slide 2: What is TDA? ~~~~~~~~~~~~~~~~~~~~~~~~~~ - TDA is the application of homotopy theory to data science, primarily through an ecosystem of unsupervised learning algorithms - Homotopy theory = (De)Construction and analysis of "spaces" - "connectivity": when two things can be connected via a continuous path - "locality": computable via divide-and-conquer - Examples of spaces: sets, graphs, circles, torii, spheres, solutions to f(x)=0, invertible matrices, .... - Examples of analysis: number of connected components. Slide 3: Step I ~~~~~~~~~~~~~~~ - We want something of the form Data --> Insight - Idea: break this process into two steps: #. Data --> Space #. Space --> Insight - Space --> Structured data - Space --> Low dimensional embedding - Slide 3: Step I ~~~~~~~~~~~~~~~ - Idea: given input of some kind (a physical theory, a protein, a point cloud of structured data, an image, ...) "naturally" construct a shape whose properties reflect something interesting about the input. - Family tree, connectivity ~ kinship - Gauge theory, connectivity ~ physical indistinguishability - Point: All of these notions of connectivity (and more!) are specified once you construct a space! Slide 5: DBSCAN ~~~~~~~~~~~~~~~ - DSCAN is a classical clustering algorithm - Primary hyperparameter is a scale parameter :math:`\epsilon > 0` determining when two points are close/similar - Number of clusters is not explicitly specified ahead of time. - Breaks into two steps Slide 6: DBSCAN ~~~~~~~~~~~~~~~ #. DBSCAN constructs a space (graph) from a point cloud :math:`\{x_i\}` and a - Points: there is a point for every example - Edges: there is an edge between :math:`x_i` and :math:`x_j` whenever they are less than :math:`\epsilon` units away from each other. #. Two data points are declared to be in the same cluster if they can be connected by a path in this graph. - number of clusters = number of connected components of this space. Slide 6: DCSCAN ~~~~~~~~~~~~~~~ Explicit example Slide 7: Mapper ~~~~~~~~~~~~~~~ - Mapper modifies this construction in by integrating two pieces: - Allows for/leverages intersections of subpopulations - Integrates the data of a "lens" function : - particularly interesting feature(s) - highest component generated by PCA - loss function of a learning algorithm - points ~ subpopulations - edges ~ intersections of subpopulations - color ~ (average/median) value of original function - Mapper requires additional choice: (overlapping) cover of range of function Slide 8: Covering the range ~~~~~~~~~~~~~~~~~~~~~~~~~~~ - Essentially hyperparameter means - For every element of the cover, we have a subset of the data - For every intersectin in the data, we have a subset of the data. - These subsets sit inside of two larger subsets Slide 8: Mapper Constructs a Space ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #. Mapper constructs a space (graph) - First step: stratify data by the "lens" function. - KEY POINT: strata should be overlapping! - Perform DBSCAN - Say when two points are not being connected Slide 9: Mapper Constructs a Space ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Slide : Embed the Graph - There are many ways to embed graphs in low dimensions - These + mapper gives low dimensional visualizations of the data - Can remember the original lens function - The colors should vary continuously as you move along the edges of the graph Slide 9 ~~~~~~~ - Mapper breast cancer application - Interpretation of picture Slide 10 ~~~~~~~~ - Mapper malaria example Drawings: ~~~~~~~~~ 1. point cloud approximation of :math:`S^1 \coprod *` clustered by DBSCAN - draw the length scale :math:`\epsilon` on the side 2. picture of an overlapping cover. 3. picture of point cloud broken into independent parts and clustered accordingly 4.