Introduction to Topological Data Analysis
=========================================

TODO:
-----

[x] Make Goals

[x] Make Outline

[x] Make abstract

[ ] Make Deck text

[ ] Make Deck drawings

[ ] Make deck pdf

Goals
------

#. Understand the general "zen" of TDA
#. Understand concrete example(s)
#. Understand a few applications 
#. Validation (time permitting)


Abstract
--------

Topological data analysis (TDA) is an ecosystem of (unsupervised) learning 
algorithms based on the ansatz: "Data has shape, shape has meaning." After a high
level overview, we present the primary workhorse of TDA (Mapper) and two 
applications to healthcare. Time permitting, I will outline a couple validation 
strategies.

Outline:
--------

#. Part I: Zen of TDA

   #. Data -> Space
   #. Space -> low dimensional embedding
   #. Space -> structured features ((persistent) homology)

      #. Qualitative/computable features, e.g. number of connected components

#. Part II: Examples

   #. DBSCAN
   #. Mapper
   #. Example of Mapper on a point cloud from a circle

      #. circle has two points connected by two qualitatively distinct paths

#. Part III: Applications

   #. Breast Cancer
   #. Malaria

Slides
-------

Loose draft of the contents of my slide deck.


Slide 1: Title Slide
~~~~~~~~~~~~~~~~~~~~

Title: Introduction to TDA: "Data has shape, shape has meaning"

Author: Jeremy Mann, PhD

Slide 2: What is TDA?
~~~~~~~~~~~~~~~~~~~~~~~~~~

- TDA is the application of homotopy theory to data science, primarily through 
  an ecosystem of unsupervised learning algorithms
- Homotopy theory = (De)Construction and analysis of "spaces"
   - "connectivity": when two things can be connected via a continuous path
   - "locality": computable via divide-and-conquer
- Examples of spaces: sets, graphs, circles, torii, spheres, solutions to f(x)=0,
  invertible matrices, ....
- Examples of analysis: number of connected components.

Slide 3: Step I
~~~~~~~~~~~~~~~

- We want something of the form Data --> Insight
- Idea: break this process into two steps:

  #. Data --> Space
  #. Space --> Insight 

       - Space --> Structured data
       - Space --> Low dimensional embedding
- 
Slide 3: Step I
~~~~~~~~~~~~~~~

- Idea: given input of some kind (a physical theory, a protein, a point cloud
  of structured data, an image, ...) "naturally" construct a shape whose properties
  reflect something interesting about the input.
   - Family tree, connectivity ~ kinship
   - Gauge theory, connectivity ~ physical indistinguishability
- Point: All of these notions of connectivity (and more!) are specified once 
  you construct a space!

Slide 5: DBSCAN
~~~~~~~~~~~~~~~

- DSCAN is a classical clustering algorithm 
- Primary hyperparameter is a scale parameter :math:`\epsilon > 0` determining 
  when two points are close/similar 
- Number of clusters is not explicitly specified ahead of time. 
- Breaks into two steps 

Slide 6: DBSCAN
~~~~~~~~~~~~~~~

#. DBSCAN constructs a space (graph) from a point cloud :math:`\{x_i\}` and a 

   - Points: there is a point for every example
   - Edges: there is an edge between :math:`x_i` and :math:`x_j` whenever they are 
     less than :math:`\epsilon` units away from each other.
#. Two data points are declared to be in the same cluster if they can be connected
   by a path in this graph.

   - number of clusters = number of connected components of this space.

Slide 6: DCSCAN
~~~~~~~~~~~~~~~

Explicit example


Slide 7: Mapper
~~~~~~~~~~~~~~~

- Mapper modifies this construction in by integrating two pieces:
  - Allows for/leverages intersections of subpopulations
  - Integrates the data of a "lens" function :
     - particularly interesting feature(s)
     - highest component generated by PCA
     - loss function of a learning  algorithm
  - points ~ subpopulations 
  - edges ~ intersections of subpopulations
  - color ~ (average/median) value of original function
- Mapper requires additional choice: (overlapping) cover of range of function 

Slide 8: Covering the range
~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Essentially hyperparameter means
- For every element of the cover, we have a subset of the data
- For every intersectin in the data, we have a subset of the data.
    - These subsets sit inside of two larger subsets

Slide 8: Mapper Constructs a Space
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

#. Mapper constructs a space (graph)
- First step: stratify data by the "lens" function.
  - KEY POINT: strata should be overlapping!
- Perform DBSCAN 
- Say when two points are not being connected

Slide 9: Mapper Constructs a Space
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Slide : Embed the Graph
- There are many ways to embed graphs in low dimensions 
- These + mapper gives low dimensional visualizations of the data
- Can remember the original lens function 
     - The colors should vary continuously as you move along the edges of the 
       graph
Slide 9
~~~~~~~

- Mapper breast cancer application  
- Interpretation of picture 

Slide 10
~~~~~~~~

- Mapper malaria example

Drawings:
~~~~~~~~~

1. point cloud approximation of :math:`S^1 \coprod *` clustered by DBSCAN
   - draw the length scale :math:`\epsilon` on the side
2. picture of an overlapping cover.
3. picture of point cloud broken into independent parts and clustered accordingly
4.