Automation of Clustering
Algorithms, Measures and Challenges
Agenda
1
Introduction
Pipeline
Algorithms & Measures
Challenges & Future
2
3
4
Concise definition of unsupervised learning: Data compression and/or representation
Introduction
1
Clustering
High variance input data
Discretized output data
Dimensionality
Reduction
High variance input data
Algorithm
Algorithm
Choose Cluster
Representative
Input data
id x1
x2
1
2
...
n
id Cluster id
1 1
2 3
...
n 2
Compressed data
Alternate dimensional
representation
Input data
id x1
x2
1
2
...
n
Encode and
Subset
Compressed data
id P1
1
2
...
n
Pipeline
2
Algorithms & Measures
3
Autoencoder PCA/SVD/NMF (special cases of autoencoder)
Linear or non-linear representation learning
Distance based Model based Spectral
Clustering (Discretization)
Measurement
Centroid based Distribution based
Loss / variance / distance / model based
Linkage
Other considerations
Density based Connectivity based
Internal External
Other measures
Rand
Dice
...
Challenges & Future
4
Future
Challenges
Energy based models fail to learn
anything larger than trivial
problems
31
Combinatorially large number
of discretizations possible even
for small number of clusters
2
Ensembling is also a combinatorially large
problem because ground truths are absent
Strides towards a Boltzmann
machine
Timeline:
Merging information theory
and clustering ideas
4
A mathematically
sound learning
theory
PAC learning for theory of
compression
Efficient learning
algorithms
Many decades
Few years
Few decades
Near future
Already happening