Automation of Clustering

Algorithms, Measures and Challenges

Agenda

1

Introduction

Pipeline

Algorithms & Measures

Challenges & Future

2

3

4

Concise definition of unsupervised learning: Data compression and/or representation

Introduction

1

Clustering

High variance input data

Discretized output data

Dimensionality

Reduction

High variance input data

Algorithm

Algorithm

Choose Cluster

Representative

Input data

id x1

x2

1

2

...

n

id Cluster id

1 1

2 3

...

n 2

Compressed data

Alternate dimensional

representation

Input data

id x1

x2

1

2

...

n

Encode and

Subset

Compressed data

id P1

1

2

...

n

Pipeline

2

Algorithms & Measures

3

Autoencoder PCA/SVD/NMF (special cases of autoencoder)

Linear or non-linear representation learning

Distance based Model based Spectral

Clustering (Discretization)

Measurement

Centroid based Distribution based

Loss / variance / distance / model based

Linkage

Other considerations

Density based Connectivity based

Internal External

Other measures

Rand

Dice

...

Challenges & Future

4

Future

Challenges

Energy based models fail to learn

anything larger than trivial

problems

31

Combinatorially large number

of discretizations possible even

for small number of clusters

2

Ensembling is also a combinatorially large

problem because ground truths are absent

Strides towards a Boltzmann

machine

Timeline:

Merging information theory

and clustering ideas

4

A mathematically

sound learning

theory

…

PAC learning for theory of

compression

Efficient learning

algorithms

Many decades

Few years

Few decades

Near future

Already happening