1

Chat Classification using

Natural Language

Processing

Overview

Misbehavior

and rumor

detection

from e-mail

Exploration

of methods

for effective

rumor and

misbehavior

identification

Text processing options

(Optional) Feature extraction options

Information extraction – rumor or misbehavior

SOLUTION DESIGN │ Components of rumor/misbehavior identification engine

Hi Tema,

The CEO wants us to scale up.

…

Let us meet to discuss in detail.

Hi <Name>,

I’d prefer we join forces

Hi <Name>,

Perfick! Lets do this… Lets

double team them.

Components

Preprocessing

Standard

Out-of-the-box

Feature Engineering

Standard

Out-of-the-box

Identification

Standard

Out-of-the-box

RESULTS │ Spelling correction and exploration

…

The python bit the man

The snake bit the man

No correction required.

One correction should be done.

…

Training Corpus

Semantic similarity:

{“python”: “snake”, “language”, “reptile”}

{“correctino”: “correction”, “correct”, “right”}

Updated Lexicon

Metaphone match:

{“python”: “pylon”}

{“correctino”: “correct”}

{“correctino”: “correction”}

{“arriev”: “arrive”}

Fuzzy match:

{“python”: “pylon”, “photon”}

{“correctino”: “correction”, “correct”}

Similarity Measures

Deep Dive

Semantic Similarity

Fuzzy Match

Metaphone Match

…

The python bit the man

The snake bit the man

… (Corpus)

python ~ snake

…

correction

correct

correctino

… (Vocabulary)

correctino ~

correction

…

correction: C623

correct : C623

correctino : C623

… (Vocabulary)

correctino ~

correction ~

correct

ENHANCEMENTS │ Supervised spelling correction and re-training models

Current Implementation

Enhancement

Basic text processing Feature extraction for NLP

Regex based identification

Tokenization

Phrasing

word2vec (spell correct)

Fuzzy match (spell correct)

Summary

Direct word matching

Pattern matching

Extended text

processing

Tokenization

Lemmatization

Phrasing

Feature extraction for

NLP

word2vec (spell correct)

Fuzzy match (spell correct)

Text

processing

Soundex (spell correct)

Supervised spell

correct*

Note*: Supervised spell correct requires lot of manually tagged data

Feature

extraction

for NLP

word2vec

tfidf

Supervised

machine

learning for

rumor/misb

ehavior

detection

Threshold

Word

count

0 45000

0.5 16000

0.7 6000

THANK YOU