1
Chat Classification using
Natural Language
Processing
Overview
Misbehavior
and rumor
detection
from e-mail
Exploration
of methods
for effective
rumor and
misbehavior
identification
Text processing options
(Optional) Feature extraction options
Information extraction rumor or misbehavior
SOLUTION DESIGN Components of rumor/misbehavior identification engine
Hi Tema,
The CEO wants us to scale up.
Let us meet to discuss in detail.
Hi <Name>,
I’d prefer we join forces
Hi <Name>,
Perfick! Lets do this… Lets
double team them.
Components
Preprocessing
Standard
Out-of-the-box
Feature Engineering
Standard
Out-of-the-box
Identification
Standard
Out-of-the-box
RESULTS Spelling correction and exploration
The python bit the man
The snake bit the man
No correction required.
One correction should be done.
Training Corpus
Semantic similarity:
{“python”: “snake”, “language”, “reptile”}
{“correctino”: “correction”, “correct”, “right”}
Updated Lexicon
Metaphone match:
{“python”: “pylon”}
{“correctino”: “correct”}
{“correctino”: “correction”}
{“arriev”: “arrive”}
Fuzzy match:
{“python”: “pylon”, “photon”}
{“correctino”: “correction”, “correct”}
Similarity Measures
Deep Dive
Semantic Similarity
Fuzzy Match
Metaphone Match
The python bit the man
The snake bit the man
… (Corpus)
python ~ snake
correction
correct
correctino
… (Vocabulary)
correctino ~
correction
correction: C623
correct : C623
correctino : C623
… (Vocabulary)
correctino ~
correction ~
correct
ENHANCEMENTS Supervised spelling correction and re-training models
Current Implementation
Enhancement
Basic text processing Feature extraction for NLP
Regex based identification
Tokenization
Phrasing
word2vec (spell correct)
Fuzzy match (spell correct)
Summary
Direct word matching
Pattern matching
Extended text
processing
Tokenization
Lemmatization
Phrasing
Feature extraction for
NLP
word2vec (spell correct)
Fuzzy match (spell correct)
Text
processing
Soundex (spell correct)
Supervised spell
correct*
Note*: Supervised spell correct requires lot of manually tagged data
Feature
extraction
for NLP
word2vec
tfidf
Supervised
machine
learning for
rumor/misb
ehavior
detection
Threshold
Word
count
0 45000
0.5 16000
0.7 6000
THANK YOU