NLP Project

Comparing Natural Language Processing Models

A study in various NLP models as a prototype for text classification, topic modeling and metadata generation NLP pipeline

Summary

The goal of this project was to test various NLP models as a prototype of a production data pipeline. The purpose of the production data pipeline is to process unstructured text-based data, sorting said data into topics and generating and/or extracting metadata. The ultimate goal is to have data processed in real time as unstructured data is uploaded to a NoSQL database with extracted metadata added to the database object.

Report Navigation

Problem Statement

Goals & Dataset

Text Classification

Topic Modeling

NER

Outcomes & Way Ahead

Problem Statement

A client needs an automated way to group unstructured, sometimes foreign language, text data into topic groups as well as to extract relevant metadata.

Hypothesis: NLP models can group unstructured text data and extract relevant metadata with minimal human-in-the-loop interaction.

NLP models can, with a high degree of accuracy, group unstructured text data into topic groups. Named Entity Recognition (NER) can be used to extract relevant metadata from unstructured text data.

Null Hypothesis: NLP models are not mature enough or cannot meaningfully group unstructured text data and extract relevant metadata

NLP models are not mature enough to group unstructured text data into topic groups without significant human-in-the-loop interaction. NER cannot reliably extract relevant metadata from unstructured data.

Goals

Train NLP models for use in production data pipeline for unstructured scientific and technical data

Automate Pre-processing
Text Classification
Topic Modeling
NER for metadata tagging

Determine feasability of LDA for topic modeling in production pipeline
Use NER to extract metadata

Dataset

Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005.Natural Classes: 5 (business, entertainment, politics, sport, tech)
Data set came from an academic research project that I can’t even begin to comprehend

D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006.

Series of text documents in folders grouped by topic
Required very little wrangling (read in text, strip 1st line as title, split filepath to get topic, store in dataframe)
Data Set stats (after trimming some outliers) are listed below

Generic dataset statistics

count: 2200.000000
mean: 198.202273
std: 86.017350
min: 46.000000
25%: 132.000000
50%: 178.000000
75%: 251.000000
max: 499.000000

Count of Articles by Topic

business: 510
sport: 505
politics: 413
tech: 392
entertainment: 380

Text Classification: MNB & SVM

Text classification works...too well?

This data was likely too good/clean leading to a potentially overfit model
Need LOTS more data to see if the model is truly overfit

Pre-processing options:

Manual, TfidfVectorizer, PorterStemmer, Lemmatier, others.

MNB achieved 94.8% - 96.8% accuracy
SVM achieved 97.5% accuracy

Figure 1

Best MNB Results

Figure 2

Best SVM Results

Topic Modeling: LDA

Can you be a library/model purist? (hint: No, you can't)

You have to leverage the library/model that works in the moment
nltk, gensim, spaCy, sklearn, etc…

What is LDA good for anyway?

People with a ton of time and patience...
Or for determining topics from bulk unstructured data
Or for clustering classified documents to gain new insights
Or...Or...Or...

Can LDA make predictions?

Why yes, yes it can...Just not as easily as text classification

Check out the links below for interactive LDA pages, modeling 5, 10, and 20 topic groups

It becomes obvious that 5 topics seems to be the sweet spot. This is interesting in that the dataset had the articles broken into 5 broad topic groups. With further data cleansing to remove superfluous stop words, there could be sub-categories found that the data could be grouped into for further granularity and insight
5-Topic LDA Page
10-Topic LDA Page
20-Topic LDA Page

Named Entity Recognition

Not much to say here
I used spaCy, but I'm open to Stanford's NER stuff as well
NER just works. Simple as that. For metadata extraction - it just works

CAVEAT - this is manual process
The data extracted can be added to NoSQL database objects as metadata tags, aiding in indexing/searching/classifying the data

The pre-trained spaCy model is pretty fantastic
I really need to learn how to train a custom NER model for highly technical and/or foreign language data

Figure 3

NER Example

Figure 4

Example of NER data extracted to JSON

Outcomes

Some key thoughts and takeaways

MNB vs SVM vs ?? ... How good is too good (model overfit)
Need more (and more diverse) data for further testing

LDA looks like it could be useful with a large enough dataset and good analytic insight...and patience
NER is a beautiful thing for metadata extraction

Need to customize it for specific client data

Pipeline and make_pipeline are...beautiful
Holy crap I'm glad I learned about pickle

Pickling trained models!?!? Yes please and thank you!

The Way Ahead

Read/write to MongoDB vs Pandas Dataframe
Customized stop words list to remove things like "say" and "bbc"
Incorporate foreign language handling

Machine translation (AWS Translate, Google Cloud Translation, Tesseract, etc.)
NLP in foreign languages

Customize NER for tailored results
Incorporate sentiment analysis for social media data
Web application for data interaction with the following functions:

Django / Flask ??
Visualize existing data
Upload new data/documents for automated processing