Home Button

Natural Language Processing

Comparing Algorithms

Multinomial Naive Bayes, Support Vector Machine, Latent Dirichlet Allocation, Named Entity Recognition

Download Project Briefing Document Download Icon
Check the Code Document Download Icon

Comparing Natural Language Processing Models

A study in various NLP models as a prototype for text classification, topic modeling and metadata generation NLP pipeline

Summary

The goal of this project was to test various NLP models as a prototype of a production data pipeline. The purpose of the production data pipeline is to process unstructured text-based data, sorting said data into topics and generating and/or extracting metadata. The ultimate goal is to have data processed in real time as unstructured data is uploaded to a NoSQL database with extracted metadata added to the database object.

Report Navigation

Problem Statement Goto Icon
Goals & Dataset Goto Icon
Text Classification Goto Icon
Topic Modeling Goto Icon
NER Goto Icon
Outcomes & Way Ahead Goto Icon

Problem Statement

A client needs an automated way to group unstructured, sometimes foreign language, text data into topic groups as well as to extract relevant metadata.

Hypothesis: NLP models can group unstructured text data and extract relevant metadata with minimal human-in-the-loop interaction.

NLP models can, with a high degree of accuracy, group unstructured text data into topic groups. Named Entity Recognition (NER) can be used to extract relevant metadata from unstructured text data.

Null Hypothesis: NLP models are not mature enough or cannot meaningfully group unstructured text data and extract relevant metadata

NLP models are not mature enough to group unstructured text data into topic groups without significant human-in-the-loop interaction. NER cannot reliably extract relevant metadata from unstructured data.

Goals

  • Train NLP models for use in production data pipeline for unstructured scientific and technical data
    • Automate Pre-processing
    • Text Classification
    • Topic Modeling
    • NER for metadata tagging
  • Determine feasability of LDA for topic modeling in production pipeline
  • Use NER to extract metadata

Dataset

  • Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005.Natural Classes: 5 (business, entertainment, politics, sport, tech)
  • Data set came from an academic research project that I can’t even begin to comprehend
    • D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006.
  • Series of text documents in folders grouped by topic
  • Required very little wrangling (read in text, strip 1st line as title, split filepath to get topic, store in dataframe)
  • Data Set stats (after trimming some outliers) are listed below

Generic dataset statistics

  • count: 2200.000000
  • mean: 198.202273
  • std: 86.017350
  • min: 46.000000
  • 25%: 132.000000
  • 50%: 178.000000
  • 75%: 251.000000
  • max: 499.000000

Count of Articles by Topic

  • business: 510
  • sport: 505
  • politics: 413
  • tech: 392
  • entertainment: 380

Text Classification: MNB & SVM

Best MNB results
Figure 1

Best MNB Results

Best SVM results
Figure 2

Best SVM Results

Topic Modeling: LDA

Named Entity Recognition

NER Example
Figure 3

NER Example

Best SVM results
Figure 4

Example of NER data extracted to JSON

Outcomes

  • Some key thoughts and takeaways
    • MNB vs SVM vs ?? ... How good is too good (model overfit)
    • Need more (and more diverse) data for further testing
  • LDA looks like it could be useful with a large enough dataset and good analytic insight...and patience
  • NER is a beautiful thing for metadata extraction
    • Need to customize it for specific client data
  • Pipeline and make_pipeline are...beautiful
  • Holy crap I'm glad I learned about pickle
    • Pickling trained models!?!? Yes please and thank you!

The Way Ahead

  • Read/write to MongoDB vs Pandas Dataframe
  • Customized stop words list to remove things like "say" and "bbc"
  • Incorporate foreign language handling
    • Machine translation (AWS Translate, Google Cloud Translation, Tesseract, etc.)
    • NLP in foreign languages
  • Customize NER for tailored results
  • Incorporate sentiment analysis for social media data
  • Web application for data interaction with the following functions:
    • Django / Flask ??
    • Visualize existing data
    • Upload new data/documents for automated processing