Multinomial Naive Bayes, Support Vector Machine, Latent Dirichlet Allocation, Named Entity Recognition
The goal of this project was to test various NLP models as a prototype of a production data pipeline. The purpose of the production data pipeline is to process unstructured text-based data, sorting said data into topics and generating and/or extracting metadata. The ultimate goal is to have data processed in real time as unstructured data is uploaded to a NoSQL database with extracted metadata added to the database object.
A client needs an automated way to group unstructured, sometimes foreign language, text data into topic groups as well as to extract relevant metadata.
NLP models can, with a high degree of accuracy, group unstructured text data into topic groups. Named Entity Recognition (NER) can be used to extract relevant metadata from unstructured text data.
NLP models are not mature enough to group unstructured text data into topic groups without significant human-in-the-loop interaction. NER cannot reliably extract relevant metadata from unstructured data.
Best MNB Results
Best SVM Results
NER Example
Example of NER data extracted to JSON