Applying Natural Language Processing (NLP) towards unstructured data extraction

In our previous blog, “Fitting a square peg in a round hole – managing unstructured data,” we pointed out the need for a more non-traditional way of managing unstructured data (mainly, text-based data). In this blog, we are going to look at Natural Language Processing (NLP) as a viable “non-traditional” technique on handling large volumes of unstructured data extraction.

As we look across existing commercial offerings in the market today, we see solutions that allow companies to perform efficient indexing and searches on unstructured text across the enterprise to distill relevant information. The capabilities range from simple keyword search or wildcard search to a more complex proximity search and regular expression search, aiming to help users quickly locate needed information in the unstructured data lake. Oftentimes, the search results do not return targeted data at the right granularity; users still have to manually comb through the proposed search results for the most relevant data values based on the context of the documents.

For a more targeted data extraction need, there are software products that allow users to extract text via anchors within the documents based on existing document structure/template. This solution is fairly effective and reusable if the documents are consistent in structure. However, as we know, document structures are usually dynamic and change from time to time. Alterations to the original document template will require a reconfiguration of the text parser. With this solution, companies may find themselves trapped in a define-change-redefine-change cycle, which is an expensive maintenance proposition. Natural Language Processing (NLP) can help solve this challenging problem.

NLP has been an academic discipline for decades but only recently has it been commercially utilized as a text mining solution for unstructured data. As the name implies, NLP is a computational technique that can process natural human language, found in business documents for example, as language. Based on the grammatical relationship structures among words and the semantics of the words found in the text, the solution can infer relevance of the data in its current context without dependencies on document templates. This technique decouples the data extraction process from changes to the document structure. Changes to the document structure and placement of the desired data values within the text will not impact NLP’s ability to perform extraction.

The ability to perform extraction relies on a set of domain-specific concepts and related keywords as an initial “training” corpus, which can be fed into an NLP engine, which Sagence has built and implemented for several clients. As more documents are being processed, the corpus can be augmented and adjusted to further improve on extraction accuracy. After an initial assessment on the extraction results of more than 100 financial documents, we learned that the NLP engine correctly identified the data values (i.e., numeric dollar values) 100% of the time for all keyword matches, which was superior to the approximately 90% accuracy benchmark for manual data key-in. Compared to a human subject matter expert, the NLP engine fell behind on classifying the extracted values in the appropriate data domain/subject (80% vs. 90% in classification accuracy). At least for the moment, we should expect some levels of human intervention to ensure extraction quality. Nonetheless, this solution has dramatically improved the speed of data extraction, reducing a four-hour manual extraction task to a two-minute automated process. This efficiency gain could be easily translated into millions of dollars of corporate savings.

We at Sagence believe that NLP technology is a viable, flexible, and effective solution for unstructured data extraction. As such, we have developed a holistic framework based on NLP techniques for managing the entire lifecycle for unstructured data. With our solution, you can focus on analyzing the data and reduce the costs and inefficiencies inherent to manual data collection and data entry.

Contributed by Connie Koo and Alex Wu.