As the applications we build are increasingly driven by text, doing data ingestion, management, loading, and preprocessing in a robust, organized, parallel, and memory-safe way can get tricky. This talk walks through the highs (a custom billion-word corpus!), the lows (segfaults, 400 errors, pesky mp3s), and the new Python libraries we built to ingest and preprocess text for machine learning.
While applications like Siri, Cortana, and Alexa may still seem like novelties, language-aware applications are rapidly becoming the new norm. Under the hood, these applications take in text data as input, parse it into composite parts, compute upon those composites, and then recombine them to deliver a meaningful and tailored end result. The best applications use language models trained on domain-specific corpora (collections of related documents containing natural language) that reduce ambiguity and prediction space to make results more intelligible. Here's the catch: these corpora are huge, generally consisting of at least hundreds of gigabytes of data inside of thousands of documents, and often more!
In this talk, we'll explore many of the challenges we experienced along the way and introduce two Python packages that make this work a bit easier: Baleen and Minke. Baleen is a package for ingesting formal natural language data from the discourse of professional and amateur writers, like bloggers and news outlets, in a categorized fashion. Minke extends Baleen with a library that performs parallel data loading, preprocessing, normalization, and keyphrase extraction to support machine learning on a large-scale custom corpus.