PyData SF 2016
NLP and Machine Learning without training data.
A major part of Big Data collected in most industries is in the form of unstructured text. Some examples are log files in IT sector, analysts reports in the finance sector, patents, laboratory notes and papers, etc. Some of the challenges of gaining insights from unstructred text is converting it into structured information and generating training sets for machine learning. Typically training sets for supervised learning are generated through the process of human annotation. In case of text this involves reading several thousands to million lines of texts by subject matter experts. This is very expensive and may not always be available, hence it is important to solve the problem of generating training sets before attempting to build machine learning models. Our approach is to combine rule based techniques with small amounts of SME time to by pass time consuming manual creation of training data. Once we have a good set of rules mimicking the training data we will use them to create knowledgebases out of the structured data. This knowledgebase can be further queried to gain insight on the domain. I have applied this technique to several domains, such as data from drug labels and medical journals, log data generated through customer interaction, generation of market research reports, etc. I will talk about the results in some of these domains and the advantage of using this approach.