This talk will be focused on doing Natural Language Processing (NLP) in a Python-based Spark environment using PySpark. Examples will be drawn from a Citing Sentences project underway within Elsevier Labs (http://labs.elsevier.com/). The goal of this project is to build and analyze citation networks to understand the diffusion and flow of ideas through the scientific research landscape. Much like a social network, scientists want to understand how others are ‘talking’ about their papers. Are they supporting their work? Disagreeing with it? Is it being referred to as a discovery?
The development of our input datasets is out of scope for this talk, partly because the framework for citing sentence extraction is built out in Spark Scala rather than PySpark. However, our citing sentence dataframe formats will be described and documented and sample data will be provided so that others can explore and reproduce our analyses.
The presentation will cover:
- Reformatting, manipulating, and combining dataframes to meet specific analysis needs
- Preparing data for use with NLP tools and techniques
- Using PySpark, SparkSQL, SparkML and other Spark libraries within Python code to perform NLP
- Moving Spark Dataframes in and out of Pandas for additional analysis and to do visualizations
- Performing additional natural language analysis in NLTK within the PySpark environment
- Generating export formats suitable for other tools, such as for visualization with Gephi
The following code will be provided for audience members to return to the topic and continue learning after the event:
- A "Community Edition" DataBricks compatible notebook with SparkML, SparkSQL, PySpark, and NLTK code
- A sample datafile of citing sentences from Elsevier's CCBY-licensed articles