Contribute Media
A thank you to everyone who has made this possible: Read More

SociaLite: Python intergrated query Language for Data Analysis

Summary

SociaLite is a Python-integrated query language for data analysis. It makes scientific data analysis simple, yet achieves fast performance with its compiler optimizations. We support relational tables and operations in SociaLite as well as Python integration, which makes it easy to implement various analysis algorithms, including Blast algorithm and genome assembly algorithm in bioinformatics.

Description

SociaLite is a Python-integrated query language for distributed data analysis. It makes scientific data analysis simple, yet achieves fast performance with its compiler optimizations. The performance of SociaLite is often more than three orders of magnitude faster than Hadoop programs, and close to optimized C programs. For example, PageRank algorithm can be implemented in just 2 lines of SociaLite query, which runs nearly as fast as an optimal parallelized C code.

SociaLite supports well-known high-level concepts to make data analysis easy for non-expert programmers. We support relational tables for storing data, and relational operations, such as join, selection, and projection, for processing the data. Moreover, SociaLite queries are fully integrated with Python, so both SociaLite and Python code can be used to implement data analysis logic. For the integration with Python, we support embedding and extending SociaLite, where embedding supports using SociaLite queries directly in Python code, and extending supports using Python functions in SociaLite queries.

The Python integration makes it easy to implement various analysis algorithms in SociaLite and Python. For example, the BLAST algorithm in bioinformatics can be implemented in just a few lines of SociaLite queries and Python code. Also genome assembly algorithm -- generating a De Bruijn graph and applying Eulerian cycle algorithm -- can be simply implemented. In the talk, I will demonstrate these algorithms in SociaLite as well as more general algorithms such as K-means clustering and logistic regression.

The SociaLite queries are compiled to highly optimized parallel/distributed code; we apply optimizations such as pipelined evaluation and prioritization. The runtime system also speeds up the performance; for example, the customized memory allocator reduces memory allocation time and footprint. In short, SociaLite makes high-performance data analysis easy with its high-level abstractions and compiler/runtime optimizations.

Details

Improve this page