The PyData ecosystem is growing rapidly, with existing tools maturing and exciting new tools appearing on a regular basis. This talk will examine the crowded PyData ecosystem and bring some clarity to which Python data tool is the right one to reach for on any given analysis. It will focus on use-cases for pure python, toolz, Numpy, Pandas, Blaze, xray, bcolz, Dask, and Spark.
The PyData ecosystem can be a bit confusing for those new to Python, or even experienced programmers moving to Python for its excellent data analysis capabilities. How do you know which tool to reach for on any given project? What tools work best for my data of size FooBar in data store FizzBuzz?
This talk will explore the Python data toolchain from bottom to top, with a focus on what tools work best based on both data locality and analysis velocity. Think of your data pipeline and storage as a city, and your data tools as a shed full of bikes. What bike works best for which trip? When should you use pure Python (the fixie) to perform your analysis? How do Pandas (the geared commuter) and Blaze (the tandem) work together? Where does Spark (the fat tire bike) fit into all of this?
This talk seeks to use questionable bike analogies to provide less-questionable look at the crowded PyData ecosystem and bring some clarity to which Python data tool is the right one to reach for on any given analysis. It will touch on pure python, toolz, Numpy, Pandas, Blaze, xray, bcolz, Dask, and Spark, with a focus on the use-cases for each one.
Finally, we’ll talk about which library you should use to paint the bikeshed.
Materials available here: https://github.com/wrobstory/pydataseattle2015