Help us!

Take some time to transcribe PyCon 2014 talks! Click on the "Share" button below the video and then "Subtitle" to get started.

SciPy 2013

URL:
http://conference.scipy.org/scipy2013/
Description:

SciPy 2013, the twelfth annual Scientific Computing with Python conference, was held June 24th-29th 2013 in Austin, Texas. SciPy is a community dedicated to the advancement of scientific computing through open source Python software for mathematics, science, and engineering. The annual SciPy Conference allows participants from academic, commercial, and governmental organizations to showcase their latest projects, learn from skilled users and developers, and collaborate on code development.

Date:
June 24, 2013
Number of videos:
139
Symbolic Computing with SymPy, SciPy2013 Tutorial, Part 1 of 6
SciPy 2013
Aaron Meurer , Mateusz Paprocki , Ondrej Certik
Added: June 30, 2013Language: English

SymPy is a pure Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python and does not require any external libraries.

Accessing the Virtual Observatory from Python; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Plante, Raymond, NCSA/UofIL; Fitzpatrick, Mike, NOAO; Graham, Matthew, Caltech; Tody, Doug, NRAO

Track: Astronomy and Astrophysics

One of the goals of the Virtual Astronomical Observatory (VAO) project is to enable developers to integrate access to astronomical archives and services into applications through standardized interfaces. As part of this effort, we have developed two packages for accessing the Virtual Observatory through Python. The first tool, VAOpy, is a package built on AstroPy which provides enables discovery of data archive services through the VAO registry service as well as the searching of the archives for individual datasets such as images, spectra, and source catalogs. The purpose of this module is to provide the developer an easy to use interface that reflects knowledge of the standards upon which services are based. The second tool, VOClient, supports the same low-level API provided by VAOpy but adds additional higher-level capabilities and "book-keeping" that make it easier to develop sophisticated applications. This includes support for searching multiple archives, finding data for a list of sources, and collaborating with other desktop tools. Long running tasks are supported through asynchronous access to the underlying services and data caching. VOClient can send retrieved data records and datasets to other applications on the desktop through VO standard protocol known as SAMP.

A comprehensive look at representing physical quantities in Python
SciPy 2013
Trevor Bekolay
Recorded: July 2, 2013Language: English

Why tracking physical quantities is an essential function for any programming language heavily used in science and a possible unification of the existing packages that enable the majority of use cases.

Advances in delivery and access tools for coastal ocean model data; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Signell, Richard, US Geological Survey

Track: Meteorology, Climatology, Atmospheric and Oceanic Science

Coastal ocean modelers are producers and consumers of vast and varied data, and spend significant effort on tasks that could be eliminated by better tools. In the last several years, standardization led by the US Integrated Ocean Observing System Program to use OPeNDAP for delivery of gridded data (e.g. model fields, remote sensing) and OGC Sensor Observation Services (SOS) for delivery of in situ data (e.g. time series sensors, profilers, ADCPs, drifters, gliders) has resulted in significant advancements, making it easier to deliver, find, access and analyze data. For distributing model results, the Unidata THREDDS Data Server and PyDAP deliver aggregated data via OPeNDAP and other web services with low impact on providers. For accessing data, NetCDF4-Python and PyDAP both allow efficient access to OPeNDAP data sources, but do not take advantage of common data models for structured and unstructured grids enabled by community-developed CF and UGRID conventions. This is starting to change with CF-data model based projects like the UK Met Office Iris project. Examples of accessing and visualizing both curvilinear and unstructured grid model output in Python will be presented, including both the IPython Notebook and ArcGIS 10.1.

Analyzing IBM Watson experiments with IPython Notebook; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Bittner, Torsten, IBM

Track: General

IBM's Emerging Technologies team was tasked with migrating the IBM Watson system that won the Jeopardy!-like game to a domain-independent codebase. This task started as a software engineering exercise and later became an information engineering exercise as we worked to optimize the system's question-answering ability for new domains. In this new paradigm the team would observe and measure a system behavior, such as its accuracy in generating candidate answers to a particular type of question, and then hypothesize what (software) change to the system would improve the behavior and how it would impact the original measurement. The team would then implement the change, re-run the system against a test dataset, analyze the gigabyte-sized test results to evaluate the difference in system behavior. By conducting many series of these experimental iterations, the team was able to significantly improve IBM Watson's question-answering performance.

Our initial attempts at information engineering used Java and the D3 JavaScript library to extract, analyze and visualize metrics of the system's behavior. Wiki pages were used to document the many experiments and their configurations. However, this arrangement proved overly cumbersome for handling the large numbers of experiments we ran, and our need to share experimental details, visualizations and results with other teams. Furthermore, we also needed to enable a broader skill set of people -- beyond expert Java programmers -- to conduct analyses, create visualizations, and share findings.

This talk describes how we used the IPython notebook environment and the rich set of Python data science libraries (e.g. Pandas, NumPy/SciPy) to perform reproducible science, which resulted in improvements to IBM Watson's accuracy.

An Open Source System for De-identification and Use of Medical Images; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

An Open Source System for De-identification and Use of Medical Images for Research

Authors: Miller, Jeffrey, Center for Biomedical Informatics, The Children's Hospital of Philadelphia

Track: Medical Imaging

Medical images captured from X-ray, MRI, CT, and ultrasound modalities represent a wealth of data for clinical researchers. Direct access to imaging studies establishes a greater opportunity for research purposes than a text-only system. However, imaging data can be difficult to work with outside of clinical systems and can contain Protected Health Information (PHI) in diverse and unexpected locations, presenting a barrier for multi-institutional, collaborative research. While there are existing integration solutions, such as the Clinical Trials Processor, they do not provide for manual curation of images to screen for relevancy and PHI, a crucial step for using images within a research application. To address these issues, we developed a system for the end-to-end provisioning of de-identified image studies. This includes a Django app for users to review and record metadata for each study, a pipeline for anonymizing and provisioning images to a production image archive, and finally an application for viewing images in the browser as part of a research application. We take advantage of the Python Ruffus pipeline framework and the PyDICOM library to orchestrate the work of moving, anonymizing, and annotating millions of files in a repeatable and auditable manner. This workflow has been used to integrate images into AudGenDB (http://audgendb.chop.edu), a publicly available hearing impairment research database. The results of the AudGenDB image integration enables researchers to visualize and assess images in direct context with clinical and genetic variables for research subjects. The source code is available under a BSD license at http://github.com/cbmi/dicom-pipeline.

A Portrait of One Scientist as a Graduate Student
SciPy 2013
Paul Ivanov
Recorded: July 2, 2013Language: English

a focus on specific tools and techniques invaluable in doing research in a reproducible manner.

A Rapidly-Adaptable Imaging & Measurement Platform for Cancer Research; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

A Rapidly-Adaptable Analytical Imaging & Measurement Standardization Platform for Cancer Diagnostics Research

Authors: Garsha, Karl, Ventana Medical Systems Inc.; Ventura, Franklin, Ventana Medical Systems Inc.;

Track: Medical Imaging

The focus of personalized medicine is to develop rationally-designed therapeutics targeting specific molecular mechanisms of diseases such as cancer. For targeted therapeutics to be of value in complex disease states, such as cancer, patient-specific mechanism(s) of disease must be identified by physicians such that the appropriate targeted therapeutic(s) may be identified and administered. The ability to evaluate phenotype and genotype for multiplexed biomarkers at the cellular level, in the context of preserved tissue, provides important information for advancing the science of personalized medicine.

Classical cancer diagnostic methods are based on direct inspection of prepared slides. In the classical approach, measurement and measurement standardization are limited by the constraints of human perception, established tradition and training. Our research seeks to empower physicians with new tools that diminish these existing limitations. Through Python, we bring together sophisticated nano-reporter technology, advanced microscopies, computational analysis and databasing technologies to establish feasibility of analytical tissue assay technology. Advancement of this technology is hoped eventually to enable powerful new opportunities for treatment of cancer.

Our work is greatly accelerated through the collective efforts of the Python community. The ability to leverage and combine rich scientific Open Source projects including SciPy, VTK, ITK, PIL, wxPython, Matplotlib, µManager, and OMERO are central to enabling this ambitious effort. Python allows us the synergy of sophisticated high-level language interfaced with rich natively-compiled libraries. This capability allows us to maintain a remarkable level of plasticity necessary to adapt to fast-moving and diverse research problems, and scalability to visualize large and complex n-dimensional datasets. Rich GUI capabilities allow us to rapidly put powerful tools in the hands of medical researchers.

Challenges include mechanisms to pass high-level data structures between native-compiled libraries, combining widgets from different GUI toolkits, memory limits, and the complexity of building self-contained installers/uninstallers for deployment to collaborator sites.

Automating Quantitative Confocal Microscopy Analysis; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Fenner, Mark; Fenner, Barbara, King's College, Wilkes-Barre, PA

Track: Medical Imaging

Confocal microscopy is a qualitative analytical tool used to visualize the associations between cellular processes and anatomical structures. Quantitative analysis of confocal images uses domain expertise, in the form of background correction, and statistical calculations to give semi-quantitative comparisons among experimental conditions. Extended automation of quantitative confocal methods will (1) reduce the time consuming effort of manual background correction and (2) give a fully quantitative method to associate cellular process with structure.

The purpose of this project is: (1) to develop automated methods to quantitatively assess colocalization of multiple fluorescent labels within confocal images and (2) to apply these methods to assess colocalization of trkB.t1 and BDNF to three types of organelles: endosomes, lysosomes, and transport organelles. Computing quantitative colocalization values requires image correction for background noise. We perform background correction in three ways: (1) manual, (2) automated heuristic analysis of the label intensity histograms, and (3) application of a regression model developed from a subset of manually corrected images. Using the corrected images, we compute a set of domain specific correlations: Pearson's and Mander's coefficients, the 'colocalization coefficients' (M1, M2, m1, and m2), and the 'overlap coefficients' (k1 and k2).

The project is implemented, end-to-end, in Python. Pure Python is used for managing file access, input parameters, and initial processing of the repository of 933 images. NumPy is used to apply manual background correction, compute the automated background corrections (reducing false positive results and manual labor), and to calculate the domain specific coefficients. We visualize the raw intensity values and computed coefficient values with Tufte-style panel plots created in matplotlib. A longer term goal of this work is to explore plausible extensions of dual-label coefficients to triple-label coefficients.

Breaking the diffraction limit with python and scipy; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Baddeley, David, Nanobiology Institute, Yale University

Track: General

Textbook physics tells us that the resolution of a microsope is limited to half the wavelength of the radiation used. This means that structures smaller than ~250 nm cannot be resolved in an optical microscope, and that electron microscopy was required to study cellular nanostructures. Recent advances based on imaging stochastically switching fluorescent probes have allowed the diffraction limit to be circumvented and optical imaging to be performed with a resolution of 10-20 nm. These new methods, known as PALM (Photo-Activated Localisation Microscopy), STORM (STochastic Optical Reconstruction Microscopy), and a number of related acronyms are computationally intensive and involve detailed control of the microscope hardware.

I will present a comprehensive package for PALM/STORM microscope control and image analysis written in python and scipy. The package is modular, and comes complete with a facility for distributed data analysis. In addition to the specialised localisation microscopy components, there are many aspects of the project which are likely to be interesting to the broader microscopy and image processing community. These include a generic microscope control package, an extensible 3D image viewer supporting many basic image processing tasks, a 3D deconvolution software (Richardson-Lucy and ICTM), as well as PSF simulation and pupil phase extraction code.

My knowledge of python has grown alongside the project, and In addition to giving an overview of the package, I will discuss some of the design choices and mistakes I've made along the way.

Bringing astronomical tools down to earth; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Droettboom, Michael, STScI; Dencheva, Nadia, STScI; Aldcroft, Tom, Harvard-Smithsonian Center for As

Track: General

In the process of developing the core tools in astropy, some modules have been developed that have wider applicability than just for astronomy. This talk will describe these tools and the approach taken by astropy towards these. These include general tools for handling units, and quantities with units, with capabilities not found in other unit packages, such as equivalency mappings. We have also developed a generic system for defining models and interfacing models with generic fitting algorithms in an easily extensible way. This system underlies our approach for mapping array coordinates to general world coordinate systems. Finally, a powerful table interface has been developed that handles many different data formats (currently focused mostly on astronomical varieties, but extensible to other fields as well).

Climate Observations from ACIS in pandas; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Noon, William: Northeast Regional Climate Center

Track: Meteorology, Climatology, Atmospheric and Oceanic Science

The Applied Climate Information System (ACIS) has been developed by the Regional Climate Centers (RCCs) and has been providing relevant climate data and products for over a decade. Last year (2012) we release version 2 of our data access protocol and made the system open to general use. http://data.rcc-acis.org

ACIS aggregates weather observations reported at over 20,000 stations in North America over the last 100 years. These observations are collected from a number of sources and updated multiple times a day. At any point in time, the system will select the best available data and merge them into a coherent record. The daily/hourly observations are available as well as climate products summarized over various time intervals. http://www.rcc-acis.org

The ACIS Web Services use standard web requests and formats to defined the requested data product and return the results. This data can be further refined by the user in their prefered analysis environment.

This talk introduces a pandas data loader for the ACIS Web Services.

Data Agnosticism: Feature Engineering Without Domain Expertise; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Kridler, Nicholas, Accretive Health

Track: General

Bits are bits. Whether you are searching for whales in audio clips or trying to predicit hospitalization rates based on insurance claims, the pro- cess is the same: clean the data, generate features, build a model, and iter- ate. Better features lead to a better model, but without domain expertise it is often difficult to extract those features. Numpy/Scipy, Matplotlib, Pandas, and Sci-kit Learn provide an excellent framework for data anal- ysis and feature discovery. This is evidenced by high performing models in the Heritage Health Prize and the Marinexplore Right Whale Detec- tion challenge. In both competitions, the largest performance gains came from identifying better features. This required being able to repeatedly visualize and characterize model successes and failures. Python provides this capability as well as the ability to rapidly implement and test new features. This talk will discuss how Python was used to develop competi- tive predictive models based on derived features discovered through data analysis.

DMTCP: Bringing Checkpoint-Restart to Python; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Arya, Kapil, Northeastern University; Cooperman, Gene, Northeastern University

Track: General

DMTCP[1] is a mature user-space checkpoint-restart package. One can think of checkpoint-restart as a generalization of pickling. Instead of saving an object to a file, one saves the entire Python session to a file. Checkpointing Python visualization software is as easy as checkpointing a VNC session with Python running inside.

A DMTCP plugin can be built in the form of a Python module. This Python module provides functions by which a Python session can checkpoint itself to disk. The same ideas extend to IPython.

Two classical uses of this feature are a saveWorkspace function (including visualization and the distributed processes of IPython). In addition, at least three novel uses of DMTCP for helping debug Python are demonstrated.

FReD[2] --- a Fast Reversible Debugger that works closely with the Python pdb debugger, as well as other Python debuggers.

Reverse Expression Watchpoint --- A bug occurred in the past. It is associated with the point in time when a certain expression changed. Bring the user back to a pdb session at the step before the bug occurred.

Fast/Slow Computation[3] --- Cython provides both traditional interpreted functions and compiled C functions. Interpreted functions are slow, but correct. Compiled functions are fast, but users sometimes define them incorrectly, whereupon the compiled function silently returns a wrong answer. The idea of fast/slow computation is to run the compiled version on one core, with checkpoints at frequent intervals, and to copy a checkpoint to another core. The second core re-runs the computation over that interval, but in interpreted mode.

[1]DMTCP: Transparent Checkpointing for Cluster Computations and the Desktop. Ansel, Arya, Cooperman. IPDPS-2009 http://dmtcp.sourceforge.net/ [2]FReD: Automated Debugging via Binary Search through a Process Lifetime http://arxiv.org/abs/1212.5204 [3]Distributed Speculative Parallelization using Checkpoint Restart, Ghoshal et al. Procedia Computer Science, 2011.

Dynamics with SymPy Mechanics; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Moore, Jason, University of California at Davis

Track: General

The SymPy Mechanics package was created to automate the derivation of the equations of motion for rigid body dynamics problems. It has been developed primarily through several Google Summer of Code grants over three years and is capable of deriving Newton's Second Law for non-trivial multi-body systems using a variety of methods: from Newton-Euler, to Lagrange, to Kane. The software provides essential classes based around the concepts of a three dimensional vector in a reference frame which ease the setup and bookkeeping of the tedious kinematics including both kinematic and motion constraints. There are also classes for the automated formulation of the equations of motion based on the bodies and forces in a system. It also includes automated linearization of the resulting non-linear models. The software can be used to solve basic physics problems or very complicated many-body and many-constraint systems all with symbolic results. I will go over the basic software design, demonstrate its use through the API along with several classic physics problems and some not-so-trivial three dimensional multi-body problems.

Emacs + org-mode + python in reproducible research; SciPy 2013 Presentation
SciPy 2013
John Kitchin
Recorded: July 2, 2013Language: English

We discuss the use of emacs + org-mode + python in enabling reproducible research.

Estimating and Visualizing the Inertia of the Human Body with Python; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Moore, Jason, University of California at Davis; Dembia, Christopher, Stanford

Track: Medical Imaging

The Yeadon human body segment inertia model is a widely used method in the biomechanical field that allows scientists to get quick and reliable estimations of mass, mass location, and inertia estimates of any human body. The model is formulated around a collection of stadia solids that are defined by a series of width, perimeter, and circumference measurements. This talk will detail a Python software package that implements the method and exposes a basic API for its use within other code bases. The package also includes a text based user interface and a graphical based user interface, both of which will be demonstrated. The GUI is implemented with MayaVi and allows the user to manipulate the joint angles of the human and instantaneously get inertia estimates for various poses. Researchers that readily need body segment and human inertial parameters for dynamical model development or other uses, should find this package useful for quick interactive results. We will demonstrate the three methods of using the package, cover the software design, show how the software can be integrated into other packages, and demonstrate a non-trivial example of computing the inertial properties of a human seated on a bicycle.

Experiences in Python for Medical Image Analysis; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Warner, Joshua, Mayo Clinic Department of Biomedical Engineering

Track: Medical Imaging

Upon entering graduate school and selecting radiology informatics as my topic of study, a broad survey of open source options for scientific work was conducted. There were three main criteria:

robust numerical and scientific capability, strong user community with continuing updates and long term support, and ease of use for students transitioning from other languages. Among several strong options that satisfied criteria #1, Python with NumPy and SciPy was the clear winner due to the latter two criteria.

My work focuses on supervised segmentation of soft-tissue abdominal MRI images, extracting novel image features from these segmented regions of interest, and applying machine learning techniques to evaluate features for predictive ability. This presentation will provide an overview of the key computational tasks required for this work, and outline the challenges facing a medical image researcher using Python. Most notably, medical image volumes are rarely isotropic, yet often algorithms for 3-D NumPy arrays inherently assume isotropic sampling. Thus, generalizing or extending various algorithms to handle anisotropic rectangular sampled data is necessary. Our improvements to one such algorithm were recently contributed back to the community, and are presently incorporated in the random walker segmentation algorithm in Scikit-Image.

Another significant challenge is visualization of algorithm output for large volumetric datasets. An extensible tool we call volview was developed, allowing fast visualization of an entire volume and an arbitrary number of colored, alpha-blended overlays, combining the abilities of NumPy, Pyglet, and PygArrayImage. This improved speed and quality of algorithm development, and facilitated review of our results by clinicians.

Exploring Collaborative HPC Visualization Workflows using VisIt and Python; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Krishnan, Harinarayan, Lawrence Berkeley National Labs; Harrison, Cyrus, Lawrence Livermore National

Track: Reproducible Science

As High Performance Computing (HPC) environments expand to address the larger computational needs of massive simulations and specialized data analysis and visualization routines, the complexity of these environments brings many challenges for scientists hoping to capture and publish their work in a reproducible manner. Collaboration using HPC resources is a particularly difficult aspect of the research process to capture. This is also the case for HPC visualization, even though there has been an explosion of technologies and tools for sharing in other contexts.

Practitioners aiming for reproducibility would benefit from collaboration tools in this space that support the ability to automatically capture multi-user collaborative interactions. For this work, we modified VisIt, an open source scientific visualization platform, to provide an environment aimed at addressing these shortcomings. The talk will focus on two exploratory features added to VisIt:

1) We enhanced VisIt's infrastructure expose a JSON API to clients over WebSockets. The new JSON API enables VisIt clients on web-based and mobile platforms. This API also enables multi-user collaborative visualization sessions. These collaborative visualization sessions can record annotated user interactions to Python scripts that can be replayed to reproduce the session in the future, thus capturing not only the end product but the step-by-step process used to create the visualization.

2) We have also added support for new Python & R programmable pipelines which allow users to easily execute their analysis scripts within VisIt's parallel infrastructure. The goal of this new functionality is to provide users familiar with of Python and R with an easier path to embed their analysis within VisIt.

To showcase how these new features enable reproducible science, we will present a workflow that demonstrates a Climate Science use case.

High Performance Reproducible Computing; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Zhang, Zhang, Intel Corporation; Rosenquist, Todd, Intel Corporation; Moffat, Kent, Intel Corporation

Track: General

The call for reproducible computational results in scientific research areas has increasingly resonated in recent years. Given that a lot of research work uses mathematical tools and relies on modern high performance computers for numerical computation, obtaining reproducible floating-point computation results becomes fundamentally important in ensuring that research work is reproducible.

It is well understood that, generally, operations involving IEEE floating-point numbers are not associative. For example, (a+b)+c may not equal a+(b+c). Different orders of operations may lead to different results. But exploiting parallelism in modern performance-oriented computer systems has typically implied out-of-order execution. This poses a great challenge to researchers who need exactly the same numerical results from run to run, and across different systems.

This talk describes how to use tools such as Intel® Math Kernel Library (Intel® MKL) and Intel® compilers to build numerical reproducibility into Python based tools. Intel® MKL includes a feature called Conditional Numerical Reproducibility that allows users to get reproducible floating-point results when calling functions from the library. Intel® compilers provide broader solutions to ensure the compiler-generated code produces reproducible results. We demonstrate that scientific computing with Python can be numerically reproducible without losing much of the performance offered by modern computers. Our discussion focuses on providing different levels of controls to obtain reproducibility on the same system, across multiple generations of Intel architectures, and across Intel architectures and Intel-compatible architectures. Performance impact of each level of controls is discussed in detail. Our conclusion is that, there is usually a certain degree of trade-off between reproducibility and performance. The approach we take gives the end users many choices of balancing the requirement of reproducible results with the speed of computing.

This talk uses NumPy/SciPy as an example, but the principles and the methodologies presented apply to any Python tools for scientific computing.

Iris & Cartopy: Python packages for Atmospheric and Oceanographic science; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Iris & Cartopy: Open source Python packages for Atmospheric and Oceanographic science

Authors: Elson, Philip, UK Met Office;

Track: Meteorology, Climatology, Atmospheric and Oceanic Science

As the capabilities of Python packages valuable to the Atmospheric and Oceanographic Sciences (AOS) such as matplotlib, scipy and numpy have developed, so the UK Met Office's use of Python has expanded. The open source scientific Python stack is strategically important to the Met Office as it strives to meet the increasing need to collaborate freely and openly in academic and commercial partnerships. Python's easy to develop, dynamically typed syntax is ideally suited for data assimilation and model post-processing type tasks, and in recent years the Met Office has sustained funding for a team of software engineers to simplify, develop and improve its scientific capabilities by contributing to the the open source AOS community.

The focus of much of this effort has been on a new open source Python package, Iris 1, which implements a generalised n-dimensional gridded data model to isolate analysis and visualisation code from file format specifics. The Iris data model is a result of close collaboration with the CF Data Model community and currently has read/write support for a variety of file formats including NetCDF and GRIB. In order to deliver a component of the core visualisation functionality, a new mapping library called Cartopy 2 has also been developed on top of matplotlib. Cartopy exposes an intuitive interface for the transformation and visualisation of geospatial vector and raster data.

This talk will outline some of the Met Office's involvement in the open source community, including demonstrations of Iris and Cartopy; highlights of recent matplotlib contributions; and an outline of future developments.

lmonade: a platform for development and distribution of scientific software; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Erocal, Burcin, TU Kaiserslautern

Track: Reproducible Science

Most results in experimental mathematics are accompanied by software implementations which often push the boundaries of what can be computed in terms of mathematical theory and efficiency. Since new algorithms are built on existing ones, just as theorems are derived from existing results, it would be natural to expect that the code produced for one project will be useful later on, to both the same researcher and others.

While theorems blissfully stay intact over time, software deteriorates and ages. Implementations need to be updated with respect to changes in underlying libraries and hardware architectures. Even if up to date, software developed for a specific application area often needs to be adapted to new situations. Like proofs can be reused by taking some components intact and modifying certain parts, software needs similar adaptations to be reusable.

It is natural that researchers cannot commit any more time than absolutely necessary for distributing and maintaining their software. The lmonade project aims to provide infrastructure and tools to foster code sharing and openness in scientific software development by

simplifying the tasks of distributing software with its dependencies, ensuring that it can be built on different platforms, and making sure the software is compatible across new releases of its dependencies. This is achieved through

a light-weight meta distribution which can be installed by a user without administrative rights. Building on the Gentoo Linux distribution and the Gentoo Prefix project, lmonade creates a uniform environment for software development where latest versions of scientific libraries can be found easily. access to a continuous integration infrastructure to detect compatibility problems between new versions of packages automatically and warn authors. By simplifying code sharing and distribution, especially when complex dependencies are involved, this platform enables researchers to build on existing tools without fear of losing users to baffling installation instructions.

Massive Online Collaborative Research and Modeling using Synapse and Python; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Omberg, Larsson, Sage Bionetworks

Track: Bioinformatics

Synapse is an open source software as a service (SaaS) platform built by Sage Bionetworks to enable collaborative and reproducible science. Having RESTful APIs at its base, Synapse is able to easily link to analytical software such as Python. In this talk I will present the Python bindings to this platform and, more specifically, how it fostered a collaborative environment for over 140 individual researchers spread across 25 institutions in The Cancer Genome Atlas (TCGA) consortium. Synapse enables tracking of provenance of data from individual genome sequencing centers, processing and quality control, and all the way through results generated from models of cancer genomics. Synapse is designed as an information commons. Allowing any user not only to access data but also contribute results and models. This allows the TCGA collaboration to accelerate discovery by using partial contributed results as starting points for downstream analyses. One sub-project that has emerged from the collaboration is an online machine learning competition to predict expected survival time of cancer patients given molecular phenotype. All submitted models are immediately open sourced allowing derivative models to be built. These collaborative competitions provide an alternative approach to performing computational science which tools like Python and Synapse can greatly accelerate.

Matplotlib: past, present and future; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Michael Droettboom

Track: Reproducible Science

This talk will be a general "state of the project address" for matplotlib, the popular plotting library in the scientific Python stack. It will provide an update about new features added to matplotlib over the course of the last year, outline some ongoing planned work, and describe some challenges to move into the future. The new features include a web browser backend, "sketch" style, and numerous other bugfixes and improvements. Also discussed will be the challenges and lessons learned moving to Python 3. Our new "MEP" (matplotlib enhancement proposal) method will be introduced, and the ongoing MEPs will be discussed, such as moving to properties, updating the docstrings, etc. Some of the more pie-in-the-sky plans (such as styling and serializing) will be discussed. It is hoped that this overview will be useful for those who use matplotlib, but don't necessarily follow its mailing list in detail, and also serve as a call to arms for assistance for the project.

Matrix Expressions and BLAS/LAPACK; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Rocklin, Matthew, University of Chicago Computer Science

Track: General

Numeric linear algebra is important ubiquitous. The BLAS/LAPACK libraries include high performance implementations of DLA algorithms in a variety of mathematical situations. They are underused because

The interface is challenging to scientific users The number of routines is huge, pressuring users to select general routines rather than finding the one that best fits their situation. I demonstrate a small DSL for Matrix Algebra[1] embedded in the SymPy project [2]. I use logic programming to infer attributes about larger matrix expressions [3]. I describe the BLAS and LAPACK libraries programmatically [4] and use strategic programming [5] to automatically build directed acyclic graphs of BLAS/LAPACK operations to compute complex expressions [6]. From these I generate readable Fortran code [7]. I then use f2py to bring this back into Python. The result is a clean mathematical interface that efficiently generates mathematically informed numeric code. I compare these results against other popular numeric packages like NumPy and Theano.

Philosophically I'll plug the following ideas

Multiple clean intermediate representations - Aside from a runnable Python function this project also generates perfectly readable Fortran90 code and a directed acyclic graph. I'll briefly show that the availability of the DAG representation opens up the possibility of static scheduling. Declarative programming - All of the math in this project is defined separately from the algorithms, increasing opportunities for independent development. I'll probably talk about separating what code from how code. I may evangelize a bit about small, modular and generally applicable projects.

NeuroTrends: Large-scale automated analysis of the neuroimaging literature; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Carp, Joshua, University of Michigan

Track: Medical Imaging

How do researchers design and analyze experiments? How should they? And how likely are their results to be reproducible? To investigate these questions, we developed NeuroTrends, a platform for large-scale analysis of research methods in the neuroimaging literature. Neurotrends identifies relevant research reports using the PubMed API, downloads and parses full-text HTML and PDF documents, and extracts hundreds of methodological details from this unstructured text.

In the present study, NeuroTrends was evaluated using a corpus of over 16,000 journal articles. Automatically extracted methodological meta-data were validated against a hand-coded database. Overall, methodological details were extracted accurately, with a mean d-prime value of 3.53 (range: 1.12 to 6.18). Results revealed both variability and stability in methodological practices over time, with some methods increasing in prevalence, some decreasing, and others remaining consistent. Results also showed that design and analysis pipelines were highly variable across studies and have grown more variable over time.

In sum, the present study confirms the feasibility of accurately extracting methodological meta-data from unstructured text. We also contend that variability in research methods across time and from study to study poses a challenge to reproducibility in the neuroimaging literature--and likely in many other fields as well. Future directions include improving the accuracy and coverage of the NeuroTrends platform, integrating with additional databases, and extending to research domains beyond neuroimaging.

Oil spill modeling and uncertainty forecasts with Python; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Hou,Xianlong, University of Texas at Austin; Hodges,Ben, University of Texas at Austin

Track: Meteorology, Climatology, Atmospheric and Oceanic Science

A new method is presented to provide automatic sequencing of multiple hydrodynamic models and automated analysis of model forecast uncertainty. The Hydrodynamic and Oil Spill model Python (hyospy) wrapper was developed to run a hydrodynamic model, link with the oil spill, and visualize results. Hyospy completes the following steps automatically: (1) downloads wind and tide data (nowcast, forecast and historical); (2) converts data to hydrodynamic model input; (3) initializes a sequence of hydrodynamic models starting at pre-defined intervals on multi-processor workstation, and (4) provides visualization on Google Earth. Each model starts from the latest observed data, so that the multiple models provide a range of forecast hydrodynamics with different initial and boundary conditions reflecting different forecast horizons. As a simple testbed for integration and visualization strategies, a Runge-Kutta 4th order (RK4) particle transport method is used for spill transport. The model forecast uncertainty is estimated by the difference between forecasts in the sequenced model runs. The hyospy integrative system shows that challenges in operational oil spill modeling can be met by leveraging existing models and web-visualization methods to provide tools for emergency managers.

open('/dev/real_world') - Raspberry Pi Sensor and Actuator Control; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Minardi, Jack, Enthought Inc. Code: https://github.com/jminardi/scipy2013 Slides: http://bit.ly/12ciLOq

Track: General

I will walk the audience through all the steps necessary for acquiring sensor data using a Raspberry Pi and streaming it over a network to be plotted live. I will also explain how to control an actuator like a DC motor.

The topics I will cover: Basic circuits (V = I * R) GPIO Pins Pulse width modulation Analog to digital converters Serial communication (SPI) Basic data streaming using zmq Basic real time plots using chaco All the python needed for the above topics

By the end of the talk I will demonstrate sensor and actuator control using everything that was covered.

Opening Up Astronomy with Python and AstroML; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Vanderplas, Jake, University of Washington; Ivezic, Zeljko, University of Washington; Connolly, Andrew, University of Washington

Track: General

As astronomical data sets grow in size and complexity, automated machine learning and data mining methods are becoming an increasingly fundamental component of research in the field. The astroML project (http://astroML.github.com), first released in fall 2012, provides a common repository for practical examples of the data mining and machine learning tools used and developed by astronomical researchers, written in python. The astroML module offers a host of general data analysis and machine learning routines, loaders for openly-available astronomical datasets, and fast implementations of specific computational methods often used in astronomy and astrophysics. The associated website features hundreds of examples of these routines in action, using real datasets. In this talk I'll go over some of the highlights of the astroML code and examples, and discuss how we've used astroML as an aid for student research, hands-on graduate astronomy curriculum, and the sharing of research tools and results.

Parallel Volume Rendering in yt: User Driven & User Developed; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Skillman, Samuel, University of Colorado at Boulder; Turk, Matthew, Columbia University

Track: General

We will describe the development, design, and deployment of the volume rendering framework within yt, an open-source python library for computational astrophysics. In order to accommodate increasingly large datasets, we have developed a parallel kd-tree construction written using Python, Numpy, and Cython. We couple this parallel kd-tree with two additional levels of parallelism exposed through image plane decomposition with mpi4py and individual brick traversal with OpenMP threads for a total of 3 levels of parallelism. This framework is capable of handling some of the world's largest adaptive mesh refinement simulations as well some of the largest uniform grid data (up to 40963 at the time of this submission). This development has been driven by the need for both inspecting and presenting our own scientific work, with designs constructed by our community of users. Finally, we will close by examining case studies which have benefited from the user-developed nature of our volume renderer, as well as discuss future improvements to both user interface and parallel capability.

Python Tools for Coding and Feature Learning; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Johnson, Leif, University of Texas at Austin

Track: Machine Learning

Sparse coding and feature learning have become popular areas of research in machine learning and neuroscience in the past few years, and for good reason: sparse codes can be applied to real-world data to obtain "explanations" that make sense to people, and the features used in these codes can be learned automatically from unsupervised datasets. In addition, sparse coding is a good model for the sorts of data processing that happens in some areas of the brain that process sensory data (Olshausen & Field 1996, Smith & Lewicki 2006), hinting that sparsity or redundancy reduction (Barlow 1961) is a good way of representing raw, real-world signals.

In this talk I will summarize several algorithms for sparse coding (k-means [MacQueen 1967], matching pursuit [Mallat & Zhang 1994], lasso regression [Tibshirani 1996], sparse neural networks [Lee Ekanadham & Ng 2008, Vincent & Bengio 2010]) and describe associated algorithms for learning dictionaries of features to use in the encoding process. The talk will include pointers to several nice Python tools for performing these tasks, including standard scipy function minimization, scikit-learn, SPAMS, MORB, and my own packages for building neural networks. Many of these techniques converge to the same or qualitatively similar solutions, so I will briefly mention some recent results that indicate the encoding can be more important than the specific features that are used (Coates & Ng, 2011).

Reproducible Documents with PythonTeX; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Poore, Geoffrey, Union University

Track: Reproducible Science

Writing a scientific document can be slow and error-prone. When a figure or calculation needs to be modified, the code that created it must be located, edited, and re-executed. When data changes or analysis is tweaked, everything that depends on it must be updated. PythonTeX is a LaTeX package that addresses these issues by allowing Python code to be included within LaTeX documents. Python code may be entered adjacent to the figure or calculation it produces. Built-in utilities may be used to track dependencies.

PythonTeX maximizes performance and efficiency. All code output is cached, so that documents can be compiled without executing code. Code is only re-executed when user-specified criteria are met, such as exit status or modified dependencies. In many cases, dependencies can be detected and tracked automatically. Slow code may be isolated in user-defined sessions, which automatically run in parallel. Errors and warnings are synchronized with the document so that they have meaningful line numbers.

Since PythonTeX documents mix LaTeX and Python code, they are less portable than plain LaTeX documents. PythonTeX includes a conversion utility that creates a new copy of a document in which all Python code is replaced by its output. The result is suitable for journal submission or conversion to other formats such as HTML.

While PythonTeX is primarily intended for Python, its design is largely language-independent. Users may easily add support for additional languages.

Scikit-Fuzzy: A New SciPy Toolkit for Fuzzy Logic; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Warner, Joshua, Mayo Clinic Department of Biomedical Engineering; Ottesen, Hal H., Adjunct Professor

Track: General

Scikit-fuzzy is a robust set of foundational tools for problems involving fuzzy logic and fuzzy systems. This area has been a challenge for the scientific Python community, largely because the common first exposure to this topic is through the MATLAB® Fuzzy Logic Toolbox™. This talk officially introduces a general set of original fuzzy logic algorithms to the scientific Python community which predate the commercial toolbox, were released under the 3-clause BSD license, and were translated to Python by an author who never used the MathWorks® Fuzzy Logic Toolbox™.

The current capabilities of scikit-fuzzy include: fuzzy membership function generation; fuzzy set operations; lambda-cuts; fuzzy mathematics including Zadeh's extension principle, the vertex method, and the DSW method; fuzzy implication given an IF THEN system of fuzzy rules (via Mamdani [min] or Larsen [product] implication); various defuzzification algorithms; fuzzy c-means clustering; and Fuzzy Inference Ruled by Else-action (FIRE) denoising of 1d or 2d signals.

The goals of scikit-fuzzy are to provide the community with a robust toolkit of independently developed and implemented fuzzy logic algorithms, filling a void in the capabilities of scientific and numerical Python, and to increase the attractiveness of scientific Python as a valid alternative to closed-source options. Scikit-fuzzy is structured similarly to scikit-learn and scikit-image, current source code is available on GitHub, and pull requests are welcome.

SciPy 2013 John Hunter Excellence in Plotting Contest
SciPy 2013
Recorded: July 2, 2013Language: English

Presentation of finalists for excellence in plotting using Matplotlib.

SciPy 2013 Lightning Talks, Thu June 27
SciPy 2013
Recorded: July 2, 2013Language: English

Series of lightening talks

SciPy 2013 Lightning Talks, Wed June 26
SciPy 2013
Recorded: July 2, 2013Language: English

Lightning Talks for Wed June 26:

SymPy Gamma and SymPy Live: Python and Mathematics Online; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Li, David, SymPy

Track: General

SymPy (sympy.org) is a Python library for symbolic mathematics and a computer algebra system. The project also develops two web applications that allow users to experiment with mathematics online, SymPy Gamma and SymPy Live. SymPy Gamma, modeled after Wolfram|Alpha, lets users enter mathematical expressions and see a variety of related computations and visualizations. Meanwhile, SymPy Live provides an online Python shell with features, such as LaTeX equation rendering, designed to aid the manipulation of mathematics with Python.

This talk will examine the implementation and development of the web applications as well as general experiences contributing to the SymPy project. In particular, the design of Gamma and the implementation of its server will be examined, as well as its features that help users explore mathematics using the SymPy library. For instance, by entering a trigonometric expression, users will receive alternate forms of the input, a plot, a series expansion, and other pertinent information. Furthermore, the purposes and development of SymPy Live will be examined, including the execution of code on Google App Engine and the development of the mobile site and other features during the 2011 Google Code-In contest and afterwards. One such feature is in SymPy's Sphinx documentation, which leverages SymPy Live to let users easily execute and see the results of any code examples in the documentation, and then use the shell to continue exploring the capabilities of this library.

The DyND Library; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Wiebe, Mark, Continuum Analytics

Track: General

The DyND library is a component of Blaze, providing an in-memory data structure which is dynamic, multidimensional, and array-oriented. It generalizes the data model of NumPy and Python's buffer protocol by dynamically composing data types and associated metadata like dimension sizes and strides. This adds flexibility to the system without requiring significant performance compromises. For example, variable-sized dimensions can be used to create ragged arrays, operated on just like fixed-size dimensions. DyND supports the NumPy set of data types, together with a growing list of additions like variable-sized strings, pointers, and a categorical type. Expression data types allow for both reading and writing from one type stored as another type under the hood. A date, with all its associated functions and properties, may be stored as a string or a struct. Expressions are evaluated in a lazy fashion, so multiple element-wise operations are fused similar to the numexpr library. DyND is written as a pure C++ library, with bindings for Python as a separate component. Its usage syntax is quite similar in Python and C++, making it easier for programmers to switch between the languages with code in the same high level style, only delving into the lower level details when necessary.

The Open Science Framework: Improving, by Opening, Science; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Spies, Jeffrey, Center for Open Science; Nosek, Brian, Center for Open Science

Track: Reproducible Science

The Center for Open Science (COS) is using Python to develop the Open Science Framework (OSF)--an infrastructure for conducting science transparently and openly with a focus on incentives and workflows. The goal of the infrastructure is to help reduce the gap between scientific practices and scientific values. The vehicle for this framework is a website (http://openscienceframework.org) and set of accompanying tools that provide scientists with a shared infrastructure that makes it easy to collaborate as well as document, organize, and search the entire lifespan of a research project. The Reproducibility Project--another COS-supported initiative--is a large-scale, collaborative study examining the rate of reproducibility in the psychological sciences. The OSF is being used to host and pre-register replication materials and replication hypotheses.

This talk will review the reasons why the major focus of the OSF is on incentives and workflow, demonstrate current features of the OSF, discuss how projects like the Reproducibility Project are using the OSF, and discuss why the COS believes that Python and the Python community will lead the open science (r)evolution.

Using IPython Notebook with IPython Cluster; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Using IPython Notebook with IPython Cluster for Reproducibility and Portability of Atomistic Simulations

Authors: Trautt, Zachary, Materials Measurement Science Division, National Institute of Standards and Technology

Track: Reproducible Science

The information presented in a typical journal article is rarely sufficient to reproduce all atomistic simulations reported within. A typical study requires the distribution of parallel preprocessing, production, and post processing tasks. This is typically accomplished with scripting and a queuing system and is not typically captured in a publication or supporting information. A traditional workflow tool can capture this. However, a traditional workflow tool has a steep learning curve and many are not capable of distributing parallel tasks. We present the use of IPython Notebook and IPython Cluster as a tool for reproducible and portable atomistic simulations. IPython Notebook is used to define and annotate functions that implement simulation tasks. IPython Cluster is used to execute and distribute tasks, including external parallel tasks. This combination is an improvement for a number of reasons. First, the IPython notebook documents all steps of all simulations and can easily be included as supplementary information with a journal submission. Second, the IPython Cluster executes computational tasks with minimal effort (a single map command) and therefore does not distract from the science. Third, the IPython Cluster abstracts computational resources, such that organization-specific computational details (cluster name, batch submission details, etc.) are not defined in the notebook. Therefore, if a third party attempts to reproduce simulation results, the notebook can be used without modification if all dependencies are met. Furthermore, the initial researcher may observe a reduction of their time effort because of the efficiency gains in using a single map command over traditional scripting for the distribution of tasks.

Using Python to drive the General NOAA Operational Modeling Environment; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Barker, Christopher H. NOAA Emergency Response Division.

Track: Meteorology, Climatology, Atmospheric and Oceanic Science

The General NOAA Operational Modeling Environment (GNOME) is a general purpose modeling tool originally designed for operational oil spill modeling. It was developed by NOAA's Emergency Response Division primarily to provide oil spill transport forecasts to the Federal On Scene Coordinator. In the years since its original development, the model has been extended to support other drifting objects, and has been used for modeling a wide variety cases, including: Marine Debris, larval transport, chemicals in water, etc. It played a key role in the Deepwater Horizon oil spill in 2010, and is being used to forecast the drift of debris from the Japanese Tsunami in 2011. In addition, the model is distributed freely to the general public, and is widely used in education and oil spill response planning.

The first version of the program has proven to be powerful, flexible, and easy to use. However, the program is written in C++, with the computational components and a desktop graphical interface code tightly integrated. As we move forward with development, we require a system that allows a new web-based user interface, easier extension of the model, easier scripting for automation, use of the core algorithms in other models, and easier testing. To achieve these goals, we are re-writing the model as a system of components, tied together with Python. Each component can be written in Python, or any language Python can call (primarily C++), and tested either individually or as part of the system with Python. We have written the new model driver in Python, and are wrapping the existing C++ components using Cython. In this paper, the model architecture is presented, with a discussion of the strengths and pitfalls of the approach.

Using Sumatra to Manage Numerical Simulations
SciPy 2013
Daniel Wheeler
Recorded: July 2, 2013Language: English

Sumatra is a lightweight system for recording the history and provenance data for numerical simulations. ... The speaker will provide an introduction to Sumatra as well as demonstrate some typical usage patterns and discuss achievable future goals.

XDress - Type, But Verify; SciPy 2013 Presentation
SciPy 2013
Recorded: July 2, 2013Language: English

Authors: Scopatz, Anthony, The University of Chicago & NumFOCUS, Inc.

Track: General

XDress is an automatic wrapper generator for C/C++ written in pure Python. Currently, xdress may generate Python bindings (via Cython) for C++ classes & functions and in-memory wrappers for C++ standard library containers (sets, vectors, maps). In the future, other tools and bindings will be supported.

The main enabling feature of xdress is a dynamic type system that was designed with the purpose of API generation in mind. This type system provides a canonical abstraction of various kinds of types: Base types (int, str, float, non-templated classes), refined types (even or odd ints, strings containing the letter 'a'), and dependent types (templates such arrays, maps, sets, vectors). This canonical form is itself hashable, being comprised only of strings, ints, and tuples.

On top of this type system, xdress provides a tool for auto-generating classes which are views into template instantiations of C++ standard library maps and sets. Additionally, this tool also creates custom numpy dtypes for any C++ type, class or struct. This allows the user to have numpy array views into C++ vectors.

Furthermore, xdress also has a tool which inspects a C++ code base and automatically generates Cython wrappers for all user-specified classes and functions. This significantly eases the burden of supporting mixed language projects.

The above code generators, however, are just the beginning. The xdress type system is flexible and powerful enough to engender a suite of other tools which take advantage of less obvious features. For example, an automatic verification & validation utility could take advantage of refinement type predicate functions to interdict parameter constraints into the API right under the users nose!

This talk will focus on xdress's type system and its use cases.

A Gentle Introduction To Machine Learning; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Kastner, Kyle, Southwest Research Institute

Track: Machine Learning

This talk will be an introduction to the root concepts of machine learning, starting with simple statistics, then working into parameter estimation, regression, model estimation, and basic classification. These are the underpinnings of many techniques in machine learning, though it is often difficult to find a clear and concise explanation of these basic methods.

Parameter estimation will cover Gaussian parameter estimation of the following types: known variance, unknown mean; known mean, unknown variance; and unknown mean, unknown variance.

Regression will cover linear regression, linear regression using alternate basis functions, bayesian linear regression, and bayesian linear regression with model selection.

Classification will extend the topic of regression, exploring k-means clustering, linear discriminants, logistic regression, and support vector machines, with some discussion of relevance vector machines for "soft" decision making.

Starting from simple statistics and working upward, I hope to provide a clear grounding of how basic machine learning works mathematically. Understanding the math behind parameter estimation, regression, and classification will help individuals gain an understanding of the more complicated methods in machine learning. This should help demystify some of the modern approaches to machine learning, leading to better technique selection in real-world applications.

An efficient workflow for reproducible science; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Bekolay, Trevor, University of Waterloo

Track: Reproducible Science

Every scientist should be able to regenerate the figures in a paper. However, all too often the correct version of a script goes missing, or the original raw data is filtered by hand and the filtering process undocumented, or the student who has the data or code has switched labs.

In this talk, I will describe a workflow for a complete end-to-end analysis pipeline, going from raw data to analysis to plotting, using existing tools to make each step of the pipeline reproducible, documented, and efficient, while requiring few sacrifices in terms of a scientist's time and effort.

The key insight is to decouple each analysis step and each plotting step, in order to do several analyses or plots in parallel. Each step can be cached if it is costly, with the code that produces the cached data serving as the documentation for how it is produced.

I will discuss a way to organize code in order to make analyzing and plotting large data sets efficient, parallelizable, and cacheable. Once completed, source code can be uploaded to a hosting service like Github or Bitbucket, and data can be uploaded to a data store like Amazon S3 or figshare. The end result is that readers can completely regenerate the figures in your paper at no or nearly no cost to you.

Astropy, growing a community-based software system for astronomy
SciPy 2013
Erik Tollerud , Michael Droettboom , Thomas Robitaille
Recorded: July 1, 2013Language: English

Astropy is a community-based software project to coordinate the development of libraries and applications for astronomy. We will report on progress that has been made with astropy since the last scipy conference.

Astropy, growing a community-based software system for astronomy; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Droettboom, Michael, STScI; Robitaille, Thomas, Max Planck Institute; Tollerud, Erik, Yale Universit

Track: Astronomy and Astrophysics

Astropy is a community-based software project to coordinate the development of libraries and applications for astronomy. We will report on progress that has been made with astropy since the last scipy conference. The past year has seen much growth in the number and quality of the core libraries in astropy and a public release. We will highlight the new capabilities available, and outline the development plans for the upcoming year. Finally we discuss the strategies for advertising its capabilities and growing the documentation and tutorials available for users and developers.

Best-practice variant calling pipeline for automated sequencing analysis; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Best-practice variant calling pipeline for fully automated high throughput sequencing analysis

Authors: Chapman, Brad; Kirchner, Rory; Hofmann, Oliver; Hide, Winston

Track: Bioinformatics

bcbio-nextgen is an automated, scalable pipeline for detecting genomic variants from large-scale next-generation sequencing data. It organizes multiple best-practice tools for alignment, post-processing and variant calling into a single, easily configurable pipeline. Users specify inputs and parameters in a configuration file and the pipeline handles all aspects of software and data management. Large-scale analysis run in parallel on compute clusters using IPython and on cloud systems using StarCluster. The goal is to create a validated and community maintained pipeline for automated variant calling, allowing researchers to focus on answering biological questions.

Our talk will describe the practical challenges we face in scaling the system to handle large whole genome data for thousands of samples. We will also discuss current work to develop a variant reference panel and associated grading scheme that ensures reproducibility in a research world with rapidly changing algorithms and tools. Finally we details plans for integration with STORMseq, a user-friendly Amazon front end, designed to make the pipeline available to non-technical users.

The presentation will show how bringing together multiple open-source communities provides infrastructure that bridges technical gaps and moves analysis work to higher-level challenges.

Combining C++ and Python in the LSST Software Stack
SciPy 2013
Jim Bosch
Recorded: July 1, 2013Language: English

The software system for the Large Synoptic Survey Telescope is completely open-source, and at every stage we've focused on making it usable not just with LSST, but with generic astronomical image data.

Combining C++ and Python in the LSST Software Stack; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Bosch, Jim, Princeton University

Track: Astronomy and Astrophysics

The Large Synoptic Survey Telescope is a 8.4-meter survey telescope that will image the entire visible sky twice a week with a 3.2 Gigapixel camera, expected to come online early in the next decade. That means a lot of data: approximately 30 TB each night, and over 60 PB at the end of the 10-year survey, all of which will be made available to the public. The software system for LSST is completely open-source, and at every stage we've focused on making it usable not just with LSST, but with generic astronomical image data (in fact, it has been used to reduce data from several other telescopes already). We're building the software system for LSST using a combination of C++ and Python, making use of third-party software such as NumPy, Swig, and Eigen, along with a lot of custom code (much of which may be of broader use). In this talk I'll go over some of the advantages and disadvantages of the C++/Python combination, and some of the tricks and tools we've developed (and trials and tribulations we've encountered) in making them play well together in the context of astronomical data analysis. While LSST is still years away, and our software pipeline is still in many ways a prototype, in many respects it is already at the cutting edge of astronomical data analysis, and the lessons we have already learned will be of value not just to astronomers, but to scientists in other "big data" fields and general-purpose scientific software developers as well.

Complex Experiment Configuration ... using Robot Operating System (ROS); SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Complex Experiment Configuration, Control, Automation, and Analysis using Robot Operating System (ROS)

Authors: Stowers, John, TU Wien; Straw, Andrew, Research Institute of Molecular Pathology

Track: Reproducible Science

The Robot Operating System (ROS), and its Python bindings, are well known and used in the engineering and robotics communities for the many high level tools and algorithms they provide. Less appreciated are the lower levels of the ROS stack; libraries for inter-process-communication, parameter and configuration management, and distributed process launching and control.

In the Straw laboratory we use ROS to automate the operation of, and experiments using, virtual reality systems for fixed and freely flying Drosophila. This includes real-time 10-camera tracking (100Hz), 5 projector panoramic virtual reality (120Hz), and real-time visual stimulus generation and control (80Hz). Operation of this system requires the launching of over 30 processes on 4 computers, and the associated configuration of each in a known state. In addition, the progress of the experiment must be monitored over its entire 12 hour duration.

In this talk we will describe how ROS makes this complex system manageable and reproducible by implicitly recording the state of the system at all times, and by automating the pre-configuration and launching of the multiple processes which control the experiment. I will also describe how we tag all experimental data with unique identifiers to facilitate live monitoring, post-experiment analysis, and long time archival in case later forensics are required.

This talk will show that ROS is a very powerful tool and should not only be considered for engineering and robotics applications; but by any scientist for robustly and reproducibly managing complex scientific experiments.

Detection and characterization of interactions of genetic risk factors
SciPy 2013
Lin Wang , Patricia Francis-Lyon , Shashank Belvadi
Recorded: July 1, 2013Language: English

Much attention has been focused on the application of machine learning approaches to detection of gene interactions. Our method is based upon training a supervised learning algorithm to detect disease, and then quantifying the effect on prediction accuracy when alleles of two or more genes are perturbed to unmutated in patterns so as to reveal and characterize gene interactions.

Exploring disease genetics from thousands of individual genomes with Gemini; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Quinlan, Aaron, University of Virginia; Paila, Uma, University of Virginia; Chapman, Brad, Harvard School of Public Health; Kirchner, Rory,

Track: Bioinformatics

The throughput of DNA sequencing has increased by five orders of magnitude in the last decade and geneticists can now sequence a complete human genome in 24 hours for less than $5000. This tremendous increase in efficiency has led to large-scale studies of the relationship between inherited genetic variation and human disease. While collecting genetic variation from the genomes of thousands of humans is now possible, unraveling the genetic basis of disease remains a tremendous analytical challenge. Interpretation is especially difficult since many genetic variants associated with human disease lie outside the genomic regions that encode genes. To address this challenge, we have developed GEMINI, a flexible Python analysis framework for exploring human genetic variation. By leveraging Numpy, SQLITE, and several powerful Python packages in the genomics domain, GEMINI integrates genome-scale genetic variation from 1000s of individuals with a wealth of genome annotations that are crucial for disease interpretation.

GEMINI provides a powerful analysis framework allowing researchers to conduct otherwise complicated analyses with an easy to use analysis interface. It provides methods for ad hoc data exploration, a programming interface for custom analyses, and both command line and graphical tools for common analysis tasks. We demonstrate GEMINI's utility for exploring variation for personal genomes and family based genetic studies. Thanks to advances such as IPython.parallel, we further illustrate the framework's ability to scale to studies involving thousands of human samples.

Ginga: an open-source astronomical image viewer and toolkit
SciPy 2013
Eric Jeschke
Recorded: July 1, 2013Language: English

Ginga is a new astronomical image viewer written in python. It uses and inter-operates with several key scientific python packages: numpy, pyfits, and scipy.

Ginga: an open-source astronomical image viewer and toolkit; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Jeschke, Eric, Subaru Telescope, National Astronomical Observatory of Japan

Track: Astronomy and Astrophysics

Ginga is a new astronomical image viewer written in python. It uses and inter-operates with several key scientific python packages: numpy, pyfits, and scipy. A key differentiator for this image viewer, compared to older-generation FITS viewers, is that all the key components are written as python classes, allowing for the first time a powerful FITS image display widget to be directly embedded in, and tightly coupled with, python code.

We call Ginga a toolkit for programming FITS viewers because it includes a choice of base classes for programming custom viewers for two different modern widget sets: Gtk and Qt, available on the three common desktop platforms. In addition, a reference viewer is included with the source code based on a plugin architecture in which the viewer can be extended with plugins scripted in python. The code is released under a BSD license similar to other major python packages and is available on github.

Ginga has been introduced only recently as a tool to the astronomical community, but since SciPy has a developer focus this talk concentrates on programming with the Ginga toolkit. We cover two cases: using the bare image widget to build custom viewers and writing plugins for the existing full-featured Ginga viewer. The talk may be of interest to anyone developing code in python needing to display scientific image (CCD or CMOS) data and astronomers interested in python-based quick look and analysis tools.

GIS Panel Discussion
SciPy 2013
Andrew Wilson , Sergio Rey , Shaun Walbridge
Recorded: July 1, 2013Language: English

Authors: Panel participants: Sergio Ray (Arizona State U), Shaun Walbridge (ESRI), Andrew Wilson (TWDB)

Track: GIS - Geospatial Data Analysis

GIS Panel Discussion; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Panel participants: Sergio Ray (Arizona State U), Shaun Walbridge (ESRI), Andrew Wilson (TWDB)

Track: GIS - Geospatial Data Analysis

GraphTerm: A notebook-like graphical terminal interface; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

GraphTerm: A notebook-like graphical terminal interface for collaboration and inline data visualization

Authors: Ramalingam Saravanan, Texas A&M University

Track: Reproducible Science

The notebook interface, which blends text and graphics, has been in use for a number of years in commercial mathematical software and is now finding more widespread usage in scientific Python with the availability browser-based front-ends like the Sage and IPython notebooks. This talk will describe a new open-source Python project, GraphTerm, that takes a slightly different approach to blending text and graphics to create a notebook-like interface. Rather than operating at the application level, it works at the unix shell level by extending the command line interface to incorporate elements of the graphical user interface. The xterm terminal escape sequences are augmented to allow any program to interactively display inline graphics (or other HTML content) simply by writing to standard output.

GraphTerm is designed to be a drop-in replacement for the standard unix terminal, with additional features for multiplexing sessions and easy deployment in the cloud. The interface aims to be tablet-friendly, with features like clickable/tappable directory listings for navigating folders etc. The user can switch, as needed, between standard a-line-at-a-time shell mode and the notebook mode, where multiple lines of code are entered in cells, allowing for in-place editing and re-execution. Multiple users can share terminal sessions for collaborative computing.

GraphTerm is implemented in Python, using the Tornado web framework for the server component and HTML+Javascript for the browser client. The presentation will discuss the architecture of GraphTerm as well as provide specific usage examples, such as inline visualization of meteorological data using matplotlib, and collaborative presentations using Landslide, a python-based slideshow tool.

Hyperopt: A Python library for optimizing machine learning algorithms; SciPy 2013
SciPy 2013
Recorded: July 1, 2013Language: English

Hyperopt: A Python library for optimizing the hyperparameters of machine learning algorithms

Authors: Bergstra, James, University of Waterloo; Yamins, Dan, Massachusetts Institute of Technology; Cox, David D., Harvard University

Track: Machine Learning

Most machine learning algorithms have hyperparameters that have a great impact on end-to-end system performance, and adjusting hyperparameters to optimize end-to-end performance can be a daunting task. Hyperparameters come in many varieties--continuous-valued ones with and without bounds, discrete ones that are either ordered or not, and conditional ones that do not even always apply (e.g., the parameters of an optional pre-processing stage)--so conventional continuous and combinatorial optimization algorithms either do not directly apply, or else operate without leveraging structure in the search space. Typically, the optimization of hyperparameters is carried out before-hand by domain experts on unrelated problems, or manually for the problem at hand with the assistance of grid search. However, even random search has been shown to be competitive [1].

Better hyperparameter optimization algorithms (HOAs) are needed for two reasons:

HOAs formalize the practice of model evaluation, so that benchmarking experiments can be reproduced by different people.

Learning algorithm designers can deliver flexible fully-configurable implementations (of e.g. Deep Learning algorithms) to non-experts, so long as they also provide a corresponding HOA.

Hyperopt provides serial and parallelizable HOAs via a Python library [2, 3]. Fundamental to its design is a protocol for communication between (a) the description of a hyperparameter search space, (b) a hyperparameter evaluation function (machine learning system), and (c) a hyperparameter search algorithm. This protocol makes it possible to make generic HOAs (such as the bundled "TPE" algorithm) work for a range of specific search problems. Specific machine learning algorithms (or algorithm families) are implemented as hyperopt search spaces in related projects: Deep Belief Networks [4], convolutional vision architectures [5], and scikit-learn classifiers [6]. My presentation will explain what problem hyperopt solves, how to use it, and how it can deliver accurate models from data alone, without operator intervention.

Import without a filesystem, SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Presenter: Pat Marion

Authors: Pat Marion, Kitware; Aron Ahmadia; Bradley M. Froehle, University of California, Berkeley

Track: General

Scientific Python is growing in popularity among HPC and supercomputing communities, but suffers from a seemingly simple and fundamental problem: importing modules from a shared network filesystem at extreme scale will cripple the performance of a parallel Python program.

At SciPy '12, the presentation titled "Solving the import problem: Scalable Dynamic Loading Network File Systems" analyzed the issue and proposed several remedies, but concluded there was more work to be done. Now, this talk introduces a new technique that leverages the linker to embed C-extension modules, and uses Python freeze to embed pure python modules. The result is a program that imports the Python standard library and scientific Python modules such as NumPy without accessing the filesystem. It achieves near-instant, and always-constant, import time even at full machine scale on today's largest supercomputers. The same technique is also relevant to Python app developers on mobile and embedded systems where filesystem access and dynamic loading inflate app startup time.

This talk will discuss the concepts involved using a simple hello-world demonstration, and overview a real-world example where Python was used to compute at full machine scale on Argonne's Intrepid BlueGene/P supercomputer.

Infer.py: Probabilistic Programming and Bayesian Inference from Python; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Zinkov, Rob

Track: Machine Learning

Infer.py is a wrapper around Microsoft Research's Infer.NET inference engine. Infer.py allows you to represent complex graphical models in terms of short pieces of code. In this talk, I will show how many popular machine learning algorithms can be modeled as short probabilistic programs and then simply trained. I will then show how to introspect on the models which were learned and debug these programs when they don't produce desired results.

IPython-powered Slideshow Reveal-ed; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Avila, Damian, OQUANTA;

Track: Reproducible Science

In the last years, Python have gained a lot of interest in the scientific community because of several useful tools, well-suited to do scientific computing research, have been developed [1]. IPython [2], a comprehensive environment for interactive and exploratory computing, has arose as must-have application to run in the daily scientific work-flow because provide not only enhanced interactive Python shells (terminal and qt-based) but also an interactive browser-based notebook with rich media support [3]. The oral presentation of our research results to the public (specialized and non-specialized) is one of the final steps in the scientific research work-flow, and recently, the IPython notebook has began to be used for these oral communications in several conferences. Despite the fact that we can present our talks with the IPython notebook or a derived static html through the nbviewer service [4], there is not a native IPython presentation tool aimed to easily present our results. So, in this paper, we describe a new IPython-Reveal.js-powered slideshow, designed specifically to be rendered directly from the IPython notebook, and powered with several features to address the most common tasks performed during the oral presentation and spreading of our scientific work, such as:

Main slides (horizontal) Nested slides (vertical) Fragments views Transitions Themes Speaker notes Export to pdf To conclude, we have developed a better visualization tool for the IPython notebook, suited for the final step of our scientific research work-flow, providing us with an enhanced experience in the oral presentation and communication of our results [5 - 6 - 7].

Julia and Python: a dynamic duo for scientific computing; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Bezanson, Jeff, MIT; Karpinski, Stefan, MIT

Track: General

Julia is a recent addition to the collection of tools a scientist has available for tackling computational problems. It combines the simple programming model of a dynamic language like Python with the performance of a compiled language, while exposing expressive high-level features such as a sophisticated type system, dynamic multiple dispatch, Lisp-style macros and metaprogramming.

Julia can natively make zero-overhead calls to C and Fortran libraries without wrappers or data copying. Moreover, Julia can now call Python as well [3], with automatic bidirectional type conversion, bidirectional callbacks, and copy-free sharing of lists, dictionaries, and NumPy arrays. This is as simple as:

julia[HTML REMOVED] using PyCall julia[HTML REMOVED] @pyimport scipy.optimize as so julia[HTML REMOVED] so.newton(x-[HTML REMOVED]cos(x)-x, 1) 0.7390851332151607 Conversely, Python code can dynamically load the Julia runtime library and execute arbitrary Julia code. We have exploited this possibility to run Julia within the IPython environment [4]:

In [1]: %load_ext juliamagic In [2]: jfib = %julia fib(n) = n [HTML REMOVED] 2 ? n : fib(n-1) + fib(n-2) Out[2]: [HTML REMOVED]PyCall.jlwrap fib[HTML REMOVED] In [3]: jfib(20) Out[3]: 6765 In this talk we'll give an introduction to the Julia language and demonstrate how you can use Julia where it makes sense for you, while continuing to use your favorite scientific libraries and existing Python and C code.

LarvaMap - A python powered larval transport modeling system; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Wilcox, Kyle, Applied Science Associates (ASA); Crosby, Alex, Applied Science Associates (ASA)

Track: GIS - Geospatial Data Analysis

LarvaMap is an open-access larval transport modeling tool. The idea behind LarvaMap is to make it easy for researchers everywhere to use sophisticated larval transport models to explore and test hypotheses about the early life of marine organisms.

LarvaMap integrates four components: an ocean circulation model, a larval behavior library, a python Lagrangian particle model, and a web-system for running the transport models.

An open-source particle transport model was written in python to support LarvaMap. The model utilizes a parallel multi-process architecture. Remote data are cached to a local file in small chunks when a process requires data, and the local data are shared between all of the active processes as the model runs. The caching approach improves performance and reduces the load on data servers by limiting the frequency and total number of web requests as well as the size of the data being moved over the internet.

Model outputs include particle trajectories in common formats (i.e. netCDF-CF and ESRI Shapefile), a web accessible geojson representation of the particle centroid trajectory, and a stochastic GeoTIFF representation of the probabilities associated with a collection of modeling runs. The common interoperable data formats allow a variety of tools to be used for in-depth analysis of the model results.

lpEdit: An editor to facilitate reproducible analysis; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

lpEdit: An editor to facilitate reproducible analysis via literate programming

Authors: Richards, Adam, Duke University, CNRS France; Kosinski Andrzej, Duke University; Bonneaud, Camille,

Track: Reproducible Science

There is evidence to suggest that a surprising proportion of published experiments in science are difficult if not impossible to reproduce. The concepts of data sharing, leaving an audit trail and extensive documentation are essential to reproducible research, whether it is in the laboratory or as part of an analysis. In this work, we introduce a tool for documentation that aims to make analyses more reproducible in the general scientific community.

The application, lpEdit, is a cross-platform editor, written with PyQt4, that enables a broad range of scientists to carry out the analytic component of their work in a reproducible manner---through the use of literate programming. Literate programming mixes code and prose to produce a final report that reads like an article or book. A major target audience of lpEdit are the researchers getting started with statistics or programming, so the hurdles associated with setting up a proper pipeline are kept to a minimum and the learning burden is reduced through the use of templates and documentation. The documentation for lpEdit is centered around learning by example, and accordingly we use several increasingly involved examples to demonstrate the software's capabilities.

Because it is commonly used, we begin with an example of Sweave in lpEdit and then in the same way R may be embedded into LaTeX we go on to show how Python can also be used. Next, we demonstrate how both R and Python code may be embedded into reStructuredText (reST). Finally, we walk through a more complete example, where we perform a functional analysis of high-throughput sequencing data, using the transcriptome of the butterfly species Pieris brassicae. There is substantial flexibility that is made available through the use of LaTeX and reST, which facilitates reproducibility through the creation of reports, presentations and web pages.

metaseq: a Python framework for integrating sequencing analyses; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

metaseq: a Python framework for integrating high-throughput sequencing analyses

Authors: Dale, Ryan, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of

Track: Bioinformatics

metaseq is a Python package that ties together a growing ecosystem of bioinformatics Python tools and file formats, focusing on flexibility and interactive exploration of high-throughput sequencing data (e.g., ChIP-seq, RNA-seq, and RIP-seq).

This talk will use a worked example to illustrate some practical bioinformatics applications of metaseq's features. For example, its filetype adapters provide random-access, uniform support for commonly-used formats (BAM, bigBed/bigWig, and, via tabix, any tab-delimited format). Combined with multiprocessing and a rebinning routine compiled by Cython, this allows relatively rapid population of NumPy arrays of binned signal over thousands of genes (or other features of interest).

metaseq's "mini-browser" framework connects these arrays -- or any other plot that considers genomic intervals, such as scatterplots of control vs treatment RNA-seq signal -- via callbacks to interactive creation of matplotlib figures that show the local genomic signal and gene models. Alternatively, callbacks can upload data and display them in the UCSC genome browser for further visualization alongside the wealth of publicly available data.

MIST: Micro-Simulation Tool to Support Disease Modeling
SciPy 2013
Jacob Barhak
Recorded: July 1, 2013Language: English

MIST stands for Misco-Simulation Tool. It is a modeling and simulation framework that supports computational Chronic Disease Modeling activities.

MIST: Micro-Simulation Tool to Support Disease Modeling; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Jacob Barhak

Track: Bioinformatics

MIST stands for Misco-Simulation Tool. It is a modeling and simulation framework that supports computational Chronic Disease Modeling activities. It is a fork from the IEST = Indirect Estimation and Simulation Tool GPL modeling framework.

MIST removes complexity associated with the estimation engine, with parameter definitions, and with rule restrictions. This significantly simplifies the system and allows its development in the Micro-simulation path less encumbered.

The incentive to split MIST was to adapt the code to use newer compiler technology to speed up simulations. There is wrong skepticism in the medical disease modeling community towards using Interpreters for simulations due to performance issues. The use of advanced compiler technology with Python may remedy this misconception and provide optimized python based simulations. MIST is a first step in this direction.

MIST takes care of a few documented and known issues. It also moves to use new scientific Python stacks such as Anaconda and PythonXY as its platform. This improves its accessibility to less sophisticated users that can now benefit from easier installation.

The Reference Model for disease progression intends to use MIST as its main platform. Yet MIST is equipped with a Micro-simulation compiler designed to accommodate Monte Carlo simulations for other purposes.

Modeling the Earth with Fatiando a Terra; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Uieda, Leonardo, Observatorio Nacional; Oliveira Jr, Vanderlei C., Observatorio Nacional; Barbosa, V

Track: General

Solid Earth geophysics is the science of using physical observations of the Earth to infer its inner structure. Generally, this is done with a variety of numerical modeling techniques and inverse problems. The development of new algorithms usually involves copy and pasting of code, which leads to errors and poor code reuse. Added to this is a modeling pipeline composed of various tools that don't communicate with each other (Fortran/C for computations, large complicated I/O files, Matlab/VTK for visualization, etc).

Fatiando a Terra is a Python library that aims to unify the modeling pipeline inside of the Python language. This allows users to replace the traditional shell scripting with more versatile and powerful Python scripting. Together with the new IPython notebook, Fatiando a Terra can integrate all stages of the geophysical modeling process, like data pre-processing, inversion, statistical analysis, and visualization. However, the library can also be used for quickly developing stand-alone programs that can be integrated into existing pipelines. Plus, because functions inside Fatiando a Terra use a common data and mesh format, existing algorithms can be combined and new ideas can build upon existing functionality. This flexibility facilitates reproducible computations, prototyping of new algorithms, and interactive teaching exercises.

Although the project has so far focused on potential field methods (gravity and magnetics), some numerical tools for other geophysical methods have been developed as well. The library already contains: fast implementations of forward modeling algorithms (using Numpy and Cython), generic inverse problem solvers, unified geometry classes (prism meshes, polygons, etc), functions to automate repetitive plotting tasks with Matplotlib (automatic griding, simple GUIs, picking, projections, etc) and Mayavi (automatic conversion of geometry classes to VTK, drawing continents, etc). In the future, we plan to continuously implement classic and state-of-the-art algorithms as well as sample problems to help teach geophysics.

Multidimensional Data Exploration with Glue; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Beaumont, Christopher, U. Hawaii; Robitaille, Thomas, MPIA; Borkin, Michelle, Harvard; Goodman, Alys

Track: General

Modern research projects incorporate data from several sources, and new insights are increasingly driven by the ability to interpret data in the context of other data. Glue (http://glueviz.org) is a graphical environment built on top of the standard Python science stack to visualize relationships within and between data sets. With Glue, users can load and visualize multiple related data sets simultaneously. Users specify the logical connections that exist between data, and Glue transparently uses this information as needed to enable visualization across files. This functionality makes it trivial, for example, to interactively overplot catalogs on top of images.

The central philosophy behind Glue is that the structure of research data is highly customized and problem-specific. Glue aims to accomodate and to simplify the "data munging" process, so that researchers can more naturally explore what their data has to say. The result is a cleaner scientific workflow, and more rapid interaction with data.

Mystic: a framework for predictive science; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Michael McKerns @ California Institute of Technology, Houman Owhadi @ California Institute of Technology

Track: Machine Learning

We have built a robust framework (mystic) that lowers the barrier to solving complex problems in predictive science. Mystic is built to rigorously solve high-dimensional non-convex optimization problems with highly nonlinear complex constraints. Mystic is capable of solving global optimization problems with thousands of parameters and thousands of constraints, and makes it almost trivial to leverage high-performance parallel computing. Mystic's unique ability to apply highly complex and statistical constraints can be used to find optimal probability distributions, calculate risk, uncertainty, sensitivity, and probability of failure in real-world inverse problems. Mystic is easy to use, open source, and pure python.

By providing a simple interface to a lot of underlying complexity, mystic enables a non-specialist user unprecedented access to optimizer configurability. Typically, both termination conditions and initial conditions are hard-coded into an optimization algorithm -- however, in mystic, conditionals are both dynamic and dynamically configurable, and thus enable tuning of the optimizer to solve a much broader range of problems. Mystic provides box constraints and penalty functions, as well as an advanced toolkit that can directly utilize all available information as constraints. With the ability to scale up to thousands of parameters, mystic can solve optimization problems that are orders of magnitude larger and of greater complexity than conventional solvers are capable of. In mystic, it's easy to create new algorithms to couple optimizers or launch multiple optimizers in parallel, thus allowing highly efficient local search algorithms to provide fast global optimization.

Calculations of uncertainty, risk, probability of failure, certification, and experiment design are formulated as global optimizations -- and are used to directly provide optimal scenarios for success or failure. Mystic has been used in calculations of materials failure under hypervelocity impact, elasto-plastic failure in structures under seismic ground acceleration, structure prediction in nanomaterials, and risk in financial portfolios.

Optimizing Geographic Processing and Analysis for Big Data; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Brittain, Carissa, GeoDecisions; Gleason, Jason, GeoDecisions

Track: GIS - Geospatial Data Analysis

Considering performance becomes more and more important as the size of datasets increase. Many factors, some outside a developer's control, can seriously impact performance; sometimes to the point that a processing script or database becomes unusable. The example discussed here is an arcpy geoprocessing script that required more than 26 hrs to process and load 24 hrs of wind velocity data from across the United States. Changing the script to apply basic optimization strategies reduced that processing time to under an hour. Benchmark tests and database inspection while applying each strategy showed the results of each change and allowed for calculating each change's impact on the final performance. Understanding and applying even basic optimization methods can have a large return on effort when working with large datasets and can have a significant impact on processing time.

OS deduplication with SIDUS (single-instance distributing universal system); SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Quemener, Emmanuel, Centre Blaise Pascal (Lyon, France); Corvellec, Marianne, McGill University (Montreal)

Track: Reproducible Science

Developing scientific programs to be run on multiple platforms takes caution. Python is typically great as a glue language (COTS approach, for 'Component Off the Shelf'). But massive integration requires a technical platform which may be difficult to even deploy. It may be tempting to stick to the same environment for both development and operation. But environments on HPC nodes are very different from those on workstations. Even if Python comes with 'batteries included', it relies on external (C or Fortran) libraries, especially via SciPy. So you want to be careful when running your Python codes on a cluster, after developing it on your workstation. In the end, how do you compare two scientific results from the same program run on two different machines? In the variability, how do you tell the part due to the hardware from the part due to the software? As a scientist, you typically port your Python code from your workstation to cluster nodes. You want to have a uniform software base, so that discrepancies between runs can be attributed to hardware differences, or to the actual code, if edited. SIDUS (single-instance distributing universal system) is your solution for extreme deduplication of an operating system (OS). SIDUS offers scientists a framework for conducting reproducible experiments. Two nodes booting on the same SIDUS base run the exact same system. This way, actually relevant tests can be carried out. We recently used Python to evaluate performance for a cluster-distributed file system. Unexpectedly, early results showed lack of reproducibility over time as well as over the different nodes. Using SIDUS, it was possible to discard that discrepancies might come from the OS. We could identify that they were due to C-states (CPU power-saving modes), which are responsible for large fluctuations in global performance losses (up to 50%).

Processing biggish data on commodity hardware: simple Python patterns; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Author: Gael Varoquaux Institution: INRIA, Parietal team

Track: Machine Learning

While big data spans many terabytes and requires distributed computing, most mere mortals deal with gigabytes. In this talk I will discuss our experience in applying efficiently machine learning to hundreds of gigabytes on commodity hardware. In particular, I will discuss patterns implemented in two Python libraries, joblib and scikit-learn, dissecting why they help addressing big data and how to implement them efficiently with simple tools.

In particular, I will cover:

On the fly data reduction On-line algorithms and out-of-core computing Parallel computing patterns: performance outside of a framework Caching of common operations, with efficient hashing of arbitrary Python objects and a robust datastore relying on Posix disk semantics The talk will illustrate the high-level concepts introduced with detailed technical discussions on Python implementations, based both on examples using scikit-learn and joblib and on an analysis of how these libraries work. The goal here is less to sell the libraries themselves than to share the insights gained in using and developing them.

PyOP2: a Framework for Performance-Portable Unstructured Mesh-based Simulations, SciPy 2013
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Rathgeber, Florian, Imperial College London, UK; Markall, Graham R., Imperial College London, UK; Mi

Track: General

We present PyOP2, a high-level domain-specific language embedded in Python for mesh-based simulation codes. Through a simple interface, numerical kernels are efficiently scheduled and executed over unstructured meshes in parallel. Without any code changes required, an application can run on a range of hardware platforms, while implementation details of the parallel execution are abstracted from the programmer. Performance portability is achieved by generating optimized low-level OpenMP, MPI, CUDA or OpenCL code for multi-core CPUs or GPUs at runtime and just-in-time compiling the generated code.

PyOP2 is suitable as an intermediate representation for scientific computations, which we demonstrate with a finite-element tool chain using the domain-specific Unified Form Language UFL and the form compiler FFC from the FEniCS project. Finite-element methods are widely used to approximately solve partial differential equations on unstructured domains. The local assembly operation executes the same kernel for every entity of the mesh and is therefore a natural fit for the PyOP2 computation model. We show how these kernels are generated automatically from the weak form of an equation given in UFL. Global assembly and linear solves are passed through to platform-specific linear algebra backends integrated into PyOP2 through a modular interface. Using this tool chain, scientists can drive finite-element computations from an input notation very close to the mathematical model and transparently benefit from performance-portable parallel execution on their hardware architecture of choice without requiring specialist knowledge in numerical analysis or parallel programming.

Python and the SKA
SciPy 2013
Ludwig Schwardt , Simon Ratcliffe
Recorded: July 1, 2013Language: English

We will discuss some of the challenges specific to the radio astronomy environment and how we believe Python can contribute, particularly when it comes to the trade off between development time and performance.

Python and the SKA; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Simon Ratcliffe SKA South Africa, Ludwig Schwardt SKA South Africa

Track: Astronomy and Astrophysics

The Square Kilometer Array will be one of the prime scientific data generates in the next few decades.

Construction is scheduled to commence in late 2016 and last for the best part of a decade. Current estimates put data volume generation near 1 Exabyte per day with 2-3 ExaFLOPs of processing required to handle this data.

As a host country, South Africa is constructing a large precursor telescope known as MeerKAT. Once complete this will be the most sensitive telescope of it's kind in the world - until dwarfed by the SKA.

We make extensive use of Python from the entire Monitor and Control system through to data handling and processing.

This talk looks at our current usage of Python, and our desire to see the entire high performance processing chain being able to call itself Pythonic.

We will discuss some of the challenges specific to the radio astronomy environment and how we believe Python can contribute, particularly when it comes to the trade off between development time and performance.

Pythran: Enabling Static Optimization of Scientific Python Programs; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Guelton, Serge, ENS ; Brunet, Pierrick, Télécom Bretagne ; Raynaud, Alan, Télécom Bretagne; Adrien Merlini, Télécom Bretagne; Mehdi Amini, SILKAN

Track: General

Pythran is a young open source static compiler that turns Python modules into native ones. Based on the fact that scientific modules do not rely much on the dynamic features of the language, it trades them against powerful, eventually inter procedural, optimizations, such as:

automatic detection of pure functions; temporary allocation removal; constant folding; numpy ufunc fusion and parallelisation; explicit parallelism through OpenMP annotations; false variable polymorphism pruning; AVX/SSE vector instruction generation. In addition to these compilation steps, Pythran provides a C++ runtime that leverages on the C++ STL for generic containers, and the Numeric Template Toolbox (nt2) for numpy support. It takes advantage of modern C++11 features such as variadic templates, type inference, move semantics and perfect forwarding, as well as classical ones such as expression templates.

The input code remains compatible with the Python interpreter, and output code is generally as efficient as the annotated Cython equivalent, if not more, without the backward compatibility loss. Numpy expressions runs as fast as if compiled with numexpr, without change on the original code.

Roadmap to a Sentience Stack; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Eric Neuman

Track: Machine Learning

Race cars don't look like cheetahs, so why do attempts at machine sentience try to look like brains? An exploration of the unique challenges and immediate options along one possible path to machine sentience.

The "Do Anything Machine" is the first component in the theoretical Sentience Stack, an open source stack of software that when put together can be configured to learn to be a sentient mind. This approach is inspired by the LAMP stack, the collection of disparate decoupled open-source components (Linux, Apache, MySQL and PHP) that were once commonly used together to make it easy to create websites. LAMP reduced the barriers preventing everyone from building great dynamic websites. It also made it possible for individual components to be swapped out or optimized for a given project allowing the needs of individual projects to push the boundaries as needed. All of these things helped to enable the explosion of growth that created the internet as we know it, and enable it to continue improving.

Although currently in its very earliest stages, the Sentience Stack project will rely heavily on Python's extensive meta-programming capabilities and deep integration into the open source community.

Scientific Computing and the Materials Genome Initiative; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Reid, Andrew, National Institute of Standards and Technology

Track: General

It is a commonplace notion that, as computers continue to become more powerful and more widely available, the communities surrounding various computational tools and techniques gain the ability to tackle larger and more interesting problems.

The US government's Materials Genome Initiative for Global Competitiveness (MGI), announced in June of 2011, has the goal of reducing the time to discover, develop, manufacture, and deploy advanced materials by a factor of two, while reducing the associated costs. Among the approaches foreseen in the initiative, there are two that are of particular interest to computational science. These are, firstly, more sophisticated computational models of materials systems, and secondly, data management tools that will better organize materials data, making data more easily discoverable, providing validation and provenance information, and simplifying the incorporation of data into new models.

The scientific python community has long had its eye not only on the computational solution of scientific and engineering problems, but also on the related issues surrounding the management of large volumes of data, version control for codes, and management of the computational scientific workflow, including reproducibility. This community is well positioned to address MGI-related issues. This talk will describe how MGI goals are being translated into more specific computational problems at NIST and other institutions, and will describe some of the challenges and issues that we have already seen in working towards its goals. The role of the Python language in general, and scientific Python tools in particular, will be highlighted. In addition, the talk will describe areas of overlap and opportunities for contributions between the MGI and the scientific python community.

SciPy 2013 Conference Welcome and Introduction
SciPy 2013
Recorded: July 1, 2013Language: English

The annual SciPy Conference allows participants from academic, commercial, and governmental organizations to showcase their latest Scientific Python projects, learn from skilled users and developers, and collaborate on code development.

The conference consists of two days of tutorials followed by two days of presentations, and concludes with two days of developer sprints on projects of interest to the attendees.

SciPy 2013 Keynote: IPython: the method behind the madness
SciPy 2013
Recorded: July 1, 2013Language: English

Presenter: Fernando Perez

IPython began its life as a personal "afternoon hack", but almost 12 years later it has become a large and complex project, where we try to think in a comprehensive and coherent way about many related problems in scientific computing. Despite all the moving parts in IPython, there are actually very few key ideas that drive our vision, and I will discuss how we seek to turn this vision into concrete technical constructs. We focus on making the computer a tool for insight and communication, and we will see how every piece of the IPython architecture is driven by these ideas.

I will also look at IPython in the context of the broader SciPy ecosystem: both how the project's user and developer community has evolved over time, and how it maintains an ongoing dialogue with the rest of this ecosystem. We have learned some important lessons along the way that I hope to share, as well as considering the challenges that lie ahead.

SciPy 2013 Keynote: The New Scientific Publishers
SciPy 2013
William Schroeder
Recorded: July 1, 2013Language: English

...As software becomes increasingly important to the practice of science, and data becomes larger and more complex, the conventional scientific journal is no longer an adequate vehicle to communicate scientific findings and ensure reproducibility. So who are the new scientific publishers filling these needs, and what roles will they play in the future of science?

SciPy 2013 Keynote: The New Scientific Publishers
SciPy 2013
Recorded: July 1, 2013Language: English

Presenter: William Schroeder, Kitware

Track: Keynotes

Scientific societies such as the Royal Society were formed in the 17th century with the goals of sharing information and ensuring reproducibility. Very quickly scientific letters and publications were assembled into collected transactions and eventually journals. For hundreds of years publishers served admirably as disseminators of scientific knowledge. Publications, and the associated peer review process, became central to the scientific process, greatly impacting how science is practiced, knowledge disseminated, and careers made. However, as software becomes increasingly important to the practice of science, and data becomes larger and more complex, the conventional scientific journal is no longer an adequate vehicle to communicate scientific findings and ensure reproducibility. So who are the new scientific publishers filling these needs, and what roles will they play in the future of science?

In this presentation we'll discuss the central mandate of reproducibility, and the role of Open Science, in particular Open Access, Open Source and Open Data, and how emerging communities and organizations are filling the needs of the scientific community. We'll also discuss the challenges of curating the avalanche of scientific knowledge, whether it be software, data or publications, and how these communities and organizations can work together to support science progress, and ensure continued technological innovation.

SciPy 2013 Keynote: Trends in Machine Learning and the SciPy community
SciPy 2013
Recorded: July 1, 2013Language: English

Presenter: Olivier Grisel

Track: Keynotes

This will give an overview of recent trends in Machine Learning namely: Deep Learning, Probabilistic Programming and Distributed Computing for Machine Learning and will demonstrate how the SciPy community at large is building innovative tools to follow those trends and sometimes even lead them.

SCI-WMS: A Python Based Web Map Service For Met-Ocean Data; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

SCI-WMS: A Python Based Web Map Service For Met-Ocean Data Accessible Over OpenDAP Or As NetCDF

Authors: Crosby, Alexander

Track: GIS - Geospatial Data Analysis

SCI-WMS is a Python based web map service (WMS) designed as a web service for visualization of local or remote data such that they can be overlaid in georeferenced mapping environments like web maps or geographic information systems (GIS). The service follows the Open Geospatial Consortium (OGC) WMS specifications and is focused on the visualization of gridded data and unstructured meshes commonly stored in NetCDF files or available from distributed servers over the OpenDAP protocol. WMS servers are commonly used to visualize large archives of numerically modeled and observed data, and SCI-WMS is currently used in several U.S. Integrated Ocean Observing System (IOOS) projects around the country including the IOOS Super-regional Modeling Testbed and regional data portals. SCI-WMS was originally developed to fill a need for standards based visualization and data access tools to examine differences between unstructured mesh ocean models like FVCOM and ADCIRC, and the available visualization styles attempt to preserve as much of the complex topology as possible in the unstructured meshes. Support for regularly gridded datasets expanded the applicability of SCI-WMS for use with more commonly available ocean and meteorological model output as well as satellite derived observations.

Skdata: Data seets and algorithm evaluation protocols in Python; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Bergstra, James, University of Waterloo: Pinto, Nicolas, Massachusetts Institute of Technology; Cox, David D., Harvard University

Track: Machine Learning

Machine learning benchmark data sets come in all shapes and sizes, yet classification algorithm implementations often insist on operating on sanitized input, such as (x, y) pairs with vector-valued input x and integer class label y. Researchers and practitioners are well aware of how much work (and even sometimes judgement) is required to get from the URL of a new data set to an ndarray fit for e.g. pandas or sklearn. The skdata library [1] handles that work for a growing number of benchmark data sets, so that one-off in-house scripts for downloading and parsing data sets can be replaced with library code that is reliable, community-tested, and documented.

Skdata consists primarily of independent submodules that deal with individual data sets. Each [new-style] submodule has three important sub-sub-module files:

a 'dataset' file with the nitty-gritty details of how to download, extract, and parse a particular data set;

a 'view' file with any standard evaluation protocols from relevant literature; and

a 'main' file with CLI entry points for e.g. downloading and visualizing the data set.

Various skdata utilities help to manage the data sets themselves, which are stored in the user's "~/.skdata" directory.

The evaluation protocols represent the logic that turns parsed (but potentially ideosyncratic) data into one or more standardized learning tasks. The basic approach has been developed over years of combined experience by the authors, and used extensively in recent work (e.g. [2]). The presentation will cover the design of data set submodules, and the basic interactions between a learning algorithm and an evaluation protocol.

Streamed Clustering of Lightning Mapping Data in Python Using sklearn; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Bruning, Eric C., Texas Tech University

Track: GIS - Geospatial Data Analysis

Lightning mapping at radio frequencies (here with VHF Lightning Mapping Array data) is typically performed by a time-of-arrival source retrieval method. Thereafter, it is common to cluster the located sources into flash-level entities (often comprised of 10^2 - 10^3 sources) using space and time separation thresholds. A previously-used clustering algorithm was a one-off implementation in Fortran, and was designed without reference to the machine learning literature. This study replaces the previous algorithm, which had been wrapped into the Python-based lmatools workflow, with the general-purpose DBSCAN implementation in Python's sklearn package. The legacy code included substantial, file format-specific, I/O boilerplate. The new code clarifies the boundary between algorithm and I/O, and promotes clean integration with the rest of the lmatools infrastructure, aiding maintainability.

A chunked, streamed processing method was developed to account for continuous data rates that may exceed 10^5 four-coordinate (space and time) source vectors per minute. The chunking method exploits known physical limits to lightning flash duration, allowing the N^2 implementation of DBSCAN in sklearn to achieve real-time processing rates within available memory. The streaming technique is expected to be useful in future work as a flexible building block for end-to-end real-time and post-processing scripts and interactive analysis tools.

The algorithm is expected to find immediate use in our analysis of data from the NSF-sponsored Deep Convective Clouds and Chemistry campaign. The open nature of the underlying clustering libraries promotes code reuse by other research groups. Accounts of source-to-flash clustering in the literature are complemented by the availability of this open, objective reference implementation for clustering of lightning mapping datasets.

Stuff to do with your genomic intervals; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Pedersen, Brent; University of Colorado

Track: Bioinformatics

After traditional bioinformatic analyses, we are often left with a set of genomic regions; for example: ChIP-Seq peaks, transcription-factor binding sites, differentially methylated regions, or sites of loss-of-heterozygosity. This talk will go over the difficulties commonly encountered at this stage of an investigation and cover some additional analyses, using python libraries, that can help to provide insight into the function of a set of intervals. Some of the libraries covered will be pybedtools, cruzdb, pandas, and shuffler. The focus will be on annotation, exploratory data analysis and calculation of simple enrichment metrics with those tools. The format will be a walk-through (in the IPython notebook) of a set of these analyses that utilizes ENCODE and other publicly available data to annotate an example dataset.

SunPy - Python for Solar Physicists
SciPy 2013
Stuart Mumford
Recorded: July 1, 2013Language: English

SunPy is a project designed to provide a free, open and easy-to-use Python alternative to IDL and SolarSoft. SunPy provides unified, coordinate-aware data objects for many common solar data types and integrates into these plotting and analysis tools.

SunPy - Python for Solar Physicists; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Mumford, Stuart, University of Sheffield / SunPy

Track: Astronomy and Astrophysics

Modern solar physicists have, at their disposal, an abundance of space and ground based instruments providing a large amount of data to analyse the complex Sun every day. The NASA Solar Dynamics Observatory satellite, for example, collects around 1.2 TB of data every 24 hours which requires extensive reconstruction before it is ready for scientific use. Currently most data processing and analysis for all solar data is done using IDL and the 'SolarSoft' library. SunPy is a project designed to provide a free, open and easy-to-use Python alternative to IDL and SolarSoft.

SunPy provides unified, coordinate-aware data objects for many common solar data types and integrates into these plotting and analysis tools. Providing this base will give the global solar physics community the opportunity to use Python for future data processing and analysis routines. The astronomy and astrophysics community, through the implementation and adoption of AstroPy and pyRAF, have already demonstrated that Python is well suited for the analysis and processing of space science data.

In this presentation, we give key examples of SunPy's structure and scope, as well as the major improvements that have taken place to provide a stable base for future expansion. We discuss recent improvements to file I/O and visualisation, as well as improvements to the structure and interface of the map objects.

We discuss the many challenges which SunPy faces if it is to achieve its goal of becoming a key package for solar physics. The SunPy developers hope to increase the the visibility and uptake of SunPy, and encourage people to contribute to the project, while maintaining a high quality code base, which is facilitated by the use of a social version control system (git and GitHub).

The advantages of a scientific IDE; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Cordoba, Carlos, The Spyder Project

Track: Reproducible Science

The enormous progress made by the IPython project during the last two years, has made many of us --the Python scientific community-- think that we are quite close to provide an application that can rival the big two M's of the computing scientific world: Matlab and Mathematica.

However, after following the project on GitHub and its mailing list for almost the same time and specially after reading its roadmap for the next two years, we at Spyder believe that its real focus is different from that aim. IPython developers are working hard to build several powerful and flexible interfaces to evaluate and document code, but they seem to have some troubles on going from a console application to a GUI one (e.g see GitHub Issues 1747, 2203, 2522, 2974 and 2985).

We consider Spyder can really help to solve these issues, by integrating IPython in a richer and more intuitive, yet powerful, environment. After working with the mentioned M's, most people expects not only a good evaluation interface but also easy access to rich text documentation, an specialized editor and a namespace browser, tied with good debugging facilities. Spyder already has all these features and, right now, also the best integration with the IPython Qt frontend.

This shows that Spyder can be the perfect complement to IPython, providing what it's missing and aiming to reach a wider audience (not just researchers and graduate students). As the current Spyder maintainer, I would like to assist to SciPy to show more concretely to the community what our added value to the scientific Python ecosystem is. We would also like to get in closer contact with her and have a direct feedback to define what should be the features we need to work or improve on the next releases.

The Production of a Multi-Resolution Global Index for GIS; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

The Production of a Multi-Resolution Global Index for Geographic Information Systems

Authors: MacManus, Kytt, Columbia University CIESIN

Track: GIS - Geospatial Data Analysis

In order to efficiently access geographic information at the pixel level, at a global scale, it is useful to develop an indexing system with nested location information. Considering a 1 sq. km image resolution, the number of global pixels covering land exceeds 200 million. This talk will summarize the steps taken to produce a global multi-resolution raster indexing system using the Geospatial Data Abstraction Library (GDAL) 1.9, and NumPy. The implications of presenting this data to a user community reliant on Microsoft Office technologies will also be discussed.

Using Python for Structured Prediction; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Zinkov, Rob

Track: Machine Learning

Many machine learning problems involved datasets with complex dependencies between variables we are trying to predict and even the data points themselves. Unfortunately most machine learning libraries are unable to model these dependencies and make use of them. In this talk, I will introduce two libraries pyCRFsuite and pyStruct and show how they can be used to solve machine learning problems where modeling the relations between data points is crucial for getting reasonable accuracy. I will cover how these libraries can be used for classifying webpages as spam, named entity extraction, and sentiment analysis.

vIPer, a new tool to work with IPython notebooks; SciPy 2013 Presentations
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Avila, Damian, OQUANTA;

Track: Reproducible Science

In the last years, Python have gained a lot of interest in the scientific community because several useful tools, well-suited to do scientific computing research, have been developed. IPython [1], a comprehensive environment for interactive and exploratory computing, has arose as must-have application to run in the daily scientific work-flow because provide not only enhanced interactive Python shells (terminal and qt-based) but also an interactive browser-based notebook with rich media support [2]. Despite the fact that we can run the IPython notebook in any most-used web browsers, there is not one better-suited to do the daily scientific work-flow and with features to easily present their results. So, in this paper, we describe the vIPer [3], and pyqt-based web browser designed specifically to host an IPython notebook, and powered with multiple features to address the most common tasks performed by scientific researchers in the publication and spreading of their results, such as:

Splitted notebook views. Exporting notebook several formats including pdf documents and html archives. Video and voice recording. Slide-show view (for oral presentation). Shortcuts to most common task. To conclude, we have developed a better visualization tool for the IPython notebook, suited not only for the daily interactive work-flow but also suited for an enhanced presentation of results in multiple formats.

Why you should write buggy software with as few features as possible; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Granger, Brian, Cal Poly San Luis Obispo

Track: General

Everyone knows that in software, features are good and bugs are bad. If this is the case, then it must follow that the best software will have the most features and the fewest bugs. In this talk I will try to convince you that this is a horrible way of thinking about software, especially in the context of open source projects. To accomplish this goal, I will use the development of the IPython Notebook as a case study in software engineering. I will describe what we learned in our numerous missteps, what we did right and how the eventual success of the IPython Notebook radically changed how I view software development. This will clarify why feature and complexity creep need to be actively guarded against and how a well defined scope can help in that battle. I will propose an informal framework for evaluating new feature requests and discuss the social/community aspects of saying no to new features within a project. Finally, I will try to convince you that bugs are a sign of quality software and a healthy community. If I am successful, you will want to go home and write buggy software with as few features as possible.

Writing Reproducible Papers with Dexy; SciPy 2013 Presentation
SciPy 2013
Recorded: July 1, 2013Language: English

Authors: Nelson, Ana

Track: Reproducible Science

Scientists are frequently admonished to create reproducible papers, but one reason so few do is the lack of good tools. Imagine you want to create a reproducible paper that any other researcher in your field can easily grab, look at, and run in order to verify your results. What would you use? A tool intended for generating API documentation? A literate programming tool? Maybe some hand coded scripts binding together a rickety project that nobody but you can run? The truth is, the tools available to scientists to create clean reproducible papers are too limited, not general purpose enough, and not portable enough for sharing with other scientists.

This talk will be a demonstration of generating a paper using Dexy, a tool that has been written from scratch for making it as painless as possible to write truly reproducible technical documents. The whole project workflow will be automated including gathering and cleaning input data, running analysis scripts, producing plots, and compiling output documents to PDF, HTML and ePub output formats. The example project will include software and scripts in Python as well as other programming languages. We will automate running the code, applying syntax highlighting, generating plots and other files as side effects from running the code, and incorporating all these elements into the various documents we wish to create.

Anatomy of Matplotlib, SciPy2013 Tutorial, Part 1 of 3
SciPy 2013
Recorded: June 27, 2013Language: English

Presenter: Benjamin Root

Description

This tutorial will be the introduction to matplotlib, intended for users who want to become familiar with python's predominate scientific plotting package. First, the plotting functions that are available will be introduced so users will know what kinds of graphs can be done. We will then cover the fundamental concepts and terminologies, starting from the figure object down to the artists. In an organized and logical fashion, the components of a matplotlib figure are introduced, such as the axes, axis, tickers, and labels. We will explain what an Artist is for, as well as explain the purpose behind Collections. Finally, we will take an overview of the major toolkits available to use, particularly AxesGrid, mplot3d and basemap.

Outline

Outline:

Introduction

Purpose of matplotlib Online Documentation Examples Page Gallery Page FAQs API documentation Mailing Lists Github Repository Bug Reports & Feature Requests What is this "backend" thing I keep hearing about?

Plotting Functions

Graphs (plot, scatter, bar, stem, etc.) Images (imshow, pcolor, pcolormesh, contour[f], etc.) Lesser Knowns: (pie, acorr, hexbin, etc.) Brand New: streamplot() What goes in a Figure?

Axes Axis ticks (and ticklines and ticklabels) (both major & minor) axis labels axes title figure suptitle axis spines colorbars (and the oddities thereof) axis scale axis gridlines legend (Throughout the aforementioned section, I will be guiding audience members through the creation and manipulation of each of these components to produce a fully customized graph)

Introducing matplotlibrc

Hands-On: Have users try making some changes to the settings and see how a resulting figure changes What is an Artist?

Hands-On: Have audience members create some and see if they can get them displayed What is a Collection?

Hands-On: Have audience members create some, manipulate the properties and display them Properties:

color (and edgecolor, linecolor, facecolor, etc...) linewidth and edgewidth and markeredgewidth (and the oddity that happens in errorbar()) linestyle zorder visible What are toolkits?

axes_grid1 mplot3d basemap Required Packages

NumPy

Matplotlib (version 1.2.1 or later is preferred, but earlier version should still be sufficient for most of the tutorial)

ipython v0.13

Documentation

https://dl.dropbox.com/u/7325604/AnatomyOfMatplotlib.ipynb

Anatomy of Matplotlib, SciPy2013 Tutorial, Part 2 of 3
SciPy 2013
Recorded: June 27, 2013Language: English

Presenter: Benjamin Root

Description

This tutorial will be the introduction to matplotlib, intended for users who want to become familiar with python's predominate scientific plotting package. First, the plotting functions that are available will be introduced so users will know what kinds of graphs can be done. We will then cover the fundamental concepts and terminologies, starting from the figure object down to the artists. In an organized and logical fashion, the components of a matplotlib figure are introduced, such as the axes, axis, tickers, and labels. We will explain what an Artist is for, as well as explain the purpose behind Collections. Finally, we will take an overview of the major toolkits available to use, particularly AxesGrid, mplot3d and basemap.

Outline

Outline:

Introduction

Purpose of matplotlib Online Documentation Examples Page Gallery Page FAQs API documentation Mailing Lists Github Repository Bug Reports & Feature Requests What is this "backend" thing I keep hearing about?

Plotting Functions

Graphs (plot, scatter, bar, stem, etc.) Images (imshow, pcolor, pcolormesh, contour[f], etc.) Lesser Knowns: (pie, acorr, hexbin, etc.) Brand New: streamplot() What goes in a Figure?

Axes Axis ticks (and ticklines and ticklabels) (both major & minor) axis labels axes title figure suptitle axis spines colorbars (and the oddities thereof) axis scale axis gridlines legend (Throughout the aforementioned section, I will be guiding audience members through the creation and manipulation of each of these components to produce a fully customized graph)

Introducing matplotlibrc

Hands-On: Have users try making some changes to the settings and see how a resulting figure changes What is an Artist?

Hands-On: Have audience members create some and see if they can get them displayed What is a Collection?

Hands-On: Have audience members create some, manipulate the properties and display them Properties:

color (and edgecolor, linecolor, facecolor, etc...) linewidth and edgewidth and markeredgewidth (and the oddity that happens in errorbar()) linestyle zorder visible What are toolkits?

axes_grid1 mplot3d basemap Required Packages

NumPy

Matplotlib (version 1.2.1 or later is preferred, but earlier version should still be sufficient for most of the tutorial)

ipython v0.13

Documentation

https://dl.dropbox.com/u/7325604/AnatomyOfMatplotlib.ipynb

Anatomy of Matplotlib, SciPy2013 Tutorial, Part 3 of 3
SciPy 2013
Recorded: June 27, 2013Language: English

Presenter: Benjamin Root

Description

This tutorial will be the introduction to matplotlib, intended for users who want to become familiar with python's predominate scientific plotting package. First, the plotting functions that are available will be introduced so users will know what kinds of graphs can be done. We will then cover the fundamental concepts and terminologies, starting from the figure object down to the artists. In an organized and logical fashion, the components of a matplotlib figure are introduced, such as the axes, axis, tickers, and labels. We will explain what an Artist is for, as well as explain the purpose behind Collections. Finally, we will take an overview of the major toolkits available to use, particularly AxesGrid, mplot3d and basemap.

Outline

Outline:

Introduction

Purpose of matplotlib Online Documentation Examples Page Gallery Page FAQs API documentation Mailing Lists Github Repository Bug Reports & Feature Requests What is this "backend" thing I keep hearing about?

Plotting Functions

Graphs (plot, scatter, bar, stem, etc.) Images (imshow, pcolor, pcolormesh, contour[f], etc.) Lesser Knowns: (pie, acorr, hexbin, etc.) Brand New: streamplot() What goes in a Figure?

Axes Axis ticks (and ticklines and ticklabels) (both major & minor) axis labels axes title figure suptitle axis spines colorbars (and the oddities thereof) axis scale axis gridlines legend (Throughout the aforementioned section, I will be guiding audience members through the creation and manipulation of each of these components to produce a fully customized graph)

Introducing matplotlibrc

Hands-On: Have users try making some changes to the settings and see how a resulting figure changes What is an Artist?

Hands-On: Have audience members create some and see if they can get them displayed What is a Collection?

Hands-On: Have audience members create some, manipulate the properties and display them Properties:

color (and edgecolor, linecolor, facecolor, etc...) linewidth and edgewidth and markeredgewidth (and the oddity that happens in errorbar()) linestyle zorder visible What are toolkits?

axes_grid1 mplot3d basemap Required Packages

NumPy

Matplotlib (version 1.2.1 or later is preferred, but earlier version should still be sufficient for most of the tutorial)

ipython v0.13

Documentation

https://dl.dropbox.com/u/7325604/AnatomyOfMatplotlib.ipynb

Cython: Speed up Python and NumPy, Pythonize C, C++, and Fortran, SciPy2013 Tutorial, Part 1 of 4
SciPy 2013
Recorded: June 27, 2013Language: English

Presenter: Kurt Smith

Description Cython is a flexible and multi-faceted tool that brings down the barrier between Python and other languages. With cython, you can add type information to your Python code to yield dramatic performance improvements. Cython also allows you to wrap C, C++ and Fortran libraries to work with Python and NumPy. It is used extensively in research environments and in end-user applications.

This hands-on tutorial will cover Cython from the ground up, and will include the newest Cython features, including typed memoryviews.

Target audience:

Developers, researchers, scientists, and engineers who use Python and NumPy and who routinely hit bottlenecks and need improved performance.

C / C++ / Fortran users who would like their existing code to work with Python.

Expected level of knowledge:

Intermediate and / or regular user of Python and NumPy. Have used Python's decorators, exceptions, and classes. Knowledge of NumPy arrays, array views, fancy indexing, and NumPy dtypes. Have programmed in at least one of C, C++, or Fortran.

Some familiarity with the Python or NumPy C-API a plus. Familiarity with memoryviews and buffers a plus. Familiarity with OpenMP a plus. Array-based inter-language programming between Python and C, C++, or Fortran a plus.

Required Packages

All necessary packages are available with an academic / full EPD installation, Anaconda, easy_install, or pip.

Users must have Cython v 0.16 or better for the course.

The tutorial material (slides, exercises & demos) will be available for download and on USB drives.

Documentation

Basic slide content is based on Enthought's Cython training slides. These slides will be reworked significantly for this tutorial. In particular, the NumPy buffer declarations will be taken out and replaced with the typed memoryview content listed in the outline. Other content (an IPython notebook with the start of the capstone project) is available as well:

http://public.enthought.com/~ksmith/scipy2013_cython/

Cython: Speed up Python and NumPy, Pythonize C, C++, and Fortran, SciPy2013 Tutorial, Part 2 of 4
SciPy 2013
Recorded: June 27, 2013Language: English

Presenter: Kurt Smith

Description Cython is a flexible and multi-faceted tool that brings down the barrier between Python and other languages. With cython, you can add type information to your Python code to yield dramatic performance improvements. Cython also allows you to wrap C, C++ and Fortran libraries to work with Python and NumPy. It is used extensively in research environments and in end-user applications.

This hands-on tutorial will cover Cython from the ground up, and will include the newest Cython features, including typed memoryviews.

Target audience:

Developers, researchers, scientists, and engineers who use Python and NumPy and who routinely hit bottlenecks and need improved performance.

C / C++ / Fortran users who would like their existing code to work with Python.

Expected level of knowledge:

Intermediate and / or regular user of Python and NumPy. Have used Python's decorators, exceptions, and classes. Knowledge of NumPy arrays, array views, fancy indexing, and NumPy dtypes. Have programmed in at least one of C, C++, or Fortran.

Some familiarity with the Python or NumPy C-API a plus. Familiarity with memoryviews and buffers a plus. Familiarity with OpenMP a plus. Array-based inter-language programming between Python and C, C++, or Fortran a plus.

Required Packages

All necessary packages are available with an academic / full EPD installation, Anaconda, easy_install, or pip.

Users must have Cython v 0.16 or better for the course.

The tutorial material (slides, exercises & demos) will be available for download and on USB drives.

Documentation

Basic slide content is based on Enthought's Cython training slides. These slides will be reworked significantly for this tutorial. In particular, the NumPy buffer declarations will be taken out and replaced with the typed memoryview content listed in the outline. Other content (an IPython notebook with the start of the capstone project) is available as well:

http://public.enthought.com/~ksmith/scipy2013_cython/

Cython: Speed up Python and NumPy, Pythonize C, C++, and Fortran, SciPy2013 Tutorial, Part 3 of 4
SciPy 2013
Recorded: June 27, 2013Language: English

Presenter: Kurt Smith

Description Cython is a flexible and multi-faceted tool that brings down the barrier between Python and other languages. With cython, you can add type information to your Python code to yield dramatic performance improvements. Cython also allows you to wrap C, C++ and Fortran libraries to work with Python and NumPy. It is used extensively in research environments and in end-user applications.

This hands-on tutorial will cover Cython from the ground up, and will include the newest Cython features, including typed memoryviews.

Target audience:

Developers, researchers, scientists, and engineers who use Python and NumPy and who routinely hit bottlenecks and need improved performance.

C / C++ / Fortran users who would like their existing code to work with Python.

Expected level of knowledge:

Intermediate and / or regular user of Python and NumPy. Have used Python's decorators, exceptions, and classes. Knowledge of NumPy arrays, array views, fancy indexing, and NumPy dtypes. Have programmed in at least one of C, C++, or Fortran.

Some familiarity with the Python or NumPy C-API a plus. Familiarity with memoryviews and buffers a plus. Familiarity with OpenMP a plus. Array-based inter-language programming between Python and C, C++, or Fortran a plus.

Required Packages

All necessary packages are available with an academic / full EPD installation, Anaconda, easy_install, or pip.

Users must have Cython v 0.16 or better for the course.

The tutorial material (slides, exercises & demos) will be available for download and on USB drives.

Documentation

Basic slide content is based on Enthought's Cython training slides. These slides will be reworked significantly for this tutorial. In particular, the NumPy buffer declarations will be taken out and replaced with the typed memoryview content listed in the outline. Other content (an IPython notebook with the start of the capstone project) is available as well:

http://public.enthought.com/~ksmith/scipy2013_cython/

Cython: Speed up Python and NumPy, Pythonize C, C++, and Fortran, SciPy2013 Tutorial, Part 4 of 4
SciPy 2013
Recorded: June 27, 2013Language: English

Presenter: Kurt Smith

Description Cython is a flexible and multi-faceted tool that brings down the barrier between Python and other languages. With cython, you can add type information to your Python code to yield dramatic performance improvements. Cython also allows you to wrap C, C++ and Fortran libraries to work with Python and NumPy. It is used extensively in research environments and in end-user applications.

This hands-on tutorial will cover Cython from the ground up, and will include the newest Cython features, including typed memoryviews.

Target audience:

Developers, researchers, scientists, and engineers who use Python and NumPy and who routinely hit bottlenecks and need improved performance.

C / C++ / Fortran users who would like their existing code to work with Python.

Expected level of knowledge:

Intermediate and / or regular user of Python and NumPy. Have used Python's decorators, exceptions, and classes. Knowledge of NumPy arrays, array views, fancy indexing, and NumPy dtypes. Have programmed in at least one of C, C++, or Fortran.

Some familiarity with the Python or NumPy C-API a plus. Familiarity with memoryviews and buffers a plus. Familiarity with OpenMP a plus. Array-based inter-language programming between Python and C, C++, or Fortran a plus.

Required Packages

All necessary packages are available with an academic / full EPD installation, Anaconda, easy_install, or pip.

Users must have Cython v 0.16 or better for the course.

The tutorial material (slides, exercises & demos) will be available for download and on USB drives.

Documentation

Basic slide content is based on Enthought's Cython training slides. These slides will be reworked significantly for this tutorial. In particular, the NumPy buffer declarations will be taken out and replaced with the typed memoryview content listed in the outline. Other content (an IPython notebook with the start of the capstone project) is available as well:

http://public.enthought.com/~ksmith/scipy2013_cython/

Data Processing with Python, SciPy2013 Tutorial, Part 1 of 3
SciPy 2013
Recorded: June 27, 2013Language: English

Presenters: Ben Zaitlen, Clayton Davis

Description

This tutorial is a crash course in data processing and analysis with Python. We will explore a wide variety of domains and data types (text, time-series, log files, etc.) and demonstrate how Python and a number of accompanying modules can be used for effective scientific expression. Starting with NumPy and Pandas, we will begin with loading, managing, cleaning and exploring real-world data right off the instrument. Next, we will return to NumPy and continue on with SciKit-Learn, focusing on a common dimensionality-reduction technique: PCA.

In the second half of the course, we will introduce Python for Big Data Analysis and introduce two common distributed solutions: IPython Parallel and MapReduce. We will develop several routines commonly used for simultaneous calculations and analysis. Using Disco -- a Python MapReduce framework -- we will introduce the concept of MapReduce and build up several scripts which can process a variety of public data sets. Additionally, users will also learn how to launch and manage their own clusters leveraging AWS and StarCluster.

Outline

Setup/Install Check (15) NumPy/Pandas (30) Series Dataframe Missing Data Resampling Plotting PCA (15) NumPy Sci-Kit Learn Parallel-Coordinates MapReduce (30) Intro Disco Hadoop Count Words EC2 and Starcluster (15) IPython Parallel (30) Bitly Links Example (30) Wiki Log Analysis (30)

45 minutes extra for questions, pitfalls, and break

Each student will have access to a 3 node EC2 cluster where they will modify and execute examples. Each cluster will have Anaconda, IPython Notebook, Disco, and Hadoop preconfigured

Required Packages

All examples in this tutorial will use real data. Attendees are expected to have some familiarity with statistical methods and familiarity with common NumPy routines. Users should come with the latest version of Anaconda pre-installed on their laptop and a working SSH client.

Documentation

Preliminary work can be found at: https://github.com/ContinuumIO/tutorials

Data Processing with Python, SciPy2013 Tutorial, Part 2 of 3
SciPy 2013
Recorded: June 27, 2013Language: English

Presenters: Ben Zaitlen, Clayton Davis

Description

This tutorial is a crash course in data processing and analysis with Python. We will explore a wide variety of domains and data types (text, time-series, log files, etc.) and demonstrate how Python and a number of accompanying modules can be used for effective scientific expression. Starting with NumPy and Pandas, we will begin with loading, managing, cleaning and exploring real-world data right off the instrument. Next, we will return to NumPy and continue on with SciKit-Learn, focusing on a common dimensionality-reduction technique: PCA.

In the second half of the course, we will introduce Python for Big Data Analysis and introduce two common distributed solutions: IPython Parallel and MapReduce. We will develop several routines commonly used for simultaneous calculations and analysis. Using Disco -- a Python MapReduce framework -- we will introduce the concept of MapReduce and build up several scripts which can process a variety of public data sets. Additionally, users will also learn how to launch and manage their own clusters leveraging AWS and StarCluster.

Outline

Setup/Install Check (15) NumPy/Pandas (30) Series Dataframe Missing Data Resampling Plotting PCA (15) NumPy Sci-Kit Learn Parallel-Coordinates MapReduce (30) Intro Disco Hadoop Count Words EC2 and Starcluster (15) IPython Parallel (30) Bitly Links Example (30) Wiki Log Analysis (30)

45 minutes extra for questions, pitfalls, and break

Each student will have access to a 3 node EC2 cluster where they will modify and execute examples. Each cluster will have Anaconda, IPython Notebook, Disco, and Hadoop preconfigured

Required Packages

All examples in this tutorial will use real data. Attendees are expected to have some familiarity with statistical methods and familiarity with common NumPy routines. Users should come with the latest version of Anaconda pre-installed on their laptop and a working SSH client.

Documentation

Preliminary work can be found at: https://github.com/ContinuumIO/tutorials

Data Processing with Python, SciPy2013 Tutorial, Part 3 of 3
SciPy 2013
Recorded: June 27, 2013Language: English

Presenters: Ben Zaitlen, Clayton Davis

Description

This tutorial is a crash course in data processing and analysis with Python. We will explore a wide variety of domains and data types (text, time-series, log files, etc.) and demonstrate how Python and a number of accompanying modules can be used for effective scientific expression. Starting with NumPy and Pandas, we will begin with loading, managing, cleaning and exploring real-world data right off the instrument. Next, we will return to NumPy and continue on with SciKit-Learn, focusing on a common dimensionality-reduction technique: PCA.

In the second half of the course, we will introduce Python for Big Data Analysis and introduce two common distributed solutions: IPython Parallel and MapReduce. We will develop several routines commonly used for simultaneous calculations and analysis. Using Disco -- a Python MapReduce framework -- we will introduce the concept of MapReduce and build up several scripts which can process a variety of public data sets. Additionally, users will also learn how to launch and manage their own clusters leveraging AWS and StarCluster.

Outline

Setup/Install Check (15) NumPy/Pandas (30) Series Dataframe Missing Data Resampling Plotting PCA (15) NumPy Sci-Kit Learn Parallel-Coordinates MapReduce (30) Intro Disco Hadoop Count Words EC2 and Starcluster (15) IPython Parallel (30) Bitly Links Example (30) Wiki Log Analysis (30)

45 minutes extra for questions, pitfalls, and break

Each student will have access to a 3 node EC2 cluster where they will modify and execute examples. Each cluster will have Anaconda, IPython Notebook, Disco, and Hadoop preconfigured

Required Packages

All examples in this tutorial will use real data. Attendees are expected to have some familiarity with statistical methods and familiarity with common NumPy routines. Users should come with the latest version of Anaconda pre-installed on their laptop and a working SSH client.

Documentation

Preliminary work can be found at: https://github.com/ContinuumIO/tutorials

Diving into NumPy Code, SciPy2013 Tutorial, Part 1 of 4
SciPy 2013
Recorded: June 27, 2013Language: English

Presenters: David Cournapeau, Stefan Van der Walt

Description

Do you want to contribute to NumPy but find the codebase daunting ? Do you want to extend NumPy (e.g. adding support for decimal, or arbitrary precision) ? Are you curious to understand how NumPy works at all ? Then this tutorial is for you.

The goal of this tutorial is do dive into NumPy codebase, in particular the core C implementation. You will learn how to build NumPy from sources, how some of the core concepts such as data types and ufuncs are implemented at the C level and how it is hooked up to the Python runtime. You will also learn how to add a new ufunc and a new data type.

During the tutorial, we will also have a look at various tools (unix-oriented) that can help tracking bugs or follow a particular numpy expression from its python representation to its low-level implementation.

While a working knowledge of C and Python is required, we do not assume a preliminary knowledge of the NumPy codebase. An understanding of Python C extensions is a plus, but not required either.

Outline

The tutorial will be divided in 3 main sections:

Introduction: Why extending numpy in C ? (and perhaps more importantly, when you should not) being ready to develop on NumPy: building from sources, and building with different flags (optimisation and debug) Source code organisation: description of the numpy source tree and high-level description of what belongs where: core vs the rest, core.multiarray, core.ufunc, scalar arrays and support libraries (npysort, npymath)

The main data structures around ndarray:

the arrayobject and data type descriptor, and how they relate to each other. exercise to add a simple array method to the array object dealing with arbitrary array memory layout with iterators Adding a new dtype: Anatomy of the dtype: from a + a to a core C loop Simple example to wrap a software implementation of quadruple precision (revised version of IEEE 754 software) The current set of planned hand-on tasks/exercises:

building from sources with debug symbols adding an array method to compute a simple statistic (e.g. kurtosis) adding a new type to handle quadruple precision type Required Packages

You will need a working C compiler (gcc on unix/os x, Visual Studio 2008 on windows), and be familiar how to use it on your platform git if possible, gdb and cgdb on unix if possible: valgrind and kcachegrind for supported platforms (linux) Vagrant VM available here: https://s3.amazonaws.com/scipy-2013/divingintonumpy/numpy-tuto.box (use vagrant 1.2.1, as 1.2.2 has a serious bug for sharing files)

Diving into NumPy Code, SciPy2013 Tutorial, Part 2 of 4
SciPy 2013
Recorded: June 27, 2013Language: English

Presenters: David Cournapeau, Stefan Van der Walt

Description

Do you want to contribute to NumPy but find the codebase daunting ? Do you want to extend NumPy (e.g. adding support for decimal, or arbitrary precision) ? Are you curious to understand how NumPy works at all ? Then this tutorial is for you.

The goal of this tutorial is do dive into NumPy codebase, in particular the core C implementation. You will learn how to build NumPy from sources, how some of the core concepts such as data types and ufuncs are implemented at the C level and how it is hooked up to the Python runtime. You will also learn how to add a new ufunc and a new data type.

During the tutorial, we will also have a look at various tools (unix-oriented) that can help tracking bugs or follow a particular numpy expression from its python representation to its low-level implementation.

While a working knowledge of C and Python is required, we do not assume a preliminary knowledge of the NumPy codebase. An understanding of Python C extensions is a plus, but not required either.

Outline

The tutorial will be divided in 3 main sections:

Introduction: Why extending numpy in C ? (and perhaps more importantly, when you should not) being ready to develop on NumPy: building from sources, and building with different flags (optimisation and debug) Source code organisation: description of the numpy source tree and high-level description of what belongs where: core vs the rest, core.multiarray, core.ufunc, scalar arrays and support libraries (npysort, npymath)

The main data structures around ndarray:

the arrayobject and data type descriptor, and how they relate to each other. exercise to add a simple array method to the array object dealing with arbitrary array memory layout with iterators Adding a new dtype: Anatomy of the dtype: from a + a to a core C loop Simple example to wrap a software implementation of quadruple precision (revised version of IEEE 754 software) The current set of planned hand-on tasks/exercises:

building from sources with debug symbols adding an array method to compute a simple statistic (e.g. kurtosis) adding a new type to handle quadruple precision type Required Packages

You will need a working C compiler (gcc on unix/os x, Visual Studio 2008 on windows), and be familiar how to use it on your platform git if possible, gdb and cgdb on unix if possible: valgrind and kcachegrind for supported platforms (linux) Vagrant VM available here: https://s3.amazonaws.com/scipy-2013/divingintonumpy/numpy-tuto.box (use vagrant 1.2.1, as 1.2.2 has a serious bug for sharing files)

Diving into NumPy Code, SciPy2013 Tutorial, Part 3 of 4
SciPy 2013
Recorded: June 27, 2013Language: English

Presenters: David Cournapeau, Stefan Van der Walt

Description

Do you want to contribute to NumPy but find the codebase daunting ? Do you want to extend NumPy (e.g. adding support for decimal, or arbitrary precision) ? Are you curious to understand how NumPy works at all ? Then this tutorial is for you.

The goal of this tutorial is do dive into NumPy codebase, in particular the core C implementation. You will learn how to build NumPy from sources, how some of the core concepts such as data types and ufuncs are implemented at the C level and how it is hooked up to the Python runtime. You will also learn how to add a new ufunc and a new data type.

During the tutorial, we will also have a look at various tools (unix-oriented) that can help tracking bugs or follow a particular numpy expression from its python representation to its low-level implementation.

While a working knowledge of C and Python is required, we do not assume a preliminary knowledge of the NumPy codebase. An understanding of Python C extensions is a plus, but not required either.

Outline

The tutorial will be divided in 3 main sections:

Introduction: Why extending numpy in C ? (and perhaps more importantly, when you should not) being ready to develop on NumPy: building from sources, and building with different flags (optimisation and debug) Source code organisation: description of the numpy source tree and high-level description of what belongs where: core vs the rest, core.multiarray, core.ufunc, scalar arrays and support libraries (npysort, npymath)

The main data structures around ndarray:

the arrayobject and data type descriptor, and how they relate to each other. exercise to add a simple array method to the array object dealing with arbitrary array memory layout with iterators Adding a new dtype: Anatomy of the dtype: from a + a to a core C loop Simple example to wrap a software implementation of quadruple precision (revised version of IEEE 754 software) The current set of planned hand-on tasks/exercises:

building from sources with debug symbols adding an array method to compute a simple statistic (e.g. kurtosis) adding a new type to handle quadruple precision type Required Packages

You will need a working C compiler (gcc on unix/os x, Visual Studio 2008 on windows), and be familiar how to use it on your platform git if possible, gdb and cgdb on unix if possible: valgrind and kcachegrind for supported platforms (linux) Vagrant VM available here: https://s3.amazonaws.com/scipy-2013/divingintonumpy/numpy-tuto.box (use vagrant 1.2.1, as 1.2.2 has a serious bug for sharing files)

Diving into NumPy Code, SciPy2013 Tutorial, Part 4 of 4
SciPy 2013
Recorded: June 27, 2013Language: English

Presenters: David Cournapeau, Stefan Van der Walt

Description

Do you want to contribute to NumPy but find the codebase daunting ? Do you want to extend NumPy (e.g. adding support for decimal, or arbitrary precision) ? Are you curious to understand how NumPy works at all ? Then this tutorial is for you.

The goal of this tutorial is do dive into NumPy codebase, in particular the core C implementation. You will learn how to build NumPy from sources, how some of the core concepts such as data types and ufuncs are implemented at the C level and how it is hooked up to the Python runtime. You will also learn how to add a new ufunc and a new data type.

During the tutorial, we will also have a look at various tools (unix-oriented) that can help tracking bugs or follow a particular numpy expression from its python representation to its low-level implementation.

While a working knowledge of C and Python is required, we do not assume a preliminary knowledge of the NumPy codebase. An understanding of Python C extensions is a plus, but not required either.

Outline

The tutorial will be divided in 3 main sections:

Introduction: Why extending numpy in C ? (and perhaps more importantly, when you should not) being ready to develop on NumPy: building from sources, and building with different flags (optimisation and debug) Source code organisation: description of the numpy source tree and high-level description of what belongs where: core vs the rest, core.multiarray, core.ufunc, scalar arrays and support libraries (npysort, npymath)

The main data structures around ndarray:

the arrayobject and data type descriptor, and how they relate to each other. exercise to add a simple array method to the array object dealing with arbitrary array memory layout with iterators Adding a new dtype: Anatomy of the dtype: from a + a to a core C loop Simple example to wrap a software implementation of quadruple precision (revised version of IEEE 754 software) The current set of planned hand-on tasks/exercises:

building from sources with debug symbols adding an array method to compute a simple statistic (e.g. kurtosis) adding a new type to handle quadruple precision type Required Packages

You will need a working C compiler (gcc on unix/os x, Visual Studio 2008 on windows), and be familiar how to use it on your platform git if possible, gdb and cgdb on unix if possible: valgrind and kcachegrind for supported platforms (linux) Vagrant VM available here: https://s3.amazonaws.com/scipy-2013/divingintonumpy/numpy-tuto.box (use vagrant 1.2.1, as 1.2.2 has a serious bug for sharing files)

Intro to scikit-learn (II), SciPy2013 Tutorial, Part 1 of 2
SciPy 2013
Recorded: June 27, 2013Language: English

Presenters: Gaël Varoquaux, Jake Vanderplas, Olivier Grisel

Description

Machine Learning is the branch of computer science concerned with the development of algorithms which can learn from previously-seen data in order to make predictions about future data, and has become an important part of research in many scientific fields. This set of tutorials will introduce the basics of machine learning, and how these learning tasks can be accomplished using Scikit-Learn, a machine learning library written in Python and built on NumPy, SciPy, and Matplotlib. By the end of the tutorials, participants will be poised to take advantage of Scikit-learn's wide variety of machine learning algorithms to explore their own data sets. The tutorial will comprise two sessions, Session I in the morning (intermediate track), and Session II in the afternoon (advanced track). Participants are free to attend either one or both, but to get the most out of the material, we encourage those attending in the afternoon to attend in the morning as well.

Session II will build upon Session I, and assume familiarity with the concepts covered there. The goals of Session II are to introduce more involved algorithms and techniques which are vital for successfully applying machine learning in practice. It will cover cross-validation and hyperparameter optimization, unsupervised algorithms, pipelines, and go into depth on a few extremely powerful learning algorithms available in Scikit-learn: Support Vector Machines, Random Forests, and Sparse Models. We will finish with an extended exercise applying scikit-learn to a real-world problem.

Outline

Tutorial 2 (advanced track)

0:00 - 0:30 -- Model validation and testing Bias, Variance, Over-fitting, Under-fitting Using validation curves & learning to improve your model Exercise: Tuning a random forest for the digits data 0:30 - 1:30 -- In depth with a few learners SVMs and kernels Trees and forests Sparse and non-sparse linear models 1:30 - 2:00 -- Unsupervised Learning Example of Dimensionality Reduction: hand-written digits Example of Clustering: Olivetti Faces 2:00 - 2:15 -- Pipelining learners Examples of unsupervised data reduction followed by supervised learning. 2:15 - 2:30 -- Break (possibly in the middle of the previous section) 2:30 - 3:00 -- Learning on big data Online learning: MiniBatchKmeans Stochastic Gradient Descent for linear models Data-reducing transforms: random-projections 3:00 - 4:00 -- Parallel Machine Learning with IPython IPython.parallel, a short primer Parallel Model Assessment and Selection Running a cluster on the EC2 cloud using StarCluster

https://github.com/jakevdp/sklearn_scipy2013

Required Packages

This tutorial will use Python 2.6 / 2.7, and require recent versions of numpy (version 1.5+), scipy (version 0.10+), matplotlib (version 1.1+), scikit-learn (version 0.13.1+), and IPython (version 0.13.1+) with notebook support. The final requirement is particularly important: participants should be able to run IPython notebook and create & manipulate notebooks in their web browser. The easiest way to install these requirements is to use a packaged distribution: we recommend Anaconda CE, a free package provided by Continuum Analytics: http://continuum.io/downloads.html or the Enthought Python Distribution: http://www.enthought.com/products/epd_free.php

Intro to scikit-learn (II), SciPy2013 Tutorial, Part 2 of 2
SciPy 2013
Recorded: June 27, 2013Language: English

Presenters: Gaël Varoquaux, Jake Vanderplas, Olivier Grisel

Description

Machine Learning is the branch of computer science concerned with the development of algorithms which can learn from previously-seen data in order to make predictions about future data, and has become an important part of research in many scientific fields. This set of tutorials will introduce the basics of machine learning, and how these learning tasks can be accomplished using Scikit-Learn, a machine learning library written in Python and built on NumPy, SciPy, and Matplotlib. By the end of the tutorials, participants will be poised to take advantage of Scikit-learn's wide variety of machine learning algorithms to explore their own data sets. The tutorial will comprise two sessions, Session I in the morning (intermediate track), and Session II in the afternoon (advanced track). Participants are free to attend either one or both, but to get the most out of the material, we encourage those attending in the afternoon to attend in the morning as well.

Session II will build upon Session I, and assume familiarity with the concepts covered there. The goals of Session II are to introduce more involved algorithms and techniques which are vital for successfully applying machine learning in practice. It will cover cross-validation and hyperparameter optimization, unsupervised algorithms, pipelines, and go into depth on a few extremely powerful learning algorithms available in Scikit-learn: Support Vector Machines, Random Forests, and Sparse Models. We will finish with an extended exercise applying scikit-learn to a real-world problem.

Outline

Tutorial 2 (advanced track)

0:00 - 0:30 -- Model validation and testing Bias, Variance, Over-fitting, Under-fitting Using validation curves & learning to improve your model Exercise: Tuning a random forest for the digits data 0:30 - 1:30 -- In depth with a few learners SVMs and kernels Trees and forests Sparse and non-sparse linear models 1:30 - 2:00 -- Unsupervised Learning Example of Dimensionality Reduction: hand-written digits Example of Clustering: Olivetti Faces 2:00 - 2:15 -- Pipelining learners Examples of unsupervised data reduction followed by supervised learning. 2:15 - 2:30 -- Break (possibly in the middle of the previous section) 2:30 - 3:00 -- Learning on big data Online learning: MiniBatchKmeans Stochastic Gradient Descent for linear models Data-reducing transforms: random-projections 3:00 - 4:00 -- Parallel Machine Learning with IPython IPython.parallel, a short primer Parallel Model Assessment and Selection Running a cluster on the EC2 cloud using StarCluster

https://github.com/jakevdp/sklearn_scipy2013

Required Packages

This tutorial will use Python 2.6 / 2.7, and require recent versions of numpy (version 1.5+), scipy (version 0.10+), matplotlib (version 1.1+), scikit-learn (version 0.13.1+), and IPython (version 0.13.1+) with notebook support. The final requirement is particularly important: participants should be able to run IPython notebook and create & manipulate notebooks in their web browser. The easiest way to install these requirements is to use a packaged distribution: we recommend Anaconda CE, a free package provided by Continuum Analytics: http://continuum.io/downloads.html or the Enthought Python Distribution: http://www.enthought.com/products/epd_free.php

Intro to scikit-learn (I), SciPy2013 Tutorial, Part 1 of 3
SciPy 2013
Recorded: June 27, 2013Language: English

Presenters: Gaël Varoquaux, Jake Vanderplas, Olivier Grisel

Description

Machine Learning is the branch of computer science concerned with the development of algorithms which can learn from previously-seen data in order to make predictions about future data, and has become an important part of research in many scientific fields. This set of tutorials will introduce the basics of machine learning, and how these learning tasks can be accomplished using Scikit-Learn, a machine learning library written in Python and built on NumPy, SciPy, and Matplotlib. By the end of the tutorials, participants will be poised to take advantage of Scikit-learn's wide variety of machine learning algorithms to explore their own data sets. The tutorial will comprise two sessions, Session I in the morning (intermediate track), and Session II in the afternoon (advanced track). Participants are free to attend either one or both, but to get the most out of the material, we encourage those attending in the afternoon to attend in the morning as well.

Session I will assume participants already have a basic knowledge of using numpy and matplotlib for manipulating and visualizing data. It will require no prior knowledge of machine learning or scikit-learn. The goals of Session I are to introduce participants to the basic concepts of machine learning, to give a hands-on introduction to using Scikit-learn for machine learning in Python, and give participants experience with several practical examples and applications of applying supervised learning to a variety of data. It will cover basic classification and regression problems, regularization of learning models, basic cross-validation, and some examples from text mining and image processing, all using the tools available in scikit-learn.

Outline

Tutorial 1 (intermediate track)

0:00 - 0:15 -- Setup and Introduction 0:15 - 0:30 -- Quick review of data visualization with matplotlib and numpy 0:30 - 1:00 -- Representation of data in machine learning Downloading data within scikit-learn Categorical & Image data Exercise: vectorization of text documents 1:00 - 2:00 -- Basic principles of Machine Learning & the scikit-learn interface Supervised Learning: Classification & Regression Unsupervised Learning: Clustering & Dimensionality Reduction Example of PCA for data visualization Flow chart: how do I choose what to do with my data set? Exercise: Interactive Demo on linearly separable data Regularization: what it is and why it is necessary 2:00 - 2:15 -- Break (possibly in the middle of the previous section) 2:15 - 3:00 -- Supervised Learning Example of Classification: hand-written digits Cross-validation: measuring prediction accuracy Example of Regression: boston house prices 3:00 - 4:15 -- Applications Examples from text mining Examples from image processing

https://github.com/jakevdp/sklearn_scipy2013

Required Packages

This tutorial will use Python 2.6 / 2.7, and require recent versions of numpy (version 1.5+), scipy (version 0.10+), matplotlib (version 1.1+), scikit-learn (version 0.13.1+), and IPython (version 0.13.1+) with notebook support. The final requirement is particularly important: participants should be able to run IPython notebook and create & manipulate notebooks in their web browser. The easiest way to install these requirements is to use a packaged distribution: we recommend Anaconda CE, a free package provided by Continuum Analytics: http://continuum.io/downloads.html or the Enthought Python Distribution: http://www.enthought.com/products/epd_free.php

Intro to scikit-learn (I), SciPy2013 Tutorial, Part 2 of 3
SciPy 2013
Recorded: June 27, 2013Language: English

Presenters: Gaël Varoquaux, Jake Vanderplas, Olivier Grisel

Description

Machine Learning is the branch of computer science concerned with the development of algorithms which can learn from previously-seen data in order to make predictions about future data, and has become an important part of research in many scientific fields. This set of tutorials will introduce the basics of machine learning, and how these learning tasks can be accomplished using Scikit-Learn, a machine learning library written in Python and built on NumPy, SciPy, and Matplotlib. By the end of the tutorials, participants will be poised to take advantage of Scikit-learn's wide variety of machine learning algorithms to explore their own data sets. The tutorial will comprise two sessions, Session I in the morning (intermediate track), and Session II in the afternoon (advanced track). Participants are free to attend either one or both, but to get the most out of the material, we encourage those attending in the afternoon to attend in the morning as well.

Session I will assume participants already have a basic knowledge of using numpy and matplotlib for manipulating and visualizing data. It will require no prior knowledge of machine learning or scikit-learn. The goals of Session I are to introduce participants to the basic concepts of machine learning, to give a hands-on introduction to using Scikit-learn for machine learning in Python, and give participants experience with several practical examples and applications of applying supervised learning to a variety of data. It will cover basic classification and regression problems, regularization of learning models, basic cross-validation, and some examples from text mining and image processing, all using the tools available in scikit-learn.

Outline

Tutorial 1 (intermediate track)

0:00 - 0:15 -- Setup and Introduction 0:15 - 0:30 -- Quick review of data visualization with matplotlib and numpy 0:30 - 1:00 -- Representation of data in machine learning Downloading data within scikit-learn Categorical & Image data Exercise: vectorization of text documents 1:00 - 2:00 -- Basic principles of Machine Learning & the scikit-learn interface Supervised Learning: Classification & Regression Unsupervised Learning: Clustering & Dimensionality Reduction Example of PCA for data visualization Flow chart: how do I choose what to do with my data set? Exercise: Interactive Demo on linearly separable data Regularization: what it is and why it is necessary 2:00 - 2:15 -- Break (possibly in the middle of the previous section) 2:15 - 3:00 -- Supervised Learning Example of Classification: hand-written digits Cross-validation: measuring prediction accuracy Example of Regression: boston house prices 3:00 - 4:15 -- Applications Examples from text mining Examples from image processing

https://github.com/jakevdp/sklearn_scipy2013

Required Packages

This tutorial will use Python 2.6 / 2.7, and require recent versions of numpy (version 1.5+), scipy (version 0.10+), matplotlib (version 1.1+), scikit-learn (version 0.13.1+), and IPython (version 0.13.1+) with notebook support. The final requirement is particularly important: participants should be able to run IPython notebook and create & manipulate notebooks in their web browser. The easiest way to install these requirements is to use a packaged distribution: we recommend Anaconda CE, a free package provided by Continuum Analytics: http://continuum.io/downloads.html or the Enthought Python Distribution: http://www.enthought.com/products/epd_free.php

Intro to scikit-learn (I), SciPy2013 Tutorial, Part 3 of 3
SciPy 2013
Recorded: June 27, 2013Language: English

Presenters: Gaël Varoquaux, Jake Vanderplas, Olivier Grisel

Description

Machine Learning is the branch of computer science concerned with the development of algorithms which can learn from previously-seen data in order to make predictions about future data, and has become an important part of research in many scientific fields. This set of tutorials will introduce the basics of machine learning, and how these learning tasks can be accomplished using Scikit-Learn, a machine learning library written in Python and built on NumPy, SciPy, and Matplotlib. By the end of the tutorials, participants will be poised to take advantage of Scikit-learn's wide variety of machine learning algorithms to explore their own data sets. The tutorial will comprise two sessions, Session I in the morning (intermediate track), and Session II in the afternoon (advanced track). Participants are free to attend either one or both, but to get the most out of the material, we encourage those attending in the afternoon to attend in the morning as well.

Session I will assume participants already have a basic knowledge of using numpy and matplotlib for manipulating and visualizing data. It will require no prior knowledge of machine learning or scikit-learn. The goals of Session I are to introduce participants to the basic concepts of machine learning, to give a hands-on introduction to using Scikit-learn for machine learning in Python, and give participants experience with several practical examples and applications of applying supervised learning to a variety of data. It will cover basic classification and regression problems, regularization of learning models, basic cross-validation, and some examples from text mining and image processing, all using the tools available in scikit-learn.

Outline

Tutorial 1 (intermediate track)

0:00 - 0:15 -- Setup and Introduction 0:15 - 0:30 -- Quick review of data visualization with matplotlib and numpy 0:30 - 1:00 -- Representation of data in machine learning Downloading data within scikit-learn Categorical & Image data Exercise: vectorization of text documents 1:00 - 2:00 -- Basic principles of Machine Learning & the scikit-learn interface Supervised Learning: Classification & Regression Unsupervised Learning: Clustering & Dimensionality Reduction Example of PCA for data visualization Flow chart: how do I choose what to do with my data set? Exercise: Interactive Demo on linearly separable data Regularization: what it is and why it is necessary 2:00 - 2:15 -- Break (possibly in the middle of the previous section) 2:15 - 3:00 -- Supervised Learning Example of Classification: hand-written digits Cross-validation: measuring prediction accuracy Example of Regression: boston house prices 3:00 - 4:15 -- Applications Examples from text mining Examples from image processing

https://github.com/jakevdp/sklearn_scipy2013

Required Packages

This tutorial will use Python 2.6 / 2.7, and require recent versions of numpy (version 1.5+), scipy (version 0.10+), matplotlib (version 1.1+), scikit-learn (version 0.13.1+), and IPython (version 0.13.1+) with notebook support. The final requirement is particularly important: participants should be able to run IPython notebook and create & manipulate notebooks in their web browser. The easiest way to install these requirements is to use a packaged distribution: we recommend Anaconda CE, a free package provided by Continuum Analytics: http://continuum.io/downloads.html or the Enthought Python Distribution: http://www.enthought.com/products/epd_free.php

IPython in Depth, SciPy2013 Tutorial, Part 1 of 3
SciPy 2013
Recorded: June 27, 2013Language: English

Presenters: Fernando Perez, Brian Granger

Description

IPython provides tools for interactive and parallel computing that are widely used in scientific computing, but can benefit any Python developer.

We will show how to use IPython in different ways, as: an interactive shell, an embedded shell, a graphical console, a network-aware VM in GUIs, a web-based notebook with code, graphics and rich HTML, and a high-level framework for parallel computing.

Outline

IPython started in 2001 simply as a better interactive Python shell. Over the last decade it has grown into a powerful set of interlocking tools that maximize developer productivity in Python while working interactively.

Today, IPython consists of a kernel that executes the user code and provides many features for introspection and namespace manipulation, and tools to control this kernel either in-process or out-of-process thanks to a well-specified communications protocol implemented over ZeroMQ. This architecture allows the core features to be accessed via a variety of clients, each providing unique functionality tuned to a specific use case:

An interactive, terminal-based shell with capabilities beyond the default Python interactive interpreter (this is the classic application opened by the ipython command that most users are familiar with).

A Qt console that provides the look and feel of a terminal, but adds support for inline figures, graphical calltips, a persistent session that can survive crashes of the kernel process, and more. A user-based review of some of these features can be found here.

A web-based notebook that can execute code and also contain rich text and figures, mathematical equations and arbitrary HTML. This notebook presents a document-like view with cells where code is executed but that can be edited in-place, reordered, mixed with explanatory text and figures, etc. The notebook provides an interactive experience that combines live code and results with literate documentation and the rich media that modern browsers can display.

A high-performance, low-latency system for parallel computing that supports the control of a cluster of IPython engines communicating over ZeroMQ, with optimizations that minimize unnecessary copying of large objects (especially numpy arrays). These engines can be controlled interactively while developing and doing exploratory work, or can run in batch mode either on a local machine or in a large cluster/supercomputing environment via a batch scheduler.

In this hands-on, in-depth tutorial, we will briefly describe IPython's architecture and will then show how to use the above tools for a highly productive workflow in Python.

Required Packages

Enthought Canopy OR Anaconda OR Linux packages for IPython, NumPy, Matplotlib, SymPy See http://ipython.org/install.html for further installation details.

IPython version 0.13.1 or higher will be required.

Documentation

A GitHub repo with our tutorial materials:

https://github.com/ipython/ipython-in-depth

IPython in Depth, SciPy2013 Tutorial, Part 2 of 3
SciPy 2013
Recorded: June 27, 2013Language: English

Presenters: Fernando Perez, Brian Granger

Description

IPython provides tools for interactive and parallel computing that are widely used in scientific computing, but can benefit any Python developer.

We will show how to use IPython in different ways, as: an interactive shell, an embedded shell, a graphical console, a network-aware VM in GUIs, a web-based notebook with code, graphics and rich HTML, and a high-level framework for parallel computing.

Outline

IPython started in 2001 simply as a better interactive Python shell. Over the last decade it has grown into a powerful set of interlocking tools that maximize developer productivity in Python while working interactively.

Today, IPython consists of a kernel that executes the user code and provides many features for introspection and namespace manipulation, and tools to control this kernel either in-process or out-of-process thanks to a well-specified communications protocol implemented over ZeroMQ. This architecture allows the core features to be accessed via a variety of clients, each providing unique functionality tuned to a specific use case:

An interactive, terminal-based shell with capabilities beyond the default Python interactive interpreter (this is the classic application opened by the ipython command that most users are familiar with).

A Qt console that provides the look and feel of a terminal, but adds support for inline figures, graphical calltips, a persistent session that can survive crashes of the kernel process, and more. A user-based review of some of these features can be found here.

A web-based notebook that can execute code and also contain rich text and figures, mathematical equations and arbitrary HTML. This notebook presents a document-like view with cells where code is executed but that can be edited in-place, reordered, mixed with explanatory text and figures, etc. The notebook provides an interactive experience that combines live code and results with literate documentation and the rich media that modern browsers can display.

A high-performance, low-latency system for parallel computing that supports the control of a cluster of IPython engines communicating over ZeroMQ, with optimizations that minimize unnecessary copying of large objects (especially numpy arrays). These engines can be controlled interactively while developing and doing exploratory work, or can run in batch mode either on a local machine or in a large cluster/supercomputing environment via a batch scheduler.

In this hands-on, in-depth tutorial, we will briefly describe IPython's architecture and will then show how to use the above tools for a highly productive workflow in Python.

Required Packages

Enthought Canopy OR Anaconda OR Linux packages for IPython, NumPy, Matplotlib, SymPy See http://ipython.org/install.html for further installation details.

IPython version 0.13.1 or higher will be required.

Documentation

A GitHub repo with our tutorial materials:

https://github.com/ipython/ipython-in-depth

IPython in Depth, SciPy2013 Tutorial, Part 3 of 3
SciPy 2013
Recorded: June 27, 2013Language: English

Presenters: Fernando Perez, Brian Granger

Description

IPython provides tools for interactive and parallel computing that are widely used in scientific computing, but can benefit any Python developer.

We will show how to use IPython in different ways, as: an interactive shell, an embedded shell, a graphical console, a network-aware VM in GUIs, a web-based notebook with code, graphics and rich HTML, and a high-level framework for parallel computing.

Outline

IPython started in 2001 simply as a better interactive Python shell. Over the last decade it has grown into a powerful set of interlocking tools that maximize developer productivity in Python while working interactively.

Today, IPython consists of a kernel that executes the user code and provides many features for introspection and namespace manipulation, and tools to control this kernel either in-process or out-of-process thanks to a well-specified communications protocol implemented over ZeroMQ. This architecture allows the core features to be accessed via a variety of clients, each providing unique functionality tuned to a specific use case:

An interactive, terminal-based shell with capabilities beyond the default Python interactive interpreter (this is the classic application opened by the ipython command that most users are familiar with).

A Qt console that provides the look and feel of a terminal, but adds support for inline figures, graphical calltips, a persistent session that can survive crashes of the kernel process, and more. A user-based review of some of these features can be found here.

A web-based notebook that can execute code and also contain rich text and figures, mathematical equations and arbitrary HTML. This notebook presents a document-like view with cells where code is executed but that can be edited in-place, reordered, mixed with explanatory text and figures, etc. The notebook provides an interactive experience that combines live code and results with literate documentation and the rich media that modern browsers can display.

A high-performance, low-latency system for parallel computing that supports the control of a cluster of IPython engines communicating over ZeroMQ, with optimizations that minimize unnecessary copying of large objects (especially numpy arrays). These engines can be controlled interactively while developing and doing exploratory work, or can run in batch mode either on a local machine or in a large cluster/supercomputing environment via a batch scheduler.

In this hands-on, in-depth tutorial, we will briefly describe IPython's architecture and will then show how to use the above tools for a highly productive workflow in Python.

Required Packages

Enthought Canopy OR Anaconda OR Linux packages for IPython, NumPy, Matplotlib, SymPy See http://ipython.org/install.html for further installation details.

IPython version 0.13.1 or higher will be required.

Documentation

A GitHub repo with our tutorial materials:

https://github.com/ipython/ipython-in-depth

NumPy and IPython, SciPy2013 Tutorial, Part 1 of 2
SciPy 2013
Recorded: June 27, 2013Language: English

Presenter: Valentin Haenel

Description

This tutorial is a hands-on introduction to the two most basic building-blocks of the scientific Python stack: the enhanced interactive interpreter IPython and the fast numerical container Numpy. Amongst other things you will learn how to structure an interactive workflow for scientific computing and how to create and manipulate numerical data efficiently. You should have some basic familiarity with Python (variables, loops, functions) and basic command-line usage (executing commands, using history).

Outline

Ipython (1 hour)

Using the IPython notebook Help system, magic functions, aliases and history Numpy (3 hours)

Basic arrays, dtypes and numerical operations Indexing, slicing, reshaping and broadcasting Copies, views and fancy indexing The tutorial will feature short bursts of small exercises every 5-10 minutes.

Required Packages

An install of Anaconda should be enough

Numpy (Version 1.6 or higher) Ipython (Version 0.13 or higher) Matplotlib (Version 1.2.1 or higher) Documentation

I have converted a large part of the Numpy chaper from the Python Scientific Lecture Notes to IPython notebook format using the sphinx2ipynb converter from the nbconvert project. All materials are collected in my scipy2013-tutorial-numpy-ipython Gitub repository at https://github.com/esc/scipy2013-tutorial-numpy-ipython or http://git.io/bocNDg.

NumPy and IPython, SciPy2013 Tutorial, Part 2 of 2
SciPy 2013
Recorded: June 27, 2013Language: English

Description

This tutorial is a hands-on introduction to the two most basic building-blocks of the scientific Python stack: the enhanced interactive interpreter IPython and the fast numerical container Numpy. Amongst other things you will learn how to structure an interactive workflow for scientific computing and how to create and manipulate numerical data efficiently. You should have some basic familiarity with Python (variables, loops, functions) and basic command-line usage (executing commands, using history).

Outline

Ipython (1 hour)

Using the IPython notebook Help system, magic functions, aliases and history Numpy (3 hours)

Basic arrays, dtypes and numerical operations Indexing, slicing, reshaping and broadcasting Copies, views and fancy indexing The tutorial will feature short bursts of small exercises every 5-10 minutes.

Required Packages

An install of Anaconda should be enough

Numpy (Version 1.6 or higher) Ipython (Version 0.13 or higher) Matplotlib (Version 1.2.1 or higher) Documentation

I have converted a large part of the Numpy chaper from the Python Scientific Lecture Notes to IPython notebook format using the sphinx2ipynb converter from the nbconvert project. All materials are collected in my scipy2013-tutorial-numpy-ipython Gitub repository at https://github.com/esc/scipy2013-tutorial-numpy-ipython or http://git.io/bocNDg.

Statistical Data Analysis in Python, SciPy2013 Tutorial, Part 1 of 4
SciPy 2013
Recorded: June 27, 2013Language: English

Presenter: Christopher Fonnesbeck

Description

This tutorial will introduce the use of Python for statistical data analysis, using data stored as Pandas DataFrame objects. Much of the work involved in analyzing data resides in importing, cleaning and transforming data in preparation for analysis. Therefore, the first half of the course is comprised of a 2-part overview of basic and intermediate Pandas usage that will show how to effectively manipulate datasets in memory. This includes tasks like indexing, alignment, join/merge methods, date/time types, and handling of missing data. Next, we will cover plotting and visualization using Pandas and Matplotlib, focusing on creating effective visual representations of your data, while avoiding common pitfalls. Finally, participants will be introduced to methods for statistical data modeling using some of the advanced functions in Numpy, Scipy and Pandas. This will include fitting your data to probability distributions, estimating relationships among variables using linear and non-linear models, and a brief introduction to Bayesian methods. Each section of the tutorial will involve hands-on manipulation and analysis of sample datasets, to be provided to attendees in advance.

The target audience for the tutorial includes all new Python users, though we recommend that users also attend the NumPy and IPython session in the introductory track.

Tutorial GitHub repo: https://github.com/fonnesbeck/statistical-analysis-python-tutorial

Outline

Introduction to Pandas (45 min)

Importing data Series and DataFrame objects Indexing, data selection and subsetting Hierarchical indexing Reading and writing files Date/time types String conversion Missing data Data summarization Data Wrangling with Pandas (45 min)

Indexing, selection and subsetting Reshaping DataFrame objects Pivoting Alignment Data aggregation and GroupBy operations Merging and joining DataFrame objects Plotting and Visualization (45 min)

Time series plots Grouped plots Scatterplots Histograms Visualization pro tips Statistical Data Modeling (45 min)

Fitting data to probability distributions Linear models Spline models Time series analysis Bayesian models

Required Packages

Python 2.7 or higher (including Python 3) pandas 0.11.1 or higher, and its dependencies NumPy 1.6.1 or higher matplotlib 1.0.0 or higher pytz IPython 0.12 or higher pyzmq tornado

Statistical Data Analysis in Python, SciPy2013 Tutorial, Part 2 of 4
SciPy 2013
Recorded: June 27, 2013Language: English

Presenter: Christopher Fonnesbeck

Description

This tutorial will introduce the use of Python for statistical data analysis, using data stored as Pandas DataFrame objects. Much of the work involved in analyzing data resides in importing, cleaning and transforming data in preparation for analysis. Therefore, the first half of the course is comprised of a 2-part overview of basic and intermediate Pandas usage that will show how to effectively manipulate datasets in memory. This includes tasks like indexing, alignment, join/merge methods, date/time types, and handling of missing data. Next, we will cover plotting and visualization using Pandas and Matplotlib, focusing on creating effective visual representations of your data, while avoiding common pitfalls. Finally, participants will be introduced to methods for statistical data modeling using some of the advanced functions in Numpy, Scipy and Pandas. This will include fitting your data to probability distributions, estimating relationships among variables using linear and non-linear models, and a brief introduction to Bayesian methods. Each section of the tutorial will involve hands-on manipulation and analysis of sample datasets, to be provided to attendees in advance.

The target audience for the tutorial includes all new Python users, though we recommend that users also attend the NumPy and IPython session in the introductory track.

Tutorial GitHub repo: https://github.com/fonnesbeck/statistical-analysis-python-tutorial

Outline

Introduction to Pandas (45 min)

Importing data Series and DataFrame objects Indexing, data selection and subsetting Hierarchical indexing Reading and writing files Date/time types String conversion Missing data Data summarization Data Wrangling with Pandas (45 min)

Indexing, selection and subsetting Reshaping DataFrame objects Pivoting Alignment Data aggregation and GroupBy operations Merging and joining DataFrame objects Plotting and Visualization (45 min)

Time series plots Grouped plots Scatterplots Histograms Visualization pro tips Statistical Data Modeling (45 min)

Fitting data to probability distributions Linear models Spline models Time series analysis Bayesian models

Required Packages

Python 2.7 or higher (including Python 3) pandas 0.11.1 or higher, and its dependencies NumPy 1.6.1 or higher matplotlib 1.0.0 or higher pytz IPython 0.12 or higher pyzmq tornado

Statistical Data Analysis in Python, SciPy2013 Tutorial, Part 3 of 4
SciPy 2013
Recorded: June 27, 2013Language: English

Presenter: Christopher Fonnesbeck

Description

This tutorial will introduce the use of Python for statistical data analysis, using data stored as Pandas DataFrame objects. Much of the work involved in analyzing data resides in importing, cleaning and transforming data in preparation for analysis. Therefore, the first half of the course is comprised of a 2-part overview of basic and intermediate Pandas usage that will show how to effectively manipulate datasets in memory. This includes tasks like indexing, alignment, join/merge methods, date/time types, and handling of missing data. Next, we will cover plotting and visualization using Pandas and Matplotlib, focusing on creating effective visual representations of your data, while avoiding common pitfalls. Finally, participants will be introduced to methods for statistical data modeling using some of the advanced functions in Numpy, Scipy and Pandas. This will include fitting your data to probability distributions, estimating relationships among variables using linear and non-linear models, and a brief introduction to Bayesian methods. Each section of the tutorial will involve hands-on manipulation and analysis of sample datasets, to be provided to attendees in advance.

The target audience for the tutorial includes all new Python users, though we recommend that users also attend the NumPy and IPython session in the introductory track.

Tutorial GitHub repo: https://github.com/fonnesbeck/statistical-analysis-python-tutorial

Outline

Introduction to Pandas (45 min)

Importing data Series and DataFrame objects Indexing, data selection and subsetting Hierarchical indexing Reading and writing files Date/time types String conversion Missing data Data summarization Data Wrangling with Pandas (45 min)

Indexing, selection and subsetting Reshaping DataFrame objects Pivoting Alignment Data aggregation and GroupBy operations Merging and joining DataFrame objects Plotting and Visualization (45 min)

Time series plots Grouped plots Scatterplots Histograms Visualization pro tips Statistical Data Modeling (45 min)

Fitting data to probability distributions Linear models Spline models Time series analysis Bayesian models

Required Packages

Python 2.7 or higher (including Python 3) pandas 0.11.1 or higher, and its dependencies NumPy 1.6.1 or higher matplotlib 1.0.0 or higher pytz IPython 0.12 or higher pyzmq tornado

Statistical Data Analysis in Python, SciPy2013 Tutorial, Part 4 of 4
SciPy 2013
Recorded: June 27, 2013Language: English

Presenter: Christopher Fonnesbeck

Description

This tutorial will introduce the use of Python for statistical data analysis, using data stored as Pandas DataFrame objects. Much of the work involved in analyzing data resides in importing, cleaning and transforming data in preparation for analysis. Therefore, the first half of the course is comprised of a 2-part overview of basic and intermediate Pandas usage that will show how to effectively manipulate datasets in memory. This includes tasks like indexing, alignment, join/merge methods, date/time types, and handling of missing data. Next, we will cover plotting and visualization using Pandas and Matplotlib, focusing on creating effective visual representations of your data, while avoiding common pitfalls. Finally, participants will be introduced to methods for statistical data modeling using some of the advanced functions in Numpy, Scipy and Pandas. This will include fitting your data to probability distributions, estimating relationships among variables using linear and non-linear models, and a brief introduction to Bayesian methods. Each section of the tutorial will involve hands-on manipulation and analysis of sample datasets, to be provided to attendees in advance.

The target audience for the tutorial includes all new Python users, though we recommend that users also attend the NumPy and IPython session in the introductory track.

Tutorial GitHub repo: https://github.com/fonnesbeck/statistical-analysis-python-tutorial

Outline

Introduction to Pandas (45 min)

Importing data Series and DataFrame objects Indexing, data selection and subsetting Hierarchical indexing Reading and writing files Date/time types String conversion Missing data Data summarization Data Wrangling with Pandas (45 min)

Indexing, selection and subsetting Reshaping DataFrame objects Pivoting Alignment Data aggregation and GroupBy operations Merging and joining DataFrame objects Plotting and Visualization (45 min)

Time series plots Grouped plots Scatterplots Histograms Visualization pro tips Statistical Data Modeling (45 min)

Fitting data to probability distributions Linear models Spline models Time series analysis Bayesian models

Required Packages

Python 2.7 or higher (including Python 3) pandas 0.11.1 or higher, and its dependencies NumPy 1.6.1 or higher matplotlib 1.0.0 or higher pytz IPython 0.12 or higher pyzmq tornado

Symbolic Computing with SymPy, SciPy2013 Tutorial, Part 3 of 6
SciPy 2013
Recorded: June 27, 2013Language: English

Ondrej Certik, Mateusz Paprocki, Aaron Meurer

Description

SymPy is a pure Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python and does not require any external libraries.

In this tutorial we will introduce attendees to SymPy. We will start by showing how to install and configure this Python module. Then we will proceed to the basics of constructing and manipulating mathematical expressions in SymPy. We will also discuss the most common issues and differences from other computer algebra systems, and how to deal with them. In the last part of this tutorial we will show how to solve simple, yet illustrative, mathematical problems with SymPy.

This knowledge should be enough for attendees to start using SymPy for solving mathematical problems and hacking SymPy's internals (though hacking core modules may require additional expertise).

We expect attendees of this tutorial to have basic knowledge of Python and mathematics. However, any more advanced topics will be explained during presentation.

Outline

installing, configuring and running SymPy basics of expressions in SymPy traversal and manipulation of expressions common issues and differences from other CAS setting up and using printers querying expression properties not only symbolics: numerical computing (mpmath) Mathematical problem solving with SymPy Required Packages

Python 2.x or 3.x, SymPy (most recent version) Optional packages: IPython, matplotlib, NetworkX, GMPY, numpy, scipy

Documentation

http://mattpap.github.com/scipy-2011-tutorial/html/index.html

Symbolic Computing with SymPy, SciPy2013 Tutorial, Part 4 of 6
SciPy 2013
Recorded: June 27, 2013Language: English

Ondrej Certik, Mateusz Paprocki, Aaron Meurer

Description

SymPy is a pure Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python and does not require any external libraries.

In this tutorial we will introduce attendees to SymPy. We will start by showing how to install and configure this Python module. Then we will proceed to the basics of constructing and manipulating mathematical expressions in SymPy. We will also discuss the most common issues and differences from other computer algebra systems, and how to deal with them. In the last part of this tutorial we will show how to solve simple, yet illustrative, mathematical problems with SymPy.

This knowledge should be enough for attendees to start using SymPy for solving mathematical problems and hacking SymPy's internals (though hacking core modules may require additional expertise).

We expect attendees of this tutorial to have basic knowledge of Python and mathematics. However, any more advanced topics will be explained during presentation.

Outline

installing, configuring and running SymPy basics of expressions in SymPy traversal and manipulation of expressions common issues and differences from other CAS setting up and using printers querying expression properties not only symbolics: numerical computing (mpmath) Mathematical problem solving with SymPy Required Packages

Python 2.x or 3.x, SymPy (most recent version) Optional packages: IPython, matplotlib, NetworkX, GMPY, numpy, scipy

Documentation

http://mattpap.github.com/scipy-2011-tutorial/html/index.html

Symbolic Computing with SymPy, SciPy2013 Tutorial, Part 5 of 6
SciPy 2013
Recorded: June 27, 2013Language: English

Ondrej Certik, Mateusz Paprocki, Aaron Meurer

Description

SymPy is a pure Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python and does not require any external libraries.

In this tutorial we will introduce attendees to SymPy. We will start by showing how to install and configure this Python module. Then we will proceed to the basics of constructing and manipulating mathematical expressions in SymPy. We will also discuss the most common issues and differences from other computer algebra systems, and how to deal with them. In the last part of this tutorial we will show how to solve simple, yet illustrative, mathematical problems with SymPy.

This knowledge should be enough for attendees to start using SymPy for solving mathematical problems and hacking SymPy's internals (though hacking core modules may require additional expertise).

We expect attendees of this tutorial to have basic knowledge of Python and mathematics. However, any more advanced topics will be explained during presentation.

Outline

installing, configuring and running SymPy basics of expressions in SymPy traversal and manipulation of expressions common issues and differences from other CAS setting up and using printers querying expression properties not only symbolics: numerical computing (mpmath) Mathematical problem solving with SymPy Required Packages

Python 2.x or 3.x, SymPy (most recent version) Optional packages: IPython, matplotlib, NetworkX, GMPY, numpy, scipy

Documentation

http://mattpap.github.com/scipy-2011-tutorial/html/index.html

Symbolic Computing with SymPy, SciPy2013 Tutorial, Part 6 of 6
SciPy 2013
Recorded: June 27, 2013Language: English

Ondrej Certik, Mateusz Paprocki, Aaron Meurer

Description

SymPy is a library for symbolic mathematics (i.e., a computer algebra system) in Python. In this tutorial, we will introduce SymPy from scratch. We will show the basics of SymPy, such as how to define expressions, manipulate them, common gotchas, how to simplify them, and calculus operations. We will then show some advanced examples of how SymPy can be used to solve real-life problems in physics.

Required Packages

Python 2.x or 3.x, SymPy (most recent version) Optional packages: IPython, matplotlib, NetworkX, GMPY, numpy, scipy

Documentation

http://certik.github.io/scipy-2013-tutorial/html/index.html

Using Geospatial Data with Python, SciPy2013 Tutorial, Part 1 of 6
SciPy 2013
Recorded: June 27, 2013Language: English

Presenter: Kelsey Jordahl

Description

Geographically referenced data is important in many scientific fields, and working with spatial data has become widespread in other domains as well (e.g. Google Maps, geolocated tweets, 4square checkins). Python has become an increasingly important language for working with geospatial data. In this tutorial, students will get experience in working with common geospatial formats in open source python libraries.

Python bindings are available for (nearly) all the standard libraries for working with geospatial data (proprietary and open source). Some of these libraries (including PROJ.4 and GDAL) will be discussed and used in this tutorial, along with more "pythonic" packages for accessing them, such as Shapely. Using spatially-aware databases will be discussed, with examples and an exercise using PostGIS, an extension to PostgreSQL. Python scripting extensions to Geographic Information Systems (GIS) packages such as QGIS and ArcView will be briefly discussed.

This tutorial should be accessible to anyone who has a basic understanding of NumPy and matplotlib. Prior familiarity with SQL database queries and the python DB API will be helpful for the PostGIS section.

Outline

1 map projections ~~~~~~~~~~~~~~~~~~ /10 min/

1.1 pyproj /10 min + 15 min exercise/

1.2 baseplot /10 min + 10 min exercise/

1.3 cartopy /5 min/

2 geographical data ~~~~~~~~~~~~~~~~~~~~

2.1 data formats /20 min intro/

2.2 GDAL/OGR /10 min + 10 min exercise/

2.3 Shapely /15 min + 30 min exercise/

2.4 PostGIS

/30 min + 30 min exercise/

2.4.1 Connecting to a PostGIS database with psycopg2

2.4.2 Converting latitude and longitude fields to geographical points

2.4.3 Setting and converting coordinate systems

2.4.4 Aggregation and geographic calculations with queries

2.4.5 GEOMETRY and GEOGRAPHY data types

3 plugins for GIS software ~~~~~~~~~~~~~~~~~~~~~~~~~~~

3.1 QGIS /15 min/

3.2 ArcGIS /10 min/

4 Conclusion ~~~~~~~~~~~~~ /10 min/

Required Packages

required packages pyproj, gdal, shapely, psycopg2 optional packages PostGIS, QGIS, cartopy

Using Geospatial Data with Python, SciPy2013 Tutorial, Part 2 of 6
SciPy 2013
Recorded: June 27, 2013Language: English

Presenter: Kelsey Jordahl

Description

Geographically referenced data is important in many scientific fields, and working with spatial data has become widespread in other domains as well (e.g. Google Maps, geolocated tweets, 4square checkins). Python has become an increasingly important language for working with geospatial data. In this tutorial, students will get experience in working with common geospatial formats in open source python libraries.

Python bindings are available for (nearly) all the standard libraries for working with geospatial data (proprietary and open source). Some of these libraries (including PROJ.4 and GDAL) will be discussed and used in this tutorial, along with more "pythonic" packages for accessing them, such as Shapely. Using spatially-aware databases will be discussed, with examples and an exercise using PostGIS, an extension to PostgreSQL. Python scripting extensions to Geographic Information Systems (GIS) packages such as QGIS and ArcView will be briefly discussed.

This tutorial should be accessible to anyone who has a basic understanding of NumPy and matplotlib. Prior familiarity with SQL database queries and the python DB API will be helpful for the PostGIS section.

Outline

1 map projections ~~~~~~~~~~~~~~~~~~ /10 min/

1.1 pyproj /10 min + 15 min exercise/

1.2 baseplot /10 min + 10 min exercise/

1.3 cartopy /5 min/

2 geographical data ~~~~~~~~~~~~~~~~~~~~

2.1 data formats /20 min intro/

2.2 GDAL/OGR /10 min + 10 min exercise/

2.3 Shapely /15 min + 30 min exercise/

2.4 PostGIS

/30 min + 30 min exercise/

2.4.1 Connecting to a PostGIS database with psycopg2

2.4.2 Converting latitude and longitude fields to geographical points

2.4.3 Setting and converting coordinate systems

2.4.4 Aggregation and geographic calculations with queries

2.4.5 GEOMETRY and GEOGRAPHY data types

3 plugins for GIS software ~~~~~~~~~~~~~~~~~~~~~~~~~~~

3.1 QGIS /15 min/

3.2 ArcGIS /10 min/

4 Conclusion ~~~~~~~~~~~~~ /10 min/

Required Packages

required packages pyproj, gdal, shapely, psycopg2 optional packages PostGIS, QGIS, cartopy

Using Geospatial Data with Python, SciPy2013 Tutorial, Part 3 of 6
SciPy 2013
Recorded: June 27, 2013Language: English

Presenter: Kelsey Jordahl

Description

Geographically referenced data is important in many scientific fields, and working with spatial data has become widespread in other domains as well (e.g. Google Maps, geolocated tweets, 4square checkins). Python has become an increasingly important language for working with geospatial data. In this tutorial, students will get experience in working with common geospatial formats in open source python libraries.

Python bindings are available for (nearly) all the standard libraries for working with geospatial data (proprietary and open source). Some of these libraries (including PROJ.4 and GDAL) will be discussed and used in this tutorial, along with more "pythonic" packages for accessing them, such as Shapely. Using spatially-aware databases will be discussed, with examples and an exercise using PostGIS, an extension to PostgreSQL. Python scripting extensions to Geographic Information Systems (GIS) packages such as QGIS and ArcView will be briefly discussed.

This tutorial should be accessible to anyone who has a basic understanding of NumPy and matplotlib. Prior familiarity with SQL database queries and the python DB API will be helpful for the PostGIS section.

Outline

1 map projections ~~~~~~~~~~~~~~~~~~ /10 min/

1.1 pyproj /10 min + 15 min exercise/

1.2 baseplot /10 min + 10 min exercise/

1.3 cartopy /5 min/

2 geographical data ~~~~~~~~~~~~~~~~~~~~

2.1 data formats /20 min intro/

2.2 GDAL/OGR /10 min + 10 min exercise/

2.3 Shapely /15 min + 30 min exercise/

2.4 PostGIS

/30 min + 30 min exercise/

2.4.1 Connecting to a PostGIS database with psycopg2

2.4.2 Converting latitude and longitude fields to geographical points

2.4.3 Setting and converting coordinate systems

2.4.4 Aggregation and geographic calculations with queries

2.4.5 GEOMETRY and GEOGRAPHY data types

3 plugins for GIS software ~~~~~~~~~~~~~~~~~~~~~~~~~~~

3.1 QGIS /15 min/

3.2 ArcGIS /10 min/

4 Conclusion ~~~~~~~~~~~~~ /10 min/

Required Packages

required packages pyproj, gdal, shapely, psycopg2 optional packages PostGIS, QGIS, cartopy

Using Geospatial Data with Python, SciPy2013 Tutorial, Part 4 of 6
SciPy 2013
Recorded: June 27, 2013Language: English

Presenter: Kelsey Jordahl

Description

Geographically referenced data is important in many scientific fields, and working with spatial data has become widespread in other domains as well (e.g. Google Maps, geolocated tweets, 4square checkins). Python has become an increasingly important language for working with geospatial data. In this tutorial, students will get experience in working with common geospatial formats in open source python libraries.

Python bindings are available for (nearly) all the standard libraries for working with geospatial data (proprietary and open source). Some of these libraries (including PROJ.4 and GDAL) will be discussed and used in this tutorial, along with more "pythonic" packages for accessing them, such as Shapely. Using spatially-aware databases will be discussed, with examples and an exercise using PostGIS, an extension to PostgreSQL. Python scripting extensions to Geographic Information Systems (GIS) packages such as QGIS and ArcView will be briefly discussed.

This tutorial should be accessible to anyone who has a basic understanding of NumPy and matplotlib. Prior familiarity with SQL database queries and the python DB API will be helpful for the PostGIS section.

Outline

1 map projections ~~~~~~~~~~~~~~~~~~ /10 min/

1.1 pyproj /10 min + 15 min exercise/

1.2 baseplot /10 min + 10 min exercise/

1.3 cartopy /5 min/

2 geographical data ~~~~~~~~~~~~~~~~~~~~

2.1 data formats /20 min intro/

2.2 GDAL/OGR /10 min + 10 min exercise/

2.3 Shapely /15 min + 30 min exercise/

2.4 PostGIS

/30 min + 30 min exercise/

2.4.1 Connecting to a PostGIS database with psycopg2

2.4.2 Converting latitude and longitude fields to geographical points

2.4.3 Setting and converting coordinate systems

2.4.4 Aggregation and geographic calculations with queries

2.4.5 GEOMETRY and GEOGRAPHY data types

3 plugins for GIS software ~~~~~~~~~~~~~~~~~~~~~~~~~~~

3.1 QGIS /15 min/

3.2 ArcGIS /10 min/

4 Conclusion ~~~~~~~~~~~~~ /10 min/

Required Packages

required packages pyproj, gdal, shapely, psycopg2 optional packages PostGIS, QGIS, cartopy

Using Geospatial Data with Python, SciPy2013 Tutorial, Part 5 of 6
SciPy 2013
Recorded: June 27, 2013Language: English

Presenter: Kelsey Jordahl

Description

Geographically referenced data is important in many scientific fields, and working with spatial data has become widespread in other domains as well (e.g. Google Maps, geolocated tweets, 4square checkins). Python has become an increasingly important language for working with geospatial data. In this tutorial, students will get experience in working with common geospatial formats in open source python libraries.

Python bindings are available for (nearly) all the standard libraries for working with geospatial data (proprietary and open source). Some of these libraries (including PROJ.4 and GDAL) will be discussed and used in this tutorial, along with more "pythonic" packages for accessing them, such as Shapely. Using spatially-aware databases will be discussed, with examples and an exercise using PostGIS, an extension to PostgreSQL. Python scripting extensions to Geographic Information Systems (GIS) packages such as QGIS and ArcView will be briefly discussed.

This tutorial should be accessible to anyone who has a basic understanding of NumPy and matplotlib. Prior familiarity with SQL database queries and the python DB API will be helpful for the PostGIS section.

Outline

1 map projections ~~~~~~~~~~~~~~~~~~ /10 min/

1.1 pyproj /10 min + 15 min exercise/

1.2 baseplot /10 min + 10 min exercise/

1.3 cartopy /5 min/

2 geographical data ~~~~~~~~~~~~~~~~~~~~

2.1 data formats /20 min intro/

2.2 GDAL/OGR /10 min + 10 min exercise/

2.3 Shapely /15 min + 30 min exercise/

2.4 PostGIS

/30 min + 30 min exercise/

2.4.1 Connecting to a PostGIS database with psycopg2

2.4.2 Converting latitude and longitude fields to geographical points

2.4.3 Setting and converting coordinate systems

2.4.4 Aggregation and geographic calculations with queries

2.4.5 GEOMETRY and GEOGRAPHY data types

3 plugins for GIS software ~~~~~~~~~~~~~~~~~~~~~~~~~~~

3.1 QGIS /15 min/

3.2 ArcGIS /10 min/

4 Conclusion ~~~~~~~~~~~~~ /10 min/

Required Packages

required packages pyproj, gdal, shapely, psycopg2 optional packages PostGIS, QGIS, cartopy

Using Geospatial Data with Python, SciPy2013 Tutorial, Part 6 of 6
SciPy 2013
Recorded: June 27, 2013Language: English

Presenter: Kelsey Jordahl

Description

Geographically referenced data is important in many scientific fields, and working with spatial data has become widespread in other domains as well (e.g. Google Maps, geolocated tweets, 4square checkins). Python has become an increasingly important language for working with geospatial data. In this tutorial, students will get experience in working with common geospatial formats in open source python libraries.

Python bindings are available for (nearly) all the standard libraries for working with geospatial data (proprietary and open source). Some of these libraries (including PROJ.4 and GDAL) will be discussed and used in this tutorial, along with more "pythonic" packages for accessing them, such as Shapely. Using spatially-aware databases will be discussed, with examples and an exercise using PostGIS, an extension to PostgreSQL. Python scripting extensions to Geographic Information Systems (GIS) packages such as QGIS and ArcView will be briefly discussed.

This tutorial should be accessible to anyone who has a basic understanding of NumPy and matplotlib. Prior familiarity with SQL database queries and the python DB API will be helpful for the PostGIS section.

Outline

1 map projections ~~~~~~~~~~~~~~~~~~ /10 min/

1.1 pyproj /10 min + 15 min exercise/

1.2 baseplot /10 min + 10 min exercise/

1.3 cartopy /5 min/

2 geographical data ~~~~~~~~~~~~~~~~~~~~

2.1 data formats /20 min intro/

2.2 GDAL/OGR /10 min + 10 min exercise/

2.3 Shapely /15 min + 30 min exercise/

2.4 PostGIS

/30 min + 30 min exercise/

2.4.1 Connecting to a PostGIS database with psycopg2

2.4.2 Converting latitude and longitude fields to geographical points

2.4.3 Setting and converting coordinate systems

2.4.4 Aggregation and geographic calculations with queries

2.4.5 GEOMETRY and GEOGRAPHY data types

3 plugins for GIS software ~~~~~~~~~~~~~~~~~~~~~~~~~~~

3.1 QGIS /15 min/

3.2 ArcGIS /10 min/

4 Conclusion ~~~~~~~~~~~~~ /10 min/

Required Packages

required packages pyproj, gdal, shapely, psycopg2 optional packages PostGIS, QGIS, cartopy

Version Control and Unit Testing for Scientific Software, SciPy2013 Tutorial, Part 2 of 3
SciPy 2013
Recorded: June 27, 2013Language: English

Presenters: Matt Davis, Katy Huff

Description

Writing software can be a frustrating process but developers have come up with ways to make it less stressful and error prone. Version control saves the history of your project and makes it easier for multiple people to participate in development. Unit testing and testing frameworks help ensure the correctness of your code and help you find errors by quickly executing and testing your entire code base. These tools can save you time and stress and are valuable to anyone writing software of any description.

This collaborative, hands-on tutorial will cover version control with Git plus writing and running unit tests in Python (and IPython!) using the nose testing framework. Attendees should be comfortable with the basics of Python and the command line but no experience with scientific Python is necessary.

Outline

The tutorial will be split into two two-hour lessons. The first lesson will cover git/GitHub and the second lesson will cover unit testing. Throughout the entirety of the tutorial students will be working in pairs. Our teaching style is to have frequent, short exercises for students. Ideally instructors aren't talking for more than a few minutes before they stop and students do something on their own machines. Instructors then give an explanation and example, then move on to the next item.

Lesson 1: git/GitHub

Students will work in pairs and in each pair one student will make a GitHub repo, then give the other student commit access on that repo. Students will take turns making modifications to learn various bits of git functionality.

Lesson 2: Unit Testing

Continuing to work in pairs, students will use test driven development to construct a small scientific program in the IPython Notebook and then take their work to .py files to see how nose works from the command line.

Required Packages

In addition to standard Python this tutorial will require the nose testing framework, the IPython Notebook, and the command line interface to Git. Installing Git varies by platform. Windows users should install Git Bash (aka msysgit), Mac users should install the Mac OS X Command Line Tools, and Linux users should use the packaging system for their particular Linux distribution. The Anaconda CE Python installer includes nose and the IPython Notebook.

Documentation

Material will be adapted from existing Software Carpentry lessons, especially: https://github.com/swcarpentry/boot-camps/tree/master/version-control/git/git-and-github https://github.com/swcarpentry/boot-camps/tree/master/python/sw_engineering

Other links: http://software-carpentry.org/ - The software carpentry organization with links to many lessons and past boot camps. https://github.com/swcarpentry/boot-camps - Standard Software Carpentry boot camp curriculum. https://github.com/thehackerwithin/PyTrieste/wiki - Early Software Carpentry curriculum from a two week boot camp at the International Center for Theoretical Physics. https://code.google.com/p/hacker-within/w/list - Lesson notes for a three hour lightning lesson at the American Nuclear Society Conference 2011. http://software-carpentry.org/blog/2012/02/trieste-italy-workshop-week-1.html - A blog post about the first week of the two week ICTP boot camp. http://software-carpentry.org/blog/2012/04/lessons-learned-at-the-university-of-chicago.html - A blog post about a two day boot camp at the University of Chicago. http://software-carpentry.org/blog/2011/11/knowledge-of-the-second-kind.html - A blog post about what the hacker within did before it was absorbed into Software Carpentry (it's no longer really its own entity).

Version Control and Unit Testing for Scientific Software, SciPy2013 Tutorial, Part 3 of 3
SciPy 2013
Recorded: June 27, 2013Language: English

Presenters: Matt Davis, Katy Huff

Description

Writing software can be a frustrating process but developers have come up with ways to make it less stressful and error prone. Version control saves the history of your project and makes it easier for multiple people to participate in development. Unit testing and testing frameworks help ensure the correctness of your code and help you find errors by quickly executing and testing your entire code base. These tools can save you time and stress and are valuable to anyone writing software of any description.

This collaborative, hands-on tutorial will cover version control with Git plus writing and running unit tests in Python (and IPython!) using the nose testing framework. Attendees should be comfortable with the basics of Python and the command line but no experience with scientific Python is necessary.

Outline

The tutorial will be split into two two-hour lessons. The first lesson will cover git/GitHub and the second lesson will cover unit testing. Throughout the entirety of the tutorial students will be working in pairs. Our teaching style is to have frequent, short exercises for students. Ideally instructors aren't talking for more than a few minutes before they stop and students do something on their own machines. Instructors then give an explanation and example, then move on to the next item.

Lesson 1: git/GitHub

Students will work in pairs and in each pair one student will make a GitHub repo, then give the other student commit access on that repo. Students will take turns making modifications to learn various bits of git functionality.

Lesson 2: Unit Testing

Continuing to work in pairs, students will use test driven development to construct a small scientific program in the IPython Notebook and then take their work to .py files to see how nose works from the command line.

Required Packages

In addition to standard Python this tutorial will require the nose testing framework, the IPython Notebook, and the command line interface to Git. Installing Git varies by platform. Windows users should install Git Bash (aka msysgit), Mac users should install the Mac OS X Command Line Tools, and Linux users should use the packaging system for their particular Linux distribution. The Anaconda CE Python installer includes nose and the IPython Notebook.

Documentation

Material will be adapted from existing Software Carpentry lessons, especially: https://github.com/swcarpentry/boot-camps/tree/master/version-control/git/git-and-github https://github.com/swcarpentry/boot-camps/tree/master/python/sw_engineering

Other links: http://software-carpentry.org/ - The software carpentry organization with links to many lessons and past boot camps. https://github.com/swcarpentry/boot-camps - Standard Software Carpentry boot camp curriculum. https://github.com/thehackerwithin/PyTrieste/wiki - Early Software Carpentry curriculum from a two week boot camp at the International Center for Theoretical Physics. https://code.google.com/p/hacker-within/w/list - Lesson notes for a three hour lightning lesson at the American Nuclear Society Conference 2011. http://software-carpentry.org/blog/2012/02/trieste-italy-workshop-week-1.html - A blog post about the first week of the two week ICTP boot camp. http://software-carpentry.org/blog/2012/04/lessons-learned-at-the-university-of-chicago.html - A blog post about a two day boot camp at the University of Chicago. http://software-carpentry.org/blog/2011/11/knowledge-of-the-second-kind.html - A blog post about what the hacker within did before it was absorbed into Software Carpentry (it's no longer really its own entity).

Symbolic Computing with SymPy, SciPy2013 Tutorial, Part 2 of 6
SciPy 2013
Recorded: June 24, 2013Language: English

Ondrej Certik, Mateusz Paprocki, Aaron Meurer

Description

SymPy is a pure Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python and does not require any external libraries.

In this tutorial we will introduce attendees to SymPy. We will start by showing how to install and configure this Python module. Then we will proceed to the basics of constructing and manipulating mathematical expressions in SymPy. We will also discuss the most common issues and differences from other computer algebra systems, and how to deal with them. In the last part of this tutorial we will show how to solve simple, yet illustrative, mathematical problems with SymPy.

This knowledge should be enough for attendees to start using SymPy for solving mathematical problems and hacking SymPy's internals (though hacking core modules may require additional expertise).

We expect attendees of this tutorial to have basic knowledge of Python and mathematics. However, any more advanced topics will be explained during presentation.

Outline

installing, configuring and running SymPy basics of expressions in SymPy traversal and manipulation of expressions common issues and differences from other CAS setting up and using printers querying expression properties not only symbolics: numerical computing (mpmath) Mathematical problem solving with SymPy Required Packages

Python 2.x or 3.x, SymPy (most recent version) Optional packages: IPython, matplotlib, NetworkX, GMPY, numpy, scipy

Documentation

http://mattpap.github.com/scipy-2011-tutorial/html/index.html