GET /api/v2/video/400
HTTP 200 OK Vary: Accept Content-Type: text/html; charset=utf-8 Allow: GET, PUT, PATCH, HEAD, OPTIONS
{ "category": "PyCon US 2011", "language": "English", "slug": "pycon-2011--large-scale-data-conditioning--amp--p", "speakers": [ "Eric Gaumer" ], "tags": [ "pycon", "pycon2011", "pypes", "stackless" ], "id": 400, "state": 1, "title": "Large Scale Data Conditioning & Processing with Stackless Python and Pypes", "summary": "", "description": "Large Scale Data Conditioning & Processing with Stackless Python and Pypes\n\nPresented by Eric Gaumer\n\nPypes is a component oriented framework for designing dataflow applications.\nIt uses Stackless Python to model components as computational entities that\noperate by sending and receiving messages. Components are designed to process\nstreams of data modeled as a series of messages which are exchanged\nasynchronously. Data streams are initiated over an asynchronous REST\ninterface.\n\nAbstract\n\nThere's been some recent momentum around data flow programming with a number\nof new frameworks having been released. This new found interest is due largely\nin part to the increasing amount of data being produced and consumed by\napplications. MapReduce has become a general topic of discussion for analytics\nover large data sets but it's increasingly evident that it's not a panacea.\n\nSimple batch processing tools like MapReduce and Hadoop are just not powerful\nenough in any one of the dimensions of the big data space that really matters.\nOne particular area where MapReduce falls short is in near real-time search.\nIt used to be common to run batch processing jobs on a nightly basis which\nwould index the days events, making them searchable.\n\nGiven today's social dynamics, people have come to expect instant access to\ndata as opposed to a daily digest. Batch oriented semantics are being\nsuperseded by event driven architectures that act on live, real-time streams\nof data. This shift in paradigm has sparked new interest in dataflow concepts.\n\nDataflow frameworks promote the data to become the main concept behind any\nprogram. It becomes a matter of \"data-flow\" over \"control-flow\" where\nprocesses are just the way data is created, manipulated and destroyed. This\nconcept is well represented in the Unix operating system which pipes data\nbetween small single-purpose tools to produce more sophisticated applications.\n\nPypes is a dataflow framework that leverages Stackless Python to model\nprocesses as black box operations that communicate by sending and receiving\nmessages. These processes are naturally component oriented allowing them to be\nconnected in different ways to form new applications. Components are\ninherently stateless making parallel processing relatively simple. Because a\ncomponent is an abstraction of a Stackless tasklet (true coroutines),\nexpensive setups such as loading machine learning models are done once during\ninitialization and can then be used throughout the life of the component. This\nis in contrast to MapReduce frameworks that typically incur this overhead each\ntime the map function is called or try to manage some sort of global state.\n\nOne aspect that differentiates Pypes from other dataflow frameworks is its\n\"push\" model. Unlike generator based solutions which pull data through the\nsystem, Pypes provides a RESTful interface that allows data to be pushed in.\nThis allows Pypes to sit more natural as an event driven middleware component\nin the context of a larger architecture. A data push model also simplifies\nscalability since an entire cluster of nodes can be setup behind a load\nbalancer which will then automatically partition the incoming data stream.\nGenerator based \"pull models\" cannot easily partition data without somehow\ncoordinating access to the data which involves global state.\n\nPypes was designed to be a highly scalable, event driven, dataflow scheduling\nand execution environment. Writing your own components is simple and Pypes\nprovides Paste templates for creating new projects. Components are packaged as\nPython eggs and discovered automatically. They can be wired together using a\nvisual editor that runs in any HTML5 compliant browser. Pypes supports\nDirected Acyclic Graphs and data streams are modeled as a series of JSON\n(dict) packets which support meta-data at both the packet level and the field\nlevel.\n\nPypes also leverages the Python multiprocessing module to scale up. Data\narriving through the REST interface on any given node will be distributed\nacross parallel instances of the graph running on different cores/CPUs. Data\nsubmission is completely asynchronous.\n\nThis talk will provide a gentle introduction to the Pypes architecture and\ndesign.\n\nOutline:\n\n * Brief intro to Stackless Python (benefits it provides) \n * Control-Flow vs Data-Flow \n * Preemptive vs Cooperative Scheduling \n * The Topological Scheduler \n * The REST API (Submitting Data - Asynchronous Web Service) \n * Packet API: A unified data model with meta-data support \n * Writing Custom Components - Paste templates and pluggable eggs \n * Scale up - multiprocessing support \n * Scale out - cloud friendly \n * Questions \n\n", "quality_notes": "", "copyright_text": "Creative Commons Attribution-NonCommercial-ShareAlike 3.0", "embed": "", "thumbnail_url": "", "duration": null, "video_ogv_length": 154817317, "video_ogv_url": null, "video_ogv_download_only": false, "video_mp4_length": null, "video_mp4_url": "", "video_mp4_download_only": false, "video_webm_length": null, "video_webm_url": null, "video_webm_download_only": false, "video_flv_length": null, "video_flv_url": null, "video_flv_download_only": false, "source_url": "", "whiteboard": "", "recorded": "2011-03-11", "added": "2012-02-23T04:20:00", "updated": "2014-04-08T20:28:27.995" }