Lessons learned while setting up a computational advertising platform on AWS with emphasis on experimental data analysis and scaling.
@ Kiwi PyCon 2013 - Sunday, 08 Sep 2013 - Track 2
Overview - hope this will be useful, but caveat emptor - not a how-to, that's well covered elsewhere - problem - recovering value from large web logs - user targeting
Is this Big Data? - When should you think about Hadoop - AWS servers available with 244 GB of memory - Twitter WTF paper, Microsoft cluster utilisation paper
Logging, Storing, and Munging - Looked at EMR but (1) it's hard to log (2) versioning issues. - For on-demand use CM is good - For automated use, combination of CDH, whirr, and boto. - backing up HBase and HDFS to S3
Processing the data - hadoop as solving distributed IO - Pig + udfs - hadoop streaming
Learning on the data - difficult data - latest machine learning algorithms, not just existing mapreduce algorithms (mahout) - frameworks are starting to appear - Graphlab, or the Berkeley Spark ecosystem. - want to experiment on smaller data to reduce iteration time.
Prototype Learning Algorithm - loading text files into numpy arrays when memory constrained - JIT python compilation - scikit-learn - logistic regression - spectral clustering and the FEAST algorithm - nearest neighbors (output to gephi) - read/write binary formats
Implementation at scale - shoehorn into map-reduce - Port successful algorithms to GraphLab, C++ and MPI or Boost Graph Library etc. - MIT Starcluster .. - Numba, Blaze, Theano, KDT - Anaconda