Apache Spark is the standard tool for processing big data, capable of processing massive datasets often at speeds much faster than Apache Hadoop, especially for iterative algorithms such as those of common machine learning tasks.
Spark is also relatively easy to get started with and use for exploratory data analysis, especially as it offers interactive Scala, Python and R shells in which a user can easily try out different ways of manipulating their data, avoiding the slow write - compile - submit - wait - inspect output loops of other frameworks. Spark also provides a high level DataFrame API and scalable machine learning libraries, making it a compelling tool for data scientists.
I am an engineer at ASI, a data science consultancy and developers of the SherlockML data science platform based in London. This talk comes directly from our experience helping our consultants and SherlockML users to most effectively use Spark as part of an integrated data science workflow.