Contribute Media
A thank you to everyone who has made this possible: Read More

Easy Spark: Exploiting large datasets for multi-class classification


Apache Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. At first glance, it seems that getting started with programming the Hadoop eco-system is quite cumbersome, and not so user-friendly for a data scientist or a machine learning specialist. In this talk I will briefly introduce Apache Spark, and its programming paradigm. I will show how to easily execute a distributed training of the common multi-class classifiers (naïve Bayes, random forest, logistic regression), without installing a single virtual machine, virtual box or a docker. I will share my experience of managing long-term software projects which are based on the Hadoop technology for data storage, extraction and transformation.


Improve this page