Clustering of data is an increasingly important task for many data scientists. This talk will explore the challenge of hierarchical clustering of text data for summarisation purposes. We'll take a look at some great solutions now available to Python users including the relevant Scikit Learn libraries, via Elasticsearch (with the carrot2 plugin), and check out visualisations from both approaches.
- Background: methods for clustering text data and the challenge of data summarisation
- Hierarchical clustering: agglomerative vs divisive
- sklearn.cluster and metrics modules
- Elasticsearch + carrot2 plugin
- Performance comparisons, assessment of ease of scalability and use
- Static visualisation using Matplotlib, interactive using Foamtree