Contribute Media
A thank you to everyone who makes this possible: Read More

Large Scale Graph Mining with Spark: What I learned from mapping >15 million websites


As the web grows ever larger and more content-rich, graph analysis may be one of the most powerful tools for unlocking insights within the mythical big data. That's totally not fluff, because WIRED wrote about it (

This talk relates to ongoing research into large-scale graph mining, and to find insights into how different websites interact with each other (sometimes in surprising ways!). Spark GraphFrames was integral to exploring the enormous Common Crawl dataset, and the data size really pushed the tool to its limits. Along the way, I learned a great deal about optimizations in representing and computing graphs.

We'll talk about:

  • Why graphs are so fascinating and the types of problems they can help solve
  • How Spark GraphFrames work under the hood.
  • How to find clusters of interest in your graph.
  • Tips that may help you in your journey (hint: you're only as good as your data structure).

And much more! Github repo with all code will also be shared.

Improve this page