Live-Hack: Analyzing 7 years of Buzzwords at Scale

Scale
06/07/2016 - 16:30 to 17:10
Kesselhaus
long talk (40 min)
Intermediate

Session abstract: 

We're coming together for Berlin Buzzwords' 7th edition and over the course of the years a lot has changed in the Big Data Technology ecosystem. Once-hot buzzwords have vanished and new buzzwords arose.

While you would probably have written a MapReduce job in Java to crawl the web and analyze it on a massive scale this has now become much simpler with tools like Spark and Flink at hand.

I want to do a live coding session where I show that today it is possible to write a scalable web crawler and analytics tool which scrapes the past 6 years of Berlin Buzzwords (websites) and shows some interesting insights in the Big Data trends of the past 6 years. While I will run the tool on the very limited data set of the historical Berlin Buzzwords websites I want to highlight that it would in principle scale to crawl millions of websites and analyze petabytes of data.

Video: 

Slide: