Every week we are introducing new speakers which will be on stage at #bbuzz 2015. Thanks to our program committee we can present part of our new eclectic program. Presentations range from beginner friendly introductions on hot data analysis topics to in-depth technical presentations about scalable architectures. The conference presents more than 50 talks by international speakers specific to the three tags "search", "store" and "scale".
Apache Flink deep-dive
Stephan Ewen
Scale – 40min
Stephan Ewen is a last-year PhD student at TU Berlin. He has been the primary architect and developer of the Stratosphere platform since its inception. and has worked at IBM Almaden Research Center (USA), Microsoft Research (USA) as well as at IBM Development Laboratory Böblingen (Germany).
Apache Flink is a distributed engine for batch and streaming data analysis and offers familiar programming APIs based on parallel collections that represent data streams, transformations and window definitions on these collections. Flink supports these APIs with a robust execution backend and implements its own memory manager and custom data processing algorithms inside the JVM. Stephans talk gives an overview of Flink’s APIs, the most important features of Flink’s runtime and the benefits for users, as well as a roadmap of the project.
Signatures, patterns and trends: Timeseries data mining at Etsy
Andrew Clegg
Search – 40min
Andrew joined Etsy in 2014 and has been involved in the redesign of their Kale platform for anomaly detection and pattern matching since he came on board. Prior to Etsy he spent almost 15 years designing machine learning workflows, and building search and analytics services, in academia, startups and enterprises, and in an ever-growing list of research areas including biomedical informatics, computational linguistics, social analytics, and educational gaming.
Kale is an open-source software suite for pattern mining and anomaly detection in operational data streams. In his talk Andrew will briefly cover the main challenges that traditional statistical methods face in this environment, and introduce some pragmatic alternatives that scale well and are easy to implement (and automate) on Elasticsearch and similar platforms.
Change Data Capture: The Magic Wand We Forgot
Martin Kleppmann
Scale – 40min
Martin is a software engineer and entrepreneur, specialising on the data infrastructure of internet companies. His last startup, Rapportive, was acquired by LinkedIn in 2012. He is a committer on Apache Samza and author of the upcoming O'Reilly book "Designing Data-Intensive Applications".
Change Data Capture (CDC) is an old idea: let the application subscribe to a stream of everything that is written to a database – a feed of data changes. You can use that feed to update search indexes, invalidate caches, create snapshots, generate recommendations, copy data into another database. In this talk, Martin will explain why change data capture is so useful, and how it prevents race conditions and other ugly problems. Then he'll go into the practical details of implementing CDC with PostgreSQL and Apache Kafka, and discusses the approaches you can use to do the same with various other databases.