Session abstract:
Data Scientists using one set of tools, predominantly in Python or R with lots of file based data. Engineers deploying production systems using different programming languages and primarily online databases. This is a common pattern, and leads to silos between these two groups.
In this talk, Eric will share what he’s learned in creating a project structure that will feel at home for both Data Scientists and Engineers using Apache Zeppelin and Docker. The project structure he’ll share is heavily influenced by the Cookie Cutter Data Science Project (https://drivendata.github.io/cookiecutter-data-science/) that is “A logical, reasonably standardized, but flexible project structure for doing and sharing data science work”, but embracing the richness of 20+ programming languages and data stores that Apache Zeppelin connects to. All code will be available via Github for you to play with.
He’ll demonstrate using Zeppelin to expose web analytics rollups, to do ETL processing, and to enrich datasets with NLP processing.
This talk will serve as as a great intro to Apache Zeppelin, and if you are already using Jupyter, will encourage you to take a look at this competitor! If you are already using Zeppelin, then you’ll be interested in how to use Zeppelin for more than just the core task of interactive data analytics, and indeed it is a great environment for rapid prototyping of the backend of many intensive data processing projects.