Session abstract:
DataFrame is an awesome interface for data manipulation in Spark but when the complexity grows outside of the capabilities of Spark itself, you need to resort to "violence". In this talk I will explain one of the projects which became too complex to be executed using the DataFrame API and had to be rewritten into a custom code applied using mapPartitions function. We will cover some of the tips and tricks for reducing lineage complexity, share our process of analyzing pain points and get into details of mapPartitions functionality to leverage Spark's distributed processing capabilities and reliability while executing custom code.