Taming the language border in data analytics and science with Apache Arrow

Store

06/18/2019 - 12:20 to 13:00

Maschinenhaus

long talk (40 min)

Intermediate

Session abstract:

In the space of building products with data, either by dealing with huge amounts of data or by applying machine learning, many different ecosystems meet. Larger volumes of data have to be passed between these systems. The handling of the data is not only down to divide between systems written in Java that need to pass it on to the machine learning model in Python. When you take into account that you want to integrate with the existing business infrastructure, you also need to cater for legacy systems as well do you need to bring the large volumes of data to the user via UIs.

Switching between each of these ecosystems is often expensive. You either have a simple but slow adapter or have spent a lot of human effort in building an efficient interchange. Apache Arrow started three years ago as a Apache top-level project to eliminate a lot of overhead in this sector. As part of this talk, we would like to show how to move large amounts of data from Java to Python and then onto JavaScript without leaving the comfort zone of each ecosystem. An important aspect is that everyone should be able to use their usual tools without a deep knowledge of the other's ecosystem while still providing data exchange at high speed. This should enable developers to remove the overhead of writing conversion code from their data pipelines and focus on the actual business requirements.