Miles or Kilometers? Why Your Data Schema Should Include Units

Stream

06/17/2019 - 11:50 to 12:10

Frannz Salon

short talk (20 min)

Intermediate

Session abstract:

You wouldn’t dream of developing a modern data processing system without defining schemas for your data to properly annotate whether data elements are dates, strings, integers or reals. But annotating values as numeric leaves a lot unsaid. Is my throughput in queries per second or queries per minute? Is this airline flight distance in miles or kilometers? Is that fluid volume in cubic centimeters or liters?

Practically every numeric value we manipulate in our software has an implied unit. We learn how to check our unit analysis in high school science, but our development tools provide no support for keeping track of units or enforcing their correctness! In modern distributed systems operating at scale, errors caused by simple unit mistakes can cost you valuable engineering time and generate weeks of corrupted data.

It doesn’t have to be this way. A unit is really a kind of data type. What if our compilers could check units along with all the other data types we use every day? Languages with modern type systems, such as Scala, Haskell and Rust, open up new possibilities for representing units as compiler checkable types with constructs like type-classes and dependent types.

In this talk, we will explain the challenges around supporting units as static data types, and approaches to solving them with modern type systems. We will discuss a range of compelling use cases, from compiler verified schema to distributed data science pipelines to type safe internationalization. Concepts will be illustrated with concrete code, using a Scala library for compile-time unit analysis. The audience will discover what our software systems could look like in a world with first class unit types.