Session abstract:
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
I will present an overview of Apache Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is joined Apache Incubator last year. It specifies the portable table format and standardizes many important features, including:
- All reads use snapshot isolation without locking.
- No directory listings are required for query planning.
- Files can be added, removed, or replaced atomically.
- Full schema evolution supports changes in the table over time.
- Partitioning evolution enables changes to the physical layout without breaking existing queries.
- Data files are stored as Avro, ORC, or Parquet.
- Support for Spark, Hive, and Presto.