E-Commerce search at scale on Apache Lucene (tm)

Search
06/17/2019 - 15:20 to 16:00
Frannz Salon
long talk (40 min)
Intermediate

Session abstract: 

After many years running its own in-house C++ search engine, Amazon is exploring moving its customer facing e-commerce product search to Apache Lucene (tm), serving millions of customers each day worldwide. Solr, Elasticsearch and other Lucene derivatives have been used widely for many years at Amazon, but until now the .com product search has been powered by a proprietary in-house engine. We'll discuss why we decided to adopt open source for this vital technology and dive deep into the technical challenges we faced in replicating our legacy engine's behavior, pointing out novel uses of Lucene along the way. Highlights will include:

  • Our open-source contributions: concurrent deletions and updates, custom term frequencies, improvement to taxonomy faceting
  • Index replication on S3: a near-real-time segment-based index replication strategy backed by cheap cloud storage that provides excellent scalability and durability.
  • Merging: fully merging all index segments may be harmful! We'll discuss why our query latencies increase when the index is fully merged.
  • Ranking: stringent performance requirements together with complex machine-learned ranking models make for an uneasy marriage. We'll explain how index sorting with early termination combined with multiphase ranking make it possible to have both.
  • Offers/families: how we model offers for a single product, and families with multiple products, using index-time joins.
  • Scoring: custom term frequencies based on various machine-learned signals feeding into extensive custom scoring function library including compiled expressions giving performance that is competitive with C++ functions.
  • Garbage collection: we had 8 second stop-the-world pauses and had to dive deep to understand why and correct it. We're using Lucene 7.5.0, JDK11 with the deprecated concurrent garbage collector, CMS. We tried FSTPostingsFormat, then had to revert it. We reduced RAM used by our complex hierarchical index configuration.
  • Query caching: we tried enabling Lucene's query cache, and it didn't help, but we saw nice gains from indexing commonly occurring sub-queries and optimizing the corresponding clauses at search time.

Slide: