Session abstract:
You are a multi-national company having customers in different countries where the business language is not necessarily English. How would you build a centralized search system catering for the needs of all your users? Apache Lucene/Elasticsearch/Apache Solr all provide different language-based text analyzers to analyze the text, but which one should you use and when? We found overselves in exactly this situation.
We are an email-security company having around 300 billion emails archived with us, resulting in indices in petabytes of indexed data. Earlier we used whitespace analyzers from Lucene to be able to serve the searches in different languages, but this approach, although simplistic, presented many limitations once we started to serve in different languages (e.g. German). I will explain how we overcome these problems by first identifying the language of the content through our own language detection model which in turn served as the guide for the selection of an analyzer to analyze the email in various languages.
This talk will walk you through how to build multilingual search systems and explore different possible approaches. It will also discuss different problems one may run into when these language-based analyzers are used and what are the ways to improve the search results in these cases.
In particular the talk will focus on the query-log analysis as an effective way to improve the multilingual search by providing the feedback to fine tune the analyzers used for stemming and lemmatization, thereby increasing not only the recall but also the precision (relevance) of the search results.