Building an enterprise Natural Language Search Engine with ElasticSearch and Facebook’s DrQA

06/17/2019 - 15:20 to 16:00

Kesselhaus

long talk (40 min)

Intermediate

Session abstract:

Modern search engines leverage Natural Language Processing (NLP) and Machine Learning (ML) to improve relevance of results. In this presentation, we focus on the specific field of ‘Enterprise Search’, whose primary goal is to make domain specific company data and documents readily accessible to employees to improve their productivity and promote collaboration. Indeed, any large organization produces vast amounts of documentation about their specific systems, technologies and processes. The question then is - “How can we speed up search-driven activities and enhance user experience for the enterprise?”

Facebook AI research team has developed and open sourced a tool to answer questions by reading Wikipedia articles, called DrQA (Chen, et al. 2017). DrQA is based on a 2-stage Q&A pipeline: (i) Retriever: retrieve the top-k relevant documents (pages), followed by (ii) Reader: determining the most relevant answer span among the retrieved documents (pages). We applied it on an enterprise use-case to search over machine manuals used by factory operators.

We present an architecture integrating ElasticSearch in the DrQA pipeline, which has been contributed upstream and is now available from the official DrQA github repository (https://github.com/facebookresearch/DrQA). The end result is a very scalable search engine that can be deployed on any document repository in your enterprise containing Microsoft Office docs, ppts, emails, pdf documents, etc. Simply point it to your ElasticSearch index and it will be able to provide ‘very precise’ answers based on your documents, thanks to the pre-trained Deep Learning Q&A model of DrQA. We discuss the learnings along the creation and the limitations of such an engine, e.g. scenarios where it excels by identifying precise answers and how it performs compared to a non-ML approach or a typical keyword based search.