Meet MS MARCO: A Dataset for the AI Research Community


Asking Google or Cortana a simple question such as “How much is 10 times 20?” will give you the answer 200. But ask it something a tad more difficult and you’re likely to get directed towards a website where you will have to scroll through to find your answer. Microsoft wants to change all that. Meet MS MARCO.

The Tech giant has released a set of 100,000 questions and answers that can be used by artificial intelligence (AI) researchers to create systems that can read and answer questions as precisely as a human.

Image taken from

MS MARCO or Microsoft MAchine Reading COmprehension, is a collection of material that can be used to teach artificial intelligence systems to recognize questions and put together answers. Eventually, the end goal would be to create a system that can present its own set of answers based on unique questions that they have not seen before.

Researchers can train systems by providing them with realistic questions and answers. These in turn can help the system deal with complex questions that people regularly ask, including questions with no clear cut answer or have multiple possible answers.

Thus, by making a dataset of this nature written by humans open sourced, Microsoft hopes that MS MARCO can make advances in AI research, thereby helping AI read and understand language just like a human would.

So the next time you ask a complex question, rather than reading your way through an entire website, you can ask Google or Cortana or Siri and they would scan through the results and provide a complete answer to your question. In case you’re wondering, the 100,000 questions and answers were made based on questions asked by actual people to either the Bing search engine or Cortana virtual assistant. Answers given by MS MARCO were collected from around 200,000 documents or websites and then summarized by a human.

MS MARCO is officially available to businesses and researchers. However, the datasets are available for download here and are for non-commercial use.



