Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) is a large-scale reading comprehension dataset which requires commonsense reasoning. ReCoRD consists of queries automatically generated from CNN/Daily Mail news articles; the answer to each query is a text span from a summarizing passage of the corresponding news. The goal of ReCoRD is to evaluate a machine's ability of commonsense reasoning in reading comprehension. ReCoRD is pronounced as [ˈrɛkərd].
ReCoRD contains 120,000+ queries from 70,000+ news articles. Each query has been validated by crowdworkers. Unlike existing reading comprehension datasets, ReCoRD contains a large portion of queries requiring commonsense reasoning, thus presenting a good challenge for future research to bridge the gap between human and machine commonsense reading comprehension.
ReCoRD paper (Zhang et al. '18)Browse examples in ReCoRD in a friendly way:
Browse ReCoRDWe've built a few resources to help you get started with the dataset.
Download a copy of the dataset in JSON format:
Read the following Readme to get familiar with the data structure.
To evaluate your models, we have also made available the evaluation script we will use for official evaluation, along with a sample prediction file that the script will take as input. To run the evaluation, use python evaluate.py <path_to_dev> <path_to_predictions>
.
Once you have a built a model that works to your expectations on the dev set, you submit it to get official scores on the dev and a hidden test set. To preserve the integrity of test results, we do not release the test set to the public. Instead, we require you to submit your model so that we can run it on the test set for you. Here's a tutorial walking you through official evaluation of your model:Submission Tutorial
ReCoRD contains passages from two domains. We make them public under the following licenses:
Ask us questions at our google group or at zsheng2@jhu.edu.
We thank the SQuAD team for allowing us to use their code and templates for generating this website.
Rank | Model | EM | F1 |
---|---|---|---|
Human Performance Johns Hopkins University (Zhang et al. '18) | 91.31 | 91.69 | |
1 Mar 26, 2020 | LUKE (single model) Studio Ousia & NAIST & RIKEN AIP | 90.64 | 91.21 |
2 Jul 20, 2019 | XLNet + MTL + Verifier (ensemble) PingAn Smart Health & SJTU | 83.09 | 83.74 |
3 Jul 20, 2019 | XLNet + MTL + Verifier (single model) PingAn Smart Health & SJTU | 81.46 | 82.66 |
3 Jul 09, 2019 | CSRLM (single model) Anonymous | 81.78 | 82.58 |
4 Jul 24, 2019 | {SKG-NET} (single model) Anonymous | 79.48 | 80.04 |
5 Jan 11, 2019 | KT-NET (single model) Baidu NLP | 71.60 | 73.62 |
5 May 16, 2019 | SKG-BERT (single model) Anonymous | 72.24 | 72.78 |
6 Nov 29, 2018 | DCReader+BERT (single model) Anonymous | 69.49 | 71.14 |
7 Oct 08, 2020 | GraphBert (single) Anonymous | 60.80 | 62.99 |
8 Oct 07, 2020 | GraphBert-WordNet (single) Anonymous | 59.86 | 61.89 |
9 Oct 08, 2020 | GraphBert-NELL (single) Anonymous | 59.41 | 61.51 |
10 Nov 16, 2018 | BERT-Base (single model) JHU [modification of the Google AI implementation] https://arxiv.org/pdf/1810.04805.pdf | 54.04 | 56.07 |
11 Oct 25, 2018 | DocumentQA w/ ELMo (single model) JHU [modification of the AllenNLP implementation] https://arxiv.org/pdf/1710.10723.pdf | 45.44 | 46.65 |
12 Oct 25, 2018 | SAN (single model) Microsoft Business Applications Research Group https://arxiv.org/pdf/1712.03556.pdf | 39.77 | 40.72 |
13 Oct 25, 2018 | DocumentQA (single model) JHU [modification of the AllenNLP implementation] https://arxiv.org/pdf/1710.10723.pdf | 38.52 | 39.76 |
14 Oct 25, 2018 | ASReader (single model) JHU [modification of the IBM Waston implementation] https://arxiv.org/pdf/1603.01547.pdf | 29.80 | 30.35 |
15 Oct 25, 2018 | Random Guess JHU | 18.55 | 19.12 |
16 Oct 25, 2018 | Language Models (single model) JHU [modification of the Google Brain implementation] https://arxiv.org/pdf/1806.02847.pdf | 17.57 | 18.15 |