Contact US

Engineering

LLM Evaluation DataSets

by Prashant Malick on August 3, 2023

LLM Evaluation DataSets

The following method consists of 34 distinct criteria sourced from diverse origins and is grouped into six overarching areas of proficiency anticipated in robust foundation models. These areas were derived after a thorough study of current LLM literature and open-source assessment tools like EleutherAI Eval Harness and Stanford CRFM's HELM.

World Knowledge

1. Jeopardy Jeopardy consists of 2,117 Jeopardy questions separated into 5 categories: Literature, American History, World History, Word Origins, and Science. The model is expected to give the exact correct response to the question. It was custom curated by MosaicML from a larger Jeopardy set available on Huggingface.

  • Year released: 2022
  • Number of few shot examples: 10
  • Random baseline accuracy: 0%

2. MMLU MMLU consists of 14,042 four-choice multiple choice questions distributed across 57 categories. The questions are in the style of academic standardized tests and the model is provided the question and the choices and is expected to choose between A, B, C, and D as its outputs. The subjects range from jurisprudence, to math, to morality.

  • Year released: 2019
  • Number of few shot examples: 10
  • Random baseline accuracy: 25% ‍

3. BIG-bench: wikidata BIG-bench wikidata consists of 20,321 questions regarding factual information pulled from wikipedia. Questions range from the native language of celebrities to the country that different regions belong to. Models are given a sentence such as “The country of citizenship of Barack Obama is” and are expected to complete the sentence with e.g. “the United States.”

  • Year released: 2022
  • Number of few shot examples: 10
  • Random baseline accuracy: ~0%

4. ARC easy ARC easy consists of 2,376 easy four-choice multiple choice science questions drawn from grade 3-9 science exams. The questions rely on world knowledge related to basic science.

  • Year released: 2019
  • Number of few shot examples: 10
  • Random baseline accuracy: 25%

5. ARC challenge ARC challenge consists of 1,172 hard four-choice multiple choice science questions drawn from grade 3-9 science exams. The questions rely on scientific world knowledge and some procedural reasoning.

  • Year released: 2019
  • Number of few shot examples: 10
  • Random baseline accuracy: 25%

6. BIG-bench misconceptions Big bench misconceptions consists of 219 true or false questions regarding common misconceptions about a variety of topics including urban legends, stereotypes, basic science, and law.

  • Year released: 2022
  • Number of few shot examples: 10
  • Random baseline accuracy: 50%

Commonsense Reasoning

Commonsense reasoning tests the models’ ability to perform basic reasoning tasks that require commonsense knowledge of objects, their properties, and their behavior.

7. BIG-bench: Strategy QA BIG-bench strategy QA consists of 2,289 very eclectic yes/no questions on a wide range of commonsense subjec e.g “Can fish get Tonsilitis?”

  • Year released: 2022
  • Number of few shot examples: 10

8. BIG-bench: Strange Stories BIG-bench strange stories consists of 174 short stories followed by a two-choice multiply choice question in which the model is asked to make commonsense inferences about the characters in the stories, how they might feel, and why they act in certain ways.

  • Year released: 2022
  • Number of few shot examples: 10
  • Random baseline accuracy: 50%

9. BIG-bench: Novel Concepts ‍BIG-bench novel concepts consists or 32 find-the-common-concept problems in which the model is given 3 words and has to choose from among 4 possible concepts that they all have in common.

  • Year released: 2022
  • Number of few shot examples: 10 Random baseline accuracy: 25%

10. COPA COPA consists of 100 cause/effect multiple choice questions in which the model is prompted with a premise and the model must choose correctly between two possible causes/effects of the premise.

  • Year released: 2011
  • Number of few shot examples: 0
  • Random baseline accuracy: 50%

11. PIQA PIQA consists of 1,838 commonsense physical intuition 2-choice multiple choice questions.

  • Year released: 2019
  • Number of few shot examples: 10
  • Random baseline accuracy: 50%

12. OpenBook QA OpenBook QA consists of 500 four-choice multiple choice questions that rely on basic physical and scientific intuition about common objects and entities.

  • Year released: 2018
  • Number of few shot examples: 0
  • Random baseline accuracy: 25%

Language Understanding

Language understanding tasks evaluate the model’s ability to understand the structure and properties of languages, including determining which types of sentences are more likely than others, performing translations between languages, and identifying languages.

13. LAMBADA LAMBADA consists of 5,153 passages take from books. The model is expected to read the first N-1 words of each passage and predict the final token.

  • Year released: 2016
  • Number of few shot examples: 0
  • Random baseline accuracy: 0%

14. HellaSwag HellaSwag consists of 10,042 multiple choice scenarios in which the model is prompted with a scenario and choose the most likely conclusion to the scenario from four possible options.

  • Year released: 2019
  • Number of few shot examples: 10
  • Random baseline accuracy: 25%

15. Winograd Schema Challenge The Winograd Schema Challenge consists of 273 scenarios in which the model must use semantics to correctly resolve the anaphora in a sentence. Two possible beginnings to a sentence are presented as well as an ending. Both involve some anaphora being resolved in a different way, only one of which would be semantically valid, and the model must choose which option produces the valid resolution.

  • Year released: 2012
  • Number of few shot examples: 0
  • Random baseline accuracy: 50%

16. Winogrande The Winogrande consists of 1,267 scenarios in which two possible beginnings of a sentence are presented along with a single ending. Both combinations are syntactically valid, but only one is semantically valid, and the model must choose the one that is semantically valid.

  • Year released: 2012
  • Number of few shot examples: 0
  • Random baseline accuracy: 50%

17. BIG bench language identification BIG bench language identification consists of 10,000 four-choice multiple choice questions in which a sentence in some language besides english is presented and the model is prompted to identify the language of the sentence amongst four options.

  • Year released: 2012
  • Number of few shot examples: 10
  • Random baseline accuracy: 25%

18. BIG bench conceptual combinations BIG bench conceptual combinations consists of 103 four-choice multiple choice questions in which the model is presented with a made up word and its definition along with a multiple choice question regarding the meaning of a sentence using that made up word. The model is then expected to select the correct answer among the choices presented.

  • Year released: 2022
  • Number of few shot examples: 10
  • Random baseline accuracy: 25%

19. BIG bench conlang translation BIG bench conlang translation consists of 164 example problems in which the model is given translations of simple sentences between English and some fake constructed language. The model is then tested for its ability to translate a complex sentence in the fake language into English.

  • Year released: 2022
  • Number of few shot examples: 0
  • Random baseline accuracy: 0%. Symbolic Problem Solving Symbolic problem solving tasks test the model’s ability to solve a diverse range of symbolic tasks including arithmetic, logical reasoning, algorithms, and algebra.

20. BIG-bench elementary math QA Big bench elementary math QA consists of 38,160 four-choice multiple choice arithmetic word problems.

  • Year released: 2022
  • Number of few shot examples: 10
  • Random baseline accuracy: 25%

21. BIG-bench dyck languages Big bench dyck languages consists of 1000 complete-the-sequence questions, in which a partially completed balanced expression consisting of parentheses and braces is given, and the model needs to output the exact tokens necessary in order to complete the balanced expression.

  • Year released: 2022
  • Number of few shot examples: 10
  • Random baseline accuracy: 0%

22. BIG-bench algorithms Big bench cs algorithms consists of 1,320 samples of questions falling into one of two types. In the first type the model must determine the length of the longest common subsequence of two strings, and in the second type the model must determine whether an expression consisting of parentheses and braces is balanced.

  • Year released: 2022
  • Number of few shot examples: 10
  • Random baseline accuracy: 0%

23. BIG-bench logical deduction Big bench logical deduction consists of 1500 four-choice multiple choice questions, in which the model is posed with a number of logical constraints describing the relative ordering of some number of objects. The model must then choose from among a list of four statements, which statement is the only one that is logically consistent with the constraints posed.

  • Year released: 2022
  • Number of few shot examples: 10
  • Random baseline accuracy: 25%

24. BIG-bench operators Big bench logical operators consists of 210 questions, in which a number of mathematical operators are defined and the model is expected to calculate the result of some expression consisting of those defined operators. This tests the model’s ability to handle mathematical abstractions and apply them appropriately.

  • Year released: 2022
  • Number of few shot examples: 10
  • Random baseline accuracy: 0%

25. BIG-bench repeat copy logic Big bench repeat copy logic consists of 32 tasks in which the model is commanded to repeat some combination of words some number of times in a particular order, and the model is expected to output the correct result.

  • Year released: 2022
  • Number of few shot examples: 10
  • Random baseline accuracy: 0%

26. Simple arithmetic with spaces Simple arithmetic with spaces was developed by MosaicML. It consists of 1000 arithmetic problems consisting of up to 3 operations and using numbers of up to 3 digits. There is spacing between all numbers and operators. The model is expected to calculate the correct result of the expression using the appropriate order of operations.

  • Year released: 2023
  • Number of few shot examples: 10
  • Random baseline accuracy: 0%

27. Simple arithmetic without spaces Simple arithmetic with spaces was developed by MosaicML. It consists of 1000 arithmetic problems consisting of up to 3 operations and using numbers of up to 3 digits. There is no spacing between any of the numbers and operators. The model is expected to calculate the correct result of the expression using the appropriate order of operations.

  • Year released: 2023
  • Number of few shot examples: 10
  • Random baseline accuracy: 0%

28. Math QA Math QA consists of 2,983 four-choice multiple choice math word problems. The questions require basic reasoning, language comprehension, and arithmetic/algebraic skills.

  • Year released: 2021
  • Number of few shot examples: 10
  • Random baseline accuracy: 25%

29. LogiQA LogiQA consists of 651 four-choice multiple choice logical word problems.The questions involve making logical deductions based on mathematical and symbolic descriptions of problems.

  • Year released: 2020
  • Number of few shot examples: 10
  • Random baseline accuracy: 25%

Reading Comprehension Reading comprehension benchmarks test a model’s ability to answer questions based on the information in a passage of text. It includes a number of multiple choice benchmarks as well as some benchmarks that require the model to output an exact correct answer.

30. BIG-bench: Understanding fables Understanding fables consists of 189 short stories followed by 4-choice multiple choice in which the model must select the correct moral for the story.

  • Year released: 2022
  • Number of few shot examples: 10
  • Random baseline accuracy: 25%

31. Pubmed QA Labeled ‍Pubmed QA Labeled consists of 1000 hand-labeled medical documents followed by a related question for which the model must respond yes/no/maybe.

  • Year released: 2019
  • Number of few shot examples: 10
  • Random baseline accuracy: ~0%

32. SQuAD SQuAD consists of 10,570 short documents followed by a related question. The documents range from short news clippings about sports events, to blurbs explaining concepts in physics, to documents about US history. We expect the model to output the exact correct answer.

  • Year released: 2016
  • Number of few shot examples: 10
  • Random baseline accuracy: ~0%

33. BoolQ BoolQ consists of 3,270 short passages on a diverse range of subjects followed by a yes/no questions. The model is expected to answer in multiple-choice format.

  • Year released: 2019
  • Number of few shot examples: 10
  • Random baseline accuracy: ~50%
  • 5153 samples of text curated from the books corpus. Consists of a several hundred word paragraph in which the model is expected to predict the next word.

Programming Programming tasks evaluate the model's ability to understand code, write functionally correct code given a specification, simulate code, and document code. Right now we just have HumanEval but later versions will include more.

34. HumanEval code generation HumanEval consists of 164 python programming challenges, in which the model is presented with the method signature and docstring comment for a python program and is expected to complete the program. We then test the resultant code’s functional correctness on a number of test input/output pairs.

  • Year released: 2022
  • Number of few shot examples: 0
  • Random baseline accuracy: 0%.

Solutions

  • AGIE Data Engine
  • Vector Database
  • LLM FineTuning
  • Monitoring and Observability
  • AI Guardrails

Copyright © 2023 AGIE AI Technology Pvt. Ltd. All rights reserved.