MIT: GPT-4 Did Not Score 90th Percentile on Bar Exam
The paper critically examines OpenAI’s claim that GPT-4 performs at the 90th percentile on the Uniform Bar Exam (UBE), resulting in four main findings:
- **Skewed Percentile Estimates**: GPT-4’s UBE score is close to the 90th percentile based on February Illinois Bar Exam data, but these estimates are biased toward lower scores, as most February test-takers previously failed the July exam.
- **July Data Comparison**: Using July data suggests GPT-4 would be at the 68th percentile, with below-average essay performance.
- **First-Time Test Takers**: GPT-4 would rank at the 62nd percentile against first-time test takers, with a 42nd percentile rank on essays.
- **Passers-Only Comparison**: Among those who passed, GPT-4’s overall rank drops to the 48th percentile, and it falls to the 15th percentile on essays.
The paper also questions the validity of GPT-4’s reported UBE score of 298, replicating the multiple-choice score but highlighting grading issues in the essay components. Adjusting temperature settings showed no significant impact, while prompt engineering had some effect.
The paper addresses theoretical and practical challenges in measuring AI capabilities and doubts the UBE’s usefulness as a proxy for legal competence, citing the general content of the UBE, the dissimilarity of exam tasks to real lawyer tasks, and the low incentive for scores beyond passing.
Despite these concerns, if one accepts the UBE as a valid proxy, GPT-4 appears less competent than assumed, particularly in tasks akin to practicing law. The findings suggest that GPT-4’s performance might lead to over-reliance by lawyers, risking misapplication of the law and professional malpractice.
The paper calls for greater transparency in AI capability reporting to ensure safe AI development and proper public understanding of AI capabilities. It also recommends that legal education incorporate more instruction on technology and AI to better prepare future lawyers.