Document Type


Publication Date



In Copyright


We live in an age of information. But whether information counts as data depends on the questions we put to it. The same bit of information can constitute important data for some questions, but be irrelevant to others. And even when relevant, the same bit of data can speak to one aspect of our question while having little to say about another. Knowing what counts as data, and what it is data of, makes or breaks a data-driven approach. Yet that need for clarity sometimes gets ignored or assumed away. In this essay, I examine what counts as data in legal corpus linguistics, a method of interpretation that uses large datasets of actual language use to give empirical heft to claims about how “ordinary people” would use or understand legal terminology—claims that pervade legal interpretation. Unlike corpus linguistics in the field of linguistics, however, legal corpus linguistic analysis tends not to articulate or examine just what its datasets can reveal. Practitioners are thus liable to make large claims on the basis of materials that don’t support them—materials that provide information, but do not constitute data that answers the questions legal corpus linguistics poses. This essay undertakes a more careful parsing of what the corpora preferred by legal corpus linguistics can, and cannot, reveal. Although I conclude that legal corpus linguistics currently faces a mismatch between information and aspiration, I also suggest areas of legal work where it can be of real use.

Publication Title

Brooklyn Law Review

First Page


Last Page



This record does not contain full text. If available, click on the "DOI" link to see where the full text of the item is located. If you are a UB student, or faculty or staff member and unable to access the full text at the link, try searching for the item in Everything Search ( If not available, request via Delivery+ (