Shelley and the Bad Corpus

“We got off easy with the Data-Sitters Club. We never had to grapple with the corpus question. We started by gathering all the books in all the Baby-Sitters Club series. That was our corpus. It was just all of them.”

“Most computational text analysis projects don’t have our good luck. It’s usually unrealistic to simply “get everything”. What even is “everything”? Do you have a comprehensive list of “everything” ever published that matches your parameters? If you think you do, how much do you trust that list? How was it compiled? What kinds of things might it be missing?”

Commentator’s Note: Yes! I’m facing this challenge with my corpora.

“I worked on what came to be called the Young Readers Database of Literature, or YRDL.”

“My collaborator on this project, Nichole Nomura (now a graduate from the Stanford Literary Lab), is extremely thoughtful about corpora, and near-synonyms for “corpora”. That’s how we ended up with the “D” in YRDL: “database” can mean just about anything queryable.”

“It’s a way to insist on structured questions as a necessary step in corpus-creation for specific research projects out of YRDL.”

“I have these kinds of conversations a lot with my students: What’s a good corpus? What’s a bad corpus? How do you know if the corpus you’ve collected is adequate – let alone good – for making the claims you want to make?”

“Heather Froehlich, pointed us towards Prof. Shelley Staples from the University of Arizona as a possible person who could help us think through these kind of questions.”

“Shelley thought for a moment. “The thing about corpora,” she said, “is that there isn’t a ‘bad corpus’ in the way that jokes can be bad. Usually, ‘bad corpus’ situations are ones where corpora are being used badly. Corpus construction is like making an argument[1], and the choices you make about how representative your texts are impact the kinds of conclusions you can make. That’s true for all corpora.”

“There are some good practices for how to get the best sample you can, and how to think about that in a principled way[2]. But first and foremost, you need to ask yourself, ‘What are my goals with this project? What claims do I want to make?’””

“We talk, repeatedly, about how DH is people, so this is a good place to acknowledge that people are complicated, and feeling blocked and uninspired about a project is not a personal failing but rather a natural outcome of being subject to the vicissitudes of life.”

“Especially at a time when corporate (and university) bullshit about so-called wellness is entirely disconnected from the realities of living under late capitalism, one thing we’re grateful for with the DSC is that we’re able to run the project the way we want, including prioritizing patience and compassion.”

“The way to get at the big things we care about is often through building up an argument using lots of smaller questions and theories.”

“Computational text analysis gives us a set of tools that can help us more easily pay attention to things we’re curious about. It’s like activating a sixth sense as you go about investigating your research question.”

“But it’s not the only sense you should be using, and you shouldn’t be using it at all without the aid of your own reasoning powers, guiding you towards making sensible choices that you can clearly articulate with regard to questions like “What are we comparing?””

“There’s no bad corpora, just bad matches between corpora and the questions you’re asking them. And in the end, there’s no magic algorithm that’s better than reading books with your eyeballs.”

“Be thoughtful about your corpora, and your results. Don’t throw a bunch of algorithms at something, then claim sweeping conclusions – regardless of what the numbers say.”

“Whether you want to excel as a writer or a (computational) literary scholar, there’s no shortcut for actually reading and writing. Yourself. Not the computer.”

Navigation

⎘ A Worldbuilding Guide, 2
⎗ Review of Perceptual Experience

Backlinks

There are no backlinks to this post.

Shelley and the Bad Corpus

👤 Quinn Dombrowski and Shelley Staples 🌐 The Data-Sitters Club

Navigation

Backlinks