What can corpus linguistics tell us about primary literacy research in England?

April 2023

This post is a transcript of the first episode of our new podcast Research Mobilities. It introduces what corpus linguistics actually is and how we’re using the method in our project. Dr Anna Lindroos Cermakova explains what analysing newspapers and Twitter conversations can tell us about how people understand and talk about literacy in primary schools in England.

You can listen to the episode here or wherever you get your podcasts by looking for Research Mobilities. 

Parinita: Hello everyone. This is the first episode in what we hope will be a series that’s going to explore our project a little bit and different aspects of our project.

Parinita: Our project is called Research Mobilities in Primary Literacy Education. And today, our first episode is going to be focused on corpus linguistics. My name is Parinita Shetty and I don’t come from a corpus linguistics background at all. I come from an education background, children’s literature, podcasts and fan podcasts and public pedagogy and intersectionality. So really interdisciplinary. So I’m probably going to be asking Anna today a bunch of really silly questions. Anna, did you want to introduce yourself first?

Anna: Yes, hello, my name is Anna Lindroos Cermakova. And I actually don’t have a background in education either. So that made me laugh when you introduced yourself. I’m a linguist, corpus linguist. But I also focus on children’s literature. In my previous project, I was looking at gender representation in children’s literature and change over time. So I think we’ve got some things in common.

Parinita: Yeah, and I loved the article that you’d written about your previous project as well. I thought it made it really accessible – the idea of corpus linguistics. So I’m going to share that in the transcript of this episode, in the blog post. But before we get to that, just a little bit about our project: we’re essentially interested in how research about primary literacy in England specifically travels to teachers; so travels to primary teachers and travels among primary teachers. So we’re interested in how primary school teachers are accessing this research, where they’re encountering this research, how they share it – either online or among colleagues and conferences, all these different spaces and settings – and how the research shifts as well, like how the meaning shifts sometimes, as well as who are the people who are really influential in these conversations about primary literacy research. And one of the ways that we’re doing it is by doing a linguistic analysis, I don’t even know if that’s the correct term Anna [laughs]. I’m just confidently saying that. So that takes me to my first question, could you tell us, like essentially in really simple terms, what corpus linguistics actually is?

Anna: Yes, I will. But I first want to sort of a little bit expand on the definition of the project, or how you describe the project.

Parinita: Of course!

Anna: Because we’re not looking at only how the research moves among teachers, and from the researchers to teachers and the other way round; but we’re also looking at how the research is being talked about. And that’s where the linguistics comes in.

So corpus linguistics is, if I sort of define it easily, empirical linguistics that uses textual data to identify repeated patterns in language. And it is based on the premise that something that is being repeated frequently has some sort of significance. What kind of significance? That needs to be looked into. But the frequency is like … think of it as a spotlight. So if a word or phrase has a higher frequency, so it’s like a spotlight you’re putting on that. And then we’re looking in a more qualitative way. What is happening? Why is it frequent?

So one of the core premises is that corpus linguistics works with computationally readable text. So it’s being done by a machine. The machine counts for you the frequencies, because usually corpus linguistics deals with fairly big sets of data – millions, hundred millions, billions of words. But it can also use smaller sets, as is the case in our project.

Parinita: Yeah, so that’s really interesting. So the machines sort of like gathers together all this data and as a researcher, you’re almost drawing attention to what specific bits make it important or how to make sense of all this information, if I’m understanding it correctly?

Anna: Yes, exactly. So say for example, if you’ve got a phrase that is frequent, it can signal a theme that is being frequent in that discourse. So, for example, if we consider primary literacy, something that crops up a lot in those discourses is say a word “phonics” or “reading”. So these are important topics in primary literacy, and that would be also reflected in the frequency of the occurrence in the text.

Parinita: Excellent. I mean, I know this a little bit, but we’re almost at two different ends of how we’re approaching the research questions in the project. So how are you using corpus linguistics in the project? Like what data sets are you looking at?

Anna: So I’m working on a work package with Professor Julia Gillen and we’re collecting different datasets to discover basically those themes I was suggesting. So one of the core data sets is that we’re looking at media discourse. So we collected a corpus of newspaper articles that are focused on primary literacy. And we’re looking into what is being talked about in those newspaper articles. The idea behind this is that this media corpus serves as a proxy of a public discourse about what seems to be important to large audiences of people.

Parinita: And since this project is based in England, it’s English newspapers.

Anna: It’s English newspapers, yes. So we collected both national papers, but also we collected a small collection of regional papers. So we have covered a variety of papers. So it’s not focused on one particular newspaper, say, The Guardian or The Daily Mail, but we’ve got all of them in the corpus. And obviously, all these newspapers would be talking about literacy in slightly different ways, and that is something that we’re looking into as well.

Parinita: And that’s what we’re looking at in the project, sort of public conversation, like how these ideas, this research is being talked about. And another space that is really important, I think, in research, but also in this sort of having these public conversations with people you may not have otherwise encountered is Twitter. And that’s one of the other things that you’re looking at, right?

Anna: Yes, social media definitely is really important. Obviously, on social media, slightly different conversations happen. And Twitter does seem to be important among people focusing on education. So we aimed for Twitter and we collected a corpus of tweets that again focused on primary literacy. So the collection would have been based on keywords. And we’re looking into again, various things – what is being talked about? Who is talking? If somebody tweets something about an interesting piece of research, what happens after that tweet>? Who retweets?

We would be using corpus linguistics methods here, but not only. We would also use other methods to sort of, for example, map the networks between the people who are participating in it.

Parinita: Yeah, and that could be a podcast episode for another day, because there’s so many different aspects.

Anna: Exactly.

Parinita: I would be really interested to know if you faced any challenges using this either with the newspapers or with the tweets specifically for this project.

Anna: Um … you got me there. I don’t think we did. I think with Twitter, we little bit faced a challenge now that Elon Musk has taken over and it has closed up for researchers. But we managed to collect that before that. But for example, if you would want one to consider that we would want to see what is happening after a year’s time, we can no longer collect another set of tweets, which obviously is not a good thing. And the question is what happens with Twitter as a space in the future? A conversational space. So that was more like outside the scope of what we can influence.

But I think the main challenge we face and that would be, I’m sure, applicable to most of the project is just the time constraints. You just don’t have enough time to do everything you want to. So you know, if we had more time, we would have collected bigger data sets. But that just wasn’t possible. So the datasets we’re using are fairly small. So maybe that was a challenge.

Parinita: What was the most interesting thing that you found while doing this?

Anna: Yes, there were a lot of interesting things. But as I said, my background is not in education. Even though I have been in literacy research for a while, there’s still a lot of things that are new to me. And I guess one of the surprising findings was that a lot of the discourse is focused on reading. Reading seems to be the main key concern in literacy education. And that actually is both in the media corpus and in Twitter. So I thought that was really interesting that writing or speaking, based on the frequency, seem not to be so important as reading.

Parinita: But it also seems to be a very specific kind of reading, right? It’s books and not like multimodal texts or not like online media.

Anna: Yes.

Parinita: Yeah.

Anna: That’s true. And also, the discussions around reading seems to be going in two directions. One is about reading for pleasure. So how much kids read and how important that is. And the other is about when you learn to read, so about the methods you are using to teach children to read. And then in there, we find a big emphasis on phonics as the core instruction of reading. And what’s interesting is that in the media corpus, that seems to be taken for granted – that this is the best approach to teach children to read. While we know from academic research that this is not an unproblematic statement. [laughs] Let’s put it that way. So

Parinita: Yeah.

Anna: So yes. So I think this was surprising for me, in a way. I probably expected a greater variety of things being discussed, but there seems to be this majority or mainstream discourse focused on reading.

Parinita: And I suppose that’s one of the contributions that this project would be making as well. Like drawing attention to which topics or what kind of research is over-represented, both in what the teachers say – because that’s the side of the project that I’m helping with, which is talking to teachers and trying to understand – but also then what the media and what maybe teachers or other “influencers” on Twitter are saying and what they’re talking about and which topics are really underrepresented or invisible.

Anna: Yes, exactly. So I’m really looking forward to when we get on with the analysis – how these things will align. Are other teachers talking about the same things as the media? Is the Twitter having a similar content? So that will be really interesting to contrast those analysis results and see what topics are dominant in these discourses?

Parinita: Absolutely.

Parinita: Okay, so you’ve told us some of the challenges that you encountered in terms of, you know, the size of the data set, the time period, and obviously, the platform challenges when it comes to Twitter. But I’d be really interested to know what this approach of looking at newspaper articles and looking at tweets and the conversations happening there can tell us about literacy research and how people share it and talk about it and encounter it.

Anna: So one of the challenges, which actually you mentioned slightly earlier, is that one thing that corpus linguistics is not very good at is identifying absences; things that are not in the discourse. So that is important. We need to know what is not being talked about. What we know that there is, and actually, we might be thinking that it should be getting to teachers and it’s not. And it’s not even talked about, so it’s very difficult then to uncover what’s happening with that. So corpus linguistics is not best suited for that. But I think we’ve got other methods within the project that we’re using, that are better suited to uncover absences. So that is one of the challenges.

Parinita: Could you tell us a little bit about what corpus linguistics would be good at in terms of like the possibilities that it holds in telling us things about research and how people are talking about it?

Anna: So what corpus linguistics is good at is at analysing discourse. So it’s very good at identifying the mainstream discourse, the thing that is being talked about a lot and how it’s being talked about. And then it’s also good at identifying alternative discourses; so discourses that somehow challenge the main discourse. But they need to be there, so it can’t be an absence. So it’s very good at identifying sort of different nuances in the discourse, you know, how we talk about things. So we are able to sort of show that we not only talk about reading, but also about how we talk about reading, how we approach reading.

Anna: I think another thing that corpus linguistics is very good at, and I think I haven’t mentioned that, or stressed that is, is at comparing things. So for example, comparison in time. So looking, for example, to simplify this looking at newspapers from 10 years ago and now. So corpus linguistics would be very, very good at picking up the differences in how these discourses were happening.

And similarly, also, for example, if we divide the media corpus into different newspapers, we might want to see how different newspapers compare to each other. One thing we’re doing now with Julia is that we actually divided our media corpus, because we collected media corpus for five years. And obviously one important event hugely influenced life in schools and that was COVID, the pandemic, the closures, school closures. So we divided the discourse into “before pandemic” and “after pandemic”. And we are exploring whether there are any differences in the research themes that are being talked about.

Parinita: Oh, that’s really interesting. You know sort of tracing the impact of like the pandemic and how people are coming to grips with it, but also other things that are going on in the world, you know, in like the world of education and literacy, but also the larger world, and how that impacts teachers and research … and yeah, all these things. So I’m assuming corpus linguistics, if you even sort of looked at the tweets at different times, different people would come into play and different conversations … maybe or maybe not. And that’s what you’d be able to tell.

Anna: Yes, no, that definitely is right. The Twitter corpus we collected is not for such a long period as the newspaper, reflecting the fact that Twitter is such a dynamic space. But we shouldn’t be forgetting that Twitter is not giving you everything that is out there. It has an algorithm that decides what it gives you. So the Twitter data is another type of proxy. So we get something we receive from the Twitter and we need to work with that. And the question is, is what else is out there? We definitely – you can’t collect everything. That is not possible. So Twitter is an interesting data set in that way, because you always need to think about it – that there is something … the more than human influence behind it.

Parinita: Yeah, that’s really interesting. And also, like, while you were saying that, and the “what else?” question as well, I think that’s such an important thing to remember throughout different aspects of the project. Like even when you know we’re talking to teachers, we’re talking to specific teachers who heard about us either through Twitter, or through emails, or whatever; who had the time and the capacity and, you know, sort of, even if they wanted to, they couldn’t participate. So all those sorts of things. And like, even with the media, it depends on what the editors and the journalists want to talk about and want to highlight.

So I suppose what I’m getting at is that we can’t ever know the full picture. We just have like, sort of, like you said, the spotlight on different aspects of the conversations and trying to make sense of that. And that can change as well. Like, it’s such a shifting landscape that you know, in another year, in another five years, in another ten years, it might be completely different than that’s okay. That’s what we’re trying to track as well.

Anna: Yeah, precisely. That’s the thing we just need to keep in mind during the analysis that the results of the analysis are always influenced by the data you collected and how well did you manage to, to sample your data, how representative it is of what you’re trying to do. So when you talked about the teachers, I’m sure it was quite difficult to get those teachers. Maybe you don’t have them spread around whole England, they might be, you know, concentrated in one area of England. And if you went somewhere else, they might be saying something different.

Also, they would be teachers who, I would assume, have a very positive attitude towards research because they are willing to participate, and it’s taking up lots of their time. And you wouldn’t be able to access teachers who might be less enthusiastic about research for whatever reason.

Parinita: Yeah.

Anna: And the same thing with newspapers, exactly. The editors, but also the authors of the articles, they do have their agenda. And that does influence the discourse a lot.

Parinita: Yeah, which is why I think that what you said right in the beginning, it’s sort of really useful as well in terms of the range of newspapers and media that you’ve looked at. So it’s not specific national newspapers that might have a specific leaning or might want to cover specific topics in a specific way; but regional things as well like what’s important at both the local and the national level. It’s just fascinating that it opens up such a different kind of way of looking at and understanding the world too – like the possibilities of corpus linguistics. So thank you so much for taking the time and sharing this.

Anna: Yeah, thank you for inviting me to this podcast. And I hope it illuminated the corpus linguistics role in this project. And I’m looking forward to future podcasts on the other methods. Because I think one of the fascinating things about this project is that it uses so many methods and approaches. Which I think compensate partly for the fact that, you know, the datasets might not always be as representative as we would want them to be. But because we combine so many methods and approaches, it gives a fuller picture. And I think that’s a really big plus in this project.

Parinita: Oh, yes, that’s actually such a good point. Because also, we’re coming from such different backgrounds, like you said, you, you know, don’t come from a traditional education background. I don’t really even though my degrees have been in the School of Education, but they’ve been much more interdisciplinary. And the researchers and research assistants on the project, all have differing interests and priorities, and all find ways to highlight things that they’re most interested in. And I think that’s really good to sort of provide a more complete and complex picture. It can’t ever be completely complete. So yeah, that’s an excellent point. And thank you so much for being here and for explaining corpus linguistics in a way that even made sense to me [laughs] who had no idea of what it was about. So yeah thanks so much.

Anna: Thank you for having me.

Leave a Reply