Jana Lasser: Advancing Reproducibility and Reuse

In this episode of the Politics of Openness Podcast, Katja Mayer interviews Jana Lasser, who is professor for data analysis at the Idea Lab at the University of Graz.

Prof. Dr. Jana Lasser has a PhD in physics and her research focuses on emergent phenomena in complex social systems, employing machine learning, data science, and computational modelling. Her work spans topics such as public health measures during the COVID-19 pandemic, the spread of misinformation on social media, and the societal impacts of recommendation algorithms. Jana is also a strong advocate for scientific integrity and open science practices, emphasising the importance of transparency and reproducibility in research. She shares her experiences and motivations for embracing open science, particularly open data practices, which stem from her early encounters with the challenges of accessing scientific publications.

Jana discusses her journey from physics to computational social science and the epistemic and pragmatic reasons behind her commitment to open science. She elaborates on the concept of reproduction packages, which include comprehensive documentation and code to ensure research is reproducible and accessible. She provides an example of a recent study published in Nature Human Behavior, highlighting the importance of thorough documentation for future use and collaboration. Jana also addresses the ethical challenges in open data, especially in sensitive research areas like hate speech. She emphasises the need for responsible data management and the role of ethics review in ensuring privacy and ethical considerations. Additionally, Jana discusses the potential of large language models (LLMs) for simulating human behaviour and the importance of openness in both models and training data not only to enhance scientific transparency and reproducibility, but also to comply with scientific integrity, as well as ethical and legal frameworks.

Links: 

Personal Website Jana Lasser

Twitter

IDEA Lab Uni Graz

Network against Abuse of Power in Science

Works mentioned: 

  • Lasser, J., Herderich, A., Garland, J., Aroyehun, S. T., Garcia, D., & Galesic, M. (2023). Collective moderation of hate, toxicity, and extremity in online discussions. arXiv preprint arXiv:2303.00357. https://arxiv.org/abs/2303.00357
  • Lasser, J., Ahne, V., Heiler, G., Klimek, P., Metzler, H., Reisch, T., Sprenger, M., Thurner, S. and Sorger, J. (2020). Complexity, transparency and time pressure: practical insights into science communication in times of crisis JCOM 19(05), N01. https://doi.org/10.22323/2.19050801

Furthermore: 

PhDComics Open Access explained https://www.youtube.com/watch?v=L5rVH1KGBCY 

Transcript of this episode:

Katja Mayer
Today, we’re joined by Dr. Jana Lasser, currently Professor for Computational Social Sciences and Humanities at RWTH Aachen and Associate Faculty at the Complexity Science Hub Vienna. Welcome Jana!Starting from May 1st, you will be professor for Data Analysis at the newly founded interdisciplinary research center IDea-Lab at the University of Graz. Congratulations to the new job.

With a PhD in physics, your work dives deep into understanding emergent phenomena in complex social systems, for example, through machine learning, data science, and computational modeling. Your research spans from investigating public health measures during the COVID pandemic, the dynamics of misinformation spread on social media, to the societal impacts of recommendation algorithms. Beyond your academic pursuits, you are an advocate for scientific integrity and open science practices. Furthermore, you are a board member of the network against power abuse in science, where you are actively working towards systemic change in the scientific landscape.

https://www.netzwerk-mawi.de/en

So welcome again, Jana. Thank you for joining me today. The introduction to you was very short, so maybe you would like to give us a bit more insights into your work and your relation to open science, particularly open data practices. And maybe, you also want to share with us what open science means for you.

Motivations for Open Science

Jana Lasser

Oh, yeah. Thank you, Katja, for inviting me and for giving me the opportunity to tell you a bit about my research and also my relationship to open science. So, I first came in contact with open science when I was still an undergrad, and somebody—I don’t even remember who—told me about scientific publishing and how crazy it is that we have to pay to access our own articles. I didn’t know anything about scientific careers or how science worked. I just found it very crazy that we would write something as academics and then have to pay to get it back. It’s even more crazy that the taxpayers who pay us to do the work couldn’t even access the work in the end.

PhDComics https://www.youtube.com/watch?v=L5rVH1KGBCY Open Access explained

That kind of incited my interest, and I joined various meetups. That was still when I was in Göttingen, where I also did my PhD back then in physics. Following from that, I explored what I call the open science landscape more and more, leading me to topics of reproducibility. Not only making our results transparent and reproducible, but also the process that led us to them. Being a quantitative empirical researcher, that for me involves data—making data accessible, that involves methods—making code accessible. More recently, it also involves documentation, because even if we publish all the code and all the data, it might be in a form or a format that is just not understandable to anybody. So, that is what I’m thinking most about these days: how can we ensure that it’s not only accessible and reproducible in principle, but also in practice?

Science in public service has to be transparent and accessible

Katja Mayer
How, in a way, came this motivation to enhance reproducibility in your own research about? So how did this move towards more open science, but especially also to more reproducibility, influence your own approaches in computation, social science, especially because this is where you moved to, from physics after some time?

Jana Lasser
Yes, exactly. So, I moved to computational social science. I’m now applying all the analytical and modeling skills that I learned as a physicist to try to understand societies. My primary motivation, I think, was kind of ideology driven. I am still convinced that as scientists, we serve the public, and to serve the public, we have to be transparent and make our results accessible. I think this is also reflected in the issues we have, particularly in Austria, where trust in science is eroding. Given that background, it’s even more important that we lay open our approaches, how we come to conclusions, and are very transparent about how our findings come about. But then that was quickly also joined by purely pragmatic experience.

Organising reproducibility

So, I am a coder, and I was told very early in my studies that we have to document our code very thoroughly. Today, we might know what we were thinking with the code we are writing, but half a year from now, only God will know if we don’t document. The same applies to all aspects of research and doing science. I’ve seen time and time again that I am thankful to my past self if I document things thoroughly and create what I now call reproduction packages of my research. This involves putting all the primary data in one place, ensuring all the code works with that primary data, and making sure everything reproduces accurately.

This thorough documentation forces me to provide descriptions that enable myself to work with the data in the future, such as when reviews come or when I have to onboard a new master’s student who wants to do a spin-off project based on my research. Time and time again, it saves me a lot of time and headaches to keep working and building on my own research, as well as on other people’s research.

Katja Mayer
Could you give us maybe a concrete example of how such a reproduction package has worked out? Maybe a case, something that kind of illustrates it even more?

Jana Lasser
Sure. I think a good example is a study that we recently published last September in Nature-Human Behavior. It was about society’s conceptualization of what it means to be honest. It was a very data-driven study. We collected millions of tweets from U.S. politicians from Twitter. Then we developed a new measurement approach to measure two distinct conceptualizations of honesty: one is called belief-speaking, the other truth-seeking. Whoever’s interested in the scientific details can read the study.

Lasser, J., Aroyehun, S.T., Carrella, F. et al. From alternative conceptions of honesty to alternative facts in communications by US politicians. Nat Hum Behav 7, 2140–2151 (2023). https://doi.org/10.1038/s41562-023-01691-w  + repo with the reproduction package: https://github.com/JanaLasser/new-ontology-of-truth

We also had some involved statistics to measure our effects. This reproduction package involved giving access to the primary data to the extent it is possible because we cannot publish raw tweets. However, we did publish tweet IDs and all the derivative statistics that we used for every tweet. We also provided the data collection code, the measurement instruments that we developed, all the analysis code, and all the code necessary to produce every figure in the paper, along with a very exhaustive README. If you go through that, you can really reproduce the complete article, every single figure.

This approach has already helped me a lot in the half year since we published the study. I’ve been using it in lectures to teach. I onboarded a new master’s student who took only a few days to get into the code because it was already there and very well documented. I will probably onboard several more master students on the project soon. For me, it’s also nice to work with it now that I’ve put in all the work to prepare it in that way.

Katja Mayer
So it’s really good for the sustainability of your own research practices that you can build up a corpus of knowledge that is reusable for yourself, but also for your peers then afterwards, right?

Jana Lasser
Yeah.

Careful openness: Research integrity and Ethics

Katja Mayer
Something that you also mentioned is that your work often includes very sensitive and critical topics. I can imagine the discussion on honesty or the investigation of honesty in social media behavior will reveal that there is also a lot of hate speech and other things. I think your experiences from the COVID pandemic have especially sharpened your approaches towards critical and sensitive issues in data collection and opening up data. Is there any particular experience you would like to share, especially at this intersection of research integrity, ethics, and openness?

Jana Lasser
Yeah, it’s a very good point. For every research project, it’s a new discussion and a new negotiation, balancing aspects of openness with the privacy of the data subjects and our obligation to inform society. Thankfully, for the honesty study I mentioned, it was fairly easy because we were investigating politicians. They are of public interest, so it’s kind of expected that what they say in public will also be discussed in public.

Preprint: Lasser, J., Herderich, A., Garland, J., Aroyehun, S. T., Garcia, D., & Galesic, M. (2023). Collective moderation of hate, toxicity, and extremity in online discussions. arXiv preprint arXiv:2303.00357. https://arxiv.org/abs/2303.00357

A completely different case is a study that we are currently doing about hate speech and people who try to intervene against it. Here, we are very careful with publishing our primary data because it could lead to the identification of the people involved in these efforts, which could have detrimental effects for them. Ethics is a big aspect here, and we might not be able to publish our primary data openly. However, we will still make it available for other researchers if they sign a data protection agreement.

During COVID, we had discussions not just about publishing data, but also about publishing results and in what way. We were concerned whether our results would freak people out too much, to put it bluntly. We decided internally to publish our findings as soon as they were quality-checked and we were sure enough about them. In the end, we cannot decide what society is ready to see and what not; this is up to society. People are perfectly capable of making sense of information and making their own decisions given all the information. So, we decided to publish everything as soon as we could.

Lasser, J., Ahne, V., Heiler, G., Klimek, P., Metzler, H., Reisch, T., Sprenger, M., Thurner, S. and Sorger, J. (2020). Complexity, transparency and time pressure: practical insights into science communication in times of crisis JCOM 19(05), N01. https://doi.org/10.22323/2.19050801

It’s an ongoing discussion, and I think there is no rule that fits all application cases.

Planning responsible data (re-)use, burden or opportunity?

Katja Mayer
Do you have any experiences from that time you were just telling us about, of people that you had not thought about reusing your data or people who were using your data against specific political views or ideologies that you could not have foreseen and that were maybe troubling you or something like that? Did that happen to you?

Jana Lasser
Not really, thankfully. So far, I have not been in a position where somebody used my research for something that I wouldn’t condone. But that could still happen. I guess the best thing we can do is to think about possible cases beforehand and make sure that we have it all covered. It is also becoming more usual to go through ethics review with the type of research we do. This wasn’t the case only two or three years ago, but these days, whenever we work with data that has something to do with humans, it’s standard to go through ethics review. Many people perceive this as a hassle and an administrative burden, but I actually like this chance to sit down and think about these aspects of my research. I think it’s a really important step, and I try to take it as seriously as I can in the limited amount of time that we usually have for these things.

Katja Mayer
Yeah, that’s for sure. The review boards also have very limited time to make decisions on those really big and very important questions. One thing I would like to stay with in this regard is the question of stewardship of your own data. Of course, what you were telling us is that you take a lot of effort to document the data to keep it reusable and so on. But what about data reuse that is somehow troubling you? Have you ever thought about being such a steward of your own data that you feel still responsible after it’s out in the open and that you might follow up with the people reusing your data in a way that you thought would not fit the desired outcomes or something like that?

Jana Lasser
I mean, in the hypothetical case that happens, I can imagine doing that, definitely. I would need to know about it in the first place, but hopefully they cite my data publication because that’s how I get my data out there these days. Then I would know about it. There’s only so much I can do. I usually publish my data under a Creative Commons license, which is rather permissible. There might be instances where I just cannot really do anything about it, but I might at least talk to the involved people and see if we can find a way that conforms to what I had in mind for my data to be used.

Katja Mayer
Yeah, you know, I asked this question because there’s an ongoing debate on whether or not sometimes it’s better to close data or not share data because it could be reused in a way that it’s not intended or not good for society or whatever.

And so I wonder whether openness would actually help to trace those activities much better and then confront those activities with an alternative worldview or alternative interpretations, right? This would actually make it traceable, right? This would make the use of your data much more traceable as it is now when people are using scientific insights to make completely contrary arguments sometimes. It’s not easy for the scientists themselves to react to those things because they just don’t know about it.

Jana Lasser
For sure, yes. And I mean, by, for example, using a share-alike license, we can kind of ensure that whatever derivative comes from our work will also have to be open so we can at least then keep an eye on it. A similar argument applies here as I put forward regarding the scientific findings. Who am I to decide what is a legitimate use of the knowledge? Data, for me, is under this umbrella that I produce. So, putting it out there under a license that at least allows for public scrutiny and then, sure, probably it’s also my responsibility to follow up and track it. But I don’t think that closing the data is the right solution. Of course, given that there are no privacy concerns with the data subjects, but other than that, I would argue that openness is the only way.

Open Science Communication

Katja Mayer
The work with media and different publics that are not scientific in society is something I know you have had several experiences with, especially during COVID. There was a big rush; you had to produce results and feed them to politicians and the media. It seemed like nobody was ever satisfied; everyone was saying, “Faster, faster, we need more, we can’t make decisions. What should we decide? Science should give us the thing where we can make the decision from.” What lessons have you learned from your work in the field of public health at that time, interacting with policymakers and the media? It’s another form of openness, right? This kind of science communication was necessary at that time.

Jana Lasser
Yes, definitely. What have I learned? I have learned not to jump into high-pressure, high time-pressure projects head-on without really thinking about whether I have the resources and capacity for it because it is very intense. The particular case that was most impressive for me was the school measures strategy during COVID. We ran a simulation study over Christmas that informed Austria’s school strategy launched in February. I had only half a day of vacation over Christmas. It was a very rewarding activity because I believe that we as scientists should care about the impact our research has. This was a case where our research had a very direct impact, which was very satisfying, and I believe it acted positively.

I also learned that science journalism is really on our side as scientists. Initially, I was careful in interacting with journalists because my first contact with media was related to research policy, not research itself. When I was a PhD student in Germany and a PhD representative, we dealt with several cases of power abuse in the Max Planck Society. As a representative, I received many questions from the media. Nothing bad happened, but it was a different kind of communication. I was representing both the organization and the PhD students, and journalists were interested in the scandal, so I had to be very careful about what I said.

During COVID, I felt that science journalists were very interested, supportive, and wanted to understand. They always gave me things to double-check, and making corrections was never a problem. This experience gave me a lot of confidence and security in working with journalists, and I am very willing and happy to do interviews and try to describe my research in an understandable way. It was a really good experience for me.

Katja Mayer
When you say you try to describe the research in an understandable way, are there any specific formats that you experimented with, like visualizations, where you would say this is another type of openness to present research results and the method in a visual way?

Jana Lasser
Well, “experimented with” might be a bit of a big word. I did a few things, and some of them worked. I have a Twitter profile where I communicate new research findings, mostly to an academic audience, but sometimes they reach a broader audience. I think about how I compose a Twitter thread and sometimes create different visuals than I would for a scientific publication. However, it’s really a matter of how much time I have because I still have to do this on the side.

For the COVID school research project, I worked with a postdoc who specialized in visualizations, Johannes Sorger, and we implemented a website where our simulation results could be explored interactively. I still think it’s a very cool website, but I also think it’s not very accessible in the end.

COVID19 Prevention-Measure Explorer for Schools https://vis.csh.ac.at/covid-schools

It has a lot of information in there, and maybe given another half a year, we could make it a very, very cool tool for teachers, students, and parents to use. But in the end, we were also lacking just the temporal resources to make that happen. So, yeah, I do try different things, and some of them have worked out.

Katja Mayer
But I think it’s a very good basis for the new job that you are about to start, right? To have all these experiences. And then maybe, as your institution has the word “lab” inside, as I remember correctly, I hope this could be a great place to further engage with these strategies or try out new ways of communicating also to the public, which is somehow another dimension of reproducibility or accessibility that we should keep in mind when talking about openness, I guess.

Open Large Language Models for Simulating Human Behavior

So now maybe looking more to the things you’re doing right now and an outlook to the future. I heard that you and your team are currently using or trying out LLMs, large language models, for simulating human behavior. What are your experiences there? What is meant when you say open large language models? Maybe you can explain that briefly.

Jana Lasser
Yeah, I’ll start with that. So, everybody probably now knows ChatGPT. That’s what we call closed large language models. It’s proprietary, owned by OpenAI, and we only get access to a web interface where we can chat with it. The issue with that is we don’t really know what happens in the background. The behavior of OpenAI’s ChatGPT model changes. There are studies that investigated things like the length of the response or the accuracy of the response over time or at different points in time. And it does change, and we don’t know why. That’s an issue if we use these things for research because we can never be sure that our results actually reproduce.

But there are large language models like Mistral, a French company publishing some of their models, and there are also the Llama-type models that initially were leaked. Now, I think they’re actually publishing them out of their own will, which can be downloaded. Anybody with the right hardware can run them locally. That means the model is completely under our control. We know when it changes because it wouldn’t change if we don’t change it. That’s good for research because we can make sure that whatever happens, we know why it happens because we did it.

We’re trying to use these models to simulate human behavior or to see whether it’s feasible to simulate human behavior. These models produce very coherent-sounding text, and there are studies showing that they can reproduce human responses to survey questions. There are various ideas around using these large language models to study humans in new ways. For example, in ways that are not accessible to classical research because they might be unethical to do with real humans or deal with demographics that are not accessible to survey questions, such as inmate populations, or to conduct surveys that are excessively long.

These are interesting use cases. Right now, we are focused on validating whether we can reproduce known behavior reliably and exploring the biases in these models. There are lots of biases in the models that they picked up in the training data. We don’t know the training data for open models. There are approaches to train new foundation models with curated datasets, but these are not there yet. So that’s where we are moving right now, and it’s all moving very fast, but it’s a very exciting thing to do.

Katja Mayer
So, here, the openness on the model dimension is already given, but you most of the time don’t have access to the data that was used to train the models, right? Which I guess could also be a problem, especially when you want to focus on groups that might have been very underrepresented.

Jana Lasser
Yeah. Or we might just get caricatures of these groups because other people wrote about their conceptualizations of these underrepresented groups, and that’s what we get in the training data.

Katja Mayer
There are not yet enough open data corpora that we can use as benchmarks for other models to see whether they really work well.

The Next Frontier: Getting Better Training Data

Jana Lasser
Yes, the scale is simply not there yet. To get to the performance that models like GPT-4 have, the corpora—the number of tokens or words needed to train the model—is so vast that with hand-curated datasets where we know what is in there, we are light years away. These models, the corpora, they are just so vast that nobody really knows what exactly went in there. They probably also contain lots of copyrighted materials. So this is the other battleground that is currently happening—the whole discussion about copyright and training data for these models, and what can be used and what can’t be used.

For an initiative that approaches this in an ethical way and actually cares about what data goes in there, all these hurdles—thinking about copyright, curating the data—are just so time-consuming. But I think this needs to happen now. So I think the next frontier of model development is not getting more and more data but getting better data where we know really what is in there. Because right now, these LLMs are a bit like magical spells. We can interact with them, we can experiment with them, but sometimes they do things that we simply don’t know why. The explanation is very likely in the training data. But since we can’t interrogate that, it’s a bit magic, and that’s not very scientific. So I really hope that the development goes in that direction.

Katja Mayer
And you will try and catch up with the dynamics of these fields. It’s incredible how fast everything changes. Every week there are new interruptions, revolutions, and what else is going on. So it’s actually quite hard to follow all that, I guess.

Jana Lasser
Yeah, for sure. I wouldn’t mind a break.

Outro

Katja Mayer
So thank you, Jana, very much. I think you gave us beautiful insights into your work. What remains is that I wish you a really good start for your new job. It will be exciting for sure. We will follow you and your activities, and I wish you all the best in the sense of openness that you can keep up your wonderful work. Show us maybe new ways of even enhancing the reproduction packaging of our research. Thank you very much, Jana.

Jana Lasser
Thank you, Katja.


Jana Lasser: Advancing Reproducibility and Reuse