Day 3 Discussion
On day 3, we learned about the importance of trustworthy data and workflows as well as how to present case studies. In the trust-a-thon session today, we want you to investigate the data and workflow of the Tropical problem and plot the predictions and observations for individual storms to identify potential biases and other issues in the model.
Please have one of your team members reply in the comments to to each of these questions after discussing them with the team. If you have not commented on the posts from the previous days, please add your thoughts there as well.
Here is the TAI4ES Tropical notebook for reference.
Discussion prompts
- In the lecture series we presented a framework for thinking about the implications of the decisions we make throughout the AI development process. Consider the framework below and answer the following hypothetical questions about the data and workflow you’re using for the Trust-a-thon:
- Where in the data/data collection process do you think there was room for error and/or bias? What are the potential implications of this error and/or bias for the end user?
- Where in your workflow (as well as the workflow outlined in your assigned notebook) is there room for error and/or bias? What are the potential implications of this error and/or bias for the end user?
- How could you leverage social science and or user engagement to mitigate these issues?
- We also talked a lot about using case studies as a way to communicate about AI with end users. Outline how you would present a case study of one of your models to your specific end user. Be as specific as possible.
Ans1: The biases can occur at the stage of data collection as the method or source of our data can be biased towards some set of users because of which we might end up with wrong model. But to remove bias as much as possible we must know in which steps it can occur and then only we can work on that. And due to that it might affect different users differently.
Ans2: For example in classification problem while splitting our data sets into train data and test data we must take train data from all different classes in equal amount so that it is not biased towards some specific set. Such things we should keep in mind while working with the problems.
We should also work on different parameters value which best suite our data set. Again it should not be biased towards some end users so we should keep this in mind or else we will have to define different models for different set of end users but finding these set might be a difficult task sometimes.
Ans3: So, by using social science we might try to understand what kind of end users will be using our model and by taking some kind of survey from different kinds or I should say types of end user we can create our model which will not have this kind of bias. So, understanding what our end users want is also important and it can be done only by taking non biased survey.
Q. 1: In the lecture series we presented a framework for thinking about the implications of the decisions we make throughout the AI development process. Consider the framework below and answer the following hypothetical questions about the data and workflow you’re using for the Trust-a-thon:
(1) Where in the data/data collection process do you think there was room for error and/or bias? What are the potential implications of this error and/or bias for the end user?
(2) Where in your workflow (as well as the workflow outlined in your assigned notebook) is there room for error and/or bias? What are the potential implications of this error and/or bias for the end user?
(3) How could you leverage social science and or user engagement to mitigate these issues?
There is room for error and bias at many stages of the data collection process. One of the biggest biases comes from non-ideal data sources – the satellite image is top-down, so we cannot see the heights of clouds nor their three-dimensional structure from the satellite. There may be biases in coverage. we do not think this is likely for our case, since the satellite covers a large part of the Earth’s surface, and has high temporal coverage. The developer workflow may also lead to errors, like bugs in the code, and biases arising from decisions made during the model development process. For example, how we split the data into train/test datasets, can affect the model and its performance. Randomness intrinsic to the model training can also affect reproducibility; we can mitigate this if we set a random seed. Interaction with the users can help deal with these issues. Users may have important insight on datasets, and quality assurance/control issues with the data. Additionally, if we have a clearer sense of the users’ needs, they can be used in making modeling decisions and may reduce bias in the areas most critical for the user. This can be an iterative process, where we ask users what they want about the edge cases and perhaps show them a few misclassification visualizations such that they can better understand and answer what they are looking for; this can help modelers to go back and focus more on these places.
Q. 2: We think it would be most illustrative to do a case study of one of the misses. The model struggles to provide an accurate prediction of the high winds and tends to underpredict strong wind speeds, sometimes by as much as half of the true speed. This can have disastrous consequences and is thus of high interest to our forecast user. We would show a scatterplot of the actual values versus predicted values (as plotted in the jupyter notebooks), and create a side-by-side colored bar plot of accurate prediction percentages along with under and over prediction percentages grouped by each class.
Great discussion of biases in data sources–this is a big topic of conversation at AI2ES. A lot of the time the best you can do is be aware of the biases and how that might affect your model and your conclusions.
And yes, case studies of misses are definitely important–weather forecasters especially want to see the misses! In general, you probably want to provide a couple of case studies, including a case where the model did very well (a good hit), a miss, and maybe a false alarm, depending on your ultimate goal and what your end users care about.
Team #37
1.1. There is always room for error and/or bias, especially in data collection! This is a widely acknowledged fact (consider major national best selling books like, “Weapons of Math Destruction” by Cathy O’Neil or many journalism pieces on racism in AI used for skin tone. This creates a generally undermining effect for many potential end users.
1.2. See answer #1 🙂 Seriously, though. Most error is directed at poor choice of model however training and more nuanced decisions can also introduce both error and bias. In many ways the potential implications are based on who your end user is. Is it another scientist? This may impact their research or credibility or your professional standing. Is it a business person? This can lead to many decisions that will have a cascading impact – such as producing AI that doesn’t treat people fairly. If it is destined for normal people possibly through the media who will make choices about leaving a storm region, buying a house, then those impacts can be deeply impactful – but only if there is a sense of trust in the result, the process and the larger scientific community.
1.3. As scientists we are trained and rewarded with finding novel results that will lead to publishing something, or finding something new. So in formulating our questions and research we are implicitly doing AI for ourselves with the scientific community as our end user. However while there may be an idea that our research will allow people to respond to things like severe storms, we need to use that in formulating our questions and design from the first concept. Have a narrative that accompanies even the most abstract research. If you are looking into a certain type of atmospheric dynamics, imagine a family living in a house in the area of interest. What do these events and scenarios mean for them? Oops already jumping ahead 🙂
2. Know thy audience.
If you are presenting to weather forecasters, you are talking directly to extremely knowledgeable practitioners who can spot many possible errors, unlikely results and generally be a very discerning audience. Spend very little time explaining the basics, get right to the assumptions, process and results, ideally using graphs to communicate. Explain hurdles, re-evaluated assumptions to generate confidence an solicit feedback for anything you might not know, to improve future effort. If you are talking to a business person, spend time to explain the topic, the basics of the process and use simple charts, key metrics. If you are talking to a media outlet or an individual combine two stories: the narrative of a use case to illustrate what the research means and the journey you took to ensure that error and bias issues were addressed in an ethical way. This is especially important, as we saw with covid when the situation changed and the science communication appeared to be contradicted. Many people in the United States are distrustful of experts and science professionals and addressing ethics is a key way to educate and build trust in the scientific process.
1.1/1.2.–Great points–this is something we talk about at AI2ES a lot! You can never completely eliminate bias in data, but we can do our best to reduce it and even more importantly–be aware of what are sources of bias are so that we can have those at the forefront of our minds when we are interpreting any results (and when we are designing our modeling studies in the first place).
1.3.–I like that you pointed out constructing a narrative! My B.S. was in engineering, and one thing I personally have noticed in earth sciences vs engineering is that earth sciences, at least in my experience, emphasize this narrative structure much more in scientific communication! Pretty much every presentation I have given or paper I have written I have had people ask me “what is the STORY?” Whereas the engineers tend to be a bit more focused on technical details and spend a bit less time in connecting the dots and presenting an overarching story (again, this is just my personal experience!)
2. Good points about trust varying with your audience (though I will say that weather forecasters, while they understand weather EXTREMELY well, are not always hugely enthusiastic about AI models 🙂 )
Team 34:
1a) Data error: Speeds are integer – how are they rounded? Might be hard to get measurements for more intense storms – sampling bias. Or there might be more error at high intensity. Could be spatial bias in satellite coverage. Might be a bad idea to merge basins – are storms the same everywhere? Implications: if the predictions are off, it’s hard to make good decisions.
1b) Workflow bias: Choice of test/train/validation splits in data can be biased. Choice of model – maybe a shallow CNN isn’t the best architecture. How well does the data match the end user needs? Choice of performance measures – MSE may not be the best loss function. Same implications as before, can lead to a bad model, maybe in sneaky ways. Could lead to bad judgments about the quality of the model; watch out for confirmation bias.
1c) Mitigating bias & error: Being aware of cognitive biases can be useful in a lot of places. Co-production of knowledge with end users is really essential to get good results. Also, many eyes make shallow bugs.
2) Case studies: we’d pick a case study that covered a real-world decision, and ask whether having this info would change the decision, or make it easier. Especially if it was a difficult decision. (It’s hard to get more specific than that, because we don’t have enough information about our end-users’ needs.)
1a) Good eye w/r/t merging basins! That’s definitely a topic that you will see different opinions on. On one hand, training your model once, on data from all basins, will give you much more training data; but your predictions for each individual basin may not be as good as if you’d trained 2 separate models. Scientifically, some people will argue that the fundamental physics of TCs are the same, regardless of what basin you are in, thus, you should be able to train a single model; other people will counter that with ‘yes, but the background conditions are different in different basins, thus, different factors tend to play different respective roles in predictability for different basins’. Ultimately, this choice will depend a bit on your specific problem and your specific goals, and also your philosophy.
1b) Very glad you brought up train/test splitting. Computer science tends to utilize a fairly naive implementation of train/validation/test split, but for earth science data (which often have spatial and/or temporal autocorrelation), this is often a bad idea! In my own work, I use a modified leave-one-year-out strategy, where I randomly select 2 years from my training sample to hold out for validation, and train on the rest, always keeping entire years together.