Data Mosaic: Crafting AI Datasets for Diversity

Human Capital & Strategy-II

Data Mosaic: Crafting AI Datasets for Diversity

One of the most challenging aspects of artificial intelligence is keeping bias out of the algorithms. The bias starts with input data, so developing inclusive datasets is a priority. -By Belinda Jones
Ensuring AI training datasets are diverse and representative is crucial for building fair and unbiased AI systems. A massive amount of data is available for collection today, so one of the first steps for developing an unbiased AI project is determining the data sources. The source data that AI uses needs to reflect the purpose and criteria of the project, and it also needs to be free of biases that can distort outcomes. Gathering data from an extensive range of relevant data sources and using different data techniques is an approach to crafting AI datasets that can ensure datasets are diverse rather than homogeneous.

View from Above: Selection Process for AI Training Data

Praveen Kumar Tupil, a software engineer at Cisco, has a simple way to describe the basic but complex process for producing a dataset most likely to produce unbiased AI outcomes. He writes, “Conduct thorough data exploration to understand biases and limitations. Implement techniques such as stratified sampling or data augmentation to address imbalances. Continuously monitor and evaluate model performance across diverse demographic groups to detect and mitigate biases. Collaborate with domain experts and stakeholders to validate results and ensure fairness.” Unfortunately, not everyone lives up to these ideals. For example, selection bias is a type of bias in which a subset of data is excluded due to a particular attribute. During an AI project for Human Resources, a population group was excluded from the selection because of the choice of selection attributes. The sampled data led to using data to train AI that was not representative of reality.

Stepping Towards an Unbiased Dataset
The Massachusetts Institute of Technology (MIT) developed “AI Blindspot” cards. AI Blindspots are oversights in workflow that leads to harmful unintended consequences. Blindspots can occur at any point of the AI project, including during the planning stage when representative data is collected for use as AI training data.

Three considerations are identified. One is to explore how data might be skewed or have encoded historical biases. Two, consider whether diverse voices are included in the data definition and collection process. The third refers to representative data for an AI project concerning community groups.

For example, an algorithm developed to detect skin cancer was only effective on light skin tones because a demographically diverse dataset was not collected. Clinical study teams are more aware today that generating an unbiased dataset must be a focus when creating AI projects. The risk of excluding population data applies to every type of AI project in and industry and organization, not just healthcare.

Demographic biases occur when training data over-represents or under-represents particular demographic groups. The result is an AI model demonstrating biased outcomes towards specific genders, ethnicities, races, or social groups. One of the ways to mitigate AI bias is to pre-process that data using various techniques that include oversampling, undersampling, or synthetic data generation.”

Pre-processing refers to “identifying and addressing biases in the data before the model is trained.” An example given of oversampling is a study by Joy Buolamwini and Timnit Gebru that found oversampling dark-skinned people increased the accuracy of facial recognition algorithms for dark-skinned people. Data augmentation creates synthetic data points to increase the representation of underrepresented groups.

Best Practices in Dataset Creation for AI Training
There are recognized best practices to use in the complex process of generating data free of bias but complete enough for AI training. Gathering data from diverse sources can ensure representation. This may include publicly available datasets, data collected from different regions, communities, or populations, industry data, and data collected through partnerships with organizations representing various demographics. Auditing the dataset may ensure it represents the diversity of the target population, and augmenting it can increase diversity through data synthesis, sampling, and data manipulation. Of course, data manipulation is used with caution and has to involve diverse teams evaluating how the data is manipulated and the potential AI outcomes.

Additional strategies include using bias detection tools and algorithms that identify patterns of bias and employing continuous monitoring and iteration. Organizations can implement continuous monitoring and iteration processes of datasets to ensure ongoing diversity and representativeness. Datasets need regular updating to reflect changes in demographics and societal trends.

Collaboration with communities of interest is also a recommended best practice. During the dataset development process, teams can collaborate with community organizations, advocacy groups, and experts from diverse backgrounds to gather insights and perspectives on data collection and representation. Team members need to be culturally sensitive and respectful when collecting data from diverse communities. Consider cultural norms, beliefs, and values to ensure appropriate and respectful data collection methods. Continuously evaluate the inclusiveness of data collection practices and seek feedback from participants and stakeholders. Use feedback to identify areas for improvement and adjust as needed.

Inclusive data collection is much more than running data collection programs. There are already stunning examples of poor AI outcomes due to dataset bias. Just recently, the Google Gemini AI tool, using generative AI, produced pictures of U.S. founders as all Black. The data the AI image generator uses for training clearly needs updating, so that it can accurately understand the balance between a desire to render modern, diverse images and the biased realities of the past.

Building Trust in AI Begins With Data Selection
The historical and sampling bias that can distort AI model outcomes can be overcome, but the experts agree that bias will never be eliminated. After all, efforts to mitigate bias may have consequences that make the problem even more complex. The BSA framework for building trust in AI points to the fact that data interventions for one group can increase bias for another group. BSA concludes that prioritizing diversity in teams involved in the system’s development and those responsible for oversight is vitally important. Developer assumptions they may not even be aware of can cloud their ability to look at data with an unbiased perspective.