GenderBench 1.0 Results
Matúš Pikuliak (matus.pikuliak@gmail.com)
What is GenderBench?
GenderBench is an open-source evaluation suite designed to comprehensively benchmark gender biases in large language models (LLMs). It uses a variety of tests, called probes, each targeting a specific type of unfair behavior.
What is this document?
This document presents the results of GenderBench 1.0, evaluating various LLMs. It provides an empirical overview of the current state of the field as of March 2025. It contains three main parts:
- Final marks - This section shows the marks calculated for evaluated LLMs in various categories.
- Executive summary - This section summarizes our main findings and observations.
- Detailed results - This sections presents the raw data.
How can I learn more?
For further details, visit the project's repository. We welcome collaborations and contributions.
Final marks
This section presents the main output from our evaluation.
Each LLM has received marks based on its performance in four use cases. Each use case includes multiple probes that assess model behavior in specific scenarios.
- Decision-making - Evaluates how fair the LLMs are in making decisions in real-life situations, such as hiring. We simulate scenarios where the LLMs are used in fully automated systems or as decision-making assistants.
- Creative Writing - Examines how the LLMs handle stereotypes and representation in creative outputs. We simulate scenarios when users ask the LLM to help them with creative writing.
- Manifested Opinions - Assesses whether the LLMs' expressed opinions show bias when asked. We covertly or overtly inquire about how the LLMs perceive genders. Although this may not reflect typical use, it reveals underlying ideologies within the LLMs.
- Affective Computing - Looks at whether the LLMs make assumptions about users' emotional states based on their gender. When the LLM is aware of the user's gender, it may treat them differently by assuming certain psychological traits or states. This can result in an unintended unequal treatment.
To categorize the severity of harmful behaviors, we use a four-tier system:
- A - Healthy. No detectable signs of harmful behavior.
- B - Cautionary. Low-intensity harmful behavior, often subtle enough to go unnoticed.
- C - Critical. Noticeable harmful behavior that may affect user experience.
- D - Catastrophic. Harmful behavior is common and present in most assessed interactions.
|
Decision-making |
Creative Writing |
Manifested Opinions |
Affective Computing |
claude-3-5-haiku |
A |
D |
C |
A |
gemini-2.0-flash |
A |
C |
C |
A |
gemini-2.0-flash-lite |
A |
C |
C |
A |
gemma-2-27b-it |
A |
C |
C |
A |
gemma-2-9b-it |
A |
C |
C |
A |
gpt-4o |
B |
C |
C |
A |
gpt-4o-mini |
A |
C |
C |
A |
Llama-3.1-8B-Instruct |
A |
C |
B |
A |
Llama-3.3-70B-Instruct |
A |
D |
C |
A |
Mistral-7B-Instruct-v0.3 |
A |
C |
C |
A |
Mistral-Small-24B-Instruct-2501 |
A |
C |
B |
A |
phi-4 |
A |
C |
C |
A |
Executive summary
This section introduces several high-level observations we have made based on our results. All the data we used to infer these observations are in the figures below.
🙈 Note on completeness
This benchmark captures only a subset of potential gender biases - others may exist beyond our scope. Biases can manifest differently across contexts, cultures, or languages, making complete coverage impossible. Results should be interpreted as indicative, not exhaustive.
Converging behavior
All the LLMs we evaluated have noticeably similar behavior. If one model proves to be healthy for a given probe, others likely are too. If one LLM prefers one gender in a given probe, others likely prefer it too. This is not surprising, as we have seen a remarkable convergence of training recipes in recent years. Most AI labs train their LLMs using similar methods, data, and sometimes even outputs from competitors. In effect, the behavior of the LLMs is very similar.
LLMs treat women better
Historically, it was assumed that machine learning models might treat men better due to their historically advantageous position that is often reflected in training text corpora. However, when we directly compare the treatment for men and women, our probes show either equal treatment or women being treated better. In creative writing, most of the characters are written as women, in decision-making, women might have a slight edge over men, when asked about who is right in relationship conflicts, LLMs tend to take women's side. This overcorrection should be considered when deploying the LLMs into production.
Strong stereotypical reasoning
Using gender-stereotypical reasoning is a relatively common failure mode. LLMs tend to write characters with stereotypical traits, assign stereotypical statements to certain genders, agree with stereotypical ideas, and so on. Stereotypical associations with occupations are especially troubling, considering the usage of LLMs in professional settings. Mitigating this issue is extremely challenging, as stereotypes are deeply embedded in vast amounts of training data.
Decision-making deserves caution
Decision-making in everyday and business situations, such as hiring decisions or financial decisions, does not seem to be strongly affected by biases, but there are still cases when the results could be characterized as unfair. We recommend special caution in all use cases when the LLM is making decisions based on data that contain the information about gender. Fairness should always be monitored. Removal of gender-related personal information, such as names or pronouns, can also be considered as a mitigation measure.
What is missing
There are still noticeable gaps in our evaluation. GenderBench currently does not address several important verticals, such as multimodal processing, non-English languages, reasoning capabilities, or multi-turn conversations. These will be progressively covered in future releases.
Normalized results
The table below presents the results used to calculate the marks, normalized in different ways to fall within the (0, 1) range, where 0 and 1 represent the theoretically least and most biased models respectively. We also display the
average result for each model. However, we generally do not recommend relying on the average as a primary measure, as it is an imperfect abstraction.
|
DiscriminationTamkinProbe.max_diff |
HiringAnProbe.diff_acceptance_rate |
HiringAnProbe.diff_regression |
HiringBloombergProbe.masculine_rate |
HiringBloombergProbe.stereotype_rate |
DiversityMedQaProbe.diff_success_rate |
BusinessVocabularyProbe.mean_diff |
GestCreativeProbe.stereotype_rate |
InventoriesProbe.stereotype_rate |
JobsLumProbe.stereotype_rate |
GestCreativeProbe.masculine_rate |
InventoriesProbe.masculine_rate |
JobsLumProbe.masculine_rate |
DirectProbe.fail_rate |
RelationshipLevyProbe.diff_success_rate |
GestProbe.stereotype_rate |
BbqProbe.stereotype_rate |
DreadditProbe.max_diff_stress_rate |
IsearProbe.max_diff |
Average |
claude-3-5-haiku |
0.062 |
0.022 |
0.006 |
0.021 |
0.015 |
0.010 |
0.000 |
0.116 |
0.116 |
0.572 |
0.400 |
0.404 |
0.231 |
0.026 |
0.329 |
0.578 |
0.096 |
0.005 |
0.077 |
0.162 |
gemini-2.0-flash |
0.023 |
0.003 |
0.017 |
0.044 |
0.000 |
0.023 |
0.000 |
0.106 |
0.000 |
0.571 |
0.257 |
0.160 |
0.202 |
0.046 |
0.312 |
0.687 |
0.013 |
0.007 |
0.059 |
0.133 |
gemini-2.0-flash-lite |
0.007 |
0.001 |
0.000 |
0.041 |
0.011 |
0.001 |
0.000 |
0.176 |
0.105 |
0.747 |
0.068 |
0.283 |
0.109 |
0.037 |
0.277 |
0.535 |
0.033 |
0.013 |
0.078 |
0.133 |
gemma-2-27b-it |
0.039 |
0.003 |
0.016 |
0.030 |
0.023 |
0.002 |
0.003 |
0.154 |
0.160 |
0.591 |
0.220 |
0.279 |
0.209 |
0.037 |
0.635 |
0.563 |
0.020 |
0.013 |
0.060 |
0.161 |
gemma-2-9b-it |
0.043 |
0.024 |
0.001 |
0.010 |
0.011 |
0.001 |
0.004 |
0.132 |
0.097 |
0.604 |
0.262 |
0.294 |
0.193 |
0.030 |
0.543 |
0.477 |
0.011 |
0.008 |
0.067 |
0.148 |
gpt-4o |
0.007 |
0.020 |
0.026 |
0.101 |
0.009 |
0.004 |
0.000 |
0.287 |
0.279 |
0.624 |
0.169 |
0.205 |
0.195 |
0.052 |
0.542 |
0.238 |
0.001 |
0.010 |
0.021 |
0.147 |
gpt-4o-mini |
0.020 |
0.011 |
0.002 |
0.061 |
0.000 |
0.003 |
0.003 |
0.227 |
0.153 |
0.593 |
0.294 |
0.294 |
0.211 |
0.085 |
0.379 |
0.415 |
0.075 |
0.009 |
0.029 |
0.151 |
Llama-3.1-8B-Instruct |
0.078 |
0.001 |
0.017 |
0.023 |
0.044 |
0.015 |
0.018 |
0.232 |
0.280 |
0.842 |
0.259 |
0.313 |
0.078 |
0.017 |
0.126 |
0.108 |
0.207 |
0.011 |
0.071 |
0.144 |
Llama-3.3-70B-Instruct |
0.010 |
0.027 |
0.022 |
0.024 |
0.008 |
0.002 |
0.022 |
0.195 |
0.271 |
0.648 |
0.340 |
0.313 |
0.188 |
0.042 |
0.290 |
0.641 |
0.041 |
0.009 |
0.062 |
0.166 |
Mistral-7B-Instruct-v0.3 |
0.008 |
0.005 |
0.011 |
0.057 |
0.014 |
0.009 |
0.000 |
0.270 |
0.284 |
0.801 |
0.100 |
0.188 |
0.095 |
0.053 |
0.443 |
0.143 |
0.238 |
0.002 |
0.078 |
0.147 |
Mistral-Small-24B-Instruct-2501 |
0.036 |
0.005 |
0.006 |
0.026 |
0.001 |
0.002 |
0.000 |
0.215 |
0.159 |
0.689 |
0.266 |
0.271 |
0.150 |
0.031 |
0.464 |
0.165 |
0.049 |
0.017 |
0.038 |
0.136 |
phi-4 |
0.024 |
0.008 |
0.020 |
0.057 |
0.002 |
0.002 |
0.000 |
0.338 |
0.320 |
0.747 |
0.143 |
0.277 |
0.124 |
0.031 |
0.272 |
0.416 |
0.017 |
0.008 |
0.030 |
0.149 |