Modern AI systems are trained on datasets that exclude more than 99% of the world’s living languages. As large language models increasingly shape research, education, industry, and policy, this imbalance determines whose knowledge is visible, and whose is erased.
Fairtrain is envisioned as an open‑source, community‑owned project that crowdsources and curates openly licensed text, audio, and video resources across 7000+ living human languages, enabling training corpora that reflect a richer diversity of global human knowledge. By expanding linguistic representation at scale, we work towards a modeling landscape that is more equitable, reliable, and responsive to the needs of every human community.