15/12/2025
Today we are releasing the Fikira Dataset by - a synthetic reasoning dataset for 10 African languages 🌍
Version 1.0. Open source. Built to be improved by the community.
አማርኛ (Amharic) • ChiShona • Hausa • Igbo • Ikinyarwanda • IsiXhosa • IsiZulu • Kiswahili • Tunisian Arabic • Yorùbá
400+ million speakers represented.
Why synthetic? Quality reasoning datasets for African languages are scarce. Human annotation by native speakers is expensive, time-intensive and difficult to scale across 10 languages.
We had a choice: wait years for perfect data or release something now that researchers can build with. We chose pragmatism.
This version 1.0 is a bootstrapping tool, not a gold standard. The dataset contains LLM-generated reasoning examples. We are transparent about its limitations: synthetic data may not capture authentic cultural reasoning patterns and carries potential biases from source models.
We are releasing this to give researchers something to build on immediately, establish a foundation the community can validate and improve and work toward version 2.0 with human validation and community contributions.
At Vambo AI, we exist to advance language inclusion in artificial intelligence. We believe progress requires both pragmatism and transparency.
Download it. Test it. Help us make it better.
🔗 huggingface.co/datasets/vamboai/fikira
📧 [email protected]
Built in South Africa 🇿🇦. For the world.