- Machine Translation for Indigenous Language Preservation

Available Resources

Datasets and models

Translation datasets for Wayuunaiki, Arhuaco, Inga, and Nasa – Parallel corpus of Indigenous languages to Spanish.
Translation models for Wayuunaiki, Arhuaco, Inga, and Nasa – The best models for each language.

Books

O'unaa unapümuin sülerru'je tü Maa'kat – (Journey to the Center of the Earth, Julio Verne) Translation into Wayuunaiki.
Chaupi alpa ukuma purii – (Journey to the Center of the Earth, Julio Verne) Translation into Inga.
Jan tsbatsanamamabe jatsëntsak – (Journey to the Center of the Earth, Julio Verne) Translation into Kamentsa.

If you use any of our datasets or models in your work, please cite us as follows:

"Juan Prieto, Cristian Martinez, Melissa Robles, Alberto Moreno, Sara Palacios, and Rubén Manrique. 2024. Translation systems for low-resource Colombian Indigenous languages, a first step towards cultural preservation. In Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024), pages 7–14, Mexico City, Mexico. Association for Computational Linguistics."

Publications

Improving Low-Resource Translation with Dictionary-Guided Fine-Tuning and RL: A Spanish-to-Wayuunaiki Study

Summary: Low-resource machine translation remains a significant challenge for large language models (LLMs), which often lack exposure to these languages during pretraining and have limited parallel data for fine-tuning. We propose a novel approach that enhances translation for low-resource languages by integrating an external dictionary tool and training models end-to-end using reinforcement learning, in addition to supervised fine-tuning. Focusing on the Spanish-Wayuunaiki language pair, we frame translation as a tool-augmented decision-making problem in which the model can selectively consult a bilingual dictionary during generation. Our method combines supervised instruction tuning with Guided Reward Policy Optimization (GRPO), enabling the model to learn both when and how to use the tool effectively. BLEU similarity scores are used as rewards to guide this learning process. Preliminary results show that our tool-augmented models achieve up to +3.37 BLEU improvement over previous work, and a 18% relative gain compared to a supervised baseline without dictionary access, on the Spanish-Wayuunaiki test set from the AmericasNLP 2025 Shared Task. We also conduct ablation studies to assess the effects of model architecture and training strategy, comparing Qwen2.5-0.5B-Instruct with other models such as LLaMA and a prior NLLB-based system. These findings highlight the promise of combining LLMs with external tools and the role of reinforcement learning in improving translation quality in low-resource language settings.

Cite: Manuel Mosquera, Melissa Robles, Johan Rodriguez, Ruben Manrique. 2025. Improving Low-Resource Translation with Dictionary-Guided Fine-Tuning and RL: A Spanish-to-Wayuunaiki Study. Arxiv pre-print https://arxiv.org/abs/2508.19481. Accepted in AAAI 2026.

Read the full paper

Translation systems for low-resource Colombian Indigenous languages, a first step towards cultural preservation

Summary: The use of machine learning and Natural Language Processing (NLP) technologies can assist in the preservation and revitalization of indigenous languages, particularly those classified as “low-resource.” Given the increasing digitization of information, the development of translation tools for these languages is of significant importance. These tools not only facilitate better access to digital resources for indigenous communities but also stimulate language preservation efforts and potentially foster more inclusive, equitable societies, as demonstrated by the AmericasNLP workshop since 2021. The focus of this paper is Colombia, a country home to 65 distinct indigenous languages, presenting a vast spectrum of linguistic characteristics. This cultural and linguistic diversity is an inherent pillar of the nation’s identity, and safeguarding it has been increasingly challenging given the dwindling number of native speakers and the communities’ inclination towards oral traditions. Considering this context, scattered initiatives exist to develop translation systems for these languages. However, these endeavors suffer from a lack of consolidated, comparable data. This paper consolidates a dataset of parallel data in four Colombian indigenous languages - Wayuunaiki, Arhuaco, Inga, and Nasa - gathered from existing digital resources. It also presents the creation of baseline models for future translation and comparison, ultimately serving as a catalyst for incorporating more digital resources progressively.

Cite: Juan Prieto, Cristian Martinez, Melissa Robles, Alberto Moreno, Sara Palacios, and Rubén Manrique. 2024. Translation systems for low-resource Colombian Indigenous languages, a first step towards cultural preservation. In Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024), pages 7–14, Mexico City, Mexico. Association for Computational Linguistics.

Read the full paper

Preserving Heritage: Developing a Translation Tool for Indigenous Dialects

Summary: The preservation and understanding of indigenous languages emerge as crucial, given their substantial contribution to the cultural and linguistic heritage of communities. Despite their undeniable value, these languages are threatened by extinction due to a dwindling number of native speakers and the predominance of oral traditions over written forms. In this context, this study aims to contribute to the conservation of these languages through the development of a Spanish-indigenous language translator. This research employs neural machine translation technology, investigating three distinct approaches: a translation model based on transformers, finetuning with a Finnish translator, and finetuning with a multilingual translator. The results obtained from these methodologies are promising, demonstrating competitive viability when compared to the limited existing research in this field of study.

Cite: Melissa Robles, Cristian A. Martínez, Juan C. Prieto, Sara Palacios, and Rubén Manrique. 2024. Preserving Heritage: Developing a Translation Tool for Indigenous Dialects. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM '24). Association for Computing Machinery, New York, NY, USA, 1200–1203. https://doi.org/10.1145/3616855.3637828

Read the full paper

Apoyo Financiero | Admisiones y Registro | Biblioteca | Bloque Neon | Decanatura de Estudiantes
Universidad de los Andes | Vigilada Mineducación
Reconocimiento como Universidad: Decreto 1297 del 30 de mayo de 1964.
Reconocimiento personería jurídica: Resolución 28 del 23 de febrero de 1949 Minjusticia
Edificio Mario Laserna Cra 1Este No 19A - 40 Bogotá (Colombia) | Tel: [571] 3394949 Ext: 2860, 2861, 2862 | Fax: [571] 3324325
© 2025 - Departamento de Ingeniería de Sistemas y Computación