Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion | ACL Rolling Review

Introducing AceGPT: A Culturally-Aware Arabic Language Model

We’re excited to announce AceGPT, a breakthrough in Arabic language AI that addresses the critical “localization issue” faced by current large language models. While existing models like GPT-3.5 and GPT-4 demonstrate impressive capabilities, they often struggle to fully align with Arabic cultural values and norms.

Key Innovations

AceGPT introduces several key innovations to create a truly Arabic-centric language model:

  • Localized Pre-training: Further pre-training on extensive Arabic text data to build strong foundations in Arabic language and cultural context
  • Localized Instructions: Fine-tuning using natural Arabic questions from real-world contexts rather than translated English data
  • Localized Responses: Generating native Arabic responses through GPT-4 rather than translations
  • Cultural Alignment: Using reinforcement learning with AI feedback (RLAIF) to align with Arabic cultural values

Performance Highlights

AceGPT achieves state-of-the-art performance among open-source Arabic language models across multiple benchmarks:

  • Instruction Following: Surpasses previous models by 33% on Arabic Vicuna-80 and 30% on Arabic AlpacaEval
  • Cultural Alignment: Strong performance on our new Arabic Cultural and Value Alignment (ACVA) benchmark
  • Knowledge: Superior results on Arabic MMLU and EXAMs tests
  • Language Understanding: Competitive performance on ALUE benchmark

Why It Matters

Traditional language models often reflect Western cultural biases, creating challenges for Arabic users. AceGPT represents a significant step toward:

  • Better understanding of Arabic cultural nuances
  • More natural and culturally appropriate responses
  • Stronger alignment with Arabic values and customs
  • Improved practical applications for Arabic-speaking communities

Open Source Commitment

The complete AceGPT framework, including code, data, and models, is available at: https://github.com/FreedomIntelligence/AceGPT

Looking Forward

While AceGPT represents significant progress in culturally-aware AI, we acknowledge current limitations and are committed to:

  • Expanding Arabic vocabulary coverage
  • Enhancing cultural datasets
  • Improving safety alignment
  • Continuing research into cultural adaptation techniques

We believe AceGPT marks an important milestone in creating AI systems that truly understand and respect cultural contexts while serving the specific needs of Arabic-speaking communities.