GitHub - aisingapore/sealion: South-East Asia Large Language Models

South East Asian Languages in One Network

Built for Southeast Asia, by Southeast Asia

South East Asian Languages in One Network (SEA-LION) is a family of open-source Large Language Models (LLMs) that better understands Southeast Asia’s (SEA) diverse contexts, languages, and cultures.

It is an open-source project anchored by the Products Pillar of AI Singapore. Our work in SEA-LION aims to create LLMs that cater to under-represented population groups and low resource languages in the SEA region. You can read more about our motivations for SEA-LION here.

This site provides information and resources on SEA-LION, including how to access the models, hosting, and how-to guides.

Key Features of SEA-LION

Model Collection	Size	Context Length	Training Strategy	Available in
SEA-LION v3	9B	8192	CPT¹ of Gemma2	Base, Instruct, GGUF
	8B	128K	CPT of Llama 3.1 8B	Base, Instruct, GGUF
	70B	128K	CPT of Llama 3.1 70B	Base, Instruct, GGUF
SEA-LION v2	8B	8192	CPT of Llama3	Base, Instruct, GGUF
SEA-LION v1	3B	2048	Pre-training from scratch	Base
	7B	2048	Pre-training from scratch	Instruct

¹ Continued Pre-Training

Performance and Benchmarks

SEA-LION has seen:

In v1, ability to outperform most models based on SEA-HELM (SouthEast Asian Holistic Evaluation of Language Models) when it was released
In v2, outperformance for SEA tasks, while retaining credible performance on standard (English) benchmarks
In v2.1, key improvements in conversational abilities across SEA languages, while providing more helpful and contextually appropriate responses to user prompts
In v3, outperforms similar sized open source models, and even some larger models in both general and SEA capabilities

We use a holistic approach to evaluation, including not just traditional Natural Language Processing (NLP) benchmarking tasks (such as sentiment analysis and question answering) but also meticulously handcrafted linguistic and cultural diagnostic tests tailored to Southeast Asia.

Visit our Leaderboard for more detailed breakdown on:

How SEA-LION compares to other available models along different metrics
What SEA-HELM is and the four key capabilities it is evaluated on: English performance, Proficiency in SEA chat, Instruction-following and Linguistic tasks
What each of these globally recognized metrics mean under SEA-HELM

Licensing

Transparent and Open Source

We have benefited greatly from the open-source community and believe that efforts to better represent our region will similarly be well served by open-source efforts.

All SEA-LION releases will therefore embrace an open-source ethos under the MIT license as much as possible; however, the exact licensing terms may vary depending on the underlying base model’s restrictions or requirements. For instance, if the model leverages Meta’s Llama3 codebase, it may be bound by the Llama3 License, which places certain restrictions on commercial use. Similarly, the Gemma-based variants may carry different terms. Users should always refer to the Hugging Face model card of each specific SEA-LION model for the most accurate, up-to-date license information.

SEA-LION will also be open and transparent in the following areas throughout this guide:

Pre-Training data
Model training code
Fine-Tuning data
Evaluation benchmarks

Community

We welcome contributions to SEA-LION! Check out the contributing guide to get started.

Some ways to contribute:

Report bugs and issues
Enhance the documentation
Add more model evaluation tasks and metrics
Train versions of the model in more SEA languages

Check out our collaborations guide also, for possible ways to further enhance and expand the capabilities of SEA-LION together.

To Cite SEA-LION

If you use SEA-LION in your work, please cite it as:

@misc{sea_lion_2024,
  title={SEA-LION (Southeast Asian Languages In One Network): A Family of Large Language Models for Southeast Asia},
  author={AI Singapore},
  year={2024},
  howpublished={\url{https://github.com/aisingapore/sealion}}
}

Acknowledgements

AI Singapore is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore. Any opinion, finding, conclusion or recommendation expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore, or the National University of Singapore.

We also grateful for the support of the Infocomm Media Development Authority (IMDA) of Singapore.

SEA-LION would not be possible without a growing list of Singapore, regional, and international collaborators. Please see our website for more details.

Contact

If you have questions, comments, or issues, please open a GitHub issue or contact us via this SEA-LION Inquiry Form.

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
benchmarks		benchmarks
guides		guides
models		models
overview		overview
resources		resources
.gitignore		.gitignore
README.md		README.md
SUMMARY.md		SUMMARY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Key Features of SEA-LION

Performance and Benchmarks

Licensing

Community

To Cite SEA-LION

Acknowledgements

Contact

About

Releases

Packages

Contributors 9

Languages

aisingapore/sealion

Folders and files

Latest commit

History

Repository files navigation

Key Features of SEA-LION

Performance and Benchmarks

Licensing

Community

To Cite SEA-LION

Acknowledgements

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Languages

Packages