Towards Open Source Large Language Models for India
The Path to an Accessible Indian Analogue to ChatGPT
In recent times, the realm of Natural Language Processing (NLP) has witnessed significant advancements, with Large Language Models (LLMs) leading the way. However, despite their potential, these models often fall short in catering to region-specific linguistic nuances and requirements. This article outlines my goal of bridging this gap with LLMs tailored for India. I will specifically focus on instruction-tuned LLMs and hope that the models can help serve as a platform for various practical applications.
Roadmap:
1. Creating an Instructive Tuning Dataset
The journey began with the creation of an instruction-tuning dataset named InstructMix. Designed to enhance the models' understanding of instructions, this dataset (accessible at link) forms the foundational building block.
2. Fine-tuning the InstructMix Dataset
Building upon the dataset, a small LLM was fine-tuned, resulting in InstructMix-Llama-3B. This model follows instructions and can be used conversationally. Further details can be found here.
3. Expanding InstructMix's Horizons
An iterative approach will be adopted to enrich the InstructMix dataset, incorporating conversational data. This expansion aims to enhance the LLM's ability to engage in dynamic exchanges.
4. Enhancing Conversational Proficiency
Aiming to elevate the model's conversational prowess, a small LLM will be fine-tuned on the updated dataset, reinforcing its ability to facilitate engaging conversations.
5. Multilingual Datasets Generation
Leveraging the InstructMix foundation, multilingual datasets will be crafted, encompassing languages like Hindi, English, and Romanized Hindi (Hinglish). These datasets pave the way for cross-lingual capabilities.
6. Pioneering Multilingual Proof of Concept
A smaller LLM will be trained on the multilingual dataset, serving as a proof of concept for the model's adaptability and efficacy across languages.
7. Scaling to 7 Billion Parameters
The next objective is to train a 7-billion-parameter LLM using diverse multilingual datasets.
8. Aiming for 13 Billion Parameters
Pushing the boundaries further, the project envisions the development of a 13-billion-parameter LLM next….
Side Quests:
1. Guanaco QLoRA of Llama 2
While not directly aligned with the primary objective, training a Guanaco QLoRA based on Llama 2 was a significant learning experience.
2. Empowering Models with Function Invocation Abilities
Adding function invocation data to InstructMix lays the foundation for training models to execute functions when contextually appropriate, akin to ChatGPT's functionality.
3. Conversational Roleplaying LLMs
As an offshoot of the core project, LLMs tailored for conversing with characters or roleplaying will also be developed. See Xilabs' Calypso-3B-Alpha-V2, designed for immersive interactions.
I have created the milestones as accessible checkpoints to the final destination - deploying a capable multilingual conversation model, similar in function to ChatGPT, but with a specific focus on India. In time, these objectives will be revised and updated, as we are still in the very early, wild west days of LLMs.
Let me know what you think :)