Practical solution for enriching LLMs

By Daniel Charters, A data scientist at KID Services, part of the KID Group of companies.

Johannesburg, 28 May 2024

Daniel Charters, data scientist at KID Services, part of the KID Group of companies.

Numerous enterprises are pinning their hopes on generative artificial intelligence (GenAI) to enhance operational efficiency and introduce innovative functions.

However, GenAI is still in its infancy, and isn’t the silver bullet many organisations hope for − yet. Currently, it’s a tool, not a solution. It can certainly be deployed to support internal decision-making, but is far from ready to engage with customers to make decisions on behalf of organisations.

Many organisations aspire to build their own large language models (LLMs) and train them on their own databases, but the costs and skills required to do this create stumbling blocks.

It should be noted there are few good LLMs out there, and those that exist are the LLMs by the Googles and Facebooks of the world, which have the capacity to train them. If an organisation opts for a black box model − a ChatGPT − the user won't know where an answer is coming from, or if the answer is a ‘hallucination’, therefore they should exercise caution.

The output of a generative model can only be as good as the data that it is trained on.

While LLMs benefit from broadly accessible datasets for general tasks, specialised applications, particularly within enterprise settings, demand proprietary data integration. The output of a generative model can only be as good as the data that it is trained on.

Approaches to integrating new data into models include costly and resource-intensive pre-training, domain-specific fine-tuning of existing LLMs, and retrieval-augmented generation (RAG) which grounds model outputs in query-relevant data within the model's processing capacity, presenting a practical solution for enriching LLMs with new information.

This is an ideal way to increase trust and transparency when using LLMs for enterprise applications.

Understanding RAG

RAG operates by combining the generative power of models like GPT (Generative Pretrained Transformer) with a retrieval mechanism that fetches relevant information from a large database or corpus of text in real-time. This approach allows the model to produce responses that are not only contextually appropriate but also factually accurate and grounded in real-world knowledge.

RAG is particularly useful for applications that require up-to-date information or domain-specific knowledge that might not be covered in the training data of the generative model. This includes tasks like question answering, content creation and conversational AI systems where accuracy and relevancy are critical.

RAG offers a significant advancement in increasing trust in LLMs through its ability to enhance accuracy and relevance, and reduce biases in generated content. By dynamically retrieving up-to-date and factually accurate information from extensive databases before generating responses, RAG ensures outputs are not only relevant to the query but also reflect current knowledge and diverse perspectives.

This process helps in mitigating the propagation of biases. Moreover, the transparency inherent in RAG's retrieval mechanism allows users to trace the origin of the information used in generating responses, adding a layer of explainability and fostering trust.

By leveraging RAG, developers can create GenAI systems that provide more accurate, informative and contextually-relevant outputs, significantly enhancing the user experience and the utility of AI applications in various domains.

RAG is one of the simplest ways to build context and trust into GenAI outputs, with some RAG tools offering a drag-and-drop solution with guardrails in place to support LLMOps, for generative AI you can have confidence in.

There are a number of data quality issues to be considered in RAG models:

Data integrity: RAG models depend critically on the quality of their input data. If the data is incorrect or outdated, the models can generate misleading outputs. Ensuring data integrity is fundamental; it involves verifying that data is accurate, complete and updated to reflect the latest information.

Challenges in data management: Managing data for GenAI involves navigating several challenges, such as ensuring the data remains relevant over time and is representative of diverse scenarios. This is complicated by the vast amount of data and its varied formats, requiring sophisticated methods for validation and curation. Effective data management must account for these dynamics, ensuring data is high-quality and aligned with the specific needs of GenAI technologies.

Data governance: Data governance for GenAI transcends traditional boundaries, encompassing not only quality and privacy but also ethical considerations. Establishing a comprehensive governance framework is crucial, defining clear policies around data usage, access and security.

Data availability: The scope and depth of data accessible to GenAI models directly influence their effectiveness. Limited datasets can restrict the models' understanding and creativity, underscoring the importance of building extensive, high-quality data repositories from where the model can get context. This impacts data gathering and biases.

Data stewardship: The evolution of data stewardship reflects the complex requirements of GenAI. Data stewards now need a deeper collaboration with technical teams, understanding GenAI's unique demands and ensuring the data ecosystem is robust. Their role is critical in bridging the gap between data management practices and the technological needs of GenAI applications.

Business implications: Prioritising data quality enhances decision-making and fosters innovation, providing a competitive-edge. It enables businesses to leverage GenAI for deeper insights and more creative solutions, translating into tangible benefits, such as improved efficiency, customer satisfaction and market position.

Future directions: The continuous advancement of GenAI necessitates equally progressive approaches to data quality management. This includes leveraging AI itself to assist in data curation and validation, as well as developing industry-specific standards for data quality tailored to the needs of GenAI applications.

Practical solution for enriching LLMs

With many organisations aspiring to build their own large language models, retrieval-augmented generation can build context and trust into GenAI.

Understanding RAG