Streamline RAG with New Document Preprocessing Features
As organizations increasingly seek to enhance decision-making and drive operational efficiencies by making knowledge in documents accessible via conversational applications, a RAG-based application framework has quickly become the most efficient and scalable approach. As RAG-based application development continues to grow, the solutions to process and manage the documents that power these applications need to evolve with scalability and efficiency in mind. Until now, document preparation (e.g. extract and chunk) for RAG relied on developing and deploying functions using Python libraries which can become hard to manage and scale.
To accelerate generative AI app development, we are now offering SQL functions to make PDFs and other documents AI-ready. Following the announcement of the general availability of Cortex Search, we are excited to announce two new document preprocessing functions:
and
These functions streamline the preparation of documents, such as PDFs, making them AI-ready. AI-ready data is key to delivering value via a RAG application. Once the documents are AI-ready, they can be fed into a RAG engine, which improves the overall quality of the AI application.
Imagine that you want to provide a sales team with a conversational app that uses a large language model (LLM) to answer questions about your company’s product portfolio. Since a pre-trained LLM alone will lack deep expertise in your company’s products, the answers generated are likely to be incorrect and of no value. To provide accurate answers, developers can use a RAG-based architecture, where the LLM retrieves relevant internal knowledge from documents, wikis or FAQs before generating a response. However, for these documents to enhance RAG quality, content must be extracted, split into smaller blocks of content (chunks) such as paragraphs or document sections, and embedded as vectors for semantic retrieval. Once the pre-processing is complete, the RAG engine can be initiated.
In other words, your RAG is as good as your search capabilities, search is as good as the data chunks that it indexes, and having high-quality text extraction is foundational to all of this.
Deliver the most relevant results
Cortex Search is a fully managed service that includes integrated embedding generation and vector management, making it a critical component of enterprise-grade RAG systems. As a hybrid search solution that combines exact keyword matching with semantic understanding, it enhances retrieval precision, capturing relevant information even when queries are phrased differently.
This hybrid approach enables RAG systems to deliver more accurate and contextually relevant responses, whether the query is narrowly focused on specific terms or explores more abstract concepts. For example, a hybrid query like “headphones SKU: ABC123” will prioritize results with an exact match on “ABC123” while also returning related results about headphones, electronics, music and more. This means each query can give semantically similar results as well as precise matches for specific terms like product SKUs or company IDs.
This capability is particularly valuable when documents are prepared using layout-aware text extraction and chunking, helping ensure that the content is optimally structured for retrieval. By simplifying document preprocessing through short SQL functions, data engineers can efficiently prepare PDFs and other documents for gen AI without the need to write complex, lengthy Python functions. This streamlined process significantly reduces the time and effort required to make documents AI-ready.
Document preprocessing is foundational for building successful RAG applications, with PARSE_DOCUMENT and SPLIT_TEXT_RECURSIVE_CHARACTER serving as important steps in this process. These new functions significantly reduce the complexity and time required for document preprocessing. This makes it faster and simpler to get documents ready for use in RAG chatbots, helping organizations quickly build and improve their AI-powered solutions all within Snowflake.