网站首页 > 基础教程正文

使用 FAISS 和开源 LLM 构建检索增强生成 (RAG) 系统

ccvgpt 2025-04-05 17:11:13 基础教程 17 ℃

检索增强生成 ( RAG ) 已成为增强大型语言模型 ( LLM ) 功能的强大范例。通过将 LLM 的创造性生成能力与检索系统的事实准确性相结合，RAG 为 LLM 最持久的挑战之一提供了解决方案：幻觉。

在本教程中，我们将使用以下内容构建完整的 RAG 系统：

FAISS（Facebook AI 相似性搜索）作为我们的矢量数据库
用于创建高质量嵌入的句子转换器
Hugging Face 的开源 LLM（我们将使用与 CPU 兼容的轻量级模型）
我们将创建的自定义知识库

在本教程结束时，您将拥有一个可以正常运行的 RAG 系统，该系统可以根据您的文档以更高的准确性和相关性回答问题。这种方法对于构建特定领域的助手、客户支持系统或任何需要根据特定文档做出 LLM 响应的应用程序非常有用。

让我们开始吧。

步骤 1：设置我们的环境

首先，我们需要安装所有必需的库。在本教程中，我们将使用 Google Colab。

# Install required packages
!pip install -q transformers==4.34.0
!pip install -q sentence-transformers==2.2.2
!pip install -q faiss-cpu==1.7.4
!pip install -q accelerate==0.23.0
!pip install -q einops==0.7.0
!pip install -q langchain==0.0.312
!pip install -q langchain_community
!pip install -q pypdf==3.15.1

我们还检查一下是否可以使用 GPU，这将加快我们的模型推理速度：

import torch


# Check if GPU is available
print(f"GPU available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
   print(f"GPU name: {torch.cuda.get_device_name(0)}")
else:
   print("Running on CPU. We'll use a CPU-compatible model.")

第 2 步：创建我们的知识库

在本教程中，我们将创建一个关于 AI 概念的简单知识库。在实际场景中，可以使用它来导入 PDF 文档、网页或数据库。

import os
import tempfile


# Create a temporary directory for our documents
docs_dir = tempfile.mkdtemp()
print(f"Created temporary directory at {docs_dir}")


# Create sample documents about AI concepts
documents = {
   "vector_databases.txt": """
   Vector databases are specialized database systems designed to store, manage, and search vector embeddings efficiently.
   They are crucial for machine learning applications, particularly those involving natural language processing and image recognition.
  
   Key features of vector databases include:
   1. Fast similarity search using algorithms like HNSW, IVF, or exact search
   2. Support for various distance metrics (cosine, euclidean, dot product)
   3. Scalability for handling billions of vectors
   4. Often support for metadata filtering alongside vector search
  
   Popular vector databases include FAISS (Facebook AI Similarity Search), Pinecone, Weaviate, Milvus, and Chroma.
   FAISS specifically was developed by Facebook AI Research and is an open-source library for efficient similarity search.
   """,
  
   "embeddings.txt": """
   Embeddings are dense vector representations of data in a continuous vector space.
   They capture semantic meaning and relationships between entities by positioning similar items closer together in the vector space.
  
   Types of embeddings include:
   1. Word embeddings (Word2Vec, GloVe)
   2. Sentence embeddings (Universal Sentence Encoder, SBERT)
   3. Document embeddings
   4. Image embeddings
   5. Audio embeddings
  
   Embeddings are created through various techniques, including neural networks trained on specific tasks.
   Modern embedding models like those from OpenAI, Cohere, or Sentence Transformers can capture nuanced semantic relationships.
  
   The dimensionality of embeddings typically ranges from 100 to 1536 dimensions, with higher dimensions often capturing more information but requiring more storage and computation.
   """,
  
   "rag_systems.txt": """
   Retrieval-Augmented Generation (RAG) is an AI architecture that combines information retrieval with text generation.
  
   The RAG process typically works as follows:
   1. User query is converted into an embedding vector
   2. Similar documents or passages are retrieved from a knowledge base using vector similarity
   3. Retrieved content is provided as context to the language model
   4. The language model generates a response informed by both its parameters and the retrieved information
  
   Benefits of RAG include:
   1. Reduced hallucination compared to pure generative approaches
   2. Up-to-date information without model retraining
   3. Attribution of information sources
   4. Lower computation costs than increasing model size
  
   RAG systems can be enhanced through techniques like reranking, query reformulation, and hybrid search approaches.
   """
}


# Write documents to files
for filename, content in documents.items():
   with open(os.path.join(docs_dir, filename), 'w') as f:
       f.write(content)
      
print(f"Created {len(documents)} documents in {docs_dir}")

步骤 3：加载和处理文档

现在，让我们加载这些文档并为我们的 RAG 系统处理它们：

from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter


# Initialize a list to store our documents
all_documents = []


# Load each text file
for filename in documents.keys():
   file_path = os.path.join(docs_dir, filename)
   loader = TextLoader(file_path)
   loaded_docs = loader.load()
   all_documents.extend(loaded_docs)


print(f"Loaded {len(all_documents)} documents")


# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
   chunk_size=500,
   chunk_overlap=50,
   separators=["nn", "n", ".", " ", ""]
)


document_chunks = text_splitter.split_documents(all_documents)
print(f"Created {len(document_chunks)} document chunks")


# Let's look at a sample chunk
print("nSample chunk content:")
print(document_chunks[0].page_content)
print(f"Source: {document_chunks[0].metadata}")

步骤 4：创建嵌入

现在，让我们将文档块转换为向量嵌入：

from sentence_transformers import SentenceTransformer
import numpy as np


# Initialize the embedding model
model_name = "sentence-transformers/all-MiniLM-L6-v2"  # A good balance of speed and quality
embedding_model = SentenceTransformer(model_name)


print(f"Loaded embedding model: {model_name}")
print(f"Embedding dimension: {embedding_model.get_sentence_embedding_dimension()}")


# Create embeddings for all document chunks
texts = [doc.page_content for doc in document_chunks]
embeddings = embedding_model.encode(texts)


print(f"Created {len(embeddings)} embeddings with shape {embeddings.shape}")

步骤 5：构建 FAISS 向量库

现在我们将使用这些嵌入构建我们的 FAISS 索引：

import faiss


# Get the dimensionality of our embeddings
dimension = embeddings.shape[1]


# Create a FAISS index - we'll use a simple Flat L2 index for demonstration
# For larger datasets, consider using indexes like IVF or HNSW for better performance
index = faiss.IndexFlatL2(dimension)  # L2 is Euclidean distance


# Add our vectors to the index
index.add(embeddings.astype(np.float32))  # FAISS requires float32


print(f"Created FAISS index with {index.ntotal} vectors")


# Create a mapping from index position to document chunk for retrieval
index_to_doc_chunk = {i: doc for i, doc in enumerate(document_chunks)}

步骤 6：加载语言模型

现在让我们从 Hugging Face 加载一个开源语言模型。我们将使用一个在 CPU 上运行良好的较小模型：

from transformers import AutoTokenizer, AutoModelForCausalLM


# We'll use a smaller model that works on CPU
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"


# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
   model_id,
   torch_dtype=torch.float32,  # Use float32 for CPU compatibility
   device_map="auto"  # Will use CPU if GPU is not available
)


print(f"Successfully loaded {model_id}")

步骤 7：创建我们的 RAG 管道

让我们创建一个结合检索和生成的函数：

def rag_response(query, index, embedding_model, llm_model, llm_tokenizer, index_to_doc_map, top_k=3):
   """
   Generate a response using the RAG pattern.


   Args:
       query: The user's question
       index: FAISS index
       embedding_model: Model to create embeddings
       llm_model: Language model for generation
       llm_tokenizer: Tokenizer for the language model
       index_to_doc_map: Mapping from index positions to document chunks
       top_k: Number of documents to retrieve


   Returns:
       response: The generated response
       sources: The source documents used
   """
   # Step 1: Convert query to embedding
   query_embedding = embedding_model.encode([query])
   query_embedding = query_embedding.astype(np.float32)  # Convert to float32 for FAISS


   # Step 2: Search for similar documents
   distances, indices = index.search(query_embedding, top_k)


   # Step 3: Retrieve the actual document chunks
   retrieved_docs = [index_to_doc_map[idx] for idx in indices[0]]


   # Create context from retrieved documents
   context = "nn".join([doc.page_content for doc in retrieved_docs])


   # Step 4: Create prompt for the LLM (TinyLlama format)
   prompt = f"""<|system|>
You are a helpful AI assistant. Answer the question based only on the provided context.
If you don't know the answer based on the context, say "I don't have enough information to answer this question."


Context:
{context}
<|user|>
{query}
<|assistant|>"""


   # Step 5: Generate response from LLM
   input_ids = llm_tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)


   generation_config = {
       "max_new_tokens": 256,
       "temperature": 0.7,
       "top_p": 0.95,
       "do_sample": True
   }


   # Generate the output
   with torch.no_grad():
       output = llm_model.generate(
           input_ids=input_ids,
           **generation_config
       )


   # Decode the output
   generated_text = llm_tokenizer.decode(output[0], skip_special_tokens=True)


   # Extract the assistant's response (remove the prompt)
   response = generated_text.split("<|assistant|>")[-1].strip()


   # Return both the response and the sources
   sources = [(doc.page_content, doc.metadata) for doc in retrieved_docs]


   return response, sources

步骤8：测试我们的RAG系统

让我们用一些问题来测试我们的系统：

#Define some test questions
test_questions = [
   "What is FAISS and what is it used for?",
   "How do embeddings capture semantic meaning?",
   "What are the benefits of RAG systems?",
   "How does vector search work?"
]


# Test our RAG pipeline
for question in test_questions:
   print(f"nn{'='*50}")
   print(f"Question: {question}")
   print(f"{'='*50}n")


   response, sources = rag_response(
       query=question,
       index=index,
       embedding_model=embedding_model,
       llm_model=model,
       llm_tokenizer=tokenizer,
       index_to_doc_map=index_to_doc_chunk,
       top_k=2  # Retrieve top 2 most relevant chunks
   )


   print(f"Response: {response}n")


   print("Sources:")
   for i, (content, metadata) in enumerate(sources):
       print(f"nSource {i+1}:")
       print(f"Metadata: {metadata}")
       print(f"Content snippet: {content[:100]}...")

输出：

第 9 步：评估和改进我们的 RAG 系统

让我们实现一个简单的评估函数来评估我们的 RAG 系统的性能：

def evaluate_rag_response(question, response, retrieved_sources, ground_truth_sources=None):
   """
   Simple evaluation of RAG response quality


   Args:
       question: The query
       response: Generated response
       retrieved_sources: Sources used for generation
       ground_truth_sources: (Optional) Known correct sources


   Returns:
       evaluation metrics
   """
   # Basic metrics
   response_length = len(response.split())
   num_sources = len(retrieved_sources)


   # Simple relevance score - we'd use better methods in production
   source_relevance = []
   for content, _ in retrieved_sources:
       # Count overlapping words between question and source
       q_words = set(question.lower().split())
       s_words = set(content.lower().split())
       overlap = len(q_words.intersection(s_words))
       source_relevance.append(overlap / len(q_words) if q_words else 0)


   avg_relevance = sum(source_relevance) / len(source_relevance) if source_relevance else 0


   return {
       "response_length": response_length,
       "num_sources": num_sources,
       "source_relevance_scores": source_relevance,
       "avg_relevance": avg_relevance
   }


# Evaluate one of our previous responses
question = test_questions[0]
response, sources = rag_response(
   query=question,
   index=index,
   embedding_model=embedding_model,
   llm_model=model,
   llm_tokenizer=tokenizer,
   index_to_doc_map=index_to_doc_chunk,
   top_k=2
)


# Run evaluation
eval_results = evaluate_rag_response(question, response, sources)
print(f"nEvaluation results for question: '{question}'")
for metric, value in eval_results.items():
   print(f"{metric}: {value}")

步骤 10：高级 RAG 技术 - 查询扩展

让我们实现查询扩展来改进检索：

# Here's the implementation of the expand_query function:


def expand_query(original_query, llm_model, llm_tokenizer):
   """
   Generate multiple search queries from an original query to improve retrieval


   Args:
       original_query: The user's original question
       llm_model: The language model for generating variations
       llm_tokenizer: Tokenizer for the language model


   Returns:
       List of query variations including the original
   """
   # Create a prompt for query expansion
   prompt = f"""<|system|>
You are a helpful assistant. Generate two alternative versions of the given search query.
The goal is to create variations that might help retrieve relevant information.
Only list the alternative queries, one per line. Do not include any explanations, numbering, or other text.
<|user|>
Generate alternative versions of this search query: "{original_query}"
<|assistant|>"""


   # Generate variations
   input_ids = llm_tokenizer(prompt, return_tensors="pt").input_ids.to(llm_model.device)


   with torch.no_grad():
       output = llm_model.generate(
           input_ids=input_ids,
           max_new_tokens=100,
           temperature=0.7,
           do_sample=True
       )


   # Decode the output
   generated_text = llm_tokenizer.decode(output[0], skip_special_tokens=True)


   # Extract the generated variations
   response_part = generated_text.split("<|assistant|>")[-1].strip()


   # Split response by lines to get individual variations
   variations = [line.strip() for line in response_part.split('n') if line.strip()]


   # Ensure we have at least some variations
   if not variations:
       variations = [original_query]


   # Add the original query and return the list with duplicates removed
   all_queries = [original_query] + variations
   return list(dict.fromkeys(all_queries))  # Remove duplicates while preserving order

步骤 11：评估和改进我们的 expand_query 函数

让我们实现一个简单的评估函数来评估 expand_query 函数的性能

# Example usage of expand_query function
test_query = "How does FAISS help with vector search?"


# Generate query variations
expanded_queries = expand_query(
   original_query=test_query,
   llm_model=model,
   llm_tokenizer=tokenizer
)


print(f"Original Query: {test_query}")
print(f"Expanded Queries:")
for i, query in enumerate(expanded_queries):
   print(f"  {i+1}. {query}")


# Enhanced RAG with query expansion
all_retrieved_docs = []
all_scores = {}


# Retrieve documents for each query variation
for query in expanded_queries:
   # Get query embedding
   query_embedding = embedding_model.encode([query]).astype(np.float32)


   # Search in FAISS index
   distances, indices = index.search(query_embedding, 3)


   # Track document scores across queries (using 1/(1+distance) as score)
   for idx, dist in zip(indices[0], distances[0]):
       score = 1.0 / (1.0 + dist)
       if idx in all_scores:
           # Take max score if document retrieved by multiple query variations
           all_scores[idx] = max(all_scores[idx], score)
       else:
           all_scores[idx] = score


# Get top documents based on scores
top_indices = sorted(all_scores.keys(), key=lambda idx: all_scores[idx], reverse=True)[:3]
expanded_retrieved_docs = [index_to_doc_chunk[idx] for idx in top_indices]


print("nRetrieved documents using query expansion:")
for i, doc in enumerate(expanded_retrieved_docs):
   print(f"nResult {i+1}:")
   print(f"Source: {doc.metadata['source']}")
   print(f"Content snippet: {doc.page_content[:150]}...")


# Now use these documents with the LLM to generate a response
context = "nn".join([doc.page_content for doc in expanded_retrieved_docs])


# Create prompt for the LLM
prompt = f"""<|system|>
You are a helpful AI assistant. Answer the question based only on the provided context.
If you don't know the answer based on the context, say "I don't have enough information to answer this question."


Context:
{context}
<|user|>
{test_query}
<|assistant|>"""


# Generate response
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
with torch.no_grad():
   output = model.generate(
       input_ids=input_ids,
       max_new_tokens=256,
       temperature=0.7,
       top_p=0.95,
       do_sample=True
   )


# Extract response
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
response = generated_text.split("<|assistant|>")[-1].strip()


print("nFinal RAG Response with Query Expansion:")
print(response)

输出：

FAISS 可以处理多种矢量类型，包括文本、图像和音频，并且可以与TensorFlow、PyTorch 和 Sklearn 等流行的机器学习框架集成。

结论

在本教程中，我们使用 FAISS 作为向量数据库，并结合开源 LLM，构建了一个完整的 RAG 系统，实现了文档处理、嵌入生成和向量索引，并将这些组件与查询扩展和混合搜索技术相结合，以提高检索质量。

此外，我们可以考虑：

使用交叉编码器实现查询重新排序
使用 Gradio 或 Streamlit 创建 Web 界面
添加元数据过滤功能
尝试不同的嵌入模型
使用更高效的 FAISS 索引（HNSW、IVF）扩展解决方案
对特定领域的数据进行 LLM 微调

上一篇：昇腾 msmodelslim w8a8量化代码解析
下一篇： Python字典:你以为自己真的懂?揭秘高效数据存储的黑科技!

网站首页 > 基础教程 正文

使用 FAISS 和开源 LLM 构建检索增强生成 (RAG) 系统

猜你喜欢

网站首页 > 基础教程正文