项目实战 | 使用Neo4j和LangChain集成非结构化和图知识增强QA

1,638次阅读

利用 Neo4j 矢量索引和 GraphCypherQAChain 优化信息合成，以通过 Mistral-7b 生成知情响应。

随着大型语言模型（LLM）和知识图的出现，不断发展的信息检索和知识提取领域发生了显着的变化，特别是在多跳问答的背景下。该项目利用 Neo4j 矢量索引和Neo4j 图数据库的强大功能来实现检索增强生成系统，旨在为用户查询提供精确且上下文丰富的答案。该系统采用矢量相似性搜索来搜索非结构化信息，同时访问图数据库来检索结构化数据，确保响应不仅全面，而且植根于经过验证的知识。这种方法对于解决多跳问题尤其重要，其中单个查询可以分解为多个子问题，并且可能需要来自大量文档的信息才能生成准确的答案。

项目实战 | 使用Neo4j和LangChain集成非结构化和图知识增强QA 知识密集型检索增强生成架构。

在数据既丰富又复杂的时代，上述系统是一个关键工具，确保用户查询得到的答案是广泛的知识和经过验证的准确性的无缝结合，从而弥合了非结构化数据和数据之间的差距。结构化知识图。最后一步，系统将检索到的非结构化和结构化信息传递到新的大型语言模型Mistral-7b中，用于文本生成。这种集成确保生成的响应不仅由模型中封装的大量知识提供信息，而且还通过从矢量和图形数据库检索的特定实时数据进行微调和丰富，从而提供细致入微、准确且符合上下文的信息和相关的用户体验。

Neo4j向量索引

Neo4j 矢量索引已成为检索增强生成 (RAG) 应用领域的关键工具，特别是在处理结构化和非结构化数据方面。LangChain库是构建大型语言模型 (LLM) 应用程序的重要框架，集成了对 Neo4j 矢量索引的全面支持，从而简化了 RAG 应用程序中的数据摄取和查询。这种集成不仅有助于将数据高效引入 Neo4j 矢量索引，而且还能够构建有效的 RAG 应用程序，通过利用结构化和非结构化数据提供实时、准确且与上下文相关的答案。

GraphCypherQAChain

GraphCypherQAChain类在使用自然语言问题查询图数据库（特别是 Neo4j）领域发挥功能作用。它使用大型语言模型从输入问题生成 Cypher 查询，针对 Neo4j 图数据库执行它们，并根据查询结果提供答案。该实用程序方便用户检索特定数据，而无需编写复杂的 Cypher 查询，从而使存储在图数据库中的数据更易于访问且更易于交互。

Mistral 7B

Mistral 7B是最新的大型语言模型，因其在一系列基准测试中的卓越性能而受到认可，展示了处理各种语言任务和查询的熟练程度，如下图所示。在检索增强生成 (RAG) 架构中，Mistral 7B 发挥着关键作用，它根据矢量和图搜索检索到的信息合成和生成文本，确保输出不仅上下文丰富，而且能够根据用户的查询精确定制。它有效地弥合了非结构化数据和结构化知识图之间的差距，提供混合了预先训练的知识和实时、经过验证的数据的答案。

项目实战 | 使用Neo4j和LangChain集成非结构化和图知识增强QA

执行

从安装依赖项开始。

请参阅GitHub 存储库（https://github.com/sauravjoshi23/ai/blob/main/retrieval%20augmented%20generation/integrated-qa-neo4j-langchain.ipynb）以获取完整的 Jupyter 笔记本。

%pip install langchain openai wikipedia tiktoken neo4j python-dotenv transformers
%pip install -U sagemaker

Neo4j向量索引

首先导入必要的库和模块，为与数据集准备、Neo4j 矢量索引的接口以及利用 Mistral 7B 的文本生成功能奠定基础。利用dotenv，它可以安全地加载环境变量，保护 OpenAI API 和 Neo4j 数据库的敏感凭据。

import os
import re
from langchain.vectorstores.neo4j_vector import Neo4jVector
from langchain.document_loaders import WikipediaLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from dotenv import load_dotenv

load_dotenv()
os.environ[“OPENAI_API_KEY”] = os.getenv(‘OPENAI_API_KEY’)
os.environ[“NEO4J_URI”] = os.getenv(‘NEO4J_URI’)
os.environ[“NEO4J_USERNAME”] = os.getenv(‘NEO4J_USERNAME’)
os.environ[“NEO4J_PASSWORD”] = os.getenv(‘NEO4J_PASSWORD’)

在这里，使用 Leonhard Euler 的维基百科页面进行实验。使用 bert-base-uncased 模型来标记文本。WikipediaLoader 加载指定页面的原始内容，然后使用 LangChain 的 RecursiveCharacterTextSplitter 将其分成更小的文本片段。该拆分器确保每个块最大化为 200 个标记，其中重叠 20 个标记，遵守嵌入模型的上下文窗口限制，并确保不会丢失上下文的连续性。

tokenizer = AutoTokenizer.from_pretrained(“bert-base-uncased”)

def bert_len(text):
tokens = tokenizer.encode(text)
return len(tokens)

raw_documents = WikipediaLoader(query=“Leonhard Euler”).load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 200,
chunk_overlap = 20,
length_function = bert_len,
separators=[‘nn’, ‘n’, ‘ ‘, ”],
)

documents = text_splitter.create_documents([raw_documents[0].page_content])

分块文档作为节点实例化到 Neo4j 向量索引中。它使用 Neo4j 图数据库和 OpenAI 嵌入的核心功能来构建该向量索引。

# Instantiate Neo4j vector from documents
neo4j_vector = Neo4jVector.from_documents(
documents,
OpenAIEmbeddings(),
url=os.environ[“NEO4J_URI”],
username=os.environ[“NEO4J_USERNAME”],
password=os.environ[“NEO4J_PASSWORD”]
)

在提取向量索引中的文档后，对示例用户查询执行向量相似度搜索并检索前 2 个最相似的文档。

query = “Who were the siblings of Leonhard Euler?”
vector_results = neo4j_vector.similarity_search(query, k=2)
for i, res in enumerate(vector_results):
print(res.page_content)
if i != len(vector_results)-1:
print()
vector_result = vector_results[0].page_content

项目实战 | 使用Neo4j和LangChain集成非结构化和图知识增强QA

构建知识图谱

受到NaLLM项目的高度启发，使用他们的开源项目从非结构化数据构建知识图。下面是使用 Leonhard Euler 的维基百科文章中的单个文档块构建的知识图。项目实战 | 使用Neo4j和LangChain集成非结构化和图知识增强QA

Leonhard Euler 知识图。

在深入研究该项目后，学到了很多关于使用LLMs构建知识图谱的知识。例如，以下是从非结构化文本中捕获实体和关系的提示：

“””
You are a data scientist working for a company that is building a graph database. Your task is to extract information from data and convert it into a graph database.
Provide a set of Nodes in the form [ENTITY_ID, TYPE, PROPERTIES] and a set of relationships in the form [ENTITY_ID_1, RELATIONSHIP, ENTITY_ID_2, PROPERTIES].
It is important that the ENTITY_ID_1 and ENTITY_ID_2 exists as nodes with a matching ENTITY_ID. If you can’t pair a relationship with a pair of nodes don’t add it.
When you find a node or relationship you want to add try to create a generic TYPE for it that describes the entity you can also think of it as a label.

Example:
Data: Alice lawyer and is 25 years old and Bob is her roommate since 2001. Bob works as a journalist. Alice owns a the webpage www.alice.com and Bob owns the webpage www.bob.com.
Nodes: [“alice”, “Person”, {“age”: 25, “occupation”: “lawyer”, “name”:”Alice”}], [“bob”, “Person”, {“occupation”: “journalist”, “name”: “Bob”}], [“alice.com”, “Webpage”, {“url”: “www.alice.com”}], [“bob.com”, “Webpage”, {“url”: “www.bob.com”}]
Relationships: [“alice”, “roommate”, “bob”, {“start”: 2021}], [“alice”, “owns”, “alice.com”, {}], [“bob”, “owns”, “bob.com”, {}]
“””

有很多功能很有趣，同时可以改进。

Neo4j DB QA chain

Neo4j DB QA 链

接下来，导入必要的库来设置 Neo4j DB QA 链。

from langchain.chat_models import ChatOpenAI
from langchain.chains import GraphCypherQAChain
from langchain.graphs import Neo4jGraph

构建图表后，需要连接到 Neo4jGraph 实例并可视化模式。

graph = Neo4jGraph(
url=os.environ[“NEO4J_URI”], username=os.environ[“NEO4J_USERNAME”], password=os.environ[“NEO4J_PASSWORD”]
)

print(graph.schema)Node properties are the following:
[{‘labels’: ‘Person’, ‘properties’: [{‘property’: ‘name’, ‘type’: ‘STRING’}, {‘property’: ‘nationality’, ‘type’: ‘STRING’}, {‘property’: ‘death_date’, ‘type’: ‘STRING’}, {‘property’: ‘birth_date’, ‘type’: ‘STRING’}]}, {‘labels’: ‘Location’, ‘properties’: [{‘property’: ‘name’, ‘type’: ‘STRING’}]}, {‘labels’: ‘Organization’, ‘properties’: [{‘property’: ‘name’, ‘type’: ‘STRING’}]}, {‘labels’: ‘Publication’, ‘properties’: [{‘property’: ‘name’, ‘type’: ‘STRING’}]}]
Relationship properties are the following:
[]
The relationships are the following:
[‘(:Person)-[:worked_at]->(:Organization)’, ‘(:Person)-[:influenced_by]->(:Person)’, ‘(:Person)-[:born_in]->(:Location)’, ‘(:Person)-[:lived_in]->(:Location)’, ‘(:Person)-[:child_of]->(:Person)’, ‘(:Person)-[:sibling_of]->(:Person)’, ‘(:Person)-[:published]->(:Publication)’]

GraphCycherQAChain 抽象所有细节并输出自然语言问题 (NLQ) 的自然语言响应。然而，它在内部使用 LLM 生成 NLQ 的 Cypher 查询，并从图数据库检索图结果，最后使用这些结果生成最终的自然语言响应，再次使用 LLM。

chain = GraphCypherQAChain.from_llm(
ChatOpenAI(temperature=0), graph=graph, verbose=True
)

graph_result = chain.run(“Who were the siblings of Leonhard Euler?”)

项目实战 | 使用Neo4j和LangChain集成非结构化和图知识增强QA

graph_result‘Thesiblings of Leonhard Euler were Maria Magdalena and Anna Maria.’

Mistral-7b-指令

在 AWS SageMaker 环境中从 Hugging Face 设置 Mistral-7B 终端节点。

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client(‘iam’)
role = iam.get_role(RoleName=‘sagemaker_execution_role’)[‘Role’][‘Arn’]

hub = {
‘HF_MODEL_ID’:‘mistralai/Mistral-7B-Instruct-v0.1’,
‘SM_NUM_GPUS’: json.dumps(1)
}

huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri(“huggingface”,version=“1.1.0”),
env=hub,
role=role,
)

最终响应是通过构造提示来制作的，该提示包括指令、向量索引中的相关数据、图数据库中的相关信息以及用户的查询。然后，该提示会传递给 Mistral-7b 模型，该模型会根据提供的信息生成有意义且准确的响应。

mistral7b_predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type=“ml.g5.4xlarge”,
container_startup_health_check_timeout=300,
)

query = “Who were the siblings of Leonhard Euler?”
final_prompt = f”””You are a helpful question-answering agent. Your task is to analyze
and synthesize information from two sources: the top result from a similarity search
(unstructured information) and relevant data from a graph database (structured information).
Given the user’s query: {query}, provide a meaningful and efficient answer based
on the insights derived from the following data:

Unstructured information: {vector_result}.
Structured information: {graph_result}.
“””

response = mistral7b_predictor.predict({
“inputs”: final_prompt,
})

print(re.search(r”Answer: (.+)”, response[0][‘generated_text’]).group(1))The siblings of Leonhard Euler were Maria Magdalena and Anna Maria.

要点

Neo4j Vector Index 和 GraphCypherQAChain 与 Mistral-7b 的集成提供了处理复杂数据的强大系统，有效地弥合了大量非结构化数据和复杂图知识之间的差距，通过综合两个数据源的信息，为用户查询提供全面、准确的响应利用 Neo4j 进行矢量相似性搜索和图数据库检索，确保生成的响应不仅由 Mistral-7b 的大量预先训练的知识提供信息，而且还通过来自矢量和图数据库的实时数据进行上下文丰富和验证。最后，目标是在未来的实验中尝试多跳查询，因为最初建立模块化管道对于适应快速发展的人工智能领域是必要的。

概括

该项目强调了 Neo4j Vector Index 和 LangChain 的 GraphCypherQAChain 的有效组合，分别可以浏览非结构化数据和图知识，然后利用 Mistral-7b 生成明智且准确的响应。通过使用 Neo4j 从向量索引和图数据库检索相关信息，系统确保生成的响应不仅上下文丰富，而且锚定在经过验证的实时知识中。该实现展示了检索增强生成的实际应用，其中利用来自不同数据源的综合信息来生成响应，这些响应是预先训练的知识和特定的实时数据的和谐混合，从而提高了预测的准确性和相关性。对用户查询的响应。

References

Neo4j and Large Language Models (LLMs) https://github.com/neo4j/NaLLM/tree/main
Knowledge Graphs & LLMs: Harnessing Large Language Models with Neo4j https://medium.com/neo4j/harnessing-large-language-models-with-neo4j-306ccbdd2867
Knowledge Graphs & LLMs: Fine-Tuning Vs. Retrieval-Augmented Generation https://medium.com/neo4j/knowledge-graphs-llms-fine-tuning-vs-retrieval-augmented-generation-30e875d63a35
Knowledge Graphs & LLMs: Multi-Hop Question Answering https://medium.com/neo4j/knowledge-graphs-llms-multi-hop-question-answering-322113f53f51
LangChain Library Adds Full Support for Neo4j Vector Index https://medium.com/neo4j/langchain-library-adds-full-support-for-neo4j-vector-index-fa94b8eab334
Mistral 7B https://mistral.ai/news/announcing-mistral-7b/
Knowledge Graph Construction Demo from raw text using an LLM https://www.youtube.com/watch?v=Hg4ahTQlBm0