LangChain：初识LangChain

整个学习过程都使用pyhon 3.10，由conda管理python环境

具体学习写到的代码在Github仓库

什么是Langchain

LangChain是一个将大语言模型(LLM)应用到应用程序的开发框架。它提供了一套工具、组建、接口，简化使用LLM的过程。LangChain顾名思义，它可以将多个组建像链条一样链接起来使用，

官网学习

LangSmith

Productionization: Use LangSmith to inspect, monitor and evaluate your applications, so that you can continuously optimize and deploy with confidence.

用于检查、监控和评估应用程序，便于优化与部署

LangGraph

Development: Build your applications using LangChain’s open-source components and third-party integrations. Use LangGraph to build stateful agents with first-class streaming and human-in-the-loop support.

LangChain开源的组建和第三方集成，用于快速构建应用程序。具有一流流媒体和人际交互支持的状态代理

LangGraph Platform

Deployment: Turn your LangGraph applications into production-ready APIs and Assistants with LangGraph Platform.

LangGraph应用部署平台

教程章节

Build a Simple LLM Application

目的

学会调用大语言模型
学会添加Prompt模版
学会使用LangSmith跟踪应用

安装

# 使用pip直接安装
pip install langchian
# 使用conda安装
conda install langchian -c conda-forge

连接LangSmith

LangSmith

登录注册之后可以获取一个API_KEY

添加到代码环境中

import os

# 导入apikey
# 我这里使用的是genmini
os.environ["GOOGLE_API_KEY"] = "*********"
# true 表示打开LangSmith的记录追踪
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_PROJECT"] = "PROJECT_NAME" # 用于LangSmith中显示到项目名称
os.environ["LANGSMITH_API_KEY"] = "*******"

# 如果你用的大模型不需要魔法，就不需要开启代理
# 设置 HTTP 和 HTTPS 代理
os.environ["http_proxy"] = "http://127.0.0.1:7897"
os.environ["https_proxy"] = "http://127.0.0.1:7897"

# 如果使用 SOCKS5 代理（例如 Clash/V2rayN）
os.environ["ALL_PROXY"] = "socks5://127.0.0.1:7897"

配置及使用LLM

官网有openai的配置以及其他平台的配置模型市场文档

from langchain_google_genai import ChatGoogleGenerativeAI

# 初始化 LLM
llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-pro-exp-02-05",
    temperature=0,
    # 其他参数...
)

使用LLM发送消息

from langchain_core.messages import HumanMessage, SystemMessage

messages = [
    SystemMessage("Translate the following from English into Italian"),
    HumanMessage("hi!"),
]

llm.invoke(messages)
# 下列都是等效地询问"Hello"
llm.invoke("Hello")
llm.invoke([{"role": "user", "content": "Hello"}])
llm.invoke([HumanMessage("Hello")])

使用Stream流

from langchain_core.messages import HumanMessage, SystemMessage

messages = [
    SystemMessage("Translate the following from English into Italian"),
    HumanMessage("hi!"),
]

for token in llm.stream(messages):
    print(token.content, end="|")

使用prompt template

使用模版可以使用站位符

from langchain_core.prompts import ChatPromptTemplate

system_template = "Translate the following from English into {language}"

prompt_template = ChatPromptTemplate.from_messages(
    [("system", system_template), ("user", "{text}")]
)

prompt = prompt_template.invoke({"language": "Italian", "text": "hi!"})

prompt

ChatPromptTemplate在单个模版中支持多种消息角色，用户通过字典的方式填充站位符。

Build a semantic search engine

目的

学会载入文档以及文档载入器的使用(如导入一个pdf)
学会使用文本分割器
学会嵌入(Embeddings)
学会矢量存储和检索器

文档以及文档载入器的使用

LangChain implements a Document abstraction, which is intended to represent a unit of text and associated metadata. It has three attributes:

page_content: a string representing the content;

metadata: a dict containing arbitrary metadata;

id: (optional) a string identifier for the document.

The metadata attribute can capture information about the source of the document, its relationship to other documents, and other information. Note that an individual Document object often represents a chunk of a larger document.

LangChain有一个Document抽象的实现，用于表示文档的文本数据和元数据。其具有主要三个属性

page_content: 内容字符串
metadata: 包含任意元数据的字典
id: (可选)文档字符串的标识符

metadata用于捕获文档来源、其与其他文档的关系以及其他文档的信息，单个Document对象通常代表一个较大文件的一部份

# Document对象示例
from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

LangChain实现了数百个常见来源集成的文档加载器

这里以PyPDFLoader作为示例

from langchain_community.document_loaders import PyPDFLoader

# 文件在官网中有github连接，或者在我的github代码中下载
file_path = "./example_data/nke-10k-2023.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(f"{docs[0].page_content[:200]}\n")
print(docs[0].metadata)

文本分割

官网对于文本分割使用的目的的说法

For both information retrieval and downstream question-answering purposes, a page may be too coarse a representation. Our goal in the end will be to retrieve Document objects that answer an input query, and further splitting our PDF will help ensure that the meanings of relevant portions of the document are not “washed out” by surrounding text.

We can use text splitters for this purpose. Here we will use a simple text splitter that partitions based on characters. We will split our documents into chunks of 1000 characters with 200 characters of overlap between chunks. The overlap helps mitigate the possibility of separating a statement from important context related to it. We use the RecursiveCharacterTextSplitter, which will recursively split the document using common separators like new lines until each chunk is the appropriate size. This is the recommended text splitter for generic text use cases.

对于信息检索和后续问答目的而言，页面可能过于粗略。我们最终的目标是检索回答输入查询的Document对象，进一步拆分 PDF 将有助于确保文档相关部分的含义不会被周围的文本“冲淡”。

我们可以使用文本分割器来实现此目的。这里我们将使用一个基于字符进行分区的简单文本分割器。我们将文档分割成 1000 个字符的块块之间有 200 个字符的重叠。重叠有助于降低将声明与重要内容分离的可能性与之相关的上下文。我们使用 RecursiveCharacterTextSplitter ，它将使用常用分隔符（如换行符）递归地拆分文档，直到每个块达到合适的大小。这是针对一般文本用例的推荐文本拆分器。

from langchain_text_splitters import RecursiveCharacterTextSplitter

# 1000个字符快，每个字符快有200个重叠
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

len(all_splits)

嵌入(Embeddings)/向量存储

具体指向量搜索，我们可以将数据转化为数字向量，然后存入常用的向量数据库。

我这里使用genmini的，官网中有其他模型

from langchain_google_genai import GoogleGenerativeAIEmbeddings


# 将模型转化为数字向量
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])

我使用milvus，具体可以在官网中配置一个远程的数据库作为测试使用

# 导入向量数据库
from langchain_milvus import Milvus

# 连接到 Milvus 服务器
vector_store = Milvus(
    embedding_function=embeddings,
    connection_args={
        "uri": "https://********.serverless.gcp-us-west1.cloud.zilliz.com",
        "token": "**********",
        },
    auto_id=True  # 让 Milvus 自动生成 ID
    )

具体我们可以使用向量数据库做一下事情

Synchronously and asynchronously;
同步和异步；
By string query and by vector;
通过字符串查询和通过向量；
With and without returning similarity scores;
返回和不返回相似度分数；
By similarity and maximum marginal relevance (to balance similarity with query to diversity in retrieved results).
通过相似性和最大边际相关性（平衡查询的相似性和检索结果的多样性）。

# 询问向量数据库
# Note that providers implement different scores; the score here
# is a distance metric that varies inversely with similarity.

# 根据字符串的想死西ing返回文档
results = vector_store.similarity_search_with_score("What was Nike's revenue in 2023?")
print(results[0])

# 异步查询
results = await vector_store.asimilarity_search("When was Nike incorporated?")
print(results[0])

# 返回分数
results = vector_store.similarity_search_with_score("What was Nike's revenue in 2023?")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)

# 根据与嵌入式查询的相似性返回文档
embedding = embeddings.embed_query("How were Nike's margins impacted in 2023?")

results = vector_store.similarity_search_by_vector(embedding)
print(results[0])

Retrievers

LangChain VectorStore objects do not subclass Runnable. LangChain Retrievers are Runnables, so they implement a standard set of methods (e.g., synchronous and asynchronous invoke and batch operations). Although we can construct retrievers from vector stores, retrievers can interface with non-vector store sources of data, as well (such as external APIs).

LangChain VectorStore对象不是Runnable 的子类。LangChain Retriever是 Runnable，因此它们实现了一组标准方法（例如，同步和异步invoke和batch操作）。虽然我们可以从向量存储构造检索器，但检索器也可以与非向量存储数据源交互（例如外部 API）。

from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import chain


@chain
def retriever(query: str) -> List[Document]:
    return vector_store.similarity_search(query, k=1)


retriever.batch(
    [
        "How many distribution centers does Nike have in the US?",
        "When was Nike incorporated?",
    ],
)