1. Introduction

Text summarization is a technique of natural language processing (NLP) that enables its users to break down information from large documents or text into short and meaningful summaries. With LangChain, it is now possible to use large language models (LLMs) for easy and efficient implementation of text summarization.

In this tutorial, we’ll discuss several text summarization techniques in LangChain, their application, and their implementation, making it easy for beginners and experts to use.

2. What Is Text Summarization?

Text summarization is the process of generating the summary of a given text or document while retaining the key points from the document. There are two basic types of text summarization:

Extractive summarization is a technique that selects the primary sentences or phrases in a document and creates a summary with them. It does not generate new sentences but simply compiles the most relevant keywords and sentences directly from the document to form a summary.

Unlike the extractive summarization technique, abstractive summarization generates new sentences from the original sentence, paraphrasing the text more clearly and coherently. When used for text summarization, LangChain combines with LLMs to understand the document context and generate human-like sentences.

Now, let’s get into the details of the LangChain framework, how it works, and its summarization technique.

3. What Is LangChain?

LangChain is an open-source model that facilitates the integration of LLMs such as OpenAI GPT, Google Gemini, or Grok models into applications. LangChain serves as the integration layer between LLMs and various applications. It provides abstractions for chains, agents, memory, and document processing that make it easier to build complex applications with LLMs.

LangChain provides access to diverse chain types in terms of LLM. Some chain types have been defined in LangChain for text summarization purposes, which operate a little differently based on the length and complexity of the document as well as the desired output quality.

It simplifies the LLM application lifecycle from development to production and deployment by implementing standard interfaces for large language models and related technologies such as embedding models and vector stores. It also simplifies the integration of AI capabilities into various tasks, including document processing, conversational AI, and text summarization.

3.1. Key Features of LangChain for Text Summarization

Here are three main LangChain features:

  • Easy integration: LangChain is easy to integrate with different LLMs, such as GPT-4, Gemini, Grok, and other models
  • Customizable: It allows for customization and the creation of custom functions for chaining different components for easy and scalable processing
  • Multi-step processing: It supports a hybrid of extractive and abstractive techniques for text summarization

Now, let’s take a look at the different summarization techniques that are applicable to LangChain.

4. Different Text Summarization Techniques Using LangChain

As a result of information overload in the current world, “time and money” have become expensive commodities. Text summarization technology powered by LLMs creates a remedy to compress these large documents into summaries that convey all the critical information. An open-source framework designed to work efficiently with LLMs, LangChain comes with several dynamic text summarization models.

First, we need to install the required library to run LangChain using the Gemini model;

This section reviews several text summarization techniques offered by LangChain, yielding information on their strengths, weaknesses, and ideal use cases.

The first step is to set our environment by installing the LangChain and Google generative model:

%pip install -U langchain-google-genai

Next, to use the Google Gemini model, we need to get the API key. Then, we store it in a .env for security and easy access:

import os

# Set up Google API key
os.environ["GOOGLE_API_KEY"] = "api-key"

This speech text is used in most of our text summarization program below:

speech = """
People across the country... (full text omitted)
"""

4.1. Basic Text Summarization

Basic summarization is a straightforward approach where the model summarizes the text and returns a summary based on the user-defined prompt. Let’s take a look at a basic prompt summarization:

# Import libraries and initialize the Gemini model
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.schema import AIMessage, HumanMessage, SystemMessage

# Initialize the model
llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro-latest")

# Define the chat messages
chat_messages = [
    SystemMessage(content="You are an expert assistant with expertise in summarizing speeches"),
    HumanMessage(content=f"Please provide a short and concise summary of the following speech:\nTEXT: {speech}")
]

# Get the number of tokens
print(llm.get_num_tokens(speech))  # Output: 886

# Get the response
response = llm(chat_messages)
print(response.content)  # Returns the summarized speech

This generates the following output:

In a powerful address to Congress, President Lyndon B. Johnson champions the Voting Rights Act (full text omitted)

The basic prompt summarization assigns roles to the LLM with SystemMessage, while the HumanMessage represents the input message from the user.

4.2. Prompt Templates Text Summarization

Prompt Templates help to structure and format the input before sending it to the LLM, allowing users to insert values into the predefined text prompt:

from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_core.prompts import ChatPromptTemplate

# Define the generic template
generic_template = (
    "Write a summary of the following speech:\n"
    "Speech: `{speech}`\n"
    "Translate the precise summary to {language}."
)

# Initialize the PromptTemplate
prompt = PromptTemplate(
    input_variables=['speech', 'language'],
    template=generic_template
)

# Format the prompt
complete_prompt = prompt.format(speech=speech, language='hindi')
print(complete_prompt)

# Create and run the chain
llm_chain = LLMChain(llm=llm, prompt=prompt)
summary = llm_chain.run({'speech': speech, 'language': 'french'})
print(summary)

Let’s break down the code snippet:

  • the generic_template defines the prompt, where speech and language are placeholders for text and output language
  • the PromptTemplate() initializes the input variables
  • LLMChain(llm=llm, prompt=prompt) connects the LLM to the formatted output
  • llm_chain.run({‘speech’: speech, ‘language’: ‘french’}) replaces {speech} and language with the actual text and language, instructing the model to return the translated summary

Below is the output of the translated text summary in Hindi:

"यह भाषण 'विकसित भारत संकल्प यात्रा' के महत्व पर केंद्रित है।  वक्ता, एक सांसद होने के ना (full text omitted)

4.3. StuffDocumentsChain Text Summarization

The StuffDocumentsChain represents the most basic form of text summarization method in LangChain. As it sounds, it “stuffs” the entire document into a single prompt and asks the LLM to generate a summary.

Let’s implement it using the speech from the previous section:

from PyPDF2 import PdfReader
from langchain.docstore.document import Document
from langchain.chains.summarize import load_summarize_chain
from langchain.prompts import PromptTemplate

# Function to read PDF
def read_pdf(file_path):
    pdfreader = PdfReader(file_path)
    text = ''
    for i, page in enumerate(pdfreader.pages):
        content = page.extract_text()
        if content:
            text += content
    return text

# Uncomment this line to use actual PDF file
# text = read_pdf('speech.pdf')

# For demonstration, let's use the speech text instead
text = speech
docs = [Document(page_content=text)]

template = (
    "Write a concise and short summary of the following speech.\n"
    "Speech: `{text}`"
)

prompt = PromptTemplate(
    input_variables=['text'],
    template=template
)

chain = load_summarize_chain(
    llm,
    chain_type='stuff',
    prompt=prompt,
    verbose=False
)

output_summary = chain.run(docs)
print(output_summary)

And the output:

The speaker, a Member of Parliament, embarked on the 'Viksit Bharat Sankalp Yatra' (full text omitted)

This technique is easy to implement, requiring minimal configuration, and retains the entire context of the document. However, it is not very efficient for very large documents because performance reduces as the size of the document increases, causing the important content of the document to be lost during the summarization process. So, it’s best to use documents that fit into the LLM’s context window, like articles, emails, and reports.

4.4. Summarizing Large Documents Using Map Reduce

For longer documents that exceed the LLM’s context window, the MapReduceDocumentsChain offers an effective solution. This technique applies a divide-and-conquer approach by:

  • Splitting the document into smaller chunks
  • Summarizing each chunk independently (Map)
  • Combining these summaries into a final summary (Reduce)

To implement MapReduce, let’s import the text splitter, which helps to break the document into sizeable chunks before it is summarized according to the user prompt:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain
from langchain.prompts import PromptTemplate

# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=100
)
chunks = text_splitter.create_documents([text])

# Define custom prompts
map_prompt = PromptTemplate(
    input_variables=["text"],
    template="Summarize this content:\n{text}"
)

combine_prompt = PromptTemplate(
    input_variables=["text"],
    template="Combine these summaries into a coherent summary:\n{text}"
)

# Create and run the chain
chain = load_summarize_chain(
    llm,
    chain_type="map_reduce",
    map_prompt=map_prompt,
    combine_prompt=combine_prompt,
    verbose=True
)

summary = chain.run(chunks)
print(summary)

The output is as follows:

The Viksit Bharat Sankalp Yatra assesses the impact of government schemes by directly interacting with beneficiaries (...)

MapReduce is highly parallelizable, which can improve performance for large documents. However, it may lose cross-chunk contextual information, with the final summary missing connections between distant parts of the document.

The map-reduce approach is ideal for long documents like research papers, books, lengthy reports, or any text that exceeds the LLM’s context window. It’s also useful when processing collections of related documents that need to be summarized together.

4.5. Text Summarization using RefineChain

The RefineChain type uses another way of summarizing, where the summary is increasingly refined through iterative refinement. This goes through a set of sequential document chunks, where one summary of previous chunks now serves as context for the summary of the next chunk.

from langchain.chains.summarize import load_summarize_chain
from langchain.prompts import PromptTemplate

# Custom prompts for the refine approach
initial_prompt = PromptTemplate(
    input_variables=["text"],
    template="Write a summary of this text:\n{text}"
)

refine_prompt = PromptTemplate(
    input_variables=["existing_summary", "text"],
    template=(
        "Here is an existing summary: {existing_summary}\n"
        "Here is some more text: {text}\n"
        "Refine the existing summary with this new information."
    )
)

# Create & run the chain
chain = load_summarize_chain(
    llm,
    chain_type="refine",
    question_prompt=initial_prompt,
    refine_prompt=refine_prompt,
    verbose=True
)

summary = chain.run(chunks)

print(summary)

Generated output:

> Entering new RefineDocumentsChain chain...

> Entering new LLMChain chain...
Prompt after formatting:
Write a summary of this text:
People across the country, involved in government, political, ...

> Finished chain.

> Finished chain.

The refine chain technique retains context across document chunks, often generating a more coherent summary than map-reduce, with a better at handling document relations and narrative flow. RefineChain is most suited for documents that are story-like in structure and progressively build context, like stories, histories, or reports over time. It comes in handy when a summary’s quality matters more than time spent processing.

One of its downsides is that it is slower than MapReduce because it processes in sequence. Older segments may gain more emphasis than newer ones and may experience issues with documents that have significant information at the end.

4.6. Text Summarization Using Custom Chain

This helps in ways that seem beyond standard chains, such as creating custom summarization chains for particular requirements. In fact, it is one of the best solutions in structured information extraction or when aspects of a specific document should be extracted.

from langchain.chains import SequentialChain, LLMChain
from langchain.prompts import PromptTemplate

# Build extraction of important points from texts
extract_chain = LLMChain(
    llm=llm,
    prompt=PromptTemplate(
        input_variables=["text"],
        template="Extract the 5 most important points from this text:\n{text}"
    ),
    output_key="points"  # Set an explicit output key
)

# Create a chain that transforms the points into a summary
summarize_chain = LLMChain(
    llm=llm,
    prompt=PromptTemplate(
        input_variables=["points"],
        template="Create a coherent summary based on these key points:\n{points}"
    ),
    output_key="summary"  # Set an explicit output key
)

# Combine into sequentially chained chains
custom_summary_chain = SequentialChain(
    chains=[extract_chain, summarize_chain],
    input_variables=["text"],
    output_variables=["summary"], # Changed from "text" to "summary
    verbose=True
)

summary = custom_summary_chain.run(speech)
print(summary)

And the output:

> Entering new SequentialChain chain...

> Finished chain.
The focus is on ensuring government schemes effectively reach their intended beneficiaries...

With a custom chain, we can customize the chain for a specific use case and extract information such as facts, arguments, and recommendations. It allows for multi-stage processing with intermediate steps, making it easy to implement a domain-specific summarization strategy.

However, unlike other summarization techniques, it requires a more complex setup and design, which may include fine-tuning. It is also more expensive due to multiple LLM calls. Custom chains are more applicable when there are specific requirements, like extracting particular types of information, different aspects of documents, or summarizing in a specific format.

5. Comparing Summarization Techniques

Applying any summarization technique depends on the size of the document, the expected output, and the model we intend to use. Other factors that determine the summarization technique can be:

Summarization Technique

Document Length

Speed

Context

Complexity

Cost

Stuff

Short to Medium

Fast

High

Low

Low, High (docs size)

Map-Reduce

Any length

Fast (Parallelizable)

Medium

Medium

Medium

Refine

Medium to Long

Slow (Sequential)

High

Medium

High

Custom

Any Length

Varies

Varies

High

Varies

6. Best Practices

When implementing text summarization with LangChain, consider these best practices:

  • Choosing a Splitting Strategy: Test various chunking methods (by character, token, paragraph, or semantic units) to determine what works best for the document size
  • Optimize chunk size: Larger chunks give more context, however, there is a possibility they may exceed the token limits. Smaller chunks process faster for them but may lose the context. Overlap helps maintain the continuity between the chunks
  • Write good prompts: Clear and to-the-point instructions produce better summaries. In fact, we can include formatting instructions, output length restrictions, or even a few focused issues in the prompts
  • Choose the right model: Different LLMs have strengths for different use cases. For instance, if a short summary is needed, certain models are efficient for that, while other models can maintain high factual accuracies but will be less concise
  • Evaluate the output: This can be achieved by computing semantic similarity, sentiment analysis, or human evaluation

7. Conclusion

LangChain text summarization entails flexible ways to compress huge amounts of information while retaining the meaning. Whether the task requires summarizing research papers, legal documents, news articles, or meetings through transcripts, all such frameworks are clearly laid out in LangChain, which offers different prototypes to draw meaningful summaries from text data on a large scale. These techniques can come in handy for researchers, as well as people who have a bulk of information to deal with every day. By realizing their strengths and weaknesses in approach, the most effective solution can be set in place for one’s needs in text summarization.

In this article, we discussed Stuffing, map-reducing, refining, and custom techniques, which all depend on the user’s specific requirements, document features, and overall performance expectations.


原始标题:Different Text Summarization Techniques Using Langchain

» 下一篇: Cycle Sort Algorithm