Summarize and query PDFs with AI using Ollama

Local AI
Machine Learning

Vincent Grégoire, PhD, CFA


March 30, 2024

Large language models (LLMs) have revolutionized the way we interact with text data, enabling us to generate, summarize, and query information with unprecedented accuracy and efficiency. In this tutorial, we’ll explore how to leverage the power of LLMs to process and analyze PDF documents using Ollama, an open-source tool that manages and runs local LLMs. By combining Ollama with LangChain, we’ll build an application that can summarize and query PDFs using AI, all from the comfort and privacy of your computer. Utilizing Ollama to serve the models, along with LangChain for its extensive library of convenience tools for accessing and interacting with large language models, we’ll construct an app that operates entirely locally on your machine. We’ll then use Streamlit to build an interactive dashboard, enhancing the usability of our application.

All the code for this tutorial is available on GitHub, so you can follow along and experiment with the application yourself. If you’re ready to enhance your research process with a powerful, AI-driven tool for summarizing and querying PDF documents, then you’ve come to the right place. Let’s get started!

Video tutorial

This post is also available as a video tutorial on YouTube.


Ollama is a tool to manage and run local LLMs, such as Meta’s Llama2 and Mistral’s Mixtral. I discussed how to use Ollama as a private, local ChatGPT replacement in a previous post.

The first step in setting up Ollama is to download and install the tool on your local machine. The installation process is straightforward and involves running a few commands in your terminal. Ollama’s download page provides installers for macOS and Windows, as well as instructions for Linux users. Once you’ve downloaded the installer, follow the installation instructions to set up Ollama on your machine.

If you’re using a Mac, you can install Ollama using Homebrew by running the following command in your terminal:

brew install ollama

The benefit of using Homebrew is that it simplifies the installation process and also sets up Ollama as a service, allowing it to run in the background and manage the LLM models you download.

At the moment, the most popular code models on Ollama are:

After installing Ollama, you can install a model from the command line using the pull command:

ollama pull mixtral


Alongside Ollama, our project leverages several key Python libraries to enhance its functionality and ease of use:

  • LangChain is our primary tool for interacting with large language models programmatically, offering a streamlined approach to processing and querying text data.
  • PyPDF is instrumental in handling PDF files, enabling us to read and extract text from documents, which is the first step in our summarization and querying process.
  • langchain_openai and the openai modules are used to access the OpenAI API-compatible API of Ollama. The added benefit is that it allows for a seamless transition to compatible cloud-based LLMs such as OpenAI or Groq.
  • tiktoken assists in token counting within queries, ensuring we optimize the model’s performance by staying within limits.
  • python-dotenv is used for environment management, allowing us to store and access API keys and other sensitive information securely. To construct our interactive dashboard, we employ Streamlit, which significantly simplifies the development of user-friendly interfaces.
  • Rich (optional) is a library that enhances command-line outputs with rich text and formatting, useful for developers who prefer CLI tools. I won’t be using Rich in this tutorial, but I also have a CLI tool in the repository that uses Rich for formatting the output.

Setting up the python project

Easiest: If you are using poetry to manage your Python dependencies, you can get the pyproject.toml file from the repository and run poetry install to install the dependencies.

Using pip

To begin, create a dedicated project directory to house all your files and dependencies. Open your terminal or command prompt, navigate to your project directory, and initiate a Python virtual environment and activate it by running:

python -m venv venv
source venv/bin/activate
python -m venv venv

This ensures that all the dependencies installed are confined to this project, avoiding conflicts with other Python projects. Once your environment is active, install the aforementioned dependencies using pip, Python’s package installer. You can do this by running:

pip install langchain pypdf langchain-openai openai tiktoken python-dotenv streamlit rich

With your environment set and dependencies installed, you’re well-prepared to dive into the development of your AI-powered PDF processing app.

Building the PDF processing app

For this app, I will be showcasing two methods for exploring documents: the stuffing method for document summarization and the map-reduce method for targeted document querying. It is adapted from the Summarization example in the LangChain documentation.

The stuffing method involves condensing the entire content of a PDF into a single, comprehensive query that the LLM can interpret and summarize. This technique is particularly useful for generating succinct overviews of documents, allowing users to grasp the core essence without reading the entire text. It’s a straightforward approach that mimics how one might ask a colleague to summarize a report they’ve read, providing us with a distilled version of the document’s contents. It works best for shorter documents that can fit within the model’s context window, ensuring the summary remains concise and informative.

On the other hand, the map-reduce method takes a more granular approach, dissecting the document into manageable pieces and applying specific queries (mapping) to each segment. This method is akin to conducting a thorough examination of each page of a document to answer a particular question, with the “reduce” phase aggregating these individual insights into a cohesive answer. It’s especially powerful for extracting specific information from documents, enabling precise and targeted queries across the entire text. The map-reduce method is ideal for longer documents or those with complex structures, allowing users to pinpoint and extract the data they need efficiently. It is, however, more computationally intensive than the stuffing method, as it requires processing each segment individually before combining the results.

Another popular method for querying documents is RAG (Retrieval-Augmented Generation), which involves first retrieving and filtering relevant information from a database or document corpus before generating a response with an LLM. While I won’t be covering RAG in this tutorial, it’s a powerful technique that can significantly enhance the quality of responses generated by LLMs, especially when dealing with very large or diverse datasets.

In the following sections, I’ll cover the implementation details of these methods, guiding you through the process of building a fully functioning PDF processing app. From loading and processing documents to interfacing with LLMs and designing a user-friendly dashboard, I’ll cover all the bases, ensuring you have the knowledge and tools to replicate and customize this solution for your own needs.

Loading PDF Documents with LangChain

The initial step in creating our PDF processing app involves efficiently loading and preparing the PDF documents for further analysis. To facilitate this, we will use LangChain, a comprehensive library designed to streamline the interaction with large language models and various document types, including PDFs, CSV files, and more.

Why Choose LangChain?

LangChain stands out for its flexibility and robustness in handling different document formats, making it an ideal choice for our project. Not only does it support local files like PDFs, which are our primary focus, but it also offers compatibility with multiple other file types and integrates seamlessly with third-party data providers. After loading, the documents are transformed into a structured format that can be easily processed by the other components of our app, such as the AI models responsible for summarization and querying.

Loading PDFs

We begin by importing the PDF document loader provided by LangChain, specifically designed for handling PDF files. For the examples in this tutorial, I’ll be using my paper titled Price revelation from insider trading: Evidence from hacked earnings news (🔓 open access).

With the file path specified, we proceed to create an instance of the PDF loader. Upon initiating the loader with our document, it parses the PDF, generating a list of documents where each entry corresponds to a page in the PDF. This structured approach ensures that each page is individually accessible for detailed analysis.

from langchain_community.document_loaders.pdf import PyPDFLoader

file_path = "hacking_prices.pdf"
loader = PyPDFLoader(file_path)
docs = loader.load()

Summarizing PDF Documents Using the Stuffing Method

After successfully loading our PDF into “documents” (one for each page), our next objective is to use a LLM to summarize these documents. The process we will use, known as the “stuffing method,” involves feeding the entire text of the document into a large language model (LLM) to generate a concise summary. The beauty of this method lies in its ability to produce an overview that captures the essence of the document, making it invaluable for quick insights into extensive research papers or reports.

Crafting the Prompt

The first step in this process involves crafting a prompt that will guide the LLM in summarizing the document. For our application, we employ a template that instructs the model to focus exclusively on the content provided, excluding any external opinions or analysis. Here’s the structure of our prompt:

# Prompt
from langchain_core.prompts import PromptTemplate

prompt_template = """Write a long summary of the following document. 
Only include information that is part of the document. 
Do not include your own opinion or analysis.

prompt = PromptTemplate.from_template(prompt_template)

The prompt is defined as a template that specifies the desired output format and the content to be summarized. When invoking the LLM, LangChain will replace the {document} placeholder with the actual text from the PDF document.

Setting up the LLM chain

To execute our summarization task, we utilize Ollama, specifically its OpenAI-compatible API, which allows for a seamless transition to GPT-4 or similar models in the future. Configuring the model involves setting parameters such as the temperature, which controls the creativity of the responses, and specifying the model name. In our case, I chose “Mixtral” for its balance of speed and performance. Additionally, an API key is required for model authentication, and the base URL is adjusted to point to our local machine where Ollama is running. Note that even though we are using a local model, we must still provide an API key, which will be ignored by Ollama.

# Define LLM Chain

from langchain_openai import ChatOpenAI
from langchain.chains.llm import LLMChain

llm = ChatOpenAI(
llm_chain = LLMChain(llm=llm, prompt=prompt)

Invoking the stuff documents chain

With our language model chain configured, we proceed to create a “Stuff Documents” chain using LangChain. This specialized chain will take our prompt templates and LLM configuration to generate summaries for each page of the PDF document.

# Create full chain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain

stuff_chain = StuffDocumentsChain(
    llm_chain=llm_chain, document_variable_name="document"

Once set up, we can invoke this chain with our loaded documents to generate the summary:

result = stuff_chain.invoke(docs)

The result will contain the summarized text. With my paper, the summary was:

This paper examines whether informed trading activities, such as insider trading and institutional trading, can predict post-earnings announcement drift (PEAD). The authors find that informed trading activities are positively related to PEAD, suggesting that these trades contain valuable information about future stock returns. They also find that the relation between informed trading and PEAD is stronger for stocks with higher levels of information asymmetry, such as those with lower institutional ownership or higher bid-ask spreads. The results suggest that informed traders are able to extract private information from earnings announcements and use it to earn abnormal returns.

This summary is not very good, as it takes the information from references in the paper’s bibliography and includes them in the summary. We can refine the summary by selecting only the pages that contain relevant information and omitting those that are primarily references or irrelevant details.

# Invoke with limited pages
result = stuff_chain.invoke(docs[:-3])

With this, I get a much more accurate summary that focuses on the content of the paper itself, excluding the references:

The paper examines the impact of hacked newswire services on informed trading and stock prices. It finds that hacked firms experience a significant increase in effective spreads, which is driven by an increase in realized spreads rather than price impacts. However, there is no evidence of higher absolute order imbalance or quoted spreads for these firms. The paper also suggests that liquidity providers may be adjusting quotes to manage inventory risk associated with large buy/sell pressure. A placebo test using morning trades shows no significant differences in informed trading measures, further supporting the findings.

Querying PDF documents using the map-reduce approach

After exploring document summarization, we will see how we can query PDF documents for specific information. This includes a summary, but it can be much more. This capability is particularly useful when looking for particular data or answers within extensive documents. Unlike the stuffing method used for summarization, the map-reduce method involves dissecting and analyzing documents at a granular level.

For this example, I will use the following user query:

user_query = "What is the data used in this analysis?"

The map phase: document-specific queries

In the map phase, we apply a unique query to each page of the document, treating each page as a separate document within a larger set. This approach ensures that no detail is overlooked in our search for answers. To implement this, we craft a prompt that instructs the large language model to identify information relevant to a specific query from the text of each page. If a page is deemed irrelevant, the model is instructed to note this, ensuring only pertinent information is processed further.

map_template = """The following is a set of documents
Based on this list of documents, please identify the information that is most relevant to the following query:
If the document is not relevant, please write "not relevant".
Helpful Answer:"""
map_prompt = PromptTemplate.from_template(map_template)
map_prompt = map_prompt.partial(user_query=user_query)
map_chain = LLMChain(llm=llm, prompt=map_prompt)

The reduce phase: aggregating answers

After mapping, we move to the reduce phase, where the outputs from the map phase are consolidated into a final, comprehensive answer. This step involves another carefully designed prompt that guides the model to distill the collected answers into a singular, coherent response to the original query. This process not only synthesizes the information gathered from each page but also ensures the final answer is succinct and directly addresses the query.

reduce_template = """The following is set of partial answers to a user query:
Take these and distill it into a final, consolidated answer to the following query:
Complete Answer:"""
reduce_prompt = PromptTemplate.from_template(reduce_template)
reduce_prompt = reduce_prompt.partial(user_query=user_query)

Constructing the full chain

To bring our querying process to life, we construct a full chain that encompasses both the map and reduce phases. This includes setting up separate LLM chains for mapping and reducing, ensuring each is tailored to its specific task within the overall process. The MapReduce Documents chain then binds these components together, managing the flow of information and ensuring the efficiency of the query process. This comprehensive setup guarantees that our queries are not only accurate but also optimized for performance, avoiding unnecessary processing and token usage.

from langchain.chains import MapReduceDocumentsChain, ReduceDocumentsChain

reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)

# Takes a list of documents, combines them into a single string, and passes this to an LLMChain
combine_documents_chain = StuffDocumentsChain(
    llm_chain=reduce_chain, document_variable_name="docs"

# Combines and iteratively reduces the mapped documents
reduce_documents_chain = ReduceDocumentsChain(
    # The maximum number of tokens to group documents into.

# Combining documents by mapping a chain over them, then combining results
map_reduce_chain = MapReduceDocumentsChain(

Execution and results

Executing this map-reduce process on a document will be more time-consuming than summarization due to the complexity and number of queries involved. However, the results are worth the wait, providing precise answers to specific questions, which, in our case, included a detailed summary of the data used in a research project. It will also work for longer documents that will not fit within the LLM’s context window and for which the stuffing method would not be suitable.

result = map_reduce_chain.invoke(docs[:-3])

After about 5 minutes, the result was:

The data used in this analysis includes stock prices, trading volume, and order imbalance measures for a sample of U.S. firms that experienced a newswire hack between 2010 and 2014. Specifically, the authors use minute-level trade and quote data from the Trade and Quote (TAQ) database, which is maintained by the New York Stock Exchange. They also use firm-level financial data from Compustat and measures of media coverage from Factiva. The sample includes all U.S. common stocks listed on the NYSE, NASDAQ, or AMEX exchanges during the study period. Not all variables are available for all firms and time periods, resulting in an unbalanced panel. Additionally, the analysis uses information from legal documents of SEC prosecutions, newswire servers, and a set of control variables such as log market capitalization, fraction of shares held by institutional investors, natural logarithm of number of analysts, natural logarithm of newswire news in the quarter leading to the announcement, daily cost of borrowing from Markit, and the inverse of the stock price. The data is used to examine the impact of hackers’ trading on volume, spreads, and order flow measures.

Overall, this is a pretty decent answer to the query, providing a detailed overview of the data used in the analysis. The map-reduce method is a powerful tool for extracting specific information from documents, enabling targeted queries and detailed responses that address user queries effectively.

Creating a UI with Streamlit

Transforming our PDF summarization and querying capabilities into a user-friendly application enhances accessibility and utility, making these powerful functions easier to use and understand. For this purpose, I use Streamlit, a dynamic Python framework that simplifies the creation of interactive web applications. Streamlit’s intuitive design and extensive features allow us to build a sleek UI without the need for complex web development skills. This approach enables the user to interact with our local AI models through a browser interface, providing a seamless experience for summarizing and querying documents. Streamlit apps are easy to deploy and share, but in this case, it would not be possible to share the app with others, as it requires access to the local LLM.

Structuring the Streamlit application

Our Streamlit application, encapsulated within, serves as the gateway for users to access the summarization and querying functionalities. The application’s architecture is straightforward yet efficient, integrating various components that facilitate user interaction and display results.

Here is the basic structure of the files for this application:

├── : Streamlit application
├── documents_llm
│   ├──
│   ├── : Document loading and processing
│   ├── : Querying operations
│   ├── : Streamlit helper functions
│   └── Summarization operations
├── poetry.lock : Poetry lock file
└── pyproject.toml : Poetry project file

The files, and contain the logic for loading and processing documents, querying operations, and summarization tasks, respectively. These files encapsulate the core functionality of our application, abstracting the underlying operations into modular components for enhanced readability and maintainability.

They provide the following functions that encapsulate what we have done in the previous sections:

def load_pdf(
    file_path: Path | str, start_page: int = 0, end_page: int = -1
) -> list[Document]

def summarize_document(
    docs: list[Document],
    model_name: str,
    openai_api_key: str,
    base_url: str,
    temperature: float = 0.1,
) -> str

def query_document(
    docs: list[Document],
    user_query: str,
    model_name: str,
    openai_api_key: str,
    base_url: str,
    temperature: float = 0.3,
) -> str

The Streamlit application is defined in Here is what the header of the file looks like:

import os
import time

import streamlit as st
from dotenv import load_dotenv

from documents_llm.st_helpers import run_query

# Load environment variables

# Load model parameters
MODEL_NAME = os.getenv("MODEL_NAME")
OPENAI_URL = os.getenv("OPENAI_URL")

st.title("🐍 VCF Document Analyzer")
    "This is a simple document analyzer that uses LLM models to summarize and answer questions about documents. "
    "You can upload a PDF or text file and the model will summarize the document and answer questions about it."

We first import dependencies, load environment variables, and define the title and introductory text for our application. The st. functions are used to create various UI elements, such as titles, text, and file upload buttons, making it easy to design a user-friendly interface.

User input

The application’s sidebar is designed to collect user inputs, such as the model name, temperature settings, and the document to be analyzed. This design allows for a customizable experience, where users can adjust parameters according to their needs and upload PDF files directly into the application.

Once a document is uploaded, users can specify the range of pages they wish to include in their analysis, further refining the scope of the summarization or query. This functionality ensures that the application’s output is tailored to the user’s precise requirements.

with st.sidebar:

    model_name = st.text_input("Model name", value=MODEL_NAME)

    temperature = st.slider("Temperature", value=0.1, min_value=0.0, max_value=1.0)

    st.subheader("Upload a PDF file")
    file = st.file_uploader("Upload a PDF file", type=["pdf"])
    if file:
        st.write("File uploaded successfully!")

    st.subheader("Page range")

        "Select page range. Pages are numbered starting at 0. For end page, you can also use negative numbers to count from the end, e.g., -1 is the last page, -2 is the second to last page, etc."
    col1, col2 = st.columns(2)
    with col1:
        start_page = st.number_input("Start page:", value=0, min_value=0)
    with col2:
        end_page = st.number_input("End page:", value=-1)

    st.subheader("Query type")

    query_type ="Select the query type", ["Summarize", "Query"])

If the user selects the “Query” option, we also need to get the query from the user. We want this in the main body of the page, so it will be displayed outside of the with st.sidebar block.

if query_type == "Query":
    user_query = st.text_area(
        "User query", value="What is the data used in this analysis?"

Helper functions

The core of our application lies in its ability to perform summarization and querying tasks based on the user’s inputs. To this end, we have abstracted the logic into helper functions that manage the loading of PDF files, the construction of prompts, and the invocation of the appropriate LangChain processes. These functions are defined in and are imported into the Streamlit application.

The first helper function, save_uploaded_file, handles the file upload process, saving the uploaded PDF file to a temporary location for processing. Streamlit keeps uploaded files in memory, but we need to save them to disk to work with LangChain.

def save_uploaded_file(
    uploaded_file: "UploadedFile", output_dir: Path = Path("/tmp")
) -> Path:
    output_path = Path(output_dir) /
    output_path.parent.mkdir(parents=True, exist_ok=True)
    with open(output_path, "wb") as f:
    return output_path

With this function in place, we can now see how the application’s main logic is structured to handle document loading, summarization, and querying based on user inputs:

def run_query(
    uploaded_file: "UploadedFile",
    summarize: bool,
    user_query: str,
    start_page: int,
    end_page: int,
    model_name: str,
    openai_api_key: str,
    openai_url: str,
    temperature: float,
) -> str:
    # Saves the uploaded file to a temporary location, loads the PDF, and deletes the file
    st.write("Saving the uploaded file...")
    file_path = save_uploaded_file(uploaded_file, output_dir=Path("/tmp"))
    st.write("Loading the document...")
    docs = load_pdf(file_path, start_page=start_page, end_page=end_page)

    if summarize:
        st.write("Summarizing the document...")
        return summarize_document(
    st.write("Querying the document...")
    return query_document(

Note that the st.write statements are used to provide feedback to the user during the processing of the document. Because this function is called inside a with st.status block (see below), the user will see these messages inside the status widget as the document is being processed.

Executing summarization and querying operations

Finally, the rest of the Streamlit application is dedicated to executing the summarization and querying operations based on the user’s inputs. We have one button for running the query and displaying the results or an error message:

if st.button("Run"):
    result = None
    start = time.time()
    if file is None:
        st.error("Please upload a file.")
        with st.status("Running...", expanded=True) as status:
                result = run_query(
                    summarize=query_type == "Summarize",
                    user_query=user_query if query_type == "Query" else "",
                status.update(label="Done!", state="complete", expanded=False)

            except Exception as e:
                status.update(label="Error", state="error", expanded=False)
                st.error(f"An error occurred: {e}")
                result = ""

        if result:
            with st.container(border=True):
      "Time taken: {time.time() - start:.2f} seconds", icon="⏱️")

Once the app is complete, you can run it using the following command:

streamlit run

Final thoughts

By integrating Streamlit into our project, we’ve created an accessible and powerful tool that bridges the gap between complex AI models and end-users seeking to extract valuable insights from PDF documents.

This application demonstrates the practical application of AI in document analysis, but it remains very simple. I hope it inspires you to explore the possibilities of AI-driven document processing further, whether for research, business, or personal use. The combination of LangChain, Ollama, and Streamlit provides a robust foundation for building sophisticated applications that leverage the power of large language models to enhance productivity and efficiency.