PDF-GPT

This article will demonstrate a model for building highly optimized, cost-effective and content-aware chatbots for querying PDF or text files. We’re going to setup a basic API that will allow us to link a PDF file and ask natural language queries related to the information in the provided document.

For the sake of simplicity, we’re building this project specifically for PDF files, but the model and the architecture described can be used for any other types of documents (basically, anything that can be parsed as textual data).

Let’s dive in 🚀

The Plan 🗺️

What are we trying to achieve? And how will we achieve it?

I like to start with a vision for the end-goal and go backwards, breaking down what needs to be done.

Without adieu, here’s what the API should do when completed:

API Endpoints 🕸️

Using the create endpoint, user provides URL to the PDF file to initiate the creation of a vector store for the data in that file. This vector store will then be stored in Pinecone and will be queried when user makes queries.

    /create?document_url={document_url}

If an index already created, user can use the load endpoint to initiate the process of loading the existing vector store from Pinecone

    /load

User can also add more documents or source data by using the update endpoint

  /update?document_url={document_url}

To make a query and get response, we query endpoint:

    /query?q={query text}

Primary Tasks 🎯

First part: generate embeddings for the source data and store in vector database

processing source file and creating vector store

Second part: Handling query made by user (once vector store for source data has been created or loaded)

Setup & Prereqs ✅

If you’d like to have a look at the code or even deploy your own version of this project locally or remotely, here is the link to project’s Github repository which also lists instructions of usage.

Python is the choice of programming language used for the project since a lot of machine learning tools, libraries and frameworks natively support Python.

Tools we would be using primarily:

Tool	Purpose
LlamaIndex	Create index for supplied source data (we later run the query on an index). Creation of such indices handle context storage, prompt limitations and text-splitting under the hood. We’ll see later why this is useful.
Open AI	Provides API for embeddings generation and chat completion
Pinecone	Hosted, cloud-native vector database with simple API
Fast API	Framework to quickly build simple yet robust RESTful API in Python
Uvicorn	An ASGI web server implementation for Python

Process Flow 🌊

The most crucial thing for this API to work is the vector store storing embeddings for our source data (PDF file), and the below diagram displays the basic flow for handling the create and update endpoints.

PDF-GPT create and update endpoint process flow

Let’s break each major step down and study how it works?

Request Initiation: User send a GET or POST request on create or update endpoint and provide URL to the PDF document
Parsing Source Data:
1. Download source document locally
2. Parse textual data out of the document
Embeddings Generation:
1. Use LlamaIndex to create an index for the vector store to be created
2. Using Langchain make the API call to Open AI to generate embeddings for the source textual data
3. Store the vector embeddings on Pinecone
  - if create endpoint being used: delete existing Pinecone index (index name given in params.py), and re-create new index
  - if update endpoint being used: update the existing Pinecone index
4. Store vector embeddings in-memory using LlamaIndex’s SimpleVectorStore
Chatbot Creation: Make the index available for querying

Does it work? 🤔

We’ll test the efficacy of the above modal, using a sample PDF document. You can find the document here: SOP-for-Quality-Improvement.pdf

Please note that this document is an excrept from the original document that was sourced from internet. Original source: SOP for Quality improvement - Quality Improvement Secretriat - MOHFW

Sample of source data:

Test query 1

You’ll notice in the response to the first test query that the response has been generated entirely from the source data and in natural language.

Part of original source document:

portion of original source document

This credibly proves that our program was able to extract a relevant portion of data from all of our source data that related most to the query being asked, and fed it to the chat completion (ChatGPT) model to generate a natural language response 🎉

Test query 2

Test query 3

Let’s take it up a notch and attempt a query which would require processing of a substantial portion of the source document:

Happy Ending 😇

This project demonstrated a frictionless and efficient approach to building highly-optimized API for interacting with and extracting information from PDF documents. The resulting API application provides the experience alike a content-aware chatbot, specifically trained to return responses from provided source data using its cutting-edge natural language processing capabilities.

The illustrated PDF-GPT modal parses the source data, creates and manages indices using LlamaIndex, generates embeddings using Open AI, and stores the embeddings in the Pinecone vector database. We showcased the efficacy of the model by testing it with a sample PDF document and querying relevant information from the source data.

With its optimized architecture and integration with powerful tools like LlamaIndex, Langchain, Open AI, Pinecone, Fast API, and Uvicorn, PDF-GPT offers a reliable and user-friendly modal for building applications which can enable interaction with documents using natural language.

Though we only used PDF documents, the approach implemented by PDF-GPT can be used to build modals for processing any type of textual source, like other document types (word, markdown, etc.), webpages (textual data extracted using web scraping), or data from a database (textual data extraction using database connectors). Not going to let out too much details at the moment but I might already be in the process of building something like this.. more deets coming soon 😉

Until then, I hope you gained some insight from this project, and I wish you best of the luck for your endeavours! Looking forward to what you might build with this! 😄

PDF-GPT: LlamaIndex + Open AI + Pinecone + Fast API