Making RAG Work for PDFs with Images and Visual Guides

In today’s AI-driven world, the ability to extract knowledge from documents and interact with it naturally is becoming a core capability across industries. This is particularly true for domains like manufacturing, healthcare, government, and enterprise IT, where user manuals, SOPs, guides, and onboarding documents often contain a rich mix of text and images. These images aren’t decorative; they’re essential to comprehension. They visually reinforce the step-by-step instructions and serve as irreplaceable references.

Yet, traditional Retrieval-Augmented Generation (RAG) systems treat documents as purely textual data. This blog explores a next-gen RAG architecture where image-aware parsing, semantic chunking, and natural language Q&A come together to form a powerful, image-augmented knowledge system.

We’ll walk through the complete pipeline using an AWS Bedrock tutorial PDF as a case study, but the same principles apply to any instructional document—from any manuals to safety guides.

We’ll create an end-to-end pipeline that:

Extracts both text and images from a PDF
Preserves their sequence and contextual importance
Generates image captions using Azure OpenAI’s GPT-4 vision API
Embeds the content semantically into a vector store (ChromaDB)
Supports natural language Q&A, returning answers with both text and relevant images

Step 1: Parsing the PDF (Text + Images)

The first step was to parse the PDF and extract its full content. What made this task challenging was the presence of instructional steps illustrated with images. Unlike simple parsers, we needed one that respected the visual narrative, text followed by relevant images and vice versa. Using PyMuPDF, we built a script (pdf_to_text.py) that:

The below script(pdf_to_text.py) performs:

Text extraction using PyMuPDF, extracts text with bounding boxes.
Extracts and saves images from each page
Sends images to Azure OpenAI Vision for description generation

To maintain the correct sequence and context of images within the extracted content, we use inline image tagging. Each image is annotated using a custom tag format that includes both the image path and its AI-generated description. This format ensures that every image can be easily retrieved and displayed correctly on the user interface during question answering or content rendering. The image path points to where the file is stored, while the description, generated by Azure OpenAI, provides contextual meaning.

Tag format:

				
					[[IMAGE_DATA_START]] Image URL: Directory_Name/page_1_image_1.png, Description: Image Description[[IMAGE_DATA_END]]

This tagging approach is critical to preserving visual alignment with the corresponding text, enabling accurate and seamless user experiences.

Script: pdf_to_text.py

				
					import fitz
import os
import io
import base64
from mimetypes import guess_type
from PIL import Image
import openai
from openai import AzureOpenAI

# Configure Azure OpenAI
openai.api_type = "azure"
api_base = 'API_BASE'
api_key="API_KEY"
deployment_name = 'gpt-4'
api_version = 'API_VERSION'

client = AzureOpenAI(
   api_key=api_key, 
   api_version=api_version,
   base_url=f"{api_base}openai/deployments/{deployment_name}",
)

def local_image_to_data_url(image_path):
   mime_type, _ = guess_type(image_path)
   if mime_type is None:
       mime_type = 'application/octet-stream'
   with open(image_path, "rb") as image_file:
       base64_encoded_data = base64.b64encode(image_file.read()).decode('utf-8')
   return f"data:{mime_type};base64,{base64_encoded_data}"

def generate_image_description(image_url, image_path):
   try:       
       prompt = """You are an image analysis expert. You will be provided with various types of images extracted from documents like research papers, technical blogs, and more.
      
Your task is to generate concise, accurate descriptions of the images without adding any information you are not confident about. Write only a short description.

Important Guidelines:
* Prioritize accuracy: If you are uncertain about any detail, state "None" instead of guessing.
* Avoid hallucinations: Do not add information that is not directly supported by the image.
* Consider context: If the image is a screenshot or contains text, incorporate that information into your description.
"""
       response = client.chat.completions.create(
           model=deployment_name,
           messages=[
               {"role": "system", "content": "You are a helpful assistant."},
               {"role": "user", "content": [
                   {"type": "text", "text": prompt},
                   {"type": "image_url", "image_url": {"url": image_url}}
               ]}
           ],
           max_tokens=1000
       )
       return response.model_dump()["choices"][0]["message"]["content"].strip()
   except Exception as e:
       print(f"Error generating description for {image_path}: {e}")
       return "No description available."

def parse_pdf(pdf_path, text_output_file="output.txt", image_output_dir="output_images"):
   pdf_document = fitz.open(pdf_path)
   os.makedirs(image_output_dir, exist_ok=True)
   with open(text_output_file, "w", encoding="utf-8") as text_file:
       for page_num in range(len(pdf_document)):
           page = pdf_document.load_page(page_num)
           elements = []
           blocks = page.get_text("dict")["blocks"]
          
           for block in blocks:
               if "lines" in block:
                   for line in block["lines"]:
                       for span in line["spans"]:
                           elements.append({"type": "text", "content": span["text"], "bbox": block["bbox"]})
          
           images = page.get_images(full=True)
           processed_images = set()
           for img_index, img in enumerate(images):
               xref = img[0]
               if xref in processed_images:
                   continue
               processed_images.add(xref)
               base_image = pdf_document.extract_image(xref)
               image_bytes = base_image["image"]
               image = Image.open(io.BytesIO(image_bytes))
               image_filename = f"page_{page_num + 1}_image_{img_index + 1}.png"
               image_path = os.path.join(image_output_dir, image_filename)
               image.convert("RGB").save(image_path)
               image_url = local_image_to_data_url(image_path)
               description = generate_image_description(image_url,image_path)
               img_rects = page.get_image_rects(xref)
               if img_rects:
                   img_rect = img_rects[0]
                   elements.append({"type": "image", "content": f"[[IMAGE_DATA_START]] Image URL: {image_path}, Description: {description} [[IMAGE_DATA_END]]", "bbox": (img_rect.x0, img_rect.y0, img_rect.x1, img_rect.y1)})
          
           elements.sort(key=lambda e: e["bbox"][1])
           last_line_was_blank = False
           for element in elements:
               if element["type"] == "text":
                   line = element["content"].strip()
                   if line:
                       text_file.write(line + "\n")
                       last_line_was_blank = False
                   elif not last_line_was_blank:
                       text_file.write("\n")
                       last_line_was_blank = True
               elif element["type"] == "image":
                   text_file.write(element["content"] + "\n")
                   last_line_was_blank = False
           if not last_line_was_blank:
               text_file.write("\n")
   print(f"Text and images saved: {text_output_file}, {image_output_dir}")

pdf_path = "AWS_Bedrock_Blog.pdf"
text_output = "AWS_Bedrock_Blog_1.txt"
image_output = "AWS_Bedrock_Blog_images_1"

parse_pdf(pdf_path, text_output, image_output)

Output:
By the end of this step, we have:

A .txt (AWS_Bedrock_Blog_1.txt) file containing a clean flow of text and inline image tags.
A folder of extracted and labeled image files (AWS_Bedrock_Blog_images_1).

This setup ensures the preservation of context when moving to the next stages.

Step 2: Embedding and Storing with ChromaDB

Most RAG pipelines split text into fixed-size chunks (e.g., every 500 characters), but this can break meaning, especially around instructional steps paired with visuals. Instead, we used the SemanticChunker from LangChain Experimental. It splits based on semantic shifts using the standard deviation of embedding similarities, ensuring contextually complete chunks.

We used text-embedding-ada-002 from Azure OpenAI to generate embeddings, and then stored everything in ChromaDB.

The below script: semantic_embedding.py:

Uses Azure OpenAI’s text-embedding-ada-002
Leverages SemanticChunker to split the document naturally
Saves vector chunks into ChromaDB with associated metadata

Script: semantic_embedding.py

				
					__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')
from langchain_openai import AzureOpenAIEmbeddings
from langchain_community.vectorstores.chroma import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import  TextLoader
from langchain_experimental.text_splitter import SemanticChunker

def custom_text_loader(file_path):
   return TextLoader(str(file_path))
embeddings = AzureOpenAIEmbeddings(
   model="text-embedding-ada-002",
   azure_endpoint='AzureOpenAI Endpoint',
   api_key="API_KEY",  # Use environment variable for API key
   openai_api_version='OPENAI_API_VERSION'
)

vector_store = Chroma(
   collection_name="AWS_Bedrock_Blog_semantic_standard_deviation", #Collection name for chromadb    embedding_function=embeddings,
   persist_directory="./AWS_Bedrock_Blog_1_semantic_standard_deviation", #Directory Name
)

filename = "AWS_Bedrock_Blog_1.txt" #The text file to be chunked

with open(filename) as f:
   text = f.read()

text_splitter = SemanticChunker(embeddings, breakpoint_threshold_type="standard_deviation")
chunks = text_splitter.create_documents([text])

# Process chunks to extract text
if isinstance(chunks[0], dict):  # If chunks are dictionaries
   chunks = [chunk['text'] for chunk in chunks if 'text' in chunk]
elif hasattr(chunks[0], 'text'):  # If chunks are objects with 'text' attributes
   chunks = [chunk.text for chunk in chunks]

# Convert all chunks to strings
chunks = [str(chunk) for chunk in chunks]

# Validate chunks
print("Processed Chunks:", chunks)
print("Length ofProcessed Chunks:", len(chunks))
print("Are all chunks strings?:", all(isinstance(chunk, str) for chunk in chunks))

# Generate embeddings
embeddings_list = embeddings.embed_documents(chunks)

# # Add texts to the vector store
vector_store.add_texts(
   texts=chunks,
   metadatas=[{"source": filename, "chunk_id": f"chunk-{i}"} for i in range(len(chunks))],
   ids=[f"chunk-{i}" for i in range(len(chunks))]
)

print(f"Embeddings stored successfully")

Why Use Semantic Chunking Instead of Fixed Sizes?

When you chunk text based purely on character length (e.g., every 500 characters), you risk:

Cutting off mid-sentence
Splitting related instructions or visuals
Losing meaning across boundaries

Instead, SemanticChunker analyzes the embedding vectors as it processes the text and breaks content at natural meaning shifts, using the standard deviation of similarity to decide where one idea ends and another begins.

This results in semantically intact chunks that perform far better during question answering.

Step 3: Ask Questions, Get Answers (With Images!)

Once we get embedding stored in the vector database, the next step is to ask questions. We tested queries using Gemini for LLM inference (you could also plug in GPT, Claude, etc)

The below query_chromadb.py script:

Accepts a user question
Embeds and searches the most relevant chunks
Constructs a markdown response, keeping image tags in sequence.

Testing Results:

1. How to access the model?

Amazon Bedrock users must request access to models for text, chat, and image generation before they can use them. To gain access to the models you need for your Amazon Bedrock projects, follow these steps:

Once you have logged in using your AWS credentials, On the left side navigation panel, locate the “Model access” link, or visit the “Edit model access” page as shown below. Then you need to select the checkbox next to the model you want to add access to. For Anthropic models, you must also request access when you click the Request Access button. Models are not available as a default setting in Amazon Bedrock. [[IMAGE_DATA_START]] Image URL: AWS_Bedrock_Blog_images_1/page_2_image_1.png [[IMAGE_DATA_END]]
Select Confirm to add access to any third-party models through Amazon Marketplace. Note: Your use of Amazon Bedrock and its models is subject to the seller’s pricing terms, EULA and the Amazon Bedrock service terms.
To complete the process, click the “Save Changes” button located in the lower right corner of the page, as shown in the below image. Please note that it may take several minutes to save changes to the Model access page. Models for which access is granted will appear as “Available” on the Model access page under “Access status.”
[[IMAGE_DATA_START]] Image URL: AWS_Bedrock_Blog_images_1/page_3_image_1.png [[IMAGE_DATA_END]]

In our blog, we will utilize the Jurassic-2 Ultra model to test the chat model and text generation, and for image generation, we will use the Stable Diffusion model. For this, you need to request access to the Jurassic-2 and Stable Diffusion model as shown below. [[IMAGE_DATA_START]] Image URL: AWS_Bedrock_Blog_images_1/page_3_image_2.png [[IMAGE_DATA_END]]

2. How to use the text generation model using Playground?

– Start by navigating to the ‘Text’ section on the left-hand panel of the dashboard. Within this section, you’ll find a variety of available models. [[IMAGE_DATA_START]] Image URL: AWS_Bedrock_Blog_images_1/page_5_image_2.png [[IMAGE_DATA_END]]

– Choose the model that best suits your specific requirements and objectives from the options provided. Here, we have selected the “Jurassic-2 Ultra” model of AI21 labs.

– Once we’ve selected the model, we need to enter the prompt, which will act as the input or question to the model. We can also set parameters from the right tab as per our requirements. Then initiate the text generation process by clicking the \\’Run\\’ button. Below is an instance of a text generation sample:

[[IMAGE_DATA_START]] Image URL: AWS_Bedrock_Blog_images_1/page_6_image_1.png [[IMAGE_DATA_END]]

3. How to use the image generation model using Playground?

– To use the Image generation service, you need to select the “Image” option from the left panel of the dashboard.

– Now that we’ve entered the Image generation playground, it’s time to provide the prompt. You can input the prompt that corresponds to the image you’d like to generate. Here’s an example of a text-to-image generation task:

[[IMAGE_DATA_START]] Image URL: AWS_Bedrock_Blog_images_1/page_6_image_2.png [[IMAGE_DATA_END]]

4. How can I use the text generation model using the API?

– To use the text generation model using the API, you need to follow the below steps:

– Locate the service you want to access and click the “View API Request” button, which will reveal the request body. The body contains essential code and parameters for programmatic interaction, as shown in the image. [[IMAGE_DATA_START]] Image URL: AWS_Bedrock_Blog_images_1/page_7_image_1.png [[IMAGE_DATA_END]]

[[IMAGE_DATA_START]] Image URL: AWS_Bedrock_Blog_images_1/page_7_image_2.png [[IMAGE_DATA_END]]

– For text generation, copy the provided code and add it into the Python script as described below:

“`python

import boto3

import json

prompt_data = “””

Write a one-liner 90s-style B-movie horror/comedy pitch about a giant man-eating Python, with a hilarious and surprising twist.

“””

bedrock = boto3.client(

service_name=”bedrock-runtime”,

region_name=”YOUR_REGION”,

aws_access_key_id=”YOUR_AWS_ACCESS_KEY”,

aws_secret_access_key=”YOUR_AWS_ACCESS_KEY”

)

payload = {

“prompt”: prompt_data,

“maxTokens”: 512,

“temperature”: 0.8,

“topP”: 0.8,

}

body = json.dumps(payload)

model_id = “ai21.j2-ultra-v1”

#You can set different ids

response = bedrock.invoke_model(

body=body,

modelId=model_id,

accept=”*/*”,

contentType=”application/json”,

)

response_body = json.loads(response.get(“body”).read())

response_text = response_body.get(“completions”)[0].get(“data”).get(“text”)

print(response_text)

“`

Below is the text generated from the above prompt:

[[IMAGE_DATA_START]] Image URL: AWS_Bedrock_Blog_images_1/page_8_image_1.png [[IMAGE_DATA_END]]

5. What are the input and output for image generation?

– To use the Image generation service, you need to select the “Image” option from the left panel of the dashboard.

[[IMAGE_DATA_START]] Image URL: AWS_Bedrock_Blog_images_1/page_6_image_2.png [[IMAGE_DATA_END]]

…

Below is the generated image

[[IMAGE_DATA_START]] Image URL: AWS_Bedrock_Blog_images_1/page_11_image_1.png [[IMAGE_DATA_END]]

[[IMAGE_DATA_START]] Image URL: AWS_Bedrock_Blog_images_1/page_11_image_2.png [[IMAGE_DATA_END]]

To display the images on the User Interface, you can extract the image from the Image URL from the answer.

For testing, we used an AWS Bedrock tutorial PDF. While the content was technical, the results were clear: the system answered context-rich questions with full visual guidance.

But here’s the best part: it works with any PDF, not just this one. Manuals, research papers, onboarding playbooks, you name it.

Conclusion:

Traditional RAG workflows are powerful but limited when applied to real-world documents where “seeing is understanding.” In domains like technical support, manufacturing, compliance, or onboarding, visuals aren’t just helpful, they’re essential. Unfortunately, most RAG systems fail to account for this visual context, treating documents as plain text and losing critical meaning along the way.

In this blog, we demonstrated how to go beyond text and build visually aware, semantically rich retrieval systems from PDFs. By preserving the order and relationship between text and images through inline tagging and AI-generated descriptions, we enable a more accurate, intuitive, and complete question-answering experience.

This architecture ensures:

Precise step-by-step answers
Visual continuity throughout the document
No hallucinations, as every image description is grounded in real, parsed content

We’re excited about the potential of this approach to reshape how organizations interact with complex, image-heavy documents, turning static manuals into dynamic, interactive knowledge agents.

If you’re exploring how to extract more value from your documentation, we’d love to collaborate.

Pragnakalp Techlabs: Your trusted partner in Python, AI, NLP, Generative AI, ML, and Automation. Our skilled experts have successfully delivered robust solutions to satisfied clients, driving innovation and success.

Hire Dedicated Developers

Services

Contact Us