Comparing Q&A Performance of Phi-3, ChatGPT, Gemini, and Claude on Text, Tables, and Graphs

Introduction

In today’s digital age, extracting meaningful insights from PDFs is a common task. Whether it’s for academic research, business analysis, or everyday information retrieval, we often rely on advanced models to perform these tasks efficiently. This blog aims to compare four popular models—Phi-3, GPT-3.5, Gemini1.5, and Claude2.1—in handling various types of data within PDFs, including text, tables, and graphs.

Understanding which model excels in different scenarios can save time and improve accuracy. For instance, some models might be better at interpreting and summarizing textual content, while others might have an edge in extracting and organizing data from tables or analyzing complex graphs. By examining their performance, we can choose the right tool for our specific needs.

Our comparison will focus on practical aspects and real-world applications. We’ll look at how each model handles different data types and assess their strengths and weaknesses. Whether you’re a student, a professional, or simply curious about AI capabilities, this blog will provide valuable insights to help you make informed decisions about using these powerful tools.

In this blog, we have used the annual report available publicly, to perform the QNA on it. You can download the report using the below Google Drive link: https://drive.google.com/file/d/1H6Ll7Fx8fTWOfZunDGG0ZLCK5ihRGSap/view?usp=sharing

Let’s move forward to some demo questions we have tested using all of these models.

Testing questions

We have tested the Phi-3-Mini-4K-Instruct, GPT-3.5 Turbo, Claude-2.1, and Gemini 1.5 models on various questions about text, table, and graphical data on Google Colab. Below are the test questions, the responses from each model, and our observations.

Question: What percentage is contained by Convenience Stores in top Line Trade?
Expected Response: 16.4% Phi-3 response: 16.4% Gpt-3.5 turbo response: Convenience Stores account for 16.4% of the annual base rent in the top lines of trade as of December 31, 2023. Claude-2.1 response: According to the first table titled “Top Lines of Trade”, Convenience stores represent 16.4% of the annualized base rent as of December 31, 2023. Gemini 1.5 response: Convenience Stores make up 16.4% of the top lines of trade. This information is found in the section titled “TOP LINES OF TRADE” which shows the percentage of annualized base rent as of December 31, 2023. The data indicates that Convenience Stores are the largest category in the top lines of trade. The next closest category is Automotive Service, which represents 15.6% of the total.
Observation: We asked this question from the graphical data and observed that all four models provided the correct answers. While Phi-3 and Gemini 1.5 simply stated the exact answer, GPT-3.5 Turbo, and Claude-2.1 provided more detailed answers.

Question: Which company's name comes between 2.7% to 3.1% in Top Tenants?
Expected Response: GPM Investments (C-stores), Flynn Restaurant Group (Taco Bell/Arby’s) Phi-3 response: JILL FUSSELL - response: The company's name that comes between 2.7% to 3.1% in the Top Tenants is JILL FUSSELL. However, the percentages provided in the context do not directly correspond to the positions of the tenants in the Top Tenants list. The percentages given (2.7%, 2.5%, 2.3%, 4.4%, 4.2%, 3.8%, 3.5%, 3.1%, 3.0%, 2.8%, 2.7%) are related to the distribution of NNN's annual base rent by tenant industry classification, tenant names, and geographic locations, not the ranking of tenants. Therefore, based on the provided context, we cannot accurately determine the position of JILL FUSSELL in the Top Tenants list. Gpt-3.5 turbo response: The companies that fall between 2.7% to 3.1% in the Top Tenants list are: - BJ’s Wholesale Club at 2.8% - AMC Theatres at 2.7% Claude-2.1 response: The company with the name that comes between 2.7% and 3.1% in Top Tenants is LA Fitness at 3.1% Gemini 1.5 response: I'm sorry, but the provided text does not contain information about "Top Tenants" or a list of companies with percentages. Therefore, I cannot answer your question.
Observation: The above question is from the graphical data, for this, we have observed that Phi-3, GPT-3.5 Turbo, and Claude-2.1 provided the wrong answers, while Gemini 1.5 was unable to find the question’s context.

Question: In the "Summarizes the Property Portfolio as of December 31" table, what is the total annualized base rent for the year 2021?
Expected Response: $ 713,169,000 Phi-3 response: The total annualized base rent for the year 2021 is $713,169,000. Gpt-3.5 turbo response: In the "Summarizes the Property Portfolio as of December 31" table, the total annualized base rent for the year 2021 was $713,169,000. Claude-2.1 response: According to the table that summarizes the Property Portfolio, the total annualized base rent for the year 2021 is $713,169,000. Specifically, it says: Total annualized base rent $818,749,000 $771,984,000 $713,169,000 So the total annualized base rent for 2021 is $713,169,000. Gemini 1.5 response: The total annualized base rent for the year 2021 was $713,169,000. This information is found in the first table within the "Property Analysis" section. The table provides data for 2023, 2022, and 2021. The rightmost column under "Total annualized base rent" shows the figures for each respective year.
Observation: We asked this question from the tabular data and observed that all four models provided the correct answers. Gemini 1.5, GPT-3.5 Turbo, and Claude-2.1 gave detailed answers with more description, while Phi-3 provided only a short and exact answer.

Question: What percentage of annual base rent in the state of Texas is in the "Diversification of the Property Portfolio" table?
Expected Response: 16.8% Phi-3 response: 16.8% Gpt-3.5 turbo response: In the "Diversification of the Property Portfolio" table, Texas accounts for 16.8% of the annual base rent as of December 31, 2023. Claude-2.1 response: Based on the table that summarizes the diversification of the Property Portfolio by state, Texas accounts for 16.8% of the annual base rent as of December 31, 2023. Gemini 1.5 response: The provided table shows that Texas represents 16.8% of annual base rent. This figure is based on annualized base rent for all leases in place as of December 31, 2023. The data comes from a table summarizing the diversification of a property portfolio by state. No further information about the specific properties or their lines of trade in Texas is available. Therefore, a more specific breakdown of this percentage by industry within Texas cannot be provided.
Observation: We asked this question from the tabular data and observed that all four models provided correct answers. The main difference between them lies in the level of detail and length of the answers.

Question: Which words are used by NNN when making forward-looking statements?
Expected Response: words “anticipate,” “assume,” “believe,” “estimate,” “expect,” “intend,” or similar expressions Phi-3 response: The words used by NNN when making forward-looking statements are "anticipate," "assume," "believe," "estimate," "expect," "intend," and similar expressions. Gpt-3.5 turbo response: When NNN makes forward-looking statements, they use words like "anticipate," "assume," "believe," "estimate," "expect," and "intend." Claude-2.1 response: Based on the context provided, the passage states that when NNN uses any of the words “anticipate,” “assume,” “believe,” “estimate,” “expect,” “intend,” or similar expressions, NNN is making forward-looking statements. The exact quote is: "Also, when NNN uses any of the words “anticipate,” “assume,” “believe,” “estimate,” “expect,” “intend,” or similar expressions, NNN is making forward-looking statements." So the words NNN uses when making forward-looking statements are: - anticipate - assume - believe - estimate - expect - intend Gemini 1.5 response: NNN uses words like "anticipate," "assume," "believe," "estimate," "expect," "intend," or similar expressions when making forward-looking statements. These statements are based on current expectations and reasonable assumptions. However, actual results could differ materially from those projected in the forward-looking statements. Forward-looking statements are only valid as of the date they are issued. NNN does not assume any obligation to update them to reflect changed assumptions or unanticipated events.
Observation: We asked this question using simple text data and observed that Phi-3 and GPT-3.5 turbo provided correct and exact answers, while Claude-2.1 and Gemini 1.5 provided correct and detailed answers.

Individual Model Comparison

Phi-3-Mini-4K-Instruct

It is an Open Source model available for download on Hugging Face. You can access it using the below link:

https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

We have tested this model on the Google Colab with T4 GPU with 12 GB RAM to process around a maximum of 8000 words at a time. The model is also supported on the CPU.

Strength:

Phi-3 handles table data type Q&A well.
It also answers text questions clearly and accurately.

Weakness:

Phi-3 has difficulties with graphical data, often providing inaccurate results despite fetching the correct chunks.
It takes approximately 100 seconds to generate results for one question, which is longer compared to other models.

GPT-3.5 Turbo

OpenAI’s GPT-3.5 Turbo is a powerful language model designed to follow instructions well. We can use it via the APIs. You can access its pricing structure using the below link:

https://openai.com/api/pricing/

Strength:

GPT-3.5 Turbo provides accurate answers for both tabular and textual questions, though not overly detailed.
It finishes tasks quickly, takes an average of 6 seconds to answer one question, and delivers results promptly.

Weakness:

The GPT-3.5 Turbo model is proprietary and not open-source.
It does not provide accurate answers for questions involving graphical data.

Claude-2.1

Anthropic’s Claude-2.1 is a reliable language model that’s really good at following instructions. It is accessible through APIs. You can check its pricing structure using the link below:

https://www.anthropic.com/api

Strength:

Claude-2.1 offers detailed answers for both table and text questions, enhancing comprehension.
It provides clear, step-by-step explanations for text questions, facilitating understanding.

Weakness:

Claude-2.1 requires payment for usage, as it is not available for free.
It struggles to effectively handle graphical Q&A, impacting its versatility.

Gemini 1.5

Gemini 1.5, developed by Google, is a sophisticated language model known for its ability to seamlessly interpret instructions. Accessible via APIs, its pricing structure can be found by following the link provided below:

https://ai.google.dev/pricing

Strength:

Gemini 1.5 excels in answering questions about table and text questions, enhancing its versatility.
It demonstrates remarkable speed in providing responses, taking an average of 5 seconds to answer one question, and ensuring promptness in tasks.
It provides answers with its reference page content included in its responses, ensuring transparency and verifiability of the information provided.

Weakness:

Gemini 1.5 requires payment for usage, as it is not available for free.
It might encounter limitations with large PDF files for Q&A, potentially requiring an increased quota limit.
It has trouble with Q&A involving graphical data, which affects its flexibility.

Observation

While performing Q&A on the above-mentioned PDF using various models, we observed the following behaviour for each model on different data types:

Phi-3-mini-4K-Instruct: Provides detailed answers for text and tables data Q&A, but responds slowly and struggles with graphical data.

GPT-3.5 Turbo: Fast and accurate in completing tasks, but not open-source and not great with graphical data Q&A.

Claude-2.1: Offers clear explanations for text questions data Q&A, but isn’t open-source and performs poorly with graphical data.

Gemini 1.5: Handles text and table data Q&A well but performs poorly with graphical data Q&A. It responds quickly, but it isn’t open-source and may require more quota for large PDFs.

Google Colab link for each model

We used the script below to perform the Q&A tasks. You can download the script using the link provided.

Phi-3-Mini-4K-Instruct:

Colab file for the Phi-3-Mini-4K-Instruct

GPT-3.5 Turbo:

Colab file for the GPT-3.5 Turbo

Claude-2.1:
Colab file for the Claude-2.1

Gemini 1.5:
Colab file for the Gemini 1.5

Conclusion

When choosing between Phi-3, GPT-3.5, Gemini 1.5, and Claude-2.1 models for PDF analysis, it’s essential to consider their strengths and weaknesses for each task. Depending on the data type, one model may excel over another, tailored to your specific needs. Therefore, selecting the appropriate model relies on your objectives for PDF analysis.

Which AI Model Best Suits Your PDF Data Needs?

Not sure which AI model works best for your PDF analysis? Share your requirements, and we’ll guide you to the perfect solution, whether it’s Phi-3, GPT-3.5, Gemini, or Claude.

Which AI Model Best Suits Your PDF Data Needs?

Pragnakalp Techlabs: Your trusted partner in Python, AI, NLP, Generative AI, ML, and Automation. Our skilled experts have successfully delivered robust solutions to satisfied clients, driving innovation and success.

Hire Dedicated Developers

Services

Contact Us