October 9, 2024 No Comments

Introduction

In application development, ensuring that data exchanged between systems is properly structured is critical—especially when using formats like JSON, which is widely adopted for data communication. However, getting models to produce valid JSON outputs consistently can be a challenging task. Developers often need to invest significant time in crafting complex prompts to ensure the model includes all the necessary keys and values while avoiding errors like invalid enums or missing fields.

To address these challenges, OpenAI has introduced Structured Outputs—a new feature designed to guarantee that models generate responses that strictly follow the provided JSON Schema. This means developers can now be confident that every response will conform to the correct format, eliminating the need for tedious prompt engineering or worrying about formatting errors.

In this blog, we’ll explore how the new Structured Outputs feature can simplify the process of parsing and extracting data from legal documents. Now, let’s dive in and see how to implement Structured Outputs using Python!

Implementation steps

Step 1:

Install the necessary package using the below command:

				
					!pip install openai==1.47.1 pydantic==2.9.2 tika==2.6.0
				
			

We also need to install the JDK as a dependency for the Tika library. To do this, please run the command below in the terminal:

				
					sudo apt-get install -y default-jdk
				
			
Step 2:

Next, we need to import the necessary packages and initialize the OpenAI client object to make the API call. To do this, please add below lines of code in your script:

				
					from pydantic import BaseModel
from typing import Optional
from openai import OpenAI
from tika import parser

client = OpenAI(api_key = "YOUR_OPENAI_KEY")
				
			

Make sure to replace the actual OpenAI key with the “YOUR_OPENAI_KEY” variable.

Step 3:

We need to define a data structure for the JSON Schema that the model will follow when generating structured outputs. To do this, we’ll create a class called ‘legal_data’, which ensures the model’s responses align with the schema, making it easier to extract and manage important legal information.

To define the schema, please use below code:

				
					class legal_data(BaseModel):
    case_number: str
    case_title: str
    court_name: str
    judge_name: str
    jurisdiction: str
    claimant_name: str
    defendant_name: str
    other_involved_parties: Optional[str]
    filing_date: str
    hearing_date: str
    order_date: str
    case_type: str
    summary_of_legal_issue: str
    compensasion_amout: Optional[str]
    referenced_documents: Optional[str]
    summary_of_ruling: str
    final_decision: str
    future_obligations: Optional[str]
				
			
Step 4:

Next, we need to extract the text from the PDF file of the legal document using Apache Tika. We have used the legal case documents from the below website:

https://www.govinfo.gov/app/collection/uscourts

You can access any section, such as “Appellate” or “District”. After selecting the court and the year, you’ll be provided with various case options. Choose any case, click the “PDF” button, and then download it once it opens. We have used the legal case study PDF  “USCOURTS-azb-3_23-bk-08817-0” for our testing purposes, you can download it using the URL below:

https://www.govinfo.gov/content/pkg/USCOURTS-azb-3_23-bk-08817/pdf/USCOURTS-azb-3_23-bk-08817-0.pdf

You have to put the downloaded legal case study PDF inside the folder where our script resides and use the below lines of code to fetch the textual content from the case study PDF:

				
					pdf_path = "USCOURTS-azb-3_23-bk-08817-0.pdf"
parsed_pdf = parser.from_file(pdf_path)
text_content = parsed_pdf.get('content', '').replace("\n\n", "")
				
			
Step 5:

Now, we will initiate a call to the gpt-4o-mini” model, providing it with prompt instructions and the extracted text from the legal documents. To ensure the model generates the response in a specific format, we need to provide an additional argument, “response_format”, which specifies the class object that defines the schema for the structured output. This will help maintain consistency and organization in the response.

It will extract the fields defined in the class object “legal_data” and store them as an instance of this class within the “legal_data” variable. This allows for structured access to the extracted information, enabling efficient data handling and analysis.

				
					completion = client.beta.chat.completions.parse(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are an expert in analyzing the fetching information from legal case documents. You have been provided with the legal case data, analyze it and provide a response in the provided structured format."},
        {"role": "user", "content": f"Legal case study data:\n{text_content}"}
    ],
    response_format=legal_data
  )
extracted_legal_data = completion.choices[0].message.parsed
				
			
Step 6:

At last, you can use the below line of codes to display the extracted information:

				
					for field, value in extracted_legal_data.__dict__.items():
    print(f"{field}: {value}")
				
			

Below is the full code:

				
					# Import necessary packages
from pydantic import BaseModel
from typing import Optional
from openai import OpenAI
from tika import parser

# Init OpenAI client
client = OpenAI(api_key = "YOUR_OPENAI_KEY")

# Define the class that describes the structured output format
class legal_data(BaseModel):
   case_number: str
   case_title: str
   court_name: str
   judge_name: str
   jurisdiction: str
   claimant_name: str
   defendant_name: str
   other_involved_parties: Optional[str]
   filing_date: str
   hearing_date: str
   order_date: str
   case_type: str
   summary_of_legal_issue: str
   compensasion_amout: Optional[str]
   referenced_documents: Optional[str]
   summary_of_ruling: str
   final_decision: str
   future_obligations: Optional[str]

# Extract text from the case study PDF using Apache Tika parser
pdf_path = "USCOURTS-azb-3_23-bk-08817-0.pdf"
parsed_pdf = parser.from_file(pdf_path)
text_content = parsed_pdf.get('content', '').replace("\n\n", "")

# Call `gpt-4o-mini` model with prompt instruction, extracted data of legal case study and expected output format
completion = client.beta.chat.completions.parse(
   model="gpt-4o-mini",
   messages=[
       {"role": "system", "content": "You are an expert in analyzing the fetching information from legal case documents. You have been provided with the legal case data, analyze it and provide a response in the provided structured format."},
       {"role": "user", "content": f"Legal case study data:\n{text_content}"}
   ],
   response_format=legal_data
 )
extracted_legal_data = completion.choices[0].message.parsed

# Display output
for field, value in extracted_legal_data.__dict__.items():
   print(f"{field}: {value}")
				
			

Output

				
					case_number: 3:23-bk-08817-DPC
case_title: In re Paul F. Seiferth
court_name: United States Bankruptcy Court District of Arizona
judge_name: Daniel P. Collins
jurisdiction: Bankruptcy
claimant_name: Paul F. Seiferth
defendant_name: Lawrence Warfield (Trustee)
other_involved_parties: None
filing_date: 2023-12-07
hearing_date: None
order_date: 2024-08-05
case_type: Chapter 7 Bankruptcy
summary_of_legal_issue: Whether Debtor's Motion for Relief from Judgment should be granted and whether the doctrine of claim preclusion bars Debtor from amending his exemption claims.
compensasion_amout: None
referenced_documents: None
summary_of_ruling: The Court granted Debtor's Motion for Relief from Judgment under Rule 60(b), finding that there was confusion and misunderstanding regarding the Order issued on Trustee's First Objection. The Court also found that the claim preclusion doctrine does not apply to Debtor's Vehicle Exemption claims.
final_decision: Court vacated its prior Order, allowing Debtor's claimed RV Vehicle Exemption, and denied Trustee's Second Objection.
future_obligations: None
				
			

Conclusion

In conclusion, structured outputs provide a powerful way to ensure that models generate accurate and well-structured responses. By using this feature, we can simplify the process of extracting important information from legal documents, reducing errors and saving time. Implementing this with Python is straightforward, making it an excellent tool for developers. Embracing Structured Outputs can significantly enhance the efficiency of your applications.

Write a comment

Your email address will not be published. Required fields are marked *

Want to talk to an Expert Developer?

Our experts in Generative AI, Python Programming, and Chatbot Development can help you build innovative solutions and scale your business faster.