Introduction
In application development, ensuring that data exchanged between systems is properly structured is critical—especially when using formats like JSON, which is widely adopted for data communication. However, getting models to produce valid JSON outputs consistently can be a challenging task. Developers often need to invest significant time in crafting complex prompts to ensure the model includes all the necessary keys and values while avoiding errors like invalid enums or missing fields.
To address these challenges, OpenAI has introduced Structured Outputs—a new feature designed to guarantee that models generate responses that strictly follow the provided JSON Schema. This means developers can now be confident that every response will conform to the correct format, eliminating the need for tedious prompt engineering or worrying about formatting errors.
In this blog, we’ll explore how the new Structured Outputs feature can simplify the process of parsing and extracting data from legal documents. Now, let’s dive in and see how to implement Structured Outputs using Python!
Implementation steps
Step 1:
Install the necessary package using the below command:
!pip install openai==1.47.1 pydantic==2.9.2 tika==2.6.0
We also need to install the JDK as a dependency for the Tika library. To do this, please run the command below in the terminal:
sudo apt-get install -y default-jdk
Step 2:
Next, we need to import the necessary packages and initialize the OpenAI client object to make the API call. To do this, please add below lines of code in your script:
from pydantic import BaseModel
from typing import Optional
from openai import OpenAI
from tika import parser
client = OpenAI(api_key = "YOUR_OPENAI_KEY")
Make sure to replace the actual OpenAI key with the “YOUR_OPENAI_KEY” variable.
Step 3:
We need to define a data structure for the JSON Schema that the model will follow when generating structured outputs. To do this, we’ll create a class called ‘legal_data’, which ensures the model’s responses align with the schema, making it easier to extract and manage important legal information.
To define the schema, please use below code:
class legal_data(BaseModel):
case_number: str
case_title: str
court_name: str
judge_name: str
jurisdiction: str
claimant_name: str
defendant_name: str
other_involved_parties: Optional[str]
filing_date: str
hearing_date: str
order_date: str
case_type: str
summary_of_legal_issue: str
compensasion_amout: Optional[str]
referenced_documents: Optional[str]
summary_of_ruling: str
final_decision: str
future_obligations: Optional[str]
Step 4:
Next, we need to extract the text from the PDF file of the legal document using Apache Tika. We have used the legal case documents from the below website:
https://www.govinfo.gov/app/collection/uscourts
You can access any section, such as “Appellate” or “District”. After selecting the court and the year, you’ll be provided with various case options. Choose any case, click the “PDF” button, and then download it once it opens. We have used the legal case study PDF “USCOURTS-azb-3_23-bk-08817-0” for our testing purposes, you can download it using the URL below:
https://www.govinfo.gov/content/pkg/USCOURTS-azb-3_23-bk-08817/pdf/USCOURTS-azb-3_23-bk-08817-0.pdf
You have to put the downloaded legal case study PDF inside the folder where our script resides and use the below lines of code to fetch the textual content from the case study PDF:
pdf_path = "USCOURTS-azb-3_23-bk-08817-0.pdf"
parsed_pdf = parser.from_file(pdf_path)
text_content = parsed_pdf.get('content', '').replace("\n\n", "")
Step 5:
Now, we will initiate a call to the “gpt-4o-mini” model, providing it with prompt instructions and the extracted text from the legal documents. To ensure the model generates the response in a specific format, we need to provide an additional argument, “response_format”, which specifies the class object that defines the schema for the structured output. This will help maintain consistency and organization in the response.
It will extract the fields defined in the class object “legal_data” and store them as an instance of this class within the “legal_data” variable. This allows for structured access to the extracted information, enabling efficient data handling and analysis.
completion = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are an expert in analyzing the fetching information from legal case documents. You have been provided with the legal case data, analyze it and provide a response in the provided structured format."},
{"role": "user", "content": f"Legal case study data:\n{text_content}"}
],
response_format=legal_data
)
extracted_legal_data = completion.choices[0].message.parsed
Step 6:
At last, you can use the below line of codes to display the extracted information:
for field, value in extracted_legal_data.__dict__.items():
print(f"{field}: {value}")
Below is the full code:
# Import necessary packages
from pydantic import BaseModel
from typing import Optional
from openai import OpenAI
from tika import parser
# Init OpenAI client
client = OpenAI(api_key = "YOUR_OPENAI_KEY")
# Define the class that describes the structured output format
class legal_data(BaseModel):
case_number: str
case_title: str
court_name: str
judge_name: str
jurisdiction: str
claimant_name: str
defendant_name: str
other_involved_parties: Optional[str]
filing_date: str
hearing_date: str
order_date: str
case_type: str
summary_of_legal_issue: str
compensasion_amout: Optional[str]
referenced_documents: Optional[str]
summary_of_ruling: str
final_decision: str
future_obligations: Optional[str]
# Extract text from the case study PDF using Apache Tika parser
pdf_path = "USCOURTS-azb-3_23-bk-08817-0.pdf"
parsed_pdf = parser.from_file(pdf_path)
text_content = parsed_pdf.get('content', '').replace("\n\n", "")
# Call `gpt-4o-mini` model with prompt instruction, extracted data of legal case study and expected output format
completion = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are an expert in analyzing the fetching information from legal case documents. You have been provided with the legal case data, analyze it and provide a response in the provided structured format."},
{"role": "user", "content": f"Legal case study data:\n{text_content}"}
],
response_format=legal_data
)
extracted_legal_data = completion.choices[0].message.parsed
# Display output
for field, value in extracted_legal_data.__dict__.items():
print(f"{field}: {value}")
Output
For the link https://www.govinfo.gov/content/pkg/USCOURTS-azb-3_23-bk-08817/pdf/USCOURTS-azb-3_23-bk-08817-0.pdf, the extracted response is as follows:
case_number: 3:23-bk-08817-DPC
case_title: In re Paul F. Seiferth
court_name: United States Bankruptcy Court District of Arizona
judge_name: Daniel P. Collins
jurisdiction: Bankruptcy
claimant_name: Paul F. Seiferth
defendant_name: Lawrence Warfield (Trustee)
other_involved_parties: None
filing_date: 2023-12-07
hearing_date: None
order_date: 2024-08-05
case_type: Chapter 7 Bankruptcy
summary_of_legal_issue: Whether Debtor's Motion for Relief from Judgment should be granted and whether the doctrine of claim preclusion bars Debtor from amending his exemption claims.
compensasion_amout: None
referenced_documents: None
summary_of_ruling: The Court granted Debtor's Motion for Relief from Judgment under Rule 60(b), finding that there was confusion and misunderstanding regarding the Order issued on Trustee's First Objection. The Court also found that the claim preclusion doctrine does not apply to Debtor's Vehicle Exemption claims.
final_decision: Court vacated its prior Order, allowing Debtor's claimed RV Vehicle Exemption, and denied Trustee's Second Objection.
future_obligations: None
Conclusion
In conclusion, structured outputs provide a powerful way to ensure that models generate accurate and well-structured responses. By using this feature, we can simplify the process of extracting important information from legal documents, reducing errors and saving time. Implementing this with Python is straightforward, making it an excellent tool for developers. Embracing Structured Outputs can significantly enhance the efficiency of your applications.