![](https://www.pragnakalp.com/wp-content/uploads/2024/12/Open-Source-Embedding-Models-Which-One-Performs-Best.jpg)
In the modern era of machine learning and artificial intelligence, understanding and processing text is a critical task. Whether it’s for chatbots, search engines, or recommendation systems, the ability of machines to “read” and “understand” language underpins many applications we use every day. However, machines don’t naturally understand human language—they need a way to translate words, sentences, or even entire documents into something they can work with: numbers. This is where embedding models come in.
Embedding models are widely used in tasks like searching for similar documents, answering questions, or even recommending products. In this blog, we will explore what embedding models are, their usage and compare some of the best open-source options available. By the end, you’ll have a clearer understanding of which model might be the best fit for your specific needs and why embeddings are such a vital part of modern machine learning.
What Are Embedding Models?
An embedding model is a tool that converts data, like words or sentences, into numeric representations called embeddings. These embeddings are arranged in a way that captures the semantic meaning of the text. For example, the words “cat” and “dog” might be closer together in this numeric space because they have similar meanings, while “cat” and “car” would be farther apart. This allows machines to recognize patterns, similarities, and relationships in the data. Below are some applications in which we can use the embedding model.
1. Semantic Search
- Embedding models improve search and retrieval by matching questions with relevant answers, even when phrased differently.
- Example: In an FAQ system, the query “How do I reset my password?” retrieves the answer for “Password recovery instructions” thanks to embedding-based matching.
2. Search Engines
- Embedding models enable search engines to find relevant results by understanding the meaning of queries.
- Example: Google matches “cheap flights to Paris” with results that may use “affordable tickets to Paris” because the embeddings capture the semantic similarity.
3. Recommendation Systems
- Platforms recommend items by representing users and products as vectors and suggesting items that are closest in the embedding space.
- Example: Netflix recommends movies like Inception to a user who enjoyed Interstellar based on embeddings of user preferences and movie genres.
Testing different Opensource Embedding Models
In this section, we will compare several popular open-source embedding models to understand their strengths and performance across various text-based tasks. The models we will evaluate include hkunlp/instructor-xl, BAAI/bge-small-en-v1.5, and sentence-transformers/sentence-t5-xl.
To provide a practical comparison, we will test all these models using the 5 demo questions based on the publicly available annual report. Below, you can find a side-by-side comparison of the results from all these models. You can download the annual report from the Google Drive link provided below:
Question 1: What percentage is contained by Convenience Stores in top Line Trade? | |
Original response: | Convenience stores make up 16.4% of the top lines of trade as of December 31, 2023. This percentage represents their share of the annualized base rent in the property portfolio. |
hkunlp/instructor-xl | According to the information, Convenience Stores make up 16.4% of the total annual base rent in the top lines of trade. |
BAAI/bge-small-en-v1.5 | According to the data, Convenience Stores make up 16.4% of the top lines of trade. |
sentence-transformers/sentence-t5-xl | According to the data, Convenience Stores make up 16.4% of the top line trades in 2023, 16.5% in 2022, and 17.9% in 2021. This information is taken from the first table in the documents, which shows the diversification of the property portfolio based on the top 20 lines of trade. |
Observation: This is a simple textual question where all the models have managed to fetch the correct information from the database. |
Question 2: In 2023, how much cash flow does NNN have? | |
Original response: | According to the information provided, as of December 31, 2023, NNN had $5,155,000 in cash and cash equivalents. Additionally, they had $968,000,000 available for future borrowings under their Credit Facility. |
hkunlp/instructor-xl | According to the provided data, NNN had $5,155,000 in cash and cash equivalents at the end of 2023. This includes $3,966,000 in restricted cash or cash held in escrow. |
BAAI/bge-small-en-v1.5 | According to the document, NNN had $5,155,000 in cash and cash equivalents at the end of 2023. This includes $3,966,000 of restricted cash or cash held in escrow. |
sentence-transformers/sentence-t5-xl | According to the document, NNN had $5,155,000 in cash and cash equivalents at the end of the year 2023. This includes $3,966,000 of restricted cash or cash held in escrow. So, if we consider only the unrestricted cash, NNN had around $1,189,000 ($5,155,000 – $3,966,000). |
Observation: This question is also asked from simple text content, where each model has to fetch the correct context. |
Question 3: What are the criteria for getting ILO? | |
Original response: | To get an Individualised Living Option (ILO), you’ll need to meet certain criteria. Here are the main points:
1. You need support because of your disability. This means the reason for getting an ILO should be related to your disability. For example, you might need help living independently or exploring a new living situation because of your disability.
2. The support should help you pursue your goals. This means the ILO support should help you overcome barriers that prevent you from achieving your goals, such as moving out of your parents’ home for the first time.
3. You don’t need 24-hour support from one or more support workers. You also don’t need frequent overnight support.
4. Restrictive practice isn’t part of your behaviour support plan (if you have one).
These criteria will help us decide if an ILO is right for you and if it meets the National Disability Insurance (NDIS) funding criteria. |
hkunlp/instructor-xl | To get an Individualised Living Option (ILO), there are several criteria that need to be met. According to the guidelines, the support you’re seeking should be related to your disability, and it should help you pursue your goals. Here are the key points:
Firstly, the support must be related to your disability. This means that you need support to live independently because of your disability, or you need to explore and design an ILO because of your disability support needs.
Secondly, the support should help you pursue your goals. This involves considering the disability-specific barriers that stop you from achieving your home and living goals. The support should address your disability support needs and help you achieve your goals, such as moving out of your parents’ home for the first time.
Lastly, the ILO support should be likely to be effective and beneficial for you. This means it’ll do what it’s meant to do and help you have a place you can call home. You should be able to express yourself and make choices about your daily life activities.
Additionally, the ILO support should be tailored to your unique needs and preferences, providing the flexibility you need and a mix of support that works in the long term.
These criteria ensure that the ILO support is focused on addressing your specific needs and helping you achieve your goals while promoting your independence and well-being. |
BAAI/bge-small-en-v1.5 | To get an Individualised Living Option (ILO), there are several criteria to consider. According to the guidelines, an ILO might be suitable for you if:
* You’re 18 or older, or 17 and planning your future living arrangements. * You need support at home for at least 6 hours each day, whether it’s from formal or informal sources. * You’re ready to explore your home and living needs, what you’d like, and your options. * You’re willing to put in the time and effort to create your future home.
On the other hand, an ILO might not be the best choice if:
* You’re under 18, except if you’re 17 and planning your future living arrangements. * You need less than 6 hours of support at home each day and don’t require help with problem-solving at other times. * You need 24-hour support from one or more support workers. * You need frequent nighttime support. * Restrictive practices are included in your behavior support plan (if you have one).
The National Disability Insurance Scheme (NDIS) will also ensure that any ILO supports meet their funding criteria before providing funding.
It’s worth noting that there are other home and living supports available that might be more suitable for your needs if an ILO isn’t the best fit. |
sentence-transformers/sentence-t5-xl | The criteria for getting an Individualised Living Option (ILO) are quite straightforward. According to the guidelines, your ILO application must meet three main conditions:
Firstly, the support you’re seeking must be related to your disability. This means that the reason for needing an ILO must be directly connected to your disability support needs. For instance, if you need support to live independently because of your disability, then an ILO could be suitable.
Secondly, the support you’re requesting must help you pursue your goals. In other words, the ILO must address the disability-specific barriers preventing you from achieving your home and living goals. This could involve creating a home and living goal in your plan that the ILO supports will help you achieve.
Lastly, the ILO must be effective and beneficial for you. This means that the support you’ll receive should enable you to have a safe and comfortable home where you can express yourself and make choices about your daily life activities. Your service proposal should clearly outline how your primary and supplementary supports will be provided, ensuring that the supports match your needs and wants, and that there are no potential risks to you or others.
Additionally, the guidelines highlight that these criteria are assessed on a case-by-case basis, taking into account factors like your individual circumstances, goals, and support needs. So, meeting these conditions does not guarantee approval, but rather serves as a foundation for further evaluation. |
Observation: This question is indirectly mentioned within the textual content. In this case, only “BAAI/bge-small-en-v1.5” is able to retrieve the correct information. However, “hkunlp/instructor-xl” and “sentence-transformers/sentence-t5-xl” fail to capture the specific context, instead returning information related to the ILO that lacks the exact points needed in the response. |
Question 4: What is the difference between the cash flow the NNN has from 2021 to 2023? | |
Original response: | To calculate the difference in cash flow between 2021 and 2023, let’s take a look at the cash flow data provided. From 2021 to 2023, NNN’s cash flow changed significantly. In 2021, NNN started with $171,322 and ended with $5,155. This means they had a decrease in cash flow of $166,167. However, looking at the cash flow details: – In 2021, NNN had a cash inflow of $568,425 through operating activities. – In 2022, they had a cash inflow of $578,355 through operating activities. – In 2023, they had a cash inflow of $612,410 through operating activities. The main reason for this change is the increase in cash inflow from operating activities over the three-year period. If we compare the cash inflows from operating activities alone, we see that NNN had a $43,985 ($578,355 – $534,370) increase in cash inflow from 2022 to 2023, but still, overall, the company experienced a net decrease in cash flow of $1,623 from 2022 to 2023. So, the overall cash flow decreased by $1,623 from 2021 to 2023, even though there was an increase in cash inflow from operating activities. |
hkunlp/instructor-xl | Let me break down the cash flow information for you. According to the documents, here are the cash flows for NNN for the years 2021, 2022, and 2023: For 2021: $171,322 For 2022: $6,778 For 2023: $5,155 The change in cash flow from 2021 to 2022 is $6,778 – $171,322 = -$164,544. The change in cash flow from 2022 to 2023 is $5,155 – $6,778 = -$1,623. So, the overall decrease in cash flow from 2021 to 2023 is $164,544 + $1,623 = $165,167. However, if you want to know the difference between the two specific years (2021 vs 2023), the change would be $171,322 – $5,155 = $166,167. Please let me know if you have any further questions! |
BAAI/bge-small-en-v1.5 | Let me break down the cash flow information from the documents provided. According to the document, here are the cash flow numbers for NNN for the years 2021, 2022, and 2023: For 2021: $171,322 For 2022: $6,778 For 2023: $5,155 Now, let’s calculate the difference in cash flow from 2021 to 2022 and then from 2022 to 2023. From 2021 to 2022: $6,778 (2022) – $171,322 (2021) = -$164,544 (a decrease of $164,544) From 2022 to 2023: $5,155 (2023) – $6,778 (2022) = -$1,623 (a decrease of $1,623) So, the cash flow decreased by $164,544 from 2021 to 2022 and then further decreased by $1,623 from 2022 to 2023. |
sentence-transformers/sentence-t5-xl | Let me help you break down the cash flow information for NNN from 2021 to 2023. According to the provided data, here’s what I found: The table showing NNN’s cash flows for each year is as follows: | Year | Cash Flow | | — | — | | 2023 | ($1,623) | | 2022 | (-164,544) | | 2021 | (-95,914) | Now, let’s calculate the difference in cash flow from 2021 to 2023: To find the difference, we’ll subtract the 2021 cash flow from the 2023 cash flow. However, since both values are negative, we’ll add them instead. So, the difference is: ($1,623) + 95,914 = 94,291. However, this doesn’t make sense because adding two negative numbers should give us a more negative result. Let me rephrase my previous statement. The difference in cash flow from 2021 to 2023 is actually an increase of 94,291 thousand dollars. This means NNN had more cash coming in (or less going out) in 2023 compared to 2021. If you want to know the exact cash flow amounts for any specific year, just let me know! |
Observation: This question is created from the table presenting a side-by-side comparison of cash flow from 2021 to 2023. We requested the model to summarize this table. In this case, both the “hkunlp/instructor-xl” and “BAAI/bge-small-en-v1.5” models correctly retrieved the relevant tabular data from the database. However, the “sentence-transformers/sentence-t5-xl” model extracted incorrect tabular information, resulting in an inaccurate response. |
Question 5: Can you help me to summarize the revenue of the NNN from the year 2021 to 2023? | |
Original response: | I’d be happy to help you summarize the revenue of NNN from 2021 to 2023. According to the provided data, here’s a summary of NNN’s revenue for the mentioned years: In 2021, NNN’s total revenue was $726,407 thousand dollars. In 2022, NNN’s total revenue increased to $773,053 thousand dollars, representing a 6.4% growth over the previous year. In 2023, NNN’s total revenue further increased to $828,111 thousand dollars, showing a 7.1% growth over 2022. Here’s a simple breakdown: * 2021: $726,407 thousand dollars * 2022: $773,053 thousand dollars (+6.4%) * 2023: $828,111 thousand dollars (+7.1%) It seems like NNN’s revenue has been steadily increasing over the past three years. |
hkunlp/instructor-xl | I’d be happy to try and help you with that. However, I don’t have any specific information on the revenue of NNN from 2021 to 2023 in the provided data. But what I can do is provide some general information about NNN’s property portfolio and financial situation. As of December 31, 2023, NNN owned 3,532 properties across 49 states with an aggregate gross leasable area of approximately 35,966,000 square feet. However, I can suggest looking at the portfolio growth table in the provided data, which shows the number of properties owned by NNN each year from 2011 to 2023: | Year | Number of Properties | |——-|——————–| | 2011 | 3,750 | | 2012 | 3,500 | | 2013 | 3,250 | | 2014 | 3,000 | | 2015 | 2,750 | | 2016 | 2,500 | | 2017 | 2,250 | | 2018 | 2,000 | | 2019 | 1,750 | | 2020 | 1,500 | | 2021 | 1,250 | | 2022 | 3,411 | | 2023 | 3,532 | Unfortunately, the provided data does not contain direct information on revenue for the years 2021 to 2023. |
BAAI/bge-small-en-v1.5 | I’d be happy to help you summarize the revenue information from the documents I have access to. Unfortunately, I don’t see any specific revenue figures mentioned in the provided documents. The information mainly talks about expenses, interest, and other financial aspects of NNN. However, let me try to extract some relevant details. From the first document, we can see the “Total Operating Expenses” for each year: * 2021: $300,405 thousand * 2022: $307,959 thousand * 2023: $320,492 thousand We can infer that the company’s expenses increased over the three-year period. But without more information, I’m not able to provide a summary of revenue. If you could provide more context or specify what kind of revenue you’re looking for (operating revenue, net income, etc.), I’ll do my best to help. |
sentence-transformers/sentence-t5-xl | I’d be happy to help you summarize the revenue of NNN from 2021 to 2023. According to the provided data, we have the following information: In 2021, NNN’s total annualized base rent was $713,169,000. In 2022, NNN’s total annualized base rent increased to $771,984,000. In 2023, NNN’s total annualized base rent further increased to $818,749,000. So, the revenue of NNN has been steadily increasing over the three years, with a growth rate of: * From 2021 to 2022: 8.4% * From 2022 to 2023: 6.4% This suggests that NNN’s revenue has been growing consistently over the past few years. |
Observation: This question seeks to summarize the revenue from the statement. However, no model has successfully identified the correct information, instead retrieving various chunks related to revenue that do not address this specific query, resulting in an incorrect response. |
We have also tested these models with other questions and based on our observations, all these models perform best with textual information. However, when working with tabular data, we observed that the “BAAI/bge-small-en-v1.5” model performs the best among all.
Comparison of Embedding Models: Performance, Time, and Memory Requirements
Apart from the embedding models tested above, we also experimented with additional embedding models. Below is an analysis of their performance on the same annual report, including the average time taken by each embedding model to complete the task and their memory requirements.
Note: All tests were conducted on Google Colab using a T4 GPU with approximately 15 GB of RAM.
Model Name | Performance | Average time for retrieval (Using T4 GPU) | Memory requirement (To load the model + Create Vector database (for this Annual report document) + Perform retrieval - Using T4 GPU) |
---|---|---|---|
hkunlp/instructor-xl | This model has worked well with both textual and tabular information. However, we observed that it may retrieve incorrect chunks of data when the dataset contains sections with similar or overlapping information. | 0.05 seconds | 9.5 GB |
nomic-ai/nomic-embed-text-v1.5 | The model performs at an average level with both textual and tabular information. We observed that it occasionally makes mistakes, even when the information is explicitly mentioned and presented in a straightforward format. | 0.02 seconds | 5.0 GB |
sentence-transformers/sentence-t5-xl | The model generally retrieved the correct context for both textual and tabular data. However, it made a mistake when the queried word did not exactly match the database and also might have failed to extract explicitly mentioned information. | 0.05 seconds | 7.2 GB |
sentence-transformers/all-mpnet-base-v2 | The model performed similarly to the above model with textual data, but it made more mistakes when handling tabular data. These errors occurred more frequently, indicating a potential weakness in processing and accurately extracting information from tables. | 0.03 seconds | 1.5 GB |
Snowflake/snowflake-arctic-embed-l-v2.0 | The model performed well with textual data but encountered challenges in specific situations. It made mistakes when the information was directly mentioned, even for simple queries, and struggled when the document contained multiple similar pieces of data. Additionally, while handling tabular data, the model occasionally failed to extract the correct information. | 0.03 seconds | 4.4 GB |
BAAI/bge-small-en-v1.5 | The model performed exceptionally well overall, accurately extracting information from both text and table data. It correctly handled all text-based queries, with only one mistake occurring during the table data extraction. This model demonstrated strong performance with minimal errors. | 0.01 seconds | 1.0 GB |
Snowflake/snowflake-arctic-embed-m-v2.0 | This model performed similarly to our best working model, accurately extracting both text and table data. However, it may make mistakes while understanding text that was written indirectly. | 0.02 seconds | 4.8 GB |
Conclusion
Embedding models play a crucial role in enabling machines to understand and process human language. Through our exploration of various models, we’ve seen that each embedding model offers its own strengths and weaknesses, and choosing the right one depends on your specific use case, whether it’s for handling textual data, working with tables, or optimizing for speed and accuracy. As machine learning continues to evolve, embedding models will remain a foundational tool in helping systems “understand” and work with human language, making them indispensable in modern AI applications.
Which Embedding Model Fits Your Needs?
Choosing the right embedding model can be tricky. Let us simplify the process and help you pick the perfect fit for your AI projects.