In 2023, we blogged about OpenSearch Service vector database capabilities. Since then, OpenSearch and Amazon OpenSearch Service have developed to bring better performance, lower cost, and enhanced tradeoffs. We’ve improved the OpenSearch Service hybrid lexical and semantic search methods using both dense vectors and sparse vectors. We’ve simplified connecting with and managing large language models (LLMs) hosted in other environments. We’ve brought native chunking and streamlined searching for chunked documents.

Where 2023 saw the explosion of LLMs for generative AI and LLM-generated vector embeddings for semantic search, 2024 was a year of consolidation and reification. Applications relying on Retrieval Augmented Generation (RAG) started to move from proof of concept (POC) to production, with all of the attendant concerns on hallucinations, inappropriate content, and cost. Builders of search applications began to move their semantic search workloads to production, seeking improved relevance to drive their businesses.

As we enter 2025, OpenSearch Service support for OpenSearch 2.17 brings these improvements to the service. In this post, we walk through 2024’s innovations with an eye to how you can adopt new features to lower your cost, reduce your latency, and improve the accuracy of your search results and generated text.

Using OpenSearch Service as a vector database

Amazon OpenSearch Service as a vector database provides you with the core capabilities to store vector embeddings from LLMs and use vector and lexical information to retrieve documents based on their lexical similarity, as well as their proximity in vector space. OpenSearch Service continues to support three vector engines: Facebook AI Similarity Search (FAISS), Non-Metric Space Library (NMSLIB), and Lucene. The service supports exact nearest-neighbor matching and approximate nearest-neighbor matching (ANN). For ANN, the service provides both Hierarchical Navigable Small World (HNSW), and Inverted File (IVF) for storage and retrieval. The service further supports a wealth of distance metrics, including Cartesian distance, cosine similarity, Manhattan distance, and more.

The move to hybrid search

The job of a search engine is to take as input a searcher’s intent, captured as words, locations, numeric ranges, dates, (and, with multimodal search, rich media such as images, videos, and audio) and return a set of results from its collection of indexed documents that meet the searcher’s need. For some queries, such as “plumbing fittings for CPVC pipes,” the words in a product’s description and the words that a searcher uses are sufficient to bring the right results, using the standard Term Frequency-Inverse Document Frequency (TF/IDF) similarity metric. These queries are characterized by a high level of specificity in the searcher’s intent, which matches well to the words they use and the product’s name and description. When the searcher’s intent is more abstract, such as “a cozy place to curl up by the fire,” the words are less likely to provide a good match.

To best serve their users across the range of queries, builders have largely started to take a hybrid search approach, using both lexical and semantic retrieval with combined ranking. OpenSearch provides a hybrid search that can blend lexical queries, k-Nearest Neighbor (k-NN) queries, and neural queries using OpenSearch’s neural search plugin. Builders can implement three levels of hybrid search—lexical filtering along with vectors, combining lexical and vector scores, and out-of-the-box score normalization and blending.

In 2024, OpenSearch improved its hybrid search capability with conditional scoring logic, improved constructs, removal of repetitive and unnecessary calculations, and optimized data structures, yielding as much as a fourfold latency improvement. OpenSearch also added support for parallelization of the query processing for hybrid search, which can deliver up to 25% improvement in latency. OpenSearch released post-filtering for hybrid queries, which can help further dial in search results. 2024 also saw the release of OpenSearch Service’s support for aggregations for hybrid queries.

Sparse vector search is a different way of combining lexical and semantic information. Sparse vectors reduce corpus terms to around 32,000 terms, the same as or closely aligned with the source. Sparse vectors use weights that are mostly zero or near-zero to provide a weighted set of tokens that capture the meaning of the terms. Queries are translated to the reduced token set, with generalization provided by sparse models. In 2024, OpenSearch introduced two-phase processing for sparse vectors that improves latency for query processing.

Focus on accuracy

One of builders’ primary concerns in moving their workloads to production has been balancing retrieval accuracy (derivatively, generated text accuracy) with the cost and latency of the solution. Over the course of 2024, OpenSearch and OpenSearch Service brought capabilities for trading off between cost, latency, and accuracy. One area of innovation for the service was to bring out various methods for reducing the amount of RAM consumed by vector embeddings through k-NN vector quantization methods. Beyond these new methods, OpenSearch has long supported product quantization for the FAISS engine. Product quantization uses training to build centroids for vector clusters on reduced-dimension sub-vectors and queries by matching these centroids. We’ve blogged about the latency and cost benefits of product quantization.

You use a chunking strategy to divide up long documents into smaller, retrievable pieces. The insight for doing that is that large pieces of text have many pools of meaning, captured in sentences, paragraphs, tables, and figures. You choose chunks that are units of meaning, within pools of related words. In 2024, OpenSearch made this process with a straightforward k-NN query, alleviating the need for custom processing logic. You can now represent long documents as multiple vectors in a nested field. When you run k-NN queries, each nested field is treated as a single vector (an encoded long document). Previously, you had to implement custom processing logic in your application to support the querying of documents represented as vector chunks. With this feature, you can run k-NN queries, making it seamless for you to create vector search applications.

Similarity search is designed around finding the k nearest vectors, representing the top-k most similar documents. In 2024, OpenSearch updated its k-NN query interface to include filtering k-NN results based on distance and vector score, alongside existing top-k support. This is ideal for use cases in which your goal is to retrieve all the results that are highly or sufficiently similar (for example, >= 0.95), minimizing the possibility of missing highly relevant results because they don’t meet a top-k threshold.

Reducing cost for production workloads

In 2024, OpenSearch introduced and extended scalar and binary quantization that reduce the number of bits used to store each vector. OpenSearch already supported product quantization for vectors. When using these scalar and byte quantization methods, OpenSearch reduces the number of bits used to store vectors in the k-NN index from 32-bit floating numbers down to as little as 1 bit per dimension. For scalar quantization, OpenSearch supports half precision (also called fp16), and quarter precision with 8-bit integers for two times and four times the compression, respectively.

For binary quantization, OpenSearch supports 1-bit, 2-bit, and 4-bit compression for 32, 16, and 8 times compression respectively. These quantization methods are lossy, reducing accuracy. In our testing, we’ve seen minimal impact on accuracy—as little as 2% on some standardized data sets—with up to 32 times reduction in RAM consumed.

In-memory handling of dense vectors drives cost in proportion to the number of vectors, the vector dimensions, and the parameters you set for indexing. In 2024, OpenSearch extended vector handling to include disk-based vector search. With disk-based search, OpenSearch keeps a reduced bit-count vector in memory for generating match candidates, retrieving full-precision vectors for the final scoring and ranking. The default compression of 32 times means a reduction in RAM needs by 32 times with an attendant reduction in the cost of the solution.

In 2024, OpenSearch introduced support for JDK21, which users can use to run OpenSearch clusters on the latest Java version. OpenSearch further enhanced performance by adding support for Single Instruction, Multiple Data (SIMD) instruction sets for exact search queries. Previous versions have supported SIMD for ANN search queries. The integration of SIMD for exact search requires no additional configuration steps, making it a seamless performance improvement. You can expect a significant reduction in query latencies and a more efficient and responsive search experience, with approximately 1.5 times faster performance than non-SIMD implementations.

Increasing innovation velocity

In November 2023, OpenSearch 2.9 was released on Amazon OpenSearch Service. The release included high-level vector database interfaces such as neural search, hybrid search, and AI connectors. For instance, users can use neural search to run semantic queries with text input instead of vectors. Using AI connectors to services such as Amazon SageMaker, Amazon Bedrock, and OpenAI, neural search encodes text into vectors using the customers’ preferred models and rewrites text-based queries into k-NN queries transparently. Effectively, neural search alleviated the need for customers to develop and manage custom middleware to perform this functionality, which is required by applications that use the k-NN APIs.

With the following 2.11 and 2.13 releases, OpenSearch added high-level interfaces for multimodal and conversational search, respectively. With multimodal search, customers can run semantic queries using a combination of text and image inputs to find images. As illustrated in this OpenSearch blog post, multimodal enables new search paradigms. An ecommerce customer, for instance, could use a photo of a shirt and describe alterations such as “with desert colors” to shop for clothes fashioned to their tastes. Facilitated by a connector to Amazon Bedrock Titan Multimodal Embeddings G1, vector generation and query rewrites are handled by OpenSearch.

Conversational search enabled yet another search paradigm, which users can use to discover information through chat. Conversational searches run RAG pipelines, which use connectors to generative LLMs such as Anthropic’s Claude 3.5 Sonnet in Amazon Bedrock, OpenAI ChatGPT, or DeepSeek R1 to generate conversational responses. A conversational memory module provides LLMs with persistent memory by retaining conversation history.

With OpenSearch 2.17, support for search AI use cases was expanded through AI-native pipelines. With ML inference processors (search request, response, ingestion), customers can enrich data flows on OpenSearch with any machine learning (ML) model or AI service. Previously, enrichments were limited to a few model types such as text embedding models to support neural search. Without limitations on model type support, the full breadth of search AI use cases can be powered by OpenSearch search and ingest pipeline APIs.

Conclusion

OpenSearch continues to explore and enhance its features to build scalable, cost-effective, and low-latency semantic search and vector database solutions. The OpenSearch Service neural plugin, connector framework, and high-level APIs reduce complexity for builders, making the OpenSearch Service vector database more approachable and powerful. 2024’s improvements span text-based exact searches, semantic search, and hybrid search. These performance enhancements, feature innovations, and integrations provide a robust foundation for creating AI-driven solutions that provide better performance and more accurate results. Try out these new features with the latest version of OpenSearch.


About the Author

Jon Handler is Director of Solutions Architecture for Search Services at Amazon Web Services, based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have generative AI, search, and log analytics workloads for OpenSearch. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a Ph. D. in Computer Science and Artificial Intelligence from Northwestern University. Solana Token Creator