Using Vector Databases for Embeddings Search¶

This notebook takes you through a simple flow to download some data, embed it, and then index and search it using a selection of vector databases. This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.

What is a Vector Database¶

A vector database is a database made to store, manage and search embedding vectors. The use of embeddings to encode unstructured data (text, audio, video and more) as vectors for consumption by machine-learning models has exploded in recent years, due to the increasing effectiveness of AI in solving use cases involving natural language, image recognition and other unstructured forms of data. Vector databases have emerged as an effective solution for enterprises to deliver and scale these use cases.

Why use a Vector Database¶

Vector databases enable enterprises to take many of the embeddings use cases we've shared in this repo (question and answering, chatbot and recommendation services, for example), and make use of them in a secure, scalable environment. Many of our customers make embeddings solve their problems at small scale but performance and security hold them back from going into production - we see vector databases as a key component in solving that, and in this guide we'll walk through the basics of embedding text data, storing it in a vector database and using it for semantic search.

Demo Flow¶

The demo flow is:

Setup: Import packages and set any required variables
Load data: Load a dataset and embed it using OpenAI embeddings
Pinecone
- Setup: Here we'll set up the Python client for Pinecone. For more details go here
- Index Data: We'll create an index with namespaces for titles and content
- Search Data: We'll test out both namespaces with search queries to confirm it works
Weaviate
- Setup: Here we'll set up the Python client for Weaviate. For more details go here
- Index Data: We'll create an index with title search vectors in it
- Search Data: We'll run a few searches to confirm it works
Milvus
- Setup: Here we'll set up the Python client for Milvus. For more details go here
- Index Data We'll create a collection and index it for both titles and content
- Search Data: We'll test out both collections with search queries to confirm it works
Qdrant
- Setup: Here we'll set up the Python client for Qdrant. For more details go here
- Index Data: We'll create a collection with vectors for titles and content
- Search Data: We'll run a few searches to confirm it works
Redis
- Setup: Set up the Redis-Py client. For more details go here
- Index Data: Create the search index for vector search and hybrid search (vector + full-text search) on all available fields.
- Search Data: Run a few example queries with various goals in mind.

Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings.

Setup¶

Import the required libraries and set the embedding model that we'd like to use.

In [ ]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# We'll need to install the clients for all vector databases
!pip install pinecone-client
!pip install weaviate-client
!pip install pymilvus
!pip install qdrant-client
!pip install redis

#Install wget to pull zip file
!pip install wget
# We'll need to install the clients for all vector databases
!pip install pinecone-client
!pip install weaviate-client
!pip install pymilvus
!pip install qdrant-client
!pip install redis

#Install wget to pull zip file
!pip install wget

In [2]:

                
                    Copied!
                    
                        
                        
                    
                    

            
import openai

import tiktoken
from typing import List, Iterator
import pandas as pd
import numpy as np
import os
import wget
from ast import literal_eval

# Redis client library for Python
import redis

# Pinecone's client library for Python
import pinecone

# Weaviate's client library for Python
import weaviate

# Milvus's client library for Python
import pymilvus

# Qdrant's client library for Python
import qdrant_client

# I've set this to our new embeddings model, this can be changed to the embedding model of your choice
EMBEDDING_MODEL = "text-embedding-ada-002"

# Ignore unclosed SSL socket warnings - optional in case you get these errors
import warnings

warnings.filterwarnings(action="ignore", message="unclosed", category=ResourceWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)
import openai

import tiktoken
from typing import List, Iterator
import pandas as pd
import numpy as np
import os
import wget
from ast import literal_eval

# Redis client library for Python
import redis

# Pinecone's client library for Python
import pinecone

# Weaviate's client library for Python
import weaviate

# Milvus's client library for Python
import pymilvus

# Qdrant's client library for Python
import qdrant_client

# I've set this to our new embeddings model, this can be changed to the embedding model of your choice
EMBEDDING_MODEL = "text-embedding-ada-002"

# Ignore unclosed SSL socket warnings - optional in case you get these errors
import warnings

warnings.filterwarnings(action="ignore", message="unclosed", category=ResourceWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning) 

Load data¶

In this section we'll load embedded data that we've prepared previous to this session.

In [3]:

                
                    Copied!
                    
embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'

# The file is ~700 MB so this will take some time
wget.download(embeddings_url)
embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'

# The file is ~700 MB so this will take some time
wget.download(embeddings_url)

Out[3]:

'vector_database_wikipedia_articles_embedded.zip'

In [4]:

                
                    Copied!
                    
import zipfile
with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip","r") as zip_ref:
    zip_ref.extractall("../data")
import zipfile
with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip","r") as zip_ref:
    zip_ref.extractall("../data")

In [ ]:

                
                    Copied!
                    
article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')
article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')

In [6]:

                
                    Copied!
                    
article_df.head()
article_df.head()

Out[6]:

	id	url	title	text	title_vector	content_vector	vector_id
0	1	https://simple.wikipedia.org/wiki/April	April	April is the fourth month of the year in the J...	[0.001009464613161981, -0.020700545981526375, ...	[-0.011253940872848034, -0.013491976074874401,...	0
1	2	https://simple.wikipedia.org/wiki/August	August	August (Aug.) is the eighth month of the year ...	[0.0009286514250561595, 0.000820168002974242, ...	[0.0003609954728744924, 0.007262262050062418, ...	1
2	6	https://simple.wikipedia.org/wiki/Art	Art	Art is a creative activity that expresses imag...	[0.003393713850528002, 0.0061537534929811954, ...	[-0.004959689453244209, 0.015772193670272827, ...	2
3	8	https://simple.wikipedia.org/wiki/A	A	A or a is the first letter of the English alph...	[0.0153952119871974, -0.013759135268628597, 0....	[0.024894846603274345, -0.022186409682035446, ...	3
4	9	https://simple.wikipedia.org/wiki/Air	Air	Air refers to the Earth's atmosphere. Air is a...	[0.02224554680287838, -0.02044147066771984, -0...	[0.021524671465158463, 0.018522677943110466, -...	4

In [7]:

                
                    Copied!
                    
# Read vectors from strings back into a list
article_df['title_vector'] = article_df.title_vector.apply(literal_eval)
article_df['content_vector'] = article_df.content_vector.apply(literal_eval)

# Set vector_id to be a string
article_df['vector_id'] = article_df['vector_id'].apply(str)
# Read vectors from strings back into a list
article_df['title_vector'] = article_df.title_vector.apply(literal_eval)
article_df['content_vector'] = article_df.content_vector.apply(literal_eval)

# Set vector_id to be a string
article_df['vector_id'] = article_df['vector_id'].apply(str)

In [11]:

                
                    Copied!
                    
article_df.info(show_counts=True)
article_df.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              25000 non-null  int64 
 1   url             25000 non-null  object
 2   title           25000 non-null  object
 3   text            25000 non-null  object
 4   title_vector    25000 non-null  object
 5   content_vector  25000 non-null  object
 6   vector_id       25000 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.3+ MB

Pinecone¶

We'll index these embedded documents in a vector database and search them. The first option we'll look at is Pinecone, a managed vector database which offers a cloud-native option.

Before you proceed with this step you'll need to navigate to Pinecone, sign up and then save your API key as an environment variable titled PINECONE_API_KEY.

For section we will:

Create an index with multiple namespaces for article titles and content
Store our data in the index with separate searchable "namespaces" for article titles and content
Fire some similarity search queries to verify our setup is working

In [ ]:

                
                    Copied!
                    
api_key = os.getenv("PINECONE_API_KEY")
pinecone.init(api_key=api_key)
api_key = os.getenv("PINECONE_API_KEY")
pinecone.init(api_key=api_key)

Create Index¶

First we will need to create an index, which we'll call wikipedia-articles. Once we have an index, we can create multiple namespaces, which can make a single index searchable for various use cases. For more details, consult Pinecone documentation.

If you want to batch insert to your index in parallel to increase insertion speed then there is a great guide in the Pinecone documentation on batch inserts in parallel.

In [ ]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Models a simple batch generator that make chunks out of an input DataFrame
class BatchGenerator:
    
    
    def __init__(self, batch_size: int = 10) -> None:
        self.batch_size = batch_size
    
    # Makes chunks out of an input DataFrame
    def to_batches(self, df: pd.DataFrame) -> Iterator[pd.DataFrame]:
        splits = self.splits_num(df.shape[0])
        if splits <= 1:
            yield df
        else:
            for chunk in np.array_split(df, splits):
                yield chunk

    # Determines how many chunks DataFrame contains
    def splits_num(self, elements: int) -> int:
        return round(elements / self.batch_size)
    
    __call__ = to_batches

df_batcher = BatchGenerator(300)
# Models a simple batch generator that make chunks out of an input DataFrame
class BatchGenerator:
    
    
    def __init__(self, batch_size: int = 10) -> None:
        self.batch_size = batch_size
    
    # Makes chunks out of an input DataFrame
    def to_batches(self, df: pd.DataFrame) -> Iterator[pd.DataFrame]:
        splits = self.splits_num(df.shape[0])
        if splits <= 1:
            yield df
        else:
            for chunk in np.array_split(df, splits):
                yield chunk

    # Determines how many chunks DataFrame contains
    def splits_num(self, elements: int) -> int:
        return round(elements / self.batch_size)
    
    __call__ = to_batches

df_batcher = BatchGenerator(300)

In [ ]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Pick a name for the new index
index_name = 'wikipedia-articles'

# Check whether the index with the same name already exists - if so, delete it
if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)
    
# Creates new index
pinecone.create_index(name=index_name, dimension=len(article_df['content_vector'][0]))
index = pinecone.Index(index_name=index_name)

# Confirm our index was created
pinecone.list_indexes()
# Pick a name for the new index
index_name = 'wikipedia-articles'

# Check whether the index with the same name already exists - if so, delete it
if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)
    
# Creates new index
pinecone.create_index(name=index_name, dimension=len(article_df['content_vector'][0]))
index = pinecone.Index(index_name=index_name)

# Confirm our index was created
pinecone.list_indexes()

In [ ]:

                
                    Copied!
                    
# Upsert content vectors in content namespace - this can take a few minutes
print("Uploading vectors to content namespace..")
for batch_df in df_batcher(article_df):
    index.upsert(vectors=zip(batch_df.vector_id, batch_df.content_vector), namespace='content')
# Upsert content vectors in content namespace - this can take a few minutes
print("Uploading vectors to content namespace..")
for batch_df in df_batcher(article_df):
    index.upsert(vectors=zip(batch_df.vector_id, batch_df.content_vector), namespace='content')

In [ ]:

                
                    Copied!
                    
# Upsert title vectors in title namespace - this can also take a few minutes
print("Uploading vectors to title namespace..")
for batch_df in df_batcher(article_df):
    index.upsert(vectors=zip(batch_df.vector_id, batch_df.title_vector), namespace='title')
# Upsert title vectors in title namespace - this can also take a few minutes
print("Uploading vectors to title namespace..")
for batch_df in df_batcher(article_df):
    index.upsert(vectors=zip(batch_df.vector_id, batch_df.title_vector), namespace='title')

In [ ]:

                
                    Copied!
                    
# Check index size for each namespace to confirm all of our docs have loaded
index.describe_index_stats()
# Check index size for each namespace to confirm all of our docs have loaded
index.describe_index_stats()

Search data¶

Now we'll enter some dummy searches and check we get decent results back

In [ ]:

                
                    Copied!
                    
# First we'll create dictionaries mapping vector IDs to their outputs so we can retrieve the text for our search results
titles_mapped = dict(zip(article_df.vector_id,article_df.title))
content_mapped = dict(zip(article_df.vector_id,article_df.text))
# First we'll create dictionaries mapping vector IDs to their outputs so we can retrieve the text for our search results
titles_mapped = dict(zip(article_df.vector_id,article_df.title))
content_mapped = dict(zip(article_df.vector_id,article_df.text))

In [ ]:

                
                    Copied!
                    
                        
                        
                    
                    

            
def query_article(query, namespace, top_k=5):
    '''Queries an article using its title in the specified
     namespace and prints results.'''

    # Create vector embeddings based on the title column
    embedded_query = openai.Embedding.create(
                                            input=query,
                                            model=EMBEDDING_MODEL,
                                            )["data"][0]['embedding']

    # Query namespace passed as parameter using title vector
    query_result = index.query(embedded_query, 
                                      namespace=namespace, 
                                      top_k=top_k)

    # Print query results 
    print(f'\nMost similar results to {query} in "{namespace}" namespace:\n')
    if not query_result.matches:
        print('no query result')
    
    matches = query_result.matches
    ids = [res.id for res in matches]
    scores = [res.score for res in matches]
    df = pd.DataFrame({'id':ids, 
                       'score':scores,
                       'title': [titles_mapped[_id] for _id in ids],
                       'content': [content_mapped[_id] for _id in ids],
                       })
    
    counter = 0
    for k,v in df.iterrows():
        counter += 1
        print(f'{v.title} (score = {v.score})')
    
    print('\n')

    return df
def query_article(query, namespace, top_k=5):
    '''Queries an article using its title in the specified
     namespace and prints results.'''

    # Create vector embeddings based on the title column
    embedded_query = openai.Embedding.create(
                                            input=query,
                                            model=EMBEDDING_MODEL,
                                            )["data"][0]['embedding']

    # Query namespace passed as parameter using title vector
    query_result = index.query(embedded_query, 
                                      namespace=namespace, 
                                      top_k=top_k)

    # Print query results 
    print(f'\nMost similar results to {query} in "{namespace}" namespace:\n')
    if not query_result.matches:
        print('no query result')
    
    matches = query_result.matches
    ids = [res.id for res in matches]
    scores = [res.score for res in matches]
    df = pd.DataFrame({'id':ids, 
                       'score':scores,
                       'title': [titles_mapped[_id] for _id in ids],
                       'content': [content_mapped[_id] for _id in ids],
                       })
    
    counter = 0
    for k,v in df.iterrows():
        counter += 1
        print(f'{v.title} (score = {v.score})')
    
    print('\n')

    return df

In [ ]:

                
                    Copied!
                    
query_output = query_article('modern art in Europe','title')
query_output = query_article('modern art in Europe','title')

In [ ]:

                
                    Copied!
                    
content_query_output = query_article("Famous battles in Scottish history",'content')
content_query_output = query_article("Famous battles in Scottish history",'content')

Weaviate¶

Another vector database option we'll explore is Weaviate, which offers both a managed, SaaS option, as well as a self-hosted open source option. As we've already looked at a cloud vector database, we'll try the self-hosted option here.

For this we will:

Set up a local deployment of Weaviate
Create indices in Weaviate
Store our data there
Fire some similarity search queries
Try a real use case

Bring your own vectors approach¶

In this cookbook, we provide the data with already generated vectors. This is a good approach for scenarios, where your data is already vectorized.

Automated vectorization with OpenAI module¶

For scenarios, where your data is not vectorized yet, you can delegate the vectorization task with OpenAI to Weaviate. Weaviate offers a built-in module text2vec-openai, which takes care of the vectorization for you at:

import
for any CRUD operations
for semantic search

Check out the Getting Started with Weaviate and OpenAI module cookbook to learn step by step how to import and vectorize data in one step.

Setup¶

To run Weaviate locally, you'll need Docker. Following the instructions contained in the Weaviate documentation here, we created an example docker-compose.yml file in this repo saved at ./weaviate/docker-compose.yml.

After starting Docker, you can start Weaviate locally by navigating to the examples/vector_databases/weaviate/ directory and running docker-compose up -d.

SaaS¶

Alternatively you can use Weaviate Cloud Service (WCS) to create a free Weaviate cluster.

create a free account and/or login to WCS
create a Weaviate Cluster with the following settings:
- Sandbox: Sandbox Free
- Weaviate Version: Use default (latest)
- OIDC Authentication: Disabled
your instance should be ready in a minute or two
make a note of the Cluster Id. The link will take you to the full path of your cluster (you will need it later to connect to it). It should be something like: https://your-project-name-suffix.weaviate.network

In [ ]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Option #1 - Self-hosted - Weaviate Open Source 
client = weaviate.Client(
    url="http://localhost:8080",
    additional_headers={
        "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")
    }
)
# Option #1 - Self-hosted - Weaviate Open Source 
client = weaviate.Client(
    url="http://localhost:8080",
    additional_headers={
        "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")
    }
)

In [ ]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Option #2 - SaaS - (Weaviate Cloud Service)
client = weaviate.Client(
    url="https://your-wcs-instance-name.weaviate.network",
    additional_headers={
        "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")
    }
)
# Option #2 - SaaS - (Weaviate Cloud Service)
client = weaviate.Client(
    url="https://your-wcs-instance-name.weaviate.network",
    additional_headers={
        "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")
    }
)

In [ ]:

                
                    Copied!
                    
client.is_ready()
client.is_ready()

Index data¶

In Weaviate you create schemas to capture each of the entities you will be searching.

In this case we'll create a schema called Article with the title vector from above included for us to search by.

The next few steps closely follow the documentation Weaviate provides here.

In [ ]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Clear up the schema, so that we can recreate it
client.schema.delete_all()
client.schema.get()

# Define the Schema object to use `text-embedding-ada-002` on `title` and `content`, but skip it for `url`
article_schema = {
    "class": "Article",
    "description": "A collection of articles",
    "vectorizer": "text2vec-openai",
    "moduleConfig": {
        "text2vec-openai": {
          "model": "ada",
          "modelVersion": "002",
          "type": "text"
        }
    },
    "properties": [{
        "name": "title",
        "description": "Title of the article",
        "dataType": ["string"]
    },
    {
        "name": "content",
        "description": "Contents of the article",
        "dataType": ["text"],
        "moduleConfig": { "text2vec-openai": { "skip": True } }
    }]
}

# add the Article schema
client.schema.create_class(article_schema)

# get the schema to make sure it worked
client.schema.get()
# Clear up the schema, so that we can recreate it
client.schema.delete_all()
client.schema.get()

# Define the Schema object to use `text-embedding-ada-002` on `title` and `content`, but skip it for `url`
article_schema = {
    "class": "Article",
    "description": "A collection of articles",
    "vectorizer": "text2vec-openai",
    "moduleConfig": {
        "text2vec-openai": {
          "model": "ada",
          "modelVersion": "002",
          "type": "text"
        }
    },
    "properties": [{
        "name": "title",
        "description": "Title of the article",
        "dataType": ["string"]
    },
    {
        "name": "content",
        "description": "Contents of the article",
        "dataType": ["text"],
        "moduleConfig": { "text2vec-openai": { "skip": True } }
    }]
}

# add the Article schema
client.schema.create_class(article_schema)

# get the schema to make sure it worked
client.schema.get()

In [ ]:

                
                    Copied!
                    
                        
                        
                    
                    

            
### Step 1 - configure Weaviate Batch, which optimizes CRUD operations in bulk
# - starting batch size of 100
# - dynamically increase/decrease based on performance
# - add timeout retries if something goes wrong

client.batch.configure(
    batch_size=100,
    dynamic=True,
    timeout_retries=3,
)
### Step 1 - configure Weaviate Batch, which optimizes CRUD operations in bulk
# - starting batch size of 100
# - dynamically increase/decrease based on performance
# - add timeout retries if something goes wrong

client.batch.configure(
    batch_size=100,
    dynamic=True,
    timeout_retries=3,
)

In [ ]:

                
                    Copied!
                    
### Step 2 - import data

print("Uploading data with vectors to Article schema..")

counter=0

with client.batch as batch:
    for k,v in article_df.iterrows():
        
        # print update message every 100 objects        
        if (counter %100 == 0):
            print(f"Import {counter} / {len(article_df)} ")
        
        properties = {
            "title": v["title"],
            "content": v["text"]
        }
        
        vector = v["title_vector"]
        
        batch.add_data_object(properties, "Article", None, vector)
        counter = counter+1

print(f"Importing ({len(article_df)}) Articles complete")
### Step 2 - import data

print("Uploading data with vectors to Article schema..")

counter=0

with client.batch as batch:
    for k,v in article_df.iterrows():
        
        # print update message every 100 objects        
        if (counter %100 == 0):
            print(f"Import {counter} / {len(article_df)} ")
        
        properties = {
            "title": v["title"],
            "content": v["text"]
        }
        
        vector = v["title_vector"]
        
        batch.add_data_object(properties, "Article", None, vector)
        counter = counter+1

print(f"Importing ({len(article_df)}) Articles complete")

In [ ]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Test that all data has loaded – get object count
result = (
    client.query.aggregate("Article")
    .with_fields("meta { count }")
    .do()
)
print("Object count: ", result["data"]["Aggregate"]["Article"])
# Test that all data has loaded – get object count
result = (
    client.query.aggregate("Article")
    .with_fields("meta { count }")
    .do()
)
print("Object count: ", result["data"]["Aggregate"]["Article"])

In [ ]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Test one article has worked by checking one object
test_article = (
    client.query
    .get("Article", ["title", "content", "_additional {id}"])
    .with_limit(1)
    .do()
)["data"]["Get"]["Article"][0]

print(test_article["_additional"]["id"])
print(test_article["title"])
print(test_article["content"])
# Test one article has worked by checking one object
test_article = (
    client.query
    .get("Article", ["title", "content", "_additional {id}"])
    .with_limit(1)
    .do()
)["data"]["Get"]["Article"][0]

print(test_article["_additional"]["id"])
print(test_article["title"])
print(test_article["content"])

Search data¶

As above, we'll fire some queries at our new Index and get back results based on the closeness to our existing vectors

In [ ]:

                
                    Copied!
                    
                        
                        
                    
                    

            
def query_weaviate(query, collection_name, top_k=20):

    # Creates embedding vector from user query
    embedded_query = openai.Embedding.create(
        input=query,
        model=EMBEDDING_MODEL,
    )["data"][0]['embedding']
    
    near_vector = {"vector": embedded_query}

    # Queries input schema with vectorised user query
    query_result = (
        client.query
        .get(collection_name, ["title", "content", "_additional {certainty distance}"])
        .with_near_vector(near_vector)
        .with_limit(top_k)
        .do()
    )
    
    return query_result
def query_weaviate(query, collection_name, top_k=20):

    # Creates embedding vector from user query
    embedded_query = openai.Embedding.create(
        input=query,
        model=EMBEDDING_MODEL,
    )["data"][0]['embedding']
    
    near_vector = {"vector": embedded_query}

    # Queries input schema with vectorised user query
    query_result = (
        client.query
        .get(collection_name, ["title", "content", "_additional {certainty distance}"])
        .with_near_vector(near_vector)
        .with_limit(top_k)
        .do()
    )
    
    return query_result

In [ ]:

                
                    Copied!
                    
query_result = query_weaviate("modern art in Europe", "Article")
counter = 0
for article in query_result["data"]["Get"]["Article"]:
    counter += 1
    print(f"{counter}. { article['title']} (Certainty: {round(article['_additional']['certainty'],3) }) (Distance: {round(article['_additional']['distance'],3) })")
query_result = query_weaviate("modern art in Europe", "Article")
counter = 0
for article in query_result["data"]["Get"]["Article"]:
    counter += 1
    print(f"{counter}. { article['title']} (Certainty: {round(article['_additional']['certainty'],3) }) (Distance: {round(article['_additional']['distance'],3) })")

In [ ]:

                
                    Copied!
                    
query_result = query_weaviate("Famous battles in Scottish history", "Article")
counter = 0
for article in query_result["data"]["Get"]["Article"]:
    counter += 1
    print(f"{counter}. {article['title']} (Score: {round(article['_additional']['certainty'],3) })")
query_result = query_weaviate("Famous battles in Scottish history", "Article")
counter = 0
for article in query_result["data"]["Get"]["Article"]:
    counter += 1
    print(f"{counter}. {article['title']} (Score: {round(article['_additional']['certainty'],3) })")

Let Weaviate handle vector embeddings¶

Weaviate has a built-in module for OpenAI, which takes care of the steps required to generate a vector embedding for your queries and any CRUD operations.

This allows you to run a vector query with the with_near_text filter, which uses your OPEN_API_KEY.

In [ ]:

                
                    Copied!
                    
                        
                        
                    
                    

            
def near_text_weaviate(query, collection_name):
    
    nearText = {
        "concepts": [query],
        "distance": 0.7,
    }

    properties = [
        "title", "content",
        "_additional {certainty distance}"
    ]

    query_result = (
        client.query
        .get(collection_name, properties)
        .with_near_text(nearText)
        .with_limit(20)
        .do()
    )["data"]["Get"][collection_name]
    
    print (f"Objects returned: {len(query_result)}")
    
    return query_result
def near_text_weaviate(query, collection_name):
    
    nearText = {
        "concepts": [query],
        "distance": 0.7,
    }

    properties = [
        "title", "content",
        "_additional {certainty distance}"
    ]

    query_result = (
        client.query
        .get(collection_name, properties)
        .with_near_text(nearText)
        .with_limit(20)
        .do()
    )["data"]["Get"][collection_name]
    
    print (f"Objects returned: {len(query_result)}")
    
    return query_result

In [ ]:

                
                    Copied!
                    
query_result = near_text_weaviate("modern art in Europe","Article")
counter = 0
for article in query_result:
    counter += 1
    print(f"{counter}. { article['title']} (Certainty: {round(article['_additional']['certainty'],3) }) (Distance: {round(article['_additional']['distance'],3) })")
query_result = near_text_weaviate("modern art in Europe","Article")
counter = 0
for article in query_result:
    counter += 1
    print(f"{counter}. { article['title']} (Certainty: {round(article['_additional']['certainty'],3) }) (Distance: {round(article['_additional']['distance'],3) })")

In [ ]:

                
                    Copied!
                    
query_result = near_text_weaviate("Famous battles in Scottish history","Article")
counter = 0
for article in query_result:
    counter += 1
    print(f"{counter}. { article['title']} (Certainty: {round(article['_additional']['certainty'],3) }) (Distance: {round(article['_additional']['distance'],3) })")
query_result = near_text_weaviate("Famous battles in Scottish history","Article")
counter = 0
for article in query_result:
    counter += 1
    print(f"{counter}. { article['title']} (Certainty: {round(article['_additional']['certainty'],3) }) (Distance: {round(article['_additional']['distance'],3) })")

Milvus¶

The next vector database we will take a look at is Milvus, which also offers a SaaS option like the previous two, as well as self-hosted options using either helm or docker-compose. Sticking to the idea of open source, we will show our self-hosted example here.

In this example we will:

Set up a local docker-compose based deployment
Create the title and content collections
Store our data
Test out our system with real world searches

Setup¶

There are many ways to run Milvus (take a look here), but for now we will stick to a simple standalone Milvus instance with docker-compose.

A simple docker-file can be found at ./milvus/docker-compose.yaml and can be run using docker-compose up if within that mentioned directory or using docker-compose -f path/to/file up

In [ ]:

                
                    Copied!
                    
from pymilvus import connections

connections.connect(host='localhost', port=19530)  # Local instance defaults to port 19530
from pymilvus import connections

connections.connect(host='localhost', port=19530)  # Local instance defaults to port 19530

Index data¶

In Milvus data is stored in the form of collections, with each collection being able to store the vectors and any attributes that come with them.

In this case we'll create a collection called articles which contains the url, title, text and the content_embedding.

In addition to this we will also create an index on the content embedding. Milvus allows for the use of many SOTA indexing methods, but in this case, we are going to use HNSW.

In [ ]:

                
                    Copied!
                    
                        
                        
                    
                    

            
from pymilvus import utility, Collection, FieldSchema, CollectionSchema, DataType

# Remove the collection if it already exists.
if utility.has_collection('articles'):
    utility.drop_collection('articles')

fields = [
    FieldSchema(name='id', dtype=DataType.INT64),
    FieldSchema(name='url', dtype=DataType.VARCHAR, max_length=1000),  # Strings have to specify a max length [1, 65535]
    FieldSchema(name='title', dtype=DataType.VARCHAR, max_length=1000),
    FieldSchema(name='text', dtype=DataType.VARCHAR, max_length=1000),
    FieldSchema(name='content_vector', dtype=DataType.FLOAT_VECTOR, dim=len(article_df['content_vector'][0])),
    FieldSchema(name='vector_id', dtype=DataType.INT64, is_primary=True, auto_id=False),
]

col_schema = CollectionSchema(fields)

col = Collection('articles', col_schema)

# Using a basic HNSW index for this example
index = {
    'index_type': 'HNSW',
    'metric_type': 'L2',
    'params': {
        'M': 8,
        'efConstruction': 64
    },
}

col.create_index('content_vector', index)
col.load()
from pymilvus import utility, Collection, FieldSchema, CollectionSchema, DataType

# Remove the collection if it already exists.
if utility.has_collection('articles'):
    utility.drop_collection('articles')

fields = [
    FieldSchema(name='id', dtype=DataType.INT64),
    FieldSchema(name='url', dtype=DataType.VARCHAR, max_length=1000),  # Strings have to specify a max length [1, 65535]
    FieldSchema(name='title', dtype=DataType.VARCHAR, max_length=1000),
    FieldSchema(name='text', dtype=DataType.VARCHAR, max_length=1000),
    FieldSchema(name='content_vector', dtype=DataType.FLOAT_VECTOR, dim=len(article_df['content_vector'][0])),
    FieldSchema(name='vector_id', dtype=DataType.INT64, is_primary=True, auto_id=False),
]

col_schema = CollectionSchema(fields)

col = Collection('articles', col_schema)

# Using a basic HNSW index for this example
index = {
    'index_type': 'HNSW',
    'metric_type': 'L2',
    'params': {
        'M': 8,
        'efConstruction': 64
    },
}

col.create_index('content_vector', index)
col.load()

In [ ]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Using the above provided batching function from Pinecone
def to_batches(df: pd.DataFrame, batch_size: int) -> Iterator[pd.DataFrame]:
    splits = df.shape[0] / batch_size
    if splits <= 1:
        yield df
    else:
        for chunk in np.array_split(df, splits):
            yield chunk

# Since we are storing the text within Milvus we need to clip any that are over our set limit.
# We can also set the limit to be higher, but that slows down the search requests as more info 
# needs to be sent back.
def shorten_text(text):
    if len(text) >= 996:
        return text[:996] + '...'
    else:
        return text

for batch in to_batches(article_df, 1000):
    batch = batch.drop(columns = ['title_vector'])
    batch['text'] = batch.text.apply(shorten_text)
    # Due to the vector_id being converted to a string for compatiblity for other vector dbs,
    # we want to swap it back to its original form.
    batch['vector_id'] = batch.vector_id.apply(int)
    col.insert(batch) 

col.flush()
# Using the above provided batching function from Pinecone
def to_batches(df: pd.DataFrame, batch_size: int) -> Iterator[pd.DataFrame]:
    splits = df.shape[0] / batch_size
    if splits <= 1:
        yield df
    else:
        for chunk in np.array_split(df, splits):
            yield chunk

# Since we are storing the text within Milvus we need to clip any that are over our set limit.
# We can also set the limit to be higher, but that slows down the search requests as more info 
# needs to be sent back.
def shorten_text(text):
    if len(text) >= 996:
        return text[:996] + '...'
    else:
        return text

for batch in to_batches(article_df, 1000):
    batch = batch.drop(columns = ['title_vector'])
    batch['text'] = batch.text.apply(shorten_text)
    # Due to the vector_id being converted to a string for compatiblity for other vector dbs,
    # we want to swap it back to its original form.
    batch['vector_id'] = batch.vector_id.apply(int)
    col.insert(batch) 

col.flush()   

Search¶

Once the data is inserted into Milvus we can perform searches. For this example the search function takes one argument, top_k, how many closest matches to return.

In [ ]:

                
                    Copied!
                    
                        
                        
                    
                    

            
def query_article(query, top_k=5):
    # Generate the embedding with openai
    embedded_query = openai.Embedding.create(
        input=query,
        model=EMBEDDING_MODEL,
    )["data"][0]['embedding']

    # Using some basic params for HNSW
    search_param = {
        'ef': max(64, top_k)
    }

    # Perform the search.
    res = col.search([embedded_query], 'content_vector', search_param, output_fields = ['title', 'url'], limit = top_k)

    ret = []
    for hit in res[0]:
        # Get the id, distance, and title for the results
        ret.append({'vector_id': hit.id, 'distance': hit.score, 'title': hit.entity.get('title'), 'url': hit.entity.get('url')})
    return ret
    

for x in query_article('fastest plane ever made', 3):
    print(x.items())
def query_article(query, top_k=5):
    # Generate the embedding with openai
    embedded_query = openai.Embedding.create(
        input=query,
        model=EMBEDDING_MODEL,
    )["data"][0]['embedding']

    # Using some basic params for HNSW
    search_param = {
        'ef': max(64, top_k)
    }

    # Perform the search.
    res = col.search([embedded_query], 'content_vector', search_param, output_fields = ['title', 'url'], limit = top_k)

    ret = []
    for hit in res[0]:
        # Get the id, distance, and title for the results
        ret.append({'vector_id': hit.id, 'distance': hit.score, 'title': hit.entity.get('title'), 'url': hit.entity.get('url')})
    return ret
    

for x in query_article('fastest plane ever made', 3):
    print(x.items())

Qdrant¶

The last vector database we'll consider is Qdrant. This is a high-performant vector search database written in Rust. It offers both on-premise and cloud version, but for the purposes of that example we're going to use the local deployment mode.

Setting everything up will require:

Spinning up a local instance of Qdrant
Configuring the collection and storing the data in it
Trying out with some queries

Setup¶

For the local deployment, we are going to use Docker, according to the Qdrant documentation: https://qdrant.tech/documentation/quick_start/. Qdrant requires just a single container, but an example of the docker-compose.yaml file is available at ./qdrant/docker-compose.yaml in this repo.

You can start Qdrant instance locally by navigating to this directory and running docker-compose up -d

In [ ]:

                
                    Copied!
                    
qdrant = qdrant_client.QdrantClient(host='localhost', prefer_grpc=True)
qdrant = qdrant_client.QdrantClient(host='localhost', prefer_grpc=True)

In [ ]:

                
                    Copied!
                    
qdrant.get_collections()
qdrant.get_collections()

Index data¶

Qdrant stores data in collections where each object is described by at least one vector and may contain an additional metadata called payload. Our collection will be called Articles and each object will be described by both title and content vectors.

We'll be using an official qdrant-client package that has all the utility methods already built-in.

In [ ]:

                
                    Copied!
                    
from qdrant_client.http import models as rest
from qdrant_client.http import models as rest

In [ ]:

                
                    Copied!
                    
                        
                        
                    
                    

            
vector_size = len(article_df['content_vector'][0])

qdrant.recreate_collection(
    collection_name='Articles',
    vectors_config={
        'title': rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
        'content': rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
    }
)
vector_size = len(article_df['content_vector'][0])

qdrant.recreate_collection(
    collection_name='Articles',
    vectors_config={
        'title': rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
        'content': rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
    }
)

In [ ]:

                
                    Copied!
                    
                        
                        
                    
                    

            
qdrant.upsert(
    collection_name='Articles',
    points=[
        rest.PointStruct(
            id=k,
            vector={
                'title': v['title_vector'],
                'content': v['content_vector'],
            },
            payload=v.to_dict(),
        )
        for k, v in article_df.iterrows()
    ],
)
qdrant.upsert(
    collection_name='Articles',
    points=[
        rest.PointStruct(
            id=k,
            vector={
                'title': v['title_vector'],
                'content': v['content_vector'],
            },
            payload=v.to_dict(),
        )
        for k, v in article_df.iterrows()
    ],
)

In [ ]:

                
                    Copied!
                    
# Check the collection size to make sure all the points have been stored
qdrant.count(collection_name='Articles')
# Check the collection size to make sure all the points have been stored
qdrant.count(collection_name='Articles')

Search Data¶

Once the data is put into Qdrant we will start querying the collection for the closest vectors. We may provide an additional parameter vector_name to switch from title to content based search.

In [ ]:

                
                    Copied!
                    
                        
                        
                    
                    

            
def query_qdrant(query, collection_name, vector_name='title', top_k=20):

    # Creates embedding vector from user query
    embedded_query = openai.Embedding.create(
        input=query,
        model=EMBEDDING_MODEL,
    )['data'][0]['embedding']
    
    query_results = qdrant.search(
        collection_name=collection_name,
        query_vector=(
            vector_name, embedded_query
        ),
        limit=top_k,
    )
    
    return query_results
def query_qdrant(query, collection_name, vector_name='title', top_k=20):

    # Creates embedding vector from user query
    embedded_query = openai.Embedding.create(
        input=query,
        model=EMBEDDING_MODEL,
    )['data'][0]['embedding']
    
    query_results = qdrant.search(
        collection_name=collection_name,
        query_vector=(
            vector_name, embedded_query
        ),
        limit=top_k,
    )
    
    return query_results

In [ ]:

                
                    Copied!
                    
query_results = query_qdrant('modern art in Europe', 'Articles')
for i, article in enumerate(query_results):
    print(f'{i + 1}. {article.payload["title"]} (Score: {round(article.score, 3)})')
query_results = query_qdrant('modern art in Europe', 'Articles')
for i, article in enumerate(query_results):
    print(f'{i + 1}. {article.payload["title"]} (Score: {round(article.score, 3)})')

In [ ]:

                
                    Copied!
                    
# This time we'll query using content vector
query_results = query_qdrant('Famous battles in Scottish history', 'Articles', 'content')
for i, article in enumerate(query_results):
    print(f'{i + 1}. {article.payload["title"]} (Score: {round(article.score, 3)})')
# This time we'll query using content vector
query_results = query_qdrant('Famous battles in Scottish history', 'Articles', 'content')
for i, article in enumerate(query_results):
    print(f'{i + 1}. {article.payload["title"]} (Score: {round(article.score, 3)})')

Redis¶

The last vector database covered in this tutorial is Redis. You most likely already know Redis. What you might not be aware of is the RediSearch module. Enterprises have been using Redis with the RediSearch module for years now across all major cloud providers, Redis Cloud, and on premise. Recently, the Redis team added vector storage and search capability to this module in addition to the features RediSearch already had.

Given the large ecosystem around Redis, there are most likely client libraries in the language you need. You can use any standard Redis client library to run RediSearch commands, but it's easiest to use a library that wraps the RediSearch API. Below are a few examples, but you can find more client libraries here.

Project	Language	License	Author
jedis	Java	MIT	Redis
redis-py	Python	MIT	Redis
node-redis	Node.js	MIT	Redis
nredisstack	.NET	MIT	Redis
redisearch-go	Go	BSD	Redis
redisearch-api-rs	Rust	BSD	Redis

In the below cells, we will walk you through using Redis as a vector database. Since many of you are likely already used to the Redis API, this should be familiar to most.

Setup¶

There are many ways to deploy Redis with RediSearch. The easiest way to get started is to use Docker, but there are are many potential options for deployment. For other deployment options, see the redis directory in this repo.

For this tutorial, we will use Redis Stack on Docker.

Start a version of Redis with RediSearch (Redis Stack) by running the following docker command

$ cd redis
$ docker compose up -d

This also includes the RedisInsight GUI for managing your Redis database which you can view at http://localhost:8001 once you start the docker container.

You're all set up and ready to go! Next, we import and create our client for communicating with the Redis database we just created.

In [134]:

                
                    Copied!
                    
                        
                        
                    
                    

            
import redis
from redis.commands.search.indexDefinition import (
    IndexDefinition,
    IndexType
)
from redis.commands.search.query import Query
from redis.commands.search.field import (
    TextField,
    VectorField
)

REDIS_HOST =  "localhost"
REDIS_PORT = 6379
REDIS_PASSWORD = "" # default for passwordless Redis

# Connect to Redis
redis_client = redis.Redis(
    host=REDIS_HOST,
    port=REDIS_PORT,
    password=REDIS_PASSWORD
)
redis_client.ping()
import redis
from redis.commands.search.indexDefinition import (
    IndexDefinition,
    IndexType
)
from redis.commands.search.query import Query
from redis.commands.search.field import (
    TextField,
    VectorField
)

REDIS_HOST =  "localhost"
REDIS_PORT = 6379
REDIS_PASSWORD = "" # default for passwordless Redis

# Connect to Redis
redis_client = redis.Redis(
    host=REDIS_HOST,
    port=REDIS_PORT,
    password=REDIS_PASSWORD
)
redis_client.ping()

Out[134]:

True

Creating a Search Index¶

The below cells will show how to specify and create a search index in Redis. We will

Set some constants for defining our index like the distance metric and the index name
Define the index schema with RediSearch fields
Create the index

In [135]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Constants
VECTOR_DIM = len(article_df['title_vector'][0]) # length of the vectors
VECTOR_NUMBER = len(article_df)                 # initial number of vectors
INDEX_NAME = "embeddings-index"                 # name of the search index
PREFIX = "doc"                                  # prefix for the document keys
DISTANCE_METRIC = "COSINE"                      # distance metric for the vectors (ex. COSINE, IP, L2)
# Constants
VECTOR_DIM = len(article_df['title_vector'][0]) # length of the vectors
VECTOR_NUMBER = len(article_df)                 # initial number of vectors
INDEX_NAME = "embeddings-index"                 # name of the search index
PREFIX = "doc"                                  # prefix for the document keys
DISTANCE_METRIC = "COSINE"                      # distance metric for the vectors (ex. COSINE, IP, L2)

In [136]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Define RediSearch fields for each of the columns in the dataset
title = TextField(name="title")
url = TextField(name="url")
text = TextField(name="text")
title_embedding = VectorField("title_vector",
    "FLAT", {
        "TYPE": "FLOAT32",
        "DIM": VECTOR_DIM,
        "DISTANCE_METRIC": DISTANCE_METRIC,
        "INITIAL_CAP": VECTOR_NUMBER,
    }
)
text_embedding = VectorField("content_vector",
    "FLAT", {
        "TYPE": "FLOAT32",
        "DIM": VECTOR_DIM,
        "DISTANCE_METRIC": DISTANCE_METRIC,
        "INITIAL_CAP": VECTOR_NUMBER,
    }
)
fields = [title, url, text, title_embedding, text_embedding]
# Define RediSearch fields for each of the columns in the dataset
title = TextField(name="title")
url = TextField(name="url")
text = TextField(name="text")
title_embedding = VectorField("title_vector",
    "FLAT", {
        "TYPE": "FLOAT32",
        "DIM": VECTOR_DIM,
        "DISTANCE_METRIC": DISTANCE_METRIC,
        "INITIAL_CAP": VECTOR_NUMBER,
    }
)
text_embedding = VectorField("content_vector",
    "FLAT", {
        "TYPE": "FLOAT32",
        "DIM": VECTOR_DIM,
        "DISTANCE_METRIC": DISTANCE_METRIC,
        "INITIAL_CAP": VECTOR_NUMBER,
    }
)
fields = [title, url, text, title_embedding, text_embedding]

In [137]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Check if index exists
try:
    redis_client.ft(INDEX_NAME).info()
    print("Index already exists")
except:
    # Create RediSearch Index
    redis_client.ft(INDEX_NAME).create_index(
        fields = fields,
        definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)
    )
# Check if index exists
try:
    redis_client.ft(INDEX_NAME).info()
    print("Index already exists")
except:
    # Create RediSearch Index
    redis_client.ft(INDEX_NAME).create_index(
        fields = fields,
        definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)
    )

Load Documents into the Index¶

Now that we have a search index, we can load documents into it. We will use the same documents we used in the previous examples. In Redis, either the Hash or JSON (if using RedisJSON in addition to RediSearch) data types can be used to store documents. We will use the HASH data type in this example. The below cells will show how to load documents into the index.

In [138]:

                
                    Copied!
                    
                        
                        
                    
                    

            
def index_documents(client: redis.Redis, prefix: str, documents: pd.DataFrame):
    records = documents.to_dict("records")
    for doc in records:
        key = f"{prefix}:{str(doc['id'])}"

        # create byte vectors for title and content
        title_embedding = np.array(doc["title_vector"], dtype=np.float32).tobytes()
        content_embedding = np.array(doc["content_vector"], dtype=np.float32).tobytes()

        # replace list of floats with byte vectors
        doc["title_vector"] = title_embedding
        doc["content_vector"] = content_embedding

        client.hset(key, mapping = doc)
def index_documents(client: redis.Redis, prefix: str, documents: pd.DataFrame):
    records = documents.to_dict("records")
    for doc in records:
        key = f"{prefix}:{str(doc['id'])}"

        # create byte vectors for title and content
        title_embedding = np.array(doc["title_vector"], dtype=np.float32).tobytes()
        content_embedding = np.array(doc["content_vector"], dtype=np.float32).tobytes()

        # replace list of floats with byte vectors
        doc["title_vector"] = title_embedding
        doc["content_vector"] = content_embedding

        client.hset(key, mapping = doc)

In [139]:

                
                    Copied!
                    
index_documents(redis_client, PREFIX, article_df)
print(f"Loaded {redis_client.info()['db0']['keys']} documents in Redis search index with name: {INDEX_NAME}")
index_documents(redis_client, PREFIX, article_df)
print(f"Loaded {redis_client.info()['db0']['keys']} documents in Redis search index with name: {INDEX_NAME}")

Loaded 25000 documents in Redis search index with name: embeddings-index

Running Search Queries¶

Now that we have a search index and documents loaded into it, we can run search queries. Below we will provide a function that will run a search query and return the results. Using this function we run a few queries that will show how you can utilize Redis as a vector database. Each example will demonstrate specific features to keep in mind when developing your search application with Redis.

Return Fields: You can specify which fields you want to return in the search results. This is useful if you only want to return a subset of the fields in your documents and doesn't require a separate call to retrieve documents. In the below example, we will only return the title field in the search results.
Hybrid Search: You can combine vector search with any of the other RediSearch fields for hybrid search such as full text search, tag, geo, and numeric. In the below example, we will combine vector search with full text search.

In [140]:

                
                    Copied!
                    
                        
                        
                    
                    

            
def search_redis(
    redis_client: redis.Redis,
    user_query: str,
    index_name: str = "embeddings-index",
    vector_field: str = "title_vector",
    return_fields: list = ["title", "url", "text", "vector_score"],
    hybrid_fields = "*",
    k: int = 20,
) -> List[dict]:

    # Creates embedding vector from user query
    embedded_query = openai.Embedding.create(input=user_query,
                                            model=EMBEDDING_MODEL,
                                            )["data"][0]['embedding']

    # Prepare the Query
    base_query = f'{hybrid_fields}=>[KNN {k} @{vector_field} $vector AS vector_score]'
    query = (
        Query(base_query)
         .return_fields(*return_fields)
         .sort_by("vector_score")
         .paging(0, k)
         .dialect(2)
    )
    params_dict = {"vector": np.array(embedded_query).astype(dtype=np.float32).tobytes()}

    # perform vector search
    results = redis_client.ft(index_name).search(query, params_dict)
    for i, article in enumerate(results.docs):
        score = 1 - float(article.vector_score)
        print(f"{i}. {article.title} (Score: {round(score ,3) })")
    return results.docs
def search_redis(
    redis_client: redis.Redis,
    user_query: str,
    index_name: str = "embeddings-index",
    vector_field: str = "title_vector",
    return_fields: list = ["title", "url", "text", "vector_score"],
    hybrid_fields = "*",
    k: int = 20,
) -> List[dict]:

    # Creates embedding vector from user query
    embedded_query = openai.Embedding.create(input=user_query,
                                            model=EMBEDDING_MODEL,
                                            )["data"][0]['embedding']

    # Prepare the Query
    base_query = f'{hybrid_fields}=>[KNN {k} @{vector_field} $vector AS vector_score]'
    query = (
        Query(base_query)
         .return_fields(*return_fields)
         .sort_by("vector_score")
         .paging(0, k)
         .dialect(2)
    )
    params_dict = {"vector": np.array(embedded_query).astype(dtype=np.float32).tobytes()}

    # perform vector search
    results = redis_client.ft(index_name).search(query, params_dict)
    for i, article in enumerate(results.docs):
        score = 1 - float(article.vector_score)
        print(f"{i}. {article.title} (Score: {round(score ,3) })")
    return results.docs

In [142]:

                
                    Copied!
                    
# For using OpenAI to generate query embedding
openai.api_key = os.getenv("OPENAI_API_KEY", "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")
results = search_redis(redis_client, 'modern art in Europe', k=10)
# For using OpenAI to generate query embedding
openai.api_key = os.getenv("OPENAI_API_KEY", "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")
results = search_redis(redis_client, 'modern art in Europe', k=10)

0. Museum of Modern Art (Score: 0.875)
1. Western Europe (Score: 0.867)
2. Renaissance art (Score: 0.864)
3. Pop art (Score: 0.86)
4. Northern Europe (Score: 0.855)
5. Hellenistic art (Score: 0.853)
6. Modernist literature (Score: 0.847)
7. Art film (Score: 0.843)
8. Central Europe (Score: 0.843)
9. European (Score: 0.841)

In [143]:

                
                    Copied!
                    
results = search_redis(redis_client, 'Famous battles in Scottish history', vector_field='content_vector', k=10)
results = search_redis(redis_client, 'Famous battles in Scottish history', vector_field='content_vector', k=10)

0. Battle of Bannockburn (Score: 0.869)
1. Wars of Scottish Independence (Score: 0.861)
2. 1651 (Score: 0.853)
3. First War of Scottish Independence (Score: 0.85)
4. Robert I of Scotland (Score: 0.846)
5. 841 (Score: 0.844)
6. 1716 (Score: 0.844)
7. 1314 (Score: 0.837)
8. 1263 (Score: 0.836)
9. William Wallace (Score: 0.835)

Hybrid Queries with Redis¶

The previous examples showed how run vector search queries with RediSearch. In this section, we will show how to combine vector search with other RediSearch fields for hybrid search. In the below example, we will combine vector search with full text search.

In [23]:

                
                    Copied!
                    
def create_hybrid_field(field_name: str, value: str) -> str:
    return f'@{field_name}:"{value}"'
def create_hybrid_field(field_name: str, value: str) -> str:
    return f'@{field_name}:"{value}"'

In [24]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# search the content vector for articles about famous battles in Scottish history and only include results with Scottish in the title
results = search_redis(redis_client,
                       "Famous battles in Scottish history",
                       vector_field="title_vector",
                       k=5,
                       hybrid_fields=create_hybrid_field("title", "Scottish")
                       )
# search the content vector for articles about famous battles in Scottish history and only include results with Scottish in the title
results = search_redis(redis_client,
                       "Famous battles in Scottish history",
                       vector_field="title_vector",
                       k=5,
                       hybrid_fields=create_hybrid_field("title", "Scottish")
                       )

0. First War of Scottish Independence (Score: 0.892)
1. Wars of Scottish Independence (Score: 0.889)
2. Second War of Scottish Independence (Score: 0.879)
3. List of Scottish monarchs (Score: 0.873)
4. Scottish Borders (Score: 0.863)

In [25]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# run a hybrid query for articles about Art in the title vector and only include results with the phrase "Leonardo da Vinci" in the text
results = search_redis(redis_client,
                       "Art",
                       vector_field="title_vector",
                       k=5,
                       hybrid_fields=create_hybrid_field("text", "Leonardo da Vinci")
                       )

# find specific mention of Leonardo da Vinci in the text that our full-text-search query returned
mention = [sentence for sentence in results[0].text.split("\n") if "Leonardo da Vinci" in sentence][0]
mention
# run a hybrid query for articles about Art in the title vector and only include results with the phrase "Leonardo da Vinci" in the text
results = search_redis(redis_client,
                       "Art",
                       vector_field="title_vector",
                       k=5,
                       hybrid_fields=create_hybrid_field("text", "Leonardo da Vinci")
                       )

# find specific mention of Leonardo da Vinci in the text that our full-text-search query returned
mention = [sentence for sentence in results[0].text.split("\n") if "Leonardo da Vinci" in sentence][0]
mention

0. Art (Score: 1.0)
1. Paint (Score: 0.896)
2. Renaissance art (Score: 0.88)
3. Painting (Score: 0.874)
4. Renaissance (Score: 0.846)

Out[25]:

'In Europe, after the Middle Ages, there was a "Renaissance" which means "rebirth". People rediscovered science and artists were allowed to paint subjects other than religious subjects. People like Michelangelo and Leonardo da Vinci still painted religious pictures, but they also now could paint mythological pictures too. These artists also invented perspective where things in the distance look smaller in the picture. This was new because in the Middle Ages people would paint all the figures close up and just overlapping each other. These artists used nudity regularly in their art.'

For more example with Redis as a vector database, see the README and examples within the vector_databases/redis directory of this repository

Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo.