Creating a tiny vector store from scratch

Minyang Chen
6 min readMar 7, 2024

--

Sometimes I found that constructing a small version of something from scratch can be a helpful way to learn new concepts. In this article, I will discuss my process of creating a petite vector database for educational purposes.

A vector database, also known as a vector store, is a type of database that stores vectors as fixed-length lists of numbers. Vector stores simplify the process of storing and retrieving embedding vectors and allow for efficient similarity searches among these vectors. In this article, I will share how I built a miniature vector store from scratch to aid in my understanding of this concept.

Vector Embedding

This is an innovative approach that enables us to store unstructured data to numeric values without losing its semantics.

For example, RGB color system where color is represented by the number

[red, green,blue] ⇒ [5,205,0]

Then similar concept for text [ Unstructured Text ] ==> [ 1.1.1.1.1..11]

However, converting complex data such as unstructured text is a bit more work. The process is converted into vector space to hold these complex relations or vectorizer (aha. embedding model).

Semantic Search

Vector embedding numerically captures the semantic meaning of the objects in relation to each other objects. Naturally, similar objects are grouped together in the vector space, which allows us to measure how close to each other or how similar they are.

See diagram below:

In the example diagram, fruit names fall into a group naturally, same for water and fish. So similar searches allow us to find and retrieve similar objects using the object’s mathematical representation.

Follow the example above, when searching for Apples, we can find similar fruits since they are closer to Apple. This concept of semantic search is based on the contextual meaning, it allows for a more human-like search experience by retrieving search results that match the user’s intent.

Vector Similarity

In short, when data is in numerical representation, it allows us to apply mathematical operations to them such as calculating distance between two vectors. There few ways to measure similarity between two vectors.

The most common one today is Cosine Similarity — it calculates the cosine of the angle between two vectors and also other approaches. See diagram below:

Another important keyword is “fixed-length”. the length of the query and data in the vector store must be the same in order to perform similarity search otherwise it will fail.

For the tiny-vector store, I only implemented the cosine similar function which is the formula below.

What does the result tell us?

If cos(O) =1 then two vectors are most similar, if cos(O) = -1 then two vectors are opposite of each other. This is because cosines measure the overlapping between two vectors. Highest is the most similar then lowest is the opposite. Now, we have the basics clear, let’s put this implementation together.

Tiny Vector Store Implementation

This is a simple implementation of the vector store that allow us store data, save to disk and perform Cosine similarity search. First we need to create a memory storage holder for the data, index and vocabulary and size of the array.

self._vector_data = {} # store vectors dataset
self._vector_index = {} # indexing structure for retrieval
self._vocabulary = set() # unique list of vocabulary words
self._word_to_index = {} # mapping word to index id
self._max_tokens = max_tokens # max size of the arrray

Next, we need a function to add data and retrieval

def add_vector(self, vector_id, vector):
logger.debug(f"add_vector id: {vector_id} | vector: {vector}")
self._vector_data[vector_id] = vector
self._update_index(vector_id, vector)

def get_vector(self, vector_id):
vector = self._vector_data.get(vector_id)
logger.debug(f"get_vector id: {vector_id} \n vector: {vector}")
return vector

And compute cosine similarity

  def _cosine_similarity(self, vector1, vector2):
return (
np.dot(vector1, vector2)
/ (np.linalg.norm(vector1))
* np.linalg.norm(vector2)
)

It’s important to maintain the index also so we can get back the text format of the numeric data.

 def _update_index(self, vector_id, vector):
logger.debug(f"_update_index id: {vector_id} \n vector: {vector}")
for existing_id, existing_vector in self._vector_data.items():
similarity = self._cosine_similarity(vector, existing_vector)
if existing_id not in self._vector_index:
self._vector_index[existing_id] = {}
logger.warning(
f"_update_index id: index not found [{existing_id}]")
logger.info(
f"_update_index id: apply update on existing index: similarity[{similarity}]"
)
self._vector_index[existing_id][vector_id] = similarity

To convert text to numeric representation we need an encoding function like this


def text_to_vector(self, sentence: str):
self._update_vocabulary_and_index([sentence])
tokens = sentence.lower().split()
if self._max_tokens < len(self._word_to_index):
raise ValueError(
f"max tokens can not be smaller than words index size: {len(self._word_to_index)}")
# static vector size
vector = np.zeros(self._max_tokens)
for token in tokens:
vector[self._word_to_index[token]] += 1
return vector

And searching function

def similarity_search(self, query: str, num_results: int = 3):
logger.debug(f"similarity_search query:{query}\n")
query_vector = self.text_to_vector(query)
similar_sentences = self.find_similar_vectors(
query_vector, num_results=num_results
)
result = []
for sentence, similarity in similar_sentences:
result.append([sentence, f"{similarity:.4f}"])
logger.debug(f"similar sentences found: {result}")
return result

Next, we also want to persist and load the data to disk so we added the save and load functionality.

 def save_to_disk(self):
self._init_persistance(self._data_storage_path, self._cache_path)
save_path = datafile = self._data_storage_path+"/"+self._name
if not os.path.exists(save_path):
os.mkdir(save_path)
datafile = save_path+"/"+self._name
joblib.dump(self._vector_data, datafile +
".data.gz", compress=('gzip', 3))
joblib.dump(self._vector_index, datafile +
".index.gz", compress=('gzip', 3))
joblib.dump(self._vocabulary, datafile +
".vocabulary.gz", compress=('gzip', 3))
joblib.dump(self._word_to_index, datafile +
".word_to_index.gz", compress=('gzip', 3))
with open(datafile+".hash", "w", newline="") as file:
file.write(str(joblib.hash(self)))
print(f"TinyVectorstore data files {datafile} saved to disk!")

@classmethod
def load_from_disk(cls, name: str = "default", data_storage_path: str = None, max_tokens: int = 768):
if data_storage_path is None:
datafile = str(cls._DATA_STORAGE_PATH)+"/"+name
else:
datafile = str(data_storage_path)+"/"+name+"/"+name
print("loading directory ", datafile)
if os.path.exists(datafile+".data.gz"):
instance = cls(max_tokens=max_tokens)
instance._vector_data = joblib.load(datafile+".data.gz")
instance._vector_index = joblib.load(datafile+".index.gz")
instance._vocabulary = joblib.load(datafile+".vocabulary.gz")
instance._word_to_index = joblib.load(datafile+".word_to_index.gz")
print("TinyVectorstore data file loaded!")
return instance
else:
print("Missing TinyVectorstore data file!")
return cls()

Also, for long-text we need to split them into the same length. It can be by word or just text length.

@staticmethod
def sentence_word_splitter(num_of_words: int, sentence: str) -> list:
pieces = sentence.split()
return [" ".join(pieces[i:i+num_of_words]) for i in range(0, len(pieces), num_of_words)]

@staticmethod
def chunk_text_to_fixed_length(text: str, length: int):
text = text.strip()
result = [text[0+i:length+i] for i in range(0, len(text), length)]
return result

def chunk_sentences_to_fixed_length(self, sentences: list, max_length: int = 768):
fixed_size_sentences = []
for sentence in sentences:
chunks = TinyVectorStore.chunk_text_to_fixed_length(
text=sentence, length=max_length)
fixed_size_sentences = fixed_size_sentences+chunks
return fixed_size_sentences

def chunk_sentences_to_max_tokens(self, sentences: list, max_tokens: int = 768):
fixed_size_sentences = []
for sentence in sentences:
tokens = sentence.lower().split()
if len(tokens) > (max_tokens):
chunks = TinyVectorStore.sentence_word_splitter(
num_of_words=max_tokens, sentence=sentence)
fixed_size_sentences = fixed_size_sentences+chunks
else:
fixed_size_sentences.append(sentence)
return fixed_size_sentences

Alright, time to do some test

def test_tiny_vector_usage():
max_tokens_count = 768
tinyvc = tinyvstore.TinyVectorStore(name="short_term_logs",
max_tokens=max_tokens_count,
persist_data=True,
data_storage_path=None)

print("\n==BUILD CORPUS==" * 1, "\n")
corpus = [
"I love eat mango, mango is my favorite fruit. so, buy lots of mangon today because we need eat lots of healthy food.",
"mango is my favorite fruit",
"mango, apple, oranges are fruits",
"mango, mango, mango, mango fruit",
"fruits are good for health",
]
corpus = tinyvc.chunk_sentences_to_max_tokens(
sentences=corpus, max_tokens=max_tokens_count)
corpus_vectors = tinyvc.encode(corpus)
tinyvc.insert_vectors(corpus_vectors)
print("\n==Similarity Search==" * 1, "\n")
query = "mango fruit"
result = tinyvc.similarity_search(query)
print(result[0])
print("\n==Persistance to Disk==" * 1, "\n")
tinyvc.save_to_disk()


def test_tiny_vector_load_persist_data():
max_tokens_count = 768
tinyvc = tinyvstore.TinyVectorStore.get_or_create_vectorstore(name="short_term_logs",
storage_path=None,
max_tokens=max_tokens_count)
query = "mango fruit"
result = tinyvc.similarity_search(query)
print(result)


if __name__ == "__main__":
test_tiny_vector_usage()
test_tiny_vector_load_persist_data()

The result returned the highest similar text which is “mango is a fruit” from the sample corpus test text store previously.

Github Repo: https://github.com/PavAI-Research/tiny-vector-store.git

I hope you find this is interesting and useful!

Have a nice day!

--

--

Minyang Chen

Enthusiastic in AI, Cloud, Big Data and Software Engineering. Sharing insights from my own experiences.