Creating Vector Embeddings for Textual Data Using BERT From Scratch

Introduction

In today’s AI world, you need to store contextual data in a way your agent can retrieve and understand.

Just like your computer understands 0s and 1s, language models do not understand text directly like humans do. Instead, they convert text into numerical representations and predict patterns based on probabilities learned during training.

Each data point has different meanings in different numerical dimensions, depending on surrounding context.

The input provided to the model acts as context, which plays a significant role in predicting the next token. We store this context in a numerical form understandable by the model, called vector embeddings.

In this article, we’ll see how to convert textual data into vector embeddings.

Real-World Applications of Embeddings

Semantic search
RAG systems
Recommendation systems
Chatbots
Clustering
Duplicate detection

What are vector embeddings?

The data or context provided is preprocessed and converted into smaller numerical units understood by the model called tokens.

These tokens are then converted into dense numerical vectors that capture their semantic meaning. These dense vectors are called vector embeddings.

In short, tokens are numerical representations of text chunks understood by the model. Embeddings are dense numerical vectors generated by the model that capture semantic meaning.

Word	Token ID	Embedding
cat	4937	[0.12, -0.91, …]

For textual data to be stored in a vector database, we perform:

Preprocessing to filter out irrelevant words/punctuations,
BERT Tokenizer to convert text to tokens.

The whole process looks like this:

Text → Preprocessing → Tokenization → BERT → Embeddings → Vector DB

Let’s jump into code.

Unlike traditional NLP pipelines, transformer models like BERT usually require minimal preprocessing because the tokenizer already handles much of the text normalization internally. We are performing preprocessing here only to understand the concepts.

Preprocessing involves:

Removing stop words
Removing punctuations
Case standardisation
Lemmatisation: the process of converting words to their base form, e.g. “running” word when lemmatised should return “run”.

We’ll use nltk library’s

stopwords for filtering stopwords,

word_tokenize for tokenizing text,

WordNetLemmatizer for lemmatising.

Install the nltk library

pip install nltk

Import the relevant modules

from nltk.corpus import stopwords
from nltk import WordNetLemmatizer
from nltk import word_tokenize
import string

Text to be embedded:

Dataset = ["This text is example for generating vector embeddings for text.",
           "Basics of GenAI engineer."]

Get the stopwords, and initialize the lemmatizer.

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

We need to preprocess each data point in the dataset, converting text to lower case, tokenising the text, and lemmatizing the text after removing stopwords and punctuations.

def preprocess_text(data):
    tokens = word_tokenize(data.lower())
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words and token not in string.punctuation]
    return ' '.join(tokens)

preprocessed_text = [preprocess_text(data) for data in Dataset]

print(preprocessed_text)
## prints: ['text example generating vector embeddings text', 'basic genai engineer']

Now, we have filtered out the noise and done the preprocessing. Next, we have to generate the embeddings using the BERT model.

About BERT

BERT (Bidirectional Encoder Representation from Transformers) uses the Transformer architecture to generate contextual embeddings. It is pre-trained on a large corpus of textual data and can be fine-tuned for different purposes such as text classification.

Since our dataset has been preprocessed and the text has been standardised, we now convert each datapoint into tokens using BertTokenizer, and then pass those tokens into BertModel to generate embeddings.

We also import the PyTorch library, torch.

We’ll use the PyTorch library which has deep learning/machine learning capabilities including BERT. Its core object is called a tensor. A tensor is a multidimensional numerical structure used in deep learning frameworks like PyTorch to store and process data.

Install the transformers library.

pip install transformers

Import relevant modules

from transformers import BertTokenizer
from transformers import BertModel
import torch

Declare tokenizer and BERT instance, with its pretrained bert-base-uncased model.

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

Now, we generate embeddings:

First we tokenize the dataset we preprocessed in the previous step, using the BertTokenizer, which returns PyTorch tensors as passed in argument pt.

We have to pass either padding=True or truncation=True since the generated tensors need to be of the same length (the tensor must be rectangular).

If the passed datapoints are not of the same length and we didn’t pass either of the two arguments it will throw the following error:
```
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
```
Then we pass these tensors into the model, which generates the embeddings in a neural network.

Generally, it stores the intermediate weights and updates weights as follows. But, to speed things up we don’t want to update weights/gradients of the neural network, hence we don’t need intermediate states. Therefore, we use torch.no_grad() which doesn’t update the weights.

Then, we pass the input tokens to the model to get the embeddings.

def generate_embeddings(preprocessed_dataset):
    inputs = tokenizer(preprocessed_dataset, padding=True, return_tensors="pt")
    print("input tokens:\n", tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]), "\n")
    print("inputs: \n", inputs, "\n")

    with torch.no_grad():
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state
        print("shape: ", outputs.last_hidden_state.shape, "\n")
        print("last hidden state: \n", outputs.last_hidden_state, "\n")

Outputs:

We have 3 fields in inputs:

input_ids: which is the token for the respective input text in the input token array.
token_type_ids: which text sequence the token belongs to. In this case, each data point contains just 1 text sequence hence it’s 0, but there can be a case where each data point has 2 text sequences.
attention_mask: 1 if the token is not a padding token, 0 if it’s a padding token (represented by [PAD]).

input tokens:
 ['[CLS]', 'text', 'example', 'generating', 'vector', 'em', '##bed', '##ding', '##s', 'text', '[SEP]']

input tokens:
 ['[CLS]', 'basic', 'gen', '##ai', 'engineer', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']

inputs:
 {'input_ids': tensor([[  101,  3793,  2742, 11717,  9207,  7861,  8270,  4667,  2015,  3793,
           102],
        [  101,  3937,  8991,  4886,  3992,   102,     0,     0,     0,     0,
             0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]])}

If you notice, the first token in every sequence is [CLS]. [CLS] is commonly used as a sentence-level representation for classification tasks and is trained to act as a summary representation of the entire sequence.

Many semantic search systems instead use mean pooling across all tokens because [CLS] embeddings alone are not always optimal.

outputs.last_hidden_state.shape gives us 3 dimensions.

outputs.last_hidden_state contains embeddings for every token in the sentence.

Shape:

batch size: 2 → 2 sentences
sequence length: 11, sequence length of tokens per sentence.
hidden state: 768, total number of features calculated for each token.

shape:  torch.Size([2, 11, 768])  # (batch size, seq length, hidden states)

last hidden state :
 tensor([[[-5.8100e-01, -3.2979e-01, -4.7724e-01,  ..., -9.0522e-01,
          -3.2825e-01,  7.4738e-01],
         [-6.2218e-04,  1.7839e-01, -8.3888e-02,  ..., -2.1917e-01,
           1.1380e-01,  3.1209e-01],
         [-6.5069e-01,  3.9529e-02, -2.3673e-01,  ..., -9.8339e-01,
          -2.6105e-01, -3.0963e-03],
         ...,
         [-6.9573e-01, -1.2655e-01, -2.8824e-01,  ..., -8.6103e-01,
          -9.8163e-01,  2.0502e-01],
         [-5.0348e-01,  1.9143e-01,  5.4120e-01,  ..., -8.7059e-01,
          -2.9835e-01,  8.0437e-02],
         [ 8.4091e-01, -1.1874e-01, -6.0505e-01,  ...,  2.3056e-01,
          -6.3892e-01, -2.1199e-01]],

        [[-6.8193e-01,  7.5097e-02, -1.2865e-01,  ..., -2.7600e-01,
           5.3131e-01,  2.4605e-01],
         [ 1.0676e-01,  5.1052e-01,  1.1428e-01,  ..., -3.8723e-01,
           1.6984e-01, -7.4429e-01],
         [-8.1379e-01, -1.2888e+00,  3.9460e-01,  ..., -4.8187e-01,
           1.5182e-01,  6.2495e-01],
         ...,
         [-2.1021e-02,  1.1814e-01,  9.9882e-02,  ..., -1.8074e-01,
          -1.6051e-01,  1.3720e-01],
         [-1.9825e-01, -2.6886e-02,  5.5269e-02,  ..., -5.3649e-02,
          -5.2529e-02,  1.9476e-01],
         [-1.5301e-01,  4.7827e-02,  1.1477e-01,  ..., -7.1446e-02,
          -1.0219e-01,  1.6900e-01]]])

Since we have the 1st token as [CLS], you can treat this as the embedding for the whole sequence.

word_embeddings = outputs.last_hidden_state
CLS_token_embedding = word_embeddings[:, 0, :]

CLS Token:
 tensor([[-0.5810, -0.3298, -0.4772,  ..., -0.9052, -0.3283,  0.7474], # for 1st seq
        [-0.6819,  0.0751, -0.1287,  ..., -0.2760,  0.5313,  0.2460]]) # for 2nd seq

Alternatively, we can also use mean pooling instead of [CLS] token embedding, since in a lot of cases the latter might not be optimal.

In real-world applications, people usually:

mean-pool token embeddings
use sentence transformers
normalize embeddings

Raw Text
   ↓
Tokenizer
   ↓
Token IDs
   ↓
BERT
   ↓
Token Embeddings
   ↓
Mean Pooling
   ↓
Sentence Embedding

Converting Token Embeddings into Sentence Embeddings

Instead of storing embeddings for every token individually, many applications compute a single sentence embedding by averaging token embeddings across the sequence.

sentence_embeddings = outputs.last_hidden_state.mean(dim=1)
print("sentence_embeddings: \n", sentence_embeddings, "\n")

sentence_embeddings:
 tensor([[-0.2917, -0.0854, -0.1267,  ..., -0.5426, -0.5290,  0.1703],
        [-0.1471, -0.0200, -0.0204,  ..., -0.1764, -0.0367, -0.0447]])

In production systems, developers often use libraries like SentenceTransformers because raw BERT embeddings are not optimised for semantic similarity tasks.

We can use models such as all-MiniLM-L6-v2, bge-small-en, e5-small.

Summary

In this article we learned:

how text becomes tokens,
how BERT processes tokens,
how embeddings are generated,
and how sentence embeddings are extracted.

🚀 That’s all — you have generated the vector embeddings for a text sequence using the BERT model!

P.S. I’m new to exploring AI and its internal workings, exploring the concepts of how we can incorporate it in our workflow to make it more efficient. Feedback is very much appreciated!

Thanks,

Saksham.