Skip to content
Go back

Creating Vector Embeddings for Textual Data Using BERT From Scratch

Edit page

Introduction

In today’s AI world, you need to store contextual data in a way your agent can retrieve and understand.

Just like your computer understands 0s and 1s, language models do not understand text directly like humans do. Instead, they convert text into numerical representations and predict patterns based on probabilities learned during training.

Each data point has different meanings in different numerical dimensions, depending on surrounding context.

The input provided to the model acts as context, which plays a significant role in predicting the next token. We store this context in a numerical form understandable by the model, called vector embeddings.

In this article, we’ll see how to convert textual data into vector embeddings.

Real-World Applications of Embeddings

What are vector embeddings?

The data or context provided is preprocessed and converted into smaller numerical units understood by the model called tokens.

These tokens are then converted into dense numerical vectors that capture their semantic meaning. These dense vectors are called vector embeddings.

In short, tokens are numerical representations of text chunks understood by the model. Embeddings are dense numerical vectors generated by the model that capture semantic meaning.

WordToken IDEmbedding
cat4937[0.12, -0.91, …]

For textual data to be stored in a vector database, we perform:

The whole process looks like this:

Text Preprocessing Tokenization BERT Embeddings Vector DB

Let’s jump into code.

Unlike traditional NLP pipelines, transformer models like BERT usually require minimal preprocessing because the tokenizer already handles much of the text normalization internally. We are performing preprocessing here only to understand the concepts.

Preprocessing involves:

We’ll use nltk library’s

stopwords for filtering stopwords,

word_tokenize for tokenizing text,

WordNetLemmatizer for lemmatising.

pip install nltk
from nltk.corpus import stopwords
from nltk import WordNetLemmatizer
from nltk import word_tokenize
import string
Dataset = ["This text is example for generating vector embeddings for text.",
           "Basics of GenAI engineer."]
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

We need to preprocess each data point in the dataset, converting text to lower case, tokenising the text, and lemmatizing the text after removing stopwords and punctuations.

def preprocess_text(data):
    tokens = word_tokenize(data.lower())
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words and token not in string.punctuation]
    return ' '.join(tokens)

preprocessed_text = [preprocess_text(data) for data in Dataset]

print(preprocessed_text)
## prints: ['text example generating vector embeddings text', 'basic genai engineer']

Now, we have filtered out the noise and done the preprocessing. Next, we have to generate the embeddings using the BERT model.

About BERT

BERT (Bidirectional Encoder Representation from Transformers) uses the Transformer architecture to generate contextual embeddings. It is pre-trained on a large corpus of textual data and can be fine-tuned for different purposes such as text classification.

Since our dataset has been preprocessed and the text has been standardised, we now convert each datapoint into tokens using BertTokenizer, and then pass those tokens into BertModel to generate embeddings.

We also import the PyTorch library, torch.

We’ll use the PyTorch library which has deep learning/machine learning capabilities including BERT. Its core object is called a tensor. A tensor is a multidimensional numerical structure used in deep learning frameworks like PyTorch to store and process data.

pip install transformers
from transformers import BertTokenizer
from transformers import BertModel
import torch
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
def generate_embeddings(preprocessed_dataset):
    inputs = tokenizer(preprocessed_dataset, padding=True, return_tensors="pt")
    print("input tokens:\n", tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]), "\n")
    print("inputs: \n", inputs, "\n")

    with torch.no_grad():
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state
        print("shape: ", outputs.last_hidden_state.shape, "\n")
        print("last hidden state: \n", outputs.last_hidden_state, "\n")

Outputs:

We have 3 fields in inputs:

input tokens:
 ['[CLS]', 'text', 'example', 'generating', 'vector', 'em', '##bed', '##ding', '##s', 'text', '[SEP]']

input tokens:
 ['[CLS]', 'basic', 'gen', '##ai', 'engineer', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']

inputs:
 {'input_ids': tensor([[  101,  3793,  2742, 11717,  9207,  7861,  8270,  4667,  2015,  3793,
           102],
        [  101,  3937,  8991,  4886,  3992,   102,     0,     0,     0,     0,
             0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]])}

If you notice, the first token in every sequence is [CLS]. [CLS] is commonly used as a sentence-level representation for classification tasks and is trained to act as a summary representation of the entire sequence.

Many semantic search systems instead use mean pooling across all tokens because [CLS] embeddings alone are not always optimal.

outputs.last_hidden_state.shape gives us 3 dimensions.

outputs.last_hidden_state contains embeddings for every token in the sentence.

Shape:

  1. batch size: 2 → 2 sentences
  2. sequence length: 11, sequence length of tokens per sentence.
  3. hidden state: 768, total number of features calculated for each token.
shape:  torch.Size([2, 11, 768])  # (batch size, seq length, hidden states)

last hidden state :
 tensor([[[-5.8100e-01, -3.2979e-01, -4.7724e-01,  ..., -9.0522e-01,
          -3.2825e-01,  7.4738e-01],
         [-6.2218e-04,  1.7839e-01, -8.3888e-02,  ..., -2.1917e-01,
           1.1380e-01,  3.1209e-01],
         [-6.5069e-01,  3.9529e-02, -2.3673e-01,  ..., -9.8339e-01,
          -2.6105e-01, -3.0963e-03],
         ...,
         [-6.9573e-01, -1.2655e-01, -2.8824e-01,  ..., -8.6103e-01,
          -9.8163e-01,  2.0502e-01],
         [-5.0348e-01,  1.9143e-01,  5.4120e-01,  ..., -8.7059e-01,
          -2.9835e-01,  8.0437e-02],
         [ 8.4091e-01, -1.1874e-01, -6.0505e-01,  ...,  2.3056e-01,
          -6.3892e-01, -2.1199e-01]],

        [[-6.8193e-01,  7.5097e-02, -1.2865e-01,  ..., -2.7600e-01,
           5.3131e-01,  2.4605e-01],
         [ 1.0676e-01,  5.1052e-01,  1.1428e-01,  ..., -3.8723e-01,
           1.6984e-01, -7.4429e-01],
         [-8.1379e-01, -1.2888e+00,  3.9460e-01,  ..., -4.8187e-01,
           1.5182e-01,  6.2495e-01],
         ...,
         [-2.1021e-02,  1.1814e-01,  9.9882e-02,  ..., -1.8074e-01,
          -1.6051e-01,  1.3720e-01],
         [-1.9825e-01, -2.6886e-02,  5.5269e-02,  ..., -5.3649e-02,
          -5.2529e-02,  1.9476e-01],
         [-1.5301e-01,  4.7827e-02,  1.1477e-01,  ..., -7.1446e-02,
          -1.0219e-01,  1.6900e-01]]])

Since we have the 1st token as [CLS], you can treat this as the embedding for the whole sequence.

word_embeddings = outputs.last_hidden_state
CLS_token_embedding = word_embeddings[:, 0, :]
CLS Token:
 tensor([[-0.5810, -0.3298, -0.4772,  ..., -0.9052, -0.3283,  0.7474], # for 1st seq
        [-0.6819,  0.0751, -0.1287,  ..., -0.2760,  0.5313,  0.2460]]) # for 2nd seq

Alternatively, we can also use mean pooling instead of [CLS] token embedding, since in a lot of cases the latter might not be optimal.

In real-world applications, people usually:

Raw Text

Tokenizer

Token IDs

BERT

Token Embeddings

Mean Pooling

Sentence Embedding

Converting Token Embeddings into Sentence Embeddings

Instead of storing embeddings for every token individually, many applications compute a single sentence embedding by averaging token embeddings across the sequence.

sentence_embeddings = outputs.last_hidden_state.mean(dim=1)
print("sentence_embeddings: \n", sentence_embeddings, "\n")
sentence_embeddings:
 tensor([[-0.2917, -0.0854, -0.1267,  ..., -0.5426, -0.5290,  0.1703],
        [-0.1471, -0.0200, -0.0204,  ..., -0.1764, -0.0367, -0.0447]])

In production systems, developers often use libraries like SentenceTransformers because raw BERT embeddings are not optimised for semantic similarity tasks.

We can use models such as all-MiniLM-L6-v2, bge-small-en, e5-small.

Summary

In this article we learned:

🚀 That’s all — you have generated the vector embeddings for a text sequence using the BERT model!

P.S. I’m new to exploring AI and its internal workings, exploring the concepts of how we can incorporate it in our workflow to make it more efficient. Feedback is very much appreciated!

Thanks,

Saksham.


Edit page
Share this post on:

Previous Post
Building a Sentiment Analysis Workflow with LangChain
Next Post
Sero Listens — Social Listening Tool