src.core.word2vec package#

Submodules#

src.core.word2vec.dataset module#

class src.core.word2vec.dataset.Word2VecDataset(config_dict)[source]#

Bases: object

Word2Vec Dataset

Parameters:

config_dict (dict) – Config Params Dictionary

get_vocab()[source]#

Generates vocabulary from from preprocessed text

make_pairs()[source]#

Creates Left and Right context and Labels using Huffman binary tree

Returns:

Left, Right Context, Left, Right Label

Return type:

tuple (list, list, list, list)

preprocess()[source]#

Preprocessed extracted data

src.core.word2vec.dataset.create_dataloader(left_cxt, right_cxt, left_lbl, right_lbl, val_split=0.2, batch_size=32, seed=2024)[source]#

Creates Train, Validation left and Right DataLoader

Parameters:
  • left_cxt (list) – Left context

  • right_cxt (list) – Right context

  • left_lbl (list) – Left label

  • right_lbl (list) – Right label

  • val_split (float) – validation split, defaults to 0.2

  • batch_size (int) – Batch size, defaults to 32

  • seed (int, optional) – Seed, defaults to 2024

Returns:

train, val left and right dataloader

Return type:

tuple (torch.utils.data.DataLoader, torch.utils.data.DataLoader, torch.utils.data.DataLoader, torch.utils.data.DataLoader)

src.core.word2vec.huffman module#

class src.core.word2vec.huffman.HuffmanBTree(vocab_freq_dict)[source]#

Bases: object

Creates Huffman Binary Tree to perform Softmax

Parameters:

vocab_freq_dict (dict) – Vocabulary Frequency Dictionary

construct_tree()[source]#

Constructs Huffman Binary Tree

generate_huffman_code(tree, code, path)[source]#

Generates Huffman code for vocabulary

Parameters:
  • tree (Node) – Node object that initialized HuffmanTree

  • code (list) – Binary code for each vocab

  • path (list) – Path for each vocab wrto Node Id

separate_left_right_path()[source]#

Separates Left and Right paths

class src.core.word2vec.huffman.Node(word_idx, freq, left=None, right=None)[source]#

Bases: object

A class to initialize Node in a Huffman tree

Parameters:
  • word_idx (int) – Word Id

  • freq (int) – Frequency of node

  • left (list, optional) – Left nodes, defaults to None

  • right (list, optional) – Right nodes, defaults to None

src.core.word2vec.model module#

class src.core.word2vec.model.Word2VecModel(config_dict)[source]#

Bases: Module

Word2Vec Model

Parameters:

config_dict (dict) – Config Params Dictionary

compute_cxt_embed(cxt)[source]#

Computes context embedding vector

Parameters:

cxt (torch.Tensor (batch_size, context_len)) – Context vector

Returns:

Label embedding

Return type:

torch.Tensor (batch_size, embed_dim)

forward(l_cxt, r_cxt, l_lbl, r_lbl)[source]#

Forward propogation

Parameters:
  • l_cxt (torch.Tensor (batch_size,)) – Left context

  • r_cxt (torch.Tensor (batch_size,)) – Right context

  • l_lbl (torch.Tensor (batch_size,)) – Left label

  • r_lbl (torch.Tensor (batch_size,)) – Right label

Returns:

Loss

Return type:

torch.float32

class src.core.word2vec.model.Word2VecTrainer(model, optimizer, config_dict)[source]#

Bases: Module

Word2Vec Trainer

Parameters:
  • model (torch.nn.Module) – Word2Vec model

  • optimizer (torch.optim) – Optimizer

  • config_dict (dict) – Config Params Dictionary

fit(train_loader, val_loader)[source]#

Fits the model on dataset. Runs training and Validation steps for given epochs and saves best model based on the evaluation metric

Parameters:
  • train_loader (torch.utils.data.DataLoader) – Train Data loader

  • val_loader (torch.utils.data.DataLoader) – Validaion Data Loader

Returns:

Training History

Return type:

dict

train_one_epoch(data_loader, epoch)[source]#

Train step

Parameters:
  • data_loader (torch.utils.data.Dataloader) – Train Data Loader

  • epoch (int) – Epoch number

Returns:

Train Losse

Return type:

torch.float32

val_one_epoch(data_loader)[source]#

Validation step

Parameters:

data_loader (torch.utils.data.Dataloader) – Validation Data Loader

Returns:

Validation Loss

Return type:

torch.float32

src.core.word2vec.word2vec module#

class src.core.word2vec.word2vec.Word2Vec(config_dict)[source]#

Bases: object

A class to run Word2Vec data preprocessing, training and inference

Parameters:

config_dict (dict) – Config Params Dictionary

get_embeddings(sentence)[source]#

Outputs Word embeddings

Parameters:

sentence (str) – Input sentence

Returns:

Word embeddings

Return type:

torch.Tensor (seq_len, embed_dim)

run()[source]#

Runs Word2Vec Training and saves output

save_output()[source]#

Saves Training and Inference results

Module contents#