src.core.bert package#

Submodules#

src.core.bert.bert module#

class src.core.bert.bert.BERT(config_dict)[source]#

Bases: object

A class to run BERT data preprocessing, training and inference

Parameters:

config_dict (dict) – Config Params Dictionary

load_pretrain_weights(pretrain_model, finetune_model)[source]#

Copies pretrain weights to finetune BERT model object

Parameters:
  • pretrain_model (torch.nn.Module) – Pretrain BERT model

  • finetune_model (torch.nn.Module) – Finetune BERT model

Returns:

Finetune BERT model with Pretrained weights

Return type:

torch.nn.Module

run()[source]#

Runs BERT pretrain and finetune stages and saves output

run_finetune()[source]#

Finetuning stage of BERT

Returns:

BERT Fientune Trainer and Training History

Return type:

tuple (torch.nn.Module, dict)

run_infer_finetune()[source]#

Runs inference using Finetuned BERT

Returns:

True and Predicted start, end ids

Return type:

tuple (numpy.ndarray [num_samples,], numpy.ndarray [num_samples,], numpy.ndarray [num_samples,], numpy.ndarray [num_samples,])

run_pretrain()[source]#

Pretraining stage of BERT

Returns:

BERT Pretrain Trainer and Training History

Return type:

tuple (torch.nn.Module, dict)

save_output()[source]#

Saves Training and Inference results

src.core.bert.dataset_finetune module#

class src.core.bert.dataset_finetune.PreprocessBERTFinetune(config_dict, wordpiece, word2id)[source]#

Bases: object

A class to preprocess BERT Finetuning Data

Parameters:
  • config_dict (dict) – Config Params Dictionary

  • wordpiece (src.preprocess.WordPiece) – WordPiece class

  • word2id (dict) – Words to Ids mapping

batched_ids2tokens(tokens)[source]#

Converting sentence of ids to tokens

Parameters:

tokens (numpy.ndarray) – Tokens Array, 2D array (num_samples, seq_len)

Returns:

List of decoded sentences

Return type:

list

extract_data()[source]#

Extracts data from SQuAD v1 JSON file

Returns:

Finetuning Data (Topic, Context, Question, Answer Start ID, Num words in an answer)

Return type:

pandas.DataFrame

get_data()[source]#

Converts extracted data into tokens with start, end ids of answers along with the topics of each sample

Returns:

Finetuning Data (tokens, start ids, end ids, topics)

Return type:

tuple (numpy.ndarray [num_samples, seq_len], numpy.ndarray [num_samples,], numpy.ndarray [num_samples,] , numpy.ndarray [num_samples,])

preprocess_text(text)[source]#

Preprocesses text

Parameters:

text (str) – Raw Input string

Returns:

Preprocessed string

Return type:

str

src.core.bert.dataset_finetune.create_data_loader_finetune(tokens, start_ids, end_ids, topics, val_split=0.2, test_split=0.2, batch_size=32, seed=2024)[source]#

Creates PyTorch DataLoaders for Finetuning data

Parameters:
  • tokens (torch.Tensor) – Input tokens

  • start_ids (torch.Tensor) – Start ids of Prediction

  • end_ids (torch.Tensor) – End ids of Prediction

  • topics (torch.Tensor) – Topic type of data samples

  • val_split (float, optional) – validation split, defaults to 0.2

  • test_split (float, optional) – Test split, defaults to 0.2

  • batch_size (int, optional) – Batch size, defaults to 32

  • seed (int, optional) – Seed, defaults to 2024

Returns:

Train, Val and Test dataloaders

Return type:

tuple (torch.utils.data.DataLoader, torch.utils.data.DataLoader,torch.utils.data.DataLoader)

src.core.bert.dataset_pretrain module#

class src.core.bert.dataset_pretrain.BERTPretrainDataset(text_tokens, nsp_labels, config_dict, word2id)[source]#

Bases: Dataset

A class to generate BERT NSP training formatted dataset from preprocessed strings

Parameters:
  • text_tokens (numpy.ndarray (num_samples, seq_len)) – Preprocessed text tokens

  • nsp_labels (numpy.ndarray (n um_samples,)) – NSP labels

  • config_dict (dict) – Config Params Dictionary

  • word2id (dict) – Words to Ids mapping

class src.core.bert.dataset_pretrain.PreprocessBERTPretrain(config_dict)[source]#

Bases: object

A class to preprocess BERT Pretraining data

Parameters:

config_dict (dict) – Config Params Dictionary

batched_ids2tokens(tokens)[source]#
extract_data()[source]#

Extracts data from Wiki en csv file

Returns:

List of raw strimgs

Return type:

list

get_data()[source]#

Converts extracted data into tokens and Next Sentence Prediction labels

Returns:

Pretraining Data (tokens, nsp labels)

Return type:

tuple (numpy.ndarray [num_samples, seq_len], numpy.ndarray [num_samples,])

get_vocab(text_ls)[source]#

Generates Vocabulary

Parameters:

text_ls (list) – List of preprocessed strings

Returns:

Corpus generated by WordPiece

Return type:

list

preprocess_text(text_ls)[source]#

Preprocesses list of strings

Parameters:

text_ls (list) – List of Raw strings

Returns:

List of preprocssed strings

Return type:

list

src.core.bert.dataset_pretrain.create_dataloader_pretrain(X, y, config_dict, word2id, val_split=0.2, test_split=0.2, batch_size=32, seed=2024)[source]#

Creates PyTorch DataLoader for Pretraining data

Parameters:
  • X (numpy.ndarray (num_samples, seq_len)) – Masked Input text tokens

  • y (numpy.ndarray (num_samples,)) – NSP labels

  • config_dict (dict) – Config Params Dictionary

  • word2id (dict) – Words to Ids mapping

  • val_split (float, optional) – validation split, defaults to 0.2

  • test_split (float, optional) – Test split, defaults to 0.2

  • batch_size (int, optional) – Batch size, defaults to 32

  • seed (int, optional) – Seed, defaults to 2024

Returns:

Train, Val and Test dataloaders

Return type:

tuple (torch.utils.data.DataLoader, torch.utils.data.DataLoader,torch.utils.data.DataLoader)

src.core.bert.finetune module#

class src.core.bert.finetune.BERTFinetuneTrainer(model, optimizer, config_dict)[source]#

Bases: Module

BERT Finetune Model trainer

Parameters:
  • model (torch.nn.Module) – BERT Finetune model

  • optimizer (torch.optim) – Optimizer

  • config_dict (dict) – Config Params Dictionary

calc_loss(start_ids_prob, end_ids_prob, start_ids, end_ids)[source]#

NLL loss for start and end ids predictions

Parameters:
  • start_ids_prob (torch.Tensor (batch_size, num_vocab)) – Predicted probabilities of start ids

  • end_ids_prob (torch.Tensor (batch_size, num_vocab)) – predicted probabilities of end ids

  • start_ids (torch.tensor (batch_size)) – True start ids

  • end_ids (torch.tensor (batch_size)) – True end ids

Returns:

NLL Loss of start, end ids

Return type:

tuple (torch.float32, torch.float32)

fit(train_loader, val_loader)[source]#

Fits the model on dataset. Runs training and Validation steps for given epochs and saves best model based on the evaluation metric

Parameters:
  • train_loader (torch.utils.data.DataLoader) – Train Data loader

  • val_loader (torch.utils.data.DataLoader) – Validaion Data Loader

Returns:

Training History

Return type:

dict

predict(data_loader)[source]#

Runs inference on Input Data

Parameters:

data_loader (torch.utils.data.DataLoader) – Infer Data loader

Returns:

Labels, Predictions (start ids labels, end ids labels, encoded inputs)

Return type:

tuple (numpy.ndarray [num_samples,], numpy.ndarray [num_samples,], numpy.ndarray [seq_len, num_samples])

train_one_epoch(data_loader, epoch)[source]#

Train step

Parameters:
  • data_loader (torch.utils.data.Dataloader) – Train Data Loader

  • epoch (int) – Epoch number

Returns:

Train Losses (Train Loss, Train Start id loss, Train End id loss)

Return type:

tuple (torch.float32, torch.float32, torch.float32)

val_one_epoch(data_loader)[source]#

Validation step

Parameters:

data_loader (torch,.utils.data.DataLoader) – Validation Data Loader

Returns:

Validation Losses

Return type:

tuple (Validation Loss, Validation Start id loss, Validation End id loss)

src.core.bert.model module#

class src.core.bert.model.BERTFinetuneModel(config_dict)[source]#

Bases: Module

BERT Finetune Model

Parameters:

config_dict (dict) – Config Params Dictionary

forward(tokens)[source]#

Forward propogation

Parameters:

tokens (torch.Tensor (num_samples, seq_len)) – Input tokens

Returns:

Encoded Inputs, Start and End ids probs

Return type:

tuple (torch.Tensor [num_samples, seq_len, d_model], torch.Tensor [num_samples,], torch.Tensor [num_samples,])

class src.core.bert.model.BERTPretrainModel(config_dict)[source]#

Bases: Module

BERT Pretrain Model

Parameters:

config_dict (dict) – Config Params Dictionary

forward(tokens)[source]#

Forward propogation

Parameters:

tokens (torch.Tensor (num_samples, seq_len)) – Input tokens

Returns:

Predicted Tokens, NSP output

Return type:

tuple (torch.Tensor [num_samples, seq_len, num_vocab], torch.Tensor [num_samples,])

src.core.bert.pretrain module#

class src.core.bert.pretrain.BERTPretrainTrainer(model, optimizer, config_dict)[source]#

Bases: Module

BERT Pretrain Model trainer

Parameters:
  • model (torch.nn.Module) – BERT Pretrain model

  • optimizer (torch.optim) – Optimizer

  • config_dict (dict) – Config Params Dictionary

calc_loss(tokens_pred, nsp_pred, tokens_true, nsp_labels, tokens_mask)[source]#

Calculates Training Loss components

Parameters:
  • tokens_pred (torch.Tensor (batch_size, seq_len, num_vocab)) – Predicted Tokens

  • nsp_pred (torch.Tensor (batch_size,)) – Predicted NSP Label

  • tokens_true (torch.Tensor (batch_size, seq_len)) – True tokens

  • nsp_labels (torch.Tensor (batch_size,)) – NSP labels

  • tokens_mask (torch.Tensor (batch_size, seq_len)) – Tokens mask

Returns:

Masked word prediction Cross Entropy loss, NSP classification loss

Return type:

tuple (torch.float32, torch.float32)

fit(train_loader, val_loader)[source]#

Fits the model on dataset. Runs training and Validation steps for given epochs and saves best model based on the evaluation metric

Parameters:
  • train_loader (torch.utils.data.DataLoader) – Train Data loader

  • val_loader (torch.utils.data.DataLoader) – Validaion Data Loader

Returns:

Training History

Return type:

dict

predict(data_loader)[source]#

Runs inference on input data

Parameters:

data_loader (torch.utils.data.DataLoader) – Infer Data loader

Returns:

Labels, Predictions (True tokens, NSP Labels, Predicton tokens, NSP Predictions)

Return type:

tuple (numpy.ndarray [num_samples, seq_len], numpy.ndarray [num_samples,], numpy.ndarray [num_samples, seq_len], numpy.ndarray [num_samples,])

train_one_epoch(data_loader, epoch)[source]#

Train step

Parameters:
  • data_loader (torch.utils.data.Dataloader) – Train Data Loader

  • epoch (int) – Epoch number

Returns:

Train Losses (Train Loss, Train masked tokens loss, Train NSP loss)

Return type:

tuple (torch.float32, torch.flooat32, torch.float32)

val_one_epoch(data_loader)[source]#

Validation step

Parameters:

data_loader (torch.utils.data.Dataloader) – Validation Data Loader

Returns:

Validation Losses (Validation Loss, Validation masked tokens loss, Validation NSP loss)

Return type:

tuple (torch.float32, torch.flooat32, torch.float32)

Module contents#