src.core.bert package#
Submodules#
src.core.bert.bert module#
- class src.core.bert.bert.BERT(config_dict)[source]#
Bases:
object
A class to run BERT data preprocessing, training and inference
- Parameters:
config_dict (dict) – Config Params Dictionary
- load_pretrain_weights(pretrain_model, finetune_model)[source]#
Copies pretrain weights to finetune BERT model object
- Parameters:
pretrain_model (torch.nn.Module) – Pretrain BERT model
finetune_model (torch.nn.Module) – Finetune BERT model
- Returns:
Finetune BERT model with Pretrained weights
- Return type:
torch.nn.Module
- run_finetune()[source]#
Finetuning stage of BERT
- Returns:
BERT Fientune Trainer and Training History
- Return type:
tuple (torch.nn.Module, dict)
- run_infer_finetune()[source]#
Runs inference using Finetuned BERT
- Returns:
True and Predicted start, end ids
- Return type:
tuple (numpy.ndarray [num_samples,], numpy.ndarray [num_samples,], numpy.ndarray [num_samples,], numpy.ndarray [num_samples,])
src.core.bert.dataset_finetune module#
- class src.core.bert.dataset_finetune.PreprocessBERTFinetune(config_dict, wordpiece, word2id)[source]#
Bases:
object
A class to preprocess BERT Finetuning Data
- Parameters:
config_dict (dict) – Config Params Dictionary
wordpiece (src.preprocess.WordPiece) – WordPiece class
word2id (dict) – Words to Ids mapping
- batched_ids2tokens(tokens)[source]#
Converting sentence of ids to tokens
- Parameters:
tokens (numpy.ndarray) – Tokens Array, 2D array (num_samples, seq_len)
- Returns:
List of decoded sentences
- Return type:
list
- extract_data()[source]#
Extracts data from SQuAD v1 JSON file
- Returns:
Finetuning Data (Topic, Context, Question, Answer Start ID, Num words in an answer)
- Return type:
pandas.DataFrame
- get_data()[source]#
Converts extracted data into tokens with start, end ids of answers along with the topics of each sample
- Returns:
Finetuning Data (tokens, start ids, end ids, topics)
- Return type:
tuple (numpy.ndarray [num_samples, seq_len], numpy.ndarray [num_samples,], numpy.ndarray [num_samples,] , numpy.ndarray [num_samples,])
- src.core.bert.dataset_finetune.create_data_loader_finetune(tokens, start_ids, end_ids, topics, val_split=0.2, test_split=0.2, batch_size=32, seed=2024)[source]#
Creates PyTorch DataLoaders for Finetuning data
- Parameters:
tokens (torch.Tensor) – Input tokens
start_ids (torch.Tensor) – Start ids of Prediction
end_ids (torch.Tensor) – End ids of Prediction
topics (torch.Tensor) – Topic type of data samples
val_split (float, optional) – validation split, defaults to 0.2
test_split (float, optional) – Test split, defaults to 0.2
batch_size (int, optional) – Batch size, defaults to 32
seed (int, optional) – Seed, defaults to 2024
- Returns:
Train, Val and Test dataloaders
- Return type:
tuple (torch.utils.data.DataLoader, torch.utils.data.DataLoader,torch.utils.data.DataLoader)
src.core.bert.dataset_pretrain module#
- class src.core.bert.dataset_pretrain.BERTPretrainDataset(text_tokens, nsp_labels, config_dict, word2id)[source]#
Bases:
Dataset
A class to generate BERT NSP training formatted dataset from preprocessed strings
- Parameters:
text_tokens (numpy.ndarray (num_samples, seq_len)) – Preprocessed text tokens
nsp_labels (numpy.ndarray (n um_samples,)) – NSP labels
config_dict (dict) – Config Params Dictionary
word2id (dict) – Words to Ids mapping
- class src.core.bert.dataset_pretrain.PreprocessBERTPretrain(config_dict)[source]#
Bases:
object
A class to preprocess BERT Pretraining data
- Parameters:
config_dict (dict) – Config Params Dictionary
- extract_data()[source]#
Extracts data from Wiki en csv file
- Returns:
List of raw strimgs
- Return type:
list
- get_data()[source]#
Converts extracted data into tokens and Next Sentence Prediction labels
- Returns:
Pretraining Data (tokens, nsp labels)
- Return type:
tuple (numpy.ndarray [num_samples, seq_len], numpy.ndarray [num_samples,])
- src.core.bert.dataset_pretrain.create_dataloader_pretrain(X, y, config_dict, word2id, val_split=0.2, test_split=0.2, batch_size=32, seed=2024)[source]#
Creates PyTorch DataLoader for Pretraining data
- Parameters:
X (numpy.ndarray (num_samples, seq_len)) – Masked Input text tokens
y (numpy.ndarray (num_samples,)) – NSP labels
config_dict (dict) – Config Params Dictionary
word2id (dict) – Words to Ids mapping
val_split (float, optional) – validation split, defaults to 0.2
test_split (float, optional) – Test split, defaults to 0.2
batch_size (int, optional) – Batch size, defaults to 32
seed (int, optional) – Seed, defaults to 2024
- Returns:
Train, Val and Test dataloaders
- Return type:
tuple (torch.utils.data.DataLoader, torch.utils.data.DataLoader,torch.utils.data.DataLoader)
src.core.bert.finetune module#
- class src.core.bert.finetune.BERTFinetuneTrainer(model, optimizer, config_dict)[source]#
Bases:
Module
BERT Finetune Model trainer
- Parameters:
model (torch.nn.Module) – BERT Finetune model
optimizer (torch.optim) – Optimizer
config_dict (dict) – Config Params Dictionary
- calc_loss(start_ids_prob, end_ids_prob, start_ids, end_ids)[source]#
NLL loss for start and end ids predictions
- Parameters:
start_ids_prob (torch.Tensor (batch_size, num_vocab)) – Predicted probabilities of start ids
end_ids_prob (torch.Tensor (batch_size, num_vocab)) – predicted probabilities of end ids
start_ids (torch.tensor (batch_size)) – True start ids
end_ids (torch.tensor (batch_size)) – True end ids
- Returns:
NLL Loss of start, end ids
- Return type:
tuple (torch.float32, torch.float32)
- fit(train_loader, val_loader)[source]#
Fits the model on dataset. Runs training and Validation steps for given epochs and saves best model based on the evaluation metric
- Parameters:
train_loader (torch.utils.data.DataLoader) – Train Data loader
val_loader (torch.utils.data.DataLoader) – Validaion Data Loader
- Returns:
Training History
- Return type:
dict
- predict(data_loader)[source]#
Runs inference on Input Data
- Parameters:
data_loader (torch.utils.data.DataLoader) – Infer Data loader
- Returns:
Labels, Predictions (start ids labels, end ids labels, encoded inputs)
- Return type:
tuple (numpy.ndarray [num_samples,], numpy.ndarray [num_samples,], numpy.ndarray [seq_len, num_samples])
src.core.bert.model module#
- class src.core.bert.model.BERTFinetuneModel(config_dict)[source]#
Bases:
Module
BERT Finetune Model
- Parameters:
config_dict (dict) – Config Params Dictionary
src.core.bert.pretrain module#
- class src.core.bert.pretrain.BERTPretrainTrainer(model, optimizer, config_dict)[source]#
Bases:
Module
BERT Pretrain Model trainer
- Parameters:
model (torch.nn.Module) – BERT Pretrain model
optimizer (torch.optim) – Optimizer
config_dict (dict) – Config Params Dictionary
- calc_loss(tokens_pred, nsp_pred, tokens_true, nsp_labels, tokens_mask)[source]#
Calculates Training Loss components
- Parameters:
tokens_pred (torch.Tensor (batch_size, seq_len, num_vocab)) – Predicted Tokens
nsp_pred (torch.Tensor (batch_size,)) – Predicted NSP Label
tokens_true (torch.Tensor (batch_size, seq_len)) – True tokens
nsp_labels (torch.Tensor (batch_size,)) – NSP labels
tokens_mask (torch.Tensor (batch_size, seq_len)) – Tokens mask
- Returns:
Masked word prediction Cross Entropy loss, NSP classification loss
- Return type:
tuple (torch.float32, torch.float32)
- fit(train_loader, val_loader)[source]#
Fits the model on dataset. Runs training and Validation steps for given epochs and saves best model based on the evaluation metric
- Parameters:
train_loader (torch.utils.data.DataLoader) – Train Data loader
val_loader (torch.utils.data.DataLoader) – Validaion Data Loader
- Returns:
Training History
- Return type:
dict
- predict(data_loader)[source]#
Runs inference on input data
- Parameters:
data_loader (torch.utils.data.DataLoader) – Infer Data loader
- Returns:
Labels, Predictions (True tokens, NSP Labels, Predicton tokens, NSP Predictions)
- Return type:
tuple (numpy.ndarray [num_samples, seq_len], numpy.ndarray [num_samples,], numpy.ndarray [num_samples, seq_len], numpy.ndarray [num_samples,])