src.core.gpt package#
Submodules#
src.core.gpt.dataset module#
- class src.core.gpt.dataset.PreprocessGPT(config_dict)[source]#
Bases:
object
A class to preprocess GPT Data
- Parameters:
config_dict (dict) – Config Params Dictionary
- batched_ids2tokens(tokens)[source]#
Converting sentence of ids to tokens
- Parameters:
tokens (numpy.ndarray) – Tokens Array, 2D array (num_samples, seq_len)
- Returns:
List of decoded sentences
- Return type:
list
- extract_data()[source]#
Extracts data from novels txt files
- Returns:
Lost of raw strings
- Return type:
list
- get_data()[source]#
Converts extracted data into tokens
- Returns:
Text tokens
- Return type:
numpy.ndarray (num_samples, seq_len)
- get_test_data()[source]#
Converts extracted test data into tokens
- Returns:
Text tokens
- Return type:
numpy.ndarray (num_samples, seq_len + num_pred_tokens)
- src.core.gpt.dataset.create_dataloader(X, data='train', val_split=0.2, batch_size=32, seed=2024)[source]#
Creates Train, Validation and Test DataLoader
- Parameters:
X (torch.Tensor (num_samples, seq_len+1)) – Input tokens
data (str, optional) – Type of data, defaults to “train”
val_split (float, optional) – validation split, defaults to 0.2
batch_size (int, optional) – Batch size, defaults to 32
seed (int, optional) – Seed, defaults to 2024
- Returns:
Train, Val / Test dataloaders
- Return type:
tuple (torch.utils.data.DataLoader, torch.utils.data.DataLoader) / torch.utils.data.DataLoader
src.core.gpt.gpt module#
src.core.gpt.model module#
- class src.core.gpt.model.DecoderLayer(config_dict)[source]#
Bases:
Module
GPT Decoder layer
- Parameters:
config_dict (dict) – Config Params Dictionary
- class src.core.gpt.model.GPTModel(config_dict)[source]#
Bases:
Module
GPT Architecture
- Parameters:
config_dict (dict) – Config Params Dictionary
- class src.core.gpt.model.GPTTrainer(model, optimizer, config_dict)[source]#
Bases:
Module
GPT Trainer
- Parameters:
model (torch.nn.Module) – GPT model
optimizer (torch.optim) – Optimizer
config_dict (dict) – Config Params Dictionary
- calc_loss(y_pred, y_true)[source]#
Crossentropy loss for predicted tokens
- Parameters:
y_pred (torch.Tensor (batch_size, seq_len, num_vocab)) – Predicted tokens
y_true (torch.Tensor (batch_size, seq_len)) – True tokens
- Returns:
BCE Loss
- Return type:
torch.float32
- fit(train_loader, val_loader)[source]#
Fits the model on dataset. Runs training and Validation steps for given epochs and saves best model based on the evaluation metric
- Parameters:
train_loader (torch.utils.data.DataLoader) – Train Data loader
val_loader (torch.utils.data.DataLoader) – Validaion Data Loader
- Returns:
Training History
- Return type:
dict
- generate(data_loader)[source]#
Runs inference to generate new text
- Parameters:
data_loader (torch.utils.data.DataLoader) – Infer Data loader
- Returns:
True tokens, Generated tokens
- Return type:
tuple (numpy.ndarray [num_samples, seq_len + num_pred_tokens], numpy.ndarray [num_samples, seq_len + num_pred_tokens])
- predict(data_loader)[source]#
Runs inference to predict a shifted sentence
- Parameters:
data_loader (torch.utils.data.DataLoader) – Infer Data loader
- Returns:
True tokens, Predicted tokens
- Return type:
tuple (numpy.ndarray [num_samples, seq_len], numpy.ndarray [num_samples, seq_len, num_vocab])