src package#

Subpackages#

Submodules#

src.configs module#

All the Config Parameters that are used in the lubrary with corresponding Data Type

src.configs.configDictDType = {'alpha': <class 'float'>, 'batch_size': <class 'int'>, 'bleu_n': <class 'int'>, 'captions_file': <class 'str'>, 'clf_dim': <class 'list'>, 'context': <class 'int'>, 'd_ff': <class 'int'>, 'd_model': <class 'int'>, 'decoder_y_dim': <class 'int'>, 'device': <class 'str'>, 'dropout': <class 'float'>, 'embed_dim': <class 'int'>, 'encoder_h_dim': <class 'list'>, 'encoder_x_dim': <class 'list'>, 'epochs': <class 'int'>, 'eval_metric': <class 'str'>, 'explore_folder': <class 'bool'>, 'h_dim': <class 'list'>, 'idf_mode': <class 'str'>, 'image_backbone': <class 'str'>, 'image_dim': <class 'list'>, 'image_folder': <class 'str'>, 'input_file': <class 'str'>, 'input_folder': <class 'str'>, 'lr': <class 'float'>, 'mask': <class 'float'>, 'next': <class 'float'>, 'ngram': <class 'int'>, 'num_classes': <class 'int'>, 'num_extra_tokens': <class 'int'>, 'num_heads': <class 'int'>, 'num_layers': <class 'int'>, 'num_samples': <class 'int'>, 'num_sents_per_doc': <class 'int'>, 'num_src_vocab': <class 'int'>, 'num_tgt_vocab': <class 'int'>, 'num_topics': <class 'int'>, 'num_vocab': <class 'int'>, 'operations': <class 'list'>, 'output_folder': <class 'str'>, 'output_label': <class 'bool'>, 'predict_tokens': <class 'int'>, 'prediction': <class 'float'>, 'pretrain_weights': <class 'str'>, 'random': <class 'float'>, 'random_lines': <class 'bool'>, 'randomize': <class 'bool'>, 'rouge_n_n': <class 'int'>, 'rouge_s_n': <class 'int'>, 'seed': <class 'int'>, 'seq_len': <class 'int'>, 'test_corpus': <class 'list'>, 'test_file': <class 'str'>, 'test_folder': <class 'str'>, 'test_samples': <class 'int'>, 'test_size': <class 'float'>, 'test_split': <class 'float'>, 'tf_mode': <class 'str'>, 'train_corpus': <class 'list'>, 'train_samples': <class 'int'>, 'val_split': <class 'float'>, 'visualize': <class 'bool'>, 'x_dim': <class 'list'>, 'x_max': <class 'int'>}#

Config Parameters Segregated for each algorithm and it’s Parent Parameters.

Format:
‘Algo’: {

}

src.main module#

src.metrics module#

class src.metrics.ClassificationMetrics(config_dict)[source]#

Bases: object

Metrics for Classification Task.

Parameters:

config_dict (dict) – Config Params Dictionary

get_metrics(references, predictions, target_names)[source]#

Function that returns Metrics using References, Predictions and Class Labels

Parameters:
  • references (numpy.ndarray) – References, 1D array (num_samples,)

  • predictions (numpy.ndarray) – Predictions, 2D array (num_samples, num_classes) with Probabilities

  • target_names (list) – Class Labels

Returns:

Metrics Dictionary

Return type:

dict

class src.metrics.TextGenerationMetrics(config_dict)[source]#

Bases: object

Metrics for Text Generation Task.

Parameters:

config_dict (dict) – Config Params Dictionary

bleu_score(references, predictions, n=4)[source]#

BLEU Score

Parameters:
  • references (numpy.ndarray) – References, 2D array (num_samples, seq_len)

  • predictions (numpy.ndarray) – Predictions, 3D array (num_samples, seq_len, num_vocab) with Probabilities

  • n (int, optional) – Max number of N gram, defaults to 4

Returns:

BLEU score

Return type:

float

cider_score(references, predictions)[source]#
get_metrics(references, predictions)[source]#

Function that returns Metrics using References, Predictions and Class Labels

Parameters:
  • references (numpy.ndarray) – References, 2D array (num_samples, seq_len)

  • predictions (numpy.ndarray) – Predictions, 3D array (num_samples, seq_len, num_vocab) with Probabilities

Returns:

Metrics Dictionary

Return type:

dict

meteor_score(references, predictions)[source]#
perplexity_score(predictions)[source]#

Perplixity Score

Parameters:

predictions (numpy.ndarray) – Predictions, 3D array (num_samples, seq_len, num_vocab) with Probabilities

Returns:

Perplixity Score

Return type:

float

rouge_l_score(references, predictions)[source]#

ROUGE L Score

Parameters:
  • references (numpy.ndarray) – References, 2D array (num_samples, seq_len)

  • predictions (numpy.ndarray) – Predictions, 3D array (num_samples, seq_len, num_vocab) with Probabilities

Returns:

ROUGE L score

Return type:

float

rouge_n_score(references, predictions, n=4)[source]#

ROUGE N Score

Parameters:
  • references (numpy.ndarray) – References, 3D array (num_samples, seq_len)

  • predictions (numpy.ndarray) – Predictions, 3D array (num_samples, seq_len, num_vocab) with Probabilities

  • n (int, optional) – Max number of N gram, defaults to 4

Returns:

ROUGE N score

Return type:

float

rouge_s_score(references, predictions, n=4)[source]#

ROUGE S Score

Parameters:
  • references (numpy.ndarray) – References, 2D array (num_samples, seq_len)

  • predictions (numpy.ndarray) – Predictions, 3D array (num_samples, seq_len, num_vocab) with Probabilities

  • n (int, optional) – Max number of N gram, defaults to 4

Returns:

ROUGE S score

Return type:

float

src.plot_utils module#

src.plot_utils.pca_emission_matrix(em_matrix_df, output_folder)[source]#

TSNE of Emission Matrix. Used in HMM

Parameters:
  • em_matrix_df (pandas.DataFrame) – DataFrame of Emission Matrix

  • output_folder (str) – Path to saving Scatter plot as HTML file

src.plot_utils.plot_conf_matrix(y_true, y_pred, classes, output_folder)[source]#

Confusion Matrix of True Labels vs Prediction Labels. Used in GRU, RNN

Parameters:
  • y_true (list) – True Labels

  • y_pred (list) – Prediction labels

  • classes (list) – List of classes

  • output_folder (str) – Path to saving Confusion Matrix png file

src.plot_utils.plot_embed(embeds, vocab, output_folder, fname='Word Embeddings TSNE')[source]#

3D TSNE of Word Embeddings from Embedding Matrix Layer

Parameters:
  • embeds (numpy.ndarray) – Embeddings Array (num_samples, embed_dim)

  • vocab (list) – Vocabulary

  • output_folder (str) – Path to saving Scatter plot as HTML file

  • fname (str, optional) – Filename, defaults to “Word Embeddings TSNE”

src.plot_utils.plot_hist_dataset(data, output_folder)[source]#

Plotting KDE plot of Sentence Length and Histogram of POS tag of each token. Used in HMM

Parameters:
  • data (tuple of (numpy.ndarray [num_samples, seq_len], numpy.ndarray [num_samples, seq_len], numpy.ndarray [num_samples, ], numpy.ndarray [num_samples, ])) – Tuple of Train X, Test X, Train y, Test y

  • output_folder (str) – Path to saving Data Analysis png file

src.plot_utils.plot_history(history, output_folder, name='History')[source]#

Training History with Loss, Metrics tracked during Training

Parameters:
  • history (dict) – History Dictionary

  • output_folder (str) – Path to saving History png file

  • name (str, optional) – Filename, defaults to “History”

src.plot_utils.plot_ngram_pie_chart(vocab_df, n, output_folder, k=20)[source]#

Pie Chart Top K frequent Ngrams. Used in NGRAM

Parameters:
  • vocab_df (pandas.DataFrame) – DataFrame of Ngrams and their Frequency in Corpus

  • n (int) – Number of terms in a Vocab (N of Ngram)

  • output_folder (str) – Path to saving Pie Chart png file

  • k (int, optional) – Number of Ngrams to plot, defaults to 20

src.plot_utils.plot_pca_pairplot(X, y, output_folder, num_pcs=6, name='TFIDF PCA Pairplot')[source]#

Pairplot of Features Colored by Labels. Used in TFIDF

Parameters:
  • X (numpy.ndarray) – Feature 2D array (num_samples, num_features)

  • y (numpy.ndarray) – Labels array, (num_samples, )

  • output_folder (str) – Path to saving pairplot png file

  • num_pcs (int, optional) – Number of Features to Plot, defaults to 6

  • name (str, optional) – Filename, defaults to “TFIDF PCA Pairplot”

src.plot_utils.plot_topk_cooccur_matrix(cooccur_mat, vocab, output_folder, k=20)[source]#

Coocurence Matrix of Tokens in GloVe Model. Used in GLOVE

Parameters:
  • cooccur_mat (numpy.ndarray) – CoOccurence Matrix

  • vocab (list) – List of Vocabulary

  • output_folder (str) – Path to saving Coocurence matrix png file

  • k (int, optional) – Number of Vocab to plot, defaults to 20

src.plot_utils.plot_topk_freq(vocab_freq, output_folder, k=10)[source]#

Histogram of Top K frequent Vocabulary in the corpus. Used in NGRAM, BOW

Parameters:
  • vocab_freq (dict) – Vocabulary Frequency in the Corpus

  • output_folder (str) – Path to saving Histogram png file

  • k (int, optional) – Number of Vocab to plot, defaults to 10

src.plot_utils.plot_transition_matrix(trans_matrix_df, output_folder)[source]#

Heatmap of Transmission Matrix. Used in HMM

Parameters:
  • trans_matrix_df (pandas.DataFrame) – DataFrame of Transmission Matrix

  • output_folder (str) – Path to saving Heatmap png file

src.plot_utils.plot_wordcloud(vocab_freq, output_folder)[source]#

Generating Word Cloud Plot. Used in NGRAM, BOW

Parameters:
  • vocab_freq (dict) – Vocabulary Frequency in the Corpus

  • output_folder (str) – Path to saving Wordcloud png file

src.plot_utils.viz_metrics(metric_dict, output_folder)[source]#

Visualizing Confusion Matrix and Classification Report Metrics. Used in HMM

Parameters:
  • metric_dict (dict) – Metrics Dictionary with conf_matrix and clf_report as keys

  • output_folder (str) – Path to saving Metrics png file

src.utils module#

class src.utils.ValidateConfig(config_dict, algo)[source]#

Bases: object

Validating Config File

Parameters:
  • config_dict (dict) – Config Params Dictionary

  • algo (str) – Name of the Algorithm

check_float(key, val)[source]#

To check whether given key whose value is float has a valid value or not

Parameters:
  • key (str) – Param Key

  • val (float) – Param value

check_int(key, val)[source]#

To check whether given key whose value is int has a valid value or not

Parameters:
  • key (str) – Param Key

  • val (int) – Param value

check_list(key, val)[source]#

To check whether given key whose value is list has a valid value or not

Parameters:
  • key (str) – Param Key

  • val (list) – Param value

check_paths(key, val)[source]#

To check whether given key whose value is a filepath has a valid value or not

Parameters:
  • key (str) – Param Key

  • val (str) – Param value

check_string(key, val)[source]#

To check whether given key whose value is str has a valid value or not

Parameters:
  • key (str) – Param Key

  • val (str) – Param value

compare_dtype(key, val)[source]#

To check whether given key whose value has a valid dtype or not

Parameters:
  • key (str) – Param Key

  • val (float/int/str/list) – Param value

run_verify()[source]#

Config Params Keys and Values Verification

verify_main_keys(keys)[source]#

Verifying whether Config has all the required keys or not

Parameters:

keys (list) – Parent Config Parameters

verify_values()[source]#

Verifying the Datatypes of all the Parameters in Config

src.utils.get_logger(log_folder)[source]#

Initializing Log File

Parameters:

log_folder (str) – Path to folder where Log file is added

src.utils.load_config(config_path)[source]#

Loading YAML Config file as a Dictionary

Parameters:

config_path (str) – Path to Config File

Returns:

Config Params Dictionary

Return type:

dict

src.utils.set_seed(seed)[source]#

Setting seed across Libraries to reproduce results

Parameters:

seed (int) – Seed value

Module contents#