src.preprocess package#
Submodules#
src.preprocess.eng2tel module#
- class src.preprocess.eng2tel.PreprocessSeq2Seq(config_dict)[source]#
Bases:
object
Loading Data and Generating Source, target Data for SEQ2SEQ Model Training.
- Parameters:
config_dict (dict) – Config Params Dictionary
- batched_ids2tokens(tokens, type='src')[source]#
Converting sentence of ids to tokens
- Parameters:
tokens (numpy.ndarray) – Tokens Array, 2D array (num_samples, seq_len)
type (str, optional) – {‘src’, ‘tgt’} Type of tokens, defaults to “src”
- Returns:
List of decoded sentences
- Return type:
list
- extract_data()[source]#
Extracting Data from csv file
- Returns:
Train and Test DataFrames with Source and Target sentences
- Return type:
tuple, (pandas.DataFrame, pandas.DataFrame)
- get_data(df)[source]#
Generating Source and Target Array of Tokens
- Parameters:
df (pandas.DataFrame) – DataFrame with Source and Target sentences
- Returns:
Source and Target Arrays
- Return type:
tuple (numpy.ndarray [num_samples, seq_len], numpy.ndarray [num_samples, seq_len])
- get_vocab(df)[source]#
Generates Vocabulary
- Parameters:
df (pandas.DataFrame) – DataFrame with Source and Target sentences
src.preprocess.flickr module#
- class src.preprocess.flickr.PreprocessFlickr(config_dict)[source]#
Bases:
object
Preprocessing Flickr Dataset
- Parameters:
config_dict (dict) – Config Params Dictionary
Converting sentence of ids to tokens
- Parameters:
tokens (numpy.ndarray) – Tokens Array, 2D array (num_samples, seq_len)
- Returns:
List of decoded sentences
- Return type:
list
- extract_data()[source]#
Extracting Image and Captions Data
- Returns:
Train and Test DataFrames
- Return type:
tuple (pandas.DataFrame, pandas.DataFrame)
- get_data()[source]#
Preprocessing
- Returns:
Returns image paths, Train Tokens, (Train, Test Transforms)
- Return type:
(list, numpy.ndarray [num_samples, seq_len], (albumentations.Compose, albumentations.Compose))
- get_test_data()[source]#
Generating test Data
- Returns:
Returns image paths, Test Tokens, Test Transforms
- Return type:
(list, numpy.ndarray [num_samples, seq_len], albumentations.Compose)
- get_vocab(train_df)[source]#
Generates Vocabulary
- Parameters:
train_df (pandas.DataFrame) – DataFrame with Training Captions
src.preprocess.imdb_reviews module#
- class src.preprocess.imdb_reviews.PreprocessIMDB(root_path, explore_folder, num_samples, operations, randomize)[source]#
Bases:
object
Loading and Generating Reviews, labels for IMDB dataset
- Parameters:
root_path (str) – Root Folder with all the classes Folders with txt files for each sample or can have txt files
explore_folder (bool) – Whether the root_path has classes folder or txt files
num_samples (int) – How many samples to select from each folder
operations (list) – Any combinations of {‘lcase’, ‘remalpha’, ‘stopwords’, ‘stemming’}. list of preprocessing Operations
randomize (bool) – Select first num_samples or at random
- extract_data(root_path, explore_folder, num_samples, randomize)[source]#
Extracting data from txt files
- Parameters:
root_path (str) – Root Folder with all the classes Folders with txt files for each sample or can have txt files
explore_folder (bool) – Whether the root_path has classes folder or txt files
num_samples (int) – How many samples to select from each folder
randomize (bool) – Select first num_samples or at random
- extract_data_folder(fold_path, num_samples, randomize)[source]#
Extracting txt data from each folder
- Parameters:
fold_path (str) – Path to Folder
num_samples (int) – How many samples to select from each folder
randomize (bool) – Select first num_samples or at random
- Returns:
List of sentences from the folder
- Return type:
list
src.preprocess.pos module#
- class src.preprocess.pos.PreprocessPOS(config_dict)[source]#
Bases:
object
Generating POS data from data available in nltk (‘treebank’, ‘brown’, ‘con11’)
- Parameters:
config_dict (dict) – Config Params Dictionary
- extract_data()[source]#
Extracting data from available corpus in nltk
- Returns:
List of sentences along with POS labels
- Return type:
list
- get_data(corpus)[source]#
Generating Tokens and POS labels from corpus
- Parameters:
corpus (list) – List of nltk sentences with POS Labels
- Returns:
Tokens and POS Labels
- Return type:
tuple (numpy.ndarray [num_samples, seq_len], numpy.ndarray [num_samples, seq_len, num_pos])
src.preprocess.utils module#
- class src.preprocess.utils.BytePairEncoding(config_dict)[source]#
Bases:
object
Byte Pair Encoding Algorithm to convert a corpus to tokens
- Parameters:
config_dict (dict) – Config Params Dictionary
- build_vocab(words)[source]#
Generates Vocabulary after updation of words by merging characters
- Parameters:
words (list) – List of words
- Returns:
List of updated words
- Return type:
list
- fit(text_ls)[source]#
Fits BPE on List of sentences and Transforms into words
- Parameters:
text_ls (list) – List of sentences
- Returns:
List of words
- Return type:
list
- get_stats(words)[source]#
Creates a dictionary with pair of consecutive characters as key and corresponding count in corpus as value
- Parameters:
words (list) – List of words from the corpus
- Returns:
Dictionary with pairs of characters and frequency
- Return type:
dict
- merge_chars(word, vocab)[source]#
Merging characters in a word if it’s concatenation present in vocabulary
- Parameters:
word (str) – Word
vocab (list) – Vocabulary
- Returns:
new word with merged characters
- Return type:
str
- preprocess(text_ls, data='train')[source]#
Creating words from list of sentences. Words are created by adding space between each character and adding </w> at the end.
- Parameters:
text_ls (list) – List od sentences
data (str, optional) – {‘train’, ‘test’} Type of data, defaults to “train”
- Returns:
List of words from all the sentences in one list
- Return type:
list
- class src.preprocess.utils.WordPiece(config_dict)[source]#
Bases:
object
WordPiece Tokenization Algorithm to tokenize a corpus and generate Vocabulary
- Parameters:
config_dict (dict) – Config Params Dictionary
- build_vocab(corpus)[source]#
Generates Vocabulary after updation of words by merging characters
- Parameters:
corpus (list) – List of words
- Returns:
List of updated corpus
- Return type:
list
- combine(pair)[source]#
Combines pair of characters based on their location in a word by removing ##
- Parameters:
pair (tuple) – Pair of characters
- Returns:
Combination of characters
- Return type:
str
- fit(text_ls)[source]#
Fits WordPiece on List of sentences and Transforms the words
- Parameters:
text_ls (list) – List of sentences
- Returns:
List of words
- Return type:
list
- get_likelihood(pair, pair_freq)[source]#
Calculates likelihood of two characters being consecutive in a corpus
- Parameters:
pair (tuple) – Pair of characters
pair_freq (dict) – Dictionary with pairs of characters and frequency
- Returns:
Likelihood (can be a value in [0, 1])
- Return type:
float
- get_stats(corpus)[source]#
Creates a dictionary with pair of consecutive characters as key and corresponding count in corpus as value
- Parameters:
corpus (list) – List of words
- Returns:
Dictionary with pairs of characters and frequency
- Return type:
dict
- merge_chars(word, vocab)[source]#
Merging characters in a word if it’s concatenation present in vocabulary
- Parameters:
word (str) – Word
vocab (list) – Vocabulary
- Returns:
new word with merged characters
- Return type:
str
- preprocess(text_ls, data='train')[source]#
Creating words from list of sentences. Words are created by adding ## at start each character (other than first character).
- Parameters:
text_ls (list) – List od sentences
data (str, optional) – {‘train’, ‘test’} Type of data, defaults to “train”
- Returns:
List of words from all the sentences in one list. Each word is a list of characters
- Return type:
list