src.preprocess package#

Submodules#

src.preprocess.eng2tel module#

class src.preprocess.eng2tel.PreprocessSeq2Seq(config_dict)[source]#

Bases: object

Loading Data and Generating Source, target Data for SEQ2SEQ Model Training.

Parameters:: config_dict (dict) – Config Params Dictionary

batched_ids2tokens(tokens, type='src')[source]#

Converting sentence of ids to tokens

Parameters:

tokens (numpy.ndarray) – Tokens Array, 2D array (num_samples, seq_len)
type (str, optional) – {‘src’, ‘tgt’} Type of tokens, defaults to “src”

Returns:

List of decoded sentences

Return type:

list

extract_data()[source]#

Extracting Data from csv file

Returns:: Train and Test DataFrames with Source and Target sentences
Return type:: tuple, (pandas.DataFrame, pandas.DataFrame)

get_data(df)[source]#

Generating Source and Target Array of Tokens

Parameters:: df (pandas.DataFrame) – DataFrame with Source and Target sentences
Returns:: Source and Target Arrays
Return type:: tuple (numpy.ndarray [num_samples, seq_len], numpy.ndarray [num_samples, seq_len])

get_vocab(df)[source]#

Generates Vocabulary

Parameters:: df (pandas.DataFrame) – DataFrame with Source and Target sentences

preprocess_src(text)[source]#

Preprocessing Source Sentence

Parameters:: text (str) – Sentence
Returns:: Preprocessed sentence
Return type:: str

preprocess_tgt(text)[source]#

Preprocessing Target Sentence

Parameters:: text (str) – Sentence
Returns:: Preprocessed sentence
Return type:: str

src.preprocess.flickr module#

class src.preprocess.flickr.PreprocessFlickr(config_dict)[source]#

Bases: object

Preprocessing Flickr Dataset

Parameters:: config_dict (dict) – Config Params Dictionary

batched_ids2captions(tokens)[source]#

Converting sentence of ids to tokens

Parameters:: tokens (numpy.ndarray) – Tokens Array, 2D array (num_samples, seq_len)
Returns:: List of decoded sentences
Return type:: list

extract_data()[source]#

Extracting Image and Captions Data

Returns:: Train and Test DataFrames
Return type:: tuple (pandas.DataFrame, pandas.DataFrame)

get_data()[source]#

Preprocessing

Returns:: Returns image paths, Train Tokens, (Train, Test Transforms)
Return type:: (list, numpy.ndarray [num_samples, seq_len], (albumentations.Compose, albumentations.Compose))

get_test_data()[source]#

Generating test Data

Returns:: Returns image paths, Test Tokens, Test Transforms
Return type:: (list, numpy.ndarray [num_samples, seq_len], albumentations.Compose)

get_vocab(train_df)[source]#

Generates Vocabulary

Parameters:: train_df (pandas.DataFrame) – DataFrame with Training Captions

image_transforms(data_type)[source]#

Creating Albumentations Transforms for train or test data

Parameters:: data_type (str) – {‘train’, ‘test’}. Type of Data
Returns:: Transforms
Return type:: albumentations.Compose

word_tokens(df)[source]#

Coverting Sentences to Tokens

Parameters:: df (pandas.DataFrame) – Captions DataFrame
Returns:: Tokens array (num_samples, seq_len)
Return type:: numpy.ndarray

src.preprocess.imdb_reviews module#

class src.preprocess.imdb_reviews.PreprocessIMDB(root_path, explore_folder, num_samples, operations, randomize)[source]#

Bases: object

Loading and Generating Reviews, labels for IMDB dataset

Parameters:

root_path (str) – Root Folder with all the classes Folders with txt files for each sample or can have txt files
explore_folder (bool) – Whether the root_path has classes folder or txt files
num_samples (int) – How many samples to select from each folder
operations (list) – Any combinations of {‘lcase’, ‘remalpha’, ‘stopwords’, ‘stemming’}. list of preprocessing Operations
randomize (bool) – Select first num_samples or at random

extract_data(root_path, explore_folder, num_samples, randomize)[source]#

Extracting data from txt files

Parameters:

root_path (str) – Root Folder with all the classes Folders with txt files for each sample or can have txt files
explore_folder (bool) – Whether the root_path has classes folder or txt files
num_samples (int) – How many samples to select from each folder
randomize (bool) – Select first num_samples or at random

extract_data_folder(fold_path, num_samples, randomize)[source]#

Extracting txt data from each folder

Parameters:

fold_path (str) – Path to Folder
num_samples (int) – How many samples to select from each folder
randomize (bool) – Select first num_samples or at random

Returns:

List of sentences from the folder

Return type:

list

run()[source]#: Preprocessing list of sentences

src.preprocess.pos module#

class src.preprocess.pos.PreprocessPOS(config_dict)[source]#

Bases: object

Generating POS data from data available in nltk (‘treebank’, ‘brown’, ‘con11’)

Parameters:: config_dict (dict) – Config Params Dictionary

extract_data()[source]#

Extracting data from available corpus in nltk

Returns:: List of sentences along with POS labels
Return type:: list

get_data(corpus)[source]#

Generating Tokens and POS labels from corpus

Parameters:: corpus (list) – List of nltk sentences with POS Labels
Returns:: Tokens and POS Labels
Return type:: tuple (numpy.ndarray [num_samples, seq_len], numpy.ndarray [num_samples, seq_len, num_pos])

get_vocab(corpus)[source]#

Generates Vocabulary

Parameters:: corpus (list) – List of nltk sentences with POS Labels

preprocess_corpus(corpus)[source]#

Preprocessing Sentences

Parameters:: corpus (list) – List of nltk sentences with POS Labels
Returns:: List of sentences, Labels
Return type:: tuple (list, list)

src.preprocess.utils module#

class src.preprocess.utils.BytePairEncoding(config_dict)[source]#

Bases: object

Byte Pair Encoding Algorithm to convert a corpus to tokens

Parameters:: config_dict (dict) – Config Params Dictionary

build_vocab(words)[source]#

Generates Vocabulary after updation of words by merging characters

Parameters:: words (list) – List of words
Returns:: List of updated words
Return type:: list

fit(text_ls)[source]#

Fits BPE on List of sentences and Transforms into words

Parameters:: text_ls (list) – List of sentences
Returns:: List of words
Return type:: list

get_stats(words)[source]#

Creates a dictionary with pair of consecutive characters as key and corresponding count in corpus as value

Parameters:: words (list) – List of words from the corpus
Returns:: Dictionary with pairs of characters and frequency
Return type:: dict

merge_chars(word, vocab)[source]#

Merging characters in a word if it’s concatenation present in vocabulary

Parameters:

word (str) – Word
vocab (list) – Vocabulary

Returns:

new word with merged characters

Return type:

str

preprocess(text_ls, data='train')[source]#

Creating words from list of sentences. Words are created by adding space between each character and adding </w> at the end.

Parameters:

text_ls (list) – List od sentences
data (str, optional) – {‘train’, ‘test’} Type of data, defaults to “train”

Returns:

List of words from all the sentences in one list

Return type:

list

run_merge(words)[source]#

Updates vocabulary until desired Vocabulary count is reached

Parameters:: words (list) – List of words
Returns:: List of updated final words
Return type:: list

transform(text_ls)[source]#

Transforms list of sentences into words

Parameters:: text_ls (list) – List of sentences
Returns:: List of words
Return type:: list

class src.preprocess.utils.WordPiece(config_dict)[source]#

Bases: object

WordPiece Tokenization Algorithm to tokenize a corpus and generate Vocabulary

Parameters:: config_dict (dict) – Config Params Dictionary

build_vocab(corpus)[source]#

Generates Vocabulary after updation of words by merging characters

Parameters:: corpus (list) – List of words
Returns:: List of updated corpus
Return type:: list

combine(pair)[source]#

Combines pair of characters based on their location in a word by removing ##

Parameters:: pair (tuple) – Pair of characters
Returns:: Combination of characters
Return type:: str

fit(text_ls)[source]#

Fits WordPiece on List of sentences and Transforms the words

Parameters:: text_ls (list) – List of sentences
Returns:: List of words
Return type:: list

get_likelihood(pair, pair_freq)[source]#

Calculates likelihood of two characters being consecutive in a corpus

Parameters:

pair (tuple) – Pair of characters
pair_freq (dict) – Dictionary with pairs of characters and frequency

Returns:

Likelihood (can be a value in [0, 1])

Return type:

float

get_stats(corpus)[source]#

Creates a dictionary with pair of consecutive characters as key and corresponding count in corpus as value

Parameters:: corpus (list) – List of words
Returns:: Dictionary with pairs of characters and frequency
Return type:: dict

merge_chars(word, vocab)[source]#

Merging characters in a word if it’s concatenation present in vocabulary

Parameters:

word (str) – Word
vocab (list) – Vocabulary

Returns:

new word with merged characters

Return type:

str

preprocess(text_ls, data='train')[source]#

Creating words from list of sentences. Words are created by adding ## at start each character (other than first character).

Parameters:

text_ls (list) – List od sentences
data (str, optional) – {‘train’, ‘test’} Type of data, defaults to “train”

Returns:

List of words from all the sentences in one list. Each word is a list of characters

Return type:

list

run_merge(corpus)[source]#

Updates vocabulary until desired Vocabulary count is reached

Parameters:: corpus (list) – List of corpus
Returns:: List of updated final corpus
Return type:: list

transform(text_ls)[source]#

Transforms list of sentences into words

Parameters:: text_ls (list) – List of sentences
Returns:: List of words
Return type:: list

src.preprocess.utils.preprocess_text(text, operations=None)[source]#

Preprocesses Text

Parameters:

text (str) – string to preprocess
operations (list, optional) – List of operations from {‘lcase’, ‘remalpha’, ‘stopwords’, ‘stemming’}, defaults to None

Returns:

Preprocessed text

Return type:

str

src.preprocess package#

Submodules#

src.preprocess.eng2tel module#

src.preprocess.flickr module#

src.preprocess.imdb_reviews module#

src.preprocess.pos module#

src.preprocess.utils module#

Module contents#