src.preprocess package#

Submodules#

src.preprocess.eng2tel module#

class src.preprocess.eng2tel.PreprocessSeq2Seq(config_dict)[source]#

Bases: object

Loading Data and Generating Source, target Data for SEQ2SEQ Model Training.

Parameters:

config_dict (dict) – Config Params Dictionary

batched_ids2tokens(tokens, type='src')[source]#

Converting sentence of ids to tokens

Parameters:
  • tokens (numpy.ndarray) – Tokens Array, 2D array (num_samples, seq_len)

  • type (str, optional) – {‘src’, ‘tgt’} Type of tokens, defaults to “src”

Returns:

List of decoded sentences

Return type:

list

extract_data()[source]#

Extracting Data from csv file

Returns:

Train and Test DataFrames with Source and Target sentences

Return type:

tuple, (pandas.DataFrame, pandas.DataFrame)

get_data(df)[source]#

Generating Source and Target Array of Tokens

Parameters:

df (pandas.DataFrame) – DataFrame with Source and Target sentences

Returns:

Source and Target Arrays

Return type:

tuple (numpy.ndarray [num_samples, seq_len], numpy.ndarray [num_samples, seq_len])

get_vocab(df)[source]#

Generates Vocabulary

Parameters:

df (pandas.DataFrame) – DataFrame with Source and Target sentences

preprocess_src(text)[source]#

Preprocessing Source Sentence

Parameters:

text (str) – Sentence

Returns:

Preprocessed sentence

Return type:

str

preprocess_tgt(text)[source]#

Preprocessing Target Sentence

Parameters:

text (str) – Sentence

Returns:

Preprocessed sentence

Return type:

str

src.preprocess.flickr module#

class src.preprocess.flickr.PreprocessFlickr(config_dict)[source]#

Bases: object

Preprocessing Flickr Dataset

Parameters:

config_dict (dict) – Config Params Dictionary

batched_ids2captions(tokens)[source]#

Converting sentence of ids to tokens

Parameters:

tokens (numpy.ndarray) – Tokens Array, 2D array (num_samples, seq_len)

Returns:

List of decoded sentences

Return type:

list

extract_data()[source]#

Extracting Image and Captions Data

Returns:

Train and Test DataFrames

Return type:

tuple (pandas.DataFrame, pandas.DataFrame)

get_data()[source]#

Preprocessing

Returns:

Returns image paths, Train Tokens, (Train, Test Transforms)

Return type:

(list, numpy.ndarray [num_samples, seq_len], (albumentations.Compose, albumentations.Compose))

get_test_data()[source]#

Generating test Data

Returns:

Returns image paths, Test Tokens, Test Transforms

Return type:

(list, numpy.ndarray [num_samples, seq_len], albumentations.Compose)

get_vocab(train_df)[source]#

Generates Vocabulary

Parameters:

train_df (pandas.DataFrame) – DataFrame with Training Captions

image_transforms(data_type)[source]#

Creating Albumentations Transforms for train or test data

Parameters:

data_type (str) – {‘train’, ‘test’}. Type of Data

Returns:

Transforms

Return type:

albumentations.Compose

word_tokens(df)[source]#

Coverting Sentences to Tokens

Parameters:

df (pandas.DataFrame) – Captions DataFrame

Returns:

Tokens array (num_samples, seq_len)

Return type:

numpy.ndarray

src.preprocess.imdb_reviews module#

class src.preprocess.imdb_reviews.PreprocessIMDB(root_path, explore_folder, num_samples, operations, randomize)[source]#

Bases: object

Loading and Generating Reviews, labels for IMDB dataset

Parameters:
  • root_path (str) – Root Folder with all the classes Folders with txt files for each sample or can have txt files

  • explore_folder (bool) – Whether the root_path has classes folder or txt files

  • num_samples (int) – How many samples to select from each folder

  • operations (list) – Any combinations of {‘lcase’, ‘remalpha’, ‘stopwords’, ‘stemming’}. list of preprocessing Operations

  • randomize (bool) – Select first num_samples or at random

extract_data(root_path, explore_folder, num_samples, randomize)[source]#

Extracting data from txt files

Parameters:
  • root_path (str) – Root Folder with all the classes Folders with txt files for each sample or can have txt files

  • explore_folder (bool) – Whether the root_path has classes folder or txt files

  • num_samples (int) – How many samples to select from each folder

  • randomize (bool) – Select first num_samples or at random

extract_data_folder(fold_path, num_samples, randomize)[source]#

Extracting txt data from each folder

Parameters:
  • fold_path (str) – Path to Folder

  • num_samples (int) – How many samples to select from each folder

  • randomize (bool) – Select first num_samples or at random

Returns:

List of sentences from the folder

Return type:

list

run()[source]#

Preprocessing list of sentences

src.preprocess.pos module#

class src.preprocess.pos.PreprocessPOS(config_dict)[source]#

Bases: object

Generating POS data from data available in nltk (‘treebank’, ‘brown’, ‘con11’)

Parameters:

config_dict (dict) – Config Params Dictionary

extract_data()[source]#

Extracting data from available corpus in nltk

Returns:

List of sentences along with POS labels

Return type:

list

get_data(corpus)[source]#

Generating Tokens and POS labels from corpus

Parameters:

corpus (list) – List of nltk sentences with POS Labels

Returns:

Tokens and POS Labels

Return type:

tuple (numpy.ndarray [num_samples, seq_len], numpy.ndarray [num_samples, seq_len, num_pos])

get_vocab(corpus)[source]#

Generates Vocabulary

Parameters:

corpus (list) – List of nltk sentences with POS Labels

preprocess_corpus(corpus)[source]#

Preprocessing Sentences

Parameters:

corpus (list) – List of nltk sentences with POS Labels

Returns:

List of sentences, Labels

Return type:

tuple (list, list)

src.preprocess.utils module#

class src.preprocess.utils.BytePairEncoding(config_dict)[source]#

Bases: object

Byte Pair Encoding Algorithm to convert a corpus to tokens

Parameters:

config_dict (dict) – Config Params Dictionary

build_vocab(words)[source]#

Generates Vocabulary after updation of words by merging characters

Parameters:

words (list) – List of words

Returns:

List of updated words

Return type:

list

fit(text_ls)[source]#

Fits BPE on List of sentences and Transforms into words

Parameters:

text_ls (list) – List of sentences

Returns:

List of words

Return type:

list

get_stats(words)[source]#

Creates a dictionary with pair of consecutive characters as key and corresponding count in corpus as value

Parameters:

words (list) – List of words from the corpus

Returns:

Dictionary with pairs of characters and frequency

Return type:

dict

merge_chars(word, vocab)[source]#

Merging characters in a word if it’s concatenation present in vocabulary

Parameters:
  • word (str) – Word

  • vocab (list) – Vocabulary

Returns:

new word with merged characters

Return type:

str

preprocess(text_ls, data='train')[source]#

Creating words from list of sentences. Words are created by adding space between each character and adding </w> at the end.

Parameters:
  • text_ls (list) – List od sentences

  • data (str, optional) – {‘train’, ‘test’} Type of data, defaults to “train”

Returns:

List of words from all the sentences in one list

Return type:

list

run_merge(words)[source]#

Updates vocabulary until desired Vocabulary count is reached

Parameters:

words (list) – List of words

Returns:

List of updated final words

Return type:

list

transform(text_ls)[source]#

Transforms list of sentences into words

Parameters:

text_ls (list) – List of sentences

Returns:

List of words

Return type:

list

class src.preprocess.utils.WordPiece(config_dict)[source]#

Bases: object

WordPiece Tokenization Algorithm to tokenize a corpus and generate Vocabulary

Parameters:

config_dict (dict) – Config Params Dictionary

build_vocab(corpus)[source]#

Generates Vocabulary after updation of words by merging characters

Parameters:

corpus (list) – List of words

Returns:

List of updated corpus

Return type:

list

combine(pair)[source]#

Combines pair of characters based on their location in a word by removing ##

Parameters:

pair (tuple) – Pair of characters

Returns:

Combination of characters

Return type:

str

fit(text_ls)[source]#

Fits WordPiece on List of sentences and Transforms the words

Parameters:

text_ls (list) – List of sentences

Returns:

List of words

Return type:

list

get_likelihood(pair, pair_freq)[source]#

Calculates likelihood of two characters being consecutive in a corpus

Parameters:
  • pair (tuple) – Pair of characters

  • pair_freq (dict) – Dictionary with pairs of characters and frequency

Returns:

Likelihood (can be a value in [0, 1])

Return type:

float

get_stats(corpus)[source]#

Creates a dictionary with pair of consecutive characters as key and corresponding count in corpus as value

Parameters:

corpus (list) – List of words

Returns:

Dictionary with pairs of characters and frequency

Return type:

dict

merge_chars(word, vocab)[source]#

Merging characters in a word if it’s concatenation present in vocabulary

Parameters:
  • word (str) – Word

  • vocab (list) – Vocabulary

Returns:

new word with merged characters

Return type:

str

preprocess(text_ls, data='train')[source]#

Creating words from list of sentences. Words are created by adding ## at start each character (other than first character).

Parameters:
  • text_ls (list) – List od sentences

  • data (str, optional) – {‘train’, ‘test’} Type of data, defaults to “train”

Returns:

List of words from all the sentences in one list. Each word is a list of characters

Return type:

list

run_merge(corpus)[source]#

Updates vocabulary until desired Vocabulary count is reached

Parameters:

corpus (list) – List of corpus

Returns:

List of updated final corpus

Return type:

list

transform(text_ls)[source]#

Transforms list of sentences into words

Parameters:

text_ls (list) – List of sentences

Returns:

List of words

Return type:

list

src.preprocess.utils.preprocess_text(text, operations=None)[source]#

Preprocesses Text

Parameters:
  • text (str) – string to preprocess

  • operations (list, optional) – List of operations from {‘lcase’, ‘remalpha’, ‘stopwords’, ‘stemming’}, defaults to None

Returns:

Preprocessed text

Return type:

str

Module contents#