Documentation#

Docs

Installation#

Install using pip#

pip install ScratchNLP

Install Manually for development#

Clone the repo

gh repo clone shanmukh05/scratch_nlp

Install dependencies

pip install -r requirements.txt

Features#

  • Algorithms

    • Bag of Words

    • Ngram

    • TF-IDF

    • Hidden Markov Model

    • Word2Vec

    • GloVe

    • RNN (Many to One)

    • LSTM (One to Many)

    • GRU (Many to Many Synced)

    • Seq2Seq + Attention (Many to Many)

    • Transformer

    • BERT

    • GPT-2

  • Tokenization

    • BypePair Encoding

    • WordPiece Tokenizer

  • Metrics

    • BLEU

    • ROUGE (-N, -L, -S)

    • Perplexity

    • METEOR

    • CIDER

  • Datasets

    • IMDB Reviews Dataset

    • Flickr Dataset

    • NLTK POS Datasets (treebank, brown, conll2000)

    • SQuAD QA Dataset

    • Genius Lyrics Dataset

    • LAMBADA Dataset

    • Wiki en dataset

    • English to Telugu Translation Dataset

  • Tasks

    • Sentiment Classification

    • POS Tagging

    • Image Captioning

    • Machine Translation

    • Question Answering

    • Text Generation

Implementation Details#

Algorithm

Task

Tokenization

Output

Dataset

BOW

Text Representation

Preprocessed words

  • Text Label, Vector npy files

  • Top K Vocab Frequency Histogram png

  • Vocab frequency csv

  • Wordcloud png

IMDB Reviews

Ngram

Text Representation

Preprocessed Words

  • Text Label, Vector npy files

  • Top K Vocab Frequency Histogram png

  • Top K ngrams Piechart ong

  • Vocab frequency csv

  • Wordcloud png

IMDB Reviews

TF-IDF

Text Representation

Preprocessed words

  • Text Label, Vector npy files

  • TF PCA Pairplot png

  • TF-IDF PCA Pairplot png

  • IDF csv

IMDB Reviews

HMM

POS Tagging

Preprocessed words

  • Data Analysis png (sent len, POS tags count)

  • Emission Matrix TSNE html

  • Emission matrix csv

  • Test Predictions conf matrix, clf report png

  • Transition Matrix csv, png

NLTK Treebank

Word2Vec

Text Representation

Preprocessed words

  • Best Model pt

  • Training History json

  • Word Embeddings TSNE html

IMDB Reviews

GloVe

Text Representation

Preprocessed words

  • Best Model pt

  • Training History json

  • Word Embeddings TSNE html

  • Top K Cooccurence Matrix png

IMDB Reviews

RNN

Sentiment Classification

Preprocessed words

  • Best Model pt

  • Training History json

  • Word Embeddings TSNE html

  • Confusion Matrix png

  • Training History png

IMDB Reviews

LSTM

Image Captioning

Preprocessed words

  • Best Model pt

  • Training History json

  • Word Embeddings TSNE html

  • Training History png

Flickr 8k

GRU

POS Tagging

Preprocessed words

  • Best Model pt

  • Training History json

  • Word Embeddings TSNE html

  • Confusion Matrix png

  • Test predictions csv

  • Training History png

NLTK Treebank, Broown, Conll2000

Seq2Seq + Attention

Machine Translation

Tokenization

  • Best Model pt

  • Training History json

  • Source, Target Word Embeddings TSNE html

  • Test predictions csv

  • Training History png

English to Telugu Translation

Transformer

Lyrics Generation

BytePairEncoding

  • Best Model pt

  • Training History json

  • Token Embeddings TSNE html

  • Test predictions csv

  • Training History png

Genius Lyrics

BERT

NSP Pretraining, QA Finetuning

WordPiece

  • Best Model pt (pretrain, finetune)

  • Training History json (pretrain, finetune)

  • Token Embeddings TSNE html

  • Finetune Test predictions csv

  • Training History png (pretrain, finetune)

Wiki en, SQuAD v1

GPT-2

Sentence Completition

BytePairEncoding

  • Best Model pt

  • Training History json

  • Token Embeddings TSNE html

  • Test predictions csv

  • Training History png

LAMBADA

Examples#

Run Train and Inference directly through import

import yaml
from scratch_nlp.src.core.gpt import gpt

with open(config_path, "r") as stream:
  config_dict = yaml.safe_load(stream)

gpt = gpt.GPT(config_dict)
gpt.run()

Run through CLI

cd src
python main.py --config_path '<config_path>' --algo '<algo name>' --log_folder '<output folder>'

Contributing#

Contributions are always welcome!

See CONTRIBUTING.md for ways to get started.

Acknowledgements#

I have referred to sa many online resources to create this project. I’m adding all the resources to RESOURCES.md. Thanks to all who has created those blogs/code/datasets.

Thanks to CS224N course which gave me motivation to start this project

About Me#

I am Shanmukha Sainath, working as AI Engineer at KLA Corporation. I have done my Bachelors from Department of Electronics and Electrical Communication Engineering department with Minor in Computer Science Engineering and Micro in Artificial Intelligence and Applications from IIT Kharagpur.

Connect with me#

Logo

Lessons Learned#

Most of the things present in this project are pretty new to me. I’m listing down my major learnings when creating this project

  • NLP Algorithms

  • Research paper Implementation

  • Designing Project structure

  • Documentation

  • GitHub pages

  • PIP packaging

License#

MIT License

Feedback#

If you have any feedback, please reach out to me at venkatashanmukhasainathg@gmail.com