Documentation#
Installation#
Install using pip#
pip install ScratchNLP
Install Manually for development#
Clone the repo
gh repo clone shanmukh05/scratch_nlp
Install dependencies
pip install -r requirements.txt
Features#
Algorithms
Bag of Words
Ngram
TF-IDF
Hidden Markov Model
Word2Vec
GloVe
RNN (Many to One)
LSTM (One to Many)
GRU (Many to Many Synced)
Seq2Seq + Attention (Many to Many)
Transformer
BERT
GPT-2
Tokenization
BypePair Encoding
WordPiece Tokenizer
Metrics
BLEU
ROUGE (-N, -L, -S)
Perplexity
METEOR
CIDER
Datasets
IMDB Reviews Dataset
Flickr Dataset
NLTK POS Datasets (treebank, brown, conll2000)
SQuAD QA Dataset
Genius Lyrics Dataset
LAMBADA Dataset
Wiki en dataset
English to Telugu Translation Dataset
Tasks
Sentiment Classification
POS Tagging
Image Captioning
Machine Translation
Question Answering
Text Generation
Implementation Details#
Algorithm | Task | Tokenization | Output | Dataset |
---|---|---|---|---|
BOW | Text Representation | Preprocessed words |
| IMDB Reviews |
Ngram | Text Representation | Preprocessed Words |
| IMDB Reviews |
TF-IDF | Text Representation | Preprocessed words |
| IMDB Reviews |
HMM | POS Tagging | Preprocessed words |
| NLTK Treebank |
Word2Vec | Text Representation | Preprocessed words |
| IMDB Reviews |
GloVe | Text Representation | Preprocessed words |
| IMDB Reviews |
RNN | Sentiment Classification | Preprocessed words |
| IMDB Reviews |
LSTM | Image Captioning | Preprocessed words |
| Flickr 8k |
GRU | POS Tagging | Preprocessed words |
| NLTK Treebank, Broown, Conll2000 |
Seq2Seq + Attention | Machine Translation | Tokenization |
| English to Telugu Translation |
Transformer | Lyrics Generation | BytePairEncoding |
| Genius Lyrics |
BERT | NSP Pretraining, QA Finetuning | WordPiece |
| Wiki en, SQuAD v1 |
GPT-2 | Sentence Completition | BytePairEncoding |
| LAMBADA |
Examples#
Run Train and Inference directly through import
import yaml
from scratch_nlp.src.core.gpt import gpt
with open(config_path, "r") as stream:
config_dict = yaml.safe_load(stream)
gpt = gpt.GPT(config_dict)
gpt.run()
Run through CLI
cd src
python main.py --config_path '<config_path>' --algo '<algo name>' --log_folder '<output folder>'
Contributing#
Contributions are always welcome!
See CONTRIBUTING.md for ways to get started.
Acknowledgements#
I have referred to sa many online resources to create this project. I’m adding all the resources to RESOURCES.md. Thanks to all who has created those blogs/code/datasets.
Thanks to CS224N course which gave me motivation to start this project
About Me#
I am Shanmukha Sainath, working as AI Engineer at KLA Corporation. I have done my Bachelors from Department of Electronics and Electrical Communication Engineering department with Minor in Computer Science Engineering and Micro in Artificial Intelligence and Applications from IIT Kharagpur.
Connect with me#
Lessons Learned#
Most of the things present in this project are pretty new to me. I’m listing down my major learnings when creating this project
NLP Algorithms
Research paper Implementation
Designing Project structure
Documentation
GitHub pages
PIP packaging
License#
Feedback#
If you have any feedback, please reach out to me at venkatashanmukhasainathg@gmail.com