Documentation ============= :doc:`Docs ` Installation ============ Install using pip ----------------- .. code:: bash pip install ScratchNLP Install Manually for development -------------------------------- Clone the repo .. code:: bash gh repo clone shanmukh05/scratch_nlp Install dependencies .. code:: bash pip install -r requirements.txt Features ======== - Algorithms - Bag of Words - Ngram - TF-IDF - Hidden Markov Model - Word2Vec - GloVe - RNN (Many to One) - LSTM (One to Many) - GRU (Many to Many Synced) - Seq2Seq + Attention (Many to Many) - Transformer - BERT - GPT-2 - Tokenization - BypePair Encoding - WordPiece Tokenizer - Metrics - BLEU - ROUGE (-N, -L, -S) - Perplexity - METEOR - CIDER - Datasets - IMDB Reviews Dataset - Flickr Dataset - NLTK POS Datasets (treebank, brown, conll2000) - SQuAD QA Dataset - Genius Lyrics Dataset - LAMBADA Dataset - Wiki en dataset - English to Telugu Translation Dataset - Tasks - Sentiment Classification - POS Tagging - Image Captioning - Machine Translation - Question Answering - Text Generation Implementation Details ---------------------- .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html .. raw:: html
Algorithm .. raw:: html Task .. raw:: html Tokenization .. raw:: html Output .. raw:: html Dataset .. raw:: html
BOW .. raw:: html Text Representation .. raw:: html Preprocessed words .. raw:: html .. raw:: html
    .. raw:: html
  • Text Label, Vector npy files .. raw:: html
  • .. raw:: html
  • Top K Vocab Frequency Histogram png .. raw:: html
  • .. raw:: html
  • Vocab frequency csv .. raw:: html
  • .. raw:: html
  • Wordcloud png .. raw:: html
  • .. raw:: html
.. raw:: html
IMDB Reviews .. raw:: html
Ngram .. raw:: html Text Representation .. raw:: html Preprocessed Words .. raw:: html .. raw:: html
    .. raw:: html
  • Text Label, Vector npy files .. raw:: html
  • .. raw:: html
  • Top K Vocab Frequency Histogram png .. raw:: html
  • .. raw:: html
  • Top K ngrams Piechart ong .. raw:: html
  • .. raw:: html
  • Vocab frequency csv .. raw:: html
  • .. raw:: html
  • Wordcloud png .. raw:: html
  • .. raw:: html
.. raw:: html
IMDB Reviews .. raw:: html
TF-IDF .. raw:: html Text Representation .. raw:: html Preprocessed words .. raw:: html .. raw:: html
    .. raw:: html
  • Text Label, Vector npy files .. raw:: html
  • .. raw:: html
  • TF PCA Pairplot png .. raw:: html
  • .. raw:: html
  • TF-IDF PCA Pairplot png .. raw:: html
  • .. raw:: html
  • IDF csv .. raw:: html
  • .. raw:: html
.. raw:: html
IMDB Reviews .. raw:: html
HMM .. raw:: html POS Tagging .. raw:: html Preprocessed words .. raw:: html .. raw:: html
    .. raw:: html
  • Data Analysis png (sent len, POS tags count) .. raw:: html
  • .. raw:: html
  • Emission Matrix TSNE html .. raw:: html
  • .. raw:: html
  • Emission matrix csv .. raw:: html
  • .. raw:: html
  • Test Predictions conf matrix, clf report png .. raw:: html
  • .. raw:: html
  • Transition Matrix csv, png .. raw:: html
  • .. raw:: html
.. raw:: html
NLTK Treebank .. raw:: html
Word2Vec .. raw:: html Text Representation .. raw:: html Preprocessed words .. raw:: html .. raw:: html
    .. raw:: html
  • Best Model pt .. raw:: html
  • .. raw:: html
  • Training History json .. raw:: html
  • .. raw:: html
  • Word Embeddings TSNE html .. raw:: html
  • .. raw:: html
.. raw:: html
IMDB Reviews .. raw:: html
GloVe .. raw:: html Text Representation .. raw:: html Preprocessed words .. raw:: html .. raw:: html
    .. raw:: html
  • Best Model pt .. raw:: html
  • .. raw:: html
  • Training History json .. raw:: html
  • .. raw:: html
  • Word Embeddings TSNE html .. raw:: html
  • .. raw:: html
  • Top K Cooccurence Matrix png .. raw:: html
  • .. raw:: html
.. raw:: html
IMDB Reviews .. raw:: html
RNN .. raw:: html Sentiment Classification .. raw:: html Preprocessed words .. raw:: html .. raw:: html
    .. raw:: html
  • Best Model pt .. raw:: html
  • .. raw:: html
  • Training History json .. raw:: html
  • .. raw:: html
  • Word Embeddings TSNE html .. raw:: html
  • .. raw:: html
  • Confusion Matrix png .. raw:: html
  • .. raw:: html
  • Training History png .. raw:: html
  • .. raw:: html
.. raw:: html
IMDB Reviews .. raw:: html
LSTM .. raw:: html Image Captioning .. raw:: html Preprocessed words .. raw:: html .. raw:: html
    .. raw:: html
  • Best Model pt .. raw:: html
  • .. raw:: html
  • Training History json .. raw:: html
  • .. raw:: html
  • Word Embeddings TSNE html .. raw:: html
  • .. raw:: html
  • Training History png .. raw:: html
  • .. raw:: html
.. raw:: html
Flickr 8k .. raw:: html
GRU .. raw:: html POS Tagging .. raw:: html Preprocessed words .. raw:: html .. raw:: html
    .. raw:: html
  • Best Model pt .. raw:: html
  • .. raw:: html
  • Training History json .. raw:: html
  • .. raw:: html
  • Word Embeddings TSNE html .. raw:: html
  • .. raw:: html
  • Confusion Matrix png .. raw:: html
  • .. raw:: html
  • Test predictions csv .. raw:: html
  • .. raw:: html
  • Training History png .. raw:: html
  • .. raw:: html
.. raw:: html
NLTK Treebank, Broown, Conll2000 .. raw:: html
Seq2Seq + Attention .. raw:: html Machine Translation .. raw:: html Tokenization .. raw:: html .. raw:: html
    .. raw:: html
  • Best Model pt .. raw:: html
  • .. raw:: html
  • Training History json .. raw:: html
  • .. raw:: html
  • Source, Target Word Embeddings TSNE html .. raw:: html
  • .. raw:: html
  • Test predictions csv .. raw:: html
  • .. raw:: html
  • Training History png .. raw:: html
  • .. raw:: html
.. raw:: html
English to Telugu Translation .. raw:: html
Transformer .. raw:: html Lyrics Generation .. raw:: html BytePairEncoding .. raw:: html .. raw:: html
    .. raw:: html
  • Best Model pt .. raw:: html
  • .. raw:: html
  • Training History json .. raw:: html
  • .. raw:: html
  • Token Embeddings TSNE html .. raw:: html
  • .. raw:: html
  • Test predictions csv .. raw:: html
  • .. raw:: html
  • Training History png .. raw:: html
  • .. raw:: html
.. raw:: html
Genius Lyrics .. raw:: html
BERT .. raw:: html NSP Pretraining, QA Finetuning .. raw:: html WordPiece .. raw:: html .. raw:: html
    .. raw:: html
  • Best Model pt (pretrain, finetune) .. raw:: html
  • .. raw:: html
  • Training History json (pretrain, finetune) .. raw:: html
  • .. raw:: html
  • Token Embeddings TSNE html .. raw:: html
  • .. raw:: html
  • Finetune Test predictions csv .. raw:: html
  • .. raw:: html
  • Training History png (pretrain, finetune) .. raw:: html
  • .. raw:: html
.. raw:: html
Wiki en, SQuAD v1 .. raw:: html
GPT-2 .. raw:: html Sentence Completition .. raw:: html BytePairEncoding .. raw:: html .. raw:: html
    .. raw:: html
  • Best Model pt .. raw:: html
  • .. raw:: html
  • Training History json .. raw:: html
  • .. raw:: html
  • Token Embeddings TSNE html .. raw:: html
  • .. raw:: html
  • Test predictions csv .. raw:: html
  • .. raw:: html
  • Training History png .. raw:: html
  • .. raw:: html
.. raw:: html
LAMBADA .. raw:: html
Examples ======== Run Train and Inference directly through import .. code:: python import yaml from scratch_nlp.src.core.gpt import gpt with open(config_path, "r") as stream: config_dict = yaml.safe_load(stream) gpt = gpt.GPT(config_dict) gpt.run() Run through CLI .. code:: bash cd src python main.py --config_path '' --algo '' --log_folder '' Contributing ============ Contributions are always welcome! See `CONTRIBUTING.md `__ for ways to get started. Acknowledgements ================ I have referred to sa many online resources to create this project. I’m adding all the resources to `RESOURCES.md `__. Thanks to all who has created those blogs/code/datasets. Thanks to `CS224N `__ course which gave me motivation to start this project About Me ======== I am Shanmukha Sainath, working as AI Engineer at KLA Corporation. I have done my Bachelors from Department of Electronics and Electrical Communication Engineering department with Minor in Computer Science Engineering and Micro in Artificial Intelligence and Applications from IIT Kharagpur. Connect with me --------------- .. figure:: https://raw.githubusercontent.com/shanmukh05/scratch_nlp/main/assets/connect.png :alt: Logo :width: 200px :target: https://linktr.ee/shanmukh05 Lessons Learned =============== Most of the things present in this project are pretty new to me. I’m listing down my major learnings when creating this project - NLP Algorithms - Research paper Implementation - Designing Project structure - Documentation - GitHub pages - PIP packaging License ======= |MIT License| Feedback ======== If you have any feedback, please reach out to me at venkatashanmukhasainathg@gmail.com .. |MIT License| image:: https://img.shields.io/badge/License-MIT-green.svg :target: https://choosealicense.com/licenses/mit/