Skip to content

dod-advana/gamechanger-ml

Repository files navigation

GC - Machine Learning

Table of Contents

  1. Directory
  2. Development Rules
  3. Train Models
  4. ML API
  5. Helpful Flags For API
  6. FAQ
  7. Pull Requests

Directory

β”œβ”€β”€ gamechangerml
β”‚Β Β  β”œβ”€β”€ api
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ README.md
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ __init__.py
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ docker-compose.override.yml
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ docker-compose.yml
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ fastapi
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ getInitModels.py
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ kube
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ logs
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ tests
β”‚Β Β  β”‚Β Β  └── utils
β”‚Β Β  β”œβ”€β”€ configs
β”‚Β Β  β”œβ”€β”€ corpus
β”‚Β Β  β”œβ”€β”€ data
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ features
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ abbcounts.json
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ abbreviations.csv
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ abbreviations.json
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ agencies.csv
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ classifier_entities.csv
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ combined_entities.csv
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ corpus_doctypes.csv
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ enwiki_vocab_min200.txt
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ generated_files
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ __init__.py
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ common_orgs.csv
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ corpus_meta.csv
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”‚Β Β  └── prod_test_data.csv
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ popular_documents.csv
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ topics_wiki.csv
β”‚Β Β  β”‚Β Β  β”‚Β Β  └── word-freq-corpus-20201101.txt
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ ltr
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ nltk_data
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ test_data
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ training
β”‚Β Β  β”‚Β Β  β”‚Β Β  └── sent_transformer
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ user_data
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ gold_standard.csv
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ matamo_feedback
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ Feedback.csv
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”‚Β Β  └── matamo_feedback.csv
β”‚Β Β  β”‚Β Β  β”‚Β Β  └── search_history
β”‚Β Β  β”‚Β Β  β”‚Β Β      └── SearchPdfMapping.csv
β”‚Β Β  β”‚Β Β  └── validation
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ domain
β”‚Β Β  β”‚Β Β      β”‚Β Β  β”œβ”€β”€ query_expansion
β”‚Β Β  β”‚Β Β      β”‚Β Β  β”œβ”€β”€ question_answer
β”‚Β Β  β”‚Β Β      β”‚Β Β  └── sent_transformer
β”‚Β Β  β”‚Β Β      └── original
β”‚Β Β  β”‚Β Β          β”œβ”€β”€ msmarco_1k
β”‚Β Β  β”‚Β Β          β”œβ”€β”€ multinli_1.0
β”‚Β Β  β”‚Β Β          └── squad2.0
β”‚Β Β  β”œβ”€β”€ mlflow
β”‚Β Β  β”œβ”€β”€ models
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ ltr
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ msmarco_index
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ qexp_20211001
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ sent_index_20211108
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ topic_models
β”‚Β Β  β”‚Β Β  └── transformers
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ bert-base-cased-squad2
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ distilbart-mnli-12-3
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ msmarco-distilbert-base-v2
β”‚Β Β  β”œβ”€β”€ scripts
β”‚Β Β  β”œβ”€β”€ src
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ featurization
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ abbreviation.py
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ abbreviations_utils.py
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ extract_improvement
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ generated_fts.py
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ keywords
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ make_meta.py
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ rank_features
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ ref_list.py
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ ref_utils.py
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ responsibilities.py
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ summary.py
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ table.py
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ term_extract
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ test_hf_ner.py
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ tests
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ topic_modeling.py
β”‚Β Β  β”‚Β Β  β”‚Β Β  └── word_sim.py
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ model_testing
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ search
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ QA
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ embed_reader
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ query_expansion
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ ranking
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ semantic
β”‚Β Β  β”‚Β Β  β”‚Β Β  └── sent_transformer
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ text_classif
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ text_handling
β”‚Β Β  β”‚Β Β  └── utilities
β”‚Β Β  β”œβ”€β”€ stresstest
β”‚Β Β  β”œβ”€β”€ train

Development Rules

  • Everything in gamechangerml/src should be independent of things outside of that structure (should not need to import from dataPipeline, common, etc).

Configs

  • Config files go in gamechangerml/configs. When you add a new class, import it in gamechangerml/configs/init.py.
  • File paths in gamechangerml/configs/* should be relative to gamechangerml and only used for local testing purposes. Feel free to change on your local machine, but do not commit system specific paths to the repository.
  • A config class (i.e., from gamechangerml/configs/*) should not be required as an input parameter to a function. However, a config class attribute can be used to provide parameters to a function (foo(path=Config.path), rather than foo(Config)).

What Can Be Stored On GitHub?

  • Models and large files should NOT be stored on Github.
  • Data should NOT be stored on Github, there is a script in the gamechangerml/scripts folder to download a corpus from s3.

Use Best Practices

  • Code should be modular, broken down into smallest logical pieces, and placed in the most logical subfolder.
  • All classes, functions, etc. should have clear, concise, and consistent docstrings.
    • Function docstrings should include:

      • A short description
      • Any important remarks
      • Parameter types, defaults, and descriptions
      • Return types and descriptions

      Example:

      def say(words, loud=False):
        """Make the animal say words.
      
        Args:
          words (str): Words for the animal to say.
          loud (bool): True to make the animal say the words loudly, False to 
            make the animal say the words in a normal tone. Default is False.
      
        Returns:
          None
        """
  • Include a maximum of 1 class per file.
  • Include README.md files that contain what, why, and how code is used.

Getting Started

To use gamechangerml as a python module

  • pip install .
  • you should now be able to import gamechangerml anywhere python is available.

Train Models

  1. Setup your environment, and make any changes to configs:
  • source ./gamechangerml/setup_env.sh DEV
  1. Ensure your AWS enviroment is setup (you have a default profile)
  2. Get dependencies
  • source ./gamechangerml/scripts/download_dependencies.sh
  1. For query expansion:
  • python -m gamechangerml.train.scripts.run_train_models --flag {MODEL_NAME_SUFFIX} --saveremote {True or False} --model_dest {FILE_PATH_MODEL_OUTPUT} --corpus {CORPUS_DIR}
  1. For sentence embeddings:
  • python -m gamechangerml.train.scripts.create_embeddings -c {CORPUS LOCATION} --gpu True --em msmarco-distilbert-base-v2

ML API

  1. Setup your environment, make any changes to configs:
  • source ./gamechangerml/setup_env.sh DEV
  1. Ensure your AWS enviroment is setup (you have a default profile)
  2. Dependencies will be automatically downloaded and extracted.
  3. cd gamechangerml/api
  4. docker-compose build
  5. docker-compose up
  6. visit localhost:5000/docs

Helpful Flags For API

  • export CONTAINER_RELOAD=True to reload the container on code changes for development
  • export DOWNLOAD_DEP=True to get models and other deps from s3
  • export MODEL_LOAD=False to not load models on API start (only for development needs)

FAQ

  • I get an error with redis on API start
    • export ENV_TYPE=DEV
  • Do I need to train models to use the API?
    • No, you can use the pretrained models within the dependencies.
  • The API is crashing when trying to load the models.
    • Likely your machine does not have enough resources (RAM or CPU) to load all models. Try to exclude models from the model folder.
  • Do I need a machine with a GPU?
    • No, but it will make training or inferring faster.
  • What if I can't download the dependencies since I am external?
    • We are working on making models publically available. However you can use download pretrained transformers from HuggingFace to include in the models/transformers directory, which will enable you to use some functionality of the API. Without any models, there is still functionality available like text extraction avaiable.

Pull Requests

Please provide:

  1. Description - what is the purpose, what are the different features added i.e. bugfix, added upload capability to model, model improving
  2. Reviewer Test - how to test it manually and if it is on a dev/test server. (if applicable) i.e. hit post endpoint /search with payload {"query": "military"}
  3. Unit/Integration tests - screenshot or copy output of unit tests from GC_ML_TESTS_119, any other tests or metrics applicable