βββ gamechangerml
βΒ Β βββ api
βΒ Β βΒ Β βββ README.md
βΒ Β βΒ Β βββ __init__.py
βΒ Β βΒ Β βββ docker-compose.override.yml
βΒ Β βΒ Β βββ docker-compose.yml
βΒ Β βΒ Β βββ fastapi
βΒ Β βΒ Β βββ getInitModels.py
βΒ Β βΒ Β βββ kube
βΒ Β βΒ Β βββ logs
βΒ Β βΒ Β βββ tests
βΒ Β βΒ Β βββ utils
βΒ Β βββ configs
βΒ Β βββ corpus
βΒ Β βββ data
βΒ Β βΒ Β βββ features
βΒ Β βΒ Β βΒ Β βββ abbcounts.json
βΒ Β βΒ Β βΒ Β βββ abbreviations.csv
βΒ Β βΒ Β βΒ Β βββ abbreviations.json
βΒ Β βΒ Β βΒ Β βββ agencies.csv
βΒ Β βΒ Β βΒ Β βββ classifier_entities.csv
βΒ Β βΒ Β βΒ Β βββ combined_entities.csv
βΒ Β βΒ Β βΒ Β βββ corpus_doctypes.csv
βΒ Β βΒ Β βΒ Β βββ enwiki_vocab_min200.txt
βΒ Β βΒ Β βΒ Β βββ generated_files
βΒ Β βΒ Β βΒ Β βΒ Β βββ __init__.py
βΒ Β βΒ Β βΒ Β βΒ Β βββ common_orgs.csv
βΒ Β βΒ Β βΒ Β βΒ Β βββ corpus_meta.csv
βΒ Β βΒ Β βΒ Β βΒ Β βββ prod_test_data.csv
βΒ Β βΒ Β βΒ Β βββ popular_documents.csv
βΒ Β βΒ Β βΒ Β βββ topics_wiki.csv
βΒ Β βΒ Β βΒ Β βββ word-freq-corpus-20201101.txt
βΒ Β βΒ Β βββ ltr
βΒ Β βΒ Β βββ nltk_data
βΒ Β βΒ Β βββ test_data
βΒ Β βΒ Β βββ training
βΒ Β βΒ Β βΒ Β βββ sent_transformer
βΒ Β βΒ Β βββ user_data
βΒ Β βΒ Β βΒ Β βββ gold_standard.csv
βΒ Β βΒ Β βΒ Β βββ matamo_feedback
βΒ Β βΒ Β βΒ Β βΒ Β βββ Feedback.csv
βΒ Β βΒ Β βΒ Β βΒ Β βββ matamo_feedback.csv
βΒ Β βΒ Β βΒ Β βββ search_history
βΒ Β βΒ Β βΒ Β βββ SearchPdfMapping.csv
βΒ Β βΒ Β βββ validation
βΒ Β βΒ Β βββ domain
βΒ Β βΒ Β βΒ Β βββ query_expansion
βΒ Β βΒ Β βΒ Β βββ question_answer
βΒ Β βΒ Β βΒ Β βββ sent_transformer
βΒ Β βΒ Β βββ original
βΒ Β βΒ Β βββ msmarco_1k
βΒ Β βΒ Β βββ multinli_1.0
βΒ Β βΒ Β βββ squad2.0
βΒ Β βββ mlflow
βΒ Β βββ models
βΒ Β βΒ Β βββ ltr
βΒ Β βΒ Β βββ msmarco_index
βΒ Β βΒ Β βββ qexp_20211001
βΒ Β βΒ Β βββ sent_index_20211108
βΒ Β βΒ Β βββ topic_models
βΒ Β βΒ Β βββ transformers
βΒ Β βΒ Β βββ bert-base-cased-squad2
βΒ Β βΒ Β βββ distilbart-mnli-12-3
βΒ Β βΒ Β βββ msmarco-distilbert-base-v2
βΒ Β βββ scripts
βΒ Β βββ src
βΒ Β βΒ Β βββ featurization
βΒ Β βΒ Β βΒ Β βββ abbreviation.py
βΒ Β βΒ Β βΒ Β βββ abbreviations_utils.py
βΒ Β βΒ Β βΒ Β βββ extract_improvement
βΒ Β βΒ Β βΒ Β βββ generated_fts.py
βΒ Β βΒ Β βΒ Β βββ keywords
βΒ Β βΒ Β βΒ Β βββ make_meta.py
βΒ Β βΒ Β βΒ Β βββ rank_features
βΒ Β βΒ Β βΒ Β βββ ref_list.py
βΒ Β βΒ Β βΒ Β βββ ref_utils.py
βΒ Β βΒ Β βΒ Β βββ responsibilities.py
βΒ Β βΒ Β βΒ Β βββ summary.py
βΒ Β βΒ Β βΒ Β βββ table.py
βΒ Β βΒ Β βΒ Β βββ term_extract
βΒ Β βΒ Β βΒ Β βββ test_hf_ner.py
βΒ Β βΒ Β βΒ Β βββ tests
βΒ Β βΒ Β βΒ Β βββ topic_modeling.py
βΒ Β βΒ Β βΒ Β βββ word_sim.py
βΒ Β βΒ Β βββ model_testing
βΒ Β βΒ Β βββ search
βΒ Β βΒ Β βΒ Β βββ QA
βΒ Β βΒ Β βΒ Β βββ embed_reader
βΒ Β βΒ Β βΒ Β βββ query_expansion
βΒ Β βΒ Β βΒ Β βββ ranking
βΒ Β βΒ Β βΒ Β βββ semantic
βΒ Β βΒ Β βΒ Β βββ sent_transformer
βΒ Β βΒ Β βββ text_classif
βΒ Β βΒ Β βββ text_handling
βΒ Β βΒ Β βββ utilities
βΒ Β βββ stresstest
βΒ Β βββ train
- Everything in
gamechangerml/srcshould be independent of things outside of that structure (should not need to import from dataPipeline, common, etc).
- Config files go in
gamechangerml/configs. When you add a new class, import it in gamechangerml/configs/init.py. - File paths in
gamechangerml/configs/*should be relative togamechangermland only used for local testing purposes. Feel free to change on your local machine, but do not commit system specific paths to the repository. - A config class (i.e., from
gamechangerml/configs/*) should not be required as an input parameter to a function. However, a config class attribute can be used to provide parameters to a function (foo(path=Config.path), rather thanfoo(Config)).
- Models and large files should NOT be stored on Github.
- Data should NOT be stored on Github, there is a script in the
gamechangerml/scriptsfolder to download a corpus from s3.
- Code should be modular, broken down into smallest logical pieces, and placed in the most logical subfolder.
- All classes, functions, etc. should have clear, concise, and consistent docstrings.
-
Function docstrings should include:
- A short description
- Any important remarks
- Parameter types, defaults, and descriptions
- Return types and descriptions
Example:
def say(words, loud=False): """Make the animal say words. Args: words (str): Words for the animal to say. loud (bool): True to make the animal say the words loudly, False to make the animal say the words in a normal tone. Default is False. Returns: None """
-
- Include a maximum of 1 class per file.
- Include README.md files that contain what, why, and how code is used.
pip install .- you should now be able to import gamechangerml anywhere python is available.
- Setup your environment, and make any changes to configs:
source ./gamechangerml/setup_env.sh DEV
- Ensure your AWS enviroment is setup (you have a default profile)
- Get dependencies
source ./gamechangerml/scripts/download_dependencies.sh
- For query expansion:
python -m gamechangerml.train.scripts.run_train_models --flag {MODEL_NAME_SUFFIX} --saveremote {True or False} --model_dest {FILE_PATH_MODEL_OUTPUT} --corpus {CORPUS_DIR}
- For sentence embeddings:
python -m gamechangerml.train.scripts.create_embeddings -c {CORPUS LOCATION} --gpu True --em msmarco-distilbert-base-v2
- Setup your environment, make any changes to configs:
source ./gamechangerml/setup_env.sh DEV
- Ensure your AWS enviroment is setup (you have a default profile)
- Dependencies will be automatically downloaded and extracted.
cd gamechangerml/apidocker-compose builddocker-compose up- visit
localhost:5000/docs
- export CONTAINER_RELOAD=True to reload the container on code changes for development
- export DOWNLOAD_DEP=True to get models and other deps from s3
- export MODEL_LOAD=False to not load models on API start (only for development needs)
- I get an error with redis on API start
- export ENV_TYPE=DEV
- Do I need to train models to use the API?
- No, you can use the pretrained models within the dependencies.
- The API is crashing when trying to load the models.
- Likely your machine does not have enough resources (RAM or CPU) to load all models. Try to exclude models from the model folder.
- Do I need a machine with a GPU?
- No, but it will make training or inferring faster.
- What if I can't download the dependencies since I am external?
- We are working on making models publically available. However you can use download pretrained transformers from HuggingFace to include in the models/transformers directory, which will enable you to use some functionality of the API. Without any models, there is still functionality available like text extraction avaiable.
Please provide:
- Description - what is the purpose, what are the different features added i.e. bugfix, added upload capability to model, model improving
- Reviewer Test - how to test it manually and if it is on a dev/test server. (if applicable)
i.e. hit post endpoint /search with payload {"query": "military"} - Unit/Integration tests - screenshot or copy output of unit tests from GC_ML_TESTS_119, any other tests or metrics applicable