Student: Markus Vogl
Matr. Nr: k1155575
Course: Explainable AI
The submitted ipython notebooks of this course are used.
-
Fetching: All projects are pulled in parallel via the system git that requires the correct credentials on your system in your ~/.gitconfig directory.
-
Filtering: I use jupyter-nbconvert to extract the source code from ipython files (the typical hand-in-format) from the root of the project.
-
Preprocessing: As the model is limited to 512 tokens (not signs, source code is tokenzied), I strip all comments, emtpy newlines and outputs via regex.
The used model is Microsofts codeBERT, a variation of RoBERTa pre-trained on programming languages. It's based on the huggingface transformers libary which itself is based on pytorch.
The sequences have to be trimmed to a length of 512, as this has been proven to be easy and effective. Even though there exist other approaches that work better for some scenarios.
- BERT is currently the state of the art for text encoding / embedding
- Pretrained on normal human text -> CodeBert
- I chose CodeBert over CuBERT, because Microsoft seems to be a better source
The explainability model is the paper Embedding Projector: Interactive Visualization and Interpretation of Embeddings by Smilkov, et al. from 2016.
Short exploration of the big questions posed in the format Visual Analytics in Deep Learning: An Interrogative Survey for the Next Frontiers by Hohman et. al from 2018.
According to Hohnman et. al the Embedding Projector paper already covers:
Question | Criterion | Explanation |
---|---|---|
Why | Interpretability & Explainability | You can interpret and explain embeddings of any kind |
Who | Model Users | Data Scientists |
What | Individual Computational Units | The tool can downproject with PCA and TSNE |
What | Neurons in High-dimensional Space | Embeddings |
What | Aggregated Information | You can aggregate multiple embeddings and compare them |
When | After Training | |
Where | NIPS Conference | |
How | Dimensionality Reduction & Scatter Plots | PCA and TSN-E |
How | Instance-based Analysis & Exploration | Exploring similiar |
In addition to the stated factors, this adds the explainability approaches:
Question | Criterion | Explanation |
---|---|---|
Why | Comparing & Selecting Models | This allows you to plug in other encoders like LSTMs, other BERT's etc. to compare them easily |
Why | Education | It's an easy showcase of git, BERT embedding-extraction, visualization |
Why | Education | It's meant to compare student exercises and find plagiarism |
Who | Model Developers & Builders | as the students in this course can compare their approach to the other teams |
Who | Model users | Teachers of any github classroom can just plug in their data and start |
Who | Non-experts | can just plug in their github classroom data, run it and get cool visualizations (given they manage the setup) |
Where | - | JKU Linz, XAI Course |
How | Interactive Experimentation | You can change parameters like stripping newlines/comments and see how that effects your embeddings |
classroom = "jku-icg-classroom"
prefix = "xai_proj_space_2020"
teams = ['xai',
'xai-wyoming',
'backpropagatedstudents',
'aikraken',
'mysterious-potatoes',
'the-explainables',
'aiexplained',
'group0',
'xai-explainable-black-magic',
'viennxai',
'xai_group_a',
'hands-on-xai',
'xai-random-group',
'feel_free_2_join',
'forum_feel_free_to_join',
'explain_it_explainable',
'yet-another-group',
'explanation-is-all-you-need',
'let_me_explain_you',
'dirty-mike-and-the-gang',
'explain-the-unexplainable',
'nothin_but_a_peanut',
'3_and_1-2_ger']
import re, os, itertools, pandas
from multiprocessing import Pool
# also requires the system utilities git, rm and jupyter-nbconvert
EXT = "ipynb"
COLUMNS = ["filename", "team", "exercise", "url"]
# Fetch contents of ipynb files
def fetch(team, strip_comments=True, strip_empty_lines=True):
path = f"{classroom}/{prefix}/{team}"
cp = f"{classroom}/{prefix}"
if not os.path.exists(path):
os.system(f"git clone [email protected]:{cp}-{team}.git {path}")
print("⬇️", end="")
file_content = {}
files = filter(lambda n: n.endswith(EXT), os.listdir(path))
for notebook in files:
full_url = f"https://github.com/{cp}-{team}/{notebook}"
cmd = f"jupyter-nbconvert {path}/{notebook} --to python --stdout"
fc = os.popen(cmd).read()
if strip_comments: fc = re.sub("#.+", "", fc)
if strip_empty_lines: fc = re.sub("\n+", "\n", fc)
no_ext = (notebook.replace('.'+EXT, '')
file_content[no_ext, team, prefix, full_url)] = fc
print("✅", end="")
return file_content
def fetch_multithreaded():
pool = Pool(len(teams))
dicts = pool.map(fetch, teams)
items = [fc.items() for fc in dicts]
items_flat = itertools.chain(*items)
return dict(items_flat)
file_content = fetch_multithreaded(); print()
prefix = "xai_model_explanation_2020"
file_content.update(fetch_multithreaded())
pandas.DataFrame(file_content.keys(), columns=COLUMNS)
⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️✅✅✅✅✅✅✅✅✅✅✅✅✅✅✅✅✅✅✅✅✅✅✅
⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️✅⬇️⬇️⬇️⬇️⬇️⬇️✅✅⬇️⬇️⬇️⬇️⬇️⬇️⬇️✅✅✅✅✅✅✅✅✅✅✅✅✅✅✅✅✅✅✅✅
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
filename | team | exercise | url | |
---|---|---|---|---|
0 | solution | xai | xai_proj_space_2020 | https://github.com/jku-icg-classroom/xai_proj_... |
1 | solution | xai-wyoming | xai_proj_space_2020 | https://github.com/jku-icg-classroom/xai_proj_... |
2 | solution | backpropagatedstudents | xai_proj_space_2020 | https://github.com/jku-icg-classroom/xai_proj_... |
3 | solution | aikraken | xai_proj_space_2020 | https://github.com/jku-icg-classroom/xai_proj_... |
4 | solution | mysterious-potatoes | xai_proj_space_2020 | https://github.com/jku-icg-classroom/xai_proj_... |
... | ... | ... | ... | ... |
73 | 6 - LIME_explanations_run | nothin_but_a_peanut | xai_model_explanation_2020 | https://github.com/jku-icg-classroom/xai_model... |
74 | Untitled | nothin_but_a_peanut | xai_model_explanation_2020 | https://github.com/jku-icg-classroom/xai_model... |
75 | imagenet labels to pkl file | nothin_but_a_peanut | xai_model_explanation_2020 | https://github.com/jku-icg-classroom/xai_model... |
76 | 2 - Visualization of Fully Connected Layer Neu... | nothin_but_a_peanut | xai_model_explanation_2020 | https://github.com/jku-icg-classroom/xai_model... |
77 | solution | 3_and_1-2_ger | xai_model_explanation_2020 | https://github.com/jku-icg-classroom/xai_model... |
78 rows × 4 columns
import torch
import numpy as np
from transformers import RobertaTokenizer, RobertaConfig, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")
model.eval() # speeds up stuff
def embed(file_content, tokenizer, model, max_length=512):
content_list = list(file_content.values())
# as pytorch, with padding, truncated to exactly 512 length
tokens = tokenizer(content_list, return_tensors="pt", padding=True,
max_length=max_length, truncation=True)
# Return dict of {filename : numpy bert token}
return model(**tokens)["pooler_output"].detach().numpy()
embedding = embed(file_content, tokenizer, model)
# save files for visualizer
np.savetxt(classroom+"-embedding.tsv", embedding, delimiter="\t")
names = ["\t".join(fc) for fc in file_content]
hdr = "filename\tteam\tproject\tfull_url"
np.savetxt(classroom+"-names.tsv", names, fmt="%s", header="\t".join(COLUMNS))
Standalone instances:
- https://projector.tensorflow.org/
- https://justindujardin.github.io/projector/ (Works better for me for some reason)
Code:
Example: Two teams left in the given example - distance 0.0 as it's the same file.