Skip to content

Commit d9a7146

Browse files
authored
How to use embedding doc (#442)
* Add the 'how to use embedding' tutorial * Add the chinese version of how to use embedding doc * update syntax * add punctuation * change visualdl to visualDL * Prove reading * prove reading
1 parent 26ea0c3 commit d9a7146

File tree

2 files changed

+260
-0
lines changed

2 files changed

+260
-0
lines changed
Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
# 如何用VisualDL可视化 Embedding
2+
3+
在这里,我们将向您展示如何在 PyTorch 中使用 VisualDL 可视化 Embedding。
4+
Embedding 常用于自然语言处理中,他能将语义使用高维向量来表示。
5+
6+
Embedding 可视化有助于验证训练算法,Embedding 可视化会将高维向量压缩到二维/三维空间,
7+
两个词越接近,它们共享的语义意义就越明显。
8+
9+
我们使用 PyTorch [embedding 示例](http://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html) 示例作为基础。
10+
11+
以下就是全部的 embedding Python 脚本,
12+
您可以直接在 Python 环境中测试它。
13+
14+
```
15+
import torch
16+
import torch.nn as nn
17+
import torch.nn.functional as F
18+
import torch.optim as optim
19+
20+
torch.manual_seed(1)
21+
22+
CONTEXT_SIZE = 2
23+
EMBEDDING_DIM = 10
24+
# We will use Shakespeare Sonnet 2
25+
test_sentence = """When forty winters shall besiege thy brow,
26+
And dig deep trenches in thy beauty's field,
27+
Thy youth's proud livery so gazed on now,
28+
Will be a totter'd weed of small worth held:
29+
Then being asked, where all thy beauty lies,
30+
Where all the treasure of thy lusty days;
31+
To say, within thine own deep sunken eyes,
32+
Were an all-eating shame, and thriftless praise.
33+
How much more praise deserv'd thy beauty's use,
34+
If thou couldst answer 'This fair child of mine
35+
Shall sum my count, and make my old excuse,'
36+
Proving his beauty by succession thine!
37+
This were to be new made when thou art old,
38+
And see thy blood warm when thou feel'st it cold.""".split()
39+
# we should tokenize the input, but we will ignore that for now
40+
# build a list of tuples. Each tuple is ([ word_i-2, word_i-1 ], target word)
41+
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
42+
for i in range(len(test_sentence) - 2)]
43+
# print the first 3, just so you can see what they look like
44+
print(trigrams[:3])
45+
46+
vocab = set(test_sentence)
47+
word_to_ix = {word: i for i, word in enumerate(vocab)}
48+
49+
50+
class NGramLanguageModeler(nn.Module):
51+
52+
def __init__(self, vocab_size, embedding_dim, context_size):
53+
super(NGramLanguageModeler, self).__init__()
54+
self.embeddings = nn.Embedding(vocab_size, embedding_dim)
55+
self.linear1 = nn.Linear(context_size * embedding_dim, 128)
56+
self.linear2 = nn.Linear(128, vocab_size)
57+
58+
def forward(self, inputs):
59+
embeds = self.embeddings(inputs).view((1, -1))
60+
out = F.relu(self.linear1(embeds))
61+
out = self.linear2(out)
62+
log_probs = F.log_softmax(out, dim=1)
63+
return log_probs
64+
65+
66+
losses = []
67+
loss_function = nn.NLLLoss()
68+
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
69+
optimizer = optim.SGD(model.parameters(), lr=0.001)
70+
71+
for epoch in range(10):
72+
total_loss = torch.Tensor([0])
73+
for context, target in trigrams:
74+
75+
# Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
76+
# into integer indices and wrap them in variables)
77+
context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)
78+
79+
# Step 2. Recall that torch *accumulates* gradients. Before passing in a
80+
# new instance, you need to zero out the gradients from the old
81+
# instance
82+
model.zero_grad()
83+
84+
# Step 3. Run the forward pass, getting log probabilities over next
85+
# words
86+
log_probs = model(context_idxs)
87+
88+
# Step 4. Compute your loss function. (Again, Torch wants the target
89+
# word wrapped in a variable)
90+
loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))
91+
92+
# Step 5. Do the backward pass and update the gradient
93+
loss.backward()
94+
optimizer.step()
95+
96+
# Get the Python number from a 1-element Tensor by calling tensor.item()
97+
total_loss += loss.item()
98+
losses.append(total_loss)
99+
print(losses) # The loss decreased every iteration over the training data!
100+
```
101+
102+
这是生成第一个 embedding 所需的所有代码。
103+
现在,让我们添加一小段代码来将 embedding 存储到 VisualDL 日志中,之后就能利用 VisualDL 来进行可视化。
104+
105+
```
106+
# Import VisualDL
107+
from visualdl import LogWriter
108+
# VisualDL setup
109+
logw = LogWriter("./embedding_log", sync_cycle=10000)
110+
with logw.mode('train') as logger:
111+
embedding = logger.embedding()
112+
113+
embeddings_list = model.embeddings.weight.data.numpy() # convert to numpy array
114+
115+
# VisualDL embedding log writer takes two parameters
116+
# The first parameter is embedding list. The type is list[list[float]]
117+
# The second parameter is word_dict. The type is dictionary<string, int>.
118+
embedding.add_embeddings_with_word_dict(embeddings_list, word_to_ix)
119+
```
120+
121+
将上述代码嵌入到您的embedding训练程序中,
122+
这将 embedding 和 word_dict 保存到 `./embedding_log` 文件夹中。
123+
124+
现在我们可以用 `visualDL --logdir=./embedding_log` 执行VisualDL,
125+
使用浏览器导航到 `localhost:8080`,切换到 `High Dimensional`
126+
127+
你可以下载教程代码 [这里](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/pytorch/pytorch_word2vec.py)
Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# How to visualize embedding with VisualDL
2+
3+
Here we would like to show you how to visualize embeddings with
4+
VisualDL in PyTorch.
5+
6+
Embedding is often used in NLP(Nature Language Processing), it can represent the
7+
sematic meanings with high dimensional vectors.
8+
9+
Embedding visualization is useful to verify the training algorithm,
10+
as visualization can reduce the high dimensional vector to 2D / 3D spaces.
11+
The closer two words are, the more sematic meaning they share.
12+
13+
We use the PyTorch [embedding example](http://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html) as
14+
the base. Here is the whole embedding program. The following block is a working python script.
15+
Feel free to test it in your python environment.
16+
17+
```
18+
import torch
19+
import torch.nn as nn
20+
import torch.nn.functional as F
21+
import torch.optim as optim
22+
23+
torch.manual_seed(1)
24+
25+
CONTEXT_SIZE = 2
26+
EMBEDDING_DIM = 10
27+
# We will use Shakespeare Sonnet 2
28+
test_sentence = """When forty winters shall besiege thy brow,
29+
And dig deep trenches in thy beauty's field,
30+
Thy youth's proud livery so gazed on now,
31+
Will be a totter'd weed of small worth held:
32+
Then being asked, where all thy beauty lies,
33+
Where all the treasure of thy lusty days;
34+
To say, within thine own deep sunken eyes,
35+
Were an all-eating shame, and thriftless praise.
36+
How much more praise deserv'd thy beauty's use,
37+
If thou couldst answer 'This fair child of mine
38+
Shall sum my count, and make my old excuse,'
39+
Proving his beauty by succession thine!
40+
This were to be new made when thou art old,
41+
And see thy blood warm when thou feel'st it cold.""".split()
42+
# we should tokenize the input, but we will ignore that for now
43+
# build a list of tuples. Each tuple is ([ word_i-2, word_i-1 ], target word)
44+
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
45+
for i in range(len(test_sentence) - 2)]
46+
# print the first 3, just so you can see what they look like
47+
print(trigrams[:3])
48+
49+
vocab = set(test_sentence)
50+
word_to_ix = {word: i for i, word in enumerate(vocab)}
51+
52+
53+
class NGramLanguageModeler(nn.Module):
54+
55+
def __init__(self, vocab_size, embedding_dim, context_size):
56+
super(NGramLanguageModeler, self).__init__()
57+
self.embeddings = nn.Embedding(vocab_size, embedding_dim)
58+
self.linear1 = nn.Linear(context_size * embedding_dim, 128)
59+
self.linear2 = nn.Linear(128, vocab_size)
60+
61+
def forward(self, inputs):
62+
embeds = self.embeddings(inputs).view((1, -1))
63+
out = F.relu(self.linear1(embeds))
64+
out = self.linear2(out)
65+
log_probs = F.log_softmax(out, dim=1)
66+
return log_probs
67+
68+
69+
losses = []
70+
loss_function = nn.NLLLoss()
71+
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
72+
optimizer = optim.SGD(model.parameters(), lr=0.001)
73+
74+
for epoch in range(10):
75+
total_loss = torch.Tensor([0])
76+
for context, target in trigrams:
77+
78+
# Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
79+
# into integer indices and wrap them in variables)
80+
context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)
81+
82+
# Step 2. Recall that torch *accumulates* gradients. Before passing in a
83+
# new instance, you need to zero out the gradients from the old
84+
# instance
85+
model.zero_grad()
86+
87+
# Step 3. Run the forward pass, getting log probabilities over next
88+
# words
89+
log_probs = model(context_idxs)
90+
91+
# Step 4. Compute your loss function. (Again, Torch wants the target
92+
# word wrapped in a variable)
93+
loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))
94+
95+
# Step 5. Do the backward pass and update the gradient
96+
loss.backward()
97+
optimizer.step()
98+
99+
# Get the Python number from a 1-element Tensor by calling tensor.item()
100+
total_loss += loss.item()
101+
losses.append(total_loss)
102+
print(losses) # The loss decreased every iteration over the training data!
103+
```
104+
105+
That's all the code you need to generate your first embedding.
106+
107+
Now, let us just add a little bit of code to store the embedding to VisualDL log
108+
so we can visualize it later.
109+
110+
```
111+
# Import VisualDL
112+
from visualdl import LogWriter
113+
# VisualDL setup
114+
logw = LogWriter("./embedding_log", sync_cycle=10000)
115+
with logw.mode('train') as logger:
116+
embedding = logger.embedding()
117+
118+
embeddings_list = model.embeddings.weight.data.numpy() # convert to numpy array
119+
120+
# VisualDL embedding log writer takes two parameters
121+
# The first parameter is embedding list. The type is list[list[float]]
122+
# The second parameter is word_dict. The type is dictionary<string, int>.
123+
embedding.add_embeddings_with_word_dict(embeddings_list, word_to_ix)
124+
```
125+
126+
Insert the above code snippet into your embedding training program.
127+
128+
This will save the embeddings and the word dictionary to the `./embedding_log` folder.
129+
130+
We can now activate the VisualDL by running `visualDL --logdir=./embedding_log`.
131+
Use your browser to navigate to `localhost:8080`, switch the tab to `High Dimensional`
132+
133+
You can download the tutorial code [here](https://github.com/PaddlePaddle/VisualDL/blob/develop/demo/pytorch/pytorch_word2vec.py).

0 commit comments

Comments
 (0)