Confusion about the window attention

想咨询一下，window attention我理解的是每个token都看到自己前面window size个token(如果足够，如bos这样前面没有所以就看到自己)，正常参考sparse attention的实现，那么不应该是forward时候会对每个q首先选出相应的需要看的kv，这一步之后attention的计算复杂度貌似是o(TW)吧，这种情况下实际上就是带overlap的？你们说的不带overlap的window attention，实际上是把llm的输入序列进行chunk切分然后每个chunk单独算attention，这显然会导致每个chunk前面的token能够看到的token数目小于window size，这在LLM里面能真的称作window attention吗，或者说实际上你们的意思是对每个不同的window都会重新计算kv，不知道这样表述是否会更加准确，希望能够解答一下，如果我的理解有误还请指正。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion about the window attention #18

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Confusion about the window attention #18

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions