Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

> While reading the code and attempting to implement it independently, I identified several differences in the code and the paper: #42

Open
ShinoMashiru opened this issue Jul 2, 2024 · 0 comments

Comments

@ShinoMashiru
Copy link

          > While reading the code and attempting to implement it independently, I identified several differences in the code and the paper:
  • The definitions of the loss function and the anomaly score do not match Equation (9) and Equation (10) in the paper. Correspondingly, the way employed for threshold determination in the code diverges from the approach outlined in Section 4.3 of the paper.

  • The code for computing AttnNi and AttnPi does not match the equations (2) and (5) in the paper. The denominator in the paper is dmodel , while in the code it is dmodel H.

  • There seems to be an error in the code when splitting the patches. For the univariate time series in Figure (a), its Patch-wise and In-patch embedding should be as shown in Figure (b) and Figure (c), respectively.

    The code can't do this as shown in Figure (a), and the correct one is shown in Figure (b):

  • The code does not seem to sum the representations of multiple patch sizes according to equations (7) and equations (8) in the paper, instead it sums the their KL divergence distances when calculating the loss. As far as we know, these two operations are not equivalent.

  • Equation (3) and Equation (6) in the paper seem to be wrong.

    • The code does not concat the multiple heads as stated in the paper, but averages them after evaluating their respective KL divergence.
    • There is no WNO and WPO in the code. In fact, this multiplication does not work at all. AttnN (AttnP) has a shape of B×H×N×N (B×H×P×P), and it cannot be multiplied at all by a WNO (WPO) with a shape of dmodel ×dmodel (dmodel ×dmodel ), concated or not.
  • Each attention layer in the encoder has an input shape of (BC)×H×P×P ((BC)×H×N×N)and an output shape of B×H×(NP)×(NP). Because of the inconsistent shapes, the individual attention layers cannot be connected in series, and the code uses a parallel approach and sums the KL divergence of the different attention layers. This is not mentioned at all in the paper.

Hello Dr. ForestsKing! Initially, I also had similar doubts while reading the paper. My initial understanding was (allow me to name it as method a): a) For the attention layer calculation like in Equation (2) and Equation (5), under normal circumstances, it should be the dot product of QK followed by multiplying a matrix V of shape N(P)×d/H. This way, the shape of the resulting AttnNi(AttnPi) would be N(P)×d/H. Then, the subsequent concat operation would result in AttnN(AttnP) of shape N(P)×d, and the subsequent Up-sampling could proceed as described in the paper. So initially, I thought the issue with the paper was that the author forgot to mention V in QKV. However, later when I tried to replicate the code and carefully read Figure 2 of the paper
image
I found that the shape of the output layer of its attention is H×T(NP)×T(NP), and combined with your reply, the author uses exactly the method b) (allow me to name it as method b) in the code.

Therefore, I have the following two conjectures:

1.The author’s idea is consistent with method a), but he forgot to mention the operation of multiplying the V matrix in the attention layer calculation. After adding this point, including the operation of multiplying the matrix W^O after concat, and the sum operation of Up-sampling, are all very consistent with the description in the paper. But this is inconsistent with the code and the description in Figure 2 of the paper.

2.The author’s idea is consistent with method b), which is consistent with the code and conforms to the description in Figure 2, but as you mentioned in method b), the operation of multiplying the matrix W^O after concat in the paper does not exist, and there is a problem with the Upsampling operation.

I hope the author can confirm how the inconsistency between the paper and the code is caused. Is it as conjectured 1, the author’s real idea is reflected in the paper, but there is a deviation in the process of implementing the code? Or as conjectured 2, the code is the real idea of the author team, and the understanding of the author of the paper is biased, causing the paper to express errors?

Which of the two methods is truly effective? Moreover, similar to the Anomaly transformer, the author’s experiments all used the PA adjustment trick. This trick has been doubted by many researchers, who believe that the effectiveness mainly comes from PA adjustment rather than the model itself. The inconsistency between the author’s paper and code description seems to have intensified everyone’s doubts.

I hope the author can provide an answer to this.
Originally posted by @ShinoMashiru in #10 (comment)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant