> While reading the code and attempting to implement it independently, I identified several differences in the code and the paper: #42

ShinoMashiru · 2024-07-02T14:06:17Z

          > While reading the code and attempting to implement it independently, I identified several differences in the code and the paper:

The definitions of the loss function and the anomaly score do not match Equation (9) and Equation (10) in the paper. Correspondingly, the way employed for threshold determination in the code diverges from the approach outlined in Section 4.3 of the paper.

The code for computing AttnNi and AttnPi does not match the equations (2) and (5) in the paper. The denominator in the paper is dmodel , while in the code it is dmodel H.

There seems to be an error in the code when splitting the patches. For the univariate time series in Figure (a), its Patch-wise and In-patch embedding should be as shown in Figure (b) and Figure (c), respectively.

The code can't do this as shown in Figure (a), and the correct one is shown in Figure (b):

The code does not seem to sum the representations of multiple patch sizes according to equations (7) and equations (8) in the paper, instead it sums the their KL divergence distances when calculating the loss. As far as we know, these two operations are not equivalent.

Equation (3) and Equation (6) in the paper seem to be wrong.

The code does not concat the multiple heads as stated in the paper, but averages them after evaluating their respective KL divergence.

There is no WNO and WPO in the code. In fact, this multiplication does not work at all. AttnN (AttnP) has a shape of B×H×N×N (B×H×P×P), and it cannot be multiplied at all by a WNO (WPO) with a shape of dmodel ×dmodel (dmodel ×dmodel ), concated or not.

Each attention layer in the encoder has an input shape of (BC)×H×P×P ((BC)×H×N×N)and an output shape of B×H×(NP)×(NP). Because of the inconsistent shapes, the individual attention layers cannot be connected in series, and the code uses a parallel approach and sums the KL divergence of the different attention layers. This is not mentioned at all in the paper.

Hello Dr. ForestsKing! Initially, I also had similar doubts while reading the paper. My initial understanding was (allow me to name it as method a): a) For the attention layer calculation like in Equation (2) and Equation (5), under normal circumstances, it should be the dot product of QK followed by multiplying a matrix V of shape N(P)×d/H. This way, the shape of the resulting AttnNi(AttnPi) would be N(P)×d/H. Then, the subsequent concat operation would result in AttnN(AttnP) of shape N(P)×d, and the subsequent Up-sampling could proceed as described in the paper. So initially, I thought the issue with the paper was that the author forgot to mention V in QKV. However, later when I tried to replicate the code and carefully read Figure 2 of the paper

I found that the shape of the output layer of its attention is H×T(NP)×T(NP), and combined with your reply, the author uses exactly the method b) (allow me to name it as method b) in the code.

Therefore, I have the following two conjectures:

1.The author’s idea is consistent with method a), but he forgot to mention the operation of multiplying the V matrix in the attention layer calculation. After adding this point, including the operation of multiplying the matrix W^O after concat, and the sum operation of Up-sampling, are all very consistent with the description in the paper. But this is inconsistent with the code and the description in Figure 2 of the paper.

2.The author’s idea is consistent with method b), which is consistent with the code and conforms to the description in Figure 2, but as you mentioned in method b), the operation of multiplying the matrix W^O after concat in the paper does not exist, and there is a problem with the Upsampling operation.

I hope the author can confirm how the inconsistency between the paper and the code is caused. Is it as conjectured 1, the author’s real idea is reflected in the paper, but there is a deviation in the process of implementing the code? Or as conjectured 2, the code is the real idea of the author team, and the understanding of the author of the paper is biased, causing the paper to express errors?

Which of the two methods is truly effective? Moreover, similar to the Anomaly transformer, the author’s experiments all used the PA adjustment trick. This trick has been doubted by many researchers, who believe that the effectiveness mainly comes from PA adjustment rather than the model itself. The inconsistency between the author’s paper and code description seems to have intensified everyone’s doubts.

I hope the author can provide an answer to this.
Originally posted by @ShinoMashiru in #10 (comment)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

> While reading the code and attempting to implement it independently, I identified several differences in the code and the paper: #42

> While reading the code and attempting to implement it independently, I identified several differences in the code and the paper: #42

ShinoMashiru commented Jul 2, 2024

> While reading the code and attempting to implement it independently, I identified several differences in the code and the paper: #42

> While reading the code and attempting to implement it independently, I identified several differences in the code and the paper: #42

Comments

ShinoMashiru commented Jul 2, 2024