From 52d02e8dcafe5f5a06668ef0c7f8344bc6d0f04c Mon Sep 17 00:00:00 2001
From: Leon Silva <leonsolon@gmail.com>
Date: Fri, 19 Nov 2021 20:59:13 -0300
Subject: [PATCH 1/3] adding weeks 03 to 06

---
 docs/_config.yml               |   18 +
 docs/pt/week03/03-1.md         |  487 +++++
 docs/pt/week03/03-2.md         |  476 +++++
 docs/pt/week03/03-3.md         |  285 +++
 docs/pt/week03/03.md           |   40 +
 docs/pt/week03/lecture03.sbv   | 3429 ++++++++++++++++++++++++++++++
 docs/pt/week03/practicum03.sbv | 1751 ++++++++++++++++
 docs/pt/week04/04-1.md         |  596 ++++++
 docs/pt/week04/04.md           |   18 +
 docs/pt/week04/practicum04.sbv | 1517 ++++++++++++++
 docs/pt/week05/05-1.md         |  451 ++++
 docs/pt/week05/05-2.md         |  512 +++++
 docs/pt/week05/05-3.md         |  490 +++++
 docs/pt/week05/05.md           |   40 +
 docs/pt/week05/lecture05.sbv   | 3572 ++++++++++++++++++++++++++++++++
 docs/pt/week05/practicum05.sbv | 1241 +++++++++++
 docs/pt/week06/06-1.md         |  285 +++
 docs/pt/week06/06-2.md         |  586 ++++++
 docs/pt/week06/06-3.md         |  734 +++++++
 docs/pt/week06/06.md           |   36 +
 docs/pt/week06/lecture06.sbv   | 3338 +++++++++++++++++++++++++++++
 docs/pt/week06/practicum06.sbv | 1742 ++++++++++++++++
 22 files changed, 21644 insertions(+)
 create mode 100644 docs/pt/week03/03-1.md
 create mode 100644 docs/pt/week03/03-2.md
 create mode 100644 docs/pt/week03/03-3.md
 create mode 100644 docs/pt/week03/03.md
 create mode 100644 docs/pt/week03/lecture03.sbv
 create mode 100644 docs/pt/week03/practicum03.sbv
 create mode 100644 docs/pt/week04/04-1.md
 create mode 100644 docs/pt/week04/04.md
 create mode 100644 docs/pt/week04/practicum04.sbv
 create mode 100644 docs/pt/week05/05-1.md
 create mode 100644 docs/pt/week05/05-2.md
 create mode 100644 docs/pt/week05/05-3.md
 create mode 100644 docs/pt/week05/05.md
 create mode 100644 docs/pt/week05/lecture05.sbv
 create mode 100644 docs/pt/week05/practicum05.sbv
 create mode 100644 docs/pt/week06/06-1.md
 create mode 100644 docs/pt/week06/06-2.md
 create mode 100644 docs/pt/week06/06-3.md
 create mode 100644 docs/pt/week06/06.md
 create mode 100644 docs/pt/week06/lecture06.sbv
 create mode 100644 docs/pt/week06/practicum06.sbv
diff --git a/docs/_config.yml b/docs/_config.yml
index 94f33a605..cf1b3725f 100644
--- a/docs/_config.yml
+++ b/docs/_config.yml
@@ -757,6 +757,24 @@ pt:
         - path: pt/week02/02-1.md
         - path: pt/week02/02-2.md
         - path: pt/week02/02-3.md
+    - path: pt/week03/03.md
+      sections:
+        - path: pt/week03/03-1.md
+        - path: pt/week03/03-2.md
+        - path: pt/week03/03-3.md
+    - path: pt/week04/04.md
+      sections:
+        - path: pt/week04/04-1.md
+    - path: pt/week05/05.md
+      sections:
+        - path: pt/week05/05-1.md
+        - path: pt/week05/05-2.md
+        - path: pt/week05/05-3.md
+    - path: pt/week06/06.md
+      sections:
+        - path: pt/week06/06-1.md
+        - path: pt/week06/06-2.md
+        - path: pt/week06/06-3.md
 
 ################################## Hungarian ###################################
 hu:
diff --git a/docs/pt/week03/03-1.md b/docs/pt/week03/03-1.md
new file mode 100644
index 000000000..e887a7733
--- /dev/null
+++ b/docs/pt/week03/03-1.md
@@ -0,0 +1,487 @@
+---
+lang: pt
+lang-ref: ch.03-1
+lecturer: Yann LeCun
+title: Visualização da Transformação de Parâmetros de Redes Neurais e Conceitos Fundamentais de Convoluções
+authors: Jiuhong Xiao, Trieu Trinh, Elliot Silva, Calliea Pan
+date: 10 Feb 2020
+typora-root-url: 03-1
+translation-date: 14 Nov 2021
+translator: Leon Solon
+---
+
+
+<!--
+## [Visualization of neural networks](https://www.youtube.com/watch?v=FW5gFiJb-ig&t=5s)
+-->
+
+## [Visualização de redes neurais](https://www.youtube.com/watch?v=FW5gFiJb-ig&t=5s)
+
+<!--In this section we will visualise the inner workings of a neural network.
+-->
+
+Nesta seção, visualizaremos o funcionamento interno de uma rede neural.
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/Network.png" alt="Network" style="zoom:35%;" /><br>
+Fig. 1 Network Structure</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/Network.png" alt="Network" style="zoom:35%;" /><br>
+Fig. 1 Estrutura da rede </center>
+
+<!--Figure 1 depicts the structure of the neural network we would like to visualise. Typically, when we draw the structure of a neural network, the input appears on the bottom or on the left, and the output appears on the top side or on the right. In Figure 1, the pink neurons represent the inputs, and the blue neurons represent the outputs. In this network, we have 4 hidden layers (in green), which means we have 6 layers in total (4 hidden layers + 1 input layer + 1 output layer). In this case, we have 2 neurons per hidden layer, and hence the dimension of the weight matrix ($W$) for each layer is 2-by-2. This is because we want to transform our input plane into another plane that we can visualize.
+-->
+
+A Figura 1 mostra a estrutura da rede neural que gostaríamos de visualizar. Normalmente, quando desenhamos a estrutura de uma rede neural, a entrada aparece na parte inferior ou à esquerda e a saída aparece na parte superior ou direita. Na Figura 1, os neurônios de cor rosa representam as entradas e os neurônios azuis representam as saídas. Nesta rede, temos 4 camadas ocultas (em verde), o que significa que temos 6 camadas no total (4 camadas ocultas + 1 camada de entrada + 1 camada de saída). Nesse caso, temos 2 neurônios por camada oculta e, portanto, a dimensão da matriz de peso ($W$) para cada camada é 2 por 2. Isso ocorre porque queremos transformar nosso plano de entrada em outro plano que possamos visualizar.
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/Visual1.png" alt="Network" style="zoom:35%;" /><br>
+Fig. 2 Visualization of folding space</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/Visual1.png" alt="Network" style="zoom:35%;" /><br>
+Fig. 2 Visualização do espaço dobrável </center>
+
+<!--The transformation of each layer is like folding our plane in some specific regions as shown in Figure 2. This folding is very abrupt, this is because all the transformations are performed in the 2D layer. In the experiment, we find that if we have only 2 neurons in each hidden layer, the optimization will take longer; the optimization is easier if we have more neurons in the hidden layers. This leaves us with an important question to consider: Why is it harder to train the network with fewer neurons in the hidden layers? You should consider this question yourself and we will return to it after the visualization of $\texttt{ReLU}$.
+-->
+
+A transformação de cada camada é como dobrar nosso plano em algumas regiões específicas, conforme mostrado na Figura 2. Esse dobramento é muito abrupto, isso porque todas as transformações são realizadas na camada 2D. No experimento, descobrimos que, se tivermos apenas 2 neurônios em cada camada oculta, a otimização será mais demorada; a otimização é mais fácil se tivermos mais neurônios nas camadas ocultas. Isso nos deixa com uma questão importante a considerar: por que é mais difícil treinar a rede com menos neurônios nas camadas ocultas? Você mesmo deve considerar esta questão e retornaremos a ela após a visualização de $\texttt{ReLU}$.
+
+<!--| <img src="{{site.baseurl}}/images/week03/03-1/Visual2a.png" alt="Network" style="zoom:45%;" /> | <img src="{{site.baseurl}}/images/week03/03-1/Visual2b.png" alt="Network" style="zoom:45%;" /> |
+|(a)|(b)|
+-->
+
+| <img src="{{site.baseurl}}/images/week03/03-1/Visual2a.png" alt="Network" style="zoom:45%;" /> | <img src="{{site.baseurl}}/images/week03/03-1/Visual2b.png" alt="Network" style="zoom:45%;" /> |
+|(a)|(b)|
+
+<!--<center>Fig. 3 Visualization of ReLU operator</center>
+-->
+
+<center>Fig. 3 Visualização do operador ReLU</center>
+
+<!--When we step through the network one hidden layer at a time, we see that with each layer we perform some affine transformation followed by applying the non-linear ReLU operation, which eliminates any negative values. In Figures 3(a) and (b), we can see the visualisation of ReLU operator. The ReLU operator helps us to do non-linear transformations. After mutliple steps of performing an affine transformation followed by the ReLU operator, we are eventually able to linearly separate the data as can be seen in Figure 4.
+-->
+
+Quando percorremos a rede, uma camada oculta de cada vez, vemos que, em cada camada, realizamos alguma transformação afim, seguida pela aplicação da operação ReLU não linear, que elimina quaisquer valores negativos. Nas Figuras 3 (a) e (b), podemos ver a visualização do operador ReLU. O operador ReLU nos ajuda a fazer transformações não lineares. Após várias etapas de realização de uma transformação afim seguida pelo operador ReLU, somos eventualmente capazes de separar linearmente os dados, como pode ser visto na Figura 4.
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/Visual3.png" alt="Network" style="zoom:30%;" /><br>
+Fig. 4 Visualization of Outputs</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/Visual3.png" alt="Network" style="zoom:30%;" /><br>
+Fig. 4 Visualização de saídas </center>
+
+<!--This provides us with some insight into why the 2-neuron hidden layers are harder to train. Our 6-layer network has one bias in each hidden layers. Therefore if one of these biases moves points out of top-right quadrant, then applying the ReLU operator will eliminate these points to zero. After that, no matter how later layers transform the data, the values will remain zero. We can make a neural network easier to train by making the network "fatter" - *i.e.* adding more neurons in hidden layers - or we can add more hidden layers, or a combination of the two methods. Throughout this course we will explore how to determine the best network architecture for a given problem, stay tuned.
+-->
+
+Isso nos fornece algumas dicas sobre por que as camadas ocultas de 2 neurônios são mais difíceis de treinar. Nossa rede de 6 camadas tem um viés em cada camada oculta. Portanto, se uma dessas polarizações mover pontos para fora do quadrante superior direito, a aplicação do operador ReLU eliminará esses pontos para zero. Depois disso, não importa o quanto as camadas posteriores transformem os dados, os valores permanecerão zero. Podemos tornar uma rede neural mais fácil de treinar tornando a rede "mais gorda" - *ou seja,* adicionando mais neurônios em camadas ocultas - ou podemos adicionar mais camadas ocultas, ou uma combinação dos dois métodos. Ao longo deste curso, exploraremos como determinar a melhor arquitetura de rede para um determinado problema, fique atento.
+
+<!--
+## [Parameter transformations](https://www.youtube.com/watch?v=FW5gFiJb-ig&t=477s)
+-->
+
+## [Transformações de parâmetro](https://www.youtube.com/watch?v=FW5gFiJb-ig&t=477s)
+
+<!--General parameter transformation means that our parameter vector $w$ is the output of a function. By this transformation, we can map original parameter space into another space. In Figure 5, $w$ is actually the output of $H$ with the parameter $u$. $G(x,w)$ is a network and $C(y,\bar y)$ is a cost function. The backpropagation formula is also adapted as follows,
+-->
+
+A transformação de parâmetro geral significa que nosso vetor de parâmetro $w$ é a saída de uma função. Por meio dessa transformação, podemos mapear o espaço de parâmetro original em outro espaço. Na Figura 5, $ w $ é na verdade a saída de $H$ com o parâmetro $u$. $G (x, w)$ é uma rede e $C(y,\bar y)$ é uma função de custo. A fórmula de retropropagação também é adaptada da seguinte forma,
+
+<!--$$
+u \leftarrow u - \eta\frac{\partial H}{\partial u}^\top\frac{\partial C}{\partial w}^\top
+$$
+-->
+
+$$
+u \leftarrow u - \eta\frac{\partial H}{\partial u}^\top\frac{\partial C}{\partial w}^\top
+$$
+
+<!--$$
+w \leftarrow w - \eta\frac{\partial H}{\partial u}\frac{\partial H}{\partial u}^\top\frac{\partial C}{\partial w}^\top
+$$
+-->
+
+$$
+w \leftarrow w - \eta\frac{\partial H}{\partial u}\frac{\partial H}{\partial u}^\top\frac{\partial C}{\partial w}^\top
+$$
+
+<!--These formulas are applied in a matrix form. Note that the dimensions of the terms should be consistent. The dimension of $u$,$w$,$\frac{\partial H}{\partial u}^\top$,$\frac{\partial C}{\partial w}^\top$ are $[N_u \times 1]$,$[N_w \times 1]$,$[N_u \times N_w]$,$[N_w \times 1]$, respectively. Therefore, the dimension of our backpropagation formula is consistent.
+-->
+
+Essas fórmulas são aplicadas em forma de matriz. Observe que as dimensões dos termos devem ser consistentes. As dimensões de $u$,$w$,$\frac{\partial H}{\partial u}^\top$,$\frac{\partial C}{\partial w}^\top$ são $[N_u \times 1]$,$[N_w \times 1]$,$[N_u \times N_w]$,$[N_w \times 1]$, respectivamente. Portanto, a dimensão de nossa fórmula de retropropagação é consistente.
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/PT.png" alt="Network" style="zoom:35%;" /><br>
+Fig. 5 General Form of Parameter Transformations</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/PT.png" alt="Network" style="zoom:35%;" /><br>
+Fig. 5 Forma geral das transformações de parâmetros </center>
+
+<!--
+### A simple parameter transformation: weight sharing
+-->
+
+### Uma transformação de parâmetro simples: compartilhamento de peso
+
+<!--A Weight Sharing Transformation means $H(u)$ just replicates one component of $u$ into multiple components of $w$. $H(u)$ is like a **Y** branch to copy $u_1$ to $w_1$, $w_2$. This can be expressed as,
+-->
+
+Uma transformação de compartilhamento de peso significa que $H(u)$ apenas replica um componente de $u$ em vários componentes de $w$. $H(u)$ é como uma divisão em **Y** para copiar $u_1$ para $w_1$, $w_2$. Isso pode ser expressado como,
+
+<!--$$
+w_1 = w_2 = u_1, w_3 = w_4 = u_2
+$$
+-->
+
+$$
+w_1 = w_2 = u_1, w_3 = w_4 = u_2
+$$
+
+<!--We force shared parameters to be equal, so the gradient w.r.t. to shared parameters will be summed in the backprop. For example the gradient of the cost function $C(y, \bar y)$ with respect to $u_1$ will be the sum of the gradient of the cost function $C(y, \bar y)$ with respect to $w_1$ and the gradient of the cost function $C(y, \bar y)$ with respect to $w_2$.
+-->
+
+Forçamos os parâmetros compartilhados a serem iguais, então o gradiente em relação aos parâmetros compartilhados será somado na retroprogação. Por exemplo, o gradiente da função de custo $C(y, \bar y)$ em relação a $u_1$ será a soma do gradiente da função de custo $C(y, \bar y)$ em relação a $w_1$ e o gradiente da função de custo $C(y, \bar y)$ em relação a $w_2$.
+
+<!--
+### Hypernetwork
+-->
+
+### Hiper-rede
+
+<!--A hypernetwork is a network where the weights of one network is the output of another network. Figure 6 shows the computation graph of a "hypernetwork". Here the function $H$ is a network with parameter vector $u$ and input $x$. As a result, the weights of $G(x,w)$ are dynamically configured by the network $H(x,u)$. Although this is an old idea, it remains very powerful.
+-->
+
+Uma hiper-rede é uma rede em que os pesos de uma rede são a saída de outra rede. A Figura 6 mostra o gráfico de computação de uma "hiper-rede". Aqui, a função $H$ é uma rede com vetor de parâmetro $u$ e entrada $x$. Como resultado, os pesos de $G(x,w)$ são configurados dinamicamente pela rede $H(x,u)$. Embora seja uma ideia antiga, continua muito poderosa.
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/HyperNetwork.png" alt="Network" style="zoom:35%;" /><br>
+Fig. 6 "Hypernetwork"</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/HyperNetwork.png" alt="Network" style="zoom:35%;" /><br>
+Fig. 6 "Hypernetwork" </center>
+
+<!--
+### Motif detection in sequential data
+-->
+
+### Detecção de motivos em dados sequenciais
+
+<!--Weight sharing transformation can be applied to motif detection. Motif detection means to find some motifs in sequential data like keywords in speech or text. One way to achieve this, as shown in Figure 7, is to use a sliding window on data, which moves the weight-sharing function to detect a particular motif (*i.e.* a particular sound in speech signal), and the outputs (*i.e.* a score) goes into a maximum function.
+-->
+
+A transformação de compartilhamento de peso pode ser aplicada à detecção de motivos. A detecção de motivos significa encontrar alguns motivos em dados sequenciais, como palavras-chave em voz ou texto. Uma maneira de conseguir isso, conforme mostrado na Figura 7, é usar uma janela deslizante de dados, que move a função de divisão de peso para detectar um motivo específico (*ou seja* um determinado som no sinal de fala) e as saídas (*ou seja,* uma pontuação) vai para uma função máxima.
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/Motif.png" alt="Network" style="zoom:30%;" /><br>
+Fig. 7 Motif Detection for Sequential Data</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/Motif.png" alt="Network" style="zoom:30%;" /><br>
+Fig. 7 Detecção de Motivos para Dados Sequenciais </center>
+
+<!--In this example we have 5 of those functions. As a result of this solution, we sum up five gradients and backpropagate the error to update the parameter $w$. When implementing this in PyTorch, we want to prevent the implicit accumulation of these gradients, so we need to use `zero_grad()` to initialize the gradient.
+-->
+
+Neste exemplo, temos 5 dessas funções. Como resultado dessa solução, somamos cinco gradientes e retropropagamos o erro para atualizar o parâmetro $w$. Ao implementar isso no PyTorch, queremos evitar o acúmulo implícito desses gradientes, então precisamos usar `zero_grad()` para inicializar o gradiente.
+
+<!--
+### Motif detection in images
+-->
+
+### Detecção de motivos em imagens
+
+<!--The other useful application is motif detection in images. We usually swipe our "templates" over images to detect the shapes independent of position and distortion of the shapes. A simple example is to distinguish between "C" and "D",  as Figure 8 shows. The difference between "C" and "D" is that "C" has two endpoints and "D" has two corners. So we can design "endpoint templates" and "corner templates". If the shape is similar to the "templates", it will have thresholded outputs. Then we can distinguish letters from these outputs by summing them up. In Figure 8, the network detects two endpoints and zero corners, so it activates "C".
+-->
+
+A outra aplicação útil é a detecção de motivos em imagens. Normalmente, passamos nossos "modelos" sobre as imagens para detectar as formas, independentemente da posição e distorção das formas. Um exemplo simples é distinguir entre "C" e "D", como mostra a Figura 8. A diferença entre "C" e "D" é que "C" tem dois pontos finais e "D" tem dois cantos. Assim, podemos projetar "modelos de terminal" e "modelos de canto". Se a forma for semelhante aos "modelos", ele terá saídas limitadas. Então, podemos distinguir as letras dessas saídas, somando-as. Na Figura 8, a rede detecta dois pontos finais e zero cantos, portanto, ativa "C".
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/MotifImage.png" alt="Network" style="zoom:35%;" /><br>
+Fig. 8 Motif Detection for Images</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/MotifImage.png" alt="Network" style="zoom:35%;" /><br>
+Fig. 8 Detecção de motivos para imagens </center>
+
+<!--It is also important that our "template matching" should be shift-invariant - when we shift the input, the output (*i.e.* the letter detected) shouldn't change. This can be solved with weight sharing transformation. As Figure 9 shows, when we change the location of "D", we can still detect the corner motifs even though they are shifted. When we sum up the motifs, it will activate the "D" detection.
+-->
+
+Também é importante que o nosso "modelo de correspondência" seja invariante ao deslocamento - quando mudamos a entrada, a saída (*ou seja,* a letra detectada) não deve mudar. Isso pode ser resolvido com a transformação do compartilhamento de peso. Como mostra a Figura 9, quando mudamos a localização de "D", ainda podemos detectar os motivos dos cantos, embora eles estejam deslocados. Ao somarmos os motivos, ele ativará a detecção "D".
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/ShiftInvariance.png" alt="Network" style="zoom:35%;" /><br>
+Fig. 9 Shift Invariance</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/ShiftInvariance.png" alt="Network" style="zoom:35%;" /><br>
+Fig. 9 Invariância de deslocamento </center>
+
+<!--This hand-crafted method of using local detectors and summation to for digit-recognition was used for many years. But it presents us with the following problem: How can we design these "templates" automatically? Can we use neural networks to learn these "templates"? Next, We will introduce the concept of **convolutions** , that is, the operation we use to match images with "templates".
+-->
+
+Este método artesanal de usar detectores locais e soma para reconhecimento de dígitos foi usado por muitos anos. Mas isso nos apresenta o seguinte problema: Como podemos criar esses "modelos" automaticamente? Podemos usar redes neurais para aprender esses "modelos"? A seguir, apresentaremos o conceito de **convoluções**, ou seja, a operação que usamos para combinar imagens com "modelos".
+
+<!--
+## Discrete convolution
+-->
+
+## Convolução discreta
+
+<!--
+### Convolution
+-->
+
+### Convolução
+
+<!--The precise mathematical definition of a convolution in the 1-dimensional case between input $x$ and $w$ is:
+-->
+
+A definição matemática precisa de uma convolução no caso unidimensional entre a entrada $x$ e $w$ é:
+
+<!--$$y_i = \sum_j w_j x_{i-j}$$
+-->
+
+$$y_i = \sum_j w_j x_{i-j}$$
+
+<!--In words, the $i$-th output is computed as the dot product between the **reversed** $w$ and a window of the same size in $x$. To compute the full output, start the window at the beginning, shift this window by one entry each time and repeat until $x$ is exhausted.
+-->
+
+Em palavras, a $i$-ésima saída é calculada como o produto escalar entre o  $w$ **invertido** e uma janela do mesmo tamanho em $x$. Para calcular a saída completa, inicie a janela no início, avance esta janela um passo de cada vez e repita até que $x$ se esgote.
+
+<!--
+### Cross-correlation
+-->
+
+### Correlação cruzada
+
+<!--In practice, the convention adopted in deep learning frameworks such as PyTorch is slightly different. Convolution in PyTorch is implemented where $w$ is **not reversed**:
+-->
+
+Na prática, a convenção adotada em estruturas de aprendizado profundo, como o PyTorch, é um pouco diferente. Na implementação das convoluções no PyTorch, $w$ **não é invertido**:
+
+<!--$$y_i = \sum_j w_j x_{i+j}$$
+-->
+
+$$y_i = \sum_j w_j x_{i+j}$$
+
+<!--Mathematicians call this formulation "cross-correlation". In our context, this difference is just a difference in convention. Practically, cross-correlation and convolution can be interchangeable if one reads the weights stored in memory forward or backward.
+-->
+
+Os matemáticos chamam essa formulação de "correlação cruzada". Em nosso contexto, essa diferença é apenas uma diferença na convenção. Praticamente, correlação cruzada e convolução podem ser intercambiáveis ​​se alguém ler os pesos armazenados na memória para frente ou para trás.
+
+<!--Being aware of this difference is important, for example, when one want to make use of certain mathematical properties of convolution/correlation from mathematical texts.
+-->
+
+Estar ciente dessa diferença é importante, por exemplo, quando se deseja fazer uso de certas propriedades matemáticas de convolução / correlação de textos matemáticos.
+
+<!--
+### Higher dimensional convolution
+-->
+
+### Convolução dimensional superior
+
+<!--For two dimensional inputs such as images, we make use of the two dimensional version of convolution:
+-->
+
+Para entradas bidimensionais, como imagens, usamos a versão bidimensional da convolução:
+
+<!--$$y_{ij} = \sum_{kl} w_{kl} x_{i+k, j+l}$$
+-->
+
+$$y_{ij} = \sum_{kl} w_{kl} x_{i+k, j+l}$$
+
+<!--This definition can easily be extended beyond two dimensions to three or four dimensions. Here $w$ is called the *convolution kernel*
+-->
+
+Essa definição pode ser facilmente estendida além de duas dimensões para três ou quatro dimensões. Aqui $w$ é chamado de *kernel de convolução*
+
+<!--
+### Regular twists that can be made with the convolutional operator in DCNNs
+-->
+
+### Torções regulares que podem ser feitas com o operador convolucional em DCNNs
+
+<!--1. **Striding**: instead of shifting the window in $x$ one entry at a time, one can do so with a larger step (for example two or three entries at a time).
+Example: Suppose the input $x$ is one dimensional and has size of 100 and $w$ has size 5. The output size with a stride of 1 or 2 is shown in the table below:
+-->
+
+1. **Striding**: em vez de mudar a janela em $x$ uma entrada por vez, pode-se fazer isso com um passo maior (por exemplo, duas ou três entradas por vez).
+Exemplo: suponha que a entrada $x$ seja unidimensional e tenha tamanho 100 e $w$ tenha tamanho 5. O tamanho de saída com uma passada de 1 ou 2 é mostrado na tabela abaixo:
+
+<!--| Stride       | 1                          | 2                          |
+| ------------ | -------------------------- | -------------------------- |
+| Output size: | $\frac{100 - (5-1)}{1}=96$ | $\frac{100 - (5-1)}{2}=48$ |
+-->
+
+| Stride       | 1                          | 2                          |
+| ------------ | -------------------------- | -------------------------- |
+| Tamanho da saída: | $\frac{100 - (5-1)}{1}=96$ | $\frac{100 - (5-1)}{2}=48$ |
+
+<!--
+2. **Padding**: Very often in designing Deep Neural Networks architectures, we want the output of convolution to be of the same size as the input. This can be achieved by padding the input ends with a number of (typically) zero entries, usually on both sides. Padding is done mostly for convenience. It can sometimes impact performance and result in strange border effects, that said, when using a ReLU non-linearity, zero padding is not unreasonable.
+-->
+
+2. **Preenchimento (Padding)**: Muito frequentemente, ao projetar arquiteturas de Redes Neurais Profundas, queremos que a saída de convolução seja do mesmo tamanho que a entrada. Isso pode ser obtido preenchendo as extremidades da entrada com um número de entradas (normalmente) de zero, geralmente em ambos os lados. O enchimento é feito principalmente por conveniência. Isso às vezes pode afetar o desempenho e resultar em efeitos de borda estranhos, ou seja, ao usar uma não linearidade ReLU, o preenchimento de zero não é irracional.
+
+<!--
+## Deep Convolution Neural Networks (DCNNs)
+-->
+
+## Redes Neurais de Convolução Profunda (DCNNs)
+
+<!--As previously described, deep neural networks are typically organized as repeated alternation between linear operators and point-wise nonlinearity layers. In convolutional neural networks, the linear operator will be the convolution operator described above. There is also an optional third type of layer called the pooling layer.
+-->
+
+Conforme descrito anteriormente, as redes neurais profundas são normalmente organizadas como alternância repetida entre operadores lineares e camadas não lineares pontuais. Em redes neurais convolucionais, o operador linear será o operador de convolução descrito acima. Também existe um terceiro tipo opcional de camada, denominado camada de pool.
+
+<!--The reason for stacking multiple such layers is that we want to build a hierarchical representation of the data. CNNs do not have to be limited to processing images, they have also been successfully applied to speech and language. Technically they can be applied to any type of data that comes in the form of arrays, although we also these arrays to satisfy certain properties.
+-->
+
+A razão para empilhar várias dessas camadas é que queremos construir uma representação hierárquica dos dados. As CNNs não precisam se limitar a processar imagens; elas também foram aplicadas com sucesso à fala e à linguagem. Tecnicamente, eles podem ser aplicados a qualquer tipo de dado que venha na forma de arrays, embora também utilizemos esses arrays para satisfazer certas propriedades.
+
+<!--Why would we want to capture the hierarchical representation of the world? Because the world we live in is compositional. This point is alluded to in previous sections. Such hierarchical nature can be observed from the fact that local pixels assemble to form simple motifs such as oriented edges. These edges in turn are assembled to form local features such as corners, T-junctions, etc. These edges are assembled to form motifs that are even more abstract. We can keep building on these hierarchical representation to eventually form the objects we observe in the real world.
+-->
+
+Por que desejaríamos capturar a representação hierárquica do mundo? Porque o mundo em que vivemos é composto. Este ponto é mencionado nas seções anteriores. Essa natureza hierárquica pode ser observada a partir do fato de que os pixels locais se reúnem para formar motivos simples, como bordas orientadas. Essas bordas, por sua vez, são montadas para formar características locais, como cantos, junções em T, etc. Essas bordas são montadas para formar motivos que são ainda mais abstratos. Podemos continuar construindo sobre essas representações hierárquicas para, eventualmente, formar os objetos que observamos no mundo real.
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/cnn_features.png" alt="CNN Features" style="zoom:35%;" /><br>
+Figure 10. Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/cnn_features.png" alt="CNN Features" style="zoom:35%;" /><br>
+Figura 10. Visualização de recurso de rede convolucional treinada em ImageNet de [Zeiler & Fergus 2013]</center>
+
+<!--
+This compositional, hierarchical nature we observe in the natural world is therefore not just the result of our visual perception, but also true at the physical level. At the lowest level of description, we have elementary particles, which assembled to form atoms, atoms together form molecules, we continue to build on this process to form materials, parts of objects and eventually full objects in the physical world.
+-->
+
+Essa natureza composicional e hierárquica que observamos no mundo natural não é, portanto, apenas o resultado de nossa percepção visual, mas também verdadeira no nível físico. No nível mais baixo de descrição, temos partículas elementares, que se agrupam para formar átomos, átomos juntos formam moléculas. Continuamos a desenvolver esse processo para formar materiais, partes de objetos e, eventualmente, objetos completos no mundo físico.
+
+<!--The compositional nature of the world might be the answer to Einstein's rhetorical question on how humans understand the world they live in:
+-->
+
+A natureza composicional do mundo pode ser a resposta à pergunta retórica de Einstein sobre como os humanos entendem o mundo em que vivem:
+
+<!-- The most incomprehensible thing about the universe is that it is comprehensible.
+-->
+
+> A coisa mais incompreensível sobre o universo é que ele é compreensível.
+
+<!--The fact that humans understand the world thanks to this compositional nature still seems like a conspiracy to Yann. It is, however, argued that without compositionality, it will take even more magic for humans to comprehend the world they live in. Quoting the great mathematician Stuart Geman:
+-->
+
+O fato de os humanos entenderem o mundo graças a essa natureza composicional ainda parece uma conspiração para Yann. No entanto, argumenta-se que, sem composicionalidade, será necessário ainda mais magia para os humanos compreenderem o mundo em que vivem. Citando o grande matemático Stuart Geman:
+
+<!-- The world is compositional or God exists.
+-->
+
+> O mundo é composicional ou Deus existe.
+
+<!--
+## [Inspirations from Biology](https://www.youtube.com/watch?v=FW5gFiJb-ig&t=2254s)
+-->
+
+## [Inspirações na Biologia](https://www.youtube.com/watch?v=FW5gFiJb-ig&t=2254s)
+
+<!--So why should Deep Learning be rooted in the idea that our world is comprehensible and has a compositional nature? Research conducted by Simon Thorpe helped motivate this further. He showed that the way we recognize everyday objects is extremely fast. His experiments involved flashing a set of images every 100ms, and then asking users to identify these images, which they were able to do successfully. This demonstrated that it takes about 100ms for humans to detect objects. Furthermore, consider the diagram below, illustrating parts of the brain annotated with the time it takes for neurons to propagate from one area to the next:
+-->
+
+Então, por que o Deep Learning deveria estar enraizado na ideia de que nosso mundo é compreensível e tem uma natureza composicional? A pesquisa conduzida por Simon Thorpe ajudou a motivar isso ainda mais. Ele mostrou que a maneira como reconhecemos objetos do cotidiano é extremamente rápida. Seus experimentos envolviam gerar um conjunto de imagens a cada 100 ms e, em seguida, pedir aos usuários que identificassem essas imagens, o que eles foram capazes de fazer com sucesso. Isso demonstrou que leva cerca de 100 ms para os humanos detectarem objetos. Além disso, considere o diagrama abaixo, ilustrando partes do cérebro anotadas com o tempo que leva para os neurônios se propagarem de uma área para a próxima:
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/Simon_Thorpe.png" alt="Simon_Thorpe" style="zoom:55%;" /></center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/Simon_Thorpe.png" alt="Simon_Thorpe" style="zoom:55%;" /></center>
+
+<!--<div align="center">Figure 11. Simon Thorpe's model of visual information flow in the brain <div>
+-->
+
+<div align="center">Figura 11. Modelo de Simon Thorpe de fluxo de informações visuais no cérebro<div>
+
+<!--Signals pass from the retina to the LGN (helps with contrast enhancement, gate control, etc.), then to the V1 primary visual cortex, V2, V4, then to the inferotemporal cortex (PIT), which is the part of the brain where categories are defined. Observations from open-brain surgery showed that if you show a human a film, neurons in the PIT will fire only when they detect certain images -- such as Jennifer Aniston or a person's grandmother -- and nothing else. The neural firings are invariant to things such as position, size, illumination, your grandmother's orientation, what she's wearing, etc.
+-->
+
+Os sinais passam da retina para o LGN (ajuda com aumento de contraste, controle de porta, etc.), em seguida, para o córtex visual primário V1, V2, V4 e, em seguida, para o córtex inferotemporal (PIT), que é a parte do cérebro onde categorias são definidas. As observações da cirurgia de cérebro aberto mostraram que, se você mostrar um filme a um humano, os neurônios no PIT serão disparados apenas quando detectarem certas imagens - como Jennifer Aniston ou a avó de uma pessoa - e nada mais. Os disparos neurais são invariáveis ​​a coisas como posição, tamanho, iluminação, orientação de sua avó, o que ela está vestindo, etc.
+
+<!--Furthermore, the fast reaction times with which humans were able to categorize these items -- barely enough time for a few spikes to get through -- demonstrates that it's possible to do this without additional time spent on complex recurrent computations. Rather, this is a single feed-forward process.
+-->
+
+Além disso, os tempos de reação rápidos com os quais os humanos foram capazes de categorizar esses itens - apenas o tempo suficiente para alguns picos passarem - demonstra que é possível fazer isso sem tempo adicional gasto em cálculos recorrentes complexos. Em vez disso, este é um único processo de feed-forward.
+
+<!--These insights suggested that we could develop a neural network architecture which is completely feed-forward, yet still able to solve the problem of recognition, in a way that is invariant to irrelevant transformations of the input.
+-->
+
+Esses insights sugeriram que poderíamos desenvolver uma arquitetura de rede neural que é completamente feed-forward, mas ainda capaz de resolver o problema de reconhecimento, de uma forma que é invariável para transformações irrelevantes de entrada.
+
+<!--One further insight from the human brain comes from Gallant & Van Essen, whose model of the human brain illustrates two distinct pathways:
+-->
+
+Um outro insight do cérebro humano vem de Gallant & Van Essen, cujo modelo do cérebro humano ilustra duas vias distintas:
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/Gallant_and_Van_Essen.png" alt="Gallant_and_Van_Essen" style="zoom:55%;" /></center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/Gallant_and_Van_Essen.png" alt="Gallant_and_Van_Essen" style="zoom:55%;" /></center>
+
+<!--<div align="center">Figure 12. Gallen & Van Essen's model of dorsal & ventral pathways in the brain <div>
+-->
+
+<div align="center">Figura 12. Modelo de Gallen e Van Essen das vias dorsais e ventrais no cérebro <div>
+
+<!--The right side shows the ventral pathway, which tells you what you're looking at, while the left side shows the dorsal pathway, which identifies locations, geometry, and motion. They seem fairly separate in the human (and primate) visual cortex (with a few interactions between them of course).
+-->
+
+O lado direito mostra a via ventral, que indica o que você está olhando, enquanto o lado esquerdo mostra a via dorsal, que identifica localizações, geometria e movimento. Eles parecem bastante separados no córtex visual humano (e primata) (com algumas interações entre eles, é claro).
+
+<!--
+### Hubel & Weisel's contributions (1962)
+-->
+
+### Contribuições de Hubel & Weisel (1962)
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/Hubel_and_Weisel.png" alt="Hubel_and_Weisel" style="zoom:55%;" /></center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/Hubel_and_Weisel.png" alt="Hubel_and_Weisel" style="zoom:55%;" /></center>
+
+<!--<div align="center">Figure 13. Hubel & Weisel's experiments with visual stimuli in cat brains <div>
+-->
+
+<div align="center"> Figura 13. Experimentos de Hubel e Wiesel com estímulos visuais em cérebros de gatos <div>
+
+<!--Hubel and Weisel experiments used electrodes to measure neural firings in cat brains in response to visual stimuli. They discovered that neurons in the V1 region are only sensitive to certain areas of a visual field (called "receptive fields"), and detect oriented edges in that area. For example, they demonstrated that if you showed the cat a vertical bar and start rotating it, at a particular angle the neuron will fire. Similarly, as the bar moves away from that angle, the activation of the neuron diminishes. These activation-selective neurons Hubel & Weisel named "simple cells", for their ability to detect local features.
+-->
+
+Os experimentos de Hubel e Weisel usaram eletrodos para medir disparos neurais em cérebros de gatos em resposta a estímulos visuais. Eles descobriram que os neurônios na região V1 são sensíveis apenas a certas áreas de um campo visual (chamadas de "campos receptivos") e detectam bordas orientadas nessa área. Por exemplo, eles demonstraram que se você mostrasse ao gato uma barra vertical e começasse a girá-la, em um determinado ângulo o neurônio dispararia. Da mesma forma, conforme a barra se afasta desse ângulo, a ativação do neurônio diminui. Hubel & Weisel denominaram esses neurônios seletivos de ativação de "células simples", por sua capacidade de detectar características locais.
+
+<!--They also discovered that if you move the bar out of the receptive field, that particular neuron doesn't fire any more, but another neuron will. There are local feature detectors corresponding to all areas of the visual field, hence the idea that the human brain processes visual information as a collection of "convolutions".
+-->
+
+Eles também descobriram que se você mover a barra para fora do campo receptivo, aquele neurônio específico não dispara mais, mas outro neurônio o faz. Existem detectores de características locais que correspondem a todas as áreas do campo visual, daí a ideia de que o cérebro humano processa informações visuais como uma coleção de "circunvoluções".
+
+<!--Another type of neuron, which they named "complex cells", aggregate the output of multiple simple cells within a certain area. We can think of these as computing an aggregate of the activations using a function such as maximum, sum, sum of squares, or any other function not depending on the order. These complex cells detect edges and orientations in a region, regardless of where those stimuli lie specifically within the region. In other words, they are shift-invariant with respect to small variations in positions of the input.
+-->
+
+Outro tipo de neurônio, que eles chamaram de "células complexas", agregam a saída de várias células simples dentro de uma determinada área. Podemos pensar nisso como o cálculo de um agregado das ativações usando uma função como máximo, soma, soma dos quadrados ou qualquer outra função que não dependa da ordem. Essas células complexas detectam bordas e orientações em uma região, independentemente de onde esses estímulos estejam especificamente na região. Em outras palavras, eles são invariáveis ​​ao deslocamento em relação a pequenas variações nas posições da entrada.
+
+<!--
+### Fukushima's contributions (1982)
+-->
+
+### Contribuições de Fukushima (1982)
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/Fukushima.png" alt="Fukushima" style="zoom:55%;" /></center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/Fukushima.png" alt="Fukushima" style="zoom:55%;" /></center>
+
+<!--<div align="center">Figure 14. Fukushima's CNN model <div>
+-->
+
+<div align="center"> Figura 14. Modelo CNN de Fukushima <div>
+
+<!--Fukushima was the first to implement the idea of multiple layers of simple cells and complex cells with computer models, using a dataset of handwritten digits. Some of these feature detectors were hand-crafted or learned, though the learning used unsupervised clustering algorithms, trained separately for each layer, as backpropagation was not yet in use.
+-->
+
+Fukushima foi o primeiro a implementar a ideia de múltiplas camadas de células simples e células complexas com modelos de computador, usando um conjunto de dados de dígitos escritos à mão. Alguns desses detectores de recursos foram feitos à mão ou aprendidos, embora o aprendizado usasse algoritmos de agrupamento não supervisionados, treinados separadamente para cada camada, já que a retropropagação ainda não estava em uso.
+
+<!--Yann LeCun came in a few years later (1989, 1998) and implemented the same architecture, but this time trained them in a supervised setting using backpropagation. This is widely regarded as the genesis of modern convolutional neural networks. (Note: Riesenhuber at MIT in 1999 also re-discovered this architecture, though he didn't use backpropagation.)
+-->
+
+Yann LeCun veio alguns anos depois (1989, 1998) e implementou a mesma arquitetura, mas desta vez os treinou em um ambiente supervisionado usando retropropagação. Isso é amplamente considerado como a gênese das redes neurais convolucionais modernas. (Observação: Riesenhuber no MIT em 1999 também redescobriu essa arquitetura, embora ele não tenha usado a retropropagação.)
diff --git a/docs/pt/week03/03-2.md b/docs/pt/week03/03-2.md
new file mode 100644
index 000000000..8b178c8ba
--- /dev/null
+++ b/docs/pt/week03/03-2.md
@@ -0,0 +1,476 @@
+---
+lang: pt
+lang-ref: ch.03-2
+lecturer: Yann LeCun
+title: Evolução, Arquiteturas, Detalhes de Implementação e Vantagens das Redes Convolucionais.
+authors: Chris Ick, Soham Tamba, Ziyu Lei, Hengyu Tang
+date: 10 Feb 2020
+translation-date: 14 Nov 2021
+translator: Leon Solon
+---
+
+<!--
+## [Proto-CNNs and evolution to modern CNNs](https://www.youtube.com/watch?v=FW5gFiJb-ig&t=2949s)
+-->
+
+
+## [Proto-CNNs e evolução para CNNs modernas](https://www.youtube.com/watch?v=FW5gFiJb-ig&t=2949s)
+
+<!--
+### Proto-convolutional neural nets on small data sets
+-->
+
+
+### Redes neurais protoconvolucionais em pequenos conjuntos de dados
+
+<!--Inspired by Fukushima's work on visual cortex modelling, using the simple/complex cell hierarchy combined with supervised training and backpropagation lead to the development of the first CNN at University of Toronto in '88-'89 by Prof. Yann LeCun. The experiments used a small dataset of 320 'mouser-written' digits. Performances of the following architectures were compared:
+-->
+
+Inspirado pelo trabalho de Fukushima na modelagem do córtex visual, o uso da hierarquia celular simples / complexa combinada com treinamento supervisionado e retropropagação levou ao desenvolvimento da primeira CNN na Universidade de Toronto em '88-'89 pelo Prof. Yann LeCun. Os experimentos usaram um pequeno conjunto de dados de 320 dígitos 'escritos no mouse'. Os desempenhos das seguintes arquiteturas foram comparados:
+
+<!--1. Single FC(fully connected) Layer
+2. Two FC Layers
+3. Locally Connected Layers w/o shared weights
+4. Constrained network w/ shared weights and local connections
+5. Constrained network w/ shared weights and local connections 2 (more feature maps)
+-->
+
+1. Camada FC única (totalmente conectada)
+2. Duas camadas FC
+3. Camadas conectadas localmente sem pesos compartilhados
+4. Rede restrita com pesos compartilhados e conexões locais
+5. Rede restrita com pesos compartilhados e conexões locais 2 (mais mapas de recursos)
+
+<!--The most successful networks (constrained network with shared weights) had the strongest generalizability, and form the basis for modern CNNs. Meanwhile, singler FC layer tends to overfit.
+-->
+
+As redes mais bem-sucedidas (rede restrita com pesos compartilhados) tiveram a generalização mais forte e formam a base para as CNNs modernas. Enquanto isso, a camada FC simples tende a se ajustar demais.
+
+<!--
+### First "real" ConvNets at Bell Labs
+-->
+
+
+### Primeiras ConvNets "reais" na Bell Labs
+
+<!--After moving to Bell Labs, LeCunn's research shifted to using handwritten zipcodes from the US Postal service to train a larger CNN:
+-->
+
+Depois de se mudar para o Bell Labs, a pesquisa de LeCunn mudou para o uso de códigos postais manuscritos dos Correios dos EUA para treinar uma CNN maior:
+
+<!--* 256 (16$\times$16) input layer
+* 12 5$\times$5 kernels with stride 2 (stepped 2 pixels): next layer has lower resolution
+* **NO** separate pooling
+-->
+
+* 256 (16$\times$16) camada de entrada
+* 12 5$\times$5 kernels com *stride* 2 (com passos de 2 pixels): a próxima camada tem resolução mais baixa
+* **Sem** pooling em separado
+
+<!--
+### Convolutional network architecture with pooling
+-->
+
+
+### Arquitetura de rede convolucional com pooling
+
+<!--The next year, some changes were made: separate pooling was introduced. Separate pooling is done by averaging input values, adding a bias, and passing to a nonlinear function (hyperbolic tangent function). The 2$\times$2 pooling was performed with a stride of 2, hence reducing resolutions by half.
+-->
+
+No ano seguinte, algumas mudanças foram feitas: o pooling separado foi introduzido. O agrupamento separado é feito calculando a média dos valores de entrada, adicionando uma polarização e passando para uma função não linear (função tangente hiperbólica). O pool de 2 $ \ vezes $ 2 foi realizado com um passo de 2, reduzindo, portanto, as resoluções pela metade.
+
+<!--<center>
+    <img src="{{site.baseurl}}/images/week03/03-2/detailed_convNet.png" width="600px" /><br>
+    <b>Fig. 1</b> ConvNet Architecture
+</center>
+-->
+
+<center>
+    <img src="{{site.baseurl}}/images/week03/03-2/detailed_convNet.png" width="600px" /><br>
+    <b>Fig. 1</b> Arquitetura ConvNet
+</center>
+
+<!--An example of a single convolutional layer would be as follows:
+1. Take an input with size *32$\times$32*
+2. The convolution layer passes a 5$\times$5 kernel with stride 1 over the image, resulting feature map size *28$\times$28*
+3. Pass the feature map to a nonlinear function: size *28$\times$28*
+4. Pass to the pooling layer that averages over a 2$\times$2 window with stride 2: size *14$\times$14*
+5. Repeat 1-4 for 4 kernels
+-->
+
+Um exemplo de uma única camada convolucional seria o seguinte:
+1. Pegue uma entrada com tamanho *32$\times$32*
+2. A camada de convolução passa um kernel 5$\times$5 com *stride* 1 sobre a imagem, resultando no tamanho do mapa de características *28$\times$28*
+3. Passe o mapa de características para uma função não linear: tamanho *28$\times$28*
+4. Passe para a camada de agrupamento que tem em média uma janela de 2$\times$2 com *stride* 2 (passo 2): tamanho *14$\times$14*
+5. Repita 1-4 para 4 kernels
+
+<!--The first-layer, simple convolution/pool combinations usually detect simple features, such as oriented edge detections. After the first convolution/pool layer, the objective is to detect combinations of features from previous layers. To do this, steps 2 to 4 are repeated with multiple kernels over previous-layer feature maps, and are summed in a new feature map:
+-->
+
+As combinações simples de convolução/pool de primeira camada geralmente detectam recursos simples, como detecções de bordas orientadas. Após a primeira camada de convolução/pool, o objetivo é detectar combinações de recursos de camadas anteriores. Para fazer isso, as etapas 2 a 4 são repetidas com vários kernels sobre os mapas de recursos da camada anterior e são somados em um novo mapa de recursos:
+
+<!--1. A new 5$\times$5 kernel is slid over all feature maps from previous layers, with results summed up. (Note: In Prof. LeCun's experiment in 1989 the connection is not full for computation purpose. Modern settings usually enforce full connections): size *10$\times$10*
+2. Pass the output of the convolution to a nonlinear function: size *10$\times$10*
+3. Repeat 1/2 for 16 kernels.
+4. Pass the result to the pooling layer that averages over 2$\times$2 window with stride 2: size *5$\times$5* each feature map
+-->
+
+1. Um novo kernel 5$\times$5 é deslizado sobre todos os mapas de características das camadas anteriores, com os resultados somados. (Observação: no experimento do Prof. LeCun em 1989, a conexão não está completa para fins de computação. As configurações modernas geralmente impõem conexões completas): tamanho *10$\times$10*
+2. Passe a saída da convolução para uma função não linear: tamanho *10$\times$10*
+3. Repita 1/2 para 16 grãos.
+4. Passe o resultado para a camada de agrupamento que tem em média mais de 2$\times$2 janela com passo 2: tamanho *5$\times$5* cada mapa de característica
+
+<!--To generate an output, the last layer of convolution is conducted, which seems like a full connections but indeed is convolutional.
+-->
+
+Para gerar uma saída, a última camada de convolução é conduzida, que parece conexões completas, mas na verdade é convolucional.
+
+<!--1. The final convolution layer slides a 5$\times$5 kernel over all feature maps, with results summed up: size *1$\times$1*
+2. Pass through nonlinear function: size *1$\times$1*
+3. Generate the single output for one category.
+4. Repeat all pervious steps for each of the 10 categories(in parallel)
+-->
+
+1. A camada de convolução final desliza um kernel 5$\times$5 sobre todos os mapas de características, com resultados resumidos: tamanho *1$\times$1*
+2. Passagem pela função não linear: tamanho *1$\times$1*
+3. Gere o resultado único para uma categoria.
+4. Repita todas as etapas anteriores para cada uma das 10 categorias (em paralelo)
+
+<!--See [this animation](http://cs231n.github.io/convolutional-networks/) on Andrej Karpathy's website on how convolutions change the shape of the next layer's feature maps. Full paper can be found [here](https://papers.nips.cc/paper/293-handwritten-digit-recognition-with-a-back-propagation-network.pdf).
+-->
+
+Veja [esta animação] (http://cs231n.github.io/convolutional-networks/) no site de Andrej Karpathy sobre como as convoluções mudam a forma dos mapas de características da próxima camada. O artigo completo pode ser encontrado [aqui](https://papers.nips.cc/paper/293-handwritten-digit-recognition-with-a-back-propagation-network.pdf).
+
+<!--
+### Shift equivariance
+-->
+
+
+### Equivariância de deslocamento (Shift equivariance)
+
+<!--<center>
+    <img src="{{site.baseurl}}/images/week03/03-2/shift_invariance.gif" width="600px" /><br>
+    <b>Fig. 2</b> Shift Equivariance
+</center>
+-->
+
+<center>
+    <img src="{{site.baseurl}}/images/week03/03-2/shift_invariance.gif" width="600px" /><br>
+    <b>Fig. 2</b> Equivariância de deslocamento (Shift Equivariance)
+</center>
+
+<!--As demonstrated by the animation on the slides(here's another example), translating the input image results in same translation of the feature maps. However, the changes in feature maps are scaled by convolution/pooling operations. *E.g.* the 2$\times$2 pooling with stride 2 will reduce the 1-pixel shift in input layer to 0.5-pixel shift in the following feature maps. Spatial resolution is then exchanged for increased number of feature types, *i.e.* making the representation more abstract and less sensitive to shifts and distortions.
+-->
+
+Conforme demonstrado pela animação nos slides (aqui está outro exemplo), a tradução da imagem de entrada resulta na mesma tradução dos mapas de características. No entanto, as mudanças nos mapas de recursos são dimensionadas por operações de convolução/agrupamento. *Por exemplo, o agrupamento de 2$\times$2 com *stride* 2 (passo de 2) reduzirá o deslocamento de 1 pixel na camada de entrada para o deslocamento de 0,5 pixel nos seguintes mapas de características. A resolução espacial é então trocada por um número maior de tipos de recursos, *ou seja,* tornando a representação mais abstrata e menos sensível a deslocamentos e distorções.
+
+<!--
+### Overall architecture breakdown
+-->
+
+
+### Análise geral da arquitetura
+
+<!--Generic CNN architecture can be broken down into several basic layer archetypes:
+-->
+
+A arquitetura genérica da CNN pode ser dividida em vários arquétipos de camadas básicas:
+
+<!--* **Normalisation**
+  * Adjusting whitening (optional)
+  * Subtractive methods *e.g.* average removal, high pass filtering
+  * Divisive: local contrast normalisation, variance normalisation
+-->
+
+* **Normalização**
+  * Ajustando o clareamento (opcional)
+  * Métodos subtrativos *por exemplo* remoção média, filtragem passa-alta
+  * Divisivo: normalização de contraste local, normalização de variância
+
+<!--* **Filter Banks**
+  * Increase dimensionality
+  * Projection on overcomplete basis
+  * Edge detections
+-->
+
+* **Bancos de filtros**
+  * Aumente a dimensionalidade
+  * Projeção em base supercompleta
+  * Detecções de borda
+
+<!--* **Non-linearities**
+  * Sparsification
+  * Typically Rectified Linear Unit (ReLU): $\text{ReLU}(x) = \max(x, 0)$
+-->
+
+* **Não linearidades**
+  * Esparsificação
+  * Unidade Linear Retificada(ReLU): $\text{ReLU}(x) = \max(x, 0)$
+
+<!--* **Pooling**
+  * Aggregating over a feature map
+  * Max Pooling: $\text{MAX}= \text{Max}_i(X_i)$
+-->
+
+* **Pooling**
+  * Agragação a partir de um mapa de características
+  * Max Pooling: $\text{MAX}= \text{Max}_i(X_i)$
+
+<!--  * LP-Norm Pooling:  $$\text{L}p= \left(\sum_{i=1}^n \|X_i\|^p \right)^{\frac{1}{p}}$$
+-->
+
+  * Pooling de Norma LP:  $$\text{L}p= \left(\sum_{i=1}^n \|X_i\|^p \right)^{\frac{1}{p}}$$
+
+<!--  * Log-Prob Pooling:  $\text{Prob}= \frac{1}{b} \left(\sum_{i=1}^n e^{b X_i} \right)$
+-->
+
+  * Pooling de Probabilidade Logarítimica:  $\text{Prob}= \frac{1}{b} \left(\sum_{i=1}^n e^{b X_i} \right)$
+
+<!--
+## [LeNet5 and digit recognition](https://www.youtube.com/watch?v=FW5gFiJb-ig&t=3830s)
+-->
+
+
+## [LeNet5 e reconhecimento de dígitos](https://www.youtube.com/watch?v=FW5gFiJb-ig&t=3830s)
+
+<!--
+### Implementation of LeNet5 in PyTorch
+-->
+
+
+### Implementação da LeNet5 no PyTorch
+
+<!--LeNet5 consists of the following layers (1 being the top-most layer):
+-->
+
+LeNet5 consiste nas seguintes camadas (1 sendo a camada superior):
+
+<!--1. Log-softmax
+2. Fully connected layer of dimensions 500$\times$10
+3. ReLu
+4. Fully connected layer of dimensions (4$\times$4$\times$50)$\times$500
+5. Max Pooling of dimensions 2$\times$2, stride of 2.
+6. ReLu
+7. Convolution with 20 output channels, 5$\times$5 kernel, stride of 1.
+8. Max Pooling of dimensions 2$\times$2, stride of 2.
+9. ReLu
+10. Convolution with 20 output channels, 5$\times$5 kernel, stride of 1.
+-->
+
+1. Log-softmax
+2. Camada de dimensões totalmente conectada 500$\times$10
+3. ReLu
+4. Camada de dimensões totalmente conectada (4$\times$4$\times$50)$\times$500
+5. Combinação máxima de dimensões 2$\times$2, *stride* de 2 (passo de 2).
+6. ReLu
+7. Convolução com 20 canais de saída, kernel 5$\times$5, *stride* de 1.
+8. Combinação máxima de dimensões 2$\times$2, *stride* de 2.
+9. ReLu
+10. Convolução com 20 canais de saída, 5$\times$5 kernel, *stride* de 1.
+
+<!--The input is a 32$\times$32 grey scale image (1 input channel).
+-->
+
+A entrada é uma imagem em escala de cinza de 32$\times$32 (1 canal de entrada).
+
+<!--LeNet5 can be implemented in PyTorch with the following code:
+-->
+
+LeNet5 pode ser implementado em PyTorch com o seguinte código:
+
+<!--```python
+class LeNet5(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.conv1 = nn.Conv2d(1, 20, 5, 1)
+        self.conv2 = nn.Conv2d(20, 20, 5, 1)
+        self.fc1 = nn.Linear(4*4*50, 500)
+        self.fc2 = nn.Linear(500, 10)
+
+    def forward(self, x):
+        x = F.relu(self.conv1(x))
+        x = F.max_pool2d(x, 2, 2)
+        x = F.relu(self.conv2(x))
+        x = F.max_pool2d(x, 2, 2)
+        x = x.view(-1, 4*4*50)
+        x = F.relu(self.fc1)
+        x = self.fc2(x)
+        
+        return F.logsoftmax(x, dim=1)
+-->
+
+```python
+class LeNet5(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.conv1 = nn.Conv2d(1, 20, 5, 1)
+        self.conv2 = nn.Conv2d(20, 20, 5, 1)
+        self.fc1 = nn.Linear(4*4*50, 500)
+        self.fc2 = nn.Linear(500, 10)
+
+    def forward(self, x):
+        x = F.relu(self.conv1(x))
+        x = F.max_pool2d(x, 2, 2)
+        x = F.relu(self.conv2(x))
+        x = F.max_pool2d(x, 2, 2)
+        x = x.view(-1, 4*4*50)
+        x = F.relu(self.fc1)
+        x = self.fc2(x)
+
+        return F.logsoftmax(x, dim=1)
+```
+
+<!--Although `fc1` and `fc2` are fully connected layers, they can be thought of as convolutional layers whose kernels cover the entire input. Fully connected layers are used for efficiency purposes.
+-->
+
+Embora `fc1` e` fc2` sejam camadas totalmente conectadas, elas podem ser consideradas camadas convolucionais cujos núcleos cobrem toda a entrada. Camadas totalmente conectadas são usadas para fins de eficiência.
+
+<!--The same code can be expressed using `nn.Sequential`, but it is outdated.
+-->
+
+O mesmo código pode ser expresso usando `nn.Sequential`, mas está desatualizado.
+
+<!--
+## Advantages of CNN
+-->
+
+
+## Vantagens das CNNs
+
+<!--In a fully convolutional network, there is no need to specify the size of the input. However, changing the size of the input changes the size of the output.
+-->
+
+Em uma rede totalmente convolucional, não há necessidade de especificar o tamanho da entrada. No entanto, alterar o tamanho da entrada altera o tamanho da saída.
+
+<!--Consider a cursive hand-writing recognition system. We do not have to break the input image into segments. We can apply the CNN over the entire image: the kernels will cover all locations in the entire image and record the same output regardless of where the pattern is located. Applying the CNN over an entire image is much cheaper than applying it at multiple locations separately. No prior segmentation is required, which is a relief because the task of segmenting an image is similar to recognizing an image.
+-->
+
+Considere um sistema de reconhecimento de escrita cursiva. Não precisamos quebrar a imagem de entrada em segmentos. Podemos aplicar o CNN em toda a imagem: os kernels cobrirão todos os locais da imagem inteira e registrarão a mesma saída, independentemente de onde o padrão esteja localizado. Aplicar a CNN sobre uma imagem inteira é muito mais barato do que aplicá-la em vários locais separadamente. Nenhuma segmentação anterior é necessária, o que é um alívio porque a tarefa de segmentar uma imagem é semelhante a reconhecer uma imagem.
+
+<!--
+### Example: MNIST
+-->
+
+
+### Exemplo: MNIST
+
+<!--LeNet5 is trained on MNIST images of size 32$\times$32 to classify individual digits in the centre of the image. Data augmentation was applied by shifting the digit around, changing the size of the digit, inserting digits to the side. It was also trained with an 11-th category which represented none of the above. Images labelled by this category were generated either by producing blank images, or placing digits at the side but not the centre.
+-->
+
+LeNet5 é treinado em imagens MNIST de tamanho 32 $ \ vezes $ 32 para classificar dígitos individuais no centro da imagem. O aumento de dados foi aplicado deslocando o dígito, alterando o tamanho do dígito, inserindo dígitos ao lado. Também foi treinado com uma 11ª categoria que não representou nenhuma das anteriores. As imagens rotuladas por esta categoria foram geradas produzindo imagens em branco ou colocando dígitos na lateral, mas não no centro.
+
+<!--<center>
+    <img src="{{site.baseurl}}/images/week03/03-2/various_input.gif" width="600px" /><br>
+    <b>Fig. 3</b> Sliding Window ConvNet
+</center>
+-->
+
+<center>
+    <img src="{{site.baseurl}}/images/week03/03-2/various_input.gif" width="600px" /><br>
+    <b>Fig. 3</b> ConvNet (Rede Convolucional) com janela deslizante
+</center>
+
+<!--
+The above image demonstrates that a LeNet5 network trained on 32$\times$32 can be applied on a 32$\times$64 input image to recognise the digit at multiple locations.
+-->
+
+A imagem acima demonstra que uma rede LeNet5 treinada em 32$\times$32 pode ser aplicada em uma imagem de entrada de  32$\times$64 para reconhecer o dígito em vários locais.
+
+<!--
+## [Feature binding problem](https://www.youtube.com/watch?v=FW5gFiJb-ig&t=4827s)
+-->
+
+
+## [Problema de vinculação de características](https://www.youtube.com/watch?v=FW5gFiJb-ig&t=4827s)
+
+<!--
+### What is the feature binding problem?
+-->
+
+
+### Qual é o problema de vinculação de recursos (Feature Binding)?
+
+<!--Visual neural scientists and computer vision people have the problem of defining the object as an object. An object is a collection of features, but how to bind all of the features to form this object?
+-->
+
+Cientistas da neurologia visual e especialistas em visão computacional têm o problema de definir o objeto como um objeto. Um objeto é uma coleção de recursos, mas como vincular todos os recursos para formar esse objeto?
+
+<!--
+### How to solve it?
+-->
+
+
+### Como solucionar?
+
+<!--We can solve this feature binding problem by using a very simple CNN: only two layers of convolutions with poolings plus another two fully connected layers without any specific mechanism for it, given that we have enough non-linearities and data to train our CNN.
+-->
+
+Podemos resolver esse problema de vinculação de recursos usando um CNN muito simples: apenas duas camadas de convoluções com agrupamentos mais outras duas camadas totalmente conectadas sem nenhum mecanismo específico para isso, visto que temos não linearidades e dados suficientes para treinar nosso CNN.
+
+<!--<center>
+    <img src="{{site.baseurl}}/images/week03/03-2/feature_binding.gif" width="600px" /><br>
+    <b>Fig. 4</b> ConvNet Addressing Feature Binding
+</center>
+-->
+
+<center>
+    <img src="{{site.baseurl}}/images/week03/03-2/feature_binding.gif" width="600px" /><br>
+    <b>Fig. 4</b> ConvNet solucionando a vinculação de características (Feature Binding)
+</center>
+
+<!--The above animation showcases the ability of CNN to recognize different digits by moving a single stroke around, demonstrating its ability to address feature binding problems, *i.e.* recognizing features in a hierarchical, compositional way.
+-->
+
+A animação acima mostra a capacidade da CNN de reconhecer diferentes dígitos movendo um único traço, demonstrando sua capacidade de resolver problemas de vinculação de recursos, *ou seja,* reconhecendo recursos de forma hierárquica e composicional.
+
+<!--
+### Example: dynamic input length
+-->
+
+
+### Example: comprimento de entrada dinâmica
+
+<!--We can build a CNN with 2 convolution layers with stride 1 and 2 pooling layers with stride 2 such that the overall stride is 4. Thus, if we want to get a new output, we need to shift our input window by 4. To be more explicit, we can see the figure below (green units). First, we have an input of size 10, and we perform convolution of size 3 to get 8 units. After that, we perform pooling of size 2 to get 4 units. Similarly, we repeat the convolution and pooling again and eventually we get 1 output.
+-->
+
+Podemos construir um CNN com 2 camadas de convolução com passo 1 e 2 camadas de pool com passo 2, de modo que o passo geral seja 4. Assim, se quisermos obter uma nova saída, precisamos mudar nossa janela de entrada em 4. Para ser mais explícito, podemos ver a figura abaixo (unidades verdes). Primeiro, temos uma entrada de tamanho 10 e realizamos a convolução de tamanho 3 para obter 8 unidades. Depois disso, realizamos o pooling de tamanho 2 para obter 4 unidades. Da mesma forma, repetimos a convolução e o agrupamento novamente e, eventualmente, obtemos 1 saída.
+
+<!--<center>
+    <img src="{{site.baseurl}}/images/week03/03-2/example.jpg" width="600px" /><br>
+    <b>Fig. 5</b> ConvNet Architecture On Variant Input Size Binding
+</center>
+-->
+
+<center>
+    <img src="{{site.baseurl}}/images/week03/03-2/example.jpg" width="600px" /><br>
+    <b>Fig. 5</b> Arquitetura ConvNet na vinculação de tamanho de entrada variante
+</center>
+
+<!--Let’s assume we add 4 units at the input layer (pink units above), so that we can get 4 more units after the first convolution layer, 2 more units after the first pooling layer, 2 more units after the second convolution layer, and 1 more output. Therefore, window size to generate a new output is 4 (2 stride $\times$2)<!<--the overall subsampling we have shown from input to output is 4 (2x2)->->. Moreover, this is a demonstration of the fact that if we increase the size of the input, we will increase the size of every layer, proving CNNs' capability in handling dynamic length inputs.
+-->
+
+Vamos supor que adicionamos 4 unidades na camada de entrada (unidades rosa acima), de modo que possamos obter mais 4 unidades após a primeira camada de convolução, mais 2 unidades após a primeira camada de agrupamento, mais 2 unidades após a segunda camada de convolução e 1 mais saída. Portanto, o tamanho da janela para gerar uma nova saída é 4 (*stride* de 2 $\times$2). Além disso, esta é uma demonstração de que se aumentarmos o tamanho da entrada, aumentaremos o tamanho de cada camada, comprovando a capacidade das CNNs em lidar com entradas de comprimento dinâmico.
+
+<!--
+## What are CNN good for
+-->
+
+
+## No que as CNNs são boas
+
+<!--CNNs are good for natural signals that come in the form of multidimensional arrays and have three major properties:
+1. **Locality**: The first one is that there is a strong local correlation between values. If we take two nearby pixels of a natural image, those pixels are very likely to have the same colour. As two pixels become further apart, the similarity between them will decrease. The local correlations can help us detect local features, which is what the CNNs are doing. If we feed the CNN with permuted pixels, it will not perform well at recognizing the input images, while FC will not be affected. The local correlation justifies local connections.
+2. **Stationarity**: Second character is that the features are essential and can appear anywhere on the image, justifying the shared weights and pooling. Moreover, statistical signals are uniformly distributed, which means we need to repeat the feature detection for every location on the input image.
+3. **Compositionality**: Third character is that the natural images are compositional, meaning the features compose an image in a hierarhical manner. This justifies the use of multiple layers of neurons, which also corresponds closely with Hubel and Weisel's research on simple and complex cells.
+-->
+
+CNNs são bons para sinais naturais que vêm na forma de matrizes multidimensionais e têm três propriedades principais:
+1. **Localidade**: O primeiro é que existe uma forte correlação local entre os valores. Se pegarmos dois pixels próximos de uma imagem natural, é muito provável que esses pixels tenham a mesma cor. À medida que dois pixels se tornam mais distantes, a semelhança entre eles diminui. As correlações locais podem nos ajudar a detectar características locais, que é o que as CNNs estão fazendo. Se alimentarmos a CNN com pixels permutados, ela não terá um bom desempenho no reconhecimento das imagens de entrada, enquanto o FC não será afetado. A correlação local justifica conexões locais.
+2. **Estacionaridade**: O segundo caráter é que as características são essenciais e podem aparecer em qualquer lugar da imagem, justificando os pesos compartilhados e pooling. Além disso, os sinais estatísticos são distribuídos uniformemente, o que significa que precisamos repetir a detecção do recurso para cada local na imagem de entrada.
+3. **Composicionalidade**: O terceiro caráter é que as imagens naturais são composicionais, ou seja, os recursos compõem uma imagem de forma hierárquica. Isso justifica o uso de múltiplas camadas de neurônios, o que também corresponde intimamente às pesquisas de Hubel e Weisel sobre células simples e complexas.
+
+<!--Furthermore, people make good use of CNNs on videos, images, texts, and speech recognition.
+-->
+
+Além disso, as pessoas fazem bom uso das CNNs em vídeos, imagens, textos e reconhecimento de fala.
\ No newline at end of file
diff --git a/docs/pt/week03/03-3.md b/docs/pt/week03/03-3.md
new file mode 100644
index 000000000..2adb70108
--- /dev/null
+++ b/docs/pt/week03/03-3.md
@@ -0,0 +1,285 @@
+---
+lang: pt
+lang-ref: ch.03-3
+title: Propriedades dos sinais naturais
+lecturer: Alfredo Canziani
+authors: Ashwin Bhola, Nyutian Long, Linfeng Zhang, and Poornima Haridas
+date: 11 Feb 2020
+translation-date: 14 Nov 2021
+translator: Leon Solon
+---
+
+
+<!--
+## [Properties of natural signals](https://www.youtube.com/watch?v=kwPWpVverkw&t=26s)
+-->
+
+## [Propriedades dos sinais naturais](https://www.youtube.com/watch?v=kwPWpVverkw&t=26s)
+
+<!--All signals can be thought of as vectors. As an example, an audio signal is a 1D signal $\boldsymbol{x} = [x_1, x_2, \cdots, x_T]$ where each value $x_t$ represents the amplitude of the waveform at time $t$. To make sense of what someone is speaking, your cochlea first converts the air pressure vibrations to signals and then your brain uses a language model to convert this signal to a language *i.e.* it needs to pick the most probable utterance given the signal. For music, the signal is stereophonic which has 2 or more channels to give you an illusion that the sound is coming from multiple directions. Even though it has 2 channels, it's still a 1D signal because time is the only variable along which the signal is changing.
+-->
+
+Todos os sinais podem ser considerados vetores. Como exemplo, um sinal de áudio é um sinal 1D $\boldsymbol{x} = [x_1, x_2, \cdots, x_T]$ onde cada valor $x_t$ representa a amplitude da forma de onda no tempo $ t $. Para entender o que alguém está falando, sua cóclea primeiro converte as vibrações da pressão do ar em sinais e, em seguida, seu cérebro usa um modelo de linguagem para converter esse sinal em uma linguagem *ou seja,* ele precisa escolher a expressão mais provável dado o sinal. Para música, o sinal é estereofônico, com 2 ou mais canais para dar a ilusão de que o som vem de várias direções. Embora tenha 2 canais, ainda é um sinal 1D porque o tempo é a única variável ao longo da qual o sinal está mudando.
+
+<!--An image is a 2D signal because the information is spatially depicted. Note that each point can be a vector in itself. This means that if we have $d$ channels in an image, each spatial point in the image is a vector of dimension $d$. A colour image has RGB planes, which means $d = 3$. For any point $x_{i,j}$, this corresponds to the intensity of red, green and blue colours respectively.
+-->
+
+Uma imagem é um sinal 2D porque a informação é representada espacialmente. Observe que cada ponto pode ser um vetor em si. Isso significa que se temos $d$ canais em uma imagem, cada ponto espacial na imagem é um vetor de dimensão $d$. Uma imagem colorida tem planos RGB, o que significa $d = 3$. Para qualquer ponto $x_{i,j}$, isso corresponde à intensidade das cores vermelha, verde e azul, respectivamente.
+
+<!--We can even represent language with the above logic. Each word corresponds to a one-hot vector with one at the position it occurs in our vocabulary and zeroes everywhere else. This means that each word is a vector of the size of the vocabulary.
+-->
+
+Podemos até representar a linguagem com a lógica acima. Cada palavra corresponde a um vetor one-hot com um na posição em que ocorre em nosso vocabulário e zeros em todas as outras. Isso significa que cada palavra é um vetor do tamanho do vocabulário.
+
+<!--Natural data signals follow these properties:
+1. Stationarity: Certain motifs repeat throughout a signal. In audio signals, we observe the same type of patterns over and over again across the temporal domain. In images, this means that we can expect similar visual patterns repeat across the dimensionality.
+2. Locality: Nearby points are more correlated than points far away. For 1D signal, this means that if we observe a peak at some point $t_i$, we expect the points in a small window around $t_i$ to have similar values as $t_i$ but for a point $t_j$ far away from $t_i$, $x_{t_i}$ has very less bearing on $x_{t_j}$. More formally, the convolution between a signal and its flipped counterpart has a peak when the signal is perfectly overlapping with it's flipped version. A convolution between two 1D signals (cross-correlation) is nothing but their dot product which is a measure of how similar or close the two vectors are. Thus, information is contained in specific portions and parts of the signal. For images, this means that the correlation between two points in an image decreases as we move the points away. If $x_{0,0}$ pixel is blue, the probability that the next pixel ($x_{1,0},x_{0,1}$) is also blue is pretty high but as you move to the opposite end of the image ($x_{-1,-1}$), the value of this pixel is independent of the pixel value at $x_{0,0}$.
+3. Compositionality: Everything in nature is composed of parts that are composed of sub-parts and so on. As an example, characters form strings that form words, which further form sentences. Sentences can be combined to form documents. Compositionality allows the world to be explainable.
+-->
+
+Os sinais de dados naturais seguem estas propriedades:
+1. Estacionariedade: Certos motivos se repetem ao longo de um sinal. Em sinais de áudio, observamos o mesmo tipo de padrões repetidamente em todo o domínio temporal. Em imagens, isso significa que podemos esperar que padrões visuais semelhantes se repitam em toda a dimensionalidade.
+2. Localidade: os pontos próximos são mais correlacionados do que os pontos distantes. Para o sinal 1D, isso significa que se observarmos um pico em algum ponto $t_i$, esperamos que os pontos em uma pequena janela em torno de $t_i$ tenham valores semelhantes a $t_i$, mas para um ponto $t_j$ longe de $t_i$, $x_{t_i}$ tem muito menos influência em $x_{t_j}$. Mais formalmente, a convolução entre um sinal e sua contraparte invertida tem um pico quando o sinal está perfeitamente sobreposto à sua versão invertida. Uma convolução entre dois sinais 1D (correlação cruzada) nada mais é do que seu produto escalar, que é uma medida de quão semelhantes ou próximos os dois vetores são. Assim, a informação está contida em porções e partes específicas do sinal. Para imagens, isso significa que a correlação entre dois pontos em uma imagem diminui à medida que afastamos os pontos. Se $x_{0,0}$ pixel for azul, a probabilidade de que o próximo pixel ($x_{1,0}, x_{0,1}$) também seja azul é muito alta, mas conforme você se move para a extremidade oposta da imagem ($x_{- 1, -1}$), o valor deste pixel é independente do valor do pixel em $x_{0,0}$.
+3. Composicionalidade: Tudo na natureza é composto de partes que são compostas de sub-partes e assim por diante. Por exemplo, os caracteres formam cadeias de caracteres que formam palavras, que também formam frases. As frases podem ser combinadas para formar documentos. A composicionalidade permite que o mundo seja explicável.
+
+<!--If our data exhibits stationarity, locality, and compositionality, we can exploit them with networks that use sparsity, weight sharing and stacking of layers.
+-->
+
+Se nossos dados exibem estacionariedade, localidade e composicionalidade, podemos explorá-los com redes que usam dispersão, compartilhamento de peso e empilhamento de camadas.
+
+<!--
+## [Exploiting properties of natural signals to build invariance and equivariance](https://www.youtube.com/watch?v=kwPWpVverkw&t=1074s)
+-->
+
+## [Explorando propriedades de sinais naturais para construir invariância e equivariância](https://www.youtube.com/watch?v=kwPWpVverkw&t=1074s)
+
+<!--
+### Locality  $\Rightarrow$ sparsity
+-->
+
+### Localidade $\Rightarrow$ esparcidade
+
+<!--Fig.1 shows a 5-layer fully connected network. Each arrow represents a weight to be multiplied by the inputs. As we can see, this network is very computationally expensive.
+-->
+
+A Fig.1 mostra uma rede de 5 camadas totalmente conectada. Cada seta representa um peso a ser multiplicado pelas entradas. Como podemos ver, essa rede é muito cara em termos computacionais.
+
+<!--<center><img src="{{site.baseurl}}/images/week02/02-3/pre-inference4layers.png" width="400px" /><br>
+<b>Figure 1:</b> Fully Connected Network</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week02/02-3/pre-inference4layers.png" width="400px" /><br>
+<b> Figura 1: </b> Rede totalmente conectada </center>
+
+<!--If our data exhibits locality, each neuron needs to be connected to only a few local neurons of the previous layer. Thus, some connections can be dropped as shown in Fig.2. Fig.2(a) represents an FC network. Taking advantage of the locality property of our data, we drop connections between far away neurons in Fig.2(b). Although the hidden layer neurons (green) in Fig.2(b) don't span the whole input, the overall architecture will be able to account for all input neurons. The receptive field (RF) is the number of neurons of previous layers, that each neuron of a particular layer can see or has taken into account. Therefore, the RF of the output layer w.r.t the hidden layer is 3, RF of the hidden layer w.r.t the input layer is 3, but the RF of the output layer w.r.t the input layer is 5.
+-->
+
+Se nossos dados exibem localidade, cada neurônio precisa ser conectado a apenas alguns neurônios locais da camada anterior. Assim, algumas conexões podem ser interrompidas, conforme mostrado na Fig.2. A Fig.2 (a) representa uma rede FC. Aproveitando a propriedade de localidade de nossos dados, eliminamos as conexões entre neurônios distantes na Fig.2 (b). Embora os neurônios da camada oculta (verde) na Fig.2 (b) não abranjam toda a entrada, a arquitetura geral será capaz de dar conta de todos os neurônios de entrada. O campo receptivo (RF) é o número de neurônios das camadas anteriores, que cada neurônio de uma determinada camada pode ver ou levou em consideração. Portanto, o RF da camada de saída com a camada oculta é 3, o RF da camada oculta com a camada de entrada é 3, mas o RF da camada de saída com a camada de entrada é 5.
+
+<!--|<img src="{{site.baseurl}}/images/week03/03-3/Figure 2(a) Before Applying Sparsity.png" width="300"/> | <img src="{{site.baseurl}}/images/week03/03-3/Figure 2(b) After Applying Sparsity.png" width="300"/>|
+|<b>Figure 2(a):</b> Before Applying Sparsity | <b>Figure 2(b):</b> After Applying Sparsity|
+-->
+
+|<img src="{{site.baseurl}}/images/week03/03-3/Figure 2(a) Before Applying Sparsity.png" width="300"/> | <img src="{{site.baseurl}}/images/week03/03-3/Figure 2(b) After Applying Sparsity.png" width="300"/>|
+|<b>Figura 2 (a):</b>Antes de aplicar a esparcidade | <b>Figura 2(b):</b> Após a aplicação da esparcidade |
+
+<!--
+### Stationarity $\Rightarrow$ parameters sharing
+-->
+
+### Estacionariedade $\Rightarrow$ Compartilhamento de parâmetros 
+
+<!--If our data exhibits stationarity, we could use a small set of parameters multiple times across the network architecture. For example in our sparse network, Fig.3(a), we can use a set of 3 shared parameters (yellow, orange and red). The number of parameters will then drop from 9 to 3! The new architecture might even work better because we have more data for training those specific weights.
+The weights after applying sparsity and parameter sharing is called a convolution kernel.
+-->
+
+Se nossos dados exibirem estacionariedade, poderíamos usar um pequeno conjunto de parâmetros várias vezes na arquitetura da rede. Por exemplo, em nossa rede esparsa, Fig.3 (a), podemos usar um conjunto de 3 parâmetros compartilhados (amarelo, laranja e vermelho). O número de parâmetros cairá de 9 para 3! A nova arquitetura pode até funcionar melhor porque temos mais dados para treinar esses pesos específicos.
+Os pesos após a aplicação de dispersão e compartilhamento de parâmetros são chamados de kernel de convolução.
+
+<!--|<img src="{{site.baseurl}}/images/week03/03-3/Figure 3(a) Before Applying Parameter Sharing.png" width="300"/> | <img src="{{site.baseurl}}/images/week03/03-3/Figure 3(b) After Applying Parameter Sharing.png" width="300"/>|
+|<b>Figure 3(a):</b> Before Applying Parameter Sharing | <b>Figure 3(b):</b> After Applying Parameter Sharing|
+-->
+
+|<img src="{{site.baseurl}}/images/week03/03-3/Figure 3(a) Before Applying Parameter Sharing.png" width="300"/> | <img src="{{site.baseurl}}/images/week03/03-3/Figure 3(b) After Applying Parameter Sharing.png" width="300"/>| <b> Figura 3 (a): </b> Antes de Aplicar o Compartilhamento de Parâmetro | <b> Figura 3 (b): </b> Após aplicar o compartilhamento de parâmetro |
+
+<!--Following are some advantages of using sparsity and parameter sharing:-
+* Parameter sharing
+  * faster convergence
+  * better generalisation
+  * not constained to input size
+  * kernel indepence $\Rightarrow$ high parallelisation
+* Connection sparsity
+  * reduced amount of computation
+-->
+
+A seguir estão algumas vantagens de usar esparsidade e compartilhamento de parâmetros: -
+* Compartilhamento de parâmetros
+  * convergência mais rápida
+  * melhor generalização
+  * não restrito ao tamanho de entrada
+  * independência do kernel $\Rightarrow$ alta paralelização
+* Esparsidade de conexão
+  * quantidade reduzida de computação
+
+<!--Fig.4 shows an example of kernels on 1D data, where the kernel size is: 2(number of kernels) * 7(thickness of the previous layer) * 3(number of unique connections/weights).
+-->
+
+A Fig.4 mostra um exemplo de kernels em dados 1D, onde o tamanho do kernel é: 2(número de kernels) * 7(espessura da camada anterior) * 3(número de conexões / pesos únicos).
+
+<!--The choice of kernel size is empirical. 3 * 3 convolution seems to be the minimal size for spatial data. Convolution of size 1 can be used to obtain a final layer that can be applied to a larger input image. Kernel size of even number might lower the quality of the data, thus we always have kernel size of odd numbers, usually 3 or 5.
+-->
+
+A escolha do tamanho do kernel é empírica. A convolução 3 * 3 parece ser o tamanho mínimo para dados espaciais. A convolução de tamanho 1 pode ser usada para obter uma camada final que pode ser aplicada a uma imagem de entrada maior. O tamanho do kernel de número par pode diminuir a qualidade dos dados, portanto, sempre temos o tamanho do kernel de números ímpares, geralmente 3 ou 5.
+
+<!--|<img src="{{site.baseurl}}/images/week03/03-3/Figure_4a_kernels_ on_1D_data.png" width="300"/> | <img src="{{site.baseurl}}/images/week03/03-3/Figure_4b_zero_padding.png" width="350"/>|
+|<b>Figure 4(a):</b> Kernels on 1D Data | <b>Figure 4(b):</b> Data with Zero Padding|
+-->
+
+|<img src="{{site.baseurl}}/images/week03/03-3/Figure_4a_kernels_ on_1D_data.png" width="300"/> | <img src="{{site.baseurl}}/images/week03/03-3/Figure_4b_zero_padding.png" width="350"/>|
+| <b> Figura 4 (a): </b> Kernels em dados 1D | <b> Figura 4 (b): </b> Dados com Preenchimento com Zeros |
+
+<!--### Padding
+-->
+
+### Preenchimento (Padding)
+
+<!--Padding generally hurts the final results, but it is convenient programmatically. We usually use zero-padding: `size =  (kernel size - 1)/2`.
+-->
+
+O preenchimento (padding) geralmente prejudica os resultados finais, mas é conveniente programaticamente. Normalmente usamos preenchimento com zeros (zero-padding): `tamanho = (tamanho do kernel - 1)/2`.
+
+<!--
+### Standard spatial CNN
+-->
+
+### CNN espacial padrão
+
+<!--A standard spatial CNN has the following properties:
+-->
+
+Uma CNN espacial padrão tem as seguintes propriedades:
+
+<!--* Multiple layers
+  * Convolution
+  * Non-linearity (ReLU and Leaky)
+  * Pooling
+  * Batch normalisation
+* Residual bypass connection
+-->
+
+* Múltiplas camadas
+  * Convolução
+  * Não linearidade (ReLU e Leaky)
+  * Pooling
+  * Normalização em lote (batch normalization)
+* Conexão de bypass residual
+
+<!--Batch normalization and residual bypass connections are very helpful to get the network to train well.
+Parts of a signal can get lost if too many layers have been stacked so, additional connections via residual bypass, guarantee a path from bottom to top and also for a path for gradients coming from top to bottom.
+-->
+
+A normalização em lote e as conexões de bypass residuais são muito úteis para fazer com que a rede treine bem.
+Partes de um sinal podem se perder se muitas camadas forem empilhadas, portanto, conexões adicionais via bypass residual garantem um caminho de baixo para cima e também um caminho para gradientes vindo de cima para baixo.
+
+<!--In Fig.5, while the input image contains mostly spatial information across two dimensions (apart from characteristic information, which is the colour of each pixel), the output layer is thick. Midway, there is a trade off between the spatial information and the characteristic information and the representation becomes denser. Therefore, as we move up the hierarchy, we get denser representation as we lose the spatial information.
+-->
+
+Na Fig.5, enquanto a imagem de entrada contém principalmente informações espaciais em duas dimensões (além das informações características, que são a cor de cada pixel), a camada de saída é espessa. No meio do caminho, há uma troca entre as informações espaciais e as informações características e a representação torna-se mais densa. Portanto, à medida que subimos na hierarquia, obtemos uma representação mais densa à medida que perdemos as informações espaciais.
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-3/Figure 5 Information Representations Moving up the Hierachy.png" width="350px" /><br>
+<b>Figure 5:</b> Information Representations Moving up the Hierarchy</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-3/Figure 5 Information Representations Moving up the Hierachy.png" width="350px" /><br>
+<b> Figura 5: </b> Representações de informações subindo na hierarquia </center>
+
+<!--
+### [Pooling](https://www.youtube.com/watch?v=kwPWpVverkw&t=2376s)
+-->
+
+### [Pooling](https://www.youtube.com/watch?v=kwPWpVverkw&t=2376s)
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-3/Figure 6 Illustration of Pooling.png" width="350px" /><br>
+<b>Figure 6:</b> Illustration of Pooling</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-3/Figure 6 Illustration of Pooling.png" width="350px" /><br>
+<b> Figura 6: </b> Ilustração de pooling </center>
+
+<!--A specific operator, $L_p$-norm, is applied to different regions (refer to Fig.6). Such an operator gives only one value per region(1 value for 4 pixels in our example). We then iterate over the whole data region-by-region, taking steps based on the stride. If we start with $m * n$ data with $c$ channels, we will end up with $\frac{m}{2} * \frac{n}{2}$ data still with $c$ channels (refer to Fig.7).
+Pooling is not parametrized; nevertheless, we can choose different polling types like max pooling, average pooling and so on. The main purpose of pooling reduces the amount of data so that we can compute in a reasonable amount of time.
+-->
+
+Um operador específico, $L_p$-norm, é aplicado a diferentes regiões (consulte a Fig.6). Esse operador fornece apenas um valor por região (1 valor para 4 pixels em nosso exemplo). Em seguida, iteramos sobre todos os dados, região por região, realizando etapas com base no passo. Se começarmos com $m * n$ dados com $c$ canais, terminaremos com $\frac{m}{2} * \frac{n}{2}$ dados ainda com $c$ canais (consulte Fig.7).
+O agrupamento não é parametrizado; no entanto, podemos escolher diferentes tipos de sondagem, como pooling máximo, pooling médio e assim por diante. O objetivo principal do agrupamento reduz a quantidade de dados para que possamos computar em um período de tempo razoável.
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-3/Figure 7 Pooling results.png" width="350px" /><br>
+<b>Figure 7:</b> Pooling results </center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-3/Figure 7 Pooling results.png" width="350px" /><br>
+<b> Figura 7: </b> Agrupando resultados </center>
+
+<!--
+## CNN - Jupyter Notebook
+-->
+
+## CNN - Jupyter Notebook
+
+<!--The Jupyter notebook can be found [here](https://github.com/Atcold/pytorch-Deep-Learning/blob/master/06-convnet.ipynb). To run the notebook, make sure you have the `pDL` environment installed as specified in [`README.md`](https://github.com/Atcold/pytorch-Deep-Learning/blob/master/README.md).
+-->
+
+O Jupyter Notebook pode ser encontrado [aqui](https://github.com/Atcold/pytorch-Deep-Learning/blob/master/06-convnet.ipynb). Para executar o notebook, certifique-se de ter o ambiente `pDL` instalado conforme especificado em [`README.md`](https://github.com/Atcold/pytorch-Deep-Learning/blob/master/README.md) .
+
+<!--In this notebook, we train a multilayer perceptron (FC network) and a convolution neural network (CNN) for the classification task on the MNIST dataset. Note that both networks have an equal number of parameters. (Fig.8)
+-->
+
+Neste notebook, treinamos um perceptron multicamadas (rede totalmente conectada - FC) e uma rede neural convolucional (CNN) para a tarefa de classificação no conjunto de dados MNIST. Observe que ambas as redes têm um número igual de parâmetros. (Fig.8)
+
+<!--<center> <img src="{{site.baseurl}}/images/week03/03-3/Figure 8 Instances from the Original MNIST Dataset.png" width="350px" /><br>
+<b>Figure 8:</b> Instances from the Original MNIST Dataset </center>
+-->
+
+<center> <img src="{{site.baseurl}}/images/week03/03-3/Figure 8 Instances from the Original MNIST Dataset.png" width="350px" /><br>
+<b> Figura 8: </b> instâncias do conjunto de dados MNIST original </center>
+
+<!--Before training, we normalize our data so that the initialization of the network will match our data distribution (very important!). Also, make sure that the following five operations/steps are present in your training:
+-->
+
+Antes do treinamento, normalizamos nossos dados para que a inicialização da rede corresponda à nossa distribuição de dados (muito importante!). Além disso, certifique-se de que as cinco operações/etapas a seguir estejam presentes em seu treinamento:
+
+<!-- 1. Feeding data to the model
+ 2. Computing the loss
+ 3. Cleaning the cache of accumulated gradients with `zero_grad()`
+ 4. Computing the gradients
+ 5. Performing a step in the optimizer method
+-->
+
+1. Alimentando dados para o modelo
+2. Calculando a perda
+3. Limpar o cache de gradientes acumulados com `zero_grad()`
+4. Calculando os gradientes
+5. Executar uma etapa no método do otimizador
+
+<!--First, we train both the networks on the normalized MNIST data. The accuracy of the FC network turned out to be $87\%$ while the accuracy of the CNN turned out to be $95\%$. Given the same number of parameters, the CNN managed to train many more filters. In the FC network, filters that try to get some dependencies between things that are further away with things that are close by, are trained. They are completely wasted. Instead, in the convolutional network, all these parameters concentrate on the relationship between neighbour pixels.
+-->
+
+Primeiro, treinamos ambas as redes nos dados MNIST normalizados. A precisão da rede totalmente conectada acabou sendo $87\%$, enquanto a precisão da CNN acabou sendo $95\%$. Dado o mesmo número de parâmetros, a CNN conseguiu treinar muitos mais filtros. Na rede FC, os filtros que tentam obter algumas dependências entre coisas que estão mais distantes com coisas que estão por perto são treinados. Eles estão completamente perdidos. Em vez disso, na rede convolucional, todos esses parâmetros se concentram na relação entre os pixels vizinhos.
+
+<!--Next, we perform a random permutation of all the pixels in all the images of our MNIST dataset. This transforms our Fig.8 to Fig.9. We then train both the networks on this modified dataset.
+-->
+
+Em seguida, realizamos uma permutação aleatória de todos os pixels em todas as imagens de nosso conjunto de dados MNIST. Isso transforma nossa Fig.8 em Fig.9. Em seguida, treinamos ambas as redes neste conjunto de dados modificado.
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-3/Figure 9 Instances from Permuted MNIST Dataset.png" width="350px" /><br>
+<b>Figure 9:</b> Instances from Permuted MNIST Dataset</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-3/Figure 9 Instances from Permuted MNIST Dataset.png" width="350px" /><br>
+<b> Figura 9: </b> instâncias do conjunto de dados MNIST permutado </center>
+
+<!--The performance of the FC network almost stayed unchanged ($85\%$), but the accuracy of CNN dropped to $83\%$. This is because, after a random permutation, the images no longer hold the three properties of locality, stationarity, and compositionality, that are exploitable by a CNN.
+-->
+
+O desempenho da rede totalmente conectada quase permaneceu inalterado ($85\%$), mas a precisão da CNN caiu para $83\%$. Isso ocorre porque, após uma permutação aleatória, as imagens não possuem mais as três propriedades de localidade, estacionariedade e composicionalidade, que são exploradas por uma CNN.
+
diff --git a/docs/pt/week03/03.md b/docs/pt/week03/03.md
new file mode 100644
index 000000000..8e652a19d
--- /dev/null
+++ b/docs/pt/week03/03.md
@@ -0,0 +1,40 @@
+---
+lang: pt
+lang-ref: ch.03
+title: Semana 3
+translator: Leon Solon
+translation-date: 14 Nov 2021
+---
+
+<!--
+## Lecture part A
+-->
+
+## Aula parte A
+
+<!--We first see a visualization of a 6-layer neural network. Next we begin with the topic of Convolutions and Convolution Neural Networks (CNN). We review several types of parameter transformations in the context of CNNs and introduce the idea of a kernel, which is used to learn features in a hierarchical manner. Thereby allowing us to classify our input data which is the basic idea motivating the use of CNNs.
+-->
+
+Iniciamos com a visualização de uma rede neural de 6 camadas. A seguir, começamos com o tópico de Convoluções e Redes Neurais Convolucionais (CNN). Revisamos vários tipos de transformações de parâmetros no contexto de CNNs e apresentamos a ideia de um kernel, que é usado para aprender características de maneira hierárquica. Assim, podemos classificar nossos dados de entrada, que é a ideia básica que motiva o uso de CNNs.
+
+<!--
+## Lecture part B
+-->
+
+## Aula parte B
+
+<!--We give an introduction on how CNNs have evolved over time. We discuss in detail different CNN architectures, including a modern implementation of LeNet5 to exemplify the task of digit recognition on the MNIST dataset. Based on its design principles, we expand on the advantages of CNNs which allows us to exploit the compositionality, stationarity, and locality features of natural images.
+-->
+
+Damos uma introdução de como as CNNs evoluíram ao longo do tempo. Discutimos em detalhes diferentes arquiteturas de CNN, incluindo uma implementação moderna de LeNet5 para exemplificar a tarefa de reconhecimento de dígitos no conjunto de dados MNIST. Com base em seus princípios de design, expandimos as vantagens das CNNs, o que nos permite explorar as características de composicionalidade, estacionariedade e localidade de imagens naturais.
+
+<!--
+## Practicum
+-->
+
+## Prática
+
+<!--Properties of natural signals that are most relevant to CNNs are discussed in more detail, namely: Locality, Stationarity, and Compositionality. We explore precisely how a kernel exploits these features through sparsity, weight sharing and the stacking of layers, as well as motivate the concepts of padding and pooling. Finally, a performance comparison between FCN and CNN was done for different data modalities.
+-->
+
+As propriedades dos sinais naturais que são mais relevantes para as CNNs são discutidas em mais detalhes, a saber: Localidade, Estacionaridade e Composicionalidade. Exploramos precisamente como um kernel explora essas características por meio de dispersão, compartilhamento de peso e empilhamento de camadas, além de motivar os conceitos de preenchimento (padding) e pooling. Finalmente, uma comparação de desempenho entre FCN (Redes Convolucionais Completas) e CNN foi feita para diferentes modalidades de dados.
\ No newline at end of file
diff --git a/docs/pt/week03/lecture03.sbv b/docs/pt/week03/lecture03.sbv
new file mode 100644
index 000000000..a6444f462
--- /dev/null
+++ b/docs/pt/week03/lecture03.sbv
@@ -0,0 +1,3429 @@
+0:00:04.819,0:00:08.319
+In this case, we have a network which has an input on the left-hand side
+
+0:00:08.959,0:00:14.259
+Usually you have the input on the bottom side or on the left. They are pink in my slides
+
+0:00:14.260,0:00:17.409
+So if you take notes, make them pink. No, just kidding!
+
+0:00:18.400,0:00:23.020
+And then we have... How many activations? How many hidden layers do you count there?
+
+0:00:23.539,0:00:27.789
+Four hidden layers. So overall how many layers does the network have here?
+
+0:00:28.820,0:00:32.980
+Six, right? Because we have four hidden, plus one input, plus one output layer
+
+0:00:33.649,0:00:37.568
+So in this case, I have two neurons per layer, right?
+
+0:00:37.569,0:00:41.739
+So what does it mean? What are the dimensions of the matrices we are using here?
+
+0:00:43.339,0:00:46.119
+Two by two. So what does that two by two matrix do?
+
+0:00:48.739,0:00:51.998
+Come on! You have... You know the answer to this question
+
+0:00:53.359,0:00:57.579
+Rotation, yeah. Then scaling, then shearing and...
+
+0:00:59.059,0:01:05.469
+reflection. Fantastic, right? So we constrain our network to perform all the operations on the plan
+
+0:01:05.540,0:01:12.380
+We have seen the first time if I allow the hidden layer to be a hundred neurons long we can...
+
+0:01:12.380,0:01:13.680
+Wow okay!
+
+0:01:13.680,0:01:15.680
+We can easily...
+
+0:01:18.079,0:01:20.079
+Ah fantastic. What is it?
+
+0:01:21.170,0:01:23.170
+We are watching movies now. I see...
+
+0:01:24.409,0:01:29.889
+See? Fantastic. What is it? Mandalorian is so cool, no? Okay...
+
+0:01:32.479,0:01:39.428
+Okay, how nice is this lesson. Is it even recorded? Okay, we have no idea
+
+0:01:40.789,0:01:43.719
+Okay, give me a sec. Okay, so we go here...
+
+0:01:47.810,0:01:49.810
+Done
+
+0:01:50.390,0:01:52.070
+Listen
+
+0:01:52.070,0:01:53.600
+All right
+
+0:01:53.600,0:01:59.679
+So we started from this network here, right? Which had this intermediate layer and we forced them to be
+
+0:02:00.289,0:02:05.229
+2-dimensional, right? Such that all the transformations are enforced to be on a plane
+
+0:02:05.270,0:02:08.319
+So this is what the network does to our plan
+
+0:02:08.319,0:02:14.269
+It folds it on specific regions, right? And those foldings are very abrupt
+
+0:02:14.370,0:02:18.499
+This is because all the transformations are performed on the 2d layer, right?
+
+0:02:18.500,0:02:22.550
+So this training took me really a lot of effort because the
+
+0:02:23.310,0:02:25.310
+optimization is actually quite hard
+
+0:02:25.740,0:02:30.769
+Whenever I had a hundred-neuron hidden layer, that was very easy to train
+
+0:02:30.770,0:02:35.299
+This one really took a lot of effort and you have to tell me why, okay?
+
+0:02:35.400,0:02:39.469
+If you don't know the answer right now, you'd better know the answer for the midterm
+
+0:02:40.470,0:02:43.370
+So you can take note of what are the questions for the midterm...
+
+0:02:43.980,0:02:49.600
+Right, so this is the final output of the network, which is also that 2d layer
+
+0:02:50.010,0:02:55.489
+to the embedding, so I have no non-linearity on my last layer. And these are the final
+
+0:02:56.370,0:03:01.850
+classification regions. So let's see what each layer does. This is the first layer, affine transformation
+
+0:03:01.850,0:03:06.710
+so it looks like it's a 3d rotation, but it's not right? It's just a 2d rotation
+
+0:03:07.740,0:03:15.600
+reflection, scaling, and shearing. And then what is this part? Ah, what's happened right now? Do you see?
+
+0:03:17.820,0:03:21.439
+We have like the ReLU part, which is killing all the negative
+
+0:03:22.800,0:03:27.079
+sides of the network, right? Sorry, all the negative sides of this
+
+0:03:28.080,0:03:33.499
+space, right? It is the second affine transformation and then here you apply again
+
+0:03:34.770,0:03:37.460
+the ReLU, you can see all the negative
+
+0:03:38.220,0:03:41.149
+subspaces have been erased and they've been set to zero
+
+0:03:41.730,0:03:44.509
+Then we keep going with a third affine transformation
+
+0:03:45.120,0:03:46.790
+We zoom it... it's zooming a lot...
+
+0:03:46.790,0:03:54.469
+And then you're gonna have the ReLU layer which is gonna be killing one of those... all three quadrants, right?
+
+0:03:54.470,0:03:59.240
+Only one quadrant survives every time. And then we go with the fourth affine transformation
+
+0:03:59.790,0:04:06.200
+where it's elongating a lot because given that we confine all the transformation to be living in this space
+
+0:04:06.210,0:04:12.439
+it really needs to stretch and use all the power it can, right? Again, this is the
+
+0:04:13.170,0:04:18.589
+second last. Then we have the last affine transformation, which is the final one. And then we reach finally
+
+0:04:19.320,0:04:20.910
+linearly separable
+
+0:04:20.910,0:04:26.359
+regions here. Finally, we're gonna see how each affine transformation can be
+
+0:04:27.240,0:04:31.759
+split in each component. So we have the rotation, we have now squashing, like zooming
+
+0:04:32.340,0:04:38.539
+Then we have rotation, reflection because the determinant is minus one, and then we have the final bias
+
+0:04:38.539,0:04:42.769
+You have the positive part of the ReLU (Rectified Linear Unit), again rotation
+
+0:04:43.650,0:04:47.209
+flipping because we had a negative, a minus one determinant
+
+0:04:47.849,0:04:49.849
+Zooming, rotation
+
+0:04:49.889,0:04:54.258
+One more reflection and then the final bias. This was the second affine transformation
+
+0:04:54.259,0:04:58.609
+Then we have here the positive part again. We have third layer so rotation, reflection
+
+0:05:00.000,0:05:05.629
+zooming and then we have... this is SVD decomposition, right? You should be aware of that, right?
+
+0:05:05.629,0:05:09.799
+You should know. And then final is the translation and the third
+
+0:05:10.229,0:05:15.589
+ReLU, then we had the fourth layer, so rotation, reflection because the determinant was negative
+
+0:05:16.169,0:05:18.169
+zooming, again the other rotation
+
+0:05:18.599,0:05:21.769
+Once more... reflection and bias
+
+0:05:22.379,0:05:24.559
+Finally a ReLU and then we have the last...
+
+0:05:25.259,0:05:27.259
+the fifth layer. So rotation
+
+0:05:28.139,0:05:32.059
+zooming, we didn't have reflection because the determinant was +1
+
+0:05:32.490,0:05:37.069
+Again, reflection in this case because the determinant was negative and then finally the final bias, right?
+
+0:05:37.139,0:05:41.478
+And so this was pretty much how this network, which was
+
+0:05:42.599,0:05:44.599
+just made of
+
+0:05:44.759,0:05:46.759
+a sequence of layers of
+
+0:05:47.159,0:05:52.218
+neurons that are only two neurons per layer, is performing the classification task
+
+0:05:54.990,0:05:58.159
+And all those transformation have been constrained to be
+
+0:05:58.680,0:06:03.199
+living on the plane. Okay, so this was really hard to train
+
+0:06:03.419,0:06:05.959
+Can you figure out why it was really hard to train?
+
+0:06:06.539,0:06:08.539
+What does it happen if my...
+
+0:06:09.270,0:06:16.219
+if my bias of one of the four layers puts my points away from the top right quadrant?
+
+0:06:21.060,0:06:25.519
+Exactly, so if you have one of the four biases putting my
+
+0:06:26.189,0:06:28.549
+initial point away from the top right quadrant
+
+0:06:29.189,0:06:34.039
+then the ReLUs are going to be completely killing everything, and everything gets collapsed into zero
+
+0:06:34.560,0:06:38.399
+Okay? And so there you can't do any more of anything, so
+
+0:06:38.980,0:06:44.129
+this network here was really hard to train. If you just make it a little bit fatter than...
+
+0:06:44.320,0:06:48.659
+instead of constraining it to be two neurons for each of the hidden layers
+
+0:06:48.660,0:06:52.230
+then it is much easier to train. Or you can do a combination of the two, right?
+
+0:06:52.230,0:06:54.300
+So instead of having just a fat network
+
+0:06:54.300,0:07:01.589
+you can have a network that is less fat, but then you have a few hidden layers, okay?
+
+0:07:02.770,0:07:06.659
+So the fatness is how many neurons you have per hidden layer, right?
+
+0:07:07.810,0:07:11.429
+Okay. So the question is how do we determine the structure or the
+
+0:07:12.730,0:07:15.150
+configuration of our network, right? How do we design the network?
+
+0:07:15.580,0:07:20.550
+And the answer is going to be, that's what Yann is gonna be teaching across the semester, right?
+
+0:07:20.550,0:07:27.300
+So keep your attention high because that's what we're gonna be teaching here
+
+0:07:28.090,0:07:30.840
+That's a good question right? There is no
+
+0:07:32.410,0:07:34.679
+mathematical rule, there is a lot of experimental
+
+0:07:35.710,0:07:39.569
+empirical evidence and a lot of people are trying different configurations
+
+0:07:39.570,0:07:42.000
+We found something that actually works pretty well now.
+
+0:07:42.100,0:07:46.200
+We're gonna be covering these architectures in the following lessons. Other questions?
+
+0:07:48.790,0:07:50.790
+Don't be shy
+
+0:07:51.880,0:07:56.130
+No? Okay, so I guess then we can switch to the second part of the class
+
+0:07:57.880,0:08:00.630
+Okay, so we're gonna talk about convolutional nets today
+
+0:08:02.710,0:08:05.879
+Let's dive right in. So I'll start with
+
+0:08:06.820,0:08:09.500
+something that's relevant to convolutional nets but not just [to them]
+
+0:08:10.000,0:08:12.500
+which is the idea of transforming the parameters of a neural net
+
+0:08:12.570,0:08:17.010
+So here we have a diagram that we've seen before except for a small twist
+
+0:08:17.920,0:08:22.300
+The diagram we're seeing here is that we have a neural net G of X and W
+
+0:08:22.360,0:08:27.960
+W being the parameters, X being the input that makes a prediction about an output, and that goes into a cost function
+
+0:08:27.960,0:08:29.500
+We've seen this before
+
+0:08:29.500,0:08:34.500
+But the twist here is that the weight vector instead of being a
+
+0:08:35.830,0:08:39.660
+parameter that's being optimized, is actually itself the output of some other function
+
+0:08:40.599,0:08:43.589
+possibly parameterized. In this case this function is
+
+0:08:44.320,0:08:50.369
+not a parameterized function, or it's a parameterized function but the only input is another parameter U, okay?
+
+0:08:50.750,0:08:56.929
+So what we've done here is make the weights of that neural net be the function of some more elementary...
+
+0:08:57.480,0:08:59.480
+some more elementary parameters U
+
+0:09:00.420,0:09:02.420
+through a function and
+
+0:09:02.940,0:09:07.880
+you realize really quickly that backprop just works there, right? If you back propagate gradients
+
+0:09:09.210,0:09:15.049
+through the G function to get the gradient of whatever objective function we're minimizing with respect to the
+
+0:09:15.600,0:09:21.290
+weight parameters, you can keep back propagating through the H function here to get the gradients with respect to U
+
+0:09:22.620,0:09:27.229
+So in the end you're sort of propagating things like this
+
+0:09:30.600,0:09:42.220
+So when you're updating U, you're multiplying the Jacobian of the objective function with respect to the parameters, and then by the...
+
+0:09:42.750,0:09:46.760
+Jacobian of the H function with respect to its own parameters, okay?
+
+0:09:46.760,0:09:50.960
+So you get the product of two Jacobians here, which is just what you get from back propagating
+
+0:09:50.960,0:09:54.919
+You don't have to do anything in PyTorch for this. This will happen automatically as you define the network
+
+0:09:59.130,0:10:03.080
+And that's kind of the update that occurs
+
+0:10:03.840,0:10:10.820
+Now, of course, W being a function of U through the function H, the change in W
+
+0:10:12.390,0:10:16.460
+will be the change in U multiplied by the Jacobian of H transpose
+
+0:10:18.090,0:10:24.739
+And so this is the kind of thing you get here, the effective change in W that you get without updating W
+
+0:10:24.740,0:10:30.260
+--you actually are updating U-- is the update in U multiplied by the Jacobian of H
+
+0:10:30.690,0:10:37.280
+And we had a transpose here. We have the opposite there. This is a square matrix
+
+0:10:37.860,0:10:41.720
+which is Nw by Nw, which is the number of... the dimension of W squared, okay?
+
+0:10:42.360,0:10:44.690
+So this matrix here
+
+0:10:45.780,0:10:47.780
+has as many rows as
+
+0:10:48.780,0:10:52.369
+W has components and then the number of columns is the number of
+
+0:10:52.560,0:10:57.470
+components of U. And then this guy, of course, is the other way around so it's an Nu by Nw
+
+0:10:57.540,0:11:02.669
+So when you make the product, do the product of those two matrices you get an Nw by Nw matrix
+
+0:11:03.670,0:11:05.670
+And then you multiply this by this
+
+0:11:06.190,0:11:10.380
+Nw vector and you get an Nw vector which is what you need for updating
+
+0:11:11.440,0:11:13.089
+the weights
+
+0:11:13.089,0:11:16.828
+Okay, so that's kind of a general form of transforming the parameter space and there's
+
+0:11:18.430,0:11:22.979
+many ways you can use this and a particular way of using it is when
+
+0:11:23.769,0:11:25.389
+H is what's called a...
+
+0:11:26.709,0:11:30.089
+what we talked about last week, which is a "Y connector"
+
+0:11:30.089,0:11:35.578
+So imagine the only thing that H does is that it takes one component of U and it copies it multiple times
+
+0:11:36.029,0:11:40.000
+So that you have the same value, the same weight replicated across the G function
+
+0:11:40.000,0:11:43.379
+the G function we use the same value multiple times
+
+0:11:45.639,0:11:47.639
+So this would look like this
+
+0:11:48.339,0:11:50.339
+So let's imagine U is two dimensional
+
+0:11:51.279,0:11:54.448
+u1, u2 and then W is four dimensional but
+
+0:11:55.000,0:11:59.969
+w1 and w2 are equal to u1 and w3, w4 are equal to u2
+
+0:12:01.060,0:12:04.400
+So basically you only have two free parameters
+
+0:12:04.700
+and when you're changing one component of U changing two components of W at the same time
+
+0:12:08.560,0:12:14.579
+in a very simple manner. And that's called weight sharing, okay? When two weights are forced to be equal
+
+0:12:14.579,0:12:19.200
+They are actually equal to a more elementary parameter that controls both
+
+0:12:19.300,0:12:21.419
+That's weight sharing and that's kind of the basis of
+
+0:12:21.940,0:12:23.940
+a lot of
+
+0:12:24.670,0:12:26.880
+ideas... you know, convolutional nets among others
+
+0:12:27.730,0:12:31.890
+but you can think of this as a very simple form of H of U
+
+0:12:33.399,0:12:38.489
+So you don't need to do anything for this in the sense that when you have weight sharing
+
+0:12:39.100,0:12:45.810
+If you do it explicitly with a module that does kind of a Y connection on the way back, when the gradients are back propagated
+
+0:12:45.810,0:12:47.800
+the gradients are summed up
+
+0:12:47.800,0:12:53.099
+so the gradient of some cost function with respect to u1, for example, will be the sum of the gradient so that
+
+0:12:53.199,0:12:55.559
+cost function with respect to w1 and w2
+
+0:12:56.860,0:13:02.219
+And similarly for the gradient with respect to u2 would be the sum of the gradients with respect to w3 and w4, okay?
+
+0:13:02.709,0:13:06.328
+That's just the effect of backpropagating through the two Y connectors
+
+0:13:13.310,0:13:19.119
+Okay, here is a slightly more general view of this parameter transformation that some people have called hypernetworks
+
+0:13:19.970,0:13:23.350
+So a hypernetwork is a network where
+
+0:13:23.839,0:13:28.299
+the weights of one network are computed as the output of another network
+
+0:13:28.459,0:13:33.969
+Okay, so you have a network H that looks at the input, it has its own parameters U
+
+0:13:35.569,0:13:37.929
+And it computes the weights of a second network
+
+0:13:38.959,0:13:44.199
+Okay? so the advantage of doing this... there are various names for it
+
+0:13:44.199,0:13:46.508
+The idea is very old, it goes back to the 80s
+
+0:13:46.880,0:13:52.539
+people using what's called multiplicative interactions, or three-way network, or sigma-pi units and they're basically
+
+0:13:53.600,0:13:59.050
+this idea --and this is maybe a slightly more general general formulation of it
+
+0:14:00.949,0:14:02.949
+that you have sort of a dynamically
+
+0:14:04.069,0:14:06.519
+Your function that's dynamically defined
+
+0:14:07.310,0:14:09.669
+In G of X and W
+
+0:14:10.459,0:14:14.318
+Because W is really a complex function of the input and some other parameter
+
+0:14:16.189,0:14:17.959
+This is particularly
+
+0:14:17.959,0:14:22.419
+interesting architecture when what you're doing to X is transforming it in some ways
+
+0:14:23.000,0:14:29.889
+Right? So you can think of W as being the parameters of that transformation, so Y would be a transformed version of X
+
+0:14:32.569,0:14:37.809
+And the X, I mean the function H basically computes that transformation
+
+0:14:38.899,0:14:41.739
+Okay? But we'll come back to that in a few weeks
+
+0:14:42.829,0:14:46.209
+Just wanted to mention this because it's basically a small modification of
+
+0:14:46.579,0:14:52.869
+of this right? You just have one more wire that goes from X to H, and that's how you get those hypernetworks
+
+0:14:56.569,0:15:03.129
+Okay, so we're showing the idea that you can have one parameter controlling
+
+0:15:06.500,0:15:12.549
+multiple effective parameters in another network. And one reason that's useful is
+
+0:15:13.759,0:15:16.779
+if you want to detect a motif on an input
+
+0:15:17.300,0:15:20.139
+And you want to detect this motif regardless of where it appears, okay?
+
+0:15:20.689,0:15:27.099
+So let's say you have an input, let's say it's a sequence but it could be an image, in this case is a sequence
+
+0:15:27.100,0:15:28.000
+Sequence of vectors, let's say
+
+0:15:28.300,0:15:33.279
+And you have a network that takes a collection of three of those vectors, three successive vectors
+
+0:15:34.010,0:15:36.339
+It's this network G of X and W and
+
+0:15:37.010,0:15:42.249
+it's trained to detect a particular motif of those three vectors. Maybe this is... I don't know
+
+0:15:42.889,0:15:44.750
+the power consumption
+
+0:15:44.750,0:15:51.880
+Electrical power consumption, and sometimes you might want to be able to detect like a blip or a trend or something like that
+
+0:15:52.519,0:15:54.519
+Or maybe it's, you know...
+
+0:15:56.120,0:15:58.120
+financial instruments of some kind
+
+0:15:59.149,0:16:05.289
+Some sort of time series. Maybe it's a speech signal and you want to detect a particular sound that consists in three
+
+0:16:06.050,0:16:10.899
+vectors that define the sort of audio content of that speech signal
+
+0:16:12.440,0:16:15.709
+And so you'd like to be able to detect
+
+0:16:15.709,0:16:20.469
+if it's a speech signal and there's a particular sound you need to detect for doing speech recognition
+
+0:16:20.470,0:16:22.630
+You might want to detect the sound
+
+0:16:23.180,0:16:28.690
+The vowel P, right? The sound P wherever it occurs in a sequence
+
+0:16:28.690,0:16:31.299
+You want some detector that fires when the sound P is...
+
+0:16:33.589,0:16:41.439
+...is pronounced. And so what we'd like to have is a detector you can slide over and regardless of where this motif occurs
+
+0:16:42.470,0:16:47.500
+detect it. So what you need to have is some network, some parameterized function that...
+
+0:16:48.920,0:16:55.029
+You have multiple copies of that function that you can apply to various regions on the input and they all share the same weight
+
+0:16:55.029,0:16:58.600
+but you'd like to train this entire system end to end
+
+0:16:58.700,0:17:01.459
+So for example, let's say...
+
+0:17:01.459,0:17:03.459
+Let's talk about a slightly more sophisticated
+
+0:17:05.569,0:17:07.688
+thing here where you have...
+
+0:17:11.059,0:17:13.059
+Let's see...
+
+0:17:14.839,0:17:17.349
+A keyword that's being being pronounced so
+
+0:17:18.169,0:17:22.959
+the system listens to sound and wants to detect when a particular keyword, a wakeup
+
+0:17:24.079,0:17:28.329
+word has been has been pronounced, right? So this is Alexa, right?
+
+0:17:28.459,0:17:32.709
+And you say "Alexa!" and Alexa wakes up it goes bong, right?
+
+0:17:35.260,0:17:40.619
+So what you'd like to have is some network that kind of takes a window over the sound and then sort of keeps
+
+0:17:41.890,0:17:44.189
+in the background sort of detecting
+
+0:17:44.860,0:17:47.219
+But you'd like to be able to detect
+
+0:17:47.220,0:17:52.020
+wherever the sound occurs within the frame that is being looked at, or it's been listened to, I should say
+
+0:17:52.300,0:17:56.639
+So you can have a network like this where you have replicated detectors
+
+0:17:56.640,0:17:59.520
+They all share the same weight and then the output which is
+
+0:17:59.520,0:18:03.329
+the score as to whether something has been detected or not, goes to a max function
+
+0:18:04.090,0:18:07.500
+Okay? And that's the output. And the way you train a system like this
+
+0:18:08.290,0:18:10.290
+you will have a bunch of samples
+
+0:18:10.780,0:18:14.140
+Audio examples where the keyword
+
+0:18:14.140,0:18:18.000
+has been pronounced and a bunch of audio samples where the keyword was not pronounced
+
+0:18:18.100,0:18:20.249
+And then you train a 2 class classifier
+
+0:18:20.470,0:18:24.689
+Turn on when "Alexa" is somewhere in this frame, turn off when it's not
+
+0:18:25.059,0:18:30.899
+But nobody tells you where the word "Alexa" occurs within the window that you train the system on, okay?
+
+0:18:30.900,0:18:35.729
+Because it's really expensive for labelers to look at the audio signal and tell you exactly
+
+0:18:35.730,0:18:37.570
+This is where the word "Alexa" is being pronounced
+
+0:18:37.570,0:18:42.720
+The only thing they know is that within this segment of a few seconds, the word has been pronounced somewhere
+
+0:18:43.450,0:18:48.390
+Okay, so you'd like to apply a network like this that has those replicated detectors?
+
+0:18:48.390,0:18:53.429
+You don't know exactly where it is, but you run through this max and you want to train the system to...
+
+0:18:53.950,0:18:59.370
+You want to back propagate gradient to it so that it learns to detect "Alexa", or whatever...
+
+0:19:00.040,0:19:01.900
+wake up word occurs
+
+0:19:01.900,0:19:09.540
+And so there what happens is you have those multiple copies --five copies in this example
+
+0:19:09.580,0:19:11.580
+of this network and they all share the same weight
+
+0:19:11.710,0:19:16.650
+You can see there's just one weight vector sending its value to five different
+
+0:19:17.410,0:19:22.559
+instances of the same network and so we back propagate through the
+
+0:19:23.260,0:19:27.689
+five copies of the network, you get five gradients, so those gradients get added up...
+
+0:19:29.679,0:19:34.949
+for the parameter. Now, there's this slightly strange way this is implemented in PyTorch and other
+
+0:19:35.740,0:19:41.760
+Deep Learning frameworks, which is that this accumulation of gradient in a single parameter is done implicitly
+
+0:19:42.550,0:19:46.659
+And it's one reason why before you do a backprop in PyTorch, you have to zero out the gradient
+
+0:19:47.840,0:19:49.840
+Because there's sort of implicit
+
+0:19:50.510,0:19:52.510
+accumulation of gradients when you do back propagation
+
+0:19:58.640,0:20:02.000
+Okay, so here's another situation where that would be useful 
+
+0:20:02.100,0:20:07.940
+And this is the real motivation behind conditional nets in the first place
+
+0:20:07.940,0:20:09.940
+Which is the problem of
+
+0:20:10.850,0:20:15.000
+training a system to recognize the shape independently of the position
+
+0:20:16.010,0:20:17.960
+of where the shape occurs
+
+0:20:17.960,0:20:22.059
+and whether there are distortions of that shape in the input
+
+0:20:22.850,0:20:28.929
+So this is a very simple type of convolutional net that is has been built by hand. It's not been trained
+
+0:20:28.929,0:20:30.929
+It's been designed by hand
+
+0:20:31.760,0:20:36.200
+And it's designed explicitly to distinguish C's from D's
+
+0:20:36.400,0:20:38.830
+Okay, so you can draw a C on the input
+
+0:20:39.770,0:20:41.770
+image which is very low resolution
+
+0:20:43.880,0:20:48.459
+And what distinguishes C's from D's is that C's have end points, right?
+
+0:20:48.460,0:20:54.610
+The stroke kind of ends, and you can imagine designing a detector for that. Whereas these have corners
+
+0:20:55.220,0:20:59.679
+So if you have an endpoint detector or something that detects the end of a segment and
+
+0:21:00.290,0:21:02.290
+a corner detector
+
+0:21:02.330,0:21:06.699
+Wherever you have corners detected, it's a D and wherever you have
+
+0:21:07.700,0:21:09.700
+segments that end, it's a C
+
+0:21:11.870,0:21:16.989
+So here's an example of a C. You take the first detector, so the little
+
+0:21:17.750,0:21:19.869
+black and white motif here at the top
+
+0:21:20.870,0:21:24.640
+is an endpoint detector, okay? It detects the end of a
+
+0:21:25.610,0:21:28.059
+of a segment and the way this
+
+0:21:28.760,0:21:33.969
+is represented here is that the black pixels here...
+
+0:21:35.840,0:21:37.929
+So think of this as some sort of template
+
+0:21:38.990,0:21:43.089
+Okay, you're going to take this template and you're going to swipe it over the input image
+
+0:21:44.510,0:21:51.160
+and you're going to compare that template to the little image that is placed underneath, okay?
+
+0:21:51.980,0:21:56.490
+And if those two match, the way you're going to determine whether they match is that you're going to do a dot product
+
+0:21:56.490,0:22:03.930
+So you're gonna think of those black and white pixels as value of +1 or -1, say +1 for black and -1 for white
+
+0:22:05.020,0:22:09.420
+And you're gonna think of those pixels also as being +1 for blacks and -1 for white and
+
+0:22:10.210,0:22:16.800
+when you compute the dot product of a little window with that template
+
+0:22:17.400,0:22:22.770
+If they are similar, you're gonna get a large positive value. If they are dissimilar, you're gonna get a...
+
+0:22:24.010,0:22:27.629
+zero or negative value. Or a smaller value, okay?
+
+0:22:29.020,0:22:35.489
+So you take that little detector here and you compute the dot product with the first window, second window, third window, etc.
+
+0:22:35.650,0:22:42.660
+You shift by one pixel every time for every location and you recall the result. And what you what you get is this, right?
+
+0:22:42.660,0:22:43.660
+So this is...
+
+0:22:43.660,0:22:51.640
+Here the grayscale is an indication of the matching
+
+0:22:51.640,0:22:57.959
+which is actually the dot product between the vector formed by those values
+
+0:22:58.100,0:23:05.070
+And the patch of the corresponding location on the input. So this image here is roughly the same size as that image
+
+0:23:06.250,0:23:08.250
+minus border effects
+
+0:23:08.290,0:23:13.469
+And you see there is a... whenever the output is dark there is a match
+
+0:23:14.380,0:23:16.380
+So you see a match here
+
+0:23:16.810,0:23:20.249
+because this endpoint detector here matches the
+
+0:23:20.980,0:23:24.810
+the endpoint. You see sort of a match here at the bottom
+
+0:23:25.630,0:23:27.930
+And the other kind of values are not as
+
+0:23:28.750,0:23:32.459
+dark, okay? Not as strong if you want
+
+0:23:33.250,0:23:38.820
+Now, if you threshold those those values you set the output to +1 if it's above the threshold
+
+0:23:39.520,0:23:41.520
+Zero if it's below the threshold
+
+0:23:42.070,0:23:46.499
+You get those maps here, you have to set the threshold appropriately but what you get is that
+
+0:23:46.500,0:23:50.880
+this little guy here detected a match at the two end points of the C, okay?
+
+0:23:52.150,0:23:54.749
+So now if you take this map and you sum it up
+
+0:23:56.050,0:23:58.050
+Just add all the values
+
+0:23:58.600,0:24:00.430
+You get a positive number
+
+0:24:00.430,0:24:03.989
+Pass that through threshold, and that's your C detector. It's not a very good C detector
+
+0:24:03.990,0:24:07.859
+It's not a very good detector of anything, but for those particular examples of C's
+
+0:24:08.429,0:24:10.210
+and maybe those D's
+
+0:24:10.210,0:24:16.980
+It will work, it'll be enough. Now for the D is similar, those other detectors here are meant to detect the corners of the D
+
+0:24:17.679,0:24:24.538
+So this guy here, this detector, as you swipe it over the input will detect the
+
+0:24:25.659,0:24:29.189
+upper left corner and that guy will detect the lower right corner
+
+0:24:29.649,0:24:33.689
+Once you threshold, you will get those two maps where the corners are detected
+
+0:24:34.509,0:24:37.019
+and then you can sum those up and the
+
+0:24:37.360,0:24:44.729
+D detector will turn on. Now what you see here is an example of why this is good because that detection now is shift invariant
+
+0:24:44.730,0:24:49.169
+So if I take the same input D here, and I shift it by a couple pixels
+
+0:24:50.340,0:24:56.279
+And I run this detector again, it will detect the motifs wherever they appear. The output will be shifted
+
+0:24:56.379,0:25:01.559
+Okay, so this is called equivariance to shift. So the output of that network
+
+0:25:02.590,0:25:10.499
+is equivariant to shift, which means that if I shift the input the output gets shifted, but otherwise unchanged. Okay? That's equivariance
+
+0:25:11.289,0:25:12.909
+Invariance would be
+
+0:25:12.909,0:25:17.398
+if I shift it, the output will be completely unchanged but here it is modified
+
+0:25:17.399,0:25:19.739
+It just modified the same way as the input
+
+0:25:23.950,0:25:31.080
+And so if I just sum up the activities in the feature maps here, it doesn't matter where they occur
+
+0:25:31.809,0:25:34.199
+My D detector will still activate
+
+0:25:34.929,0:25:38.998
+if I just compute the sum. So this is sort of a handcrafted
+
+0:25:39.700,0:25:47.100
+pattern recognizer that uses local feature detectors and then kind of sums up their activity and what you get is an invariant detection
+
+0:25:47.710,0:25:52.529
+Okay, this is a fairly classical way actually of building certain types of pattern recognition systems
+
+0:25:53.049,0:25:55.049
+Going back many years
+
+0:25:57.730,0:26:03.929
+But the trick here, what's important of course, what's interesting would be to learn those templates
+
+0:26:04.809,0:26:10.258
+Can we view this as just a neural net and we back propagate to it and we learn those templates?
+
+0:26:11.980,0:26:18.779
+As weights of a neural net? After all we're using them to do that product which is a weighted sum, so basically
+
+0:26:21.710,0:26:29.059
+This layer here to go from the input to those so-called feature maps that are weighted sums
+
+0:26:29.520,0:26:33.080
+is a linear operation, okay? And we know how to back propagate through that
+
+0:26:35.850,0:26:41.750
+We'd have to use a kind of a soft threshold, a ReLU or something like this here because otherwise we can't do backprop
+
+0:26:43.470,0:26:48.409
+Okay, so this operation here of taking the dot product of a bunch of coefficients
+
+0:26:49.380,0:26:53.450
+with an input window and then swiping it over, that's a convolution
+
+0:26:57.810,0:27:03.409
+Okay, so that's the definition of a convolution. It's actually the one up there so this is in the one dimensional case
+
+0:27:05.400,0:27:07.170
+where imagine you have
+
+0:27:10.530,0:27:16.639
+An input Xj, so X indexed by the j in the index
+
+0:27:20.070,0:27:22.070
+You take a window
+
+0:27:23.310,0:27:26.029
+of X at a particular location i
+
+0:27:27.330,0:27:30.080
+Okay, and then you sum
+
+0:27:31.890,0:27:40.340
+You do a weighted sum of the window of the X values and you multiply those by the weights wⱼ's
+
+0:27:41.070,0:27:50.359
+Okay, and the sum presumably runs over a kind of a small window so j here would go from 1 to 5
+
+0:27:51.270,0:27:54.259
+Something like that, which is the case in the little example I showed earlier
+
+0:27:58.020,0:28:00.950
+and that gives you one Yi
+
+0:28:01.770,0:28:05.510
+Okay, so take the first window of 5 values of X
+
+0:28:06.630,0:28:13.280
+Compute the weighted sum with the weights, that gives you Y1. Then shift that window by 1, compute the weighted sum of the
+
+0:28:13.620,0:28:18.320
+dot product of that window by the Y's, that gives you Y2, shift again, etc.
+
+0:28:23.040,0:28:26.839
+Now, in practice when people implement in things like PyTorch
+
+0:28:26.840,0:28:31.069
+there is a confusion between two things that mathematicians think are very different
+
+0:28:31.070,0:28:37.009
+but in fact, they're pretty much the same. It's convolution and cross correlation. So in convolution, the convention is that the...
+
+0:28:37.979,0:28:44.359
+the index goes backwards in the window when it goes forwards in the weights
+
+0:28:44.359,0:28:49.519
+In cross correlation, they both go forward. In the end, it's just a convention, it depends on how you lay...
+
+0:28:51.659,0:28:59.598
+organize the data and your weights. You can interpret this as a convolution if you read the weights backwards, so really doesn't make any difference
+
+0:29:01.259,0:29:06.949
+But for certain mathematical properties of a convolution if you want everything to be consistent you have to have the...
+
+0:29:07.440,0:29:10.849
+The j in the W having an opposite sign to the j in the X
+
+0:29:11.879,0:29:13.879
+So the two dimensional version of this...
+
+0:29:15.419,0:29:17.419
+If you have an image X
+
+0:29:17.789,0:29:21.258
+that has two indices --in this case i and j
+
+0:29:23.339,0:29:25.909
+You do a weighted sum over two indices k and l
+
+0:29:25.909,0:29:31.368
+And so you have a window a two-dimensional window indexed by k and l and you compute the dot product
+
+0:29:31.769,0:29:34.008
+of that window over X with the...
+
+0:29:35.099,0:29:39.679
+the weight, and that gives you one value in Yij which is the output
+
+0:29:43.349,0:29:51.319
+So the vector W or the matrix W in the 2d version, there is obvious extensions of this to 3d and 4d, etc.
+
+0:29:52.080,0:29:55.639
+It's called a kernel, it's called a convolutional kernel, okay?
+
+0:30:00.380,0:30:03.309
+Is it clear? I'm sure this is known for many of you but...
+
+0:30:10.909,0:30:13.449
+So what we're going to do with this is that
+
+0:30:14.750,0:30:18.699
+We're going to organize... build a network as a succession of
+
+0:30:20.120,0:30:23.769
+convolutions where in a regular neural net you have
+
+0:30:25.340,0:30:29.100
+alternation of linear operators and pointwise non-linearity
+
+0:30:29.250,0:30:34.389
+In convolutional nets, we're going to have an alternation of linear operators that will happen to be convolutions, so multiple convolutions
+
+0:30:34.940,0:30:40.179
+Then also pointwise non-linearity and there's going to be a third type of operation called pooling...
+
+0:30:42.620,0:30:44.620
+which is actually optional
+
+0:30:45.470,0:30:50.409
+Before I go further, I should mention that there are
+
+0:30:52.220,0:30:56.889
+twists you can make to this convolution. So one twist is what's called a stride
+
+0:30:57.380,0:31:01.239
+So a stride in a convolution consists in moving the window
+
+0:31:01.760,0:31:07.509
+from one position to another instead of moving it by just one value
+
+0:31:07.940,0:31:13.510
+You move it by two or three or four, okay? That's called a stride of a convolution
+
+0:31:14.149,0:31:17.138
+And so if you have an input of a certain length and...
+
+0:31:19.700,0:31:26.590
+So let's say you have an input which is kind of a one-dimensional and size 100 hundred
+
+0:31:27.019,0:31:31.059
+And you have a convolution kernel of size five
+
+0:31:32.330,0:31:34.330
+Okay, and you convolve
+
+0:31:34.909,0:31:38.409
+this kernel with the input
+
+0:31:39.350,0:31:46.120
+And you make sure that the window stays within the input of size 100
+
+0:31:46.730,0:31:51.639
+The output you get has 96 outputs, okay? It's got the number of inputs
+
+0:31:52.519,0:31:56.019
+minus the size of the kernel, which is 5 minus 1
+
+0:31:57.110,0:32:00.610
+Okay, so that makes it 4. So you get 100 minus 4, that's 96
+
+0:32:02.299,0:32:08.709
+That's the number of windows of size 5 that fit within this big input of size 100
+
+0:32:11.760,0:32:13.760
+Now, if I use this stride...
+
+0:32:13.760,0:32:21.960
+So what I do now is I take my window of 5 where I applied the kernel and I shift not by one pixel but by 2 pixels
+
+0:32:21.960,0:32:24.710
+Or two values, let's say. They're not necessarily pixels
+
+0:32:26.310,0:32:31.880
+Okay, the number of outputs I'm gonna get is gonna be divided by two roughly
+
+0:32:33.570,0:32:36.500
+Okay, instead of 96 I'm gonna have
+
+0:32:37.080,0:32:42.949
+a little less than 50, 48 or something like that. The number is not exact, you can...
+
+0:32:44.400,0:32:46.400
+figure it out in your head
+
+0:32:47.430,0:32:51.470
+Very often when people run convolutions in convolutional nets they actually pad the convolution
+
+0:32:51.470,0:32:59.089
+So they sometimes like to have the output being the same size as the input, and so they actually displace the input window
+
+0:32:59.490,0:33:02.479
+past the end of the vector assuming that it's padded with zeros
+
+0:33:04.230,0:33:06.230
+usually on both sides
+
+0:33:16.110,0:33:19.849
+Does it have any effect on performance or is it just for convenience?
+
+0:33:21.480,0:33:25.849
+If it has an effect on performance is bad, okay? But it is convenient
+
+0:33:28.350,0:33:30.350
+That's pretty much the answer
+
+0:33:32.700,0:33:37.800
+The assumption that's bad is assuming that when you don't have data it's equal to zero
+
+0:33:38.000,0:33:41.720
+So when your nonlinearities are ReLU, it's not necessarily completely unreasonable
+
+0:33:43.650,0:33:48.079
+But it sometimes creates funny border effects (boundary effects)
+
+0:33:51.120,0:33:53.539
+Okay, everything clear so far?
+
+0:33:54.960,0:33:59.059
+Right. Okay. So what we're going to build is a
+
+0:34:01.050,0:34:03.050
+neural net composed of those
+
+0:34:03.690,0:34:08.120
+convolutions that are going to be used as feature detectors, local feature detectors
+
+0:34:09.090,0:34:13.069
+followed by nonlinearities, and then we're gonna stack multiple layers of those
+
+0:34:14.190,0:34:18.169
+And the reason for stacking multiple layers is because
+
+0:34:19.170,0:34:21.090
+We want to build
+
+0:34:21.090,0:34:25.809
+hierarchical representations of the visual world of the data
+
+0:34:26.089,0:34:32.258
+It's not... convolutional nets are not necessarily applied to images. They can be applied to speech and other signals
+
+0:34:32.299,0:34:35.619
+They basically can be applied to any signal that comes to you in the form of an array
+
+0:34:36.889,0:34:41.738
+And I'll come back to the properties that this array has to verify
+
+0:34:43.789,0:34:45.789
+So what you want is...
+
+0:34:46.459,0:34:48.698
+Why do you want to build hierarchical representations?
+
+0:34:48.699,0:34:54.369
+Because the world is compositional --and I alluded to this I think you the first lecture if remember correctly
+
+0:34:55.069,0:35:03.519
+It's the fact that pixes assemble to form simple motifs like oriented edges
+
+0:35:04.430,0:35:10.839
+Oriented edges kind of assemble to form local features like corners and T junctions and...
+
+0:35:11.539,0:35:14.018
+things like that... gratings, you know, and...
+
+0:35:14.719,0:35:19.600
+then those assemble to form motifs that are slightly more abstract.
+
+0:35:19.700,0:35:23.559
+Then those assemble to form parts of objects, and those assemble to form objects
+
+0:35:23.559,0:35:28.000
+So there is a sort of natural compositional hierarchy in the natural world
+
+0:35:28.100,0:35:33.129
+And this natural compositional hierarchy in the natural world is not just because of
+
+0:35:34.369,0:35:38.438
+perception --visual perception-- is true at a physical level, right?
+
+0:35:41.390,0:35:46.808
+You start at the lowest level of the description
+
+0:35:47.719,0:35:50.079
+You have elementary particles and they form...
+
+0:35:50.079,0:35:56.438
+they clump to form less elementary particles, and they clump to form atoms, and they clump to form molecules, and molecules clump to form
+
+0:35:57.229,0:36:00.399
+materials, and materials parts of objects and
+
+0:36:01.130,0:36:03.609
+parts of objects into objects, and things like that, right?
+
+0:36:04.670,0:36:07.599
+Or macromolecules or polymers, bla bla bla
+
+0:36:08.239,0:36:13.239
+And then you have this natural composition or hierarchy the world is built this way
+
+0:36:14.719,0:36:19.000
+And it may be why the world is understandable, right?
+
+0:36:19.100,0:36:22.419
+So there's this famous quote from Einstein that says:
+
+0:36:23.329,0:36:26.750
+"the most incomprehensible thing about the world is that the world is comprehensible"
+
+0:36:26.800,0:36:30.069
+And it seems like a conspiracy that we live in a world that we are able to comprehend
+
+0:36:31.130,0:36:35.019
+But we can comprehend it because the world is compositional and
+
+0:36:36.970,0:36:38.970
+it happens to be easy to build
+
+0:36:39.760,0:36:44.370
+brains in a compositional world that actually can interpret compositional world
+
+0:36:45.580,0:36:47.580
+It still seems like a conspiracy to me
+
+0:36:49.660,0:36:51.660
+So there's a famous quote from...
+
+0:36:53.650,0:36:54.970
+from a...
+
+0:36:54.970,0:37:00.780
+Not that famous, but somewhat famous, from a statistician at Brown called Stuart Geman.
+
+0:37:01.360,0:37:04.799
+And he says that sounds like a conspiracy, like magic
+
+0:37:06.070,0:37:08.070
+But you know...
+
+0:37:08.440,0:37:15.570
+If the world were not compositional we would need some even more magic to be able to understand it
+
+0:37:17.260,0:37:21.540
+The way he says this is: "the world is compositional or there is a God"
+
+0:37:25.390,0:37:32.339
+You would need to appeal to superior powers if the world was not compositional to explain how we can understand it
+
+0:37:35.830,0:37:37.830
+Okay, so this idea of hierarchy
+
+0:37:38.440,0:37:44.520
+and local feature detection comes from biology. So the whole idea of convolutional nets comes from biology. It's been
+
+0:37:45.850,0:37:47.850
+so inspired by biology and
+
+0:37:48.850,0:37:53.399
+what you see here on the right is a diagram by Simon Thorpe who's a
+
+0:37:54.160,0:37:56.160
+psycho-physicist and
+
+0:37:56.500,0:38:02.939
+did some relatively famous experiments where he showed that the way we recognize everyday objects
+
+0:38:03.580,0:38:05.969
+seems to be extremely fast. So if you show...
+
+0:38:06.640,0:38:10.409
+if you flash the image of an everyday object to a person and
+
+0:38:11.110,0:38:12.730
+you flash
+
+0:38:12.730,0:38:16.649
+one of them every 100 milliseconds or so, you realize that the
+
+0:38:18.070,0:38:23.549
+the time it takes for a person to identify in a long sequence, whether there was a particular object, let's say a tiger
+
+0:38:25.780,0:38:27.640
+is about 100 milliseconds
+
+0:38:27.640,0:38:34.769
+So the time it takes for brain to interpret an image and recognize basic objects in them is about 100 milliseconds
+
+0:38:35.650,0:38:37.740
+A tenth of a second, right?
+
+0:38:39.490,0:38:42.120
+And that's just about the time it takes for the
+
+0:38:43.000,0:38:45.000
+nerve signal to propagate from
+
+0:38:45.700,0:38:47.550
+the retina
+
+0:38:47.550,0:38:54.090
+where images are formed in the eye to what's called the LGN (lateral geniculate nucleus)
+
+0:38:54.340,0:38:56.340
+which is a small
+
+0:38:56.350,0:39:02.640
+piece of the brain that basically does sort of contrast enhancement and gain control, and things like that
+
+0:39:03.580,0:39:08.789
+And then that signal goes to the back of your brain v1. That's the primary visual cortex area
+
+0:39:09.490,0:39:15.600
+in humans and then v2, which is very close to v1. There's a fold that sort of makes v1 sort of
+
+0:39:17.380,0:39:20.549
+right in front of v2, and there is lots of wires between them
+
+0:39:21.580,0:39:28.890
+And then v4, and then the inferior temporal cortex, which is on the side here and that's where object categories are represented
+
+0:39:28.890,0:39:35.369
+So there are neurons in your inferior temporal cortex that represent generic object categories
+
+0:39:38.350,0:39:41.370
+And people have done experiments with this where...
+
+0:39:44.320,0:39:51.150
+epileptic patients are in hospital and have their skull open because they need to locate the...
+
+0:39:52.570,0:40:00.200
+exact position of the source of their epilepsy seizures
+
+0:40:02.080,0:40:04.650
+And because they have electrodes on the surface of their brain
+
+0:40:05.770,0:40:11.000
+you can show the movies and then observe if a particular neuron turns on for particular movies
+
+0:40:11.100,0:40:14.110
+And you show them a movie with Jennifer Aniston and there is this
+
+0:40:14.110,0:40:17.900
+neuron that only turns on when Jennifer Aniston is there, okay?
+
+0:40:18.000,0:40:21.000
+It doesn't turn on for anything else as far as we could tell, okay?
+
+0:40:21.700,0:40:27.810
+So you seem to have very selective neurons in the inferior temporal cortex that react to a small number of categories
+
+0:40:30.760,0:40:35.669
+There's a joke, kind of a running joke, in neuroscience of a concept called the grandmother cell
+
+0:40:35.670,0:40:40.350
+So this is the one neuron in your inferior temporal cortex that turns on when you see your grandmother
+
+0:40:41.050,0:40:45.120
+regardless of what position what she's wearing, how far, whether it's a photo or not
+
+0:40:46.510,0:40:50.910
+Nobody really believes in this concept, what people really believe in is distributed representations
+
+0:40:50.910,0:40:54.449
+So there is no such thing as a cell that just turns on for you grandmother
+
+0:40:54.970,0:41:00.820
+There are this collection of cells that turn on for various things and they serve to represent general categories
+
+0:41:01.100,0:41:04.060
+But the important thing is that they are invariant to
+
+0:41:04.700,0:41:06.700
+position, size...
+
+0:41:06.920,0:41:11.080
+illumination, all kinds of different things and the real motivation behind
+
+0:41:11.930,0:41:14.349
+convolutional nets is to build
+
+0:41:15.140,0:41:18.670
+neural nets that are invariant to irrelevant transformation of the inputs
+
+0:41:19.510,0:41:27.070
+You can still recognize a C or D or your grandmother regardless of the position and to some extent the orientation, the style, etc.
+
+0:41:29.150,0:41:36.790
+So this idea that the signal only takes 100 milliseconds to go from the retina to the inferior temporal cortex
+
+0:41:37.160,0:41:40.330
+Seems to suggest that if you count the delay
+
+0:41:40.850,0:41:42.850
+to go through every neuron or every
+
+0:41:43.340,0:41:45.489
+stage in that pathway
+
+0:41:46.370,0:41:48.880
+There's barely enough time for a few spikes to get through
+
+0:41:48.880,0:41:55.720
+So there's no time for complex recurrent computation, is basically a feed-forward process. It's very fast
+
+0:41:56.930,0:41:59.980
+Okay, and we need it to be fast because that's a question of survival for us
+
+0:41:59.980,0:42:06.159
+There's a lot of... for most animals, you need to be able to recognize really quickly what's going on, particularly...
+
+0:42:07.850,0:42:12.820
+fast-moving predators or preys for that matter
+
+0:42:17.570,0:42:20.830
+So that kind of suggests the idea that we can do
+
+0:42:21.920,0:42:26.230
+perhaps we could come up with some sort of neuronal net architecture that is completely feed-forward and
+
+0:42:27.110,0:42:29.110
+still can do recognition
+
+0:42:30.230,0:42:32.230
+The diagram on the right
+
+0:42:34.430,0:42:39.280
+is from Gallent & Van Essen, so this is a type of sort of abstract
+
+0:42:39.920,0:42:43.450
+conceptual diagram of the two pathways in the visual cortex
+
+0:42:43.490,0:42:50.530
+There is the ventral pathway and the dorsal pathway. The ventral pathway is, you know, basically the v1, v2, v4, IT hierarchy
+
+0:42:50.530,0:42:54.999
+which is sort of from the back of the brain, and goes to the bottom and to the side and
+
+0:42:55.280,0:42:58.179
+then the dorsal pathway kind of goes
+
+0:42:59.060,0:43:02.469
+through the top also towards the inferior temporal cortex and
+
+0:43:04.040,0:43:09.619
+there is this idea somehow that the ventral pathway is there to tell you what you're looking at, right?
+
+0:43:10.290,0:43:12.499
+The dorsal pathway basically identifies
+
+0:43:13.200,0:43:15.200
+locations
+
+0:43:15.390,0:43:17.390
+geometry and motion
+
+0:43:17.460,0:43:25.040
+Okay? So there is a pathway for what, and another pathway for where, and that seems fairly separate in the
+
+0:43:25.040,0:43:29.030
+human or primate visual cortex
+
+0:43:32.610,0:43:34.610
+And of course there are interactions between them
+
+0:43:39.390,0:43:45.499
+So various people had the idea of kind of using... so where does that idea come from? There is
+
+0:43:46.080,0:43:48.799
+classic work in neuroscience from the late 50s early 60s
+
+0:43:49.650,0:43:52.129
+By Hubel & Wiesel, they're on the picture here
+
+0:43:53.190,0:43:57.440
+They won a Nobel Prize for it, so it's really classic work and what they showed
+
+0:43:58.290,0:44:01.519
+with cats --basically by poking electrodes into cat brains
+
+0:44:02.310,0:44:08.480
+is that neurons in the cat brain --in v1-- detect...
+
+0:44:09.150,0:44:13.789
+are only sensitive to a small area of the visual field and they detect oriented edges
+
+0:44:14.970,0:44:17.030
+contours in that particular area, okay?
+
+0:44:17.880,0:44:22.160
+So the area to which a particular neuron is sensitive is called a receptive field
+
+0:44:23.700,0:44:27.859
+And you take a particular neuron and you show it
+
+0:44:29.070,0:44:35.719
+kind of an oriented bar that you rotate, and at one point the neuron will fire
+
+0:44:36.270,0:44:40.640
+for a particular angle, and as you move away from that angle the activation of the neuron kind of
+
+0:44:42.690,0:44:50.149
+diminishes, okay? So that's called orientation selective neurons, and Hubel & Wiesel called it simple cells
+
+0:44:51.420,0:44:56.930
+If you move the bar a little bit, you go out of the receptive field, that neuron doesn't fire anymore
+
+0:44:57.150,0:45:03.049
+it doesn't react to it. This could be another neuron almost exactly identical to it, just a little bit
+
+0:45:04.830,0:45:09.620
+Away from the first one that does exactly the same function. It will react to a slightly different
+
+0:45:10.380,0:45:12.440
+receptive field but with the same orientation
+
+0:45:14.700,0:45:18.889
+So you start getting this idea that you have local feature detectors that are positioned
+
+0:45:20.220,0:45:23.689
+replicated all over the visual field, which is basically this idea of
+
+0:45:24.960,0:45:26.960
+convolution, okay?
+
+0:45:27.870,0:45:33.470
+So they are called simple cells. And then another idea that or discovery that
+
+0:45:35.100,0:45:40.279
+Hubel & Wiesel did is the idea of complex cells. So what a complex cell is is another type of neuron
+
+0:45:41.100,0:45:45.200
+that integrates the output of multiple simple cells within a certain area
+
+0:45:46.170,0:45:50.120
+Okay? So they will take different simple cells that all detect
+
+0:45:51.180,0:45:54.079
+contours at a particular orientation, edges at a particular orientation
+
+0:45:55.350,0:46:02.240
+And compute an aggregate of all those activations. It will either do a max, or a sum, or
+
+0:46:02.760,0:46:08.239
+a sum of squares, or square root of sum of squares. Some sort of function that does not depend on the order of the arguments
+
+0:46:08.820,0:46:11.630
+Okay? Let's say max for the sake of simplicity
+
+0:46:12.900,0:46:17.839
+So basically a complex cell will turn on if any of the simple cells within its
+
+0:46:19.740,0:46:22.399
+input group turns on
+
+0:46:22.680,0:46:29.480
+Okay? So that complex cell will detect an edge at a particular orientation regardless of its position within that little region
+
+0:46:30.210,0:46:32.210
+So it builds a little bit of
+
+0:46:32.460,0:46:34.609
+shift invariance of the
+
+0:46:35.250,0:46:40.159
+representation coming out of the complex cells with respect to small variation of positions of
+
+0:46:40.890,0:46:42.890
+features in the input
+
+0:46:46.680,0:46:52.010
+So a gentleman by the name of Kunihiko Fukushima
+
+0:46:54.420,0:46:56.569
+--No real relationship with the nuclear power plant
+
+0:46:58.230,0:47:00.230
+In the late 70s early 80s
+
+0:47:00.330,0:47:07.190
+experimented with computer models that sort of implemented this idea of simple cell / complex cell, and he had the idea of sort of replicating this
+
+0:47:07.500,0:47:09.500
+with multiple layers, so basically...
+
+0:47:11.310,0:47:17.810
+The architecture he did was very similar to the one I showed earlier here with this sort of handcrafted
+
+0:47:18.570,0:47:20.490
+feature detector
+
+0:47:20.490,0:47:24.559
+Some of those feature detectors in his model were handcrafted but some of them were learned
+
+0:47:25.230,0:47:30.709
+They were learned by an unsupervised method. He didn't have have backprop, right? Backprop didn't exist
+
+0:47:30.710,0:47:36.770
+I mean, it existed but it wasn't really popular and people didn't use it
+
+0:47:38.609,0:47:43.338
+So he trained those filters basically with something that amounts to a
+
+0:47:44.190,0:47:46.760
+sort of clustering algorithm a little bit...
+
+0:47:49.830,0:47:53.569
+and separately for each layer. And so he would
+
+0:47:56.609,0:48:02.389
+train the filters for the first layer, train this with handwritten digits --he also had a dataset of handwritten digits
+
+0:48:03.390,0:48:06.470
+and then feed this to complex cells that
+
+0:48:06.470,0:48:10.820
+pool the activity of simple cells together, and then that would
+
+0:48:11.880,0:48:18.440
+form the input to the next layer, and it would repeat the same running algorithm. His model of neuron was very complicated
+
+0:48:18.440,0:48:19.589
+It was kind of inspired by biology
+
+0:48:19.589,0:48:27.229
+So it had separate inhibitory neurons, the other neurons only have positive weights and outgoing weights, etc.
+
+0:48:27.839,0:48:29.839
+He managed to get this thing to kind of work
+
+0:48:30.510,0:48:33.800
+Not very well, but sort of worked
+
+0:48:36.420,0:48:39.170
+Then a few years later
+
+0:48:40.770,0:48:44.509
+I basically kind of got inspired by similar architectures, but
+
+0:48:45.780,0:48:51.169
+trained them supervised with backprop, okay? So that's the genesis of convolutional nets, if you want
+
+0:48:51.750,0:48:53.869
+And then independently more or less
+
+0:48:57.869,0:49:04.969
+Max Riesenhuber and Tony Poggio's lab at MIT kind of rediscovered this architecture also, but also didn't use backprop for some reason
+
+0:49:06.060,0:49:08.060
+He calls this H-max
+
+0:49:12.150,0:49:20.039
+So this is sort of early experiments I did with convolutional nets when I was finishing my postdoc in the University of Toronto in 1988
+
+0:49:20.040,0:49:22.040
+So that goes back a long time
+
+0:49:22.840,0:49:26.730
+And I was trying to figure out, does this work better on a small data set?
+
+0:49:26.730,0:49:27.870
+So if you have a tiny amount of data
+
+0:49:27.870,0:49:31.109
+you're trying to fully connect to network or linear network with just one layer or
+
+0:49:31.480,0:49:34.529
+a network with local connections but no shared weights or compare this with
+
+0:49:35.170,0:49:39.299
+what was not yet called a convolutional net, where you have shared weights and local connections
+
+0:49:39.400,0:49:42.749
+Which one works best? And it turned out that in terms of
+
+0:49:43.450,0:49:46.439
+generalization ability, which are the curves on the bottom left
+
+0:49:49.270,0:49:52.499
+which you see here, the top curve here, is...
+
+0:49:53.500,0:50:00.330
+basically the baby convolutional net architecture trained with very a simple data set of handwritten digits that were drawn with a mouse, right?
+
+0:50:00.330,0:50:02.490
+We didn't have any way of collecting images, basically
+
+0:50:03.640,0:50:05.640
+at that time
+
+0:50:05.860,0:50:09.240
+And then if you have real connections without shared weights
+
+0:50:09.240,0:50:12.119
+it works a little worse. And then if you have fully connected
+
+0:50:14.470,0:50:22.230
+networks it works worse, and if you have a linear network, it not only works worse, but but it also overfits, it over trains
+
+0:50:23.110,0:50:28.410
+So the test error goes down after a while, and this was trained with 320
+
+0:50:29.410,0:50:35.519
+320 training samples, which is really small. Those networks had on the order of
+
+0:50:36.760,0:50:43.170
+five thousand connections, one thousand parameters. So this is a billion times smaller than what we do today
+
+0:50:43.990,0:50:45.990
+A million times I would say
+
+0:50:47.890,0:50:53.730
+And then I finished my postdoc, I went to Bell Labs, and Bell Labs had slightly bigger computers
+
+0:50:53.730,0:50:57.389
+but what they had was a data set that came from the Postal Service
+
+0:50:57.390,0:51:00.629
+So they had zip codes for envelopes and we built a
+
+0:51:00.730,0:51:05.159
+data set out of those zip codes and then trained a slightly bigger a neural net for three weeks
+
+0:51:06.430,0:51:12.749
+and got really good results. So this convolutional net did not have separate
+
+0:51:13.960,0:51:15.960
+convolution and pooling
+
+0:51:16.240,0:51:22.769
+It had strided convolution, so convolutions where the window is shifted by more than one pixel. So that's...
+
+0:51:23.860,0:51:29.739
+What's the result of this? So the result is that the output map when you do a convolution where the stride is
+
+0:51:30.710,0:51:36.369
+more than one, you get an output whose resolution is smaller than the input and you see an example here
+
+0:51:36.370,0:51:40.390
+So here the input is 16 by 16 pixels. That's what we could afford
+
+0:51:41.900,0:51:49.029
+The kernels are 5 by 5, but they are shifted by 2 pixels every time and so the
+
+0:51:51.950,0:51:56.919
+the output here is smaller because of that
+
+0:52:11.130,0:52:13.980
+Okay? And then one year later this was the next generation
+
+0:52:14.830,0:52:16.830
+convolutional net. This one had separate
+
+0:52:17.680,0:52:19.680
+convolution and pooling so...
+
+0:52:20.740,0:52:24.389
+Where's the pooling operation? At that time, the pooling operation was just another
+
+0:52:25.690,0:52:31.829
+neuron except that all the weights of that neuron were equal, okay? So a pooling unit was basically
+
+0:52:32.680,0:52:36.839
+a unit that computed an average of its inputs
+
+0:52:37.180,0:52:41.730
+it added a bias, and then passed it to a non-linearity, which in this case was a hyperbolic tangent function
+
+0:52:42.820,0:52:48.450
+Okay? All the non-linearities in this network were hyperbolic tangents at the time. That's what people were doing
+
+0:52:53.200,0:52:55.200
+And the pooling operation was
+
+0:52:56.380,0:52:58.440
+performed by shifting
+
+0:52:59.680,0:53:01.710
+the window over which you compute the
+
+0:53:02.770,0:53:09.240
+the aggregate of the output of the previous layer by 2 pixels, okay? So here
+
+0:53:10.090,0:53:13.470
+you get a 32 by 32 input window
+
+0:53:14.470,0:53:20.730
+You convolve this with filters that are 5 by 5. I should mention that a convolution kernel sometimes is also called a filter
+
+0:53:22.540,0:53:25.230
+And so what you get here are
+
+0:53:27.520,0:53:29.520
+outputs that are
+
+0:53:30.520,0:53:33.749
+I guess minus 4 so is 28 by 28, okay?
+
+0:53:34.540,0:53:40.380
+And then there is a pooling which computes an average of
+
+0:53:41.530,0:53:44.400
+pixels here over a 2 by 2 window and
+
+0:53:45.310,0:53:47.310
+then shifts that window by 2
+
+0:53:48.160,0:53:50.160
+So how many such windows do you have?
+
+0:53:51.220,0:53:56.279
+Since the image is 28 by 28, you divide by 2, is 14 by 14, okay? So those images
+
+0:53:57.460,0:54:00.359
+here are 14 by 14 pixels
+
+0:54:02.050,0:54:05.759
+And they are basically half the resolution as the previous window
+
+0:54:07.420,0:54:09.420
+because of this stride
+
+0:54:10.360,0:54:16.470
+Okay? Now it becomes interesting because what you want is, you want the next layer to detect combinations of features from the previous layer
+
+0:54:17.200,0:54:19.200
+And so...
+
+0:54:20.200,0:54:22.619
+the way to do this is... you have
+
+0:54:23.440,0:54:26.579
+different convolution filters apply to each of those feature maps
+
+0:54:27.730,0:54:29.730
+Okay?
+
+0:54:29.950,0:54:35.939
+And you sum them up, you sum the results of those four convolutions and you pass the result to a non-linearity and that gives you
+
+0:54:36.910,0:54:42.239
+one feature map of the next layer. So because those filters are 5 by 5 and those
+
+0:54:43.330,0:54:46.380
+images are 14 by 14, those guys are 10 by 10
+
+0:54:47.290,0:54:49.739
+Okay? To not have border effects
+
+0:54:52.270,0:54:56.999
+So each of these feature maps --of which there are sixteen if I remember correctly
+
+0:54:59.290,0:55:01.290
+uses a different set of
+
+0:55:02.860,0:55:04.860
+kernels to...
+
+0:55:06.340,0:55:09.509
+convolve the previous layers. In fact
+
+0:55:10.630,0:55:13.799
+the connection pattern between the feature map...
+
+0:55:14.650,0:55:18.720
+the feature map at this layer and the feature map at the next layer is actually not full
+
+0:55:18.720,0:55:22.349
+so not every feature map is connected to every feature map. There's a particular scheme of
+
+0:55:23.680,0:55:25.950
+different combinations of feature map from the previous layer
+
+0:55:28.030,0:55:33.600
+combining to four feature maps at the next layer. And the reason for doing this is just to save computer time
+
+0:55:34.000,0:55:40.170
+We just could not afford to connect everything to everything. It would have taken twice the time to run or more
+
+0:55:41.890,0:55:48.359
+Nowadays we are kind of forced more or less to actually have a complete connection between feature maps in a convolutional net
+
+0:55:49.210,0:55:52.289
+Because of the way that multiple convolutions are implemented in GPUs
+
+0:55:53.440,0:55:55.440
+Which is sad
+
+0:55:56.560,0:55:59.789
+And then the next layer up. So again those maps are 10 by 10
+
+0:55:59.790,0:56:02.729
+Those feature maps are 10 by 10 and the next layer up
+
+0:56:03.970,0:56:06.389
+is produced by pooling and subsampling
+
+0:56:07.330,0:56:09.330
+by a factor of 2
+
+0:56:09.370,0:56:11.370
+and so those are 5 by 5
+
+0:56:12.070,0:56:14.880
+Okay? And then again there is a 5 by 5 convolution here
+
+0:56:14.880,0:56:18.089
+Of course, you can't move the window 5 by 5 over a 5 by 5 image
+
+0:56:18.090,0:56:21.120
+So it looks like a full connection, but it's actually a convolution
+
+0:56:22.000,0:56:24.000
+Okay? Keep this in mind
+
+0:56:24.460,0:56:26.460
+But you basically just sum in only one location
+
+0:56:27.250,0:56:33.960
+And those feature maps at the top here are really outputs. And so you have one special location
+
+0:56:33.960,0:56:39.399
+Okay? Because you can only place one 5 by 5 window within a 5 by 5 image
+
+0:56:40.460,0:56:45.340
+And you have 10 of those feature maps each of which corresponds to a category so you train the system to classify
+
+0:56:45.560,0:56:47.619
+digits from 0 to 9, you have ten categories
+
+0:56:59.750,0:57:03.850
+This is a little animation that I borrowed from Andrej Karpathy
+
+0:57:05.570,0:57:08.439
+He spent the time to build this really nice real animation
+
+0:57:09.470,0:57:16.780
+which is to represent several convolutions, right? So you have three feature maps here on the input and you have six
+
+0:57:18.650,0:57:21.100
+convolution kernels and two feature maps on the output
+
+0:57:21.100,0:57:26.709
+So here the first group of three feature maps are convolved with...
+
+0:57:28.520,0:57:31.899
+kernels are convolved with the three input feature maps to produce
+
+0:57:32.450,0:57:37.330
+the first group, the first of the two feature maps, the green one at the top
+
+0:57:38.390,0:57:40.370
+Okay?
+
+0:57:40.370,0:57:42.820
+And then...
+
+0:57:44.180,0:57:49.000
+Okay, so this is the first group of three kernels convolved with the three feature maps
+
+0:57:49.000,0:57:53.349
+And they produce the green map at the top, and then you switch to the second group of
+
+0:57:54.740,0:57:58.479
+of convolution kernels. You convolve with the
+
+0:57:59.180,0:58:04.149
+three input feature maps to produce the map at the bottom. Okay? So that's
+
+0:58:05.810,0:58:07.810
+an example of
+
+0:58:10.070,0:58:17.709
+n-feature map on the input, n-feature map on the output, and N times M convolution kernels to get all combinations
+
+0:58:25.000,0:58:27.000
+Here's another animation which I made a long time ago
+
+0:58:28.100,0:58:34.419
+That shows convolutional net after it's been trained in action trying to recognize digits
+
+0:58:35.330,0:58:38.529
+And so what's interesting to look at here is you have
+
+0:58:39.440,0:58:41.440
+an input here, which is I believe
+
+0:58:42.080,0:58:44.590
+32 rows by 64 columns
+
+0:58:45.770,0:58:52.570
+And after doing six convolutions with six convolution kernels passing it through a hyperbolic tangent non-linearity after a bias
+
+0:58:52.570,0:58:59.229
+you get those feature maps here, each of which kind of activates for a different type of feature. So, for example
+
+0:58:59.990,0:59:01.990
+the feature map at the top here
+
+0:59:02.390,0:59:04.690
+turns on when there is some sort of a horizontal edge
+
+0:59:07.400,0:59:10.090
+This guy here it turns on whenever there is a vertical edge
+
+0:59:10.940,0:59:15.340
+Okay? And those convolutional kernels have been learned through backprop, the thing has been just been trained
+
+0:59:15.980,0:59:20.980
+with backprop. Not set by hand. They're set randomly usually
+
+0:59:21.620,0:59:26.769
+So you see this notion of equivariance here, if I shift the input image the
+
+0:59:27.500,0:59:31.600
+activations on the feature maps shift, but otherwise stay unchanged
+
+0:59:32.540,0:59:34.540
+All right?
+
+0:59:34.940,0:59:36.940
+That's shift equivariance
+
+0:59:36.950,0:59:38.860
+Okay, and then we go to the pooling operation
+
+0:59:38.860,0:59:42.519
+So this first feature map here corresponds to a pooled version of
+
+0:59:42.800,0:59:46.149
+this first one, the second one to the second one, third went to the third one
+
+0:59:46.250,0:59:51.370
+and the pooling operation here again is an average, then a bias, then a similar non-linearity
+
+0:59:52.070,0:59:55.029
+And so if this map shifts by
+
+0:59:56.570,0:59:59.499
+one pixel this map will shift by one half pixel
+
+1:00:01.370,1:00:02.780
+Okay?
+
+1:00:02.780,1:00:05.259
+So you still have equavariance, but
+
+1:00:06.260,1:00:11.830
+shifts are reduced by a factor of two, essentially
+
+1:00:11.830,1:00:15.850
+and then you have the second stage where each of those maps here is a result of
+
+1:00:16.160,1:00:23.440
+doing a convolution on each, or a subset of the previous maps with different kernels, summing up the result, passing the result through
+
+1:00:24.170,1:00:27.070
+a sigmoid, and so you get those kind of abstract features
+
+1:00:28.730,1:00:32.889
+here that are a little hard to interpret visually, but it's still equivariant to shift
+
+1:00:33.860,1:00:40.439
+Okay? And then again you do pooling and subsampling. So the pooling also has this stride by a factor of two
+
+1:00:40.630,1:00:42.580
+So what you get here are
+
+1:00:42.580,1:00:47.609
+our maps, so that those maps shift by one quarter pixel if the input shifts by one pixel
+
+1:00:48.730,1:00:55.290
+Okay? So we reduce the shift and it becomes... it might become easier and easier for following layers to kind of interpret what the shape is
+
+1:00:55.290,1:00:57.290
+because you exchange
+
+1:00:58.540,1:01:00.540
+spatial resolution for
+
+1:01:01.030,1:01:05.009
+feature type resolution. You increase the number of feature types as you go up the layers
+
+1:01:06.040,1:01:08.879
+The spatial resolution goes down because of the pooling and subsampling
+
+1:01:09.730,1:01:14.459
+But the number of feature maps increases and so you make the representation a little more abstract
+
+1:01:14.460,1:01:19.290
+but less sensitive to shift and distortions. And the next layer
+
+1:01:20.740,1:01:25.080
+again performs convolutions, but now the size of the convolution kernel is equal to the height of the image
+
+1:01:25.080,1:01:27.449
+And so what you get is a single band
+
+1:01:28.359,1:01:32.219
+for this feature map. It basically becomes one dimensional and
+
+1:01:32.920,1:01:39.750
+so now any vertical shift is basically eliminated, right? It's turned into some variation of activation, but it's not
+
+1:01:40.840,1:01:42.929
+It's not a shift anymore. It's some sort of
+
+1:01:44.020,1:01:45.910
+simpler --hopefully
+
+1:01:45.910,1:01:49.020
+transformation of the input. In fact, you can show it's simpler
+
+1:01:51.160,1:01:53.580
+It's flatter in some ways
+
+1:01:56.650,1:02:00.330
+Okay? So that's the sort of generic convolutional net architecture we have
+
+1:02:01.570,1:02:05.699
+This is a slightly more modern version of it, where you have some form of normalization
+
+1:02:07.450,1:02:09.450
+Batch norm
+
+1:02:10.600,1:02:15.179
+Good norm, whatever. A filter bank, those are the multiple convolutions
+
+1:02:16.660,1:02:18.690
+In signal processing they're called filter banks
+
+1:02:19.840,1:02:27.149
+Pointwise non-linearity, generally a ReLU, and then some pooling, generally max pooling in the most common
+
+1:02:28.330,1:02:30.629
+implementations of convolutional nets. You can, of course
+
+1:02:30.630,1:02:35.880
+imagine other types of pooling. I talked about the average but the more generic version is the LP norm
+
+1:02:36.640,1:02:38.640
+which is...
+
+1:02:38.770,1:02:45.530
+take all the inputs through a complex cell, elevate them to some power and then take the...
+
+1:02:45.530,1:02:47.530
+Sum them up, and then take the...
+
+1:02:49.860,1:02:51.860
+Elevate that to 1 over the power
+
+1:02:53.340,1:02:58.489
+Yeah, this should be a sum inside of the P-th root here
+
+1:03:00.870,1:03:02.870
+Another way to pool and again
+
+1:03:03.840,1:03:07.759
+a good pooling operation is an operation that is
+
+1:03:07.920,1:03:11.719
+invariant to a permutation of the input. It gives you the same result
+
+1:03:12.750,1:03:14.750
+regardless of the order in which you put the input
+
+1:03:15.780,1:03:22.670
+Here's another example. We talked about this function before: 1 over b log sum of our inputs of e to the bXᵢ
+
+1:03:25.920,1:03:30.649
+Exponential bX. Again, that's a kind of symmetric aggregation operation that you can use
+
+1:03:32.400,1:03:35.539
+So that's kind of a stage of a convolutional net, and then you can repeat that
+
+1:03:36.270,1:03:43.729
+There's sort of various ways of positioning the normalization. Some people put it after the non-linearity before the pooling
+
+1:03:43.730,1:03:45.730
+You know, it depends
+
+1:03:46.590,1:03:48.590
+But it's typical
+
+1:03:53.640,1:03:56.569
+So, how do you do this in PyTorch? there's a number of different ways
+
+1:03:56.570,1:04:02.479
+You can do it by writing it explicitly, writing a class. So this is an example of a convolutional net class
+
+1:04:04.020,1:04:10.520
+In particular one here where you do convolutions, ReLU and max pooling
+
+1:04:12.600,1:04:17.900
+Okay, so the constructor here creates convolutional layers which have parameters in them
+
+1:04:18.810,1:04:24.499
+And this one has what's called fully-connected layers. I hate that. Okay?
+
+1:04:25.980,1:04:30.919
+So there is this idea somehow that the last layer of a convolutional net
+
+1:04:32.760,1:04:34.790
+Like this one, is fully connected because
+
+1:04:37.320,1:04:42.860
+every unit in this layer is connected to every unit in that layer. So that looks like a full connection
+
+1:04:44.010,1:04:47.060
+But it's actually useful to think of it as a convolution
+
+1:04:49.200,1:04:51.060
+Okay?
+
+1:04:51.060,1:04:56.070
+Now, for efficiency reasons, or maybe some others bad reasons they're called
+
+1:04:57.370,1:05:00.959
+fully-connected layers, and we used the class linear here
+
+1:05:01.120,1:05:05.459
+But it kind of breaks the whole idea that your network is a convolutional network
+
+1:05:06.070,1:05:09.209
+So it's much better actually to view them as convolutions
+
+1:05:09.760,1:05:14.370
+In this case one by one convolution which is sort of a weird concept. Okay. So here we have
+
+1:05:15.190,1:05:20.46
+four layers, two convolutional layers and two so-called fully-connected layers
+
+1:05:21.790,1:05:23.440
+And then the way we...
+
+1:05:23.440,1:05:29.129
+So we need to create them in the constructor, and the way we use them in the forward pass is that
+
+1:05:30.630,1:05:35.310
+we do a convolution of the input, and then we apply the ReLU, and then we do max pooling and then we
+
+1:05:35.710,1:05:38.699
+run the second layer, and apply the ReLU, and do max pooling again
+
+1:05:38.700,1:05:44.280
+And then we reshape the output because it's a fully connected layer. So we want to make this a
+
+1:05:45.190,1:05:47.879
+vector so that's what the x.view(-1) does
+
+1:05:48.820,1:05:50.820
+And then apply a
+
+1:05:51.160,1:05:53.160
+ReLU to it
+
+1:05:53.260,1:05:55.260
+And...
+
+1:05:55.510,1:06:00.330
+the second fully-connected layer, and then apply a softmax if we want to do classification
+
+1:06:00.460,1:06:04.409
+And so this is somewhat similar to the architecture you see at the bottom
+
+1:06:04.900,1:06:08.370
+The numbers might be different in terms of feature maps and stuff, but...
+
+1:06:09.160,1:06:11.160
+but the general architecture is
+
+1:06:12.250,1:06:14.250
+pretty much what we're talking about
+
+1:06:15.640,1:06:17.640
+Yes?
+
+1:06:20.530,1:06:22.530
+Say again
+
+1:06:24.040,1:06:26.100
+You know, whatever gradient descent decides
+
+1:06:28.630,1:06:30.630
+We can look at them, but
+
+1:06:31.180,1:06:33.180
+if you train with a lot of
+
+1:06:33.280,1:06:37.590
+examples of natural images, the kind of filters you will see at the first layer
+
+1:06:37.840,1:06:44.999
+basically will end up being mostly oriented edge detectors, very much similar to what people, to what neuroscientists
+
+1:06:45.340,1:06:49.110
+observe in the cortex of
+
+1:06:49.210,1:06:50.440
+animals
+
+1:06:50.440,1:06:52.440
+In the visual cortex of animals
+
+1:06:55.780,1:06:58.469
+They will change when you train the model, that's the whole point yes
+
+1:07:05.410,1:07:11.160
+Okay, so it's pretty simple. Here's another way of defining those. This is... I guess it's kind of an
+
+1:07:12.550,1:07:15.629
+outdated way of doing it, right? Not many people do this anymore
+
+1:07:17.170,1:07:23.340
+but it's kind of a simple way. Also there is this class in PyTorch called nn.Sequential
+
+1:07:24.550,1:07:28.469
+It's basically a container and you keep putting modules in it and it just
+
+1:07:29.080,1:07:36.269
+automatically kind of use them as being kind of connected in sequence, right? And so then you just have to call
+
+1:07:40.780,1:07:45.269
+forward on it and it will just compute the right thing
+
+1:07:46.360,1:07:50.370
+In this particular form here, you pass it a bunch of pairs
+
+1:07:50.370,1:07:55.229
+It's like a dictionary so you can give a name to each of the layers, and you can later access them
+
+1:08:08.079,1:08:10.079
+It's the same architecture we were talking about earlier
+
+1:08:18.489,1:08:24.029
+Yeah, I mean the backprop is automatic, right? You get it
+
+1:08:25.630,1:08:27.630
+by default you just call
+
+1:08:28.690,1:08:32.040
+backward and it knows how to back propagate through it
+
+1:08:44.000,1:08:49.180
+Well, the class kind of encapsulates everything into an object where the parameters are
+
+1:08:49.250,1:08:51.250
+There's a particular way of...
+
+1:08:52.220,1:08:54.220
+getting the parameters out and 
+
+1:08:55.130,1:08:58.420
+kind of feeding them to an optimizer
+
+1:08:58.420,1:09:01.330
+And so the optimizer doesn't need to know what your network looks like
+
+1:09:01.330,1:09:06.910
+It just knows that there is a function and there is a bunch of parameters and it gets a gradient and
+
+1:09:06.910,1:09:08.910
+it doesn't need to know what your network looks like
+
+1:09:10.790,1:09:12.879
+Yeah, you'll hear more about this
+
+1:09:14.840,1:09:16.840
+tomorrow
+
+1:09:25.610,1:09:33.159
+So here's a very interesting aspect of convolutional nets and it's one of the reasons why they've become so
+
+1:09:33.830,1:09:37.390
+successful in many applications. It's the fact that
+
+1:09:39.440,1:09:45.280
+if you view every layer in a convolutional net as a convolution, so there is no full connections, so to speak
+
+1:09:47.660,1:09:53.320
+you don't need to have a fixed size input. You can vary the size of the input and the network will
+
+1:09:54.380,1:09:56.380
+vary its size accordingly
+
+1:09:56.780,1:09:58.780
+because...
+
+1:09:59.510,1:10:01.510
+when you apply a convolution to an image
+
+1:10:02.150,1:10:05.800
+you fit it an image of a certain size, you do a convolution with a kernel
+
+1:10:06.620,1:10:11.979
+you get an image whose size is related to the size of the input
+
+1:10:12.140,1:10:15.789
+but you can change the size of the input and it just changes the size of the output
+
+1:10:16.760,1:10:20.320
+And this is true for every convolutional-like like operation, right?
+
+1:10:20.320,1:10:25.509
+So if your network is composed only of convolutions, then it doesn't matter what the size of the input is
+
+1:10:26.180,1:10:31.450
+It's going to go through the network and the size of every layer will change according to the size of the input
+
+1:10:31.580,1:10:34.120
+and the size of the output will also change accordingly
+
+1:10:34.640,1:10:37.329
+So here is a little example here where
+
+1:10:38.720,1:10:40.720
+I wanna do
+
+1:10:41.300,1:10:45.729
+cursive handwriting recognition and it's very hard because I don't know where the letters are
+
+1:10:45.730,1:10:48.700
+So I can't just have a character recognizer that...
+
+1:10:49.260,1:10:51.980
+I mean a system that will first cut the
+
+1:10:52.890,1:10:56.100
+word into letters
+
+1:10:56.100,1:10:57.72
+because I don't know where the letters are
+
+1:10:57.720,1:10:59.900
+and then apply the convolutional net to each of the letters
+
+1:11:00.210,1:11:05.200
+So the best I can do is take the convolutional net and swipe it over the input and then record the output
+
+1:11:05.850,1:11:11.810
+Okay? And so you would think that to do this you will have to take a convolutional net like this that has a window
+
+1:11:12.060,1:11:14.389
+large enough to see a single character
+
+1:11:15.120,1:11:21.050
+and then you take your input image and compute your convolutional net at every location
+
+1:11:21.660,1:11:27.110
+shifting it by one pixel or two pixels or four pixels or something like this, a small enough number of pixels that
+
+1:11:27.630,1:11:30.619
+regardless of where the character occurs in the input
+
+1:11:30.620,1:11:35.000
+you will still get a score on the output whenever it needs to recognize one
+
+1:11:36.150,1:11:38.989
+But it turns out that will be extremely wasteful
+
+1:11:40.770,1:11:42.770
+because...
+
+1:11:43.290,1:11:50.179
+you will be redoing the same computation multiple times. And so the proper way to do this --and this is very important to understand
+
+1:11:50.880,1:11:56.659
+is that you don't do what I just described where you have a small convolutional net that you apply to every window
+
+1:11:58.050,1:12:00.050
+What you do is you
+
+1:12:01.230,1:12:07.939
+take a large input and you apply the convolutions to the input image since it's larger you're gonna get a larger output
+
+1:12:07.940,1:12:11.270
+you apply the second layer convolution to that, or the pooling, whatever it is
+
+1:12:11.610,1:12:15.170
+You're gonna get a larger input again, etc.
+
+1:12:15.170,1:12:16.650
+all the way to the top and
+
+1:12:16.650,1:12:20.929
+whereas in the original design you were getting only one output now you're going to get multiple outputs because
+
+1:12:21.570,1:12:23.570
+it's a convolutional layer
+
+1:12:27.990,1:12:29.990
+This is super important because
+
+1:12:30.600,1:12:35.780
+this way of applying a convolutional net with a sliding window is
+
+1:12:36.870,1:12:40.610
+much, much cheaper than recomputing the convolutional net at every location
+
+1:12:42.510,1:12:44.510
+Okay?
+
+1:12:45.150,1:12:51.619
+You would not believe how many decades it took to convince people that this was a good thing
+
+1:12:58.960,1:13:03.390
+So here's an example of how you can use this
+
+1:13:04.090,1:13:09.180
+This is a conventional net that was trained on individual digits, 32 by 32. It was trained on a MNIST, okay?
+
+1:13:09.760,1:13:11.760
+32 by 32 input windows
+
+1:13:12.400,1:13:15.690
+It's LeNet 5, so it's very similar to the architecture
+
+1:13:15.690,1:13:20.940
+I just showed the code for, okay? It's trained on individual characters to just classify
+
+1:13:21.970,1:13:26.369
+the character in the center of the image. And the way it was trained was there was a little bit of data
+
+1:13:26.770,1:13:30.359
+augmentation where the character in the center was kind of shifted a little bit in various locations
+
+1:13:31.420,1:13:36.629
+changed in size. And then there were two other characters
+
+1:13:37.420,1:13:39.600
+that were kind of added to the side to confuse it
+
+1:13:40.480,1:13:45.660
+in many samples. And then it was also trained with an 11th category
+
+1:13:45.660,1:13:50.249
+which was "none of the above" and the way it's trained is either you show it a blank image
+
+1:13:50.410,1:13:54.149
+or you show it an image where there is no character in the center but there are characters on the side
+
+1:13:54.940,1:13:59.399
+so that it would detect whenever it's inbetween two characters
+
+1:14:00.520,1:14:02.520
+and then you do this thing of
+
+1:14:02.650,1:14:10.970
+computing the convolutional net at every location on the input without actually shifting it but just applying the convolutions to the entire image
+
+1:14:11.740,1:14:13.740
+And that's what you get
+
+1:14:13.780,1:14:23.220
+So here the input image is 64 by 32, even though the network was trained on 32 by 32 with those kind of generated examples
+
+1:14:24.280,1:14:28.049
+And what you see is the activity of some of the layers, not all of them are represented
+
+1:14:29.410,1:14:32.309
+And what you see at the top here, those kind of funny shapes
+
+1:14:33.520,1:14:37.560
+You see threes and fives popping up and they basically are an
+
+1:14:38.830,1:14:41.850
+indication of the winning category for every location, right?
+
+1:14:42.670,1:14:47.339
+So the eight outputs that you see at the top are
+
+1:14:48.520,1:14:50.520
+basically the output corresponding to eight different
+
+1:14:51.250,1:14:56.790
+positions of the 32 by 32 input window on the input, shifted by 4 pixels every time
+
+1:14:59.530,1:15:05.859
+And what is represented is the winning category within that window and the grayscale indicates the score, okay?
+
+1:15:07.220,1:15:10.419
+So what you see is that there's two detectors detecting the five
+
+1:15:11.030,1:15:15.850
+until the three kind of starts overlapping. And then two detectors are detecting the three that kind of moved around
+
+1:15:18.230,1:15:22.779
+because within a 32 by 32 window
+
+1:15:23.390,1:15:29.919
+the three appears to the left of that 32 by 32 window, and then to the right of that other 32 by 32 windows shifted by four
+
+1:15:29.920,1:15:31.920
+and so those two detectors detect
+
+1:15:32.690,1:15:34.690
+that 3 or that 5
+
+1:15:36.140,1:15:39.890
+So then what you do is you take all those scores here at the top and you
+
+1:15:39.890,1:15:43.809
+do a little bit of post-processing very simple and you figure out if it's a three and a five
+
+1:15:44.630,1:15:46.630
+What's interesting about this is that
+
+1:15:47.660,1:15:49.899
+you don't need to do prior segmentation
+
+1:15:49.900,1:15:51.860
+So something that people had to do
+
+1:15:51.860,1:15:58.180
+before, in computer vision, was if you wanted to recognize an object you had to separate the object from its background because the recognition system
+
+1:15:58.490,1:16:00.490
+would get confused by
+
+1:16:00.800,1:16:07.900
+the background. But here with this convolutional net, it's been trained with overlapping characters and it knows how to tell them apart
+
+1:16:08.600,1:16:10.809
+And so it's not confused by characters that overlap
+
+1:16:10.810,1:16:15.729
+I have a whole bunch of those on my web website, by the way, those animations from the early nineties
+
+1:16:38.450,1:16:41.679
+No, that was the main issue. That's one of the reasons why
+
+1:16:44.210,1:16:48.040
+computer vision wasn't working very well. It's because the very problem of
+
+1:16:49.850,1:16:52.539
+figure/background separation, detecting an object
+
+1:16:53.780,1:16:59.530
+and recognizing it is the same. You can't recognize the object until you segment it but you can't segment it until you recognize it
+
+1:16:59.840,1:17:05.290
+It's the same for cursive handwriting recognition, right? You can't... so here's an example
+
+1:17:07.460,1:17:09.460
+Do we have pens?
+
+1:17:10.650,1:17:12.650
+Doesn't look like we have pens right?
+
+1:17:14.969,1:17:21.859
+Here we go, that's true. I'm sorry... maybe I should use the...
+
+1:17:24.780,1:17:26.780
+If this works...
+
+1:17:34.500,1:17:36.510
+Oh, of course...
+
+1:17:43.409,1:17:45.409
+Okay...
+
+1:17:52.310,1:17:54.310
+Can you guys read this?
+
+1:17:55.670,1:18:01.990
+Okay, I mean it's horrible handwriting but it's also because I'm writing on the screen. Okay, now can you read it?
+
+1:18:08.240,1:18:10.240
+Minimum, yeah
+
+1:18:11.870,1:18:15.010
+Okay, there's actually no way you can segment the letters out of this right
+
+1:18:15.010,1:18:17.439
+I mean this is kind of a random number of waves
+
+1:18:17.900,1:18:23.260
+But just the fact that the two "I"s are identified, then it's basically not ambiguous at least in English
+
+1:18:24.620,1:18:26.620
+So that's a good example of
+
+1:18:28.100,1:18:30.340
+the interpretation of individual
+
+1:18:31.580,1:18:38.169
+objects depending on their context. And what you need is some sort of high-level language model to know what words are possible
+
+1:18:38.170,1:18:40.170
+If you don't know English or similar
+
+1:18:40.670,1:18:44.320
+languages that have the same word, there's no way you can you can read this
+
+1:18:45.500,1:18:48.490
+Spoken language is very similar to this
+
+1:18:49.700,1:18:53.679
+All of you who have had the experience of learning a foreign language
+
+1:18:54.470,1:18:56.470
+probably had the experience that
+
+1:18:57.110,1:19:04.150
+you have a hard time segmenting words from a new language and then recognizing the words because you don't have the vocabulary
+
+1:19:04.850,1:19:09.550
+Right? So if I speak in French -- si je commence à parler français, vous n'avez aucune idée d'où sont les limites des mots --
+[If I start speaking French, you have no idea where the limits of words are]
+
+1:19:09.740,1:19:13.749
+Except if you speak French. So I spoke a sentence, it's words
+
+1:19:13.750,1:19:17.140
+but you can't tell the boundary between the words right because it is basically no
+
+1:19:17.990,1:19:23.800
+clear seizure between the words unless you know where the words are in advance, right? So that's the problem of segmentation
+
+1:19:23.900,1:19:28.540
+You can't recognize until you segment, you can't segment until you recognize you have to do both at the same time
+
+1:19:29.150,1:19:32.379
+Early computer vision systems had a really hard time doing this
+
+1:19:40.870,1:19:46.739
+So that's why this kind of stuff is big progress because you don't have to do segmentation in advance, it just...
+
+1:19:47.679,1:19:52.559
+just train your system to be robust to kind of overlapping objects and things like that. Yes, in the back!
+
+1:19:55.510,1:19:59.489
+Yes, there is a background class. So when you see a blank response
+
+1:20:00.340,1:20:04.410
+it means the system says "none of the above" basically, right? So it's been trained
+
+1:20:05.590,1:20:07.590
+to produce "none of the above"
+
+1:20:07.690,1:20:11.699
+either when the input is blank or when there is one character that's too
+
+1:20:13.420,1:20:17.190
+outside of the center or when you have two characters
+
+1:20:17.620,1:20:24.029
+but there's nothing in the center. Or when you have two characters that overlap, but there is no central character, right? So it's...
+
+1:20:24.760,1:20:27.239
+trying to detect boundaries between characters essentially
+
+1:20:28.420,1:20:30.420
+Here's another example
+
+1:20:31.390,1:20:38.640
+This is an example that shows that even a very simple convolutional net with just two stages, right? convolution, pooling, convolution
+
+1:20:38.640,1:20:40.640
+pooling, and then two layers of...
+
+1:20:42.010,1:20:44.010
+two more layers afterwards
+
+1:20:44.770,1:20:47.429
+can solve what's called the feature-binding problem
+
+1:20:48.130,1:20:50.130
+So visual neuroscientists and
+
+1:20:50.320,1:20:56.190
+computer vision people had the issue --it was kind of a puzzle-- How is it that
+
+1:20:57.489,1:21:01.289
+we perceive objects as objects? Objects are collections of features
+
+1:21:01.290,1:21:04.229
+but how do we bind all the features together of an object to form this object?
+
+1:21:06.460,1:21:09.870
+Is there some kind of magical way of doing this?
+
+1:21:12.520,1:21:16.589
+And they did... psychologists did experiments like...
+
+1:21:24.210,1:21:26.210
+draw this and then that
+
+1:21:28.239,1:21:31.349
+and you perceive the bar as
+
+1:21:32.469,1:21:39.419
+a single bar because you're used to bars being obstructed by, occluded by other objects
+
+1:21:39.550,1:21:41.550
+and so you just assume it's an occlusion
+
+1:21:44.410,1:21:47.579
+And then there are experiments that figure out how much do I have to
+
+1:21:48.430,1:21:52.109
+shift the two bars to make me perceive them as two separate bars
+
+1:21:53.980,1:21:56.580
+But in fact, the minute they perfectly line and if you...
+
+1:21:57.250,1:21:59.080
+if you do this..
+
+1:21:59.080,1:22:03.809
+maybe exactly identical to what you see here, but now you perceive them as two different objects
+
+1:22:06.489,1:22:12.929
+So how is it that we seem to be solving the feature-binding problem?
+
+1:22:15.880,1:22:21.450
+And what this shows is that you don't need any specific mechanism for it. It just happens
+
+1:22:22.210,1:22:25.919
+If you have enough nonlinearities and you train with enough data
+
+1:22:26.440,1:22:33.359
+then, as a side effect, you get a system that solves the feature-binding problem without any particular mechanism for it
+
+1:22:37.510,1:22:40.260
+So here you have two shapes and you move a single
+
+1:22:43.060,1:22:50.519
+stroke and it goes from a six and a one, to a three, to a five and a one, to a seven and a three
+
+1:22:53.140,1:22:55.140
+Etcetera
+
+1:23:00.020,1:23:07.480
+Right, good question. So the question is: how do you distinguish between the two situations? We have two fives next to each other and
+
+1:23:08.270,1:23:14.890
+the fact that you have a single five being detected by two different frames, right? Two different framing of that five
+
+1:23:15.470,1:23:17.470
+Well there is this explicit
+
+1:23:17.660,1:23:20.050
+training so that when you have two characters that
+
+1:23:20.690,1:23:25.029
+are touching and none of them is really centered you train the system to say "none of the above", right?
+
+1:23:25.030,1:23:29.079
+So it's always going to have five blank five
+
+1:23:30.020,1:23:35.800
+It's always gonna have even like one blank one, and the ones can be very close. It will you'll tell you the difference
+
+1:23:39.170,1:23:41.289
+Okay, so what are convnets good for?
+
+1:24:04.970,1:24:07.599
+So what you have to look at is this
+
+1:24:11.510,1:24:13.510
+Every layer here is a convolution
+
+1:24:13.610,1:24:15.020
+Okay?
+
+1:24:15.020,1:24:21.070
+Including the last layer, so it looks like a full connection because every unit in the second layer goes into the output
+
+1:24:21.070,1:24:24.460
+But in fact, it is a convolution, it just happens to be applied to a single location
+
+1:24:24.950,1:24:31.300
+So now imagine that this layer at the top here now is bigger, okay? Which is represented here
+
+1:24:32.840,1:24:34.130
+Okay?
+
+1:24:34.130,1:24:37.779
+Now the size of the kernel is the size of the image you had here previously
+
+1:24:37.820,1:24:43.360
+But now it's a convolution that has multiple locations, right? And so what you get is multiple outputs
+
+1:24:46.430,1:24:55.100
+That's right, that's right. Each of which corresponds to a classification over an input window of size 32 by 32 in the example I showed
+
+1:24:55.100,1:25:02.710
+And those windows are shifted by 4 pixels. The reason being that the network architecture I showed
+
+1:25:04.280,1:25:11.739
+here has a convolution with stride one, then pooling with stride two, convolution with stride one, pooling with stride two
+
+1:25:13.949,1:25:17.178
+And so the overall stride is four, right?
+
+1:25:18.719,1:25:22.788
+And so to get a new output you need to shift the input window by four
+
+1:25:24.210,1:25:29.509
+to get one of those because of the two pooling layers with...
+
+1:25:31.170,1:25:35.480
+Maybe I should be a little more explicit about this. Let me draw a picture, that would be clearer
+
+1:25:39.929,1:25:43.848
+So you have an input
+
+1:25:49.110,1:25:53.749
+like this... a convolution, let's say a convolution of size three
+
+1:25:57.420,1:25:59.420
+Okay? Yeah with stride one
+
+1:26:01.289,1:26:04.518
+Okay, I'm not gonna draw all of them, then you have
+
+1:26:05.460,1:26:11.389
+pooling with subsampling of size two, so you pool over 2 and you subsample, the stride is 2, so you shift by two
+
+1:26:12.389,1:26:14.389
+No overlap
+
+1:26:18.550,1:26:25.060
+Okay, so here the input is this size --one two, three, four, five, six, seven, eight
+
+1:26:26.150,1:26:29.049
+because the convolution is of size three you get
+
+1:26:29.840,1:26:31.840
+an output here of size six and
+
+1:26:32.030,1:26:39.010
+then when you do pooling with subsampling with stride two, you get three outputs because that divides the output by two, okay?
+
+1:26:39.880,1:26:41.880
+Let me add another one
+
+1:26:43.130,1:26:45.130
+Actually two
+
+1:26:46.790,1:26:48.790
+Okay, so now the output is ten
+
+1:26:50.030,1:26:51.680
+This guy is eight
+
+1:26:51.680,1:26:53.680
+This guy is four
+
+1:26:54.260,1:26:56.409
+I can do convolutions now also
+
+1:26:57.650,1:26:59.650
+Let's say three
+
+1:27:01.400,1:27:03.400
+I only get two outputs
+
+1:27:04.490,1:27:06.490
+Okay? Oops!
+
+1:27:07.040,1:27:10.820
+Hmm not sure why it doesn't... draw
+
+1:27:10.820,1:27:13.270
+Doesn't wanna draw anymore, that's interesting
+
+1:27:17.060,1:27:19.060
+Aha!
+
+1:27:24.110,1:27:26.380
+It doesn't react to clicks, that's interesting
+
+1:27:34.460,1:27:39.609
+Okay, not sure what's going on! Oh "xournal" is not responding
+
+1:27:41.750,1:27:44.320
+All right, I guess it crashed on me
+
+1:27:46.550,1:27:48.550
+Well, that's annoying
+
+1:27:53.150,1:27:55.150
+Yeah, definitely crashed
+
+1:28:02.150,1:28:04.150
+And, of course, it forgot it, so...
+
+1:28:09.860,1:28:12.760
+Okay, so we have ten, then eight
+
+1:28:15.230,1:28:20.470
+because of convolution with three, then we have pooling
+
+1:28:22.520,1:28:24.520
+of size two with
+
+1:28:26.120,1:28:28.120
+stride two, so we get four
+
+1:28:30.350,1:28:36.970
+Then we have convolution with three so we get two, okay? And then maybe pooling again
+
+1:28:38.450,1:28:42.700
+of size two and subsampling two, we get one. Okay, so...
+
+1:28:44.450,1:28:46.869
+ten input, eight
+
+1:28:49.370,1:28:53.079
+four, two, and...
+
+1:28:58.010,1:29:03.339
+then one for the pooling. This is convolution three, you're right
+
+1:29:06.500,1:29:08.500
+This is two
+
+1:29:09.140,1:29:11.140
+And those are three
+
+1:29:12.080,1:29:14.080
+Etcetera. Right. Now, let's assume
+
+1:29:14.540,1:29:17.860
+I add a few units here
+
+1:29:18.110,1:29:21.010
+Okay? So that's going to add, let's say
+
+1:29:21.890,1:29:24.160
+four units here, two units here
+
+1:29:27.620,1:29:29.620
+Then...
+
+1:29:41.190,1:29:42.840
+Yeah, this one is
+
+1:29:42.840,1:29:46.279
+like this and like that so I got four and
+
+1:29:47.010,1:29:48.960
+I got another one here
+
+1:29:48.960,1:29:52.460
+Okay? So now I have only one output and by adding four
+
+1:29:53.640,1:29:55.640
+four inputs here
+
+1:29:55.830,1:29:58.249
+which is not 14. I got two outputs
+
+1:29:59.790,1:30:02.090
+Why four? Because I have 2
+
+1:30:02.970,1:30:04.830
+stride of 2
+
+1:30:04.830,1:30:10.939
+Okay? So the overall subsampling ratio from input to output is 4, it's 2 times 2
+
+1:30:13.140,1:30:17.540
+Now this is 12, and this is 6, and this is 4
+
+1:30:20.010,1:30:22.010
+So that's a...
+
+1:30:22.620,1:30:24.620
+demonstration of the fact that
+
+1:30:24.900,1:30:26.900
+you can increase the size of the input
+
+1:30:26.900,1:30:32.330
+it will increase the size of every layer, and if you have a layer that has size 1 and it's a convolutional layer
+
+1:30:32.330,1:30:34.330
+its size is going to be increased
+
+1:30:42.870,1:30:44.870
+Yes
+
+1:30:47.250,1:30:52.760
+Change the size of a layer, like, vertically, horizontally? Yeah, so there's gonna be...
+
+1:30:54.390,1:30:57.950
+So first you have to train for it, if you want the system to have so invariance to size
+
+1:30:58.230,1:31:03.860
+you have to train it with characters of various sizes. You can do this with data augmentation if your characters are normalized
+
+1:31:04.740,1:31:06.740
+That's the first thing. Second thing is...
+
+1:31:08.850,1:31:16.579
+empirically simple convolutional nets are only invariant to size within a factor of... rather small factor, like you can increase the size by
+
+1:31:17.610,1:31:23.599
+maybe 40 percent or something. I mean change the size about 40 percent plus/minus 20 percent, something like that, right?
+
+1:31:26.250,1:31:28.250
+Beyond that...
+
+1:31:28.770,1:31:33.830
+you might have more trouble getting invariance, but people have trained with input...
+
+1:31:33.980,1:31:38.390
+I mean objects of sizes that vary by a lot. So the way to handle this is
+
+1:31:39.750,1:31:46.430
+if you want to handle variable size, is that if you have an image and you don't know what size the objects are
+
+1:31:46.950,1:31:50.539
+that are in this image, you apply your convolutional net to that image and
+
+1:31:51.180,1:31:53.979
+then you take the same image, reduce it by a factor of two
+
+1:31:54.440,1:31:58.179
+just scale the image by a factor of two, run the same convolutional net on that new image and
+
+1:31:59.119,1:32:02.949
+then reduce it by a factor of two again, and run the same convolutional net again on that image
+
+1:32:03.800,1:32:08.110
+Okay? So the first convolutional net will be able to detect small objects within the image
+
+1:32:08.630,1:32:11.859
+So let's say your network has been trained to detect objects of size...
+
+1:32:11.860,1:32:16.179
+I don't know, 20 pixels, like faces for example, right? They are 20 pixels
+
+1:32:16.789,1:32:20.739
+It will detect faces that are roughly 20 pixels within this image and
+
+1:32:21.320,1:32:24.309
+then when you subsample by a factor of 2 and you apply the same network
+
+1:32:24.309,1:32:31.209
+it will detect faces that are 20 pixels within the new image, which means there were 40 pixels in the original image
+
+1:32:32.179,1:32:37.899
+Okay? Which the first network will not see because the face would be bigger than its input window
+
+1:32:39.170,1:32:41.529
+And then the next network over will detect
+
+1:32:42.139,1:32:44.409
+faces that are 80 pixels, etc., right?
+
+1:32:44.659,1:32:49.089
+So then by kind of combining the scores from all of those, and doing something called non-maximum suppression
+
+1:32:49.090,1:32:51.090
+we can actually do detection and
+
+1:32:51.230,1:32:57.939
+localization of objects. People use considerably more sophisticated techniques for detection now, and for localization that we'll talk about next week
+
+1:32:58.429,1:33:00.429
+But that's the basic idea
+
+1:33:00.920,1:33:02.920
+So let me conclude
+
+1:33:03.019,1:33:09.429
+What are convnets good for? They're good for signals that come to you in the form of a multi-dimensional array
+
+1:33:10.190,1:33:12.190
+But that multi-dimensional array has
+
+1:33:13.190,1:33:17.500
+to have two characteristics at least. The first one is
+
+1:33:18.469,1:33:23.828
+there is strong local correlations between values. So if you take an image
+
+1:33:24.949,1:33:32.949
+random image, take two pixels within this image, two pixels that are nearby. Those two pixels are very likely to have very similar colors
+
+1:33:33.530,1:33:38.199
+Take a picture of this class, for example, two pixels on the wall basically have the same color
+
+1:33:39.469,1:33:42.069
+Okay? It looks like there is a ton of objects here, but
+
+1:33:43.280,1:33:49.509
+--animate objects-- but in fact mostly, statistically, neighboring pixels are essentially the same color
+
+1:33:52.699,1:34:00.129
+As you move the distance from two pixels away and you compute the statistics of how similar pixels are as a function of distance
+
+1:34:00.650,1:34:02.650
+they're less and less similar
+
+1:34:03.079,1:34:05.079
+So what does that mean? Because
+
+1:34:06.350,1:34:09.430
+nearby pixels are likely to have similar colors
+
+1:34:09.560,1:34:14.499
+that means that when you take a patch of pixels, say five by five, or eight by eight or something
+
+1:34:16.040,1:34:18.040
+The type of patch you're going to observe
+
+1:34:18.920,1:34:21.159
+is very likely to be kind of a smoothly varying
+
+1:34:21.830,1:34:23.830
+color or maybe with an edge
+
+1:34:24.770,1:34:32.080
+But among all the possible combinations of 25 pixels, the ones that you actually observe in natural images is a tiny subset
+
+1:34:34.130,1:34:38.380
+What that means is that it's advantageous to represent the content of that patch
+
+1:34:39.440,1:34:46.509
+by a vector with perhaps less than 25 values that represent the content of that patch. Is there an edge, is it uniform?
+
+1:34:46.690,1:34:48.520
+What color is it? You know things like that, right?
+
+1:34:48.520,1:34:52.660
+And that's basically what the convolutions in the first layer of a convolutional net are doing
+
+1:34:53.900,1:34:58.809
+Okay. So if you have local correlations, there is an advantage in detecting local features
+
+1:34:59.090,1:35:01.659
+That's what we observe in the brain. That's what convolutional nets are doing
+
+1:35:03.140,1:35:08.140
+This idea of locality. If you feed a convolutional net with permuted pixels
+
+1:35:09.020,1:35:15.070
+it's not going to be able to do a good job at recognizing your images, even if the permutation is fixed
+
+1:35:17.030,1:35:19.960
+Right? A fully connected net doesn't care
+
+1:35:21.410,1:35:23.410
+about permutations
+
+1:35:25.700,1:35:28.240
+Then the second characteristics is that
+
+1:35:30.050,1:35:34.869
+features that are important may appear anywhere on the image. So that's what justifies shared weights
+
+1:35:35.630,1:35:38.499
+Okay? The local correlation justifies local connections
+
+1:35:39.560,1:35:46.570
+The fact that features can appear anywhere, that the statistics of images or the signal is uniform
+
+1:35:47.810,1:35:52.030
+means that you need to have repeated feature detectors for every location
+
+1:35:52.850,1:35:54.850
+And that's where shared weights
+
+1:35:55.880,1:35:57.880
+come into play
+
+1:36:01.990,1:36:06.059
+It does justify the pooling because the pooling is if you want invariance to
+
+1:36:06.760,1:36:11.400
+variations in the location of those characteristic features. And so if the objects you're trying to recognize
+
+1:36:12.340,1:36:16.619
+don't change their nature by kind of being slightly distorted then you want pooling
+
+1:36:21.160,1:36:24.360
+So people have used convnets for cancer stuff, image video
+
+1:36:25.660,1:36:31.019
+text, speech. So speech actually is pretty... speech recognition convnets are used a lot
+
+1:36:32.260,1:36:34.380
+Time series prediction, you know things like that
+
+1:36:36.220,1:36:42.030
+And you know biomedical image analysis, so if you want to analyze an MRI, for example
+
+1:36:42.030,1:36:44.030
+MRI or CT scan is a 3d image
+
+1:36:44.950,1:36:49.170
+As humans we can't because we don't have a good visualization technology. We can't really
+
+1:36:49.960,1:36:54.960
+apprehend or understand a 3d volume, a 3-dimensional image
+
+1:36:55.090,1:36:58.709
+But a convnet is fine, feed it a 3d image and it will deal with it
+
+1:36:59.530,1:37:02.729
+That's a big advantage because you don't have to go through slices to kind of figure out
+
+1:37:04.000,1:37:06.030
+the object in the image
+
+1:37:10.390,1:37:15.300
+And then the last thing here at the bottom, I don't know if you guys know where hyperspectral images are
+
+1:37:15.300,1:37:19.139
+So hyperspectral image is an image where... most natural color images
+
+1:37:19.140,1:37:22.619
+I mean images that you collect with a normal camera you get three color components
+
+1:37:23.470,1:37:25.390
+RGB
+
+1:37:25.390,1:37:28.019
+But we can build cameras with way more
+
+1:37:28.660,1:37:30.660
+spectral bands than this and
+
+1:37:31.510,1:37:34.709
+that's particularly the case for satellite imaging where some
+
+1:37:36.160,1:37:40.920
+cameras have many spectral bands going from infrared to ultraviolet and
+
+1:37:41.890,1:37:44.610
+that gives you a lot of information about what you see in each pixel
+
+1:37:45.760,1:37:47.040
+Some tiny animals
+
+1:37:47.040,1:37:54.930
+that have small brains find it easier to process hyperspectral images of low resolution than high resolution images with just three colors
+
+1:37:55.750,1:38:00.450
+For example, there's a particular type of shrimp, right? They have those beautiful
+
+1:38:01.630,1:38:07.499
+eyes and they have like 17 spectral bands or something, but super low resolution and they have a tiny brain to process it
+
+1:38:09.770,1:38:12.850
+Okay, that's all for today. See you!
diff --git a/docs/pt/week03/practicum03.sbv b/docs/pt/week03/practicum03.sbv
new file mode 100644
index 000000000..79126d43e
--- /dev/null
+++ b/docs/pt/week03/practicum03.sbv
@@ -0,0 +1,1751 @@
+0:00:00.020,0:00:07.840
+So convolutional neural networks, I guess today I so foundations me, you know, I post nice things on Twitter
+
+0:00:09.060,0:00:11.060
+Follow me. I'm just kidding
+
+0:00:11.290,0:00:16.649
+Alright. So again anytime you have no idea what's going on. Just stop me ask questions
+
+0:00:16.900,0:00:23.070
+Let's make these lessons interactive such that I can try to please you and provide the necessary information
+
+0:00:23.980,0:00:25.980
+For you to understand what's going on?
+
+0:00:26.349,0:00:27.970
+alright, so
+
+0:00:27.970,0:00:31.379
+Convolutional neural networks. How cool is this stuff? Very cool
+
+0:00:32.439,0:00:38.699
+mostly because before having convolutional nets we couldn't do much and we're gonna figure out why now
+
+0:00:39.850,0:00:43.800
+how why why and how these networks are so powerful and
+
+0:00:44.379,0:00:48.329
+They are going to be basically making they are making like a very large
+
+0:00:48.879,0:00:52.859
+Chunk of like the whole networks are used these days
+
+0:00:53.980,0:00:55.300
+so
+
+0:00:55.300,0:01:02.369
+More specifically we are gonna get used to repeat several times those three words, which are the key words for understanding
+
+0:01:02.920,0:01:05.610
+Convolutions, but we are going to be figuring out that soon
+
+0:01:06.159,0:01:09.059
+so let's get started and figuring out how
+
+0:01:09.580,0:01:11.470
+these
+
+0:01:11.470,0:01:13.470
+signals these images and these
+
+0:01:13.990,0:01:17.729
+different items look like so whenever we talk about
+
+0:01:18.670,0:01:21.000
+signals we can think about them as
+
+0:01:21.580,0:01:23.200
+vectors for example
+
+0:01:23.200,0:01:30.600
+We have there a signal which is representing a monophonic audio signal so given that is only
+
+0:01:31.180,0:01:38.339
+We have only the temporal dimension going in like the signal happens over one dimension, which is the temporal dimension
+
+0:01:38.560,0:01:46.079
+This is called 1d signal and can be represented by a singular vector as is shown up up there
+
+0:01:46.750,0:01:48.619
+each
+
+0:01:48.619,0:01:52.389
+Value of that vector represents the amplitude of the wave form
+
+0:01:53.479,0:01:56.589
+for example, if you have just a sign you're going to be just hearing like
+
+0:01:57.830,0:01:59.830
+Like some sound like that
+
+0:02:00.560,0:02:05.860
+If you have like different kind of you know, it's not just a sign a sign you're gonna hear
+
+0:02:06.500,0:02:08.500
+different kind of Timbers or
+
+0:02:09.200,0:02:11.200
+different kind of
+
+0:02:11.360,0:02:13.190
+different kind of
+
+0:02:13.190,0:02:15.190
+flavor of the sound
+
+0:02:15.440,0:02:18.190
+Moreover you're familiar. How sound works, right? So
+
+0:02:18.709,0:02:21.518
+Right now I'm just throwing air through my windpipe
+
+0:02:22.010,0:02:26.830
+where there are like some membranes which is making the air vibrate these the
+
+0:02:26.930,0:02:33.640
+Vibration propagates through the air there are going to be hitting your ears and the ear canal you have inside some little
+
+0:02:35.060,0:02:38.410
+you have likely cochlea right and then given about
+
+0:02:38.989,0:02:45.159
+How much the sound propagates through the cochlea you're going to be detecting the pitch and then by adding different pitch
+
+0:02:45.830,0:02:49.119
+information you can and also like different kind of
+
+0:02:50.090,0:02:53.350
+yeah, I guess speech information you're going figure out what is the
+
+0:02:53.930,0:02:59.170
+Sound I was making over here and then you reconstruct that using your language model you have in your brain
+
+0:02:59.170,0:03:03.369
+Right and the same thing Yun was mentioning if you start speaking another language
+
+0:03:04.310,0:03:11.410
+then you won't be able to parse the information because you're using both a speech model like a conversion between
+
+0:03:12.019,0:03:17.709
+Vibrations and like, you know signal your brain plus the language model in order to make sense
+
+0:03:18.709,0:03:22.629
+Anyhow, that was a 1d signal. Let's say I'm listening to music so
+
+0:03:23.570,0:03:25.570
+What kind of signal do I?
+
+0:03:25.910,0:03:27.910
+have there
+
+0:03:28.280,0:03:34.449
+So if I listen to music user is going to be a stare of stereophonic, right? So it means you're gonna have how many channels?
+
+0:03:35.420,0:03:37.420
+Two channels, right?
+
+0:03:37.519,0:03:38.570
+nevertheless
+
+0:03:38.570,0:03:41.019
+What type of signal is gonna be this one?
+
+0:03:41.150,0:03:46.420
+It's still gonna be one this signal although there are two channels so you can think about you know
+
+0:03:46.640,0:03:54.459
+regardless of how many chanted channels like if you had Dolby Surround you're gonna have what 5.1 so six I guess so, that's the
+
+0:03:55.050,0:03:56.410
+You know
+
+0:03:56.410,0:03:58.390
+vectorial the
+
+0:03:58.390,0:04:02.790
+size of the signal and then the time is the only variable which is
+
+0:04:03.820,0:04:07.170
+Like moving forever. Okay. So those are 1d signals
+
+0:04:09.430,0:04:13.109
+All right, so let's have a look let's zoom in a little bit so
+
+0:04:14.050,0:04:18.420
+We have it. For example on the left hand side. We have something that looks like a sinusoidal
+
+0:04:19.210,0:04:25.619
+function here nevertheless a little bit after you're gonna have again the same type of
+
+0:04:27.280,0:04:29.640
+Function appearing again, so this is called
+
+0:04:30.460,0:04:37.139
+Stationarity you're gonna see over and over and over again the same type of pattern across the temporal
+
+0:04:37.810,0:04:39.810
+Dimension, okay
+
+0:04:40.090,0:04:47.369
+So the first property of this signal which is our natural signal because it happens in nature is gonna be we said
+
+0:04:49.330,0:04:51.330
+Stationarity, okay. That's the first one
+
+0:04:51.580,0:04:53.580
+Moreover what do you think?
+
+0:04:54.130,0:04:56.130
+How likely is?
+
+0:04:56.140,0:05:00.989
+If I have a peak on the left hand side to have a peak also very nearby
+
+0:05:03.430,0:05:09.510
+So how likely is to have a peak there rather than having a peak there given that you had a peak before or
+
+0:05:09.610,0:05:11.590
+if I keep going
+
+0:05:11.590,0:05:18.119
+How likely is you have a peak, you know few seconds later given that you have a peak on the left hand side. So
+
+0:05:19.960,0:05:24.329
+There should be like some kind of common sense common knowledge perhaps that
+
+0:05:24.910,0:05:27.390
+If you are close together and if you are
+
+0:05:28.000,0:05:33.360
+Close to the left hand side is there's gonna be a larger probability that things are gonna be looking
+
+0:05:33.880,0:05:40.589
+Similar, for example you have like a specific sound will have a very kind of specific shape
+
+0:05:41.170,0:05:43.770
+But then if you go a little bit further away from that sound
+
+0:05:44.050,0:05:50.010
+then there's no relation anymore about what happened here given what happened before and so if you
+
+0:05:50.410,0:05:55.170
+Compute the cross correlation between a signal and itself, do you know what's a cross correlation?
+
+0:05:57.070,0:06:02.670
+Do know like if you don't know okay how many hands up who doesn't know a cross correlation
+
+0:06:04.360,0:06:07.680
+Okay fine, so that's gonna be homework for you
+
+0:06:07.680,0:06:14.489
+If you take one signal just a signal audio signal they perform convolution of that signal with itself
+
+0:06:14.650,0:06:15.330
+Okay
+
+0:06:15.330,0:06:19.680
+and so convolution is going to be you have your own signal you take the thing you flip it and then you
+
+0:06:20.170,0:06:22.170
+pass it across and then you multiply
+
+0:06:22.390,0:06:25.019
+Whenever you're gonna have them overlaid in the same
+
+0:06:25.780,0:06:27.780
+Like when there is zero
+
+0:06:28.450,0:06:33.749
+Misalignment you're gonna have like a spike. And then as you start moving around you're gonna have basically two decaying
+
+0:06:34.360,0:06:36.930
+sides that represents the fact that
+
+0:06:37.990,0:06:44.850
+Things have much things in common basically performing a dot product right? So things that have much in common when they are
+
+0:06:45.370,0:06:47.970
+Very close to one specific location
+
+0:06:47.970,0:06:55.919
+If you go further away things start, you know averaging out. So here the second property of this natural signal is locality
+
+0:06:56.500,0:07:04.470
+Information is contained in specific portion and parts of the in this case temporal domain. Okay. So before we had
+
+0:07:06.940,0:07:08.940
+Stationarity now we have
+
+0:07:09.640,0:07:11.640
+Locality alright don't
+
+0:07:12.160,0:07:17.999
+Bless you. All, right. So how about this one right? This is completely unrelated to what happened over there
+
+0:07:20.110,0:07:24.960
+Okay, so let's look at the nice little kitten what kind of
+
+0:07:25.780,0:07:27.070
+dimensions
+
+0:07:27.070,0:07:31.200
+What kind of yeah what dimension has this signal? What was your guess?
+
+0:07:32.770,0:07:34.829
+It's a 2 dimensional signal why is that
+
+0:07:39.690,0:07:45.469
+Okay, we have also a three-dimensional signal option here so someone said two dimensions someone said three dimensions
+
+0:07:47.310,0:07:51.739
+It's two-dimensional why is that sorry noise? Why is two-dimensional
+
+0:07:54.030,0:07:56.030
+Because the information is
+
+0:07:58.050,0:08:00.050
+Sorry the information is
+
+0:08:00.419,0:08:01.740
+especially
+
+0:08:01.740,0:08:03.740
+Depicted right? So the information
+
+0:08:03.750,0:08:05.310
+is
+
+0:08:05.310,0:08:08.450
+Basically encoded in the spatial location of those points
+
+0:08:08.760,0:08:15.439
+Although each point is a vector for example of three or if it's a hyper spectral image. It can be several planes
+
+0:08:16.139,0:08:23.029
+Nevertheless you still you still have two directions in which points can move right? The thickness doesn't change
+
+0:08:24.000,0:08:27.139
+across like in the thicknesses of a given space
+
+0:08:27.139,0:08:33.408
+Right so given thickness and it doesn't change right so you can have as many, you know planes as you want
+
+0:08:33.409,0:08:35.409
+but the information is basically
+
+0:08:35.640,0:08:41.779
+It's a spatial information is spread across the plane. So these are two dimensional data you can also
+
+0:08:50.290,0:08:53.940
+Okay, I see your point so like a wide image or a
+
+0:08:54.910,0:08:56.350
+grayscale image
+
+0:08:56.350,0:08:58.350
+It's definitely a 2d
+
+0:08:58.870,0:09:04.169
+Signal and also it can be represented by using a tensor of two dimensions
+
+0:09:04.870,0:09:07.739
+A color image has RGB planes
+
+0:09:08.350,0:09:14.550
+but the thickness is always three doesn't change and the information is still spread across the
+
+0:09:15.579,0:09:21.839
+Other two dimensions so you can change the size of a color image, but you won't change the thickness of a color image, right?
+
+0:09:22.870,0:09:28.319
+So we are talking about here. The dimension of the signal is how is the information?
+
+0:09:29.470,0:09:31.680
+Basically spread around right in the temporal information
+
+0:09:31.959,0:09:38.789
+If you have Dolby Surround mono mono signal or you have a stereo we still have over time, right?
+
+0:09:38.790,0:09:41.670
+So it's one dimensional images are 2d
+
+0:09:42.250,0:09:44.759
+so let's have a look to the little nice kitten and
+
+0:09:45.519,0:09:47.909
+Let's focus on the on the nose, right? Oh
+
+0:09:48.579,0:09:50.579
+My god, this is a monster. No
+
+0:09:50.949,0:09:52.949
+Okay. Nice big
+
+0:09:53.649,0:09:55.948
+Creature here, right? Okay, so
+
+0:09:56.740,0:10:03.690
+We observe there and there is some kind of dark region nearby the eye you can observe that kind of seeing a pattern
+
+0:10:04.329,0:10:09.809
+Appear over there, right? So what is this property of natural signals? I
+
+0:10:12.699,0:10:18.239
+Told you two properties, this is stationarity. Why is this stationarity?
+
+0:10:22.029,0:10:29.129
+Right, so the same pattern appears over and over again across the dimensionality in this case the dimension is two dimension. Sorry
+
+0:10:30.220,0:10:36.600
+Moreover, what is the likelihood that given that the color in the pupil is black? What is the likelihood that?
+
+0:10:37.149,0:10:42.448
+The pixel on the arrow or like on the tip of the arrow is also black
+
+0:10:42.449,0:10:47.879
+I would say it's quite likely right because it's very close. How about that point?
+
+0:10:48.069,0:10:51.899
+Yeah, kind of less likely right if I keep clicking
+
+0:10:52.480,0:10:59.649
+You know, it's completely it's bright. No, no the other pics in right so is further you go in spacial dimension
+
+0:11:00.290,0:11:06.879
+The less less likely you're gonna have, you know similar information. And so this is called
+
+0:11:08.629,0:11:10.629
+Locality which means
+
+0:11:12.679,0:11:16.269
+There's a higher likelihood for things to have if like
+
+0:11:16.549,0:11:22.509
+The information is like containers in a specific region as you move around things get much much more
+
+0:11:24.649,0:11:26.649
+You know independent
+
+0:11:27.199,0:11:32.529
+Alright, so we have two properties. The third property is gonna be the following. What is this?
+
+0:11:33.829,0:11:35.829
+Are you hungry?
+
+0:11:37.579,0:11:41.769
+So you can see here some donuts right no donuts how you called
+
+0:11:42.649,0:11:44.230
+Bagels, right? All right
+
+0:11:44.230,0:11:51.009
+So for the you the the one of you which have glasses take your glasses off and now answer my question
+
+0:11:53.179,0:11:55.179
+Okay
+
+0:11:59.210,0:12:01.210
+So the third property
+
+0:12:02.210,0:12:07.059
+It's compositionality right and so compositionality means that the
+
+0:12:07.880,0:12:10.119
+Word is actually explainable, right?
+
+0:12:11.060,0:12:13.060
+okay, you enjoy the
+
+0:12:15.830,0:12:20.199
+The thing okay, you gotta get back to me right? I just try to keep your life
+
+0:12:26.180,0:12:28.100
+Hello
+
+0:12:28.100,0:12:33.520
+Okay. So for the one that doesn't have glasses ask the friend who has glasses and try them on. Okay now
+
+0:12:34.430,0:12:36.430
+Don't do it if it's not good
+
+0:12:37.010,0:12:43.659
+I'm just kidding. You can squint just queen don't don't don't use other people glasses. Okay?
+
+0:12:44.990,0:12:46.990
+Question. Yeah
+
+0:12:50.900,0:12:52.130
+So
+
+0:12:52.130,0:12:57.489
+Stationerity means you observe the same kind of pattern over and over again your data
+
+0:12:58.160,0:13:01.090
+Locality means that pattern are just localized
+
+0:13:01.820,0:13:08.109
+So you have some specific information here some information here information here as you move away from this point
+
+0:13:08.270,0:13:10.270
+this other value is gonna be
+
+0:13:10.760,0:13:11.780
+almost
+
+0:13:11.780,0:13:15.249
+Independent from the value of this point here. So things are correlated
+
+0:13:15.860,0:13:17.860
+Only within a neighborhood, okay
+
+0:13:19.910,0:13:27.910
+Okay, everyone has been experimenting now squinting and looking at this nice picture, okay. So this is the third part which is compositionality
+
+0:13:28.730,0:13:32.289
+Here you can tell how you can actually see something
+
+0:13:33.080,0:13:35.080
+If you blur it a little bit
+
+0:13:35.810,0:13:39.250
+because again things are made of small parts and you can actually
+
+0:13:40.010,0:13:42.429
+You know compose things in this way
+
+0:13:43.400,0:13:47.829
+anyhow, so these are the three main properties of natural signals, which
+
+0:13:48.650,0:13:50.650
+allow us to
+
+0:13:51.260,0:13:55.960
+Can be exploited for making, you know, a design of our architecture, which is more
+
+0:13:56.600,0:14:00.880
+Actually prone to extract information that has these properties
+
+0:14:00.880,0:14:05.169
+Okay, so we are just talking now about signals that exhibits those properties
+
+0:14:07.730,0:14:11.500
+Finally okay. There was the last one which I didn't talk so
+
+0:14:12.890,0:14:18.159
+We had the last one here. We have an English sentence, right John picked up the apple
+
+0:14:18.779,0:14:22.818
+whatever and here again, you can represent each word as
+
+0:14:23.399,0:14:26.988
+One vector, for example each of those items. It can be a
+
+0:14:27.869,0:14:30.469
+Vector which has a 1 in correspondent
+
+0:14:31.110,0:14:35.329
+Correspondence to the position of where that word happens to be in a dictionary, okay
+
+0:14:35.329,0:14:39.709
+so if you have a dictionary of 10,000 words, you can just check whatever is the
+
+0:14:40.679,0:14:44.899
+The word on this dictionary you just put the page plus the whatever number
+
+0:14:45.629,0:14:50.599
+Like you just figured that the position of the page in the dictionary. So also language
+
+0:14:51.899,0:14:56.419
+Has those kind of properties things that are close by have, you know
+
+0:14:56.420,0:15:01.069
+Some kind of relationship things away are not less unless you know
+
+0:15:01.470,0:15:05.149
+Correlated and then similar patterns happen over and over again over
+
+0:15:05.819,0:15:12.558
+Moreover, you can use you know words make sentences to make full essays and to make finally your write-ups for the
+
+0:15:12.839,0:15:16.008
+Sessions. I'm just kidding. Okay. All right, so
+
+0:15:17.429,0:15:19.789
+We already seen this one. So I'm gonna be going quite fast
+
+0:15:20.759,0:15:28.279
+there shouldn't be any I think questions because also we have everything written down on the website, right so you can always check the
+
+0:15:28.860,0:15:30.919
+summaries of the previous lesson on the website
+
+0:15:32.040,0:15:39.349
+So fully connected layer. So this actually perhaps is a new version of the diagram. This is my X,Y is at the bottom
+
+0:15:42.089,0:15:49.698
+Low level features. What's the color of the decks? Pink. Okay good. All right, so we have an arrow which represents my
+
+0:15:51.299,0:15:54.439
+Yeah, fine that's the proper term, but I like to call them
+
+0:15:55.410,0:16:02.299
+Rotations and then there is some squashing right? squashing means the non-linearity then I have my hidden layer then I have another
+
+0:16:04.379,0:16:06.379
+Rotation and a final
+
+0:16:06.779,0:16:12.888
+Squashing. Okay. It's not necessary. Maybe can be a linear, you know final transformation like a linear
+
+0:16:14.520,0:16:18.059
+Whatever function they're like if you do if you perform a regression task
+
+0:16:19.750,0:16:21.750
+There you have the equations, right
+
+0:16:22.060,0:16:24.060
+And those guys can be any of those
+
+0:16:24.610,0:16:26.260
+nonlinear functions or
+
+0:16:26.260,0:16:33.239
+Even a linear function right if you perform regression once more and so you can write down these layers where I expand
+
+0:16:33.240,0:16:39.510
+So this guy here the the bottom guy is actually a vector and I represent the vector G with just one pole there
+
+0:16:39.510,0:16:42.780
+I just show you all the five items elements of that vector
+
+0:16:43.030,0:16:45.239
+So you have the X the first layer?
+
+0:16:45.370,0:16:50.520
+Then you have the first hidden second hidden third hit and the last layer so we have how many layers?
+
+0:16:53.590,0:16:55.240
+Five okay
+
+0:16:55.240,0:16:56.950
+And then you can also call them
+
+0:16:56.950,0:17:03.689
+activation layer 1 layer 2 3 4 whatever and then the matrices are where you store your
+
+0:17:03.970,0:17:10.380
+Parameters you have those different W's and then in order to get each of those values you already seen the stuff, right?
+
+0:17:10.380,0:17:17.280
+So I go quite faster you perform just the scalar product. Which means you just do that thing
+
+0:17:17.860,0:17:23.400
+You get all those weights. I multiply the input for each of those weights and you keep going like that
+
+0:17:24.490,0:17:28.920
+And then you store those weights in those matrices and so on. So as you can tell
+
+0:17:30.700,0:17:37.019
+There is a lot of arrows right and regardless of the fact that I spent too many hours doing that drawing
+
+0:17:38.200,0:17:43.649
+This is also like very computationally expensive because there are so many computations right each arrow
+
+0:17:44.350,0:17:46.350
+represents a weight which you have to multiply
+
+0:17:46.960,0:17:49.110
+for like by its own input
+
+0:17:49.870,0:17:51.870
+so
+
+0:17:52.090,0:17:53.890
+What can we do now?
+
+0:17:53.890,0:17:55.150
+so
+
+0:17:55.150,0:17:57.150
+given that our information is
+
+0:17:57.700,0:18:04.679
+Has locality. No our data has this locality as a property. What does it mean if I had something here?
+
+0:18:05.290,0:18:07.290
+Do I care what's happening here?
+
+0:18:09.460,0:18:12.540
+So some of you are just shaking the hand and the rest of
+
+0:18:13.000,0:18:17.219
+You are kind of I don't know not responsive and I have to ping you
+
+0:18:18.140,0:18:18.900
+so
+
+0:18:18.900,0:18:25.849
+We have locality, right? So things are just in specific regions. You actually care to look about far away
+
+0:18:27.030,0:18:28.670
+No, okay. Fantastic
+
+0:18:28.670,0:18:32.119
+So let's simply drop some connections, right?
+
+0:18:32.130,0:18:38.660
+So here we go from layer L-1 to the layer L by using the first, you know five
+
+0:18:39.570,0:18:45.950
+Ten and fifteen, right? Plus I have the last one here to from the layer L to L+1
+
+0:18:45.950,0:18:48.529
+I have three more right so in total we have
+
+0:18:50.550,0:18:53.089
+Eighteen weights computations, right
+
+0:18:53.760,0:18:55.760
+so, how about we
+
+0:18:56.370,0:19:01.280
+Drop the things that we don't care, right? So like let's say for this neuron, perhaps
+
+0:19:01.830,0:19:04.850
+Why do we have to care about those guys there on the bottom, right?
+
+0:19:05.160,0:19:08.389
+So, for example, I can just use those three weights, right?
+
+0:19:08.390,0:19:12.770
+I just forget about the other two and then again, I just use those three weights
+
+0:19:12.770,0:19:15.229
+I skip the first and the last and so on
+
+0:19:16.170,0:19:23.570
+Okay. So right now we have just nine connections now just now nine multiplications and finally three more
+
+0:19:24.360,0:19:28.010
+so as we go from the left hand side to the right hand side we
+
+0:19:28.920,0:19:32.149
+Climb the hierarchy and we're gonna have a larger and larger
+
+0:19:33.960,0:19:34.790
+View right
+
+0:19:34.790,0:19:40.879
+so although these green bodies here and don't see the whole input is you keep climbing the
+
+0:19:41.310,0:19:45.109
+Hierarchy you're gonna be able to see the whole span of the input, right?
+
+0:19:46.590,0:19:48.590
+so in this case, we're going to be
+
+0:19:49.230,0:19:55.760
+Defining the RF as receptive field. So my receptive field here from the last
+
+0:19:56.400,0:20:03.769
+Neuron to the intermediate neuron is three. So what is gonna be? This means that the final neuron sees three
+
+0:20:04.500,0:20:10.820
+Neurons from the previous layer. So what is the receptive field of the hidden layer with respect to the input layer?
+
+0:20:14.970,0:20:21.199
+The answer was three. Yeah, correct, but what is now their septic field of the output layer with respect to the input layer
+
+0:20:23.549,0:20:25.549
+Five right. That's fantastic
+
+0:20:25.679,0:20:30.708
+Okay, sweet. So right now the whole architecture does see the whole input
+
+0:20:31.229,0:20:33.229
+while each sub part
+
+0:20:33.239,0:20:39.019
+Like intermediate layers only sees small regions and this is very nice because you will spare
+
+0:20:39.239,0:20:46.939
+Computations which are unnecessary because on average they have no whatsoever in information. And so we managed to speed up
+
+0:20:47.669,0:20:50.059
+The computations that you actually can compute
+
+0:20:51.119,0:20:53.208
+things in a decent amount of time
+
+0:20:54.809,0:20:58.998
+Clear so we can talk about sparsity only because
+
+0:21:02.669,0:21:05.238
+We assume that our data shows
+
+0:21:06.329,0:21:08.249
+locality, right
+
+0:21:08.249,0:21:12.708
+Question if my data doesn't show locality. Can I use sparsity?
+
+0:21:16.139,0:21:19.279
+No, okay fantastic, okay. All right
+
+0:21:20.549,0:21:23.898
+more stuff so we also said that this natural signals are
+
+0:21:24.209,0:21:28.399
+Stationary and so given that they're stationary things appear over and over again
+
+0:21:28.399,0:21:34.008
+So maybe we don't have to learn again again the same stuff of all over the time right? So
+
+0:21:34.679,0:21:37.668
+In this case we said oh we drop those two lines, right?
+
+0:21:38.729,0:21:41.179
+And so how about we use?
+
+0:21:41.969,0:21:46.999
+The first connection the oblique one from you know going in down
+
+0:21:47.549,0:21:52.158
+Make it yellow. So all of those are yellows then these are orange
+
+0:21:52.859,0:21:57.139
+And then the final one are red, right? So how many weights do I have here?
+
+0:21:59.639,0:22:01.639
+And I had over here
+
+0:22:03.089,0:22:05.089
+Nine right and before we had
+
+0:22:06.749,0:22:09.769
+15 right so we drop from 15 to 3
+
+0:22:10.529,0:22:14.958
+This is like a huge reduction and how perhaps now it is actually won't work
+
+0:22:14.969,0:22:16.759
+So we have to fix that in a bit
+
+0:22:16.759,0:22:22.368
+But anyhow in this way when I train a network, I just had to train three weights the red
+
+0:22:22.840,0:22:25.980
+sorry, the yellow orange and red and
+
+0:22:26.889,0:22:30.959
+It's gonna be actually working even better because it just has to learn
+
+0:22:31.749,0:22:37.079
+You're gonna have more information you have more data for you know training those specific weights
+
+0:22:41.320,0:22:48.299
+So those are those three colors the yellow orange and red are gonna be called my kernel and so I stored them
+
+0:22:48.850,0:22:50.850
+Into a vector over here
+
+0:22:53.200,0:22:58.679
+And so those if you talk about you know convolutional careness those are simply the weight of these
+
+0:22:59.200,0:22:59.909
+over here
+
+0:22:59.909,0:23:04.589
+Right the weights that we are using by using sparsity and then using parameter sharing
+
+0:23:04.869,0:23:09.629
+Parameter sharing means you use the same parameter over over again across the architecture
+
+0:23:10.330,0:23:15.090
+So there are the following nice properties of using those two combined
+
+0:23:15.490,0:23:20.699
+So parameter sharing gives us faster convergence because you're gonna have much more information
+
+0:23:21.399,0:23:23.549
+To use in order to train these weights
+
+0:23:24.519,0:23:26.139
+You have a better
+
+0:23:26.139,0:23:32.008
+Generalization because you don't have to learn every time a specific type of thing that happened in different region
+
+0:23:32.009,0:23:34.079
+You just learn something. That makes sense
+
+0:23:34.720,0:23:36.720
+You know globally
+
+0:23:37.570,0:23:44.460
+Then we also have we are not constrained to the input size this is so important ray also Yann said this thing three times yesterday
+
+0:23:45.700,0:23:48.029
+Why are we not constrained to the input size?
+
+0:23:54.039,0:24:00.449
+Because we can keep shifting in over right before in these other case if you have more neurons you have to learn new stuff
+
+0:24:00.450,0:24:06.210
+Right, in this case. I can simply add more neurons and I keep using my weight across right that was
+
+0:24:07.240,0:24:09.809
+Some of the major points Yann, you know
+
+0:24:10.509,0:24:12.509
+highlighted yesterday
+
+0:24:12.639,0:24:14.939
+Moreover we have the kernel independence
+
+0:24:15.999,0:24:18.689
+So for the one of you they are interested in optimization
+
+0:24:19.659,0:24:21.009
+optimizing like computation
+
+0:24:21.009,0:24:22.299
+this is so cool because
+
+0:24:22.299,0:24:29.189
+This kernel and another kernel are completely independent so you can train them you can paralyze is to make things go faster
+
+0:24:33.580,0:24:38.549
+So finally we have also some connection sparsity property and so here we have a
+
+0:24:39.070,0:24:41.700
+Reduced amount of computation, which is also very good
+
+0:24:42.009,0:24:48.659
+So all these properties allowed us to be able to train this network on a lot of data
+
+0:24:48.659,0:24:55.739
+you still require a lot of data, but without having sparsity locality, so without having sparsity and
+
+0:24:56.409,0:25:01.859
+Parameter sharing you wouldn't be able to actually finish training this network in a reasonable amount of time
+
+0:25:03.639,0:25:11.039
+So, let's see, for example now how this works when you have like audio signal which is how many dimensional signal
+
+0:25:12.279,0:25:17.849
+1 dimensional signal, right? Okay. So for example kernels for 1d data
+
+0:25:18.490,0:25:24.119
+On the right hand side. You can see again. My my neurons can I'll be using my
+
+0:25:24.909,0:25:30.359
+Different the first scanner here. And so I'm gonna be storing my kernel there in that vector
+
+0:25:31.330,0:25:36.059
+For example, I can have a second kernel right. So right now we have two kernels the
+
+0:25:36.700,0:25:39.749
+Blue purple and pink and the yellow, orange and red
+
+0:25:41.559,0:25:44.158
+So let's say my output is r2
+
+0:25:44.799,0:25:46.829
+So that means that each of those
+
+0:25:47.980,0:25:50.909
+Bubbles here. Each of those neurons are actually
+
+0:25:51.639,0:25:57.359
+One and two rightly come out from the from the board, right? So it's each of those are having a thickness of two
+
+0:25:58.929,0:26:02.819
+And let's say the other guy here are having a thickness of seven, right
+
+0:26:02.990,0:26:07.010
+They are coming outside from the screen and they are you know, seven euros in this way
+
+0:26:08.070,0:26:13.640
+so in this case, my kernel are going to be of size 2 * 7 * 3
+
+0:26:13.860,0:26:17.719
+So 2 means I have two kernels which are going from 7
+
+0:26:18.240,0:26:20.070
+to give me
+
+0:26:20.070,0:26:22.070
+3
+
+0:26:22.950,0:26:24.950
+Outputs
+
+0:26:28.470,0:26:32.959
+Hold on my bad. So the 2 means you have ℝ² right here
+
+0:26:33.659,0:26:37.069
+Because you have two corners. So the first kernel will give you the first
+
+0:26:37.679,0:26:41.298
+The first column here and the second kernel is gonna give you the second column
+
+0:26:42.179,0:26:44.869
+Then it has to init 7
+
+0:26:45.210,0:26:50.630
+Because it needs to match all the thickness of the previous layer and then it has 3 because there are three
+
+0:26:50.789,0:26:56.778
+Connections right? So maybe I miss I got confused before does it make sense the sizing?
+
+0:26:58.049,0:26:59.820
+so given that our
+
+0:26:59.820,0:27:03.710
+273  2 means you had 2 kernels and therefore you have two
+
+0:27:04.080,0:27:08.000
+Items here like one a one coming out for each of those columns
+
+0:27:08.640,0:27:15.919
+It has seven because each of these have a thickness of 7 and finally 3 means there are 3 connection connecting to the previous layer
+
+0:27:17.429,0:27:22.819
+Right so 1d data uses 3d kernels ok
+
+0:27:23.460,0:27:30.049
+so if I call this my collection of kernel, right, so if those are gonna be stored in a tensor
+
+0:27:30.049,0:27:32.898
+This tensor will be a three dimensional tensor
+
+0:27:33.690,0:27:34.919
+so
+
+0:27:34.919,0:27:37.939
+Question for you, if I'm gonna be playing now with images
+
+0:27:38.580,0:27:40.580
+What is the size of?
+
+0:27:40.679,0:27:43.999
+You know full pack of kernels for an image
+
+0:27:45.809,0:27:47.809
+Convolutional net
+
+0:27:49.590,0:27:56.209
+Four right. So we're gonna have the number of kernels then it's going to be the number of the thickness
+
+0:27:56.730,0:28:00.589
+And then you're gonna have connections in height and connection in width
+
+0:28:01.799,0:28:03.179
+Okay
+
+0:28:03.179,0:28:09.798
+So if you're gonna be checking the currently convolutional kernels later on in your notebook, actually you should check that
+
+0:28:09.929,0:28:12.138
+You should find the same kind of dimensions
+
+0:28:14.159,0:28:16.159
+All right, so
+
+0:28:18.059,0:28:20.478
+Questions so far, is this so clear?. Yeah
+
+0:28:50.460,0:28:52.460
+Okay, so good question so
+
+0:28:52.469,0:28:56.149
+trade-off about, you know sizing of those convolutions
+
+0:28:56.700,0:28:59.119
+convolutional kernels, right is it correct? Right
+
+0:28:59.909,0:29:06.409
+Three by three he seems to be like the minimum you can go for if you actually care about spatial information
+
+0:29:07.499,0:29:13.098
+As Yann pointed out you can also use one by one convolution. Oh, sorry one come one
+
+0:29:13.769,0:29:15.149
+like a
+
+0:29:15.149,0:29:20.718
+Convolution with which has only one weight or if you use like in images you have a one by one convolution
+
+0:29:21.179,0:29:23.179
+Those are used in order to be
+
+0:29:23.309,0:29:24.570
+having like a
+
+0:29:24.570,0:29:26.570
+final layer, which is still
+
+0:29:26.909,0:29:30.528
+Spatial still can be applied to a larger input image
+
+0:29:31.649,0:29:36.138
+Right now we just use kernels that are three or maybe five
+
+0:29:36.929,0:29:42.348
+it's kind of empirical so it's not like we don't have like a magic formulas, but
+
+0:29:43.349,0:29:44.279
+we've been
+
+0:29:44.279,0:29:50.329
+trying hard in the past ten years to figure out what is you know the best set of hyper parameters and if you check
+
+0:29:50.969,0:29:55.879
+For each field like for a speech processing visual processing like image processing
+
+0:29:55.879,0:29:59.718
+You're gonna figure out what is the right compromise for your specific data?
+
+0:30:01.769,0:30:03.769
+Yeah
+
+0:30:04.910,0:30:06.910
+Second
+
+0:30:07.970,0:30:12.279
+Okay, that's a good question why odd numbers why the kernel has an odd number
+
+0:30:14.390,0:30:16.220
+Of elements
+
+0:30:16.220,0:30:20.049
+So if you actually have a odd number of elements there would be a central element
+
+0:30:20.240,0:30:25.270
+Right. If you have a even number of elements there, we'll know there won't be a central value
+
+0:30:25.370,0:30:27.880
+So if you have again odd number
+
+0:30:27.880,0:30:30.790
+You know that from a specific point you're gonna be considering
+
+0:30:31.220,0:30:36.789
+Even number of left and even number of right items if it's a even size
+
+0:30:37.070,0:30:42.399
+Kernel that you actually don't know where the center is and the center is gonna be the average of two
+
+0:30:43.040,0:30:48.310
+Neighboring samples which actually creates like a low-pass filter effect. So even
+
+0:30:49.220,0:30:51.910
+kernel sizes are not usually
+
+0:30:52.580,0:30:56.080
+preferred or not usually used because they imply some kind of
+
+0:30:57.290,0:30:59.889
+additional lowering of the quality of the data
+
+0:31:02.000,0:31:08.380
+Okay, so one more thing that we mentioned also yesterday its padding padding is something
+
+0:31:09.590,0:31:16.629
+that if it has an effect on the final results is getting it worse, but it's very convenient for
+
+0:31:17.570,0:31:25.450
+programming side so if we've had our so as you can see here when we apply convolution from this layer you're gonna end up with
+
+0:31:27.680,0:31:31.359
+Okay, how many how many neurons we have here
+
+0:31:32.720,0:31:34.720
+three and we started from
+
+0:31:35.480,0:31:39.400
+five, so if we use a convolutional kernel of three
+
+0:31:40.490,0:31:42.490
+We lose how many neurons? 
+
+0:31:43.310,0:31:50.469
+Two, okay, one per side. If you're gonna be using a convolutional kernel of size five how much you're gonna be losing
+
+0:31:52.190,0:31:57.639
+Four right and so that's the rule user zero padding you have to add an extra
+
+0:31:58.160,0:32:02.723
+Neuron here an extra neuron here. So you're gonna do number size of the kernel, right?
+
+0:32:02.723,0:32:05.800
+Three minus one divided by two and then you add that extra
+
+0:32:06.560,0:32:12.850
+Whatever number of neurons here, you've set them to zero. Why to zero? because usually you zero mean
+
+0:32:13.470,0:32:18.720
+Your inputs or your zero each layer output by using some normalization layers
+
+0:32:19.900,0:32:21.820
+in this case
+
+0:32:21.820,0:32:25.770
+Yeah, three comes from the size of the kernel and then you have that
+
+0:32:26.740,0:32:28.630
+Some animation should be playing
+
+0:32:28.630,0:32:31.289
+Yeah, you have one extra neuron there there then
+
+0:32:31.289,0:32:37.289
+I have an extra neuron there such that finally you end up with these, you know ghosts neurons there
+
+0:32:37.330,0:32:41.309
+But now you have the same number of input and the same number of output
+
+0:32:41.740,0:32:47.280
+And this is so convenient because if we started with I don't know 64 neurons you apply a convolution
+
+0:32:47.280,0:32:54.179
+You still have 64 neurons and therefore you can use let's say max pooling of two you're going to end up at 32 neurons
+
+0:32:54.179,0:32:57.809
+Otherwise you gonna have this I don't know if you consider one
+
+0:32:58.539,0:33:01.019
+We have a odd number right so you don't know what to do
+
+0:33:04.030,0:33:06.030
+after a bit, right?
+
+0:33:08.320,0:33:10.320
+Okay, so
+
+0:33:10.720,0:33:12.720
+Yeah, and you have the same size
+
+0:33:13.539,0:33:20.158
+All right. So, let's see how much time you have left. You have a bit of time. So, let's see how we use this
+
+0:33:21.130,0:33:27.270
+Convolutional net work in practice. So this is like the theory behind and we have said that we can use convolutions
+
+0:33:28.000,0:33:33.839
+So this is a convolutional operator. I didn't even define. What's a convolution. We just said that if our data has
+
+0:33:37.090,0:33:39.929
+Stationarity locality and is actually
+
+0:33:42.130,0:33:45.689
+Compositional then we can exploit this by using
+
+0:33:49.240,0:33:51.240
+Weight sharing
+
+0:33:51.940,0:33:56.730
+Sparsity and then you know by stacking several of this layer. You have a like a hierarchy, right?
+
+0:33:58.510,0:34:06.059
+So by using this kind of operation this is a convolution I didn't even define it I don't care right now maybe next class
+
+0:34:07.570,0:34:11.999
+So this is like the theory behind now, we're gonna see a little bit of practical
+
+0:34:12.429,0:34:15.628
+You know suggestions how we actually use this stuff in practice
+
+0:34:16.119,0:34:22.229
+So next thing we have like a standard a spatial convolutional net which is operating which kind of data
+
+0:34:22.840,0:34:24.840
+If it's spatial
+
+0:34:25.780,0:34:28.229
+It's special because it's my network right special
+
+0:34:29.260,0:34:32.099
+Not just kidding so special as you know space
+
+0:34:33.190,0:34:37.139
+So in this case, we have multiple layers, of course we stuck them
+
+0:34:37.300,0:34:42.419
+We also talked about why it's better to have several layers rather than having a fat layer
+
+0:34:43.300,0:34:48.149
+We have convolutions. Of course, we have nonlinearities because otherwise
+
+0:34:55.270,0:34:56.560
+So
+
+0:34:56.560,0:35:04.439
+ok, next time we're gonna see how a convolution can be implemented with matrices but convolutions are just linear operator with which a lot of
+
+0:35:04.440,0:35:07.470
+zeros and like replication of the same by the weights
+
+0:35:07.570,0:35:13.019
+but otherwise if you don't use non-linearity a convolution of a convolution
+
+0:35:13.020,0:35:16.679
+It's gonna be a convolution. So we have to clean up stuff
+
+0:35:17.680,0:35:19.510
+that
+
+0:35:19.510,0:35:25.469
+We have to like put barriers right? in order to avoid collapse of the whole network. We had some pooling operator
+
+0:35:26.140,0:35:27.280
+which
+
+0:35:27.280,0:35:33.989
+Geoffrey says that's you know, something already bad. But you know, you're still doing that Hinton right Geoffrey Hinton
+
+0:35:35.410,0:35:40.950
+Then we've had something that if you don't use it, your network is not gonna be training. So just use it
+
+0:35:41.560,0:35:44.339
+although we don't know exactly why it works but
+
+0:35:45.099,0:35:48.659
+I think there is a question on Piazza. I will put a link there
+
+0:35:49.330,0:35:53.519
+About this batch normalization. Also Yann is going to be covering all the normalization layers
+
+0:35:54.910,0:36:01.889
+Finally we have something that also is quite recent which is called a receival or bypass connections
+
+0:36:01.990,0:36:03.990
+Which are basically these?
+
+0:36:04.240,0:36:05.859
+extra
+
+0:36:05.859,0:36:07.089
+connections
+
+0:36:07.089,0:36:09.089
+Which allow me to
+
+0:36:09.250,0:36:10.320
+Get the network
+
+0:36:10.320,0:36:13.320
+You know the network decided whether whether to send information
+
+0:36:13.780,0:36:18.780
+Through this line or actually send it forward if you stack so many many layers one after each other
+
+0:36:18.910,0:36:24.330
+The signal get lost a little bit after sometime if you add these additional connections
+
+0:36:24.330,0:36:27.089
+You always have like a path in order to go back
+
+0:36:27.710,0:36:31.189
+The bottom to the top and also to have gradients coming down from the top to the bottom
+
+0:36:31.440,0:36:38.599
+so that's actually a very important both the receiver connection and the batch normalization are really really helpful to get this network to
+
+0:36:39.059,0:36:46.849
+Properly train if you don't use them then it's going to be quite hard to get those networks to really work for the training part
+
+0:36:48.000,0:36:51.949
+So how does it work we have here an image, for example
+
+0:36:53.010,0:36:55.939
+Where most of the information is spatial information?
+
+0:36:55.940,0:36:59.000
+So the information is spread across the two dimensions
+
+0:36:59.220,0:37:04.520
+Although there is a thickness and I call the thickness as characteristic information
+
+0:37:04.770,0:37:07.339
+Which means it provides a information?
+
+0:37:07.890,0:37:11.569
+At that specific point. So what is my characteristic information?
+
+0:37:12.180,0:37:15.740
+ in this image let's say it's a RGB image
+
+0:37:16.680,0:37:18.680
+It's a color image right?
+
+0:37:19.230,0:37:27.109
+So we have the most of the information is spread on a  spatial information. Like if you have me making funny faces
+
+0:37:28.109,0:37:30.109
+but then at each point
+
+0:37:30.300,0:37:33.769
+This is not a grayscale image is a color image, right?
+
+0:37:33.770,0:37:39.199
+So each point will have an additional information which is my you know specific
+
+0:37:39.990,0:37:42.439
+Characteristic information. What is it in this case?
+
+0:37:44.640,0:37:46.910
+It's a vector of three values which represent
+
+0:37:48.630,0:37:51.530
+RGB are the three letters by the __  as they represent
+
+0:37:54.780,0:37:57.949
+Okay, overall, what does it represent like
+
+0:37:59.160,0:38:02.480
+Yes intensity. Just you know, tell me in English without weird
+
+0:38:03.359,0:38:05.130
+things
+
+0:38:05.130,0:38:11.480
+The color of the pixel, right? So my specific information. My characteristic information. Yeah. I don't know what you're saying
+
+0:38:11.480,0:38:18.500
+Sorry, the characteristic information in this case is just a color right so the color is the only information that is specific there
+
+0:38:18.500,0:38:20.780
+But then otherwise information is spread around
+
+0:38:21.359,0:38:23.359
+As if we climb climb the hierarchy
+
+0:38:23.730,0:38:31.189
+You can see now some final vector which has let's say we are doing classification in this case. So my
+
+0:38:31.770,0:38:36.530
+You know the height and width or the thing is going to be one by one so it's just one vector
+
+0:38:37.080,0:38:43.590
+And then let's say there you have the specific final logit, which is the highest one so which is representing the class
+
+0:38:43.590,0:38:47.400
+Which is most likely to be the correct one if it's trained well
+
+0:38:48.220,0:38:51.630
+in the Midway, you have something that is, you know a trade-off between
+
+0:38:52.330,0:38:59.130
+Spatial information and then these characteristic information. Okay. So basically it's like a conversion between
+
+0:39:00.070,0:39:01.630
+spatial information
+
+0:39:01.630,0:39:03.749
+into this characteristic information
+
+0:39:04.360,0:39:07.049
+Do you see so it basically go from a thing?
+
+0:39:07.660,0:39:08.740
+input
+
+0:39:08.740,0:39:13.920
+Data to something. It is very thick, but then has no more information spatial information
+
+0:39:14.710,0:39:20.760
+and so you can see here with my ninja PowerPoint skills how you can get you know a
+
+0:39:22.240,0:39:27.030
+Reduction of the ___ thickener like a figure thicker in our presentation
+
+0:39:27.070,0:39:30.840
+Whereas you actually lose the spatial special one
+
+0:39:32.440,0:39:39.870
+Okay, so that was oh one more pooling so pooling is simply again for example
+
+0:39:41.620,0:39:43.600
+It can be performed in this way
+
+0:39:43.600,0:39:48.660
+So there you have some hand drawing because I didn't want to do you have time to make it in latex?
+
+0:39:49.270,0:39:52.410
+So you have different regions you apply a specific?
+
+0:39:53.500,0:39:57.060
+Operator to that specific region, for example, you have the P norm
+
+0:39:58.150,0:39:59.680
+and then
+
+0:39:59.680,0:40:02.760
+Yes, the P goes to plus infinity. You have the Max
+
+0:40:03.730,0:40:09.860
+And then that one is not give you one value right then you perform a stride.
+
+0:40:09.860,0:40:12.840
+jump to Pixels further and then you again you compute the same thing
+
+0:40:12.840,0:40:18.150
+you're gonna get another value there and so on until you end up from
+
+0:40:18.700,0:40:24.900
+Your data which was m by n with c channels you get still c channels
+
+0:40:24.900,0:40:31.199
+But then in this case you gonna get m/2 and c and  n/2. Okay, and this is for images
+
+0:40:35.029,0:40:41.079
+There are no parameters on the pooling how you can nevertheless choose which kind of pooling, right you can choose max pooling
+
+0:40:41.390,0:40:44.229
+Average pooling any pooling is wrong. So
+
+0:40:45.769,0:40:48.879
+Yeah, let's also the problem, okay, so
+
+0:40:49.999,0:40:55.809
+This was the mean part with the slides. We are gonna see now the notebooks will go a bit slower this time
+
+0:40:55.809,0:40:58.508
+I noticed that last time I kind of rushed
+
+0:40:59.900,0:41:02.529
+Are there any questions so far on this part that we cover?
+
+0:41:04.519,0:41:06.519
+Yeah
+
+0:41:10.670,0:41:12.469
+So there is like
+
+0:41:12.469,0:41:17.769
+Geoffrey Hinton is renowned for saying that max pooling is something which is just
+
+0:41:18.259,0:41:23.319
+Wrong because you just throw away information as you average or you take the max you just throw away things
+
+0:41:24.380,0:41:29.140
+He's been working on like something called capsule networks, which have you know specific
+
+0:41:29.660,0:41:33.849
+routing paths that are choosing, you know some
+
+0:41:34.519,0:41:41.319
+Better strategies in order to avoid like throwing away information. Okay. Basically that's the the argument behind yeah
+
+0:41:45.469,0:41:52.329
+Yes, so the main purpose of using this pooling or the stride is actually to get rid of a lot of data such that you
+
+0:41:52.329,0:41:54.579
+Can compute things in a reasonable amount of time?
+
+0:41:54.619,0:42:00.939
+Usually you need a lot of stride or pooling at the first layers at the bottom because otherwise  it's absolutely  you know
+
+0:42:01.339,0:42:03.339
+Too computationally expensive
+
+0:42:03.979,0:42:05.979
+Yeah
+
+0:42:21.459,0:42:23.459
+So on that sit
+
+0:42:24.339,0:42:32.068
+Those network architectures are so far driven by you know the state of the art, which is completely an empirical base
+
+0:42:33.279,0:42:40.109
+we try hard and we actually go to I mean now we actually arrive to some kind of standard so a
+
+0:42:40.359,0:42:44.399
+Few years back. I was answering like I don't know but right now we actually have
+
+0:42:45.099,0:42:47.049
+Determined some good configurations
+
+0:42:47.049,0:42:53.968
+Especially using those receiver connections and the batch normalization. We actually can get to train basically everything
+
+0:42:54.759,0:42:56.759
+Yeah
+
+0:43:05.859,0:43:11.038
+So basically you're gonna have your gradient at a specific point coming down as well
+
+0:43:11.039,0:43:13.679
+And then you have the other gradient coming down down
+
+0:43:13.839,0:43:18.238
+Then you had a branch right a branching and if you have branch what's happening with the gradient?
+
+0:43:19.720,0:43:25.439
+That's correct. Yeah, they get added right so you have the two gradients coming from two different branches getting added together
+
+0:43:26.470,0:43:31.769
+All right. So let's go to the notebook such that we can cover  we don't rush too much
+
+0:43:32.859,0:43:37.139
+So here I just go through the convnet part. So here I train
+
+0:43:39.519,0:43:41.289
+Initially I
+
+0:43:41.289,0:43:43.979
+Load the MNIST data set so I show you a few
+
+0:43:44.680,0:43:45.849
+characters here
+
+0:43:45.849,0:43:52.828
+Okay, and I train now a multi-layer perceptron like a fully connected Network like a mood, you know
+
+0:43:53.440,0:44:00.509
+Yeah, fully connected Network and a convolutional neural net which have the same number of parameters. Okay. So these two models will have the same
+
+0:44:01.150,0:44:05.819
+Dimension in terms of D. If you save them we'll wait the same so
+
+0:44:07.269,0:44:11.219
+I'm training here this guy here with the fully connected Network
+
+0:44:12.640,0:44:14.640
+It takes a little bit of time
+
+0:44:14.829,0:44:21.028
+And he gets some 87% Okay. This is trained on classification of the MNIST digits from Yann
+
+0:44:21.999,0:44:24.419
+We actually download from his website if you check
+
+0:44:25.239,0:44:32.189
+Anyhow, I train a convolutional neural net with the same number of parameters what you expect to have a better a worse result
+
+0:44:32.349,0:44:35.548
+So my multi-layer perceptron gets 87 percent
+
+0:44:36.190,0:44:38.190
+What do we get with a convolutional net?
+
+0:44:41.739,0:44:43.739
+Yes, why
+
+0:44:46.910,0:44:50.950
+Okay, so what is the point here of using sparsity what does it mean
+
+0:44:52.640,0:44:55.089
+Given that we have the same number of parameters
+
+0:44:56.690,0:44:58.690
+We manage to train much
+
+0:44:59.570,0:45:05.440
+more filters right in the second case because in the first case we use filters that are completely trying to get some
+
+0:45:05.960,0:45:12.549
+dependencies between things that are further away with things that are closed by so they are completely wasted basically they learn 0
+
+0:45:12.830,0:45:19.930
+Instead in the convolutional net. I have all these parameters. They're just concentrated for figuring out. What is the relationship within a
+
+0:45:20.480,0:45:23.799
+Neighboring pixels. All right. So now it takes the pictures I
+
+0:45:24.740,0:45:26.740
+Shake everything just got scrambled
+
+0:45:27.410,0:45:33.369
+But I keep the same I scramble the same same way all the images. So I perform a random permutation
+
+0:45:34.850,0:45:38.710
+Always the same random permutation of all my images or the pixels on my images
+
+0:45:39.500,0:45:41.090
+What does it happen?
+
+0:45:41.090,0:45:43.299
+If I train both networks
+
+0:45:47.990,0:45:50.049
+So here I trained see here
+
+0:45:50.050,0:45:56.950
+I have my pics images and here I just scrambled with the same scrambling function all the pixels
+
+0:46:00.200,0:46:04.240
+All my inputs are going to be these images here
+
+0:46:06.590,0:46:10.870
+The output is going to be still the class of the original so this is a four you
+
+0:46:11.450,0:46:13.780
+Can see this this is a four. This is a nine
+
+0:46:14.920,0:46:19.889
+This is a 1 this is a 7 is a 3 in this is a 4 so I keep the same labels
+
+0:46:19.930,0:46:24.450
+But I scrambled the order of the pixels and I perform the same scrambling every time
+
+0:46:25.239,0:46:27.239
+What do you expect is performance?
+
+0:46:31.029,0:46:33.299
+Who's better who's working who's the same?
+
+0:46:38.619,0:46:46.258
+Perception how does it do with the perception? Does he see any difference? No, okay. So the guy still 83
+
+0:46:47.920,0:46:49.920
+Yann's network
+
+0:46:52.029,0:46:54.029
+What do you guys
+
+0:47:04.089,0:47:09.988
+Know that's a fully connected. Sorry. I'll change the order. Yeah, see. Okay. There you go
+
+0:47:12.460,0:47:14.999
+So I can't even show you this thing
+
+0:47:17.920,0:47:18.730
+All right
+
+0:47:18.730,0:47:24.659
+So the fully connected guy basically performed the same the differences are just basic based on the initial
+
+0:47:25.059,0:47:30.899
+The random initialization the convolutional net which was winning by kind of large advance
+
+0:47:31.509,0:47:33.509
+advantage before actually performs
+
+0:47:34.059,0:47:38.008
+Kind of each similarly, but I mean worse than much worse than before
+
+0:47:38.499,0:47:42.449
+Why is the convolutional network now performing worse than my fully connected Network?
+
+0:47:44.829,0:47:46.829
+Because we fucked up
+
+0:47:47.739,0:47:55.379
+Okay, and so every time you use a convolutional network, you actually have to think can I use of convolutional network, okay
+
+0:47:56.440,0:47:59.700
+If it holds now, you have the three properties then yeah
+
+0:47:59.700,0:48:05.759
+Maybe of course, it should be giving you a better performance if those three properties don't hold
+
+0:48:06.579,0:48:09.058
+then using convolutional networks is
+
+0:48:11.499,0:48:17.939
+BS right, which was the bias? No. Okay. Never mind. All right. Well, good night
diff --git a/docs/pt/week04/04-1.md b/docs/pt/week04/04-1.md
new file mode 100644
index 000000000..78ecbe56c
--- /dev/null
+++ b/docs/pt/week04/04-1.md
@@ -0,0 +1,596 @@
+---
+lang: pt
+lang-ref: ch.04-1
+lecturer: Alfredo Canziani
+title: Álgebra Linear e Convoluções
+authors: Yuchi Ge, Anshan He, Shuting Gu e Weiyang Wen
+date: 18 Feb 2020
+translation-date: 05 Nov 2021
+translator: Leon Solon
+---
+
+<!--
+## [Linear Algebra review](https://www.youtube.com/watch?v=OrBEon3VlQg&t=68s)
+-->
+
+## [Revisão de Álgebra Linear](https://www.youtube.com/watch?v=OrBEon3VlQg&t=68s)
+
+<!--This part is a recap of basic linear algebra in the context of neural networks. We start with a simple hidden layer $\boldsymbol{h}$:
+-->
+
+Esta parte é uma recapitulação de Álgebra Linear básica no contexto das redes neurais. Começamos com uma camada oculta simples $\boldsymbol{h}$:
+
+<!--$$
+\boldsymbol{h} = f(\boldsymbol{z})
+$$
+-->
+
+$$
+\boldsymbol{h} = f(\boldsymbol{z})
+$$
+
+<!--The output is a non-linear function $f$ applied to a vector $z$. Here $z$ is the output of an affine transformation $\boldsymbol{A} \in\mathbb{R^{m\times n}}$ to the input vector $\boldsymbol{x} \in\mathbb{R^n}$:
+-->
+
+A saída é uma função não linear $f$ aplicada a um vetor $z$. Aqui $z$ é a saída de uma transformação afim (affine transformation) $\boldsymbol{A} \in\mathbb{R^{m\times n}}$ para o vetor de entrada $\boldsymbol{x} \in\mathbb{R^n}$:
+
+<!--$$
+\boldsymbol{z} = \boldsymbol{A} \boldsymbol{x}
+$$
+-->
+
+$$
+\boldsymbol{z} = \boldsymbol{A} \boldsymbol{x}
+$$
+
+<!--For simplicity biases are ignored. The linear equation can be expanded as:
+-->
+
+Para simplificar, os viéses (biases) são ignorados. A equação linear pode ser expandida como:
+
+<!--$$
+\boldsymbol{A}\boldsymbol{x} =
+\begin{pmatrix}
+a_{11} & a_{12} & \cdots & a_{1n}\\
+a_{21} & a_{22} & \cdots & a_{2n} \\
+\vdots & \vdots & \ddots & \vdots \\
+a_{m1} & a_{m2} & \cdots & a_{mn} \end{pmatrix} \begin{pmatrix}
+x_1 \\ \vdots \\x_n \end{pmatrix} =
+\begin{pmatrix}
+    \text{---} \; \boldsymbol{a}^{(1)} \; \text{---} \\
+    \text{---} \; \boldsymbol{a}^{(2)} \; \text{---} \\
+    \vdots \\
+    \text{---} \; \boldsymbol{a}^{(m)} \; \text{---} \\
+\end{pmatrix}
+\begin{matrix}
+    \rvert \\ \boldsymbol{x} \\ \rvert
+\end{matrix} =
+\begin{pmatrix}
+    {\boldsymbol{a}}^{(1)} \boldsymbol{x} \\ {\boldsymbol{a}}^{(2)} \boldsymbol{x} \\ \vdots \\ {\boldsymbol{a}}^{(m)} \boldsymbol{x}
+\end{pmatrix}_{m \times 1}
+$$
+-->
+
+$$
+\boldsymbol{A}\boldsymbol{x} =
+\begin{pmatrix}
+a_{11} & a_{12} & \cdots & a_{1n}\\
+a_{21} & a_{22} & \cdots & a_{2n} \\
+\vdots & \vdots & \ddots & \vdots \\
+a_{m1} & a_{m2} & \cdots & a_{mn} \end{pmatrix} \begin{pmatrix}
+x_1 \\ \vdots \\x_n \end{pmatrix} =
+\begin{pmatrix}
+    \text{---} \; \boldsymbol{a}^{(1)} \; \text{---} \\
+    \text{---} \; \boldsymbol{a}^{(2)} \; \text{---} \\
+    \vdots \\
+    \text{---} \; \boldsymbol{a}^{(m)} \; \text{---} \\
+\end{pmatrix}
+\begin{matrix}
+    \rvert \\ \boldsymbol{x} \\ \rvert
+\end{matrix} =
+\begin{pmatrix}
+    {\boldsymbol{a}}^{(1)} \boldsymbol{x} \\ {\boldsymbol{a}}^{(2)} \boldsymbol{x} \\ \vdots \\ {\boldsymbol{a}}^{(m)} \boldsymbol{x}
+\end{pmatrix}_{m \times 1}
+$$
+
+<!--where $\boldsymbol{a}^{(i)}$ is the $i$-th row of the matrix $\boldsymbol{A}$.
+-->
+
+onde $\boldsymbol{a}^{(i)}$ é a $i$-ésima linha da matriz $\boldsymbol{A}$.
+
+<!--To understand the meaning of this transformation, let us analyse one component of $\boldsymbol{z}$ such as $a^{(1)}\boldsymbol{x}$. Let  $n=2$, then $\boldsymbol{a} = (a_1,a_2)$ and $\boldsymbol{x}  = (x_1,x_2)$.
+-->
+
+Para entender o significado dessa transformação, vamos analisar um componente de $\boldsymbol{z}$ como $a^{(1)}\boldsymbol{x}$. Seja $n=2$, então $\boldsymbol{a} = (a_1,a_2)$ e $\boldsymbol{x}  = (x_1,x_2)$.
+
+<!--$\boldsymbol{a}$ and $\boldsymbol{x}$ can be drawn as vectors in the 2D coordinate axis. Now, if the angle between $\boldsymbol{a}$ and $\hat{\boldsymbol{\imath}}$ is $\alpha$ and the angle between $\boldsymbol{x}$ and $\hat{\boldsymbol{\imath}}$ is $\xi$, then with trigonometric formulae $a^\top\boldsymbol{x}$ can be expanded as:
+-->
+
+$\boldsymbol{a}$ e $\boldsymbol{x}$ podem ser desenhados como vetores no eixo de coordenadas 2D. Agora, se o ângulo entre $\boldsymbol{a}$ e $\hat{\boldsymbol{\imath}}$ é $\alpha$ e o ângulo entre $\boldsymbol{x}$ e $\hat{\boldsymbol{\imath}}$ é $\xi$, então com fórmulas trigonométricas $a^\top\boldsymbol{x}$ pode ser expandido como:
+
+<!--$$
+\begin {aligned}
+\boldsymbol{a}^\top\boldsymbol{x} &= a_1x_1+a_2x_2\\
+&=\lVert \boldsymbol{a} \rVert \cos(\alpha)\lVert \boldsymbol{x} \rVert \cos(\xi) + \lVert \boldsymbol{a} \rVert \sin(\alpha)\lVert \boldsymbol{x} \rVert \sin(\xi)\\
+&=\lVert \boldsymbol{a} \rVert \lVert \boldsymbol{x} \rVert \big(\cos(\alpha)\cos(\xi)+\sin(\alpha)\sin(\xi)\big)\\
+&=\lVert \boldsymbol{a} \rVert \lVert \boldsymbol{x} \rVert \cos(\xi-\alpha)
+\end {aligned}
+$$
+-->
+
+$$
+\begin {aligned}
+\boldsymbol{a}^\top\boldsymbol{x} &= a_1x_1+a_2x_2\\
+&=\lVert \boldsymbol{a} \rVert \cos(\alpha)\lVert \boldsymbol{x} \rVert \cos(\xi) + \lVert \boldsymbol{a} \rVert \sin(\alpha)\lVert \boldsymbol{x} \rVert \sin(\xi)\\
+&=\lVert \boldsymbol{a} \rVert \lVert \boldsymbol{x} \rVert \big(\cos(\alpha)\cos(\xi)+\sin(\alpha)\sin(\xi)\big)\\
+&=\lVert \boldsymbol{a} \rVert \lVert \boldsymbol{x} \rVert \cos(\xi-\alpha)
+\end {aligned}
+$$
+
+<!--
+The output measures the alignment of the input to a specific row of the matrix $\boldsymbol{A}$. This can be understood by observing the angle between the two vectors, $\xi-\alpha$. When $\xi = \alpha$, the two vectors are perfectly aligned and maximum is attained. If $\xi - \alpha = \pi$, then $\boldsymbol{a}^\top\boldsymbol{x}$ attains its minimum and the two vectors are pointing at opposite directions. In essence, the linear transformation allows one to see the projection of an input to various orientations as defined by $A$. This intuition is expandable to higher dimensions as well.
+-->
+
+A saída mede o alinhamento da entrada a uma linha específica da matriz $\boldsymbol{A}$. Isso pode ser entendido observando o ângulo entre os dois vetores, $\xi-\alpha$. Quando $\xi = \alpha$, os dois vetores estão perfeitamente alinhados e o máximo é atingido. Se $\xi - \alpha = \pi$, então $\boldsymbol{a}^\top\boldsymbol{x}$ atinge seu mínimo e os dois vetores estão apontando em direções opostas. Em essência, a transformação linear permite ver a projeção de uma entrada para várias orientações definidas por $A$. Essa intuição também pode ser expandida para dimensões superiores.
+
+<!--Another way to understand the linear transformation is by understanding that $\boldsymbol{z}$ can also be expanded as:
+-->
+
+Outra maneira de entender a transformação linear é entendendo que $\boldsymbol{z}$ também pode ser expandido como:
+
+<!--$$
+\boldsymbol{A}\boldsymbol{x} =
+\begin{pmatrix}
+    \vert            & \vert            &        & \vert             \\
+    \boldsymbol{a}_1 & \boldsymbol{a}_2 & \cdots & \boldsymbol{a}_n  \\
+    \vert            & \vert            &        & \vert             \\
+\end{pmatrix}
+\begin{matrix}
+    \rvert \\ \boldsymbol{x} \\ \rvert
+\end{matrix} =
+x_1 \begin{matrix} \rvert \\ \boldsymbol{a}_1 \\ \rvert \end{matrix} +
+x_2 \begin{matrix} \rvert \\ \boldsymbol{a}_2 \\ \rvert \end{matrix} +
+    \cdots +
+x_n \begin{matrix} \rvert \\ \boldsymbol{a}_n \\ \rvert \end{matrix}
+$$
+-->
+
+$$
+\boldsymbol{A}\boldsymbol{x} =
+\begin{pmatrix}
+    \vert            & \vert            &        & \vert             \\
+    \boldsymbol{a}_1 & \boldsymbol{a}_2 & \cdots & \boldsymbol{a}_n  \\
+    \vert            & \vert            &        & \vert             \\
+\end{pmatrix}
+\begin{matrix}
+    \rvert \\ \boldsymbol{x} \\ \rvert
+\end{matrix} =
+x_1 \begin{matrix} \rvert \\ \boldsymbol{a}_1 \\ \rvert \end{matrix} +
+x_2 \begin{matrix} \rvert \\ \boldsymbol{a}_2 \\ \rvert \end{matrix} +
+    \cdots +
+x_n \begin{matrix} \rvert \\ \boldsymbol{a}_n \\ \rvert \end{matrix}
+$$
+
+<!--The output is the weighted sum of the columns of matrix $\boldsymbol{A}$. Therefore, the signal is nothing but a composition of the input.
+-->
+
+A saída é a soma ponderada das colunas da matriz $\boldsymbol{A}$. Portanto, o sinal nada mais é do que uma composição da entrada.
+
+<!--
+## [Extend Linear Algebra to convolutions](https://www.youtube.com/watch?v=OrBEon3VlQg&t=1030s)
+-->
+
+## [Extender Álgebra Linear para convoluções](https://www.youtube.com/watch?v=OrBEon3VlQg&t=1030s)
+
+<!--Now we extend linear algebra to convolutions, by using the example of audio data analysis. We start with representing a fully connected layer as a form of matrix multiplication: -
+-->
+
+Agora estendemos a álgebra linear às convoluções, usando o exemplo de análise de dados de áudio. Começamos representando uma camada totalmente conectada como uma forma de multiplicação de matriz: -
+
+<!--$$
+\begin{bmatrix}
+w_{11} & w_{12} & w_{13}\\
+w_{21} & w_{22} & w_{23}\\
+w_{31} & w_{32} & w_{33}\\
+w_{41} & w_{42} & w_{43}
+\end{bmatrix}
+\begin{bmatrix}
+x_1\\
+x_2\\
+x_3
+\end{bmatrix} = \begin{bmatrix}
+y_1\\
+y_2\\
+y_3\\
+y_4
+\end{bmatrix}
+$$
+-->
+
+$$
+\begin{bmatrix}
+w_{11} & w_{12} & w_{13}\\
+w_{21} & w_{22} & w_{23}\\
+w_{31} & w_{32} & w_{33}\\
+w_{41} & w_{42} & w_{43}
+\end{bmatrix}
+\begin{bmatrix}
+x_1\\
+x_2\\
+x_3
+\end{bmatrix} = \begin{bmatrix}
+y_1\\
+y_2\\
+y_3\\
+y_4
+\end{bmatrix}
+$$
+
+<!--In this example, the weight matrix has a size of $4 \times 3$, the input vector has a size of $3 \times 1$ and the output vector has a of size $4 \times 1$.
+-->
+
+Neste exemplo, a matriz de peso tem um tamanho de $4 \times 3$, o vetor de entrada tem um tamanho de $3 \times 1$ e o vetor de saída tem um tamanho de $4 \times 1$.
+
+<!--However, for audio data, the data is much longer (not 3-sample long). The number of samples in the audio data is equal to the duration of the audio (*e.g.* 3 seconds) times the sampling rate (*e.g.* 22.05 kHz). As shown below, the input vector $\boldsymbol{x}$ will be quite long. Correspondingly, the weight matrix will become "fat".
+-->
+
+No entanto, para dados de áudio, os dados são muito mais longos (não com 3 amostras). O número de amostras nos dados de áudio é igual à duração do áudio (*por exemplo,* 3 segundos) vezes a taxa de amostragem (*por exemplo,* 22,05 kHz). Conforme mostrado abaixo, o vetor de entrada $\boldsymbol{x}$ será bem longo. Correspondentemente, a matriz de peso se tornará "gorda".
+
+<!--$$
+\begin{bmatrix}
+w_{11} & w_{12} & w_{13} & w_{14} & \cdots &w_{1k}& \cdots &w_{1n}\\
+w_{21} & w_{22} & w_{23}& w_{24} & \cdots & w_{2k}&\cdots &w_{2n}\\
+w_{31} & w_{32} & w_{33}& w_{34} & \cdots & w_{3k}&\cdots &w_{3n}\\
+w_{41} & w_{42} & w_{43}& w_{44} & \cdots & w_{4k}&\cdots &w_{4n}
+\end{bmatrix}
+\begin{bmatrix}
+x_1\\
+x_2\\
+x_3\\
+x_4\\
+\vdots\\
+x_k\\
+\vdots\\
+x_n
+\end{bmatrix} = \begin{bmatrix}
+y_1\\
+y_2\\
+y_3\\
+y_4
+\end{bmatrix}
+$$
+-->
+
+$$
+\begin{bmatrix}
+w_{11} & w_{12} & w_{13} & w_{14} & \cdots &w_{1k}& \cdots &w_{1n}\\
+w_{21} & w_{22} & w_{23}& w_{24} & \cdots & w_{2k}&\cdots &w_{2n}\\
+w_{31} & w_{32} & w_{33}& w_{34} & \cdots & w_{3k}&\cdots &w_{3n}\\
+w_{41} & w_{42} & w_{43}& w_{44} & \cdots & w_{4k}&\cdots &w_{4n}
+\end{bmatrix}
+\begin{bmatrix}
+x_1\\
+x_2\\
+x_3\\
+x_4\\
+\vdots\\
+x_k\\
+\vdots\\
+x_n
+\end{bmatrix} = \begin{bmatrix}
+y_1\\
+y_2\\
+y_3\\
+y_4
+\end{bmatrix}
+$$
+
+<!--The above formulation will be difficult to train. Fortunately there are ways to simplify the same.
+-->
+
+A formulação acima será difícil de treinar. Felizmente, existem maneiras de simplificar o mesmo.
+
+<!--
+### Property: locality
+-->
+
+### Propriedade: localidade
+
+<!--Due to locality (*i.e.* we do not care for data points that are far away) of data, $w_{1k}$ from the weight matrix above, can be filled with 0 when $k$ is relatively large. Therefore, the first row of the matrix becomes a kernel of size 3. Let's denote this size-3 kernel as $\boldsymbol{a}^{(1)} = \begin{bmatrix} a_1^{(1)}  & a_2^{(1)}  & a_3^{(1)} \end{bmatrix}$.
+-->
+
+Devido à localidade (*ou seja,* não nos importamos com pontos de dados distantes) dos dados, $ w_ {1k} $ da matriz de peso acima pode ser preenchido com 0 quando $ k $ é relativamente grande. Portanto, a primeira linha da matriz torna-se um kernel de tamanho 3. Vamos denotar este kernel de tamanho 3 como $\boldsymbol{a}^{(1)} = \begin{bmatrix} a_1^{(1)}  & a_2^{(1)}  & a_3^{(1)} \end{bmatrix}$.
+
+<!--$$
+\begin{bmatrix}
+a_1^{(1)}  & a_2^{(1)}  & a_3^{(1)}  & 0 & \cdots &0& \cdots &0\\
+w_{21} & w_{22} & w_{23}& w_{24} & \cdots & w_{2k}&\cdots &w_{2n}\\
+w_{31} & w_{32} & w_{33}& w_{34} & \cdots & w_{3k}&\cdots &w_{3n}\\
+w_{41} & w_{42} & w_{43}& w_{44} & \cdots & w_{4k}&\cdots &w_{4n}
+\end{bmatrix}
+\begin{bmatrix}
+x_1\\
+x_2\\
+x_3\\
+x_4\\
+\vdots\\
+x_k\\
+\vdots\\
+x_n
+\end{bmatrix} = \begin{bmatrix}
+y_1\\
+y_2\\
+y_3\\
+y_4
+\end{bmatrix}
+$$
+-->
+
+$$
+\begin{bmatrix}
+a_1^{(1)}  & a_2^{(1)}  & a_3^{(1)}  & 0 & \cdots &0& \cdots &0\\
+w_{21} & w_{22} & w_{23}& w_{24} & \cdots & w_{2k}&\cdots &w_{2n}\\
+w_{31} & w_{32} & w_{33}& w_{34} & \cdots & w_{3k}&\cdots &w_{3n}\\
+w_{41} & w_{42} & w_{43}& w_{44} & \cdots & w_{4k}&\cdots &w_{4n}
+\end{bmatrix}
+\begin{bmatrix}
+x_1\\
+x_2\\
+x_3\\
+x_4\\
+\vdots\\
+x_k\\
+\vdots\\
+x_n
+\end{bmatrix} = \begin{bmatrix}
+y_1\\
+y_2\\
+y_3\\
+y_4
+\end{bmatrix}
+$$
+
+<!--
+### Property: stationarity
+-->
+
+### Propriedade: estacionariedade
+
+<!--Natural data signals have the property of stationarity (*i.e.* certain patterns/motifs will repeat). This helps us reuse kernel $\mathbf{a}^{(1)}$ that we defined previously. We use this kernel by placing it one step further each time (*i.e.* stride is 1), resulting in the following:
+-->
+
+Os sinais de dados naturais têm a propriedade de estacionariedade (*ou seja,* certos padrões / motivos se repetirão). Isso nos ajuda a reutilizar o kernel $\mathbf{a}^{(1)}$ que definimos anteriormente. Usamos este kernel colocando-o um passo adiante a cada vez (*ou seja,* o passo é 1), resultando no seguinte:
+
+<!--$$
+\begin{bmatrix}
+a_1^{(1)} & a_2^{(1)}  & a_3^{(1)}  & 0 & 0 & 0 & 0&\cdots  &0\\
+0 & a_1^{(1)}  & a_2^{(1)} & a_3^{(1)}  & 0&0&0&\cdots &0\\
+0 & 0 & a_1^{(1)} & a_2^{(1)}  & a_3^{(1)}  & 0&0&\cdots &0\\
+0 & 0 & 0& a_1^{(1)}  & a_2^{(1)}  &a_3^{(1)} &0&\cdots &0\\
+0 & 0 & 0& 0 & a_1^{(1)}  &a_2^{(1)} &a_3^{(1)} &\cdots &0\\
+\vdots&&\vdots&&\vdots&&\vdots&&\vdots
+\end{bmatrix}
+\begin{bmatrix}
+x_1\\
+x_2\\
+x_3\\
+x_4\\
+\vdots\\
+x_k\\
+\vdots\\
+x_n
+\end{bmatrix}
+$$
+-->
+
+$$
+\begin{bmatrix}
+a_1^{(1)} & a_2^{(1)}  & a_3^{(1)}  & 0 & 0 & 0 & 0&\cdots  &0\\
+0 & a_1^{(1)}  & a_2^{(1)} & a_3^{(1)}  & 0&0&0&\cdots &0\\
+0 & 0 & a_1^{(1)} & a_2^{(1)}  & a_3^{(1)}  & 0&0&\cdots &0\\
+0 & 0 & 0& a_1^{(1)}  & a_2^{(1)}  &a_3^{(1)} &0&\cdots &0\\
+0 & 0 & 0& 0 & a_1^{(1)}  &a_2^{(1)} &a_3^{(1)} &\cdots &0\\
+\vdots&&\vdots&&\vdots&&\vdots&&\vdots
+\end{bmatrix}
+\begin{bmatrix}
+x_1\\
+x_2\\
+x_3\\
+x_4\\
+\vdots\\
+x_k\\
+\vdots\\
+x_n
+\end{bmatrix}
+$$
+
+<!--Both the upper right part and lower left part of the matrix are filled with $0$s thanks to locality, leading to sparsity. The reuse of a certain kernel again and again is called weight sharing.
+-->
+
+Tanto a parte superior direita quanto a parte inferior esquerda da matriz são preenchidas com $ 0 $ s graças à localidade, levando à dispersão. A reutilização de um determinado kernel repetidamente é chamada de divisão de peso.
+
+<!--
+### Multiple layers of Toeplitz matrix
+-->
+
+### Múltiplas camadas de matriz Toeplitz
+
+<!--After these changes, the number of parameters we are left with is 3 (*i.e.* $a_1,a_2,a_3$). In comparison to the previous weight matrix, which had 12 parameters (*i.e.* $w_{11},w_{12},\cdots,w_{43}$), the current number of parameters is too restrictive and we would like to expand the same.
+-->
+
+Após essas alterações, o número de parâmetros que resta é 3 (*ou seja,* $a_1,a_2,a_3$). Em comparação com a matriz de peso anterior, que tinha 12 parâmetros (*por exemplo* $w_{11},w_{12},\cdots,w_{43}$), o número atual de parâmetros é muito restritivo e gostaríamos de expandir o mesmo.
+
+<!--The previous matrix can be considered to be a layer (*i.e.* a convolutional layer) with the kernel $\boldsymbol{a}^{(1)}$. Then we can construct multiple layers with different kernels $\boldsymbol{a}^{(2)}$, $\boldsymbol{a}^{(3)}$, etc, thereby increasing the parameters.
+-->
+
+A matriz anterior pode ser considerada uma camada (*ou seja,* uma camada convolucional) com o kernel $\boldsymbol{a}^{(1)}$. Então podemos construir múltiplas camadas com diferentes kernels $\boldsymbol{a}^{(2)}$, $\boldsymbol{a}^{(3)}$, etc, aumentando assim os parâmetros.
+
+<!--Each layer has a matrix containing just one kernel that is replicated multiple times. This type of matrix is called a Toeplitz matrix. In every Toeplitz matrix, each descending diagonal from left to right is constant. The Toeplitz matrices that we use here are sparse matrices as well.
+-->
+
+Cada camada possui uma matriz contendo apenas um kernel que é replicado várias vezes. Este tipo de matriz é denominado matriz de Toeplitz. Em cada matriz de Toeplitz, cada diagonal descendente da esquerda para a direita é constante. As matrizes Toeplitz que usamos aqui também são matrizes esparsas.
+
+<!--Given the first kernel $\boldsymbol{a}^{(1)}$ and the input vector $\boldsymbol{x}$, the first entry in the output given by this layer is, $a_1^{(1)} x_1 + a_2^{(1)} x_2 + a_3^{(1)}x_3$. Therefore, the whole output vector looks like the following: -
+-->
+
+Dado o primeiro kernel $\boldsymbol{a}^{(1)}$ e o vetor de entrada $\boldsymbol{x}$, a primeira entrada na saída fornecida por esta camada é, $a_1^{(1)} x_1 + a_2^{(1)} x_2 + a_3^{(1)}x_3$. Portanto, todo o vetor de saída se parece com o seguinte: -
+
+<!--$$
+\begin{bmatrix}
+\mathbf{a}^{(1)}x[1:3]\\
+\mathbf{a}^{(1)}x[2:4]\\
+\mathbf{a}^{(1)}x[3:5]\\
+\vdots
+\end{bmatrix}
+$$
+-->
+
+$$
+\begin{bmatrix}
+\mathbf{a}^{(1)}x[1:3]\\
+\mathbf{a}^{(1)}x[2:4]\\
+\mathbf{a}^{(1)}x[3:5]\\
+\vdots
+\end{bmatrix}
+$$
+
+<!--The same matrix multiplication method can be applied on following convolutional layers with other kernels (*e.g.* $\boldsymbol{a}^{(2)}$ and $\boldsymbol{a}^{(3)}$) to get similar results.
+-->
+
+O mesmo método de multiplicação de matriz pode ser aplicado nas seguintes camadas convolucionais com outros kernels (*por exemplo* $\boldsymbol{a}^{(2)}$ e $\boldsymbol{a}^{(3)}$) para obter similar resultados.
+
+<!--
+## [Listening to convolutions - Jupyter Notebook](https://www.youtube.com/watch?v=OrBEon3VlQg&t=1709s)
+-->
+
+## [Ouvindo as convoluções - Jupyter Notebook](https://www.youtube.com/watch?v=OrBEon3VlQg&t=1709s)
+
+<!--The Jupyter Notebook can be found [here](https://github.com/Atcold/pytorch-Deep-Learning/blob/master/07-listening_to_kernels.ipynb).
+-->
+
+O Jupyter Notebook pode ser encontrado [aqui](https://github.com/Atcold/pytorch-Deep-Learning/blob/master/07-listening_to_kernels.ipynb).
+
+<!--In this notebook, we are going to explore Convolution as a 'running scalar product'.
+-->
+
+Neste bloco de notas, vamos explorar a Convolução como um 'produto escalar em execução'.
+
+<!--The library `librosa` enables us to load the audio clip $\boldsymbol{x}$ and its sampling rate. In this case, there are 70641 samples, sampling rate is 22.05kHz and total length of the clip is 3.2s. The imported audio signal is wavy (refer to Fig 1) and we can guess what it sounds like from the amplitude of $y$ axis. The audio signal $x(t)$ is actually the sound played when turning off the Windows system (refer to Fig 2).
+-->
+
+A biblioteca `librosa` nos permite carregar o clipe de áudio $\boldsymbol{x}$ e sua taxa de amostragem. Nesse caso, existem 70641 amostras, a taxa de amostragem é de 22,05 kHz e a duração total do clipe é de 3,2 s. O sinal de áudio importado é ondulado (consulte a Figura 1) e podemos adivinhar como ele soa a partir da amplitude do eixo $ y $. O sinal de áudio $x(t)$ é na verdade o som reproduzido ao desligar o sistema Windows (consulte a Fig 2).
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week04/04-1/audioSignal.png" width="500px" /><br>
+<b>Fig. 1</b>: A visualization of the audio signal.<br>
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week04/04-1/audioSignal.png" width="500px" /><br>
+<b>Fig. 1</b>: Uma visualização do sinal de áudio.<br>
+</center>
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week04/04-1/notes.png" width="500px" /><br>
+<b>Fig. 2</b>: Notes for the above audio signal.<br>
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week04/04-1/notes.png" width="500px" /><br>
+<b>Fig. 2</b>: Observações para o sinal de áudio acima.<br>
+</center>
+
+<!--
+We need to seperate the notes from the waveform. To achieve this, if we use Fourier transform (FT) all the notes would come out together and it will be hard to figure out the exact time and location of each pitch. Therefore, a localized FT is needed (also known as spectrogram). As is observed in the spectrogram (refer to Fig 3), different pitches peak at different frequencies (*e.g.* first pitch peaks at 1600). Concatenating the four pitches at their frequencies gives us a pitched version of the original signal.
+-->
+
+Precisamos separar as notas da forma de onda. Para conseguir isso, se usarmos a transformada de Fourier (FT), todas as notas sairão juntas e será difícil descobrir a hora exata e a localização de cada afinação. Portanto, um FT localizado é necessário (também conhecido como espectrograma). Como é observado no espectrograma (consulte a Fig. 3), diferentes tons de pico em diferentes frequências (*por exemplo* primeiros picos de tom em 1600). A concatenação dos quatro tons em suas frequências nos dá uma versão do sinal original.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week04/04-1/spectrogram.png" width="500px" /><br>
+<b>Fig. 3</b>: Audio signal and its spectrogram.<br>
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week04/04-1/spectrogram.png" width="500px" /><br>
+<b>Fig. 3</b>: Sinal de áudio e seu espectrograma.<br>
+</center>
+
+<!--Convolution of the input signal with all the pitches (all the keys of the piano for example) can help extract all notes in the input piece (*i.e.* the hits when the audio matches the specific kernels). The spectrograms of the original signal and the signal of the concatenated pitches is shown in Fig 4 while the frequencies of the original signal and the four pitches is shown in Fig 5. The plot of the convolutions of the four kernels with the input signal (original signal) is shown in Fig 6. Fig 6 along with the audio clips of the convolutions prove the effectiveness of the convolutions in extracting the notes.
+-->
+
+A convolução do sinal de entrada com todos os tons (todas as teclas do piano, por exemplo) pode ajudar a extrair todas as notas na peça de entrada (*ou seja,* os hits quando o áudio corresponde aos núcleos específicos). Os espectrogramas do sinal original e o sinal dos tons concatenados são mostrados na Fig. 4, enquanto as frequências do sinal original e os quatro tons são mostrados na Fig. 5. O gráfico das convoluções dos quatro núcleos com o sinal de entrada (original sinal) é mostrado na Fig 6. A Fig 6 junto com os clipes de áudio das convoluções comprovam a eficácia das convoluções na extração das notas.
+
+<!--
+<center>
+<img src="{{site.baseurl}}/images/week04/04-1/fig4.png" width="500px" /><br>
+<b>Fig. 4</b>: Spectrogram of original signal (left) and Sepctrogram of the concatenation of pitches (right).<br>
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week04/04-1/fig4.png" width="500px" /><br>
+<b>Fig. 4</b>: Espectrograma do sinal original (esquerda) e Espectrograma da concatenação de tons (direita).<br>
+</center>
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week04/04-1/fig5.png" width="500px" /><br>
+<b>Fig. 5</b>: First note of the melody.<br>
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week04/04-1/fig5.png" width="500px" /><br>
+<b>Fig. 5</b>: Primeira nota da melodia.<br>
+</center>
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week04/04-1/fig6.png" width="500px" /><br>
+<b>Fig. 6</b>: Convolution of four kernels.<br>
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week04/04-1/fig6.png" width="500px" /><br>
+<b>Fig. 6</b>: Convolução de quatro grãos.<br>
+</center>
+
+<!--
+## Dimensionality of different datasets
+-->
+
+## Dimensionalidade de diferentes conjuntos de dados
+
+<!--The last part is a short digression on the different representations of dimensionality and examples for the same. Here we consider input set $X$ is made of  functions mapping from domains $\Omega$ to channels $c$.
+-->
+
+A última parte é uma pequena digressão sobre as diferentes representações da dimensionalidade e exemplos para as mesmas. Aqui, consideramos que o conjunto de entrada $X$ é feito de mapeamento de funções dos domínios $\Omega$ para os canais $c$.
+
+<!--
+### Examples
+-->
+
+### Exemplos
+
+<!--* Audio data: domain is 1-D, discrete signal indexed by time; number of channels $c$ can range from 1 (mono), 2 (stereo), 5+1 (Dolby 5.1), *etc.*
+* Image data: domain is 2-D (pixels); $c$ can range from 1(greyscale), 3(colour), 20(hyperspectral), *etc.*
+* Special relativity: domain is $\mathbb{R^4} \times \mathbb{R^4}$ (space-time $\times$ four-momentum); when $c = 1$ it is called Hamiltonian.
+-->
+
+* Dados de áudio: o domínio é 1-D, sinal discreto indexado pelo tempo; o número de canais $ c $ pode variar de 1 (mono), 2 (estéreo), 5 + 1 (Dolby 5.1), *etc.*
+* Dados da imagem: o domínio é 2-D (pixels); $ c $ pode variar de 1 (escala de cinza), 3 (cor), 20 (hiperespectral), *etc.*
+* Relatividade especial: o domínio é $\mathbb{R^4} \times \mathbb{R^4}$ (espaço-tempo $\times$ quatro-momento); quando $c = 1$ é chamado de Hamiltoniano.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week04/04-1/fig7.png" width="600px" /><br>
+<b>Fig. 7</b>: Different dimensions of different types of signals.<br>
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week04/04-1/fig7.png" width="600px" /><br>
+<b>Fig. 7</b>: Dimensões diferentes de tipos diferentes de sinais.<br>
+</center>
diff --git a/docs/pt/week04/04.md b/docs/pt/week04/04.md
new file mode 100644
index 000000000..942afab61
--- /dev/null
+++ b/docs/pt/week04/04.md
@@ -0,0 +1,18 @@
+---
+lang: pt
+lang-ref: ch.04
+title: Semana 4
+translation-date: 05 Nov 2021
+translator: Leon Solon
+---
+
+<!--
+## Practicum
+-->
+
+## Prática
+
+<!--We start with a brief review of linear algebra and then extend the topic to convolutions using audio data as an example. Key concepts like locality, stationarity and Toeplitz matrix are reiterated. Then we give a live demo of convolution performance in pitch analysis. Finally, there is a short digression about the dimensionality of different data.
+-->
+
+Começamos com uma breve revisão de Álgebra Linear e, em seguida, estendemos o tópico para as convoluções usando dados de áudio como exemplo. Conceitos chave como localidade, estacionariedade e matriz de Toeplitz são reiterados. Em seguida, oferecemos uma demonstração ao vivo do desempenho da convolução na análise do tom. Finalmente, há uma pequena digressão sobre a dimensionalidade de diferentes dados.
\ No newline at end of file
diff --git a/docs/pt/week04/practicum04.sbv b/docs/pt/week04/practicum04.sbv
new file mode 100644
index 000000000..5b8ec5afc
--- /dev/null
+++ b/docs/pt/week04/practicum04.sbv
@@ -0,0 +1,1517 @@
+0:00:00.030,0:00:04.730
+então, desde a última vez, ok, bem-vindo de volta, obrigado por estar aqui.
+
+0:00:04.730,0:00:09.059
+A última vez que Yann usou o tablet, certo? e como você pode usar o tablet e eu
+
+0:00:09.059,0:00:13.040
+não usa o tablet, certo? Então, eu deveria ser tão legal quanto Yann, pelo menos eu acho.
+
+0:00:13.040,0:00:18.900
+Mais uma coisa para começar, há uma planilha onde você pode decidir se
+
+0:00:18.900,0:00:22.890
+você gostaria de entrar no canal do Slack, onde colaboramos para fazer
+
+0:00:22.890,0:00:28.529
+alguns desenhos para o site, corrigindo algumas notações matemáticas, tendo alguns
+
+0:00:28.529,0:00:32.640
+tipo de, você sabe, consertar o erro em inglês na gramática inglesa ou
+
+0:00:32.640,0:00:37.290
+seja o que for, então, se você estiver interessado em ajudar a melhorar o conteúdo de
+
+0:00:37.290,0:00:42.450
+esta turma, fique à vontade para preencher a planilha, ok? Já somos alguns de
+
+0:00:42.450,0:00:49.789
+nós no canal do Slack, então quero dizer, se você quiser entrar, de nada. Então
+
+0:00:49.789,0:00:53.399
+em vez de escrever no quadro branco, porque é impossível ver, eu acho
+
+0:00:53.399,0:01:00.270
+do lado superior, vamos experimentar um novo brinquedo aqui. Tudo
+
+0:01:00.270,0:01:03.930
+direito. Primeira vez, então você sabe, estou um pouco
+
+0:01:03.930,0:01:11.250
+tenso. Da última vez, estraguei um notebook, então tudo bem, tudo bem. Então nós vamos
+
+0:01:11.250,0:01:15.659
+comece com uma pequena revisão sobre álgebra linear. Espero que não seja
+
+0:01:15.659,0:01:20.670
+ofender alguém, estou ciente de que você já estudou álgebra linear e está
+
+0:01:20.670,0:01:25.920
+muito forte nisso, mas, no entanto, gostaria de lhe fornecer minha intuição, meu
+
+0:01:25.920,0:01:31.320
+perspectiva, ok? é apenas um slide, não muito, então talvez você queira
+
+0:01:31.320,0:01:36.290
+para tirar papel e caneta ou você pode apenas saber, o que for, acompanhar.
+
+0:01:36.290,0:01:49.850
+Portanto, esta será uma revisão de álgebra linear.
+
+0:01:51.170,0:02:06.450
+OK. Estou esperando um pouco? Deixe-me esperar um segundo. Preparar?? Sim? Não? mexa sua
+
+0:02:06.450,0:02:12.150
+cabeça. Fantástico, certo, estávamos conversando da última vez que tivemos um
+
+0:02:12.150,0:02:15.270
+rede com a entrada na parte inferior, então tínhamos um afim
+
+0:02:15.270,0:02:18.510
+transformação, então temos uma camada oculta à direita. Então, vou apenas escrever o
+
+0:02:18.510,0:02:23.280
+primeira equação. Teremos essa minha camada oculta, e como estou escrevendo
+
+0:02:23.280,0:02:26.970
+com uma caneta, você pode ver alguma coisa? sim? então, como estou escrevendo com uma caneta, estou
+
+0:02:26.970,0:02:30.360
+vai colocar um sublinhado embaixo da variável para
+
+0:02:30.360,0:02:35.120
+indicam que é um vetor. OK? é assim que escrevo vetores. Então meu H vai ser um
+
+0:02:35.120,0:02:42.660
+função não linear f aplicada ao meu z e z vai ser minha entrada linear,
+
+0:02:42.660,0:02:46.230
+a saída da transformação afim, portanto, neste caso
+
+0:02:46.230,0:02:54.989
+Vou escrever aqui z será igual à minha matriz A vezes x. Nós
+
+0:02:54.989,0:03:00.630
+podemos imaginar que não há preconceito neste caso, é genérico o suficiente porque podemos
+
+0:03:00.630,0:03:06.030
+inclua o viés dentro da matriz e tenha o primeiro item de x igual a
+
+0:03:06.030,0:03:18.360
+1. Então, se este x aqui pertence a R n e este z aqui pertence a R m, a primeira pergunta:
+
+0:03:18.360,0:03:24.989
+qual é o tamanho desta matriz? Ok, fantástico, então esta matriz aqui é
+
+0:03:24.989,0:03:30.360
+vai ser o nosso m vezes n, você tem tantas linhas quanto a dimensão para onde você atira
+
+0:03:30.360,0:03:34.200
+e você tem tantas colunas quanto a dimensão de onde você está filmando, ok?
+
+0:03:34.200,0:03:39.930
+Tudo bem, então vamos expandir este, então esta matriz aqui vai ser igual a
+
+0:03:39.930,0:03:49.549
+o que você tem a_ {1,1} a_ {1,2} assim por diante até o último, qual vai ser? gritar.
+
+0:03:49.549,0:03:56.269
+Obrigado, 1 ...? sim 1 e então você tem o segundo
+
+0:03:56.269,0:04:04.249
+vai ter a_ {2,1} a_ {2,2} assim por diante até o último que é a {2, n}, certo? e aí você
+
+0:04:04.249,0:04:14.840
+continuar descendo até o último qual vai ser? quais são os índices? m 1,
+
+0:04:14.840,0:04:24.740
+direito? Ok, então você tem a_ {m, 1}, a_ {m, 2} e assim por diante até a_ {m, n}, ok, obrigado.
+
+0:04:24.740,0:04:37.220
+E então temos aqui o nosso x, certo? então você tem x_1, x_2 e assim por diante até x_n,
+
+0:04:37.220,0:04:39.490
+direito?
+
+0:04:41.680,0:04:47.479
+Você está mais responsivo do que no ano passado, ótimo, obrigado. Tudo bem, então nós também podemos
+
+0:04:47.479,0:04:52.099
+reescrever este de maneiras diferentes. Então, a primeira maneira que vou escrever este
+
+0:04:52.099,0:05:00.020
+será o seguinte, então terei aqui estes 1, então terei
+
+0:05:00.020,0:05:12.409
+aqui meu a 2 e então eu tenho o último que vai ser meu an, ok? e então
+
+0:05:12.409,0:05:15.800
+aqui vou multiplicar isso por um vetor de coluna, certo? então meu vetor de coluna
+
+0:05:15.800,0:05:23.409
+Eu vou escrever assim. Tudo bem, então qual é o resultado desta operação?
+
+0:05:23.409,0:05:28.820
+então essas são métricas, você tem um vetor, o resultado será um? vetor. Então
+
+0:05:28.820,0:05:35.780
+qual vai ser o primeiro item do meu vetor? Eu não uso pontos porque eu não sou
+
+0:05:35.780,0:05:40.219
+um físico, na verdade eu sou, mas estamos fazendo álgebra linear, então o que devo
+
+0:05:40.219,0:05:42.400
+escrever?
+
+0:05:43.810,0:05:47.509
+Tudo bem, isso já foi transposto porque esses são um vetor linha, então eu apenas
+
+0:05:47.509,0:05:54.979
+escreva um Eu vou dizer apenas certo então eu tenho um 1 x ok então não há
+
+0:05:54.979,0:06:02.780
+transposição aqui, sem pontos ao redor e assim por diante. O segundo elemento vai ser? um 2, ok,
+
+0:06:02.780,0:06:12.970
+xe então até o último qual vai ser? ok, não há ponto, mas com certeza.
+
+0:06:12.970,0:06:17.810
+Como se alguém estivesse chamando aquele produto escalar em vez de produto vetorial, mas isso é
+
+0:06:17.810,0:06:20.389
+assume que você usa um tipo diferente de notação.
+
+0:06:20.389,0:06:26.710
+Tudo bem, então esse será o meu, quantos elementos esse vetor tem?
+
+0:06:26.710,0:06:35.120
+estou bem, então temos z 1, z 2 e assim por diante até zm e este é meu conjunto final, certo?
+
+0:06:35.120,0:06:39.710
+meu vetor z, ok? Fantástico. Agora vamos nos concentrar um pouco
+
+0:06:39.710,0:06:48.590
+sobre o significado dessa coisa aqui ok, outras perguntas até agora?
+
+0:06:48.590,0:06:54.530
+tudo bem, isso é muito trivial até agora, espero, quero dizer, deixe-me saber se não, ok
+
+0:06:54.530,0:07:00.140
+então vamos analisar um desses caras aqui, então eu gostaria de descobrir o que é
+
+0:07:00.140,0:07:10.970
+o significado de escrever um T vezes x certo, então meu a T será meu ai genérico, então
+
+0:07:10.970,0:07:24.680
+vamos supor, neste caso, quando n é igual a 2, certo? Então, o que é um T x? então um T x está indo
+
+0:07:24.680,0:07:29.780
+ser igual a quê? então deixe-me desenhar aqui algo para que seja mais fácil para
+
+0:07:29.780,0:07:37.720
+você entender. Então este vai ser meu a, esses vão ser meu alfa e
+
+0:07:37.720,0:07:48.620
+então aqui você tem como se fosse meu x e isso vai ser aqui meu xi, então
+
+0:07:48.620,0:07:57.430
+qual é a saída deste produto aqui? Este aqui.
+
+0:07:58.210,0:08:08.470
+Diga novamente, desculpe. Uma transposição, vamos chamá-la, digamos que é um vetor linha
+
+0:08:10.030,0:08:27.880
+Você pode ver? não? qual vai ser o resultado desta operação aqui? não por que?
+
+0:08:29.410,0:08:34.610
+Isso é como um genérico de muitos a's, certo? há muitos,
+
+0:08:34.610,0:08:39.169
+este é um daqueles m a's, então estou multiplicando um daqueles a's vezes meu x
+
+0:08:39.169,0:08:43.310
+certo, vamos supor que existam apenas duas dimensões, então qual será a
+
+0:08:43.310,0:08:50.660
+saída deste produto escalar? alguem pode me dizer? não não não normal
+
+0:08:50.660,0:09:02.720
+produto escalar. Espere então você tem um aqui, você tem
+
+0:09:02.720,0:09:08.510
+aqui, essa parte aqui vai ser 1, vai ser 2, certo? então você tem
+
+0:09:08.510,0:09:18.550
+aqui x 1 e agora você tem x 2, certo? então, como você expressa este produto escalar aqui?
+
+0:09:19.980,0:09:26.730
+ok, então estou apenas escrevendo, deixe-me saber se está tudo claro, então vou escrever aqui:
+
+0:09:28.890,0:09:37.780
+a 1 vezes x 1 mais a 2 vezes x 2, certo? esta é a definição. Claro, certo? Então
+
+0:09:37.780,0:09:44.830
+longe né? não? ok, sim, pergunta sim. Isso é para ser uma transposição. Uma fila
+
+0:09:44.830,0:09:49.600
+vetor de coluna de vezes, então vamos assumir que a é um vetor de coluna, então eu tenho tempos de linha
+
+0:09:49.600,0:10:02.230
+coluna. Então, vamos continuar escrevendo essas coisas aqui, então o que é 1? Como posso
+
+0:10:02.230,0:10:15.220
+computar 1? Repita? ok ok Então, vou escrever aqui que a 1 está indo
+
+0:10:15.220,0:10:20.770
+sendo o comprimento do vetor a vezes o cosseno alfa, então o que dizer de x 1? alguém
+
+0:10:20.770,0:10:30.010
+outro. O mesmo certo? Não, espere, o quê? o que é x 1? mesma coisa, certo, diferente
+
+0:10:30.010,0:10:37.450
+cartas. então alguém fala algo. Você está seguindo, você está completamente confuso,
+
+0:10:37.450,0:10:43.210
+você não está tendo ideia, é muito fácil? Não tenho ideia do que está acontecendo aqui. Isso é
+
+0:10:43.210,0:10:53.770
+boa direita? até agora ok? qual vai ser o segundo mandato? Este vai ser x
+
+0:10:53.770,0:11:00.430
+aqui né? vezes cos xi certo, e então você tinha o segundo termo que vai ser
+
+0:11:00.430,0:11:13.920
+que? magnitude de um ... grito, não consigo ouvir. Ok, seno de alfa e então
+
+0:11:17.660,0:11:21.290
+ok obrigado ok
+
+0:11:23.930,0:11:27.870
+tudo bem, vou apenas juntar aqueles dois caras, então você vai
+
+0:11:27.870,0:11:40.370
+obter igual magnitude de uma magnitude de vezes de x vezes cosseno alfa cosseno Xi mais
+
+0:11:40.370,0:11:48.210
+seno alfa e seno, cosseno, seno Xi, desculpe.
+
+0:11:48.210,0:11:54.210
+o que é o material entre parênteses? tudo bem, então é o cosseno do
+
+0:11:54.210,0:11:57.240
+diferença dos dois ângulos certo? todo mundo sabe trigonometria aqui, certo?
+
+0:11:57.240,0:12:05.160
+então, coisas do ensino médio, então este será igual a machado vezes o cosseno
+
+0:12:05.160,0:12:11.000
+de cos xi menos alfa, certo? ou o contrário, alfa menos xi.
+
+0:12:11.000,0:12:17.040
+Então o que isso quer dizer? Você pode pensar em cada elemento. até agora está claro? eu
+
+0:12:17.040,0:12:22.680
+não fez nenhuma mágica, sim, sacuda a cabeça assim para sim, isso para não, isso
+
+0:12:22.680,0:12:27.810
+porque talvez nada esteja funcionando. Ok, então você pode pensar sempre que quiser
+
+0:12:27.810,0:12:33.600
+multiplique uma matriz por um vetor que basicamente cada saída desta operação
+
+0:12:33.600,0:12:39.570
+vai medir, então ok espere um pouco, o que é esse cosseno? quanto é cosseno
+
+0:12:39.570,0:12:45.780
+de zero? 1. Então, isso significa que se esses dois ângulos, se os dois vetores estiverem alinhados,
+
+0:12:45.780,0:12:50.460
+o que significa que há um ângulo zero entre os dois vetores, você pode ter o
+
+0:12:50.460,0:12:57.090
+valor máximo desse elemento, certo? sempre que você tiver o menos o
+
+0:12:57.090,0:13:03.240
+valor mais negativo? quando eles são opostos, certo? então quando eles estão em
+
+0:13:03.240,0:13:08.310
+oposição de fase, você obterá a magnitude mais negativa, mas se você
+
+0:13:08.310,0:13:11.730
+aplique apenas digamos um ReLU, você vai cortar todas as coisas negativas que você está apenas
+
+0:13:11.730,0:13:16.200
+verificando as correspondências positivas, então a rede neural basicamente apenas talvez
+
+0:13:16.200,0:13:20.520
+vai descobrir apenas as correspondências positivas, certo? e então novamente quando
+
+0:13:20.520,0:13:23.620
+você multiplica uma matriz por um vetor de coluna que você
+
+0:13:23.620,0:13:31.330
+estar realizando um produto escalar lamentável em termos de elemento entre cada coluna, cada linha de
+
+0:13:31.330,0:13:36.220
+a matriz que representa o seu kernel certo? então, sempre que você tiver um
+
+0:13:36.220,0:13:40.000
+camada seu kernel vai ser toda a linha da matriz e agora você vê o que
+
+0:13:40.000,0:13:47.050
+é a projeção dessa entrada nessa coluna, quero dizer, na entrada dessa linha
+
+0:13:47.050,0:13:52.900
+direito? então cada elemento deste produto vai lhe dizer o alinhamento com
+
+0:13:52.900,0:13:57.580
+qual a entrada é qual é o alinhamento da entrada em relação ao
+
+0:13:57.580,0:14:04.600
+linha específica da matriz ok? sim? não? isso deve moldar um pouco mais como
+
+0:14:04.600,0:14:08.290
+intuição, enquanto usamos essas transformações lineares, elas são como
+
+0:14:08.290,0:14:13.140
+permitindo que você veja a projeção da entrada em diferentes tipos de
+
+0:14:13.140,0:14:22.300
+orientações digamos assim. Certo? você pode tentar saber extrapolar isso em
+
+0:14:22.300,0:14:26.140
+dimensões altas, eu acho que a intuição pelo menos eu posso dar a você funciona
+
+0:14:26.140,0:14:30.580
+definitivamente em duas e três dimensões, em dimensões superiores eu meio que acho
+
+0:14:30.580,0:14:34.209
+funciona de maneira semelhante. próxima lição que vamos assistir, na verdade somos nós
+
+0:14:34.209,0:14:38.890
+vamos ver como qual é a distribuição das projeções em um
+
+0:14:38.890,0:14:43.240
+espaço dimensional mais alto é esse tipo que vai ser tão legal, eu acho. tudo bem então
+
+0:14:43.240,0:14:49.779
+essa foi a primeira parte de eu penso na aula ah bem, tem mais uma
+
+0:14:49.779,0:14:54.940
+parte, então na verdade aqui este z aqui também podemos escrever de uma maneira diferente,
+
+0:14:54.940,0:15:00.130
+talvez isso seja talvez seja conhecido talvez não seja conhecido. quando eu vi pela primeira vez
+
+0:15:00.130,0:15:05.050
+Eu não sabia, então você sabe que é legal às vezes você ver essas coisas uma vez
+
+0:15:05.050,0:15:10.450
+de novo talvez então vamos voltar aqui é o mesmo z ali e então você pode expressar
+
+0:15:10.450,0:15:18.820
+este z como sendo igual ao vetor a 1, neste caso a 1 será o primeiro
+
+0:15:18.820,0:15:22.830
+coluna da matriz a ok e esta vai ser multiplicada pelo escalar
+
+0:15:22.830,0:15:29.380
+x1 agora você tem a segunda coluna da matriz, então eu tenho um 2 que é multiplicado
+
+0:15:29.380,0:15:34.870
+pelo segundo elemento do X à direita até o último qual vai ser?
+
+0:15:34.870,0:15:44.930
+novamente? Não consigo ouvir se é m ou n? m? assim? ou n? você conhece a linguagem de sinais? que
+
+0:15:44.930,0:15:52.880
+1? n? certo, você conhece a linguagem de sinais? não? você deve aprender que é bom, sabe?
+
+0:15:52.880,0:15:59.630
+inclusividade. um certo? então a última coluna vezes seu xn, é claro porque x
+
+0:15:59.630,0:16:04.100
+tem um tamanho de n, existem n itens, certo? e então basicamente quando você também quando você
+
+0:16:04.100,0:16:07.940
+faça uma transformação linear ou aplique um operador linear
+
+0:16:07.940,0:16:12.890
+vão pesar basicamente cada coluna da matriz com o coeficiente que
+
+0:16:12.890,0:16:16.640
+está em um, você sabe, você tem a primeira coluna vezes o primeiro coeficiente do
+
+0:16:16.640,0:16:22.130
+vetor, segunda coluna e pelo segundo item, mais a terceira coluna vezes o terceiro
+
+0:16:22.130,0:16:26.420
+item e, portanto, você pode ver que a saída dessa transformação de DN é uma soma ponderada
+
+0:16:26.420,0:16:32.180
+das colunas da matriz a ok? então este é um tipo diferente de intuição
+
+0:16:32.180,0:16:36.620
+às vezes você vê isso como se você quisesse expressar seu sinal, seu
+
+0:16:36.620,0:16:45.290
+dados são uma combinação de diferentes, você sabe, a composição, isso é uma espécie de
+
+0:16:45.290,0:16:50.600
+composição linear de sua entrada. tudo bem então essa foi a primeira parte, é o
+
+0:16:50.600,0:16:57.530
+recapitulação sobre a álgebra linear. uma segunda parte vai ser algo ainda mais
+
+0:16:57.530,0:17:06.790
+legal eu acho. perguntas até agora? não? fácil? muito fácil? você está ficando entediado?
+
+0:17:06.790,0:17:10.670
+desculpe ok tudo bem então eu vou acelerar eu acho.
+
+0:17:10.670,0:17:15.170
+tudo bem, então vamos ver como podemos estender o que as coisas que vimos
+
+0:17:15.170,0:17:19.040
+agora para as convoluções certas, então talvez as convoluções às vezes sejam um pouco
+
+0:17:19.040,0:17:28.900
+estranho, vamos ver como podemos fazer uma extensão para convoluções
+
+0:17:31.720,0:17:38.390
+tudo bem. então, digamos que eu comece com a mesma matriz. então vou ter aqui quatro
+
+0:17:38.390,0:17:53.660
+linhas e, em seguida, três colunas. Certo. então meus dados têm que ser? se eu tenho, se eu tenho isso
+
+0:17:53.660,0:17:59.360
+matriz, se eu multiplicar isso por uma coluna, meu vetor de coluna deve ser? do tamanho? três,
+
+0:17:59.360,0:18:04.250
+obrigada. tudo bem, deixe-me desenhar aqui meu vetor de coluna de tamanho três e este
+
+0:18:04.250,0:18:08.809
+vai te dar uma saída de tamanho quatro ok fantástico
+
+0:18:08.809,0:18:15.740
+mas então são seus dados, digamos que você vai ouvir um bom áudio, áudio
+
+0:18:15.740,0:18:21.260
+arquivo, seus dados têm apenas três amostras de comprimento? quanto tempo vão ficar seus dados? Digamos
+
+0:18:21.260,0:18:24.590
+você ouvindo uma música que dura três minutos
+
+0:18:24.590,0:18:32.330
+quantas amostras tem três minutos de áudio? sim, eu acho, o que é
+
+0:18:32.330,0:18:40.070
+vai ser minha taxa de amostragem? digamos vinte e dois, ok. vinte e dois mil quilos
+
+0:18:40.070,0:18:46.480
+Hertz, certo? 22 quilohertz então quantas amostras de três minutos de música tem?
+
+0:18:47.799,0:18:58.010
+Repita? Tem certeza que? é monofônico ou estereofônico? estou brincando. OK, então
+
+0:18:58.010,0:19:02.650
+você vai multiplicar o número de amostras, o número de segundos, certo? a
+
+0:19:02.650,0:19:08.660
+número de segundos vezes a taxa de quadros, certo? ok, a frequência neste
+
+0:19:08.660,0:19:12.620
+caso. de qualquer forma, esse sinal vai ser muito, muito longo, certo? vai ser mantido
+
+0:19:12.620,0:19:16.940
+indo para baixo, então se eu tiver um vetor que é muito, muito longo, eu tenho que usar um
+
+0:19:16.940,0:19:23.540
+matriz que vai ficar muito, muito gorda, larga, certo? ok fantástico então este top
+
+0:19:23.540,0:19:27.080
+continua indo nessa direção, tudo bem, então minha pergunta para você vai
+
+0:19:27.080,0:19:31.540
+ser o que devo colocar neste local aqui?
+
+0:19:35.570,0:19:48.020
+o que devo colocar aqui? então, nós nos importamos com coisas que estão mais distantes?
+
+0:19:48.740,0:19:54.000
+Não porque não? porque nossos dados têm a propriedade de
+
+0:19:54.000,0:20:00.299
+localidade, fantástica. então o que vou fazer o que vou colocar aqui? Uma grande
+
+0:20:00.299,0:20:06.690
+zero, certo, fantástico, bom trabalho, ok então colocamos um zero aqui e então qual é o outro
+
+0:20:06.690,0:20:09.809
+propriedade, então deixe-me começar a desenhar essas coisas novamente para que eu possa ter meu
+
+0:20:09.809,0:20:16.890
+kernel de tamanho três e aqui estou meus dados que serão muito longos
+
+0:20:16.890,0:20:31.169
+direito? e assim por diante. Não posso desenhar, espere. Eu não consigo ver Tudo bem, então aqui há zero, então vamos dizer o que
+
+0:20:31.169,0:20:36.960
+é a outra propriedade que meus dados naturais têm? estacionariedade, que
+
+0:20:36.960,0:20:45.059
+meios? o padrão que você espera encontrar pode ser uma espécie de repetição
+
+0:20:45.059,0:20:49.140
+e de novo certo? e então se eu tiver aqui meus três valores aqui talvez eu
+
+0:20:49.140,0:20:53.789
+gostaria de reutilizá-los uma e outra vez, certo? e então, se esses três valores permitirem
+
+0:20:53.789,0:21:00.779
+eu mude a cor talvez para que você possa ver que há a mesma coisa. então eu tenho
+
+0:21:00.779,0:21:07.230
+três valores aqui e então vou usar esses três mesmos valores em uma etapa
+
+0:21:07.230,0:21:16.770
+mais, certo? e eu continuo descendo e continuo assim
+
+0:21:16.770,0:21:22.799
+direito. Então, o que devo colocar aqui no fundo? O que devo colocar aqui? um zero,
+
+0:21:22.799,0:21:28.529
+direito? por que isso por que isso? devido à localidade dos dados. direito? tão colocando
+
+0:21:28.529,0:21:36.350
+zeros ao redor é chamado também é chamado de preenchimento, mas neste caso é chamado
+
+0:21:36.350,0:21:44.850
+esparsidade, certo? então isso é como esparsidade e, em seguida, a replicação dessa coisa
+
+0:21:44.850,0:21:52.730
+isso é repetidamente chamado de estacionariedade era a propriedade
+
+0:21:52.730,0:21:57.430
+do sinal, isso é chamado de divisão de peso. sim?
+
+0:22:04.330,0:22:10.090
+ok fantástico tudo bem então quantos valores nós temos agora? quantos
+
+0:22:10.090,0:22:13.530
+parâmetros que tenho do lado direito?
+
+0:22:15.150,0:22:22.750
+bem, então temos três parâmetros. No lado esquerdo, ao invés, nós
+
+0:22:22.750,0:22:28.900
+teve? Doze, certo? então o lado direito vai ser, o lado direito vai ser
+
+0:22:28.900,0:22:36.810
+trabalhar em tudo? você tem três parâmetros de um lado do outro lado você tem 12.
+
+0:22:36.810,0:22:41.860
+OK? isso é bom, usando localidade e qualquer coisa diferente de dispersão
+
+0:22:41.860,0:22:45.610
+e compartilhamento de parâmetros, mas acabamos com apenas três parâmetros, não é
+
+0:22:45.610,0:22:51.190
+isso é muito restritivo? como podemos ter vários parâmetros múltiplos? o que é
+
+0:22:51.190,0:22:56.290
+faltando aqui no quadro geral? existem vários canais, certo? então isso é
+
+0:22:56.290,0:23:01.570
+apenas uma camada aqui e então você tem essas coisas saindo do
+
+0:23:01.570,0:23:08.230
+placa aqui, então você tem o primeiro kernel aqui, agora você tem algum
+
+0:23:08.230,0:23:20.500
+segundo kernel, digamos este, e eu tenho o último aqui, certo? e então você tem
+
+0:23:20.500,0:23:24.760
+cada plano dessas métricas contendo apenas um kernel que é
+
+0:23:24.760,0:23:36.190
+replicado várias vezes. Quem sabe o nome desta matriz? então isso vai
+
+0:23:36.190,0:23:40.650
+ser chamada de matriz Toeplitz
+
+0:23:43.419,0:23:48.219
+Certo? então qual é a principal característica dessas matrizes Toeplitz? qual é o grande
+
+0:23:48.219,0:24:01.479
+grande coisa que você não notará? é uma matriz esparsa. ok ok o que vai
+
+0:24:01.479,0:24:16.149
+estar aqui, esse primeiro item aqui? qual é o conteúdo do primeiro cara? sim? tão
+
+0:24:16.149,0:24:21.669
+este aqui vai ser a extensão da minha transformação linear que foi,
+
+0:24:21.669,0:24:26.169
+você sabe, eu tenho um sinal que é maior do que três amostras, portanto, eu tenho que
+
+0:24:26.169,0:24:32.229
+tornar esta matriz mais gorda, a segunda parte será dada que eu não me importo
+
+0:24:32.229,0:24:36.129
+que coisas, tipo, coisas que estão aqui embaixo, não me importo com coisas que estão
+
+0:24:36.129,0:24:40.299
+aqui, se eu olhar para os pontos que estão aqui em cima, vou colocar um grande 0 aqui
+
+0:24:40.299,0:24:45.070
+para que tudo que está aqui embaixo seja limpo, né? e
+
+0:24:45.070,0:24:49.899
+finalmente vou usar o mesmo kernel repetidamente porque
+
+0:24:49.899,0:24:55.839
+suponho que meus dados estão estacionários e, portanto, suponho que padrões semelhantes
+
+0:24:55.839,0:24:58.929
+vão acontecer uma e outra vez, portanto, vou usar este
+
+0:24:58.929,0:25:03.429
+aquele que está escrito aqui: divisão de peso.
+
+0:25:03.429,0:25:08.019
+Finalmente, dado que este fornece apenas três parâmetros para
+
+0:25:08.019,0:25:12.940
+trabalhar com vou usar várias camadas para ter diferentes, sabe,
+
+0:25:12.940,0:25:17.739
+canais. Portanto, este é um kernel. antes, um kernel era toda a linha de
+
+0:25:17.739,0:25:21.849
+a matriz, ok? então, quando você tem uma camada totalmente conectada, a única diferença
+
+0:25:21.849,0:25:25.179
+entre uma camada totalmente conectada e uma convolução é que você tem todo o
+
+0:25:25.179,0:25:37.019
+linha da matriz. Então, o que vai estar neste primeiro item aqui? qualquer um?
+
+0:25:38.480,0:25:43.230
+então o kernel verde, vamos chamar o kernel verde de apenas 1, deixe-me realmente
+
+0:25:43.230,0:25:55.200
+faça com que ela brilhe em verde porque é uma semente verde. Então você tem 1 vezes ... o quê? Está
+
+0:25:55.200,0:25:58.890
+vai ser do número um ao número três certo? e então o segundo item é
+
+0:25:58.890,0:26:05.640
+vai ser o mesmo cara aqui um 1 e então você vai ter o x mudado por
+
+0:26:05.640,0:26:20.039
+um e assim por diante certo? faz sentido? sim e então nós teremos este está indo
+
+0:26:20.039,0:26:23.970
+para ser a saída verde, então você terá a saída azul uma camada chegando
+
+0:26:23.970,0:26:26.730
+para fora e então você tem o outro vindo o vermelho.
+
+0:26:26.730,0:26:32.929
+mesmo uma camada de fora. OK? A experiência com o iPad foi legal?
+
+0:26:32.929,0:26:40.520
+sim? não? Eu gostei. OK. Outras perguntas?
+
+0:26:41.029,0:26:49.049
+Repita? o círculo azul este aqui? é um grande zero
+
+0:26:49.049,0:26:53.480
+essa é a dispersão que o mesmo está aqui.
+
+0:26:53.899,0:26:58.520
+sim? não? Yeah, yeah.
+
+0:27:01.530,0:27:16.200
+então aqui eu coloquei muitos zeros aqui dentro então matei todos os
+
+0:27:16.200,0:27:21.000
+valores que estão longe da pequena parte e depois repito os mesmos três
+
+0:27:21.000,0:27:24.480
+valores repetidamente porque espero encontrar o mesmo padrão em
+
+0:27:24.480,0:27:33.360
+diferentes regiões deste, este grande grande sinal que eu tenho. Este aqui? Então eu disse
+
+0:27:33.360,0:27:36.840
+que neste caso terei apenas três valores, certo? e começamos com
+
+0:27:36.840,0:27:41.040
+12 valores e acabei com 3, que é realmente muito pouco, então se eu quiser
+
+0:27:41.040,0:27:44.760
+tem, digamos, 6 valores, então se eu quiser ter seis valores e posso ter meu
+
+0:27:44.760,0:27:49.530
+segundo 3 em um plano diferente e eu realizo a mesma operação sempre que você
+
+0:27:49.530,0:27:55.110
+multiplique esta matriz por um vetor e você realiza uma convolução para que ela apenas diga
+
+0:27:55.110,0:27:58.770
+você que uma convolução é apenas uma multiplicação de matriz com muitos zeros
+
+0:27:58.770,0:28:09.600
+é isso. Sim, então eles vão ter este aqui, então você tem um segundo,
+
+0:28:09.600,0:28:14.790
+então você tem um terceiro, então você tem três versões da entrada. Tudo bem. Então, para o
+
+0:28:14.790,0:28:18.000
+segunda parte da aula, vou mostrar a vocês algumas coisas mais interativas
+
+0:28:18.000,0:28:27.810
+por favor, participe da segunda parte também, certo? então vamos tentar então eu tenho
+
+0:28:27.810,0:28:37.950
+reformulei a marca Eu mudei a marca do site e agora o
+
+0:28:37.950,0:28:42.450
+ambiente será chamado de pDL, portanto, o aprendizado profundo PyTorch em vez de
+
+0:28:42.450,0:28:49.980
+minicurso de aprendizagem profunda, era muito longo. Então, deixe-me começar executando este
+
+0:28:49.980,0:28:52.460
+tão
+
+0:28:55.320,0:29:04.960
+Aprendizado profundo do PyTorch para que possamos fazer apenas Conda ativar ativar o PyTorch profundo
+
+0:29:04.960,0:29:12.940
+aprendizagem (pDL) e a seguir vamos abrir o caderno, o caderno de Júpiter. Tudo bem, então agora você está
+
+0:29:12.940,0:29:18.520
+vai estar assistindo, repassando a escuta de kernels. Então eu te mostrei um
+
+0:29:18.520,0:29:22.630
+convolução no papel bem no meu tablet agora você vai ouvir
+
+0:29:22.630,0:29:25.990
+convolução também pode, de modo que você pode realmente apreciar o que essas convoluções
+
+0:29:25.990,0:29:35.919
+estão. Aqui dissemos, o novo kernel certo que é chamado pDL PyTorch deep
+
+0:29:35.919,0:29:42.520
+aprendendo, então você notará o mesmo tipo de procedimento se atualizar
+
+0:29:42.520,0:29:49.690
+Seu sistema. Tudo bem, então, neste caso, podemos ler o topo aqui, então deixe-me esconder o
+
+0:29:49.690,0:29:52.200
+topo aqui.
+
+0:29:52.890,0:29:56.950
+Tudo bem, considerando a suposição de localidade, estacionariedade e
+
+0:29:56.950,0:30:00.280
+composicionalidade, podemos reduzir a quantidade de computação para uma matriz
+
+0:30:00.280,0:30:05.169
+multiplicação de vetores usando uma matriz de Toeplitz esparsa porque local porque
+
+0:30:05.169,0:30:09.850
+esquema estacionário, desta forma, podemos simplesmente acabar redescobrindo o
+
+0:30:09.850,0:30:14.980
+operador de convolução, certo? além disso, também podemos lembrar que um produto escalar é
+
+0:30:14.980,0:30:19.150
+uma distância cosseno simplesmente normalizada que nos diz o alinhamento de dois
+
+0:30:19.150,0:30:21.850
+vetores, mais especificamente, calculamos o
+
+0:30:21.850,0:30:26.320
+magnitude da projeção ortogonal de dois vetores um sobre o outro e vice
+
+0:30:26.320,0:30:29.590
+versa. Então, vamos descobrir agora como tudo isso
+
+0:30:29.590,0:30:34.270
+pode fazer sentido usando nossos ouvidos, certo? então vou importar uma biblioteca que
+
+0:30:34.270,0:30:39.880
+professor aqui da NYU feito e aqui vou carregar meus dados de áudio e
+
+0:30:39.880,0:30:43.600
+Eu vou ter isso no meu x, e então minha taxa de amostragem vai ser
+
+0:30:43.600,0:30:48.570
+na outra variável. Então, aqui vou apenas mostrar que terei cerca de 70.000
+
+0:30:48.570,0:30:54.430
+amostras neste caso porque eu tenho uma taxa de amostragem de 22 quilo Hertz e então
+
+0:30:54.430,0:31:01.720
+meu tempo total será de três segundos, ok, então três segundos vezes 22 você começa
+
+0:31:01.720,0:31:06.910
+que? então não é 180 que você estava dizendo, era cento e oitenta, era três, certo?
+
+0:31:06.910,0:31:11.380
+Oh, foram três minutos, você está certo, são três segundos, então você realmente está
+
+0:31:11.380,0:31:16.540
+corrigir meu mal. Então, são três segundos, então vezes 22 quilo Hertz você tem 70
+
+0:31:16.540,0:31:22.390
+cerca de 70.000 amostras. Aqui, vou importar algumas bibliotecas para
+
+0:31:22.390,0:31:28.180
+mostrarei algo e então mostrarei o primeiro gráfico, então este é
+
+0:31:28.180,0:31:37.270
+o sinal de áudio que importei agora, como está? ondulado, ok legal.
+
+0:31:37.270,0:31:50.680
+Você pode me dizer como isso soa? Aluno: "aaaaaaaaaaahhhhhh". Esse foi um bom palpite. O palpite era 'aaah'. Sim, você não pode dizer exatamente
+
+0:31:50.680,0:31:55.450
+qual é o conteúdo, certo? a partir deste diagrama porque a amplitude de,
+
+0:31:55.450,0:32:01.240
+o eixo y aqui vai mostrar apenas a amplitude. posso
+
+0:32:01.240,0:32:05.530
+apague a luz? está tudo bem? ou ... tem certeza? ok obrigado,
+
+0:32:05.530,0:32:16.090
+Eu realmente não gosto dessas luzes. OK. Boa noite. Oh, vê como isso é bom? Certo
+
+0:32:16.090,0:32:19.930
+legal. Tudo bem, então você não pode dizer nada aqui, certo?
+
+0:32:19.930,0:32:26.580
+você não pode dizer o que é o que é o som, certo? então como podemos descobrir
+
+0:32:26.580,0:32:31.870
+qual é o som aqui dentro? então, por exemplo, posso mostrar a você uma transcrição de
+
+0:32:31.870,0:32:37.660
+o som e, na verdade, deixe-me realmente forçá-los em seu
+
+0:32:37.660,0:32:44.810
+sua cabeça, certo? então você vai ter ... espere, não funcionou. * Ouve-se um som *
+
+0:32:44.810,0:32:50.610
+tudo bem, agora nós realmente ouvimos, ok, agora você pode realmente ver * imita o som *
+
+0:32:50.610,0:32:56.400
+você sabe que pode imaginar um pouco, mas tudo bem e daí
+
+0:32:56.400,0:33:00.830
+notas que tocamos lá? como posso descobrir quais são as notas que
+
+0:33:00.830,0:33:05.550
+eles estão dentro? então vou mostrar este, já que é um pouco mais claro
+
+0:33:05.550,0:33:13.820
+Eu posso ver seus rostos. Quantos de vocês não podem ler isso? Oh, ai ...
+
+0:33:13.820,0:33:20.960
+Ok, deixe-me ver se posso pedir ajuda.
+
+0:33:23.620,0:33:26.620
+Talvez alguém possa nos ajudar aqui.
+
+0:33:29.400,0:33:32.480
+OK. Vamos ver.
+
+0:33:40.140,0:33:42.140
+Ei, ei Alf! Oh, oi Alf!
+
+0:33:42.880,0:33:45.420
+Como está indo? Sim, estou bem, obrigado.
+
+0:33:45.420,0:33:47.040
+Óculos bonitos lá! Oh, obrigado pelos óculos.
+
+0:33:47.040,0:33:49.040
+Oh, belo suéter! você também! Belo suéter!
+
+0:33:49.040,0:33:51.040
+Oh, estamos usando o mesmo suéter!
+
+0:33:51.040,0:33:54.480
+Você pode nos ajudar? Eles não sabem ler o
+
+0:33:54.480,0:33:57.120
+Oh, a conexão ... Que diabos!
+
+0:33:57.120,0:34:00.380
+Eles não podem ler a partitura! Você pode nos ajudar, por favor?
+
+0:34:00.380,0:34:02.380
+Tudo bem! Deixe-me tentar ajudá-lo.
+
+0:34:02.380,0:34:04.380
+Obrigado! Deixe-me trocar a câmera.
+
+0:34:04.380,0:34:06.380
+Tudo bem. Por favor faça.
+
+0:34:06.380,0:34:08.380
+Então, aqui podemos ir como ...
+
+0:34:08.380,0:34:10.860
+e ouça primeiro como tudo soa.
+
+0:34:10.860,0:34:14.320
+Então, vai ser assim.
+
+0:34:14.440,0:34:23.380
+Quão legal é isso? * alunos aplaudem *
+
+0:34:23.380,0:34:27.990
+Obrigada. Demorou quatro lições para você me aplaudir. Então agora…
+
+0:34:27.990,0:34:32.360
+Isso é muito legal da sua parte. Vamos continuar.
+
+0:34:32.360,0:34:36.320
+A ♭, então temos um E ♭, e então um A ♭.
+
+0:34:36.320,0:34:40.380
+A diferença entre o primeiro A ♭ e o outro em frequências
+
+0:34:40.380,0:34:46.540
+é que o primeiro A ♭ terá o dobro da frequência do outro.
+
+0:34:46.540,0:34:51.400
+E em vez disso, no meio, temos o 5º. Vamos descobrir qual é a frequência disso.
+
+0:34:51.400,0:34:55.320
+E então, vamos para um B ♭, aqui.
+
+0:34:55.320,0:34:57.680
+No lado esquerdo, em vez disso, temos o acompanhamento,
+
+0:34:57.680,0:35:01.220
+e então teremos um A ♭ e B ♭
+
+0:35:01.220,0:35:05.620
+e então B ♭ e E ♭.
+
+0:35:05.680,0:35:10.900
+Então, se juntarmos todos, vamos conseguir este.
+
+0:35:11.300,0:35:14.020
+Tudo bem? Simples, não? Sim! Obrigado!
+
+0:35:14.020,0:35:17.320
+Bye Bye! Tchaaaau!
+
+0:35:18.820,0:35:23.540
+Ver? Demorou um dia inteiro para se preparar ...
+
+0:35:23.540,0:35:27.480
+Eu estava tão nervoso antes de vir aqui ...
+
+0:35:27.480,0:35:32.140
+Eu não sabia se realmente teria funcionado ... Ambos, tablet e este.
+
+0:35:32.140,0:35:35.060
+Estou tão feliz! Agora posso realmente dormir, mais tarde.
+
+0:35:35.060,0:35:39.280
+De qualquer forma, isso foi como no primeiro
+
+0:35:39.290,0:35:43.280
+parte você vai ter a primeira nota, há A ♭ que você tem um B ♭
+
+0:35:43.280,0:35:51.550
+A ♭ e B ♭ para que você * recrie o som * e oe a diferença entre o primeiro tom e
+
+0:35:51.550,0:35:57.440
+é uma oitava, portanto a primeira frequência será o dobro da segunda
+
+0:35:57.440,0:36:02.870
+frequência. OK? então, sempre que vamos observar a forma de onda, um sinal
+
+0:36:02.870,0:36:07.730
+tem um mais curto igual a metade do período do outro, certo?
+
+0:36:07.730,0:36:12.410
+especialmente o A ♭ no topo terá um período que é a metade de
+
+0:36:12.410,0:36:20.000
+o período do A ♭ no inferior, certo, então você * recria o som * ok, se você for a metade de
+
+0:36:20.000,0:36:27.290
+este que você obteve * soa * certo, ok ok, então, como realmente tiramos essas notas de
+
+0:36:27.290,0:36:33.770
+esse espectro, da forma de onda? quem pode me dizer como posso extrair estes
+
+0:36:33.770,0:36:40.790
+arremessos, essas frequências do outro sinal? qualquer palpite? ok transformada de Fourier
+
+0:36:40.790,0:36:45.530
+que eu acho que é um bom palpite. O que acontece se eu executar agora um
+
+0:36:45.530,0:36:50.660
+Transformada de Fourier desse sinal? alguém pode realmente me responder? você não pode aumentar
+
+0:36:50.660,0:36:54.100
+sua mão porque eu não vejo, apenas grite.
+
+0:36:55.120,0:36:59.690
+Então, se você basicamente realizar a transformada de Fourier de todo o sinal, você
+
+0:36:59.690,0:37:06.470
+vai ouvir * faz som * como todas as notas juntas * faz som *? todos juntos, certo, mas então você não pode
+
+0:37:06.470,0:37:13.260
+descobrir qual pitch está tocando, onde ou quando, neste caso, certo.
+
+0:37:13.260,0:37:18.210
+Ha! Então, precisamos de uma espécie de transformada de Fourier que é localizada e, portanto, um
+
+0:37:18.210,0:37:23.190
+transformada de Fourier localizada no tempo ou no espaço, dependendo de qualquer domínio
+
+0:37:23.190,0:37:27.390
+você está usando seu espectrograma denominado. certo, e assim por diante eu vou ser
+
+0:37:27.390,0:37:30.000
+imprimindo para você o espectrograma, desculpe.
+
+0:37:30.000,0:37:34.380
+e estarei imprimindo aqui o espectrograma deste aqui. E então aqui
+
+0:37:34.380,0:37:39.960
+você pode comparar os dois, certo, na primeira parte aqui deste lado aqui você está
+
+0:37:39.960,0:37:47.970
+vai ter esse pico aqui em 1600 que é o * faz som *, o tom é realmente mais alto. * faz som * lá vamos nós. Agora
+
+0:37:47.970,0:37:56.640
+você tem um segundo que é este pico aqui * faz som * e então este * faz som *. Você pode ver esse pico, certo? E você
+
+0:37:56.640,0:38:04.260
+veja este pico, tudo bem, então esses picos serão as notas reais que toco
+
+0:38:04.260,0:38:07.560
+com a mão direita, então vamos colocá-los juntos e
+
+0:38:07.560,0:38:13.140
+Vou ter aqui as frequências. Então eu tenho 1600, 1200 e 800, você pode ver aqui?
+
+0:38:13.140,0:38:20.550
+Eu tenho 1600, 800, por que um é o dobro do outro? porque eles são uma oitava
+
+0:38:20.550,0:38:27.540
+separados, então se isso é * faz som * isso vai ser * faz som * certo e este é um quinto que também
+
+0:38:27.540,0:38:32.280
+tem um bom intervalo. Então, deixe-me gerar esses sinais aqui e depois
+
+0:38:32.280,0:38:36.720
+tem que ser concatená-los todos, então vou jogar os dois. o primeiro
+
+0:38:36.720,0:38:42.109
+um é na verdade o áudio original, mas
+
+0:38:42.109,0:38:45.470
+deixe-me tentar de novo, enquanto se eu jogar
+
+0:38:45.680,0:38:54.860
+o segundo, a concatenação, sim, está um pouco alto, agora não consigo nem
+
+0:38:54.860,0:39:01.720
+reduza o volume. Oh, eu posso reduzir isso aqui. Demais. Ok, deixe-me ir de novo. Tudo
+
+0:39:02.020,0:39:05.350
+direito. Então, esta é a concatenação desses
+
+0:39:05.350,0:39:12.380
+quatro pitches diferentes, então adivinhe o que faremos a seguir? então como posso
+
+0:39:12.380,0:39:20.420
+extrair todas as notas que posso ouvir em uma peça específica? então vamos dizer
+
+0:39:20.420,0:39:29.360
+você joga uma partitura completa e eu gostaria de saber qual campo é jogado e a que horas. Do
+
+0:39:29.360,0:39:36.230
+que? então a resposta foi convolução, apenas para a gravação, então estou pedindo convolução
+
+0:39:36.230,0:39:43.370
+sobre o que? sem convolução do espectrograma, então você tem convolução de sua entrada
+
+0:39:43.370,0:39:49.460
+sinalizar com o quê? com algum tipo diferente de pitches, os quais irão
+
+0:39:49.460,0:39:59.150
+sua vez? digamos que você não veja o espectro, porque digamos que eu só vou
+
+0:39:59.150,0:40:03.770
+tocar qualquer tipo de música, então eu gostaria de saber todas as notas possíveis que são
+
+0:40:03.770,0:40:06.220
+aí o que você faria?
+
+0:40:06.220,0:40:13.430
+você não conhece todos os arremessos, como você tentaria? certo, então em que estão todos os
+
+0:40:13.430,0:40:20.900
+tons que você pode querer usar, se estiver tocando piano? todas as chaves de
+
+0:40:20.900,0:40:24.530
+o piano, certo? então, se eu tocar um concerto com o piano, eu quero
+
+0:40:24.530,0:40:28.010
+tenho um pedaço de áudio para cada uma dessas teclas e vou estar executando
+
+0:40:28.010,0:40:32.690
+circunvoluções de toda a minha peça com as chaves antigas, certo? e portanto você é
+
+0:40:32.690,0:40:36.470
+veremos picos que são o alinhamento da similaridade do cosseno
+
+0:40:36.470,0:40:41.349
+sempre que você obtiver basicamente o áudio correspondente ao seu kernel específico.
+
+0:40:41.349,0:40:46.989
+então vou fazer isso, mas com esses tons específicos, na verdade, extraio
+
+0:40:46.989,0:40:52.929
+aqui. Então, aqui vou mostrar primeiro como os dois espectrogramas se parecem
+
+0:40:52.929,0:40:57.699
+como se o lado esquerdo fosse o espectrograma do meu sinal real X de t
+
+0:40:57.699,0:41:01.630
+e no lado direito eu tenho apenas o espectrograma desta concatenação de
+
+0:41:01.630,0:41:10.749
+meus argumentos, então aqui você pode ver claramente que isso * faz som *, mas aqui, em primeiro lugar, o que
+
+0:41:10.749,0:41:14.429
+são essas barras aqui, essas barras verticais?
+
+0:41:15.269,0:41:20.589
+você está seguindo, certo? Eu não posso te ver, tenho que realmente responder. O que são esses vermelhos
+
+0:41:20.589,0:41:24.160
+barras aqui, barras verticais? Agora, o horizontal eu já falei pra vocês, né?
+
+0:41:24.160,0:41:34.390
+* faz som * e a vertical? o que é? problemas de amostragem, certo, transições. Então
+
+0:41:34.390,0:41:39.099
+sempre que você tem o * faz som *, você na verdade tem uma forma de onda branca, uma forma de onda e depois
+
+0:41:39.099,0:41:44.019
+o outro, uma forma de onda tem que parar para que não seja mais periódica e sempre
+
+0:41:44.019,0:41:47.609
+você faz uma transformada de Fourier de um sinal não periódico, você sabe uma porcaria.
+
+0:41:47.609,0:41:53.589
+É por isso que sempre que você consegue a junção entre eles o * faz o som * o salto
+
+0:41:53.589,0:41:57.729
+aqui você vai ter este pico porque você pode
+
+0:41:57.729,0:42:01.749
+pensar no salto é como ter uma frequência muito alta né? Porque
+
+0:42:01.749,0:42:05.469
+é como um delta, então você realmente consegue todas as frequências, é por isso que você
+
+0:42:05.469,0:42:12.549
+obtenha todas as frequências aqui. Estrondo. OK? faz sentido até agora? tipo de? tudo bem.
+
+0:42:12.549,0:42:18.720
+Esta é a versão limpa * faz som * Não consigo nem assinar e o que
+
+0:42:18.720,0:42:25.800
+lado esquerdo aqui? por que está do lado esquerdo todo vermelho aí embaixo? OK
+
+0:42:25.800,0:42:31.440
+sim, você sabia. então o lado esquerdo do lado esquerdo do cabo é o que eu mostro a vocês no
+
+0:42:31.440,0:42:37.320
+lado esquerdo inferior. Ok, então deixe-me terminar esta aula e depois deixo você ir. Então
+
+0:42:37.320,0:42:42.990
+aqui vou te mostrar primeiro todos os kernels, você pode dizer agora
+
+0:42:42.990,0:42:48.090
+o vermelho vai ser o primeiro pedaço do meu sinal, o real
+
+0:42:48.090,0:42:53.280
+um e então você pode ver que o primeiro tom tem a mesma frequência,
+
+0:42:53.280,0:42:58.460
+você pode ver? Portanto, o * faz som * tem o mesmo
+
+0:42:58.460,0:43:04.230
+delta t o mesmo intervalo, período, você pode ver? você não pode acenar com a cabeça
+
+0:43:04.230,0:43:08.700
+cabeça porque de novo eu não vejo você, tem que me responder. Você pode ver ou não? OK,
+
+0:43:08.700,0:43:13.050
+obrigado, fantástico. E então este é o terceiro, você pode ver que
+
+0:43:13.050,0:43:17.130
+começa aqui no período e termina aqui, se você subir aqui você está
+
+0:43:17.130,0:43:20.520
+veremos exatamente que havia dois desses caras, certo, então isso é
+
+0:43:20.520,0:43:24.619
+como você pode ver isso é como o dobro da frequência do abaixo.
+
+0:43:24.619,0:43:30.089
+Finalmente, irei realizar a convolução desses quatro kernels com
+
+0:43:30.089,0:43:36.839
+meu sinal de entrada, e é assim que parecemos, ok, então o primeiro kernel tem um alto
+
+0:43:36.839,0:43:42.150
+coincidir na primeira parte do placar. Então, entre zero e zero cinco
+
+0:43:42.150,0:43:46.830
+segundos. O segundo começa logo após o primeiro, então você tem o
+
+0:43:46.830,0:43:50.820
+terceiro começando em zero três eu acho e então você tem o último
+
+0:43:50.820,0:43:56.940
+começando do zero seis, certo? Então adivinhe? Eu vou fazer você ouvir
+
+0:43:56.940,0:44:02.520
+convoluções agora, você está animado? ok, você realmente está respondendo agora, ótimo!
+
+0:44:02.520,0:44:07.440
+Tudo bem e esses são os resultados. Deixe-me baixar um pouco os volumes
+
+0:44:07.440,0:44:14.900
+caso contrário, você reclamará, sim, eu não posso diminuir o,
+
+0:44:16.800,0:44:23.690
+ok, então o primeiro, vamos tentar novamente
+
+0:44:28.880,0:44:37.110
+* toca som * não é legal? Você escuta as convoluções. Ok, então basicamente isso era
+
+0:44:37.110,0:44:41.280
+quase isso, tenho mais um slide porque senti que houve alguma confusão no último
+
+0:44:41.280,0:44:45.360
+tempo sobre qual é a diferente dimensionalidade de diferentes tipos de
+
+0:44:45.360,0:44:50.850
+sinais, então estou realmente recomendando ir e fazer a aula da Joan Bruna que
+
+0:44:50.850,0:44:56.580
+é matemática para aprendizado profundo e eu roubei uma das pequenas coisas que ele era
+
+0:44:56.580,0:45:04.320
+ensinando, acabei de colocar um slide aqui para você. Portanto, este slide é o
+
+0:45:04.320,0:45:13.140
+Segue. Portanto, temos a camada de entrada ou as amostras que fornecemos em
+
+0:45:13.140,0:45:18.420
+esta rede e então normalmente nossa última vez eu defino isso eu tenho este X encaracolado
+
+0:45:18.420,0:45:23.010
+que será feito daqueles xi, que são todas as minhas amostras de dados corretas
+
+0:45:23.010,0:45:29.100
+e geralmente tenho m amostras de dados, então meu i vai de m = 1 para n, ok, então
+
+0:45:29.100,0:45:34.080
+está claro? em que esta notação está clara? porque é um pouco mais formal, normalmente sou um pouco
+
+0:45:34.080,0:45:39.030
+menos formal, mas, de alguma forma, alguém estava se sentindo um pouco desconfortável.
+
+0:45:39.030,0:45:46.470
+acho que este é apenas meus exemplos de entrada, mas também podemos ver este
+
+0:45:46.470,0:45:52.950
+é este X encaracolado que é minha entrada definida como o conjunto de todas essas funções como xi
+
+0:45:52.950,0:45:59.850
+que estão mapeando meu Omega capital Omega, que é meu domínio, para um RC que é
+
+0:45:59.850,0:46:06.150
+serão basicamente meus canais desse exemplo específico e aqui estou
+
+0:46:06.150,0:46:14.040
+mapearei aqueles Omega minúsculos para esses xi's de ômega, então vamos ver como
+
+0:46:14.040,0:46:17.550
+estes são diferentes da notação anterior. Então eu vou te dar agora três
+
+0:46:17.550,0:46:21.300
+exemplos e você deve ser capaz de dizer agora qual é a dimensionalidade e
+
+0:46:21.300,0:46:24.820
+neste exemplo. Então, o primeiro, digamos,
+
+0:46:24.820,0:46:29.560
+Eu gostei do que mostrei a você agora, apenas uma parte divertida de você
+
+0:46:29.560,0:46:34.870
+sinal de áudio, então meu Omega será apenas amostras como a amostra número um
+
+0:46:34.870,0:46:39.550
+amostra número dois como o índice, certo? então você tem índice um, índice dois, índice
+
+0:46:39.550,0:46:44.740
+até esses 70.000 seja o que for que acabamos de ver agora, ok? e o último valor é
+
+0:46:44.740,0:46:49.330
+vai ser o T, T maiúsculo, que é o número de segundos dividido pelo delta T
+
+0:46:49.330,0:46:53.200
+que seria o 1 sobre a frequência e isso vai ser um
+
+0:46:53.200,0:46:57.250
+subconjunto de n certo? então este é um número discreto de amostras, porque você tem um
+
+0:46:57.250,0:47:03.220
+computador, você sempre tem amostras discretas. Portanto, estes são meus dados de entrada, e
+
+0:47:03.220,0:47:09.310
+então que tal a imagem desta função? então quando eu pergunto o que é
+
+0:47:09.310,0:47:13.330
+dimensionalidade deste tipo de sinal, você deve responder que é um
+
+0:47:13.330,0:47:19.360
+sinal unidimensional porque a potência desses n aqui é 1 ok? então isso é como
+
+0:47:19.360,0:47:25.000
+um sinal unidimensional, embora você possa ter o tempo total e o
+
+0:47:25.000,0:47:28.900
+1 lá estava um intervalo de amostragem, do lado direito você tem o número
+
+0:47:28.900,0:47:33.340
+de canais pode ser 1 se você tiver um sinal mono ou 2 se tiver um
+
+0:47:33.340,0:47:38.230
+estereofônico, então você tem mono aí, você tem 2 para estereofônico ou o que é 5
+
+0:47:38.230,0:47:44.650
+mais 1? esse é o Dolby como 5.1 não é legal? tudo bem então este ainda é um
+
+0:47:44.650,0:47:48.430
+sinal dimensional que pode ter vários canais, mas ainda é
+
+0:47:48.430,0:47:53.020
+sinal unidimensional porque há apenas uma variável em execução lá, ok? é isso
+
+0:47:53.020,0:47:58.270
+de alguma forma melhor do que da última vez? sim? não? Melhor? obrigada.
+
+0:47:58.270,0:48:03.070
+vamos agradecer a Joan. Tudo bem, segundo exemplo que tenho aqui, meu Omega vai
+
+0:48:03.070,0:48:07.630
+ser o produto cartesiano desses dois conjuntos, o primeiro conjunto vai
+
+0:48:07.630,0:48:13.150
+de 1 em altura, e também esta discreta e a outra vai
+
+0:48:13.150,0:48:17.020
+indo de 1 para a largura, então estes são os pixels reais, e este
+
+0:48:17.020,0:48:21.690
+é um sinal bidimensional porque tenho 2 graus de liberdade no meu
+
+0:48:21.690,0:48:28.690
+domínio. Quais são os canais possíveis que temos? Então, aqui, os canais possíveis que
+
+0:48:28.690,0:48:32.440
+são muito comuns são os seguintes: Assim, você pode ter uma imagem em tons de cinza e
+
+0:48:32.440,0:48:36.849
+portanto, você apenas produz um valor escalar ou obtém o
+
+0:48:36.849,0:48:42.789
+arco-íris ali, a cor e, portanto, você fica como o meu X que é uma função de
+
+0:48:42.789,0:48:49.299
+as coordenadas Omega 1 Omega 2 em que cada ponto é
+
+0:48:49.299,0:48:52.690
+representado por um vetor de três componentes que será o R
+
+0:48:52.690,0:48:59.109
+componente do ponto Omega 1 Omega 2, o componente G do Omega 1 Omega 2, e
+
+0:48:59.109,0:49:04.299
+o componente azul do Omega 1 Omega 2. Então, novamente, vocês podem pensar nisso como um
+
+0:49:04.299,0:49:08.980
+ponto de big big data ou você pode pensar nisso como um mapeamento de função
+
+0:49:08.980,0:49:12.640
+domínio dimensional que é um domínio bidimensional para um domínio tridimensional
+
+0:49:12.640,0:49:18.490
+domínio dimensional, certo? finalmente os vinte, quem sabe o nome do
+
+0:49:18.490,0:49:24.400
+imagem de vinte canais? sim, esta é uma imagem hiperespectral. É muito comum
+
+0:49:24.400,0:49:31.869
+tem 20 bandas. Finalmente, quem pode adivinhar este?
+
+0:49:31.869,0:49:40.829
+se meu domínio for r4 x r4, o que pode ser?
+
+0:49:41.099,0:49:50.589
+Não, não, isso discreto né? Este é o r4, então nem é computador. Ha!
+
+0:49:50.589,0:49:56.740
+quem disse algo ai? Ouvi! Sim, está correto, então este é o espaço-tempo, o que
+
+0:49:56.740,0:50:00.849
+é o segundo? Sim, qual impulso? Tem um especial
+
+0:50:00.849,0:50:07.779
+nome. É chamado de quatro momentos porque tem uma informação temporal como
+
+0:50:07.779,0:50:12.160
+bem, certo? E então qual será a minha possível imagem
+
+0:50:12.160,0:50:24.790
+da função X? digamos que c é igual a 1. O que é? você sabe?
+
+0:50:24.790,0:50:29.630
+Então esse poderia ser, por exemplo, o hamiltoniano do sistema, certo? então, é isso
+
+0:50:29.630,0:50:36.460
+foi como um pouco mais de introdução matemática ou matemática
+
+0:50:37.000,0:50:42.890
+procedimento, como se diz, você fará uma definição mais precisa. De modo a
+
+0:50:42.890,0:50:48.980
+foi praticamente tudo por hoje, deixa eu acender a luz e vejo você
+
+0:50:48.980,0:50:54.969
+na próxima segunda-feira. Obrigado por estar comigo.
\ No newline at end of file
diff --git a/docs/pt/week05/05-1.md b/docs/pt/week05/05-1.md
new file mode 100644
index 000000000..92eae3fca
--- /dev/null
+++ b/docs/pt/week05/05-1.md
@@ -0,0 +1,451 @@
+---
+lang: pt
+lang-ref: ch.05-1
+title: Técnicas de Otimização I
+lecturer: Aaron Defazio
+authors: Vaibhav Gupta, Himani Shah, Gowri Addepalli, Lakshmi Addepalli
+date: 24 Feb 2020
+translation-date: 06 Nov 2021
+translator: Felipe Schiavon
+---
+
+<!--
+## [Gradient descent](https://www.youtube.com/watch?v=--NZb480zlg&t=88s)
+-->
+
+
+## [Gradiente Descendente](https://www.youtube.com/watch?v=--NZb480zlg&t=88s)
+
+<!--We start our study of Optimization Methods with the most basic and the worst (reasoning to follow) method of the lot, Gradient Descent.
+-->
+
+Começamos nosso estudo de Métodos de Otimização com o pior e mais básico método (raciocínio a seguir) do lote, o Gradiente Descendente.
+
+<!--**Problem:**
+-->
+
+**Problema:**
+
+<!--$$
+\min_w f(w)
+$$
+-->
+
+$$
+\min_w f(w)
+$$
+
+<!--**Iterative Solution:**
+-->
+
+**Solução Iterativa:**
+
+<!--$$
+w_{k+1} = w_k - \gamma_k \nabla f(w_k)
+$$
+-->
+
+$$
+w_{k+1} = w_k - \gamma_k \nabla f(w_k)
+$$
+
+<!--where,
+ - $w_{k+1}$ is the updated value after the $k$-th iteration,
+ - $w_k$ is the initial value before the $k$-th iteration,
+ - $\gamma_k$ is the step size,
+ - $\nabla f(w_k)$ is the gradient of $f$.
+-->
+
+onde,
+ - $w_{k+1}$ é o valor atualizado depois da $k$-ésima iteração,
+ - $w_k$ é o valor inicial antes da $k$-ésima iteração,
+ - $\gamma_k$ é o tamanho do passo,
+ - $\nabla f(w_k)$ é o gradiente de $f$.
+
+<!--The assumption here is that the function $f$ is continuous and differentiable. Our aim is to find the lowest point (valley) of the optimization function. However, the actual direction to this valley is not known. We can only look locally, and therefore the direction of the negative gradient is the best information that we have. Taking a small step in that direction can only take us closer to the minimum. Once we have taken the small step, we again compute the new gradient and again move a small amount in that direction, till we reach the valley. Therefore, essentially all that the gradient descent is doing is following the direction of steepest descent (negative gradient).
+-->
+
+A suposição aqui é que a função $f$ é contínua e diferenciável. Nosso objetivo é encontrar o ponto mais baixo (vale) da função de otimização. No entanto, a direção real para este vale não é conhecida. Só podemos olhar localmente e, portanto, a direção do gradiente negativo é a melhor informação que temos. Dar um pequeno passo nessa direção só pode nos levar mais perto do mínimo. Assim que tivermos dado o pequeno passo, calculamos novamente o novo gradiente e novamente nos movemos um pouco nessa direção, até chegarmos ao vale. Portanto, basicamente tudo o que o gradiente descendente está fazendo é seguir a direção da descida mais acentuada (gradiente negativo).
+
+<!--The $\gamma$ parameter in the iterative update equation is called the **step size**. Generally we don't know the value of the optimal step-size; so we have to try different values. Standard practice is to try a bunch of values on a log-scale and then use the best one. There are a few different scenarios that can occur. The image above depicts these scenarios for a 1D quadratic. If the learning rate is too low, then we would make steady progress towards the minimum. However, this might take more time than what is ideal. It is generally very difficult (or impossible) to get a step-size that would directly take us to the minimum. What we would ideally want is to have a step-size a little larger than the optimal. In practice, this gives the quickest convergence. However, if we use too large a learning rate, then the iterates get further and further away from the minima and we get divergence. In practice, we would want to use a learning rate that is just a little less than diverging.
+-->
+
+O parâmetro $\gamma$ na equação de atualização iterativa é chamado de **tamanho do passo**. Geralmente não sabemos o valor do tamanho ideal do passo; então temos que tentar valores diferentes. A prática padrão é tentar vários valores em uma escala logarítmica e, a seguir, usar o melhor valor. Existem alguns cenários diferentes que podem ocorrer. A imagem acima descreve esses cenários para uma função de erro quadrática de uma dimensão (1D). Se a taxa de aprendizado for muito baixa, faremos um progresso constante em direção ao mínimo. No entanto, isso pode levar mais tempo do que o ideal. Geralmente é muito difícil (ou impossível) obter um tamanho de passo que nos leve diretamente ao mínimo. O que desejaríamos idealmente é ter um tamanho de degrau um pouco maior do que o ideal. Na prática, isso dá a convergência mais rápida. No entanto, se usarmos uma taxa de aprendizado muito grande, as iterações se distanciam cada vez mais dos mínimos e obtemos divergência. Na prática, gostaríamos de usar uma taxa de aprendizado um pouco menor do que divergente.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-1/step-size.png" style="zoom: 70%; background-color:#DCDCDC;" /><br>
+<b>Figure 1:</b> Step sizes for 1D Quadratic
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-1/step-size.png" style="zoom: 70%; background-color:#DCDCDC;" /><br>
+<b>Figure 1:</b> Tamanhos dos passos para função de erro quadrática de uma dimensão (1D)
+</center>
+
+<!--
+## [Stochastic gradient descent](https://www.youtube.com/watch?v=--NZb480zlg&t=898s)
+-->
+
+
+## [Gradiente Descendente Estocástico](https://www.youtube.com/watch?v=--NZb480zlg&t=898s)
+
+<!--In Stochastic Gradient Descent, we replace the actual gradient vector with a stochastic estimation of the gradient vector. Specifically for a neural network, the stochastic estimation means the gradient of the loss for a single data point (single instance).
+-->
+
+No Gradiente Descendente Estocástico, substituímos o vetor gradiente real por uma estimativa estocástica do vetor gradiente. Especificamente para uma rede neural, a estimativa estocástica significa o gradiente da perda para um único ponto dos dados (única instância).
+
+<!--Let $f_i$ denote the loss of the network for the $i$-th instance.
+-->
+
+Seja $f_i$ a perda da rede para a $i$-ésima instância.
+
+<!--$$
+f_i = l(x_i, y_i, w)
+$$
+-->
+
+$$
+f_i = l(x_i, y_i, w)
+$$
+
+<!--The function that we eventually want to minimize is $f$, the total loss over all instances.
+-->
+
+A função que queremos minimizar é $f$, a perda total de todas as instâncias.
+
+<!--$$
+f = \frac{1}{n}\sum_i^n f_i
+$$
+-->
+
+$$
+f = \frac{1}{n}\sum_i^n f_i
+$$
+
+<!--In SGD, we update the weights according to the gradient over $f_i$ (as opposed to the gradient over the total loss $f$).
+-->
+
+
+No SGD, atualizamos os pesos de acordo com o gradiente sobre $f_i$ (em oposição ao gradiente sobre a perda total $f$).
+
+<!--$$
+\begin{aligned}
+w_{k+1} &= w_k - \gamma_k \nabla f_i(w_k) & \quad\text{(i chosen uniformly at random)}
+\end{aligned}
+$$
+-->
+
+$$
+\begin{aligned}
+w_{k+1} &= w_k - \gamma_k \nabla f_i(w_k) & \quad\text{(i escolhido uniformemente ao acaso)}
+\end{aligned}
+$$
+
+<!--If $i$ is chosen randomly, then $f_i$ is a noisy but unbiased estimator of $f$, which is mathematically written as:
+-->
+
+Se $i$ for escolhido aleatoriamente, então $f_i$ é um estimador com ruído, mas sem viés, de $f$, que é matematicamente escrito como:
+
+<!--$$
+\mathbb{E}[\nabla f_i(w_k)] = \nabla f(w_k)
+$$
+-->
+
+$$
+\mathbb{E}[\nabla f_i(w_k)] = \nabla f(w_k)
+$$
+
+<!--As a result of this, the expected $k$-th step of SGD is the same as the $k$-th step of full gradient descent:
+-->
+
+Como resultado disso, a $k$-ésima etapa esperada do SGD é a mesma que a $k$-ésima etapa da Gradiente Descendente completo:
+
+<!--$$
+\mathbb{E}[w_{k+1}] = w_k - \gamma_k \mathbb{E}[\nabla f_i(w_k)] = w_k - \gamma_k \nabla f(w_k)
+$$
+-->
+
+$$
+\mathbb{E}[w_{k+1}] = w_k - \gamma_k \mathbb{E}[\nabla f_i(w_k)] = w_k - \gamma_k \nabla f(w_k)
+$$
+
+<!--Thus, any SGD update is the same as full-batch update in expectation. However, SGD is not just faster gradient descent with noise. Along with being faster, SGD can also get us better results than full-batch gradient descent. The noise in SGD can help us avoid the shallow local minima and find a better (deeper) minima. This phenomenon is called **annealing**.
+-->
+
+Portanto, qualquer atualização do SGD é igual à atualização de lote completo em expectativa. No entanto, o SGD não é apenas um gradiente descendente mais rápida com algum ruído. Além de ser mais rápido, o SGD também pode nos dar melhores resultados do que o gradiente descendente completo. O ruído no SGD pode nos ajudar a evitar os mínimos locais superficiais e a encontrar mínimos melhores (mais profundos). Este fenômeno é denominado **recozimento** (**annealing**).
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-1/annealing.png"/><br>
+<b>Figure 2:</b> Annealing with SGD
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-1/annealing.png"/><br>
+<b>Figure 2:</b> Recozimento com SGD
+</center>
+
+<!--In summary, the advantages of Stochastic Gradient Descent are as follows:
+-->
+
+Em resumo, as vantagens do Gradiente Descendente Estocástico são as seguintes:
+
+<!-- 1. There is a lot of redundant information across instances. SGD prevents a lot of these redundant computations.
+ 2. At early stages, the noise is small as compared to the information in the gradient. Therefore a SGD step is *virtually as good as* a GD step.
+ 3. *Annealing* - The noise in SGD update can prevent convergence to a bad(shallow) local minima.
+ 4. Stochastic Gradient Descent is drastically cheaper to compute (as you don't go over all data points).
+-->
+
+1. Há muitas informações redundantes entre as instâncias. O SGD evita muitos desses cálculos redundantes.
+ 2. Nos estágios iniciais, o ruído é pequeno em comparação com as informações no gradiente. Portanto, uma etapa SGD é *virtualmente tão boa quanto* uma etapa de Gradiente Descendente.
+ 3. *Recozimento* - O ruído na atualização do SGD pode impedir a convergência para mínimos locais ruins (rasos).
+ 4. O Gradiente Descendente Estocástico é drasticamente mais barato para calcular (já que você não passa por todos os pontos de dados).
+
+<!--
+### Mini-batching
+-->
+
+
+### Mini-lotes
+
+<!--In mini-batching, we consider the loss over multiple randomly selected instances instead of calculating it over just one instance. This reduces the noise in the step update.
+-->
+
+Em mini-lotes, consideramos a perda em várias instâncias selecionadas aleatoriamente em vez de calculá-la em apenas uma instância. Isso reduz o ruído em cada etapa da atualização do passo.
+
+<!--$$
+w_{k+1} = w_k - \gamma_k \frac{1}{|B_i|} \sum_{j \in B_i}\nabla f_j(w_k)
+$$
+-->
+
+$$
+w_{k+1} = w_k - \gamma_k \frac{1}{|B_i|} \sum_{j \in B_i}\nabla f_j(w_k)
+$$
+
+<!--Often we are able to make better use of our hardware by using mini batches instead of a single instance. For example, GPUs are poorly utilized when we use single instance training. Distributed network training techniques split a large mini-batch between the machines of a cluster and then aggregate the resulting gradients. Facebook recently trained a network on ImageNet data within an hour, using distributed training.
+-->
+
+Freqüentemente, podemos fazer melhor uso de nosso hardware usando mini-lotes em vez de uma única instância. Por exemplo, as GPUs são mal utilizadas quando usamos o treinamento de instância única. As técnicas de treinamento de rede distribuída dividem um grande mini-lote entre as máquinas de um cluster e, em seguida, agregam os gradientes resultantes. O Facebook treinou recentemente uma rede em dados ImageNet em uma hora, usando treinamento distribuído.
+
+<!--It is important to note that Gradient Descent should never be used with full sized batches. In case you want to train on the full batch-size, use an optimization technique called LBFGS. PyTorch and SciPy both provide implementations of this technique.
+-->
+
+É importante observar que o Gradiente Descendente nunca deve ser usado com lotes de tamanho normal. Caso você queira treinar no tamanho total do lote, use uma técnica de otimização chamada LBFGS. O PyTorch e o SciPy possuem implementações desta técnica.
+
+<!--## [Momentum](https://www.youtube.com/watch?v=--NZb480zlg&t=1672s)
+-->
+
+## [Momento](https://www.youtube.com/watch?v=--NZb480zlg&t=1672s)
+
+<!--In Momentum, we have two iterates ($p$ and $w$) instead of just one. The updates are as follows:
+-->
+
+No Momento, temos duas iterações ($p$ e $w$) ao invés de apenas uma. As atualizações são as seguintes:
+
+<!--$$
+\begin{aligned}
+p_{k+1} &= \hat{\beta_k}p_k + \nabla f_i(w_k) \\
+w_{k+1} &=  w_k - \gamma_kp_{k+1} \\
+\end{aligned}
+$$
+-->
+
+$$
+\begin{aligned}
+p_{k+1} &= \hat{\beta_k}p_k + \nabla f_i(w_k) \\
+w_{k+1} &=  w_k - \gamma_kp_{k+1} \\
+\end{aligned}
+$$
+
+<!--$p$ is called the SGD momentum. At each update step we add the stochastic gradient to the old value of the momentum, after dampening it by a factor $\beta$ (value between 0 and 1). $p$ can be thought of as a running average of the gradients. Finally we move $w$ in the direction of the new momentum $p$.
+-->
+
+$p$ é chamado de momento SGD. Em cada etapa de atualização do passo, adicionamos o gradiente estocástico ao antigo valor do momento, após amortecê-lo por um fator $\beta$ (valor entre 0 e 1). $p$ pode ser considerado uma média contínua dos gradientes. Finalmente, movemos $w$ na direção do novo momento $p$.
+
+<!--Alternate Form: Stochastic Heavy Ball Method
+-->
+
+Forma alternativa: Método Estocástico de Bola Pesada
+
+<!--$$
+\begin{aligned}
+w_{k+1} &= w_k - \gamma_k\nabla f_i(w_k) + \beta_k(w_k - w_{k-1}) & 0 \leq \beta < 1
+\end{aligned}
+$$
+-->
+
+$$
+\begin{aligned}
+w_{k+1} &= w_k - \gamma_k\nabla f_i(w_k) + \beta_k(w_k - w_{k-1}) & 0 \leq \beta < 1
+\end{aligned}
+$$
+
+<!--This form is mathematically equivalent to the previous form. Here, the next step is a combination of previous step's direction ($w_k - w_{k-1}$) and the new negative gradient.
+-->
+
+Esta forma é matematicamente equivalente à forma anterior. Aqui, o próximo passo é uma combinação da direção do passo anterior ($w_k - w_{k-1}$) e o novo gradiente negativo.
+
+<!--
+### Intuition
+-->
+
+### Intuição
+
+<!--SGD Momentum is similar to the concept of momentum in physics. The optimization process resembles a heavy ball rolling down the hill. Momentum keeps the ball moving in the same direction that it is already moving in. Gradient can be thought of as a force pushing the ball in some other direction.
+-->
+
+O Momento do SGD é semelhante ao conceito de momentum na física. O processo de otimização se assemelha a uma bola pesada rolando colina abaixo. O momento mantém a bola se movendo na mesma direção em que já está se movendo. O gradiente pode ser considerado como uma força que empurra a bola em alguma outra direção.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-1/momentum.png"/><br>
+<b>Figure 3:</b> Effect of Momentum<br>
+<b>Source:</b><a href="https://distill.pub/2017/momentum/" target="_blank"> distill.pub </a><br>
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-1/momentum.png"/><br>
+<b>Figure 3:</b> Efeito do Momento<br>
+<b>Source:</b><a href="https://distill.pub/2017/momentum/" target="_blank"> distill.pub </a><br>
+</center>
+
+<!--Rather than making dramatic changes in the direction of travel (as in the figure on the left), momentum makes modest changes. Momentum dampens the oscillations which are common when we use only SGD.
+-->
+
+
+Ao invés de fazer mudanças dramáticas na direção do caminho (como na figura à esquerda), o momento faz mudanças pequenas. O momento amortece as oscilações que são comuns quando usamos apenas SGD.
+
+<!--The $\beta$ parameter is called the Dampening Factor. $\beta$ has to be greater than zero, because if it is equal to zero, you are just doing gradient descent. It also has to be less than 1, otherwise everything will blow up. Smaller values of $\beta$ result in change in direction quicker. For larger values, it takes longer to make turns.
+-->
+
+O parâmetro $\beta$ é chamado de fator de amortecimento. $\beta$ tem que ser maior que zero, porque se for igual a zero, você está apenas fazendo um gradiente descendente comum. Também deve ser menor que 1, caso contrário, tudo explodirá. Valores menores de $\beta$ resultam em mudanças de direção mais rápidas. Para valores maiores, leva mais tempo para fazer curvas.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-1/momentum-beta.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 4:</b> Effect of Beta on Convergence
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-1/momentum-beta.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 4:</b> Efeito do Beta na Convergência
+</center>
+
+<!--
+### Practical guidelines
+-->
+
+
+### Diretrizes práticas
+
+<!--Momentum must pretty much be always be used with stochastic gradient descent.
+$\beta$ = 0.9 or 0.99 almost always works well.
+-->
+
+O momento deve quase sempre ser usado com o Gradiente Descendente Estocástico.
+$\beta$ = 0,9 ou 0,99 quase sempre funciona bem.
+
+<!--The step size parameter usually needs to be decreased when the momentum parameter is increased to maintain convergence. If $\beta$ changes from 0.9 to 0.99, learning rate must be decreased by a factor of 10.
+-->
+
+O parâmetro de tamanho do passo geralmente precisa ser reduzido quando o parâmetro de momento é aumentado para manter a convergência. Se $\beta$ mudar de 0,9 para 0,99, a taxa de aprendizagem deve ser reduzida em um fator de 10.
+
+<!--
+### Why does momentum works?
+-->
+
+
+### Por que o momento funciona?
+
+<!--
+#### Acceleration
+-->
+
+
+#### Aceleração
+
+<!--The following are the update rules for Nesterov's momentum.
+-->
+
+
+A seguir estão as regras de atualização para o Momento de Nesterov.
+
+<!--$$
+p_{k+1} = \hat{\beta_k}p_k + \nabla f_i(w_k) \\
+w_{k+1} =  w_k - \gamma_k(\nabla f_i(w_k) +\hat{\beta_k}p_{k+1})
+$$
+-->
+
+$$
+p_{k+1} = \hat{\beta_k}p_k + \nabla f_i(w_k) \\
+w_{k+1} =  w_k - \gamma_k(\nabla f_i(w_k) +\hat{\beta_k}p_{k+1})
+$$
+
+<!--With Nesterov's Momentum, you can get accelerated convergence if you choose the constants very carefully. But this applies only to convex problems and not to Neural Networks.
+-->
+
+Com o Momento de Nesterov, você pode obter uma convergência acelerada se escolher as constantes com cuidado. Mas isso se aplica apenas a problemas convexos e não a redes neurais.
+
+<!--Many people say that normal momentum is also an accelerated method. But in reality, it is accelerated only for quadratics. Also, acceleration does not work well with SGD, as SGD has noise and acceleration does not work well with noise. Therefore, though some bit of acceleration is present with Momentum SGD, it alone is not a good explanation for the high performance of the technique.
+-->
+
+Muitas pessoas dizem que o momento normal também é um método acelerado. Mas, na realidade, ele é acelerado apenas para funções quadráticas. Além disso, a aceleração não funciona bem com SGD, pois SGD tem ruído e a aceleração não funciona bem com ruído. Portanto, embora um pouco de aceleração esteja presente no SGD com Momento, por si só não é uma boa explicação para o alto desempenho da técnica.
+
+<!--
+#### Noise smoothing
+-->
+
+
+#### Suavização de ruído
+
+<!--Probably a more practical and probable reason to why momentum works is Noise Smoothing.
+-->
+
+Provavelmente, uma razão mais prática e provável de por que o momento funciona é a Suavização de ruído.
+
+<!--Momentum averages gradients. It is a running average of gradients that we use for each step update.
+-->
+
+
+O momento calcula a média dos gradientes. É uma média contínua de gradientes que usamos para cada atualização do passo.
+
+<!--Theoretically, for SGD to work we should take average over all step updates.
+-->
+
+Teoricamente, para que o SGD funcione, devemos obter a média de todas as atualizações dos passos.
+
+<!--$$
+\bar w_k = \frac{1}{K} \sum_{k=1}^K w_k
+$$
+-->
+
+$$
+\bar w_k = \frac{1}{K} \sum_{k=1}^K w_k
+$$
+
+<!--The great thing about SGD with momentum is that this averaging is no longer necessary. Momentum adds smoothing to the optimization process, which makes each update a good approximation to the solution. With SGD you would want to average a whole bunch of updates and then take a step in that direction.
+-->
+
+A grande vantagem do SGD com momento é que essa média não é mais necessária. O Momento adiciona suavização ao processo de otimização, o que torna cada atualização uma boa aproximação da solução. Com o SGD, você desejaria calcular a média de um monte de atualizações e, em seguida, dar um passo nessa direção.
+
+<!--Both Acceleration and Noise smoothing contribute to high performance of momentum.
+-->
+
+Tanto a aceleração quanto a suavização de ruído contribuem para um alto desempenho do Momento.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-1/sgd-vs-momentum.png" style="zoom: 35%; background-color:#DCDCDC;"/><br>
+<b>Figure 5:</b> SGD <i>vs.</i> Momentum
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-1/sgd-vs-momentum.png" style="zoom: 35%; background-color:#DCDCDC;"/><br>
+<b>Figure 5:</b> SGD <i>vs.</i> Momento
+</center>
+
+<!--With SGD, we make good progress towards solution initially but when we reach bowl (bottom of the valley) we bounce around in this floor. If we adjust learning rate we will bounce around slower. With momentum we smooth out the steps, so that there is no bouncing around.
+-->
+
+Com o Gradiente Descendente Estocástico, inicialmente, fazemos um bom progresso em direção à solução, mas quando alcançamos o fundo da "tigela", ficamos rodeando em volta deste piso. Se ajustarmos a taxa de aprendizado, vamos rodear mais devagar. Com o impulso, suavizamos os degraus, para que não haja saltos.
diff --git a/docs/pt/week05/05-2.md b/docs/pt/week05/05-2.md
new file mode 100644
index 000000000..6c925fb0f
--- /dev/null
+++ b/docs/pt/week05/05-2.md
@@ -0,0 +1,512 @@
+---
+lang: pt
+lang-ref: ch.05-2
+title: Técnicas de Otimização II
+lecturer: Aaron Defazio
+authors: Guido Petri, Haoyue Ping, Chinmay Singhal, Divya Juneja
+date: 24 Feb 2020
+translator: Felipe Schiavon
+translation-date: 14 Nov 2021
+---
+
+<!--
+## [Adaptive methods](https://www.youtube.com/watch?v=--NZb480zlg&t=2675s)
+-->
+
+
+## [Métodos Adaptativos](https://www.youtube.com/watch?v=--NZb480zlg&t=2675s) 
+
+<!--SGD with momentum is currently the state of the art optimization method for a lot of ML problems. But there are other methods, generally called Adaptive Methods, innovated over the years that are particularly useful for poorly conditioned problems (if SGD does not work).
+-->
+
+Momento com SGD é atualmente o método de otimização de última geração para muitos problemas de aprendizagem de máquina. Mas existem outros métodos, geralmente chamados de Métodos Adaptativos, inovados ao longo dos anos que são particularmente úteis para problemas mal condicionados (se o SGD não funcionar).
+
+<!--In the SGD formulation, every single weight in network is updated using an equation with the same learning rate (global $\gamma$). Here, for adaptive methods, we *adapt a learning rate for each weight individually*. For this purpose, the information we get from gradients for each weight is used.
+-->
+
+Na formulação SGD, cada peso na rede é atualizado usando uma equação com a mesma taxa de aprendizado  (global $\gamma$). Aqui, para métodos adaptativos, *adaptamos uma taxa de aprendizagem para cada peso individualmente*. Para tanto, são utilizadas as informações que obtemos dos gradientes para cada peso.
+
+<!--Networks that are often used in practice have different structure in different parts of them. For instance, early parts of CNN may be very shallow convolution layers on large images and later in the network we might have convolutions of large number of channels on small images. Both of these operations are very different so a learning rate which works well for the beginning of the network may not work well for the latter sections of the network. This means adaptive learning rates by layer could be useful.
+-->
+
+As redes que são frequentemente usadas na prática têm estruturas diferentes em diferentes partes delas. Por exemplo, partes iniciais da CNN podem ser camadas de convolução muito rasas em imagens grandes e, posteriormente, na rede, podemos ter convoluções de grande número de canais em imagens pequenas. Ambas as operações são muito diferentes, portanto, uma taxa de aprendizado que funciona bem para o início da rede pode não funcionar bem para as últimas seções da rede. Isso significa que as taxas de aprendizagem adaptativa por camada podem ser úteis.
+
+<!--Weights in the latter part of the network (4096 in figure 1 below) directly dictate the output and have a very strong effect on it. Hence, we need smaller learning rates for those. In contrast, earlier weights will have smaller individual effects on the output, especially when initialized randomly.
+-->
+
+Os pesos na última parte da rede (4096 na figura 1 abaixo) ditam diretamente a saída e têm um efeito muito forte sobre ela. Portanto, precisamos de taxas de aprendizado menores para eles. Em contraste, pesos anteriores terão efeitos individuais menores na saída, especialmente quando inicializados aleatoriamente.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_vgg.png" style="zoom:40%"><br>
+<b>Figure 1: </b>VGG16
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_vgg.png" style="zoom:40%"><br>
+<b>Figure 1: </b>VGG16
+</center>
+
+<!--
+### RMSprop
+-->
+
+
+### RMSprop
+
+<!--The key idea of *Root Mean Square Propagation* is that the gradient is normalized by its root-mean-square.
+-->
+
+A ideia principal da *Propagação da raiz do valor quadrático médio* (*Root Mean Square Propagation*) é que o gradiente é normalizado por sua raiz quadrada média.
+
+<!--In the equation below, squaring the gradient denotes that each element of the vector is squared individually.
+-->
+
+Na equação abaixo, elevar ao quadrado o gradiente denota que cada elemento do vetor é elevado ao quadrado individualmente.
+
+<!--$$
+\begin{aligned}
+v_{t+1} &= {\alpha}v_t + (1 - \alpha) \nabla f_i(w_t)^2 \\
+w_{t+1} &=  w_t - \gamma \frac {\nabla f_i(w_t)}{ \sqrt{v_{t+1}} + \epsilon}
+\end{aligned}
+$$
+-->
+
+$$
+\begin{aligned}
+v_{t+1} &= {\alpha}v_t + (1 - \alpha) \nabla f_i(w_t)^2 \\
+w_{t+1} &=  w_t - \gamma \frac {\nabla f_i(w_t)}{ \sqrt{v_{t+1}} + \epsilon}
+\end{aligned}
+$$
+
+<!--where $\gamma$ is the global learning rate, $\epsilon$ is a value close to machine $\epsilon$ (on the order of $10^{-7}$ or  $10^{-8}$) -- in order to avoid division by zero errors, and $v_{t+1}$ is the 2nd moment estimate.
+-->
+
+onde $\gamma$ é a taxa de aprendizagem global,  $\epsilon$ é um valor próximo a máquina  $\epsilon$ (na ordem de $10^{-7}$ ou $10^{-8}$) - na ordem para evitar erros de divisão por zero, e $v_{t+1}$ é a estimativa do segundo momento.
+
+<!--We update $v$ to estimate this noisy quantity via an *exponential moving average* (which is a standard way of maintaining an average of a quantity that may change over time). We need to put larger weights on the newer values as they provide more information. One way to do that is down-weight old values exponentially. The values in the $v$ calculation that are very old are down-weighted at each step by an $\alpha$ constant, which varies between 0 and 1. This dampens the old values until they are no longer an important part of the exponential moving average.
+-->
+
+Atualizamos $v$ para estimar essa quantidade ruidosa por meio de uma *média móvel exponencial* (que é uma maneira padrão de manter uma média de uma quantidade que pode mudar com o tempo). Precisamos colocar pesos maiores nos valores mais novos, pois eles fornecem mais informações. Uma maneira de fazer isso é reduzir exponencialmente os valores antigos. Os valores no cálculo de $v$ que são muito antigos são reduzidos a cada etapa por uma constante $\alpha$, que varia entre 0 e 1. Isso amortece os valores antigos até que eles não sejam mais uma parte importante do exponencial média móvel.
+
+<!--The original method keeps an exponential moving average of a non-central second moment, so we don't subtract the mean here. The *second moment* is used to normalize the gradient element-wise, which means that every element of the gradient is divided by the square root of the second moment estimate. If the expected value of gradient is small, this process is similar to dividing the gradient by the standard deviation.
+-->
+
+O método original mantém uma média móvel exponencial de um segundo momento não central, portanto, não subtraímos a média aqui. O *segundo momento* é usado para normalizar o gradiente em termos de elemento, o que significa que cada elemento do gradiente é dividido pela raiz quadrada da estimativa do segundo momento. Se o valor esperado do gradiente for pequeno, esse processo é semelhante a dividir o gradiente pelo desvio padrão.
+
+<!--Using a small $\epsilon$ in the denominator doesn't diverge because when $v$ is very small, the momentum is also very small.
+-->
+
+Usar um $\epsilon$ pequeno no denominador não diverge porque quando $v$ é muito pequeno, o momento também é muito pequeno.
+
+<!--
+### ADAM
+-->
+
+
+### ADAM
+
+<!--ADAM, or *Adaptive Moment Estimation*, which is RMSprop plus momentum, is a more commonly used method. The momentum update is converted to an exponential moving average and we don't need to change the learning rate when we deal with $\beta$. Just as in RMSprop, we take an exponential moving average of the squared gradient here.
+-->
+
+ADAM, ou *Estimativa Adaptativa do Momento*, que é RMSprop mais o Momento, é o método mais comumente usado. A atualização do Momento é convertida em uma média móvel exponencial e não precisamos alterar a taxa de aprendizagem quando lidamos com $\beta$. Assim como no RMSprop, pegamos uma média móvel exponencial do gradiente quadrado aqui.
+
+<!--$$
+\begin{aligned}
+m_{t+1} &= {\beta}m_t + (1 - \beta) \nabla f_i(w_t) \\
+v_{t+1} &= {\alpha}v_t + (1 - \alpha) \nabla f_i(w_t)^2 \\
+w_{t+1} &=  w_t - \gamma \frac {m_{t}}{ \sqrt{v_{t+1}} + \epsilon}
+\end{aligned}
+$$
+-->
+
+$$
+\begin{aligned}
+m_{t+1} &= {\beta}m_t + (1 - \beta) \nabla f_i(w_t) \\
+v_{t+1} &= {\alpha}v_t + (1 - \alpha) \nabla f_i(w_t)^2 \\
+w_{t+1} &=  w_t - \gamma \frac {m_{t}}{ \sqrt{v_{t+1}} + \epsilon}
+\end{aligned}
+$$
+
+<!--where $m_{t+1}$ is the momentum's exponential moving average.
+-->
+
+onde $m_{t+1}$ é a média móvel exponencial do momento.
+
+<!--Bias correction that is used to keep the moving average unbiased during early iterations is not shown here.
+-->
+
+A correção de viés que é usada para manter a média móvel imparcial durante as iterações iniciais não é mostrada aqui.
+
+<!--
+### Practical side
+-->
+
+
+### Lado Prático
+
+<!--When training neural networks, SGD often goes in the wrong direction in the beginning of the training process, whereas RMSprop hones in on the right direction. However, RMSprop suffers from noise just as regular SGD, so it bounces around the optimum significantly once it's close to a local minimizer. Just like when we add momentum to SGD, we get the same kind of improvement with ADAM. It is a good, not-noisy estimate of the solution, so **ADAM is generally recommended over RMSprop**.
+-->
+
+Ao treinar redes neurais, o SGD geralmente vai na direção errada no início do processo de treinamento, enquanto o RMSprop aprimora a direção certa. No entanto, o RMSprop sofre de ruído da mesma forma que o SGD normal, então ele oscila em torno do ótimo significativamente quando está perto de um minimizador local. Assim como quando adicionamos impulso ao SGD, obtemos o mesmo tipo de melhoria com o ADAM. É uma estimativa boa e não ruidosa da solução, portanto **ADAM é geralmente recomendado em vez de RMSprop**.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_comparison.png" style="zoom:45%"><br>
+<b>Figure 2: </b> SGD *vs.* RMSprop *vs.* ADAM
+</center><br>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_comparison.png" style="zoom:45%"><br>
+<b>Figure 2: </b> SGD *vs.* RMSprop *vs.* ADAM
+</center><br>
+
+<!--ADAM is necessary for training some of the networks for using language models. For optimizing neural networks, SGD with momentum or ADAM is generally preferred. However, ADAM's theory in papers is poorly understood and it also has several disadvantages:
+-->
+
+O ADAM é necessário para treinar algumas das redes para usar modelos de linguagem. Para otimizar redes neurais, SGD com momentum ou ADAM é geralmente preferido. No entanto, a teoria do ADAM em artigos é mal compreendida e também tem várias desvantagens:
+
+<!--* It can be shown on very simple test problems that the method does not converge.
+* It is known to give generalization errors. If the neural network is trained to give zero loss on the data you trained it on, it will not give zero loss on other data points that it has never seen before. It is quite common, particularly on image problems, that we get worse generalization errors than when SGD is used. Factors could include that it finds the closest local minimum, or less noise in ADAM, or its structure, for instance.
+* With ADAM we need to maintain 3 buffers, whereas SGD needs 2 buffers. This doesn't really matter unless we train a model on the order of several gigabytes in size, in which case it might not fit in memory.
+* 2 momentum parameters need to be tuned instead of 1.
+-->
+
+* Pode ser mostrado em problemas de teste muito simples que o método não converge.
+* É conhecido por fornecer erros de generalização. Se a rede neural for treinada para fornecer perda zero nos dados em que você a treinou, ela não fornecerá perda zero em outros pontos de dados que nunca viu antes. É bastante comum, principalmente em problemas de imagem, obtermos erros de generalização piores do que quando se usa SGD. Os fatores podem incluir que ele encontre o mínimo local mais próximo, ou menos ruído no ADAM, ou sua estrutura, por exemplo.
+* Com o ADAM, precisamos manter 3 buffers, enquanto o SGD precisa de 2 buffers. Isso realmente não importa, a menos que treinemos um modelo da ordem de vários gigabytes de tamanho; nesse caso, ele pode não caber na memória.
+* 2 parâmetros de momentum precisam ser ajustados em vez de 1.
+
+<!--
+## [Normalization layers](https://www.youtube.com/watch?v=--NZb480zlg&t=3907s)
+-->
+
+
+## [Camadas de Normalização](https://www.youtube.com/watch?v=--NZb480zlg&t=3907s)
+
+<!--Rather than improving the optimization algorithms, *normalization layers* improve the network structure itself. They are additional layers in between existing layers. The goal is to improve the optimization and generalization performance.
+-->
+
+Em vez de melhorar os algoritmos de otimização, as *camadas de normalização* melhoram a própria estrutura da rede. Eles são camadas adicionais entre as camadas existentes. O objetivo é melhorar o desempenho de otimização e generalização.
+
+<!--In neural networks, we typically alternate linear operations with non-linear operations. The non-linear operations are also known as activation functions, such as ReLU. We can place normalization layers before the linear layers, or after the activation functions. The most common practice is to put them between the linear layers and activation functions, as in the figure below.
+-->
+
+Em redes neurais, normalmente alternamos operações lineares com operações não lineares. As operações não lineares também são conhecidas como funções de ativação, como ReLU. Podemos colocar camadas de normalização antes das camadas lineares ou após as funções de ativação. A prática mais comum é colocá-los entre as camadas lineares e as funções de ativação, como na figura abaixo.
+
+<!--| <center><img src="{{site.baseurl}}/images/week05/05-2/5_2_norm_layer_a.png" width="200px"/></center> | <center><img src="{{site.baseurl}}/images/week05/05-2/5_2_norm_layer_b.png" width="200px"/></center> | <center><img src="{{site.baseurl}}/images/week05/05-2/5_2_norm_layer_c.png" width="225px"/></center> |
+| (a) Before adding normalization                              |                (b) After adding normalization                |                    (c) An example in CNNs                    |
+-->
+
+| <center><img src="{{site.baseurl}}/images/week05/05-2/5_2_norm_layer_a.png" width="200px"/></center> | <center><img src="{{site.baseurl}}/images/week05/05-2/5_2_norm_layer_b.png" width="200px"/></center> | <center><img src="{{site.baseurl}}/images/week05/05-2/5_2_norm_layer_c.png" width="225px"/></center> |
+| (a) Antes de adicionar a normalização                              |                (b) Depois de adicionar a normalização                |                    (c) Um exemplo em CNNs                    |
+
+<!--<center><b>Figure 3:</b> Typical positions of normalization layers.</center>
+-->
+
+<center><b>Figura 3:</b> Posições típicas de camadas de normalização.</center>
+
+<!--In figure 3(c), the convolution is the linear layer, followed by batch normalization, followed by ReLU.
+-->
+
+Na figura 3 (c), a convolução é a camada linear, seguida pela normalização do lote, seguida por ReLU.
+
+<!--Note that the normalization layers affect the data that flows through, but they don't change the power of the network in the sense that, with proper configuration of the weights, the unnormalized network can still give the same output as a normalized network.
+-->
+
+Observe que as camadas de normalização afetam os dados que fluem, mas não alteram o poder da rede no sentido de que, com a configuração adequada dos pesos, a rede não normalizada ainda pode dar a mesma saída que uma rede normalizada.
+
+<!--
+### Normalization operations
+-->
+
+
+### Operações de normalização
+
+<!--This is the generic notation for normalization:
+-->
+
+Esta é a notação genérica para normalização:
+
+<!--$$
+y = \frac{a}{\sigma}(x - \mu) + b
+$$
+-->
+
+$$
+y = \frac{a}{\sigma}(x - \mu) + b
+$$
+
+<!--where $x$ is the input vector, $y$ is the output vector, $\mu$ is the estimate of the mean of $x$, $\sigma$ is the estimate of the standard deviation (std) of $x$, $a$ is the learnable scaling factor, and $b$ is the learnable bias term.
+-->
+
+onde $x$ é o vetor de entrada, $y$ é o vetor de saída, $\mu$ é a estimativa da média de $x$, $\sigma$ é a estimativa do desvio padrão (std) de $x$ , $a$ é o fator de escala que pode ser aprendido e $b$ é o termo de polarização que pode ser aprendido.
+
+<!--Without the learnable parameters $a$ and $b$, the distribution of output vector $y$ will have fixed mean 0 and std 1. The scaling factor $a$ and bias term $b$ maintain the representation power of the network,*i.e.*the output values can still be over any particular range. Note that $a$ and $b$ do not reverse the normalization, because they are learnable parameters and are much more stable than $\mu$ and $\sigma$.
+-->
+
+Sem os parâmetros aprendíveis $a$ e $b$, a distribuição do vetor de saída $y$ terá média fixa 0 e padrão 1. O fator de escala $a$ e o termo de polarização $b$ mantêm o poder de representação da rede, *ou seja,* os valores de saída ainda podem estar acima de qualquer faixa específica. Observe que $a$ e $b$ não invertem a normalização, porque eles são parâmetros aprendíveis e são muito mais estáveis do que $\mu$ e $\sigma$.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_norm_operations.png"/><br>
+<b>Figure 4:</b> Normalization operations.
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_norm_operations.png"/><br>
+<b>Figure 4:</b> Operações de Normalização.
+</center>
+
+<!--There are several ways to normalize the input vector, based on how to select samples for normalization. Figure 4 lists 4 different normalization approaches, for a mini-batch of $N$ images of height $H$ and width $W$, with $C$ channels:
+-->
+
+Existem várias maneiras de normalizar o vetor de entrada, com base em como selecionar amostras para normalização. A Figura 4 lista 4 abordagens diferentes de normalização, para um minilote de $N$ imagens de altura $H$ e largura $W$, com canais $C$:
+
+<!--- *Batch norm*: the normalization is applied only over one channel of the input. This is the first proposed and the most well-known approach. Please read [How to Train Your ResNet 7: Batch Norm](https://myrtle.ai/learn/how-to-train-your-resnet-7-batch-norm/) for more information.
+- *Layer norm*: the normalization is applied within one image across all channels.
+- *Instance norm*: the normalization is applied only over one image and one channel.
+- *Group norm*: the normalization is applied over one image but across a number of channels. For example, channel 0 to 9 is a group, then channel 10 to 19 is another group, and so on. In practice, the group size is almost always 32. This is the approach recommended by Aaron Defazio, since it has good performance in practice and it does not conflict with SGD.
+-->
+
+- *Normalização em lote (Batch Norm)*: a normalização é aplicada apenas em um canal da entrada. Esta é a primeira proposta e a abordagem mais conhecida. Leia [Como treinar seu ResNet 7: norma de lote](https://myrtle.ai/learn/how-to-train-your-resnet-7-batch-norm/) para obter mais informações.
+- *Normalização de camada (Layer Norm)*: a normalização é aplicada dentro de uma imagem em todos os canais.
+- *Normalização de instância (Instance Norm)*: a normalização é aplicada apenas sobre uma imagem e um canal.
+- *Normalização de grupo (Group Norm)*: a normalização é aplicada sobre uma imagem, mas em vários canais. Por exemplo, os canais 0 a 9 são um grupo, os canais 10 a 19 são outro grupo e assim por diante. Na prática, o tamanho do grupo é quase sempre de 32. Essa é a abordagem recomendada por Aaron Defazio, pois tem um bom desempenho na prática e não conflita com o SGD.
+
+<!--In practice, batch norm and group norm work well for computer vision problems, while layer norm and instance norm are heavily used for language problems.
+-->
+
+Na prática, a norma de lote e a norma de grupo funcionam bem para problemas de visão computacional, enquanto a norma de camada e a norma de instância são muito usadas para problemas de linguagem.
+
+<!--
+### Why does normalization help?
+-->
+
+
+### Por que a normalização ajuda?
+
+<!--Although normalization works well in practice, the reasons behind its effectiveness are still disputed. Originally, normalization is proposed to reduce "internal covariate shift", but some scholars proved it wrong in experiments. Nevertheless, normalization clearly has a combination of the following factors:
+-->
+
+Embora a normalização funcione bem na prática, as razões por trás de sua eficácia ainda são contestadas. Originalmente, a normalização é proposta para reduzir a "mudança interna da covariável" ("internal covariate shift"), mas alguns estudiosos provaram que estava errada em experimentos. No entanto, a normalização claramente tem uma combinação dos seguintes fatores:
+
+<!--- Networks with normalization layers are easier to optimize, allowing for the use of larger learning rates. Normalization has an optimization effect that speeds up the training of neural networks.
+- The mean/std estimates are noisy due to the randomness of the samples in batch. This extra "noise" results in better generalization in some cases. Normalization has a regularization effect.
+- Normalization reduces sensitivity to weight initialization.
+-->
+
+- Redes com camadas de normalização são mais fáceis de otimizar, permitindo o uso de maiores taxas de aprendizado. A normalização tem um efeito de otimização que acelera o treinamento das redes neurais.
+- As estimativas de média/padrão são ruidosas devido à aleatoriedade das amostras no lote. Este "ruído" extra resulta em melhor generalização em alguns casos. A normalização tem um efeito de regularização.
+- A normalização reduz a sensibilidade à inicialização do peso.
+
+<!--As a result, normalization lets you be more "careless" -- you can combine almost any neural network building blocks together and have a good chance of training it without having to consider how poorly conditioned it might be.
+-->
+
+Como resultado, a normalização permite que você seja mais "descuidado" - você pode combinar quase todos os blocos de construção de rede neural e ter uma boa chance de treiná-la sem ter que considerar o quão mal condicionada ela pode estar.
+
+<!--
+### Practical considerations
+-->
+
+
+### Considerações práticas
+
+<!--It’s important that back-propagation is done through the calculation of the mean and std, as well as the application of the normalization: the network training will diverge otherwise. The back-propagation calculation is fairly difficult and error-prone, but PyTorch is able to automatically calculate it for us, which is very helpful. Two normalization layer classes in PyTorch are listed below:
+-->
+
+É importante que a retropropagação seja feita por meio do cálculo da média e do padrão, bem como a aplicação da normalização: o treinamento da rede irá divergir de outra forma. O cálculo da propagação reversa é bastante difícil e sujeito a erros, mas o PyTorch é capaz de calculá-lo automaticamente para nós, o que é muito útil. Duas classes de camada de normalização em PyTorch estão listadas abaixo:
+
+<!--```python
+torch.nn.BatchNorm2d(num_features, ...)
+torch.nn.GroupNorm(num_groups, num_channels, ...)
+```
+-->
+
+```python
+torch.nn.BatchNorm2d(num_features, ...)
+torch.nn.GroupNorm(num_groups, num_channels, ...)
+```
+
+<!--Batch norm was the first method developed and is the most widely known. However, **Aaron Defazio recommends using group norm** instead. It’s more stable, theoretically simpler, and usually works better. Group size 32 is a good default.
+-->
+
+A normalização em lote (batch norm) foi o primeiro método desenvolvido e é o mais amplamente conhecido. No entanto, **Aaron Defazio recomenda usar a normalização de grupo (group norm)** ao invés da primeira. Ele é mais estável, teoricamente mais simples e geralmente funciona melhor. O tamanho do grupo 32 é um bom padrão.
+
+<!--Note that for batch norm and instance norm, the mean/std used are fixed after training, rather than re-computed every time the network is evaluated, because multiple training samples are needed to perform normalization. This is not necessary for group norm and layer norm, since their normalization is over only one training sample.
+-->
+
+Observe que para normalização em lote e normalização de instância, a média/padrão usada é fixada após o treinamento, em vez de recalculada toda vez que a rede é avaliada, porque várias amostras de treinamento são necessárias para realizar a normalização. Isso não é necessário para normalização de grupo e normalização de camada, uma vez que sua normalização é sobre apenas uma amostra de treinamento.
+
+<!--
+## [The Death of Optimization](https://www.youtube.com/watch?v=--NZb480zlg&t=4817s)
+-->
+
+
+## [A morte da otimização](https://www.youtube.com/watch?v=--NZb480zlg&t=4817s)
+
+<!--Sometimes we can barge into a field we know nothing about and improve how they are currently implementing things. One such example is the use of deep neural networks in the field of Magnetic Resonance Imaging (MRI) to accelerate MRI image reconstruction.
+-->
+
+Às vezes, podemos invadir um campo sobre o qual nada sabemos e melhorar a forma como eles estão implementando as coisas. Um exemplo é o uso de redes neurais profundas no campo da exames de Ressonância Magnética (MRI) para acelerar a reconstrução de imagens de MRI.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_conv_xkcd.png" style="zoom:60%"><br>
+<b>Figure 5:</b> Sometimes it actually works!
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_conv_xkcd.png" style="zoom:60%"><br>
+<b>Figure 5:</b> Às vezes realmente funciona!
+</center>
+
+<!--
+### MRI Reconstruction
+-->
+
+
+### Reconstrução de Ressonância Magnética
+
+<!--In the traditional MRI reconstruction problem, raw data is taken from an MRI machine and an image is reconstructed from it using a simple pipeline/algorithm. MRI machines capture data in a 2-dimensional Fourier domain, one row or one column at a time (every few milliseconds). This raw input is composed of a frequency and a phase channel and the value represents the magnitude of a sine wave with that particular frequency and phase. Simply speaking, it can be thought of as a complex valued image, having a real and an imaginary channel. If we apply an inverse Fourier transform on this input, i.e add together all these sine waves weighted by their values, we can get the original anatomical image.
+-->
+
+No problema de reconstrução tradicional de exames de ressonância magnética (MRI), os dados brutos são obtidos de uma máquina de MRI e uma imagem é reconstruída a partir dele usando um pipeline/algoritmo simples. As máquinas de ressonância magnética capturam dados em um domínio de Fourier bidimensional, uma linha ou uma coluna por vez (a cada poucos milissegundos). Esta entrada bruta é composta por uma frequência e um canal de fase e o valor representa a magnitude de uma onda senoidal com aquela frequência e fase específicas. Em termos simples, pode ser pensada como uma imagem de valor complexo, possuindo um canal real e outro imaginário. Se aplicarmos uma transformada inversa de Fourier nesta entrada, ou seja, somarmos todas essas ondas senoidais ponderadas por seus valores, podemos obter a imagem anatômica original.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_mri.png" style="zoom:60%"/><br>
+<b>Fig. 6:</b> MRI reconstruction
+</center><br>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_mri.png" style="zoom:60%"/><br>
+<b>Fig. 6:</b> Reconstrução de ressonância magnética
+</center><br>
+
+<!--A linear mapping currently exists to go from the Fourier domain to the image domain and it's very efficient, literally taking milliseconds, no matter how big the image is. But the question is, can we do it even faster?
+-->
+
+Existe atualmente um mapeamento linear para ir do domínio de Fourier ao domínio da imagem e é muito eficiente, levando literalmente milissegundos, não importa o tamanho da imagem. Mas a questão é: podemos fazer isso ainda mais rápido?
+
+<!--
+### Accelerated MRI
+-->
+
+
+### Ressonância magnética acelerada
+
+<!--The new problem that needs to be solved is accelerated MRI, where by acceleration we mean making the MRI reconstruction process much faster. We want to run the machines quicker and still be able to produce identical quality images. One way we can do this and the most successful way so far has been to not capture all the columns from the MRI scan. We can skip some columns randomly, though it's useful in practice to capture the middle columns, as they contain a lot of information across the image, but outside them we just capture randomly. The problem is that we can't use our linear mapping anymore to reconstruct the image. The rightmost image in Figure 7 shows the output of a linear mapping applied to the subsampled Fourier space. It's clear that this method doesn't give us very useful outputs, and that there's room to do something a little bit more intelligent.
+-->
+
+O novo problema que precisa ser resolvido é a ressonância magnética acelerada, onde por aceleração queremos dizer tornar o processo de reconstrução por ressonância magnética muito mais rápido. Queremos operar as máquinas mais rapidamente e ainda ser capazes de produzir imagens de qualidade idêntica. Uma maneira de fazer isso e a maneira mais bem-sucedida até agora tem sido não capturar todas as colunas da varredura de ressonância magnética. Podemos pular algumas colunas aleatoriamente, embora seja útil na prática capturar as colunas do meio, pois elas contêm muitas informações na imagem, mas fora delas apenas capturamos aleatoriamente. O problema é que não podemos mais usar nosso mapeamento linear para reconstruir a imagem. A imagem mais à direita na Figura 7 mostra a saída de um mapeamento linear aplicado ao espaço de Fourier subamostrado. É claro que esse método não nos dá resultados muito úteis e que há espaço para fazer algo um pouco mais inteligente.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_acc_mri.png" style="zoom:45%"><br>
+<b>Fig.:</b> Linear mapping on subsampled Fourier-space
+</center><br>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_acc_mri.png" style="zoom:45%"><br>
+<b>Fig.:</b> Mapeamento linear no espaço de Fourier subamostrado
+</center><br>
+
+<!--
+### Compressed sensing
+-->
+
+
+### Compressed sensing
+
+<!--One of the biggest breakthroughs in theoretical mathematics for a long time was compressed sensing. A paper by <a href="https://arxiv.org/pdf/math/0503066.pdf">Candes et al.</a> showed that theoretically, we can get a perfect reconstruction from the subsampled Fourier-domain image. In other words, when the signal we are trying to reconstruct is sparse or sparsely structured, then it is possible to perfectly reconstruct it from fewer measurements. But there are some practical requirements for this to work -- we don't need to sample randomly, rather we need to sample incoherently -- though in practice, people just end up sampling randomly. Additionally, it takes the same time to sample a full column or half a column, so in practice we also sample entire columns.
+-->
+
+Um dos maiores avanços na matemática teórica por muito tempo foi o sensoriamento comprimido. Um artigo de <a href="https://arxiv.org/pdf/math/0503066.pdf"> Candes et al. </a> mostrou que, teoricamente, podemos obter uma reconstrução perfeita a partir da subamostra da imagem do domínio de Fourier . Em outras palavras, quando o sinal que estamos tentando reconstruir é esparso ou esparsamente estruturado, então é possível reconstruí-lo perfeitamente a partir de menos medições. Mas existem alguns requisitos práticos para que isso funcione - não precisamos amostrar aleatoriamente, em vez disso, precisamos amostrar incoerentemente - embora, na prática, as pessoas acabem apenas amostrando aleatoriamente. Além disso, leva o mesmo tempo para amostrar uma coluna inteira ou meia coluna, portanto, na prática, também amostramos colunas inteiras.
+
+<!--Another condition is that we need to have *sparsity* in our image, where by sparsity we mean a lot of zeros or black pixels in the image. The raw input can be represented sparsely if we do a wavelength decomposition, but even this decomposition gives us an approximately sparse and not an exactly sparse image. So, this approach gives us a pretty good but not perfect reconstruction, as we can see in Figure 8. However, if the input were very sparse in the wavelength domain, then we would definitely get a perfect image.
+-->
+
+Outra condição é que precisamos ter *esparsidade* em nossa imagem, onde por esparsidade queremos dizer muitos zeros ou pixels pretos na imagem. A entrada bruta pode ser representada esparsamente se fizermos uma decomposição do comprimento de onda, mas mesmo essa decomposição nos dá uma imagem aproximadamente esparsa e não exatamente esparsa. Portanto, essa abordagem nos dá uma reconstrução muito boa, mas não perfeita, como podemos ver na Figura 8. No entanto, se a entrada fosse muito esparsa no domínio do comprimento de onda, com certeza obteríamos uma imagem perfeita.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_comp_sensing.png" style="zoom:50%"><br>
+<b>Figure 8: </b>Compressed sensing
+</center><br>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_comp_sensing.png" style="zoom:50%"><br>
+<b>Figure 8: </b>Sensoriamento comprimido
+</center><br>
+
+<!--Compressed sensing is based on the theory of optimization. The way we can get this reconstruction is by solving a mini-optimization problem which has an additional regularization term:
+-->
+
+O sensoriamento comprimido é baseado na teoria da otimização. A maneira como podemos obter essa reconstrução é resolvendo um problema de mini-otimização que tem um termo de regularização adicional:
+
+<!--$$
+\hat{x} = \arg\min_x \frac{1}{2} \Vert M (\mathcal{F}(x)) - y \Vert^2 + \lambda TV(x)
+$$
+-->
+
+$$
+\hat{x} = \arg\min_x \frac{1}{2} \Vert M (\mathcal{F}(x)) - y \Vert^2 + \lambda TV(x)
+$$
+
+<!--where $M$ is the mask function that zeros out non-sampled entries, $\mathcal{F}$ is the Fourier transform, $y$ is the observed Fourier-domain data, $\lambda$ is the regularization penalty strength, and $V$ is the regularization function.
+-->
+
+onde $M$ é a função de máscara que zera as entradas não amostradas, $\mathcal{F}$ é a transformação de Fourier, $y$ são os dados observados do domínio de Fourier, $\lambda$ é a força da penalidade de regularização e $V$ é a função de regularização.
+
+<!--The optimization problem must be solved for each time step or each "slice" in an MRI scan, which often takes much longer than the scan itself. This gives us another reason to find something better.
+-->
+
+O problema de otimização deve ser resolvido para cada etapa de tempo ou cada "fatia" em uma ressonância magnética, que geralmente leva muito mais tempo do que a própria varredura. Isso nos dá outro motivo para encontrar algo melhor.
+
+<!--
+### Who needs optimization?
+-->
+
+
+### Quem precisa de otimização?
+
+<!--Instead of solving the little optimization problem at every time step, why not use a big neural network to produce the required solution directly? Our hope is that we can train a neural network with sufficient complexity that it essentially solves the optimization problem in one step and produces an output that is as good as the solution obtained from solving the optimization problem at each time step.
+-->
+
+Em vez de resolver o pequeno problema de otimização em cada etapa do tempo, por que não usar uma grande rede neural para produzir a solução necessária diretamente? Nossa esperança é que possamos treinar uma rede neural com complexidade suficiente para que essencialmente resolva o problema de otimização em uma etapa e produza uma saída que seja tão boa quanto a solução obtida ao resolver o problema de otimização em cada etapa de tempo.
+
+<!--$$
+\hat{x} = B(y)
+$$
+-->
+
+$$
+\hat{x} = B(y)
+$$
+
+<!--where $B$ is our deep learning model and $y$ is the observed Fourier-domain data.
+-->
+
+onde $B$ é o nosso modelo de aprendizado profundo e $y$ são os dados observados do domínio de Fourier.
+
+<!--15 years ago, this approach was difficult -- but nowadays this is a lot easier to implement. Figure 9 shows the result of a deep learning approach to this problem and we can see that the output is much better than the compressed sensing approach and looks very similar to the actual scan.
+-->
+
+Há 15 anos, essa abordagem era difícil - mas hoje em dia é muito mais fácil de implementar. A Figura 9 mostra o resultado de uma abordagem de aprendizado profundo para esse problema e podemos ver que a saída é muito melhor do que a abordagem de detecção compactada e é muito semelhante ao exame de imagem real.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_dl_approach.png" style="zoom:60%"><br>
+<b>Figure 9: </b>Deep Learning approach
+</center><br>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_dl_approach.png" style="zoom:60%"><br>
+<b>Figure 9: </b>Abordagem com aprendizado profundo
+</center><br>
+
+<!--The model used to generate this reconstruction uses an ADAM optimizer, group-norm normalization layers, and a U-Net based convolutional neural network. Such an approach is very close to practical applications and we will hopefully be seeing these accelerated MRI scans happening in clinical practice in a few years' time.
+-->
+
+O modelo usado para gerar essa reconstrução usa um otimizador ADAM, camadas de normalização de norma de grupo e uma rede neural convolucional baseada em U-Net. Essa abordagem está muito próxima de aplicações práticas e esperamos ver esses exames de ressonância magnética acelerados acontecendo na prática clínica em alguns anos.
\ No newline at end of file
diff --git a/docs/pt/week05/05-3.md b/docs/pt/week05/05-3.md
new file mode 100644
index 000000000..029a327af
--- /dev/null
+++ b/docs/pt/week05/05-3.md
@@ -0,0 +1,490 @@
+---
+lang: pt
+lang-ref: ch.05-3
+title: Noções básicas sobre convoluções e mecanismo de diferenciação automática
+lecturer: Alfredo Canziani
+authors: Leyi Zhu, Siqi Wang, Tao Wang, Anqi Zhang
+date: 25 Feb 2020
+translator: Felipe Schiavon
+translation-date: 14 Nov 2021
+---
+
+<!--
+## [Understanding 1D convolution](https://www.youtube.com/watch?v=eEzCZnOFU1w&t=140s)
+-->
+
+
+## [Entendendo a convolução 1D](https://www.youtube.com/watch?v=eEzCZnOFU1w&t=140s)
+
+<!--In this part we will discuss convolution, since we would like to explore the sparsity, stationarity, compositionality of the data.
+-->
+
+Nesta parte discutiremos a convolução, uma vez que gostaríamos de explorar a esparsidade, estacionariedade e composicionalidade dos dados.
+
+<!--Instead of using the matrix $A$ discussed in the [previous week]({{site.baseurl}}/en/week04/04-1), we will change the matrix width to the kernel size $k$. Therefore, each row of the matrix is a kernel. We can use the kernels by stacking and shifting (see Fig 1). Then we can have $m$ layers of height $n-k+1$.
+-->
+
+Ao invés de usar a matriz $A$ discutida na [semana anterior]({{site.baseurl}}/pt/week04/04-1), vamos alterar a largura da matriz para o tamanho do kernel $k$. Portanto, cada linha da matriz é um kernel. Podemos usar os kernels os empilhando e deslocando (veja a Fig. 1). Então podemos ter $m$ camadas de altura $n-k+1$.
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-3/Illustration_1D_Conv.png" alt="1" style="zoom:40%;" /><br>
+<b>Fig 1</b>: Illustration of 1D Convolution
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-3/Illustration_1D_Conv.png" alt="1" style="zoom:40%;" /><br>
+<b>Fig 1</b>: Ilustração de uma Convolução 1D
+</center>
+
+<!--The output is $m$ (thickness) vectors of size $n-k+1$.
+-->
+
+A saída são $m$ (espessura) vetores de tamanho $n-k+1$.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-3/Result_1D_Conv.png" alt="2" style="zoom:40%;" /><br>
+<b>Fig 2</b>: Result of 1D Convolution
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-3/Result_1D_Conv.png" alt="2" style="zoom:40%;" /><br>
+<b>Fig 2</b>: Resultado da Convolução 1D
+</center>
+
+<!--Furthermore, a single input vector can viewed as a monophonic signal.
+-->
+
+Além disso, um único vetor de entrada pode ser visto como um sinal monofônico.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-3/Monophonic_Signal.png" alt="3" style="zoom:40%;" /><br>
+<b>Fig 3</b>: Monophonic Signal
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-3/Monophonic_Signal.png" alt="3" style="zoom:40%;" /><br>
+<b>Fig 3</b>: Monofônico Signal
+</center>
+
+<!--Now, the input $x$ is a mapping
+-->
+
+Agora, a entrada $x$ é o mapeamento
+
+<!--$$
+x:\Omega\rightarrow\mathbb{R}^{c}
+$$
+-->
+
+$$
+x:\Omega\rightarrow\mathbb{R}^{c}
+$$
+
+<!--where $\Omega = \lbrace 1, 2, 3, \cdots \rbrace \subset \mathbb{N}^1$ (since this is $1$ dimensional signal / it has a $1$ dimensional domain) and in this case the channel number $c$ is $1$. When $c = 2$ this becomes a stereophonic signal.
+-->
+
+onde $\Omega = \lbrace 1, 2, 3, \cdots \rbrace \subset \mathbb{N}^1$ (uma vez que este é um sinal $1$ dimensional / tem um domínio unidimensional) e, neste caso, o canal o número $c$ é $1$. Quando $c = 2$, isso se torna um sinal estereofônico.
+
+<!--For the 1D convolution, we can just compute the scalar product, kernel by kernel (see Fig 4).
+-->
+
+Para a convolução 1D, podemos apenas calcular o produto escalar, kernel por kernel (consulte a Figura 4).
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-3/Layer_by_layer_scalar_product.png" alt="4" style="zoom:40%;" /><br>
+<b>Fig 4</b>: Layer-by-layer Scalar Product of 1D Convolution
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-3/Layer_by_layer_scalar_product.png" alt="4" style="zoom:40%;" /><br>
+<b>Fig 4</b>: Produto escalar camada por camada da convolução 1D
+</center>
+
+<!--
+## [Dimension of kernels and output width in PyTorch](https://www.youtube.com/watch?v=eEzCZnOFU1w&t=1095s)
+-->
+
+
+## [Dimensões das larguras dos kernels e saídas no PyTorch](https://www.youtube.com/watch?v=eEzCZnOFU1w&t=1095s)
+
+<!--Tips: We can use ***question mark*** in IPython to get access to the documents of functions. For example,
+-->
+
+Dicas: Podemos usar ***ponto de interrogação*** no IPython para obter acesso à documentação das funções. Por exemplo,
+
+<!--```python
+Init signature:
+nn.Conv1d(
+	in_channels,           # number of channels in the input image
+	out_channels,          # number of channels produced by the convolution
+	kernel_size,           # size of the convolving kernel
+	stride=1,              # stride of the convolution
+	padding=0,             # zero-padding added to both sides of the input
+	dilation=1,            # spacing between kernel elements
+	groups=1,              # nb of blocked connections from input to output
+	bias=True,             # if `True`, adds a learnable bias to the output
+	padding_mode='zeros',  # accepted values `zeros` and `circular`
+)
+```
+-->
+
+```python
+Init signature:
+nn.Conv1d(
+	in_channels,           # número de canais na imagem de entrada
+	out_channels,          # número de canais produzidos pela convolução
+	kernel_size,           # tamanho do kernel convolvente
+	stride=1,              # stride (passo) da convolução
+	padding=0,             # zero-padding (preenchimento com zero) adicionado nos dois lados da entrada
+	dilation=1,            # espaçamento entre os elementos do kernel
+	groups=1,              # número de conexões bloqueadas da entrada para a saída
+	bias=True,             # se `True`, adiciona um viés "aprendível" na saída
+	padding_mode='zeros',  # modo de preenchimento, aceita valores `zeros` e `circular`
+)
+```
+
+<!--
+### 1D convolution
+-->
+
+
+### Convolução 1D
+
+<!--We have $1$ dimensional convolution going from $2$ channels (stereophonic signal) to $16$ channels ($16$ kernels) with kernel size of $3$ and stride of $1$. We then have $16$ kernels with thickness $2$ and length $3$. Let's assume that the input signal has a batch of size $1$ (one signal), $2$ channels and $64$ samples. The resulting output layer has $1$ signal, $16$ channels and the length of the signal is $62$ ($=64-3+1$). Also, if we output the bias size, we'll find the bias size is $16$, since we have one bias per weight.
+-->
+
+Temos $1$ convolução dimensional indo de $2$ canais (sinal estereofônico) para $16$ canais ($16$ kernels) com tamanho de kernel de $3$ e *stride* (passo) de $1$. Temos então $16$ kernels com espessura $2$ e comprimento $3$. Vamos supor que o sinal de entrada tenha um lote de tamanho $1$ (um sinal), $2$ canais e $64$ amostras. A camada de saída resultante tem $1$ sinal, $16$ canais e o comprimento do sinal é $62$ ($=64-3+1$). Além disso, se gerarmos o tamanho do enviesamento, descobriremos que o tamanho do viés é $16$, já que temos um viés para cada peso.
+
+<!--```python
+conv = nn.Conv1d(2, 16, 3)  # 2 channels (stereo signal), 16 kernels of size 3
+conv.weight.size()          # output: torch.Size([16, 2, 3])
+conv.bias.size()            # output: torch.Size([16])
+x = torch.rand(1, 2, 64)    # batch of size 1, 2 channels, 64 samples
+conv(x).size()              # saída: torch.Size([1, 16, 62])
+conv = nn.Conv1d(2, 16, 5)  # 2 channels, 16 kernels of size 5
+conv(x).size()              # output: torch.Size([1, 16, 60])
+```
+
+-->
+
+```python
+conv = nn.Conv1d(2, 16, 3)  # 2 canais (sinal estéreo), 16 kernels de tamanho 3
+conv.weight.size()          # saída: torch.Size([16, 2, 3])
+conv.bias.size()            # saída: torch.Size([16])
+x = torch.rand(1, 2, 64)    # lote de tamanho 1, 2 canais, 64 amostras
+conv(x).size()              # saída: torch.Size([1, 16, 62])
+conv = nn.Conv1d(2, 16, 5)  # 2 canais, 16 kernels de tamanho 5
+conv(x).size()              # saída: torch.Size([1, 16, 60])
+
+```
+
+<!--
+### 2D convolution
+-->
+
+
+### Convolução 2D
+
+<!--We first define the input data as $1$ sample, $20$ channels (say, we're using an hyperspectral image) with height $64$ and width $128$. The 2D convolution has $20$ channels from input and $16$ kernels with size of $3 \times 5$. After the convolution, the output data has $1$ sample, $16$ channels with height $62$ ($=64-3+1$) and width $124$ ($=128-5+1$).
+-->
+
+Primeiro definimos os dados de entrada como $1$ amostra, $20$ canais (digamos, estamos usando uma imagem hiperespectral) com altura $64$ e largura $128$. A convolução 2D tem $20$ canais de entrada e $16$ kernels com tamanho de $3\times 5$. Após a convolução, os dados de saída têm $1$ amostra, $16$ canais com altura $62$ ($=64-3+1$) e largura $124$ ($=128-5+1$).
+
+<!--```python
+x = torch.rand(1, 20, 64, 128)    # 1 sample, 20 channels, height 64, and width 128
+conv = nn.Conv2d(20, 16, (3, 5))  # 20 channels, 16 kernels, kernel size is 3 x 5
+conv.weight.size()                # output: torch.Size([16, 20, 3, 5])
+conv(x).size()                    # output: torch.Size([1, 16, 62, 124])
+```
+-->
+
+```python
+x = torch.rand(1, 20, 64, 128)    # 1 amostra, 20 canais, altura 64 e largura 128
+conv = nn.Conv2d(20, 16, (3, 5))  # 20 canais, 16 kernels, kernel de tamanho 3 x 5
+conv.weight.size()                # saída: torch.Size([16, 20, 3, 5])
+conv(x).size()                    # saída: torch.Size([1, 16, 62, 124])
+```
+
+<!--If we want to achieve the same dimensionality, we can have paddings. Continuing the code above, we can add new parameters to the convolution function: `stride=1` and `padding=(1, 2)`, which means $1$ on $y$ direction ($1$ at the top and $1$ at the bottom) and $2$ on $x$ direction. Then the output signal is in the same size compared to the input signal. The number of dimensions that is required to store the collection of kernels when you perform 2D convolution is $4$.
+-->
+
+Se quisermos atingir a mesma dimensionalidade, podemos ter preenchimentos. Continuando o código acima, podemos adicionar novos parâmetros à função de convolução: `stride = 1` e` padding = (1, 2) `, o que significa $1$ em $y$ direction ($1$ no topo e $1$ na parte inferior) e $2$ na direção $x$. Então, o sinal de saída tem o mesmo tamanho em comparação com o sinal de entrada. O número de dimensões necessárias para armazenar a coleção de kernels ao realizar a convolução 2D é $4$.
+
+<!--```python
+# 20 channels, 16 kernels of size 3 x 5, stride is 1, padding of 1 and 2
+conv = nn.Conv2d(20, 16, (3, 5), 1, (1, 2))
+conv(x).size()  # output: torch.Size([1, 16, 64, 128])
+```
+-->
+
+```python
+# 20 canais, 16 kernels de tamanho 3 x 5, stride de 1, preenchimento (padding) de 1 e 2
+conv = nn.Conv2d(20, 16, (3, 5), 1, (1, 2))
+conv(x).size()  # saída: torch.Size([1, 16, 64, 128])
+```
+
+<!--
+## [How automatic gradient works?](https://www.youtube.com/watch?v=eEzCZnOFU1w&t=1634s)
+-->
+
+
+## [Como funciona o gradiente automático?](https://www.youtube.com/watch?v=eEzCZnOFU1w&t=1634s)
+
+<!--In this section we're going to ask torch to check all the computation over the tensors so that we can perform the computation of partial derivatives.
+-->
+
+Nesta seção, vamos pedir ao torch para verificar todos os cálculos sobre os tensores para que possamos realizar o cálculo das derivadas parciais.
+
+<!--- Create a $2\times2$ tensor $\boldsymbol{x}$ with gradient-accumulation capabilities;
+- Deduct $2$ from all elements of $\boldsymbol{x}$ and get $\boldsymbol{y}$; (If we print `y.grad_fn`, we will get `<SubBackward0 object at 0x12904b290>`, which means that `y` is generated by the module of subtraction $\boldsymbol{x}-2$. Also we can use `y.grad_fn.next_functions[0][0].variable` to derive the original tensor.)
+- Do more operations: $\boldsymbol{z} = 3\boldsymbol{y}^2$;
+- Calculate the mean of $\boldsymbol{z}$.
+-->
+
+- Crie um tensor $2\times2$ $\boldsymbol{x}$ com capacidades de acumulação de gradiente;
+- Deduza $2$ de todos os elementos de $\boldsymbol{x}$ e obtenha $\boldsymbol{y}$; (Se imprimirmos `y.grad_fn`, obteremos` <SubBackward0 object em 0x12904b290> `, o que significa que `y` é gerado pelo módulo da subtração $\boldsymbol{x}-2$. Também podemos usar `y.grad_fn.next_functions[0][0].variable` para derivar o tensor original.)
+- Faça mais operações: $\boldsymbol{z}=3 \boldsymbol{y}^2 $;
+- Calcule a média de $\boldsymbol{z}$.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-3/Flow_Chart.png" alt="5" style="zoom:60%;" /><br>
+<b>Fig 5</b>: Flow Chart of the Auto-gradient Example
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-3/Flow_Chart.png" alt="5" style="zoom:60%;" /><br>
+<b>Fig 5</b>: Fluxograma do Exemplo de gradiente automático
+</center>
+
+<!--Back propagation is used for computing the gradients. In this example, the process of back propagation can be viewed as computing the gradient $\frac{d\boldsymbol{a}}{d\boldsymbol{x}}$. After computing $\frac{d\boldsymbol{a}}{d\boldsymbol{x}}$ by hand as a validation, we can find that the execution of `a.backward()` gives us the same value of *x.grad* as our computation.
+-->
+
+A retropropagação (backpropagation) é usada para calcular os gradientes. Neste exemplo, o processo de retropropagação pode ser visto como o cálculo do gradiente $\frac{d\boldsymbol{a}}{d\boldsymbol{x}}$. Depois de calcular $\frac{d\boldsymbol{a}}{d\boldsymbol{x}}$ manualmente como uma validação, podemos descobrir que a execução de `a.backward()` nos dá o mesmo valor de *x.grad* como nosso cálculo.
+
+<!--Here is the process of computing back propagation by hand:
+-->
+
+Aqui está o processo de cálculo da retropropagação manualmente:
+
+<!--$$
+\begin{aligned}
+a &= \frac{1}{4} (z_1 + z_2 + z_3 + z_4) \\
+z_i &= 3y_i^2 = 3(x_i-2)^2 \\
+\frac{da}{dx_i} &= \frac{1}{4}\times3\times2(x_i-2) = \frac{3}{2}x_i-3 \\
+x &= \begin{pmatrix} 1&2\\3&4\end{pmatrix} \\
+\left(\frac{da}{dx_i}\right)^\top &= \begin{pmatrix} 1.5-3&3-3\\[2mm]4.5-3&6-3\end{pmatrix}=\begin{pmatrix} -1.5&0\\[2mm]1.5&3\end{pmatrix}
+\end{aligned}
+$$
+-->
+
+$$
+\begin{aligned}
+a &= \frac{1}{4} (z_1 + z_2 + z_3 + z_4) \\
+z_i &= 3y_i^2 = 3(x_i-2)^2 \\
+\frac{da}{dx_i} &= \frac{1}{4}\times3\times2(x_i-2) = \frac{3}{2}x_i-3 \\
+x &= \begin{pmatrix} 1&2\\3&4\end{pmatrix} \\
+\left(\frac{da}{dx_i}\right)^\top &= \begin{pmatrix} 1.5-3&3-3\\[2mm]4.5-3&6-3\end{pmatrix}=\begin{pmatrix} -1.5&0\\[2mm]1.5&3\end{pmatrix}
+\end{aligned}
+$$
+
+<!--Whenever you use partial derivative in PyTorch, you get the same shape of the original data. But the correct Jacobian thing should be the transpose.
+-->
+
+Sempre que você usa derivada parcial em PyTorch, obtém a mesma forma dos dados originais. Mas a coisa jacobiana correta deveria ser a transposição.
+
+<!--
+### From basic to more crazy
+-->
+
+
+### Do básico ao mais louco
+
+<!--Now we have a $1\times3$ vector $x$, assign $y$ to the double $x$ and keep doubling $y$ until its norm is smaller than $1000$. Due to the randomness we have for $x$, we cannot directly know the number of iterations when the procedure terminates.
+-->
+
+Agora temos um vetor $1\times3$ $x$, atribua $y$ ao dobro de $x$ e continue dobrando $y$ até que sua norma seja menor que $1000$. Devido à aleatoriedade que temos para $x$, não podemos saber diretamente o número de iterações quando o procedimento termina.
+
+<!--```python
+x = torch.randn(3, requires_grad=True)
+y = x * 2
+i = 0
+while y.data.norm() < 1000:
+    y = y * 2
+    i += 1
+```
+-->
+
+```python
+x = torch.randn(3, requires_grad=True)
+
+y = x * 2
+i = 0
+while y.data.norm() < 1000:
+    y = y * 2
+    i += 1
+```
+
+<!--However, we can infer it easily by knowing the gradients we have.
+-->
+
+No entanto, podemos inferir isso facilmente conhecendo os gradientes que temos.
+
+<!--```python
+gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
+y.backward(gradients)
+print(x.grad)
+tensor([1.0240e+02, 1.0240e+03, 1.0240e-01])
+print(i)
+9
+```
+-->
+
+```python
+gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
+y.backward(gradients)
+
+print(x.grad)
+tensor([1.0240e+02, 1.0240e+03, 1.0240e-01])
+print(i)
+9
+```
+
+<!--As for the inference, we can use `requires_grad=True` to label that we want to track the gradient accumulation as shown below. If we omit `requires_grad=True` in either $x$ or $w$'s declaration and call `backward()` on $z$, there will be runtime error due to we do not have gradient accumulation on $x$ or $w$.
+-->
+
+Quanto à inferência, podemos usar `requires_grad=True` para rotular que queremos rastrear o acúmulo de gradiente conforme mostrado abaixo. Se omitirmos `requires_grad=True` na declaração de $x$ ou $w$ e chamar`backward ()`em $z$, haverá um erro de execução devido a não termos acumulação de gradiente em $x$ ou $w$.
+
+<!--```python
+# Both x and w that allows gradient accumulation
+x = torch.arange(1., n + 1, requires_grad=True)
+w = torch.ones(n, requires_grad=True)
+z = w @ x
+z.backward()
+print(x.grad, w.grad, sep='\n')
+```
+-->
+
+```python
+# Tanto x quanto w que permitem o acúmulo de gradiente
+x = torch.arange(1., n + 1, requires_grad=True)
+w = torch.ones(n, requires_grad=True)
+z = w @ x
+z.backward()
+print(x.grad, w.grad, sep='\n')
+```
+
+<!--And, we can have `with torch.no_grad()` to omit the gradient accumulation.
+-->
+
+E, podemos usar o comando `with torch.no_grad()` para omitir o acúmulo de gradiente.
+
+<!--```python
+x = torch.arange(1., n + 1)
+w = torch.ones(n, requires_grad=True)
+
+# All torch tensors will not have gradient accumulation
+with torch.no_grad():
+    z = w @ x
+
+try:
+    z.backward()  # PyTorch will throw an error here, since z has no grad accum.
+except RuntimeError as e:
+    print('RuntimeError!!! >:[')
+    print(e)
+```
+-->
+
+
+```python
+x = torch.arange(1., n + 1)
+w = torch.ones(n, requires_grad=True)
+
+# Todos os tensores do torch não terão gradientes acumulados
+with torch.no_grad():
+    z = w @ x
+
+try:
+    z.backward()  # PyTorch vai lançar um erro aqui, pois z não tem acumulador de gradientes
+except RuntimeError as e:
+    print('RuntimeError!!! >:[')
+    print(e)
+```
+
+<!--
+## More stuff -- custom gradients
+-->
+
+
+## Mais coisas - gradientes personalizados
+
+<!--Also, instead of basic numerical operations, we can generate our own self-defined modules / functions, which can be plugged into the neural graph. The Jupyter Notebook can be found [here](https://github.com/Atcold/pytorch-Deep-Learning/blob/master/extra/b-custom_grads.ipynb).
+-->
+
+Além disso, em vez de operações numéricas básicas, podemos criar nossos próprios módulos / funções, que podem ser plugados no grafo da rede neural. O Jupyter Notebook pode ser encontrado [aqui](https://github.com/Atcold/pytorch-Deep-Learning/blob/master/extra/b-custom_grads.ipynb).
+
+<!--To do so, we need to inherit `torch.autograd.Function` and override `forward()` and `backward()` functions. For example, if we want to training nets, we need to get the forward pass and know the partial derivatives of the input respect to the output, such that we can use this module in any kind of point in the code. Then, by using back-propagation (chain rule), we can plug the thing anywhere in the chain of operations, as long as we know the partial derivatives of the input respect to the output.
+-->
+
+Para fazer isso, precisamos herdar `torch.autograd.Function` e substituir as funções `forward ()` e `backward()`. Por exemplo, se quisermos treinar redes, precisamos obter a passagem pelo *forward* e saber as derivadas parciais da entrada em relação à saída, de forma que possamos usar este módulo em qualquer tipo de ponto do código. Então, usando retropropagação (regra da cadeia), podemos conectar a coisa em qualquer lugar na cadeia de operações, desde que conheçamos as derivadas parciais da entrada em relação à saída.
+
+<!--In this case, there are three examples of ***custom modules*** in the *notebook*, the `add`, `split`, and `max` modules. For example, the custom addition module:
+-->
+
+Neste caso, existem três exemplos de ***módulos personalizados*** no *notebook*, os módulos `add`,`split` e `max`. Por exemplo, o módulo de adição personalizado:
+
+<!--```python
+# Custom addition module
+class MyAdd(torch.autograd.Function):
+
+    @staticmethod
+    def forward(ctx, x1, x2):
+        # ctx is a context where we can save
+        # computations for backward.
+        ctx.save_for_backward(x1, x2)
+        return x1 + x2
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        x1, x2 = ctx.saved_tensors
+        grad_x1 = grad_output * torch.ones_like(x1)
+        grad_x2 = grad_output * torch.ones_like(x2)
+        # need to return grads in order
+        # of inputs to forward (excluding ctx)
+        return grad_x1, grad_x2
+```
+-->
+
+```python
+# Custom addition module
+class MyAdd(torch.autograd.Function):
+
+    @staticmethod
+    def forward(ctx, x1, x2):
+        # ctx is a context where we can save
+        # computations for backward.
+        ctx.save_for_backward(x1, x2)
+        return x1 + x2
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        x1, x2 = ctx.saved_tensors
+        grad_x1 = grad_output * torch.ones_like(x1)
+        grad_x2 = grad_output * torch.ones_like(x2)
+        # need to return grads in order
+        # of inputs to forward (excluding ctx)
+        return grad_x1, grad_x2
+```
+
+<!--If we have addition of two things and get an output, we need to overwrite the forward function like this. And when we go down to do back propagation, the gradients copied over both sides. So we overwrite the backward function by copying.
+-->
+
+Se adicionarmos duas coisas e obtivermos uma saída, precisamos sobrescrever a função forward desta forma. E quando descemos para fazer a propagação reversa, os gradientes são copiados em ambos os lados. Portanto, sobrescrevemos a função de retrocesso copiando.
+
+<!--For `split` and `max`, see the code of how we overwrite forward and backward functions in the *notebook*. If we come from the same thing and **Split**, when go down doing gradients, we should add / sum them. For `argmax`, it selects the index of the highest thing, so the index of the highest should be $1$ while others being $0$. Remember, according to different custom modules, we need to overwrite its own forward pass and how they do gradients in backward function.
+-->
+
+Para `split` e `max`, veja o código de como sobrescrevemos as funções de avanço e retrocesso no *bloco de notas*. Se viermos da mesma coisa e **Dividir**, ao descermos fazendo gradientes, devemos somar / somar. Para `argmax`, ele seleciona o índice da coisa mais alta, então o índice da mais alta deve ser $1$ enquanto os outros devem ser $0$. Lembre-se, de acordo com diferentes módulos personalizados, precisamos sobrescrever sua própria passagem do *forward* e como eles fazem os gradientes na função *backward*.
diff --git a/docs/pt/week05/05.md b/docs/pt/week05/05.md
new file mode 100644
index 000000000..10ca81a95
--- /dev/null
+++ b/docs/pt/week05/05.md
@@ -0,0 +1,40 @@
+---
+lang: pt
+lang-ref: ch.05
+title: Semana 5
+translation-date: 05 Nov 2021
+translator: Felipe Schiavon
+---
+
+<!--## Lecture part A
+-->
+
+## Aula parte A
+
+<!--We begin by introducing Gradient Descent. We discuss the intuition and also talk about how step sizes play an important role in reaching the solution. Then we move on to SGD and its performance in comparison to Full Batch GD. Finally we talk about Momentum Updates, specifically the two update rules, the intuition behind momentum and its effect on convergence.
+-->
+
+Começamos apresentando o Gradiente Descendente. Discutimos a intuição e também falamos sobre como os tamanhos dos passos desempenham um papel importante para se chegar à solução. Em seguida, passamos para Gradiente Descendente Estocástico (SGD) e seu desempenho em comparação com Gradiente Descendente completo (Full Batch GD). Por fim, falamos sobre as atualizações de momento, especificamente as duas regras de atualização, a intuição por trás do momento e seu efeito na convergência.
+
+<!--
+## Lecture part B
+-->
+
+## Aula parte B
+
+<!--We discuss adaptive methods for SGD such as RMSprop and ADAM. We also talk about normalization layers and their effects on the neural network training process. Finally, we discuss a real-world example of neural nets being used in industry to make MRI scans faster and more efficient.
+-->
+
+Discutimos métodos adaptativos para SGD, como RMSprop e ADAM. Também falamos sobre camadas de normalização e seus efeitos no processo de treinamento das redes neurais. Finalmente, discutimos um exemplo do mundo real de redes neurais sendo usadas na indústria para tornar os exames de ressonância magnética mais rápidos e eficientes.
+
+<!--
+## Practicum
+-->
+
+## Prática
+
+<!--We briefly review the matrix-multiplications and then discuss the convolutions. Key point is we use kernels by stacking and shifting. We first understand the 1D convolution by hand, and then use PyTorch to learn the dimension of kernels and output width in 1D and 2D convolutions examples. Furthermore, we use PyTorch to learn about how automatic gradient works and custom-grads.
+-->
+
+Revisamos brevemente as multiplicações de matrizes e, em seguida, discutimos as convoluções. O ponto principal é que usamos kernels por empilhamento e deslocamento. Primeiro entendemos a convolução de uma dimensão (1D) manualmente e, em seguida, usamos o PyTorch para aprender a dimensão dos kernels e da largura da saída em exemplos de convoluções de uma (1D) e duas dimensões (2D). Além disso, usamos o PyTorch para aprender sobre como o funciona o gradiente automático e os gradientes customizados.
+
diff --git a/docs/pt/week05/lecture05.sbv b/docs/pt/week05/lecture05.sbv
new file mode 100644
index 000000000..8d3d55657
--- /dev/null
+++ b/docs/pt/week05/lecture05.sbv
@@ -0,0 +1,3572 @@
+0:00:00.000,0:00:04.410
+All right so as you can see today we don't have Yann. Yann is somewhere else
+
+0:00:04.410,0:00:09.120
+having fun. Hi Yann. Okay so today's that we have
+
+0:00:09.120,0:00:13.740
+Aaron DeFazio he's a research scientist at Facebook working mostly on
+
+0:00:13.740,0:00:16.619
+optimization he's been there for the past three years
+
+0:00:16.619,0:00:21.900
+and before he was a data scientist at Ambiata and then a student at the
+
+0:00:21.900,0:00:27.599
+Australian National University so why don't we give a round of applause to the
+
+0:00:27.599,0:00:37.350
+our speaker today I'll be talking about optimization and if we have time at the
+
+0:00:37.350,0:00:42.739
+end the death of optimization so these are the topics I will be covering today
+
+0:00:42.739,0:00:47.879
+now optimization is at the heart of machine learning and some of the things
+
+0:00:47.879,0:00:52.680
+are going to be talking about today will be used every day in your role
+
+0:00:52.680,0:00:56.640
+potentially as an applied scientist or even as a research scientist or a data
+
+0:00:56.640,0:01:01.590
+scientist and I'm gonna focus on the application of these methods
+
+0:01:01.590,0:01:05.850
+particularly rather than the theory behind them part of the reason for this
+
+0:01:05.850,0:01:10.260
+is that we don't fully understand all of these methods so for me to come up here
+
+0:01:10.260,0:01:15.119
+and say this is why it works I would be oversimplifying things but what I can
+
+0:01:15.119,0:01:22.320
+tell you is how to use them how we know that they work in certain situations and
+
+0:01:22.320,0:01:28.320
+what the best method may be to use to train your neural network and to
+
+0:01:28.320,0:01:31.770
+introduce you to the topic of optimization I need to start with the
+
+0:01:31.770,0:01:36.720
+worst method in the world gradient descent and I'll explain in a minute why
+
+0:01:36.720,0:01:43.850
+it's the worst method but to begin with we're going to use the most generic
+
+0:01:43.850,0:01:47.549
+formulation of optimization now the problems you're going to be considering
+
+0:01:47.549,0:01:51.659
+will have more structure than this but it's very useful useful notationally to
+
+0:01:51.659,0:01:56.969
+start this way so we talked about a function f now we're trying to prove
+
+0:01:56.969,0:02:03.930
+properties of our optimizer will assume additional structure on f but in
+
+0:02:03.930,0:02:07.049
+practice the structure in our neural networks essentially obey no of the
+
+0:02:07.049,0:02:09.239
+assumptions none of the assumptions people make in
+
+0:02:09.239,0:02:12.030
+practice I'm just gonna start with the generic F
+
+0:02:12.030,0:02:17.070
+and we'll assume it's continuous and differentiable even though we're already
+
+0:02:17.070,0:02:20.490
+getting into the realm of incorrect assumptions since the neural networks
+
+0:02:20.490,0:02:25.170
+most people are using in practice these days are not differentiable instead you
+
+0:02:25.170,0:02:29.460
+have a equivalent sub differential which you can essentially plug into all these
+
+0:02:29.460,0:02:33.570
+formulas and if you cross your fingers there's no theory to support this it
+
+0:02:33.570,0:02:38.910
+should work so the method of gradient descent is shown here it's an iterative
+
+0:02:38.910,0:02:44.790
+method so you start at a point k equals zero and at each step you update your
+
+0:02:44.790,0:02:49.410
+point and here we're going to use W to represent our current iterate either it
+
+0:02:49.410,0:02:54.000
+being the standard nomenclature for the point for your neural network this w
+
+0:02:54.000,0:03:00.420
+will be some large collection of weights one weight tensor per layer but notation
+
+0:03:00.420,0:03:03.540
+we we kind of squash the whole thing down to a single vector and you can
+
+0:03:03.540,0:03:09.000
+imagine just doing that literally by reshaping all your vectors to all your
+
+0:03:09.000,0:03:13.740
+tensors two vectors and just concatenate them together and this method is
+
+0:03:13.740,0:03:17.519
+remarkably simple all we do is we follow the direction of the negative gradient
+
+0:03:17.519,0:03:24.750
+and the rationale for this it's pretty simple so let me give you a diagram and
+
+0:03:24.750,0:03:28.410
+maybe this will help explain exactly why following the negative gradient
+
+0:03:28.410,0:03:33.570
+direction is a good idea so we don't know enough about our function to do
+
+0:03:33.570,0:03:38.760
+better this is a high level idea when we're optimizing a function we look at
+
+0:03:38.760,0:03:45.060
+the landscape the optimization landscape locally so by optimization landscape I
+
+0:03:45.060,0:03:49.230
+mean the domain of all possible weights of our network now we don't know what's
+
+0:03:49.230,0:03:53.459
+going to happen if we use any particular weights on your network we don't know if
+
+0:03:53.459,0:03:56.930
+it'll be better at the task we're trying to train it to or worse but we do know
+
+0:03:56.930,0:04:01.530
+locally is the point that are currently ad and the gradient and this gradient
+
+0:04:01.530,0:04:05.190
+provides some information about a direction which we can travel in that
+
+0:04:05.190,0:04:09.870
+may improve the performance of our network or in this case reduce the value
+
+0:04:09.870,0:04:14.340
+of our function were minimizing here in this set up this general setup
+
+0:04:14.340,0:04:19.380
+minimizing a function is essentially training in your network so minimizing
+
+0:04:19.380,0:04:23.520
+the loss will give you the best performance on your classification task
+
+0:04:23.520,0:04:26.550
+or whatever you're trying to do and because we only look at the world
+
+0:04:26.550,0:04:31.110
+locally here this gradient is basically the best information we have and you can
+
+0:04:31.110,0:04:36.270
+think of this as descending a valley where you start somewhere horrible some
+
+0:04:36.270,0:04:39.600
+pinkie part of the landscape the top of a mountain for instance and you travel
+
+0:04:39.600,0:04:43.590
+down from there and at each point you follow the direction near you that has
+
+0:04:43.590,0:04:50.040
+the most sorry the steepest descent and in fact the go the method of grading %
+
+0:04:50.040,0:04:53.820
+is sometimes called the method of steepest descent and this direction will
+
+0:04:53.820,0:04:57.630
+change as you move in the space now if you move locally by only an
+
+0:04:57.630,0:05:02.040
+infinitesimal amount assuming this smoothness that I mentioned before which
+
+0:05:02.040,0:05:04.740
+is actually not true in practice but we'll get to that assuming the
+
+0:05:04.740,0:05:08.280
+smoothness this small step will only change the gradient a small amount so
+
+0:05:08.280,0:05:11.820
+the direction you're traveling in is at least a good direction when you take
+
+0:05:11.820,0:05:18.120
+small steps and we essentially just follow this path taking as larger steps
+
+0:05:18.120,0:05:20.669
+as we can traversing the landscape until we reach
+
+0:05:20.669,0:05:25.229
+the valley at the bottom which is the minimizer our function now there's a
+
+0:05:25.229,0:05:30.690
+little bit more we can say for some problem classes and I'm going to use the
+
+0:05:30.690,0:05:34.950
+most simplistic problem class we can just because it's the only thing that I
+
+0:05:34.950,0:05:39.210
+can really do any mathematics for on one slide so bear with me
+
+0:05:39.210,0:05:44.580
+this class is quadratics so for a quadratic optimization problem we
+
+0:05:44.580,0:05:51.570
+actually know quite a bit just based off the gradient so firstly a gradient cuts
+
+0:05:51.570,0:05:55.440
+off an entire half of a space and now illustrate this here with this green
+
+0:05:55.440,0:06:02.130
+line so we're at that point there where the line starts near the Green Line we
+
+0:06:02.130,0:06:05.789
+know the solution cannot be in the rest of the space and this is not true from
+
+0:06:05.789,0:06:09.930
+your networks but it's still a genuinely a good guideline that we want to follow
+
+0:06:09.930,0:06:13.710
+the direction of negative gradient there could be better solutions elsewhere in
+
+0:06:13.710,0:06:17.910
+the space but finding them is is much harder than just trying to find the best
+
+0:06:17.910,0:06:21.300
+solution near to where we are so that's what we do we trying to find the best
+
+0:06:21.300,0:06:24.930
+solution near to where we are you could imagine this being the surface of the
+
+0:06:24.930,0:06:28.410
+earth where there are many hills and valleys and we can't hope to know
+
+0:06:28.410,0:06:31.020
+something about a mountain on the other side of the planet but we can certainly
+
+0:06:31.020,0:06:34.559
+look for the valley directly beneath the mountain where we currently are
+
+0:06:34.559,0:06:39.089
+in fact you can think of these functions here as being represented with these
+
+0:06:39.089,0:06:44.369
+topographic maps this is the same as topographic maps you use that you may be
+
+0:06:44.369,0:06:50.369
+familiar with from from the planet Earth where mountains are shown by these rings
+
+0:06:50.369,0:06:53.309
+now here the rings are representing descent so this is the bottom of the
+
+0:06:53.309,0:06:57.839
+valley we're showing here not the top of a hill at the center there so yes our
+
+0:06:57.839,0:07:02.459
+gradient knocks off a whole half of the possible space now it's very reasonable
+
+0:07:02.459,0:07:06.059
+then to go in the direction find this negative gradient because it's kind of
+
+0:07:06.059,0:07:10.199
+orthogonal to this line that cuts off after space and you can see that I've
+
+0:07:10.199,0:07:21.409
+got the indication of orthogonal you there the little la square so the
+
+0:07:21.409,0:07:25.319
+properties of gradient to spend a gradient descent depend greatly on the
+
+0:07:25.319,0:07:28.889
+structure of the problem for these quadratic problems it's actually
+
+0:07:28.889,0:07:32.549
+relatively simple to characterize what will happen so I'm going to give you a
+
+0:07:32.549,0:07:35.369
+little bit of an overview here and I'll spend a few minutes on this because it's
+
+0:07:35.369,0:07:38.339
+quite interesting and I'm hoping that those of you with some background in
+
+0:07:38.339,0:07:42.629
+linear algebra can follow this derivation but we're going to consider a
+
+0:07:42.629,0:07:47.309
+quadratic optimization problem now the problem stated in the gray box
+
+0:07:47.309,0:07:53.309
+at the top you can see that this is a quadratic where a is a positive definite
+
+0:07:53.309,0:07:58.769
+matrix we can handle broader classes of Quadra quadratics and this potentially
+
+0:07:58.769,0:08:04.649
+but the analysis is most simple in the positive definite case and the grating
+
+0:08:04.649,0:08:09.539
+of that function is very simple of course as Aw - b and u the solution of
+
+0:08:09.539,0:08:13.379
+this problem has a closed form in the case of quadratics it's as inverse of a
+
+0:08:13.379,0:08:20.179
+times B now what we do is we take the steps they're shown in the green box and
+
+0:08:20.179,0:08:26.519
+we just plug it into the distance from solution. So this || wₖ₊₁ – w*||
+
+0:08:26.519,0:08:30.479
+is a distance from solution so we want to see how this changes over time and
+
+0:08:30.479,0:08:34.050
+the idea is that if we're moving closer to the solution over time the method is
+
+0:08:34.050,0:08:38.579
+converging so we start with that distance from solution to be plug in the
+
+0:08:38.579,0:08:44.509
+value of the update now with a little bit of rearranging we can pull
+
+0:08:45.050,0:08:50.950
+the terms we can group the terms together and we can write B as a inverse
+
+0:08:50.950,0:09:05.090
+so we can pull or we can pull the W star inside the inside the brackets there and
+
+0:09:05.090,0:09:11.960
+then we get this expression where it's matrix times the previous distance to
+
+0:09:11.960,0:09:16.040
+the solution matrix times previous distance solution now we don't know
+
+0:09:16.040,0:09:20.720
+anything about which directions this quadratic it varies most extremely in
+
+0:09:20.720,0:09:24.890
+but we can just not bound this very simply by taking the product of the
+
+0:09:24.890,0:09:28.850
+matrix as norm and the distance to the solution here this norm at the bottom so
+
+0:09:28.850,0:09:34.070
+that's the bottom line now now when you're considering matrix norms it's
+
+0:09:34.070,0:09:39.590
+pretty straightforward to see that you're going to have an expression where
+
+0:09:39.590,0:09:45.710
+the eigen values of this matrix are going to be 1 minus μ γ or 1 minus
+
+0:09:45.710,0:09:48.950
+L γ now the way I get this is I just look at what are the extreme eigen
+
+0:09:48.950,0:09:54.050
+values of a which we call them μ and L and by plugging these into the
+
+0:09:54.050,0:09:56.930
+expression we can see what the extreme eigen values will be of this combined
+
+0:09:56.930,0:10:03.050
+matrix I minus γ a and you have this absolute value here now you can optimize
+
+0:10:03.050,0:10:06.320
+this and get an optimal learning rate for the quadratics
+
+0:10:06.320,0:10:09.920
+but that optimal learning rate is not robust in practice you probably don't
+
+0:10:09.920,0:10:16.910
+want to use that so a simpler value you can use is 1/L. L being the largest
+
+0:10:16.910,0:10:22.420
+eigen value and this gives you this convergence rate of 1 – μ/L
+
+0:10:22.420,0:10:29.240
+reduction in distance to solution every step do we have any questions here I
+
+0:10:29.240,0:10:32.020
+know it's a little dense yes yes it's it's a substitution from in
+
+0:10:41.120,0:10:46.010
+that gray box do you see the bottom line on the gray box yeah that's that's just
+
+0:10:46.010,0:10:51.230
+a by definition we can solve the gradient so by taking the gradient to
+
+0:10:51.230,0:10:53.060
+zero if you see in that second line in the box
+
+0:10:53.060,0:10:55.720
+taking the gradient to zero this so replaced our gradient with zero and
+
+0:10:55.720,0:11:01.910
+rearranging you get the closed form solution to the problem here so the
+
+0:11:01.910,0:11:04.490
+problem with using that closed form solution in practice is we have to
+
+0:11:04.490,0:11:08.420
+invert a matrix and by using gradient descent we can solve this problem by
+
+0:11:08.420,0:11:12.920
+only doing matrix multiplications instead I'm not that I would suggest you
+
+0:11:12.920,0:11:15.560
+actually use this technique to solve the matrix as I mentioned before it's the
+
+0:11:15.560,0:11:20.750
+worst method in the world and the convergence rate of this method is
+
+0:11:20.750,0:11:25.100
+controlled by this new overall quantity now these are standard notations so
+
+0:11:25.100,0:11:27.950
+we're going from linear algebra where you talk about the min and Max eigen
+
+0:11:27.950,0:11:33.430
+value to the notation typically used in the field of optimization.
+
+0:11:33.430,0:11:39.380
+μ is smallest eigen value L being largest eigen value and this μ/L is the
+
+0:11:39.380,0:11:44.570
+inverse of the condition number condition number being L/μ this
+
+0:11:44.570,0:11:51.140
+gives you a broad characterization of how quickly optimization methods will
+
+0:11:51.140,0:11:57.440
+work on this problem and this these military terms they don't exist for
+
+0:11:57.440,0:12:02.870
+neural networks only in the very simplest situations do we have L exists
+
+0:12:02.870,0:12:06.740
+and we essentially never have μ existing nevertheless we want to talk
+
+0:12:06.740,0:12:10.520
+about network networks being polar conditioned and well conditioned and
+
+0:12:10.520,0:12:14.930
+poorly conditioned would typically be some approximation to L is very large
+
+0:12:14.930,0:12:21.260
+and well conditioned maybe L is very close to one so the step size we can
+
+0:12:21.260,0:12:27.770
+select in one summer training depends very heavily on these constants so let
+
+0:12:27.770,0:12:30.800
+me give you a little bit of an intuition for step sizes and this is very
+
+0:12:30.800,0:12:34.640
+important in practice I myself find a lot of my time is spent treating
+
+0:12:34.640,0:12:40.310
+learning rates and I'm sure you'll be involved in similar procedure so we have
+
+0:12:40.310,0:12:45.740
+a couple of situations that can occur if we use a learning rate that's too low
+
+0:12:45.740,0:12:49.310
+we'll find that we make steady progress towards the solution here we're
+
+0:12:49.310,0:12:56.480
+minimizing a little 1d quadratic and by steady progress I mean that every
+
+0:12:56.480,0:13:00.920
+iteration the gradient stays in buffer the same direction and you make similar
+
+0:13:00.920,0:13:05.420
+progress as you approach the solution this is slower than it is possible so
+
+0:13:05.420,0:13:09.910
+what you would ideally want to do is go straight to the solution for a quadratic
+
+0:13:09.910,0:13:12.650
+especially a 1d one like this that's going to be pretty straightforward
+
+0:13:12.650,0:13:16.340
+there's going to be an exact step size that'll get you all the way to solution
+
+0:13:16.340,0:13:20.810
+but more generally you can't do that and what you typically want to use is
+
+0:13:20.810,0:13:26.150
+actually a step size a bit above that optimal and this is for a number of
+
+0:13:26.150,0:13:29.570
+reasons it tends to be quicker in practice we have to be very very careful
+
+0:13:29.570,0:13:33.800
+because you get divergence and the term divergence means that the iterates will
+
+0:13:33.800,0:13:37.160
+get further away than from the solution instead of closer this will typically
+
+0:13:37.160,0:13:42.530
+happen if you use two larger learning rate unfortunately for us we want to use
+
+0:13:42.530,0:13:45.590
+learning rates as large as possible to get as quick learning as possible so
+
+0:13:45.590,0:13:50.180
+we're always at the edge of divergence in fact it's very rare that you'll see
+
+0:13:50.180,0:13:55.400
+that the gradients follow this nice trajectory where they all point the same
+
+0:13:55.400,0:13:58.670
+direction until you kind of reach the solution what almost always happens in
+
+0:13:58.670,0:14:02.960
+practice especially with gradient descent invariants is that you observe
+
+0:14:02.960,0:14:06.770
+this zigzagging behavior now we can't actually see zigzagging in million
+
+0:14:06.770,0:14:10.940
+dimensional spaces that we train your networks in but it's very evident in
+
+0:14:10.940,0:14:15.680
+these 2d plots of a quadratic so here I'm showing the level sets you can see
+
+0:14:15.680,0:14:20.560
+the numbers or the function value indicated there on the level sets and
+
+0:14:20.560,0:14:27.830
+when we use a learning rate that is good not optimal but good we get pretty close
+
+0:14:27.830,0:14:31.760
+to that blue dot the solution are for the 10 steps when we use a learning rate
+
+0:14:31.760,0:14:35.450
+that seems nicer in that it's not oscillating it's well-behaved when we
+
+0:14:35.450,0:14:38.330
+use such a learning rate we actually end up quite a bit further away from the
+
+0:14:38.330,0:14:42.830
+solution so it's a fact of life that we have to deal with these learning rates
+
+0:14:42.830,0:14:50.690
+that are stressfully high it's kind of like a race right you know no one wins a
+
+0:14:50.690,0:14:55.730
+a race by driving safely so our network training should be very comparable to
+
+0:14:55.730,0:15:01.940
+that so the core topic we want to talk about is actually it stochastic
+
+0:15:01.940,0:15:08.600
+optimization and this is the method that we will be using every day for training
+
+0:15:08.600,0:15:14.660
+neural networks in practice so it's de casting optimization is actually not so
+
+0:15:14.660,0:15:19.190
+different what we're gonna do is we're going to replace the gradients in our
+
+0:15:19.190,0:15:25.700
+gradient descent step with a stochastic approximation to the gradient now in a
+
+0:15:25.700,0:15:29.930
+neural network we can be a bit more precise here by stochastic approximation
+
+0:15:29.930,0:15:36.310
+what we mean is the gradient of the loss for a single data point single instance
+
+0:15:36.310,0:15:42.970
+you might want to call it so I've got that in the notation here this function
+
+0:15:42.970,0:15:49.430
+L is the loss of one day the point here the data point is indexed by AI and we
+
+0:15:49.430,0:15:52.970
+would write this typically in the optimization literature as the function
+
+0:15:52.970,0:15:57.380
+fᵢ and I'm going to use this notation but you should imagine fᵢ as being the
+
+0:15:57.380,0:16:02.390
+loss for a single instance I and here I'm using supervised learning setup
+
+0:16:02.390,0:16:08.330
+where we have data points I labels yᵢ so they points xᵢ labels yᵢ the full
+
+0:16:08.330,0:16:14.290
+loss for a function is shown at the top there it's a sum of all these fᵢ. Now
+
+0:16:14.290,0:16:17.600
+let me give you a bit more explanation for what we're doing here we're placing
+
+0:16:17.600,0:16:24.230
+this through gradient with a stochastic gradient this is a noisy approximation
+
+0:16:24.230,0:16:30.350
+and this is how it's often explained in the stochastic optimization setup so we
+
+0:16:30.350,0:16:36.440
+have this function the gradient and in our setup it's expected value is equal
+
+0:16:36.440,0:16:41.150
+to the full gradient so you can think of a stochastic gradient descent step as
+
+0:16:41.150,0:16:47.210
+being a full gradient step in expectation now this is not actually the
+
+0:16:47.210,0:16:50.480
+best way to view it because there's a lot more going on than that it's not
+
+0:16:50.480,0:16:58.310
+just gradient descent with noise so let me give you a little bit more detail but
+
+0:16:58.310,0:17:03.050
+first I let anybody ask any questions I have here before I move on yes
+
+0:17:03.050,0:17:08.420
+mm-hmm yeah I could talk a bit more about that but yes so you're right so
+
+0:17:08.420,0:17:12.500
+using your entire dataset to calculate a gradient is here what I mean by gradient
+
+0:17:12.500,0:17:17.720
+descent we also call that full batch gradient descent just to be clear now in
+
+0:17:17.720,0:17:22.280
+machine learning we virtually always use mini batches so people may use the name
+
+0:17:22.280,0:17:24.620
+gradient descent or something when they're really talking about stochastic
+
+0:17:24.620,0:17:29.150
+gradient descent and what you mentioned is absolutely true so there are some
+
+0:17:29.150,0:17:33.920
+difficulties of training neural networks using very large batch sizes and this is
+
+0:17:33.920,0:17:37.010
+understood to some degree and I'll actually explain that on the very next
+
+0:17:37.010,0:17:39.230
+slide so let me let me get to to your point first
+
+0:17:39.230,0:17:45.679
+so the point the answer to your question is actually the third point here the
+
+0:17:45.679,0:17:50.780
+noise in stochastic gradient descent induces this phenomena known as
+
+0:17:50.780,0:17:54.770
+annealing and the diagram directly to the right of it illustrates this
+
+0:17:54.770,0:18:00.260
+phenomena so your network training landscapes have a bumpy structure to
+
+0:18:00.260,0:18:05.330
+them where there are lots of small minima that are not good minima that
+
+0:18:05.330,0:18:09.320
+appear on the path to the good minima so the theory that a lot of people
+
+0:18:09.320,0:18:13.760
+subscribe to is that SGD in particular the noise induced in the gradient
+
+0:18:13.760,0:18:18.919
+actually helps the optimizer to jump over these bad minima and the theory is
+
+0:18:18.919,0:18:22.669
+that these bad minima are quite small in the space and so they're easy to jump
+
+0:18:22.669,0:18:27.380
+over we're good minima that results in good performance around your own network
+
+0:18:27.380,0:18:34.070
+are larger and harder to skip so does this answer your question yes so besides
+
+0:18:34.070,0:18:39.440
+that annealing point of view there's there's actually a few other reasons so
+
+0:18:39.440,0:18:45.559
+we have a lot of redundancy in the information we get from each terms
+
+0:18:45.559,0:18:51.679
+gradient and using stochastic gradient lets us exploit this redundancy in a lot
+
+0:18:51.679,0:18:56.870
+of situations the gradient computed on a few hundred examples is almost as good
+
+0:18:56.870,0:19:01.460
+as a gradient computed on the full data set and often thousands of times cheaper
+
+0:19:01.460,0:19:05.300
+depending on your problem so it's it's hard to come up with a compelling reason
+
+0:19:05.300,0:19:09.320
+to use gradient descent given the success of stochastic gradient descent
+
+0:19:09.320,0:19:13.809
+and this is part of the reason why disgusted gradient said is one of the
+
+0:19:15.659,0:19:19.859
+best misses we have but gradient descent is one of the worst and in fact early
+
+0:19:19.859,0:19:23.580
+stages the correlation is remarkable this disgusted gradient can be
+
+0:19:23.580,0:19:28.499
+correlated up to a coefficient of 0.999 correlation coefficient to the true
+
+0:19:28.499,0:19:33.869
+gradient at those early steps of optimization so I want to briefly talk
+
+0:19:33.869,0:19:38.179
+about a something you need to know about I think Yann has already mentioned this
+
+0:19:38.179,0:19:43.259
+briefly but in practice we don't use individual instances in stochastic
+
+0:19:43.259,0:19:48.749
+gradient descent how we use mini batches of instances so I'm just using some
+
+0:19:48.749,0:19:52.649
+notation here but everybody uses different notation for mini batching so
+
+0:19:52.649,0:19:56.970
+you shouldn't get too attached to the notation but essentially at every step
+
+0:19:56.970,0:20:03.149
+you have some batch here I'm going to call it B an index with I for step and
+
+0:20:03.149,0:20:09.299
+you basically use the average of the gradients over this mini batch which is
+
+0:20:09.299,0:20:13.470
+a subset of your data rather than a single instance or the full full batch
+
+0:20:13.470,0:20:19.799
+now almost everybody will use this mini batch selected uniformly at random
+
+0:20:19.799,0:20:23.009
+some people use with replacement sampling and some people use without
+
+0:20:23.009,0:20:26.669
+with replacement sampling but the differences are not important for this
+
+0:20:26.669,0:20:31.729
+purposes you can use either and there's a lot of advantages to mini batching so
+
+0:20:31.729,0:20:35.220
+there's actually some good impelling theoretical reasons to not be any batch
+
+0:20:35.220,0:20:38.609
+but the practical reasons are overwhelming part of these practical
+
+0:20:38.609,0:20:43.950
+reasons are computational we make ammonia may utilize our hardware say at
+
+0:20:43.950,0:20:47.489
+1% efficiency when training some of the network's we use if we try and use
+
+0:20:47.489,0:20:51.239
+single instances and we get the most efficient utilization of the hardware
+
+0:20:51.239,0:20:55.979
+with batch sizes often in the hundreds if you're training on the typical
+
+0:20:55.979,0:20:59.999
+ImageNet data set for in for instance you don't use batch sizes less than
+
+0:20:59.999,0:21:08.429
+about 64 to get good efficiency maybe can go down to 32 but another important
+
+0:21:08.429,0:21:13.080
+application is distributed training and this is really becoming a big thing so
+
+0:21:13.080,0:21:17.309
+as was mentioned before people were recently able to Train ImageNet days
+
+0:21:17.309,0:21:21.639
+said that normally takes two days to train and not so long ago it took
+
+0:21:21.639,0:21:25.779
+in a week to train in only one hour and the way they did that was using very
+
+0:21:25.779,0:21:29.889
+large mini batches and along with using large many batches there are some tricks
+
+0:21:29.889,0:21:34.059
+that you need to use to get it to work it's probably not something that you
+
+0:21:34.059,0:21:37.149
+would cover an introductory lecture so I encourage you to check out that paper if
+
+0:21:37.149,0:21:40.409
+you're interested it's ImageNet in one hour
+
+0:21:40.409,0:21:45.279
+leaves face book authors I can't recall the first author at the moment as a side
+
+0:21:45.279,0:21:51.459
+note there are some situations where you need to do full batch optimization do
+
+0:21:51.459,0:21:54.759
+not use gradient descent in that situation I can't emphasize it enough to
+
+0:21:54.759,0:21:59.950
+not use gradient ascent ever if you have full batch data by far the most
+
+0:21:59.950,0:22:03.249
+effective method that is kind of plug-and-play you don't to think about
+
+0:22:03.249,0:22:08.859
+it is known as l-bfgs it's accumulation of 50 years of optimization research and
+
+0:22:08.859,0:22:12.519
+it works really well torch's implementation is pretty good
+
+0:22:12.519,0:22:17.379
+but the Scipy implementation causes some filtering code that was written 15 years
+
+0:22:17.379,0:22:23.440
+ago that is pretty much bulletproof so because they were those so that's a good
+
+0:22:23.440,0:22:26.619
+question classically you do need to use the full
+
+0:22:26.619,0:22:28.809
+data set now PyTorch implementation actually
+
+0:22:28.809,0:22:34.209
+supports using mini battery now this is somewhat of a gray area in that there's
+
+0:22:34.209,0:22:37.899
+really no theory to support the use of this and it may work well for your
+
+0:22:37.899,0:22:43.839
+problem or it may not so it could be worth trying I mean you want to use your
+
+0:22:43.839,0:22:49.929
+whole data set for each gradient evaluation or probably more likely since
+
+0:22:49.929,0:22:52.359
+it's very rarely you want to do that probably more likely you're solving some
+
+0:22:52.359,0:22:56.889
+other optimization problem that isn't isn't training in your network but maybe
+
+0:22:56.889,0:23:01.869
+some ancillary problem related and you need to solve an optimization problem
+
+0:23:01.869,0:23:06.669
+without this data point structure that doesn't summer isn't a sum of data
+
+0:23:06.669,0:23:12.239
+points yeah hopefully it was another question yep oh yes the question was
+
+0:23:12.239,0:23:16.869
+Yann recommended we used mini batches equal to the size of the number of
+
+0:23:16.869,0:23:20.079
+classes we have in our data set why is that reasonable that was the question
+
+0:23:20.079,0:23:23.889
+the answer is that we want any vectors to be representative of the full data
+
+0:23:23.889,0:23:28.329
+set and typically each class is quite distinct from the other classes in its
+
+0:23:28.329,0:23:33.490
+properties so about using a mini batch that contains on average
+
+0:23:33.490,0:23:36.850
+one instance from each class in fact we can enforce that explicitly although
+
+0:23:36.850,0:23:39.820
+it's not necessary by having an approximately equal to that
+
+0:23:39.820,0:23:44.590
+size we can assume it has the kind of structure of a food gradient so you
+
+0:23:44.590,0:23:49.870
+capture a lot of the correlations in the data you see with the full gradient and
+
+0:23:49.870,0:23:54.279
+it's a good guide especially if you're using training on CPU where you're not
+
+0:23:54.279,0:23:58.690
+constrained too much by hardware efficiency here when training on energy
+
+0:23:58.690,0:24:05.080
+on a CPU batch size is not critical for hardware utilization it's problem
+
+0:24:05.080,0:24:09.370
+dependent I would always recommend mini batching I don't think it's worth trying
+
+0:24:09.370,0:24:13.899
+size one as a starting point if you try to eke out small gains maybe that's
+
+0:24:13.899,0:24:19.779
+worth exploring yes there was another question so in the annealing example so
+
+0:24:19.779,0:24:24.760
+the question was why is the lost landscape so wobbly and this is this is
+
+0:24:24.760,0:24:31.600
+actually something that is very a very realistic depiction of actual law slams
+
+0:24:31.600,0:24:37.630
+codes for neural networks they're incredibly in the sense that they have a
+
+0:24:37.630,0:24:41.860
+lot of hills and valleys and this is something that is actively researched
+
+0:24:41.860,0:24:47.140
+now what we can say for instance is that there is a very large number of good
+
+0:24:47.140,0:24:52.720
+minima and and so hills and valleys we know this because your networks have
+
+0:24:52.720,0:24:56.590
+this combinatorial aspect to them you can reaper ammeter eyes a neural network
+
+0:24:56.590,0:25:00.309
+by shifting all the weights around and you can get in your work you'll know if
+
+0:25:00.309,0:25:04.750
+it outputs exactly the same output for whatever task you're looking at with all
+
+0:25:04.750,0:25:07.419
+these weights moved around and that correspondence essentially to a
+
+0:25:07.419,0:25:12.460
+different location in parameter space so given that there's an exponential number
+
+0:25:12.460,0:25:16.270
+of these possible ways of rearranging the weights to get the same network
+
+0:25:16.270,0:25:18.940
+you're going to end up with the space that's incredibly spiky exponential
+
+0:25:18.940,0:25:24.789
+number of these spikes now the reason why these these local minima appear that
+
+0:25:24.789,0:25:27.580
+is something that is still active research so I'm not sure I can give you
+
+0:25:27.580,0:25:32.890
+a great answer there but they're definitely observed in practice and what
+
+0:25:32.890,0:25:39.000
+I can say is they appear to be less of a problem we've very
+
+0:25:39.090,0:25:42.810
+like close to state-of-the-art networks so these local minima were considered
+
+0:25:42.810,0:25:47.940
+big problems 15 years ago but so much at the moment people essentially never hit
+
+0:25:47.940,0:25:52.350
+them in practice when using kind of recommended parameters and things like
+
+0:25:52.350,0:25:55.980
+that when you use very large batches you can run into these problems it's not
+
+0:25:55.980,0:25:59.490
+even clear that the the poor performance when using large batches is even
+
+0:25:59.490,0:26:03.900
+attributable to these larger minima to these local minima so this is yes to
+
+0:26:03.900,0:26:08.550
+ongoing research yes the problem is you can't really see this local structure
+
+0:26:08.550,0:26:10.920
+because we're in this million dimensional space it's not a good way to
+
+0:26:10.920,0:26:15.090
+see it so yeah I don't know if people might have explored that already I'm not
+
+0:26:15.090,0:26:18.840
+familiar with papers on that but I bet someone has looked at it so you might
+
+0:26:18.840,0:26:23.520
+want to google that yeah so a lot of the advances in neural network design have
+
+0:26:23.520,0:26:27.420
+actually been in reducing this bumpiness in a lot of ways so this is part of the
+
+0:26:27.420,0:26:30.510
+reason why it's not considered a huge problem anymore whether it was it was
+
+0:26:30.510,0:26:35.960
+considered a big problem in the past there's any other questions yes so it's
+
+0:26:35.960,0:26:41.550
+it is hard to see but there are certain things you can do that we make the the
+
+0:26:41.550,0:26:46.830
+peaks and valleys smaller certainly and by rescaling some parts the neural
+
+0:26:46.830,0:26:50.010
+network you can amplify certain directions the curvature in certain
+
+0:26:50.010,0:26:54.320
+directions can be stretched and squashed the particular innovation residual
+
+0:26:54.320,0:27:00.000
+connections that were mentioned they're very easy to see that they smooth out
+
+0:27:00.000,0:27:03.600
+the the loss in fact you can kind of draw two line between two points in the
+
+0:27:03.600,0:27:06.570
+space and you can see what happens along that line that's really the best way we
+
+0:27:06.570,0:27:10.170
+have a visualizing million dimensional spaces so I turn him into one dimension
+
+0:27:10.170,0:27:13.200
+and you can see that it's that it's a much nicer between these two points
+
+0:27:13.200,0:27:17.370
+whatever two points you choose when using these residual connections I'll be
+
+0:27:17.370,0:27:21.570
+talking all about dodging or later in the lecture so yeah if hopefully I'll
+
+0:27:21.570,0:27:24.870
+answer that question without you having to ask it again but we'll see
+
+0:27:24.870,0:27:31.560
+thanks any other questions yes so l-bfgs excellent method it's it's kind of a
+
+0:27:31.560,0:27:34.650
+constellation of optimization researchers that we still use SGD a
+
+0:27:34.650,0:27:40.470
+method invented in the 60s or earlier is still state of the art but there has
+
+0:27:40.470,0:27:44.880
+been some innovation in fact only a couple years later but there was some
+
+0:27:44.880,0:27:49.180
+innovation since the invention of sed and one of these innovations is
+
+0:27:49.180,0:27:54.730
+and I'll talk about another later so momentum it's a trick
+
+0:27:54.730,0:27:57.520
+that you should pretty much always be using when you're using stochastic
+
+0:27:57.520,0:28:00.880
+gradient descent it's worth be going into this in a little bit of detail
+
+0:28:00.880,0:28:04.930
+you'll often be tuning the momentum parameter and your network and it's
+
+0:28:04.930,0:28:09.340
+useful to understand what it's actually doing when you're tuning up so part of
+
+0:28:09.340,0:28:15.970
+the problem with momentum it's very misunderstood and this can be explained
+
+0:28:15.970,0:28:18.760
+by the fact that there's actually three different ways of writing momentum that
+
+0:28:18.760,0:28:21.790
+look completely different but turn out to be equivalent I'm only going to
+
+0:28:21.790,0:28:25.120
+present two of these ways because the third way is not as well known but is
+
+0:28:25.120,0:28:30.070
+actually in my opinion the correct way to view it I don't talk about my
+
+0:28:30.070,0:28:32.470
+research here so we'll talk about how it's actually implemented in the
+
+0:28:32.470,0:28:37.390
+packages you'll be using and this first form here is what's actually implemented
+
+0:28:37.390,0:28:42.040
+in PyTorch and other software that you'll be using here we maintain two variables
+
+0:28:42.040,0:28:47.650
+now you'll see lots of papers using different notation here P is the
+
+0:28:47.650,0:28:51.580
+notation used in physics for momentum and it's very common to use that also as
+
+0:28:51.580,0:28:55.720
+the momentum variable when talking about sed with momentum so I'll be following
+
+0:28:55.720,0:29:01.000
+that convention so instead of having a single iterate we now have to Eretz P
+
+0:29:01.000,0:29:06.940
+and W and at every step we update both and this is quite a simple update so the
+
+0:29:06.940,0:29:13.060
+P update involves adding to the old P and instead of adding exactly to the old
+
+0:29:13.060,0:29:16.720
+P we kind of damp the old P we reduce it by multiplying it by a constant that's
+
+0:29:16.720,0:29:21.310
+worse than one so reduce the old P and here I'm using β̂ as the constant
+
+0:29:21.310,0:29:24.880
+there so that would probably be 0.9 in practice a small amount of damping and
+
+0:29:24.880,0:29:32.650
+we add to that the new gradient so P is kind of this accumulated gradient buffer
+
+0:29:32.650,0:29:38.170
+you can think of where new gradients come in at full value and past gradients
+
+0:29:38.170,0:29:42.490
+are reduced at each step by a certain factor usually 0.9 which used to reduce
+
+0:29:42.490,0:29:47.910
+reduced so the buffer tends to be a some sort of running sum of gradients and
+
+0:29:47.910,0:29:53.080
+it's basically we just modify this to custer gradient two-step descent step by
+
+0:29:53.080,0:29:56.440
+using this P instead of the negative gradient instead of the gradient sorry
+
+0:29:56.440,0:30:00.260
+using P instead of the in the update since the two line formula
+
+0:30:00.260,0:30:05.790
+it may be better to understand this by the second form that I put below this is
+
+0:30:05.790,0:30:09.600
+equivalent you've got a map the β with a small transformation so it's not
+
+0:30:09.600,0:30:12.750
+exactly the same β between the two methods but it's practically the same
+
+0:30:12.750,0:30:20.300
+for in practice so these are essentially the same up to reap romanization and
+
+0:30:21.260,0:30:25.530
+this film I think is maybe clearer this form is called the stochastic heavy ball
+
+0:30:25.530,0:30:31.170
+method and here our update still includes the gradient but we're also
+
+0:30:31.170,0:30:40.020
+adding on a multiplied copy of the past direction we traveled in now what does
+
+0:30:40.020,0:30:43.320
+this mean what are we actually doing here so it's actually not too difficult
+
+0:30:43.320,0:30:49.170
+to visualize and I'm going to kind of use a visualization from a distilled
+
+0:30:49.170,0:30:52.710
+publication you can see the dress at the bottom there and I disagree with a lot
+
+0:30:52.710,0:30:55.620
+of what they talked about in that document but I like the visualizations
+
+0:30:55.620,0:31:02.820
+so let's use had and I'll explain why I disagreed some regards later but it's
+
+0:31:02.820,0:31:07.440
+quite simple so you can think of momentum as the physical process and I
+
+0:31:07.440,0:31:10.650
+mention those of you have done introductory physics courses would have
+
+0:31:10.650,0:31:17.340
+covered this so momentum is the property of something to keep moving in the
+
+0:31:17.340,0:31:21.330
+direction that's currently moving in all right if you're familiar with Newton's
+
+0:31:21.330,0:31:24.240
+laws things want to keep going in the direction they're going and this is
+
+0:31:24.240,0:31:28.860
+momentum and when you do this mapping the physics the gradient is kind of a
+
+0:31:28.860,0:31:34.020
+force that is pushing you're literate which by this analogy is a heavy ball
+
+0:31:34.020,0:31:39.860
+it's pushing this heavy ball at each point so rather than making dramatic
+
+0:31:39.860,0:31:44.030
+changes in the direction we travel at every step which is shown in that left
+
+0:31:44.030,0:31:48.480
+diagram instead of making these dramatic changes we're going to make kind of a
+
+0:31:48.480,0:31:51.480
+bit more modest changes so when we realize we're going in the wrong
+
+0:31:51.480,0:31:55.740
+direction we kind of do a u-turn instead of putting the hand brake on and
+
+0:31:55.740,0:31:59.440
+swinging around it turns out in a lot of practical
+
+0:31:59.440,0:32:01.810
+problems this gives you a big improvement so here you can see you're
+
+0:32:01.810,0:32:06.280
+getting much closer to the solution by the end of it with much less oscillation
+
+0:32:06.280,0:32:10.840
+and you can see this oscillation so it's kind of a fact of life if you're using
+
+0:32:10.840,0:32:14.650
+gradient descent type methods so here we talk about momentum on top of gradient
+
+0:32:14.650,0:32:18.550
+descent in the visualization you're gonna get this oscillation it's just a
+
+0:32:18.550,0:32:22.240
+property of gradient descent no way to get rid of it without modifying the
+
+0:32:22.240,0:32:27.490
+method and we're meant to them to some degree dampens this oscillation I've got
+
+0:32:27.490,0:32:30.760
+another visualization here which will kind of give you an intuition for how
+
+0:32:30.760,0:32:34.660
+this β parameter controls things now the Department of these to be greater
+
+0:32:34.660,0:32:39.280
+than zero if it's equal to zero you distr in gradient descent and it's gotta
+
+0:32:39.280,0:32:43.330
+be less than one otherwise the Met everything blows up as you start
+
+0:32:43.330,0:32:45.970
+including past gradients with more and more weight over times it's gotta be
+
+0:32:45.970,0:32:54.070
+between zero and one and typical values range from you know small 0.25 up to
+
+0:32:54.070,0:32:59.230
+like 0.99 so in practice you can get pretty close to one and what happens is
+
+0:32:59.230,0:33:09.130
+the smaller values they result in you're changing direction quicker okay so in
+
+0:33:09.130,0:33:12.820
+this diagram you can see on the left with the small β you as soon as you
+
+0:33:12.820,0:33:16.120
+get close to the solution you kind of change direction pretty rapidly and head
+
+0:33:16.120,0:33:19.900
+towards a solution when you use these larger βs it takes longer for you to
+
+0:33:19.900,0:33:23.530
+make this dramatic turn you can think of it as a car with a bad turning circle
+
+0:33:23.530,0:33:26.170
+takes you quite a long time to get around that corner and head towards
+
+0:33:26.170,0:33:31.180
+solution now this may seem like a bad thing but actually in practice this
+
+0:33:31.180,0:33:35.110
+significantly dampens the oscillations that you get from gradient descent and
+
+0:33:35.110,0:33:40.450
+that's the nice property of it now in terms of practice I can give you some
+
+0:33:40.450,0:33:45.760
+pretty clear guidance here you pretty much always want to use momentum it's
+
+0:33:45.760,0:33:48.820
+pretty hard to find problems where it's actually not beneficial to some degree
+
+0:33:48.820,0:33:52.960
+now part of the reason for this is it's just an extra parameter now typically
+
+0:33:52.960,0:33:55.870
+when you take some method and just add more parameters to it you can usually
+
+0:33:55.870,0:34:01.000
+find some value of that parameter that makes us slightly better now that is
+
+0:34:01.000,0:34:04.330
+sometimes the case here but often these improvements from using momentum are
+
+0:34:04.330,0:34:08.810
+actually quite substantial and using a momentum value of point nine is
+
+0:34:08.810,0:34:13.610
+really a default value used in machine learning quite often and often in some
+
+0:34:13.610,0:34:19.010
+situations 0.99 may be better so I would recommend trying both values if you have
+
+0:34:19.010,0:34:24.770
+time otherwise just try point nine but I have to do a warning the way momentum is
+
+0:34:24.770,0:34:29.300
+stated in this expression if you look at it carefully when we increase the
+
+0:34:29.300,0:34:36.440
+momentum we kind of increase the step size now it's not the step size of the
+
+0:34:36.440,0:34:39.380
+current gradient so the current gradient is included in the step with the same
+
+0:34:39.380,0:34:43.399
+strengths but past gradients become included in the step with a higher
+
+0:34:43.399,0:34:48.290
+strength when you increase momentum now when you write momentum in other forms
+
+0:34:48.290,0:34:53.179
+this becomes a lot more obvious so this firm kind of occludes that but what you
+
+0:34:53.179,0:34:58.820
+should generally do when you change momentum you want to change it so that
+
+0:34:58.820,0:35:04.310
+you have your step size divided by one minus β is your new step size so if
+
+0:35:04.310,0:35:07.790
+your old step size was using a certain B do you want to map it to that equation
+
+0:35:07.790,0:35:11.690
+then map it back to get the the new step size now this may be very modest change
+
+0:35:11.690,0:35:16.400
+but if you're going from momentum 0.9 to momentum 0.99 you may need to reduce
+
+0:35:16.400,0:35:20.480
+your learning rate by a factor of 10 approximately so just be wary of that
+
+0:35:20.480,0:35:22.850
+you can't expect to keep the same learning rate and change the momentum
+
+0:35:22.850,0:35:27.260
+parameter at wallmart work now I want to go into a bit of detail about why
+
+0:35:27.260,0:35:31.880
+momentum works is very misunderstood and the explanation you'll see in that
+
+0:35:31.880,0:35:38.570
+Distilled post is acceleration and this is certainly a contributor to the
+
+0:35:38.570,0:35:44.380
+performance of momentum now acceleration is a topic yes if you've got a question
+
+0:35:44.380,0:35:48.170
+the question was is there a big difference between using momentum and
+
+0:35:48.170,0:35:54.890
+using a mini batch of two and there is so momentum has advantages in for when
+
+0:35:54.890,0:35:59.150
+using gradient descent as well as stochastic gradient descent so in fact
+
+0:35:59.150,0:36:03.110
+this acceleration explanation were about to use applies both in the stochastic
+
+0:36:03.110,0:36:07.520
+and non stochastic case so no matter what batch size you're going to use the
+
+0:36:07.520,0:36:13.100
+benefits of momentum still are shown now it also has benefits in the stochastic
+
+0:36:13.100,0:36:17.000
+case as well which I'll cover in a slide or two so the answer is it's quite
+
+0:36:17.000,0:36:19.579
+distinct from batch size and you shouldn't complete them
+
+0:36:19.579,0:36:22.459
+learn it like really you should be changing your learning rate when you
+
+0:36:22.459,0:36:26.239
+change your bat size rather than changing the momentum and for very large
+
+0:36:26.239,0:36:30.380
+batch sizes there's a clear relationship between learning rate and batch size but
+
+0:36:30.380,0:36:34.729
+for small batch sizes it's not clear so it's problem dependent any other
+
+0:36:34.729,0:36:38.599
+questions before I move on on momentum yes yes it's it's just blow up so it's
+
+0:36:38.599,0:36:42.979
+actually in the in the in the physics interpretation it's conservation of
+
+0:36:42.979,0:36:48.499
+momentum would be exactly equal to one now that's not good because if you're in
+
+0:36:48.499,0:36:51.890
+a world with no friction then you drop a heavy ball somewhere it's gonna keep
+
+0:36:51.890,0:36:56.479
+moving forever it's not good stuff so we need some dampening and this is where
+
+0:36:56.479,0:37:01.069
+the physics interpretation breaks down so you do need some damping now now you
+
+0:37:01.069,0:37:05.209
+can imagine if you use a larger value than one those past gradients get
+
+0:37:05.209,0:37:09.410
+amplified every step so in fact the first gradient you evaluate in your
+
+0:37:09.410,0:37:13.940
+network is not relevant information content wise later in optimization but
+
+0:37:13.940,0:37:16.910
+if it used to be the larger than 1 it would dominate the step that you're
+
+0:37:16.910,0:37:21.170
+using does that answer your question yeah ok any other questions about
+
+0:37:21.170,0:37:26.359
+momentum before we move on they are for a particular value of β yes it's
+
+0:37:26.359,0:37:30.859
+strictly equivalent it's not very hard to you should be able to do it in like
+
+0:37:30.859,0:37:38.359
+two lines if you try and do the equivalence yourself no the bidders are
+
+0:37:38.359,0:37:40.910
+not quite the same but the the γ is the same that's why I use the same
+
+0:37:40.910,0:37:45.319
+notation for it oh yes so that's what I mentioned yes so when you change β
+
+0:37:45.319,0:37:48.349
+you want to scale your learning rate by the learning rate divided by one over
+
+0:37:48.349,0:37:52.369
+β so in this form I'm not sure if it appears in this form it could be a
+
+0:37:52.369,0:37:55.969
+mistake but I think I'm okay here I think it's not in this formula but yeah
+
+0:37:55.969,0:37:59.269
+what you definitely when you change β you need to change learning rate as well
+
+0:37:59.269,0:38:09.300
+to keep things balanced yeah Oh either averaging form it's probably
+
+0:38:09.300,0:38:13.830
+not worth going over but you can think of it as momentum is basically changing
+
+0:38:13.830,0:38:17.850
+the point that you evaluate the gradient at in the standard firm you evaluate the
+
+0:38:17.850,0:38:22.230
+gradient at this W point in the inner averaging form you take a running
+
+0:38:22.230,0:38:25.890
+average of the points you've been evaluating the Grady Nutt and you
+
+0:38:25.890,0:38:30.630
+evaluate at that point so it's basically instead of averaging gradients to
+
+0:38:30.630,0:38:37.530
+average points it's clear sense Jewell yes yes so acceleration now this is
+
+0:38:37.530,0:38:43.260
+something you can spend the whole career studying and it's it's somewhat poorly
+
+0:38:43.260,0:38:47.070
+understood now if you try and read Nesterov original work on it now
+
+0:38:47.070,0:38:53.520
+Nesterov is kind of the grandfather of modern optimization in practically half
+
+0:38:53.520,0:38:56.460
+the methods we use are named after him to some degree which is can be confusing
+
+0:38:56.460,0:39:01.740
+at times and in the 80s he came up with this formulation he didn't write it in
+
+0:39:01.740,0:39:04.650
+this form he wrote it in another form which people realized a while later
+
+0:39:04.650,0:39:09.450
+could be written in this form and his analysis is also very opaque and
+
+0:39:09.450,0:39:15.590
+originally written in Russian doesn't help no for understanding unfortunately
+
+0:39:15.590,0:39:21.180
+those nice people the NSA translated all of the Russian literature back then so
+
+0:39:21.180,0:39:27.330
+so we have access to them and it's actually a very small modification of
+
+0:39:27.330,0:39:31.890
+the momentum step but I think that small modification belittles what it's
+
+0:39:31.890,0:39:36.600
+actually doing it's really not the same method at all what I can say is with
+
+0:39:36.600,0:39:41.400
+Nesterov Swimmer momentum if you very carefully choose these constants you can
+
+0:39:41.400,0:39:46.050
+get what's known as accelerated convergence now this doesn't apply in
+
+0:39:46.050,0:39:49.560
+your networks but for convex problems I won't go into details of convexity but
+
+0:39:49.560,0:39:52.230
+some of you may know what that means it's kind of a simple structure but
+
+0:39:52.230,0:39:55.740
+convex problems it's a radically improved convergence rate from this
+
+0:39:55.740,0:39:59.940
+acceleration but only for very carefully chosen constants and you really can't
+
+0:39:59.940,0:40:03.030
+choose these carefully ahead of time so you've got to do quite a large search
+
+0:40:03.030,0:40:05.640
+over your parameters your hyper parameters sorry to find the right
+
+0:40:05.640,0:40:10.710
+constants to get that acceleration what I can say is this actually occurs for
+
+0:40:10.710,0:40:14.779
+quadratics when using regular momentum and this is confused a lot of people
+
+0:40:14.779,0:40:18.559
+so you'll see a lot of people say that momentum is an accelerated method it's
+
+0:40:18.559,0:40:23.449
+excited only for quadratics and even then it's it's a little bit iffy I would
+
+0:40:23.449,0:40:27.529
+not recommend using it for quadratics use conjugate gradients or some new
+
+0:40:27.529,0:40:33.499
+methods that have been developed over the last few years and this is
+
+0:40:33.499,0:40:36.919
+definitely a contributing factor to our momentum works so well in practice and
+
+0:40:36.919,0:40:42.499
+there's definitely some acceleration going on but this acceleration is hard
+
+0:40:42.499,0:40:46.669
+to realize when you have stochastic gradients now when you look at what
+
+0:40:46.669,0:40:51.679
+makes acceleration work noise really kills it and it's it's hard to believe
+
+0:40:51.679,0:40:55.549
+that it's the main factor contributing to the performance but it's certainly
+
+0:40:55.549,0:40:59.989
+there and the the still post I mentioned attributes or the performance of
+
+0:40:59.989,0:41:02.689
+momentum to acceleration but I wouldn't go that quite that far but it's
+
+0:41:02.689,0:41:08.390
+definitely a contributing factor but probably the practical and provable
+
+0:41:08.390,0:41:13.669
+reason why acceleration why knows sorry why momentum helps is noise smoothing
+
+0:41:13.669,0:41:21.619
+and this is very intuitive momentum averages gradients in a sense we keep
+
+0:41:21.619,0:41:25.099
+this running buffer gradients that we use as a step instead of individual
+
+0:41:25.099,0:41:30.259
+gradients this is kind of a form of averaging and it turns out that when you
+
+0:41:30.259,0:41:33.229
+use s to D without momentum to prove anything at all about it
+
+0:41:33.229,0:41:37.449
+you actually have to work with the average of all the points you visited
+
+0:41:37.449,0:41:42.380
+you can get really weak bounds on the last point that you ended up at but
+
+0:41:42.380,0:41:45.349
+really you've got to work with this average of points and this is suboptimal
+
+0:41:45.349,0:41:48.529
+like we never want to actually take this average in practice it's heavily
+
+0:41:48.529,0:41:52.099
+weighted with points that we visited a long time ago which may be irrelevant
+
+0:41:52.099,0:41:55.159
+and in fact this averaging doesn't work very well in practice for neural
+
+0:41:55.159,0:41:59.150
+networks it's really only important for convex problems but nevertheless it's
+
+0:41:59.150,0:42:03.380
+necessary to analyze regular s2d and one of the remarkable facts about momentum
+
+0:42:03.380,0:42:09.019
+is actually this averaging is no longer theoretically necessary so essentially
+
+0:42:09.019,0:42:14.509
+momentum adds smoothing dream optimization that makes it makes us so
+
+0:42:14.509,0:42:19.459
+the last point you visit is still a good approximation to the solution with SGG
+
+0:42:19.459,0:42:23.329
+really you want to average a whole bunch of last points you've seen in order to
+
+0:42:23.329,0:42:26.700
+get a good approximation to the solution now let me illustrate that
+
+0:42:26.700,0:42:31.190
+here so this is this is a very typical example of what happens when using STD
+
+0:42:31.190,0:42:36.329
+STD at the beginning you make great progress the gradient is essentially
+
+0:42:36.329,0:42:39.960
+almost the same as the stochastic gradient so first few steps you make
+
+0:42:39.960,0:42:44.490
+great progress towards solution but then you end up in this ball now recall here
+
+0:42:44.490,0:42:47.579
+that's a valley that we're heading down so this ball here is kind of the floor
+
+0:42:47.579,0:42:53.550
+of the valley and you kind of bounce around in this floor and the most common
+
+0:42:53.550,0:42:56.579
+solution of this is if you reduce your learning rate you'll bounce around
+
+0:42:56.579,0:43:01.290
+slower not exactly a great solution but it's one way to handle it but when you
+
+0:43:01.290,0:43:04.710
+use s to deal with momentum you can kind of smooth out this bouncing around and
+
+0:43:04.710,0:43:08.160
+you kind of just kind of wheel around now the path is not always going to be
+
+0:43:08.160,0:43:12.300
+this corkscrew tile path it's actually quite random you could kind of wobble
+
+0:43:12.300,0:43:15.990
+left and right but when I seeded it with 42 this is what it spread out so that's
+
+0:43:15.990,0:43:20.790
+what I'm using here you typically get this corkscrew you get this cork scoring
+
+0:43:20.790,0:43:24.660
+for this set of parameters and yeah I think this is a good explanation so some
+
+0:43:24.660,0:43:27.960
+combination of acceleration and noise smoothing is why momentum works
+
+0:43:27.960,0:43:33.180
+oh yes yes so I should say that when we inject noise here the gradient may not
+
+0:43:33.180,0:43:37.470
+even be the right direction to travel in fact it could be in the opposite
+
+0:43:37.470,0:43:40.800
+direction from where you want to go and this is why you kind of bounce around in
+
+0:43:40.800,0:43:46.410
+the valley there so in fact the gray you can see here that the first step with
+
+0:43:46.410,0:43:49.980
+SUV is practically orthogonal to the level set there that's because it is
+
+0:43:49.980,0:43:52.770
+such a good step at the beginning but once you get further down it can point
+
+0:43:52.770,0:44:00.300
+in pretty much any direction vaguely around the solution so yesterday with
+
+0:44:00.300,0:44:03.540
+momentum is currently state of the art optimization method for a lot of machine
+
+0:44:03.540,0:44:08.730
+learning problems so you'll probably be using it in your course for a lot of
+
+0:44:08.730,0:44:12.990
+problems but there has been some other innovations over the years and these are
+
+0:44:12.990,0:44:16.829
+particularly useful for poorly conditioned problems now as I mentioned
+
+0:44:16.829,0:44:19.770
+earlier in the lecture some problems have this kind of well condition
+
+0:44:19.770,0:44:22.530
+property that we can't really characterize for neural networks but we
+
+0:44:22.530,0:44:27.450
+can measure it by the test that if s to D works then it's well conditioned
+
+0:44:27.450,0:44:31.470
+eventually there doesent works and if I must be walking poorly conditioned so we
+
+0:44:31.470,0:44:34.410
+have other methods we can handle we can use to handle this in some
+
+0:44:34.410,0:44:39.690
+situations and these generally are called adaptive methods now you need to
+
+0:44:39.690,0:44:43.500
+be a little bit careful because what are you adapting to people in literature use
+
+0:44:43.500,0:44:51.780
+this nomenclature for adapting learning rates adapting momentum parameters but
+
+0:44:51.780,0:44:56.339
+in our our situation we're talk about a specific type of adaptivity roman this
+
+0:44:56.339,0:45:03.780
+adaptivity is individual learning rates now what I mean by that so in the
+
+0:45:03.780,0:45:06.869
+simulation I already showed you a stochastic gradient descent
+
+0:45:06.869,0:45:10.619
+I used a global learning rate by that I mean every single rate in your network
+
+0:45:10.619,0:45:16.800
+is updated using an equation with the same γ now γ could vary over
+
+0:45:16.800,0:45:21.720
+time step so you used γ K in the notation but often you use a fixed
+
+0:45:21.720,0:45:26.310
+camera for quite a long time but for adaptive methods we want to adapt a
+
+0:45:26.310,0:45:30.240
+learning rate for every weight individually and we want to use
+
+0:45:30.240,0:45:37.109
+information we get from gradients for each weight to adapt this so this seems
+
+0:45:37.109,0:45:39.900
+like the obvious thing to do and people have been trying to get this stuff to
+
+0:45:39.900,0:45:43.200
+work for decades and we're kind of stumbled upon some methods that work and
+
+0:45:43.200,0:45:48.510
+some that don't but I want to ask for questions here if there's any any
+
+0:45:48.510,0:45:53.040
+explanation needed so I can say that it's not entirely clear why you need to
+
+0:45:53.040,0:45:56.880
+do this right if your network is well conditioned you don't need to do this
+
+0:45:56.880,0:46:01.349
+potentially but often the network's we use in practice have very different
+
+0:46:01.349,0:46:05.069
+structure in different parts of the network so for instance the early parts
+
+0:46:05.069,0:46:10.619
+of your convolutional neural network may be very shallow convolutional layers on
+
+0:46:10.619,0:46:14.849
+large images later in the network you're going to be doing convolutions with
+
+0:46:14.849,0:46:18.359
+large numbers of channels on small images now these operations are very
+
+0:46:18.359,0:46:21.150
+different and there's no reason to believe that a learning rate that works
+
+0:46:21.150,0:46:26.310
+well for one would work well for the other and this is why the adaptive
+
+0:46:26.310,0:46:28.140
+learning rates can be useful any questions here
+
+0:46:28.140,0:46:32.250
+yes so unfortunately there's no good definition for neural networks we
+
+0:46:32.250,0:46:35.790
+couldn't measure it even if there was a good definition so I'm going to use it
+
+0:46:35.790,0:46:40.109
+in a vague sense that it actually doesn't works and it's poorly
+
+0:46:40.109,0:46:42.619
+conditioned yes so in the sort of quadratic case if
+
+0:46:45.830,0:46:51.380
+you recall I have an explicit definition of this condition number L over μ.
+
+0:46:51.380,0:46:55.910
+L being maximized in value μ being smallest eigen value and yeah the large
+
+0:46:55.910,0:47:00.140
+of this gap between largest larger and smaller eigen value the worst condition
+
+0:47:00.140,0:47:03.320
+it is this does not imply if in your network so that μ does not exist in
+
+0:47:03.320,0:47:07.610
+your networks L still has some information in it but I wouldn't say
+
+0:47:07.610,0:47:12.800
+it's a determining factor there's just a lot going on so there are some ways that
+
+0:47:12.800,0:47:15.619
+your looks behave a lot like simple problems but there are other ways where
+
+0:47:15.619,0:47:23.090
+we just kind of hang wave and say that they like them yeah yeah yes so for this
+
+0:47:23.090,0:47:25.910
+particular network this is a network that actually isn't too poorly
+
+0:47:25.910,0:47:30.920
+conditioned already in fact this is a VDD 16 which is practically the best net
+
+0:47:30.920,0:47:34.490
+method best network when you had a train before the invention of certain
+
+0:47:34.490,0:47:37.369
+techniques to improve conditioning so this is almost the best of first
+
+0:47:37.369,0:47:40.910
+condition you can actually get and there are a lot of the structure of this
+
+0:47:40.910,0:47:45.140
+network is actually defined by this conditioning like we double the number
+
+0:47:45.140,0:47:48.680
+of channels after certain steps because that seems to result in networks at a
+
+0:47:48.680,0:47:53.600
+world condition rather than any other reason but it's certainly what you can
+
+0:47:53.600,0:47:57.170
+say is that weights very light the network have very large effect on the
+
+0:47:57.170,0:48:02.630
+output that very last layer there with if there are 4096 weights in it that's a
+
+0:48:02.630,0:48:06.400
+very small number of whites this network has millions of whites I believe those
+
+0:48:06.400,0:48:10.640
+4096 weights have a very strong effect on the output because they directly
+
+0:48:10.640,0:48:14.450
+dictate that output and for that reason you generally want to use smaller
+
+0:48:14.450,0:48:19.190
+learning rates for those whereas yeah weights early in the network some of
+
+0:48:19.190,0:48:21.770
+them might have a large effect but especially when you've initialized
+
+0:48:21.770,0:48:25.910
+network of randomly they typically will have a smaller effect of those those
+
+0:48:25.910,0:48:29.840
+earlier weights and this is very hand wavy and the reason why is because we
+
+0:48:29.840,0:48:33.859
+really don't understand this well enough for me to give you a precise precise
+
+0:48:33.859,0:48:41.270
+statement here 120 million weights in this network actually so yeah so that
+
+0:48:41.270,0:48:47.710
+last layer is like 4096 by 4096 matrix so
+
+0:48:47.950,0:48:53.510
+yeah okay any other questions yeah yes I would recommend only using them when
+
+0:48:53.510,0:48:59.120
+your problem doesn't have a structure that decomposes into a large sum of
+
+0:48:59.120,0:49:04.880
+similar things okay yeah that's a bit of a mouthful but sut works well when you
+
+0:49:04.880,0:49:09.830
+have an objective that is a sum where each term of the sum is is vaguely
+
+0:49:09.830,0:49:14.990
+comparable so in machine learning each sub term in this sum is a loss of one
+
+0:49:14.990,0:49:18.290
+data point and these have very similar structures individual losses that's a
+
+0:49:18.290,0:49:21.080
+hand-wavy sense that they have very similar structure because of course each
+
+0:49:21.080,0:49:25.220
+data point could be quite different but when your problem doesn't have a large
+
+0:49:25.220,0:49:30.440
+sum as the main part of its structure then l-bfgs would be useful that's the
+
+0:49:30.440,0:49:35.840
+general answer I doubt you make use of it in this course l-bfgs doubt it that
+
+0:49:35.840,0:49:40.660
+it can be very handy for small networks you can experiment around with it with
+
+0:49:40.660,0:49:44.720
+the leaner v network or something which I'm sure you probably use in this course
+
+0:49:44.720,0:49:51.230
+you could experiment with l-bfgs probably and have some success there one
+
+0:49:51.230,0:49:58.670
+of the kind of founding techniques in modern your network training is rmsprop
+
+0:49:58.670,0:50:03.680
+and i'm going to talk about this year now at some point kind of the standard
+
+0:50:03.680,0:50:07.640
+practice in the field of optimization is in research and optimization kind of
+
+0:50:07.640,0:50:10.640
+diverged with what people were actually doing when training neural networks and
+
+0:50:10.640,0:50:14.150
+this IMS prop was kind of the fracturing point where we all went off in different
+
+0:50:14.150,0:50:19.820
+directions and this rmsprop is usually attributed to Geoffrey Hinton slides
+
+0:50:19.820,0:50:23.380
+which he then attributes to an unpublished paper from someone else
+
+0:50:23.380,0:50:28.790
+which is really unsatisfying to be citing someone slides in a paper but
+
+0:50:28.790,0:50:34.400
+anyway it's a method that has some it has no proof behind why it works but
+
+0:50:34.400,0:50:38.050
+it's similar to methods that you can prove work so that's at least something
+
+0:50:38.050,0:50:43.520
+and it works pretty well in practice and that's why I look if we use it so I want
+
+0:50:43.520,0:50:46.310
+to give you that kind of introduction before what I explained what it actually
+
+0:50:46.310,0:50:51.020
+is and rmsprop stands for root mean squared propagation
+
+0:50:51.020,0:50:54.579
+this was from the era where everything we do the fuel networks we
+
+0:50:54.579,0:50:58.690
+called propagation such-and-such like back prop which now we call deep so it
+
+0:50:58.690,0:51:02.920
+probably be called Armas deep propyl something if it was embedded now and
+
+0:51:02.920,0:51:08.470
+it's a little bit of a modification so it still to line algorithm but a little
+
+0:51:08.470,0:51:11.200
+bit different so I'm gonna go over these terms in some detail because it's
+
+0:51:11.200,0:51:19.450
+important to understand this now we we keep around this V buffer now this is
+
+0:51:19.450,0:51:22.720
+not a momentum buffer okay so we using different notation here he is doing
+
+0:51:22.720,0:51:27.069
+something different and I'm going to use some notation that that some people
+
+0:51:27.069,0:51:30.760
+really hates but I think it's convenient I'm going to write the element wise
+
+0:51:30.760,0:51:36.040
+square of a vector just by squaring the vector this is not really confusing
+
+0:51:36.040,0:51:40.390
+notationally in almost all situations but it's a nice way to write it so here
+
+0:51:40.390,0:51:43.480
+I'm writing the gradient squared I really mean you take every element in
+
+0:51:43.480,0:51:47.109
+that vector million element vector or whatever it is and square each element
+
+0:51:47.109,0:51:51.309
+individually so this video update is what's known as an exponential moving
+
+0:51:51.309,0:51:55.480
+average I do I have a quick show of hands who's familiar with exponential
+
+0:51:55.480,0:51:59.890
+moving averages I want to know if I need to talk about it in some more seems like
+
+0:51:59.890,0:52:03.270
+it's probably need to explain it in some depth but in expose for a moving average
+
+0:52:03.270,0:52:08.020
+it's a standard way this has been used for many many decades across many fields
+
+0:52:08.020,0:52:14.650
+for maintaining an average that are the quantity that may change over time okay
+
+0:52:14.650,0:52:19.630
+so when a quantity is changing over time we need to put larger weights on newer
+
+0:52:19.630,0:52:24.210
+values because they provide more information and one way to do that is
+
+0:52:24.210,0:52:30.700
+down weight old values exponentially and when you do this exponentially you mean
+
+0:52:30.700,0:52:36.880
+that the weight of an old value from say ten steps ago will have weight alpha to
+
+0:52:36.880,0:52:41.109
+the ten in your thing so that's where the exponential comes in the output of
+
+0:52:41.109,0:52:43.900
+the ten now it's that's not really in the notation and in the notation at each
+
+0:52:43.900,0:52:49.390
+step we just download the pass vector by this alpha constant and as if you can
+
+0:52:49.390,0:52:53.440
+imagine in your head things in that buffer the V buffer that are very old at
+
+0:52:53.440,0:52:57.760
+each step they get downloaded by alpha at every step and just as before alpha
+
+0:52:57.760,0:53:01.359
+here is something between zero and one so we can't use values greater than one
+
+0:53:01.359,0:53:04.280
+there so this will damp those all values until they no longer
+
+0:53:04.280,0:53:08.180
+the exponential moving average so this method keeps an exponential moving
+
+0:53:08.180,0:53:12.860
+average of the second moment I mean non-central second moment so we do not
+
+0:53:12.860,0:53:18.920
+subtract off the mean here the PyTorch implementation has a switch where you
+
+0:53:18.920,0:53:22.370
+can tell it to subtract off the mean play with that if you like it'll
+
+0:53:22.370,0:53:25.460
+probably perform very similarly in practice there's a paper on that I'm
+
+0:53:25.460,0:53:30.620
+sure but the original method does not subtract off the mean there and we use
+
+0:53:30.620,0:53:35.000
+this second moment to normalize the gradient and we do this element-wise so
+
+0:53:35.000,0:53:39.560
+all this notation is element wise every element of the gradient is divided
+
+0:53:39.560,0:53:43.310
+through by the square root of the second moment estimate and if you think that
+
+0:53:43.310,0:53:47.090
+this square root is really being the standard deviation even though this is
+
+0:53:47.090,0:53:50.990
+not a central moment so it's not actually the standard deviation it's
+
+0:53:50.990,0:53:55.580
+useful to think of it that way and the name you know root means square is kind
+
+0:53:55.580,0:54:03.590
+of alluding to that division by the root of the mean of the squares and the
+
+0:54:03.590,0:54:07.820
+important technical detail here you have to add epsilon here for the annoying
+
+0:54:07.820,0:54:12.950
+problem that when you divide 0 by 0 everything breaks so you occasionally
+
+0:54:12.950,0:54:16.310
+have zeros in your network there are some situations where it makes a
+
+0:54:16.310,0:54:20.060
+difference outside of when your gradients zero but you absolutely do
+
+0:54:20.060,0:54:25.310
+need that epsilon in your method and you'll see this is a recurring theme all
+
+0:54:25.310,0:54:29.900
+of these no adaptive methods basically you've got to put an epsilon when your
+
+0:54:29.900,0:54:34.040
+the divide something just to avoiding to avoid dividing by 0 and typically that
+
+0:54:34.040,0:54:38.690
+epsilon will be close to your machine Epsilon I don't know if so if you're
+
+0:54:38.690,0:54:41.750
+familiar with that term but it's something like 10 to a negative 7
+
+0:54:41.750,0:54:45.710
+sometimes 10 to the negative 8 something of that order so really only has a small
+
+0:54:45.710,0:54:49.790
+effect on the value before I talk about why this method works I want to talk
+
+0:54:49.790,0:54:53.150
+about the the most recent kind of innovation on top of this method and
+
+0:54:53.150,0:54:57.560
+that is the method that we actually use in practice so rmsprop is sometimes
+
+0:54:57.560,0:55:03.170
+still use but more often we use a method notice atom an atom means adaptive
+
+0:55:03.170,0:55:10.790
+moment estimation so Adam is rmsprop with momentum so I spent 20 minutes
+
+0:55:10.790,0:55:13.760
+telling you I should use momentum so I'm going to say well you should put it on
+
+0:55:13.760,0:55:18.420
+top of rmsprop as well there's always of doing that at least
+
+0:55:18.420,0:55:21.569
+half a dozen in this papers for each of them but Adam is the one that caught on
+
+0:55:21.569,0:55:25.770
+and the way we do have a mention here is we actually convert the momentum update
+
+0:55:25.770,0:55:32.609
+to an exponential moving average as well now this may seem like a quantity
+
+0:55:32.609,0:55:37.200
+qualitatively different update like doing momentum by moving average in fact
+
+0:55:37.200,0:55:40.829
+what we were doing before is essentially equivalent to that you can work out some
+
+0:55:40.829,0:55:44.490
+constants where you can get a method where you use a moving exponential
+
+0:55:44.490,0:55:47.760
+moving average momentum that is equivalent to the regular mentum so
+
+0:55:47.760,0:55:50.460
+don't think of this moving average momentum as being anything different
+
+0:55:50.460,0:55:54.000
+than your previous momentum but it has a nice property that you don't need to
+
+0:55:54.000,0:55:57.660
+change the learning rate when you mess with the β here which I think it's a
+
+0:55:57.660,0:56:03.780
+big improvement so yeah we added momentum of the gradient and just as
+
+0:56:03.780,0:56:07.980
+before with rmsprop we have this exponential moving average of the
+
+0:56:07.980,0:56:13.050
+squared gradient on top of that we basically just plug in this moving
+
+0:56:13.050,0:56:17.010
+average gradient where we had the gradient in the previous update so it's
+
+0:56:17.010,0:56:20.579
+not too complicated now if you actually read the atom paper you'll see a whole
+
+0:56:20.579,0:56:23.880
+bunch of additional notation the algorithm is like ten lines long instead
+
+0:56:23.880,0:56:28.859
+of three and that is because they add something called bias correction this is
+
+0:56:28.859,0:56:34.260
+actually not necessary but it'll help a little bit so everybody uses it and all
+
+0:56:34.260,0:56:39.780
+it does is it increases the value of these parameters during the early stages
+
+0:56:39.780,0:56:43.319
+of optimization and the reason you do that is because you initialize this
+
+0:56:43.319,0:56:48.150
+momentum buffer at zero typically now imagine your initial initializer at zero
+
+0:56:48.150,0:56:52.440
+then after the first step we're going to be adding to that a value of 1 minus
+
+0:56:52.440,0:56:56.700
+β times the gradient now 1 minus β will typically be 0.1 because we
+
+0:56:56.700,0:57:00.599
+typically use momentum point 9 so when we do that our gradient step is actually
+
+0:57:00.599,0:57:05.069
+using a learning rate 10 times smaller because this momentum buffer has a tenth
+
+0:57:05.069,0:57:08.670
+of a gradient in it and that's undesirable so all the bias
+
+0:57:08.670,0:57:13.890
+correction does is just multiply by 10 the step in those early iterations and
+
+0:57:13.890,0:57:18.420
+the bias correction formula is just basically the correct way to do that to
+
+0:57:18.420,0:57:23.030
+result in a step that's unbiased and unbiased here means just the expectation
+
+0:57:23.030,0:57:28.420
+of the momentum buffer is the gradient so it's nothing too mysterious
+
+0:57:28.420,0:57:32.960
+yeah don't think of it as being like a huge addition although I do think that
+
+0:57:32.960,0:57:37.190
+the atom paper was the first one to use bicycle action in a mainstream
+
+0:57:37.190,0:57:40.310
+optimization method I don't know if they invented it but it certainly pioneered
+
+0:57:40.310,0:57:44.990
+the base correction so these methods work really well in practice let me just
+
+0:57:44.990,0:57:48.590
+give you a common empirical comparison here now this quadratic I'm using is a
+
+0:57:48.590,0:57:52.220
+diagonal quadratic so it's a little bit shading to use a method that works well
+
+0:57:52.220,0:57:55.060
+on down or quadratics on and diagonal quadratic but I'm gonna do that anyway
+
+0:57:55.060,0:58:00.320
+and you can see that the direction they travel is quite an improvement over SGD
+
+0:58:00.320,0:58:03.950
+so in this simplified problem sut kind of goes in the wrong direction at the
+
+0:58:03.950,0:58:08.780
+beginning where rmsprop basically heads in the right direction now the problem
+
+0:58:08.780,0:58:15.140
+is rmsprop suffers from noise just as regular sut without noise suffers so you
+
+0:58:15.140,0:58:19.490
+get this situation where kind of bounces around the optimum quite significantly
+
+0:58:19.490,0:58:24.710
+and just as with std with momentum when we add momentum to atom we get the same
+
+0:58:24.710,0:58:29.210
+kind of improvement where we kind of corkscrew or sometimes reverse corkscrew
+
+0:58:29.210,0:58:32.240
+around the solution that kind of thing and this gets you to the solution
+
+0:58:32.240,0:58:35.960
+quicker and it means that the last point you're currently at is a good estimate
+
+0:58:35.960,0:58:39.370
+of the solution not a noisy estimate but it's kind of the best estimate you have
+
+0:58:39.370,0:58:45.350
+so I would generally recommend using a demova rmsprop and it's serving the case
+
+0:58:45.350,0:58:50.750
+that for some problems you just can't use SGD atom is necessary for training
+
+0:58:50.750,0:58:53.690
+some of the neural networks were using our language models or say our language
+
+0:58:53.690,0:58:57.290
+models it's necessary for training the network so I'm going to talk about near
+
+0:58:57.290,0:59:03.580
+the end of this presentation and it's it's generally the if I have to
+
+0:59:07.490,0:59:10.670
+recommend something you should use you should try either s to D with momentum
+
+0:59:10.670,0:59:14.690
+or atom as you'll go to methods for optimizing your networks so there's some
+
+0:59:14.690,0:59:19.430
+practical advice for you personally I hate atom because I'm an optimization
+
+0:59:19.430,0:59:24.920
+researcher and the theory and their paper is wrong this has been shown
+
+0:59:24.920,0:59:29.360
+recently so the method in fact does not converge and you can show this on very
+
+0:59:29.360,0:59:32.430
+simple test problems so one of the most heavily music
+
+0:59:32.430,0:59:35.820
+use methods in modern machine learning actually doesn't work in a lot of
+
+0:59:35.820,0:59:40.740
+situations this is unsatisfying and it's I'm kind of an ongoing research question
+
+0:59:40.740,0:59:44.670
+of the best way to fix this I don't think just modifying Adam a little bit
+
+0:59:44.670,0:59:47.160
+to try and fix it is really the best solution I think it's got some more
+
+0:59:47.160,0:59:52.620
+fundamental problems but I won't go into any detail for that there is a very
+
+0:59:52.620,0:59:56.460
+practical problem they need to talk about though Adam is known to sometimes
+
+0:59:56.460,1:00:01.140
+give worse generalization error I think Yara's talked in detail about
+
+1:00:01.140,1:00:08.730
+generalization error do I go over that so yeah generalization error is the
+
+1:00:08.730,1:00:14.100
+error on data that you didn't train your model on basically so your networks are
+
+1:00:14.100,1:00:17.370
+very heavily parameter over parameterised and if you train them to
+
+1:00:17.370,1:00:22.200
+give zero loss on the data you trained it on they won't give zero loss on other
+
+1:00:22.200,1:00:27.240
+data points data that it's never seen before and this generalization error is
+
+1:00:27.240,1:00:32.310
+that error typically the best thing we can do is minimize the loss and the data
+
+1:00:32.310,1:00:37.080
+we have but sometimes that's suboptimal and it turns out when you use Adam it's
+
+1:00:37.080,1:00:40.860
+quite common on particularly on image problems that you get worst
+
+1:00:40.860,1:00:46.140
+generalization error than when you use STD and people attribute this to a whole
+
+1:00:46.140,1:00:50.400
+bunch of different things it may be finding those bad local minima that I
+
+1:00:50.400,1:00:54.180
+mentioned earlier the ones that are smaller it's kind of unfortunate that
+
+1:00:54.180,1:00:57.840
+the better your optimization method the more likely it is to hit those small
+
+1:00:57.840,1:01:02.460
+local minima because they're closer to where you currently are and kind of it's
+
+1:01:02.460,1:01:06.510
+the goal of an optimization method to find you the closest minima in a sense
+
+1:01:06.510,1:01:10.620
+these local optimization methods we use but there's a whole bunch of other
+
+1:01:10.620,1:01:16.950
+reasons that you can attribute to it less noise in Adam perhaps it could be
+
+1:01:16.950,1:01:20.100
+some structure maybe these methods where you rescale
+
+1:01:20.100,1:01:23.070
+space like this have this fundamental problem where they give worst
+
+1:01:23.070,1:01:26.430
+generalization we don't really understand this but it's important to
+
+1:01:26.430,1:01:30.390
+know that this may be a problem or in some cases it's not to say that it will
+
+1:01:30.390,1:01:33.450
+give horrible performance you'll still get a pretty good neuron that workout at
+
+1:01:33.450,1:01:37.200
+the end and what I can tell you is the language models that we trained at
+
+1:01:37.200,1:01:41.890
+Facebook use methods like atom or atom itself and they
+
+1:01:41.890,1:01:46.960
+much better results than if you use STD and there's a kind of a small thing that
+
+1:01:46.960,1:01:51.490
+won't affect you at all I would expect but with Adam you have to maintain these
+
+1:01:51.490,1:01:56.410
+three buffers where's sed you have two buffers of parameters this doesn't
+
+1:01:56.410,1:01:59.230
+matter except when you're training a model that's like 12 gigabytes and then
+
+1:01:59.230,1:02:02.790
+it really becomes a problem I don't think you'll encounter that in practice
+
+1:02:02.790,1:02:06.280
+and surely there's a little bit iffy so you gotta trim two parameters instead of
+
+1:02:06.280,1:02:13.060
+one so yeah that's practical advice use Adam arrest you do but onto something
+
+1:02:13.060,1:02:18.220
+that is also sup is also kind of a core thing oh sorry have a question yes yes
+
+1:02:18.220,1:02:22.600
+you absolutely correct but typically I guess the question the question was
+
+1:02:22.600,1:02:28.000
+weren't using a small epsilon in the denominator result in blow-up certainly
+
+1:02:28.000,1:02:32.440
+if the numerator was equal to roughly one than dividing through by ten to the
+
+1:02:32.440,1:02:37.900
+negative seven could be catastrophic and this this is a legitimate question but
+
+1:02:37.900,1:02:45.250
+typically in order for the V buffer to have very small values the gradient also
+
+1:02:45.250,1:02:48.340
+has to have had very small values you can see that from the way the
+
+1:02:48.340,1:02:53.110
+exponential moving averages are updated so in fact it's not a practical problem
+
+1:02:53.110,1:02:56.860
+when this when this V is incredibly small the momentum is also very small
+
+1:02:56.860,1:03:01.180
+and when you're dividing small thing by a small thing you don't get blow-up oh
+
+1:03:01.180,1:03:08.050
+yeah so the question is should I you buy an SUV and atom separately at the same
+
+1:03:08.050,1:03:11.860
+time and just see which one works better in fact that is pretty much what we do
+
+1:03:11.860,1:03:14.620
+because we have lots of computers we just have one computer runners you need
+
+1:03:14.620,1:03:17.890
+one computer one atom and see which one works better although we kind of know
+
+1:03:17.890,1:03:21.730
+from most problems which one is the better choice for whatever problems
+
+1:03:21.730,1:03:24.460
+you're working with maybe you can try both it depends how long it's going to
+
+1:03:24.460,1:03:27.940
+take to train I'm not sure exactly what you're gonna be doing in terms of
+
+1:03:27.940,1:03:31.150
+practice in this course yeah certainly legitimate way to do it
+
+1:03:31.150,1:03:35.020
+in fact some people use SGD at the beginning and then switch to atom at the
+
+1:03:35.020,1:03:39.430
+end that's certainly a good approach it just makes it more complicated and
+
+1:03:39.430,1:03:44.740
+complexity should be avoided if possible yes this is one of those deep unanswered
+
+1:03:44.740,1:03:48.400
+questions so the question was should we 1s you deal with lots of different
+
+1:03:48.400,1:03:51.850
+initializations and see which one gets the best solution won't I help with the
+
+1:03:51.850,1:03:54.990
+bumpiness this is the case with small neural net
+
+1:03:54.990,1:03:59.160
+that you will get different solutions depending on your initialization now
+
+1:03:59.160,1:04:02.369
+there's a remarkable property of the kind of large networks we use at the
+
+1:04:02.369,1:04:07.349
+moment and the art networks as long as you use similar random initialization in
+
+1:04:07.349,1:04:11.400
+terms of the variance of initialization you'll end up practically at a similar
+
+1:04:11.400,1:04:16.380
+quality solutions and this is not well understood so yeah it's it's quite
+
+1:04:16.380,1:04:19.319
+remarkable that your neural network can train for three hundred epochs and you
+
+1:04:19.319,1:04:23.550
+end up with solution the test error is like almost exactly the same as what you
+
+1:04:23.550,1:04:26.220
+got with some completely different initialization we don't understand this
+
+1:04:26.220,1:04:31.800
+so if you really need to eke out tiny performance gains you may be able to get
+
+1:04:31.800,1:04:36.150
+a little bit better Network by running multiple and picking the best and it
+
+1:04:36.150,1:04:39.180
+seems the bigger your network and the harder your problem the less game you
+
+1:04:39.180,1:04:44.190
+get from doing that yes so the question was we have three buffers for each
+
+1:04:44.190,1:04:49.470
+weight on the answer answer is yes so essentially yeah we basically in memory
+
+1:04:49.470,1:04:53.160
+we have a copy of the same size as our weight data so our weight will be a
+
+1:04:53.160,1:04:55.920
+whole bunch of tensors in memory we have a separate whole bunch of tensors that
+
+1:04:55.920,1:05:01.849
+our momentum tensors and we have a whole bunch of other tensors that are the the
+
+1:05:01.849,1:05:09.960
+second moment tensors so yeah so normalization layers so this is kind of
+
+1:05:09.960,1:05:14.369
+a clever idea why try and salt why try and come up with a better optimization
+
+1:05:14.369,1:05:20.540
+algorithm where we can just come up with a better network and this is the idea so
+
+1:05:20.960,1:05:24.960
+modern neural networks typically we modify the network by adding additional
+
+1:05:24.960,1:05:32.280
+layers in between existing layers and the goal of these layers to improve the
+
+1:05:32.280,1:05:36.450
+optimization and generalization performance of the network and the way
+
+1:05:36.450,1:05:39.059
+they do this can happen in a few different ways but let me give you an
+
+1:05:39.059,1:05:44.430
+example so we would typically take standard kind of combinations so as you
+
+1:05:44.430,1:05:48.930
+know in modern your networks we typically alternate linear operations
+
+1:05:48.930,1:05:52.319
+with nonlinear operations and here I call that activation functions we
+
+1:05:52.319,1:05:56.069
+alternate them linear nonlinear linear nonlinear what we could do is we can
+
+1:05:56.069,1:06:01.819
+place these normalization layers either between the linear order non-linear or
+
+1:06:01.819,1:06:11.009
+before so there in this case we are using for instance this is the kind of
+
+1:06:11.009,1:06:14.369
+structure we have in real networks where we have a convolution recover that
+
+1:06:14.369,1:06:18.240
+convolutions or linear operations followed by batch normalization this is
+
+1:06:18.240,1:06:20.789
+a type of normalization which I will detail in a minute
+
+1:06:20.789,1:06:28.140
+followed by riilu which is currently the most popular activation function and we
+
+1:06:28.140,1:06:31.230
+place this mobilization between these existing layers and what I want to make
+
+1:06:31.230,1:06:35.940
+clear is this normalization layers they affect the flow of data through so they
+
+1:06:35.940,1:06:39.150
+modify the data that's flowing through but they don't change the power of the
+
+1:06:39.150,1:06:43.380
+network in the sense that that you can set up the weights in the network in
+
+1:06:43.380,1:06:46.769
+some way that'll still give whatever output you had in an unknown alized
+
+1:06:46.769,1:06:50.220
+network with a normalized network so normalization layers you're not making
+
+1:06:50.220,1:06:53.670
+that work more powerful they improve it in other ways normally when we add
+
+1:06:53.670,1:06:57.660
+things to a neural network the goal is to make it more powerful and yes this
+
+1:06:57.660,1:07:01.740
+normalization layer can also be after the activation or before the linear or
+
+1:07:01.740,1:07:05.009
+you know because this wraps around we do this in order a lot of them are
+
+1:07:05.009,1:07:11.400
+equivalent but any questions here this is this bits yes yes so that's certainly
+
+1:07:11.400,1:07:16.140
+true but we kind of want that we want the real o2 sensor some of the data but
+
+1:07:16.140,1:07:20.009
+not too much but it's also not quite accurate because normalization layers
+
+1:07:20.009,1:07:24.989
+can also scale and ship the data and so it won't necessarily be that although
+
+1:07:24.989,1:07:28.739
+it's certainly at initialization they do not do that scaling in ship so typically
+
+1:07:28.739,1:07:32.460
+cut off half the data and in fact if you try to do a theoretical analysis of this
+
+1:07:32.460,1:07:37.470
+it's very convenient that it cuts off half the data so the structure this
+
+1:07:37.470,1:07:42.239
+normalization layers they all pretty much do the same kind of operation and
+
+1:07:42.239,1:07:47.640
+how many use kind of generic notation here so you should imagine that X is an
+
+1:07:47.640,1:07:54.930
+input to the normalization layer and Y is an output and what you do is use do a
+
+1:07:54.930,1:08:00.119
+whitening or normalization operation where you subtract off some estimate of
+
+1:08:00.119,1:08:05.190
+the mean of the data and you divide through by some estimate of the standard
+
+1:08:05.190,1:08:10.259
+deviation and remember before that I mentioned we want to keep the
+
+1:08:10.259,1:08:12.630
+representational power of the network the same
+
+1:08:12.630,1:08:17.430
+what we do to ensure that is we multiply by an alpha and we add a sorry in height
+
+1:08:17.430,1:08:22.050
+multiplied by an hey and we add a B and this is just so that the layer can still
+
+1:08:22.050,1:08:27.120
+output values over any particular range or if we just always had every layer
+
+1:08:27.120,1:08:30.840
+output in white and data the network couldn't output like a value million or
+
+1:08:30.840,1:08:35.370
+something like that it wouldn't it could only do that you know with very in very
+
+1:08:35.370,1:08:38.520
+rare cases because that would be very heavy on the tail of the normal
+
+1:08:38.520,1:08:41.850
+distribution so this allows our layers to essentially output things that are
+
+1:08:41.850,1:08:49.200
+the same range as before and yes so normalization layers have parameters and
+
+1:08:49.200,1:08:51.900
+in the network is a little bit more complicated in the sensor has more
+
+1:08:51.900,1:08:56.010
+parameters it's typically a very small number of parameters like rounding error
+
+1:08:56.010,1:09:04.290
+in your counts of network parameters typically and yeah so the complexity of
+
+1:09:04.290,1:09:06.840
+this is on being kind of vague about how you compute the mean and standard
+
+1:09:06.840,1:09:10.170
+deviation the reason I'm doing that is because all the methods compute in a
+
+1:09:10.170,1:09:18.210
+different way and I'll detail that in a second yes question weighs re lb oh it's
+
+1:09:18.210,1:09:24.630
+just a shift parameter so the data could have had a nonzero mean and we want it
+
+1:09:24.630,1:09:28.470
+delayed to be able to produce outputs with a nonzero mean so if we always just
+
+1:09:28.470,1:09:30.570
+subtract off the mean it couldn't do that
+
+1:09:30.570,1:09:34.950
+so it just adds back representational power to the layer yes so the question
+
+1:09:34.950,1:09:40.110
+is don't these a and B parameters reverse the normalization and and in
+
+1:09:40.110,1:09:44.730
+fact that often is the case that they do something similar but they move at
+
+1:09:44.730,1:09:48.750
+different time scales so between the steps or between evaluations your
+
+1:09:48.750,1:09:52.410
+network the mean and variance can can shift quite substantially based off the
+
+1:09:52.410,1:09:55.320
+data you're feeding but these a and B parameters are quite stable they move
+
+1:09:55.320,1:10:01.260
+slowly as you learn them so because they're most stable this has beneficial
+
+1:10:01.260,1:10:04.530
+properties and I'll describe those a little bit later but I want to talk
+
+1:10:04.530,1:10:08.610
+about is exactly how you normalize the data and this is where the crucial thing
+
+1:10:08.610,1:10:11.760
+so the earliest of these methods developed was batch norm and he is this
+
+1:10:11.760,1:10:16.429
+kind of a bizarre normalization that I I think is a horrible idea
+
+1:10:16.429,1:10:22.460
+but unfortunately works fantastically well so it normalizes across batches so
+
+1:10:22.460,1:10:28.370
+we want information about a certain channel recall for a convolutional
+
+1:10:28.370,1:10:32.000
+neural network which channel is one of these latent images that you have in
+
+1:10:32.000,1:10:34.610
+your network that part way through the network you have some data it doesn't
+
+1:10:34.610,1:10:37.070
+really look like an image if you actually look at it but it's it's shaped
+
+1:10:37.070,1:10:41.000
+like an image anyway and that's a channel so we want to compute an average
+
+1:10:41.000,1:10:47.239
+over this over this channel but we only have a small amount of data that's
+
+1:10:47.239,1:10:51.380
+what's in this channel basically height times width if it's a if it's an image
+
+1:10:51.380,1:10:56.000
+and it turns out that's not enough data to get good estimates of these mean and
+
+1:10:56.000,1:10:58.969
+variance parameters so what batchman does is it takes a mean and variance
+
+1:10:58.969,1:11:05.570
+estimate across all the instances in your mini-batch pretty straightforward
+
+1:11:05.570,1:11:09.890
+and that's what it divides blue by the reason why I don't like this is it is no
+
+1:11:09.890,1:11:12.830
+longer actually stochastic gradient descent if you using batch normalization
+
+1:11:12.830,1:11:19.429
+so it breaks all the theory that I work on for a living so I prefer some other
+
+1:11:19.429,1:11:24.409
+normalization strategies there in fact quite a soon after Bachelor and people
+
+1:11:24.409,1:11:27.409
+tried normalizing via every other possible combination of things you can
+
+1:11:27.409,1:11:31.699
+normalize by and it turns out the three that kind of work a layer instance and
+
+1:11:31.699,1:11:37.370
+group norm and layer norm here in this diagram you averaged across all of the
+
+1:11:37.370,1:11:43.820
+channels and across height and width now this doesn't work on all problems so I
+
+1:11:43.820,1:11:47.000
+would only recommend it on a problem where you know it already works and
+
+1:11:47.000,1:11:49.940
+that's typically a problem where people already using it so look at what the
+
+1:11:49.940,1:11:53.989
+network's people are using if that's a good idea or not will depend the
+
+1:11:53.989,1:11:57.140
+instance normalization is something that's used a lot in modern language
+
+1:11:57.140,1:12:03.380
+models and this you do not average across the batch anymore which is nice I
+
+1:12:03.380,1:12:07.310
+won't we talk about that much depth I really the one I would rather you rather
+
+1:12:07.310,1:12:12.440
+you use in practice is group normalization so here we have which
+
+1:12:12.440,1:12:16.219
+across a group of channels and this group is trapped is chosen arbitrarily
+
+1:12:16.219,1:12:20.090
+and fixed at the beginning so typically we just group things numerically so
+
+1:12:20.090,1:12:23.580
+channel 0 to 10 would be a group channel you know 10 to
+
+1:12:23.580,1:12:31.110
+20 making sure you don't overlap of course disjoint groups of channels and
+
+1:12:31.110,1:12:34.560
+the size of these groups is a parameter that you need to tune although we always
+
+1:12:34.560,1:12:39.150
+use 32 in practice you could tune that and you just do this because there's not
+
+1:12:39.150,1:12:42.600
+enough information on a single channel and using all the channels is too much
+
+1:12:42.600,1:12:46.170
+so you just use something in between it's it's really quite a simple idea and
+
+1:12:46.170,1:12:50.790
+it turns out this group norm often works better than batch normal a lot of
+
+1:12:50.790,1:12:55.410
+problems and it does mean that my HUD theory that I work on is still balanced
+
+1:12:55.410,1:12:57.890
+so I like that so why does normalization help this is a
+
+1:13:02.190,1:13:06.330
+matter of dispute so in fact in the last few years several papers have come out
+
+1:13:06.330,1:13:08.790
+on this topic unfortunately the papers did not agree
+
+1:13:08.790,1:13:13.590
+on why it works they all have completely separate explanations but there's some
+
+1:13:13.590,1:13:16.260
+things that are definitely going on so we can shape it we can say for sure
+
+1:13:16.260,1:13:24.120
+that the network appears to be easier to optimize so by that I mean you can use
+
+1:13:24.120,1:13:28.140
+large learning rates better in a better condition network you can use larger
+
+1:13:28.140,1:13:31.590
+learning rates and therefore get faster convergence so that does seem to be the
+
+1:13:31.590,1:13:35.030
+case when you uses normalization layers another factor which is a little bit
+
+1:13:38.070,1:13:39.989
+disputed but I think is reasonably well-established
+
+1:13:39.989,1:13:44.489
+you get noise in the data passing through your network when you use
+
+1:13:44.489,1:13:49.940
+normalization in vaginal and this noise comes from other instances in the bash
+
+1:13:49.940,1:13:53.969
+because it's random what I like instances are in your batch when you
+
+1:13:53.969,1:13:57.239
+compute the mean using those other instances that mean is noisy and this
+
+1:13:57.239,1:14:01.469
+noise is then added or sorry subtracted from your weight so when you do the
+
+1:14:01.469,1:14:06.050
+normalization operation so this noise is actually potentially helping
+
+1:14:06.050,1:14:11.790
+generalization performance in your network now there has been a lot of
+
+1:14:11.790,1:14:15.180
+papers on injecting noise internet works to help generalization so it's not such
+
+1:14:15.180,1:14:20.370
+a crazy idea that this noise can be helping and in terms of a practical
+
+1:14:20.370,1:14:24.030
+consideration this normalization makes the weight initialization that you use a
+
+1:14:24.030,1:14:28.260
+lot less important it used to be kind of a black art to select the initialization
+
+1:14:28.260,1:14:32.460
+your new your network and the people who really good motive is often it was just
+
+1:14:32.460,1:14:35.340
+because they're really good at changing their initialization and this is just
+
+1:14:35.340,1:14:39.540
+less the case now when we use normalization layers and also gives the
+
+1:14:39.540,1:14:45.930
+benefit if you can kind of tile together layers with impunity so again it used to
+
+1:14:45.930,1:14:49.050
+be the situation that if you just plug together two possible ways in your
+
+1:14:49.050,1:14:52.740
+network it probably wouldn't work now that we use normalization layers it
+
+1:14:52.740,1:14:57.900
+probably will work and even if it's a horrible idea and this has spurred a
+
+1:14:57.900,1:15:02.310
+whole field of automated architecture search where they just randomly calm
+
+1:15:02.310,1:15:05.940
+build together blocks and it's try thousands of them and see what works and
+
+1:15:05.940,1:15:09.540
+that really wasn't possible before because that would typically result in a
+
+1:15:09.540,1:15:14.010
+poorly conditioned Network you couldn't train and with normalization typically
+
+1:15:14.010,1:15:19.590
+you can train it some practical considerations so the the bachelor on
+
+1:15:19.590,1:15:23.310
+paper one of the reasons why it wasn't invented earlier is the kind of
+
+1:15:23.310,1:15:27.480
+non-obvious thing that you have to back propagate through the calculation of the
+
+1:15:27.480,1:15:32.160
+mean and standard deviation if you don't do this everything blows up now you
+
+1:15:32.160,1:15:35.190
+might have to do this yourself as it'll be implemented in the implementation
+
+1:15:35.190,1:15:42.000
+that you use oh yes so I do not have the expertise to answer that I feel like
+
+1:15:42.000,1:15:45.060
+it's kind of sometimes it's just a patent pet method like people like
+
+1:15:45.060,1:15:49.710
+layering in suits normally that field more and in fact a good norm if you it's
+
+1:15:49.710,1:15:53.640
+just the group size covers both so I would be sure that you could probably
+
+1:15:53.640,1:15:56.640
+get the same performance using group norm with a particular group size chosen
+
+1:15:56.640,1:16:00.980
+carefully yeah the choice of national does affect
+
+1:16:00.980,1:16:06.720
+parallelization so the implementation zinc in your computer library or your
+
+1:16:06.720,1:16:10.380
+CPU library are pretty efficient for each of these but it's complicated when
+
+1:16:10.380,1:16:14.820
+you are spreading your computation across machines and you kind of have to
+
+1:16:14.820,1:16:18.630
+synchronize these these these things and batch norm is a bit of a pain there
+
+1:16:18.630,1:16:23.790
+because it would mean that you need to compute an average across all machines
+
+1:16:23.790,1:16:27.540
+and aggregator whereas if you're using group norm every instance is on a
+
+1:16:27.540,1:16:30.450
+different machine you can just completely compute the norm so in all
+
+1:16:30.450,1:16:34.350
+those other three it's separate normalization for each instance it
+
+1:16:34.350,1:16:37.560
+doesn't depend on the other instances in the batch so it's nicer when you're
+
+1:16:37.560,1:16:40.570
+distributing it's when people use batch norm on a cluster
+
+1:16:40.570,1:16:45.100
+they actually do not sync the statistics across which makes it even less like SGD
+
+1:16:45.100,1:16:51.250
+and makes me even more annoyed so what was it already
+
+1:16:51.250,1:16:57.610
+yes yeah Bachelor basically has a lot of momentum not in the optimization sense
+
+1:16:57.610,1:17:01.300
+but in the sense of people's minds so it's very heavily used for that reason
+
+1:17:01.300,1:17:05.860
+but I would recommend group norm instead and there's kind of like a technical
+
+1:17:05.860,1:17:09.760
+data with batch norm you don't want to compute these mean and standard
+
+1:17:09.760,1:17:14.950
+deviations on batches during evaluation time by evaluation time I mean when you
+
+1:17:14.950,1:17:20.170
+actually run your network on the test data set or we use it in the real world
+
+1:17:20.170,1:17:24.370
+for some application it's typically in those situations you don't have batches
+
+1:17:24.370,1:17:29.050
+any more batches or more for training things so you need some substitution in
+
+1:17:29.050,1:17:33.100
+that case you can compute an exponential moving average as we talked about before
+
+1:17:33.100,1:17:37.930
+and EMA of these mean and standard deviations you may think to yourself why
+
+1:17:37.930,1:17:41.260
+don't we use an EMA in the implementation of batch norm the answer
+
+1:17:41.260,1:17:44.860
+is because it doesn't work we it seems like a very reasonable idea though and
+
+1:17:44.860,1:17:48.880
+people have explored that and quite a lot of depth but it doesn't work oh yes
+
+1:17:48.880,1:17:52.900
+this is quite crucial so yet people have tried normalizing things in neural
+
+1:17:52.900,1:17:55.480
+networks before a batch norm was invented but they always made the
+
+1:17:55.480,1:17:59.380
+mistake of not back popping through the mean and standard deviation and the
+
+1:17:59.380,1:18:02.290
+reason why they didn't do that is because the math is really tricky and if
+
+1:18:02.290,1:18:05.650
+you try to implement it yourself it will probably be wrong now that we have pie
+
+1:18:05.650,1:18:09.460
+charts which which computes gradients correctly for you in all situations you
+
+1:18:09.460,1:18:12.850
+could actually do this in practice and there are just a little bit but only a
+
+1:18:12.850,1:18:16.780
+little bit because it's surprisingly difficult yeah so the question is is
+
+1:18:16.780,1:18:21.070
+there a difference if we apply normalization before after than
+
+1:18:21.070,1:18:25.690
+non-linearity and the answer is there will be a small difference in the
+
+1:18:25.690,1:18:28.930
+performance of your network now I can't tell you which one's better because it
+
+1:18:28.930,1:18:32.110
+appears in some situation one works a little bit better in other situations
+
+1:18:32.110,1:18:35.350
+the other one works better what I can tell you is the way I draw it here is
+
+1:18:35.350,1:18:39.100
+what's used in the PyTorch implementation of ResNet and most
+
+1:18:39.100,1:18:43.330
+resonant implementations so just there's probably almost as good as you can get I
+
+1:18:43.330,1:18:49.270
+think that would use the other form if it was better and it's certainly problem
+
+1:18:49.270,1:18:51.460
+depended this is another one of those things where maybe the
+
+1:18:51.460,1:18:55.420
+no correct answer how you do it and it's just random which works better I don't
+
+1:18:55.420,1:19:03.190
+know yes yeah any other questions on this before I move on to the so you need
+
+1:19:03.190,1:19:06.850
+more data to get accurate estimates of the mean and standard deviation the
+
+1:19:06.850,1:19:10.570
+question was why is it a good idea to compute it across multiple channels
+
+1:19:10.570,1:19:13.450
+rather than a single channel and yes it is because you just have more data to
+
+1:19:13.450,1:19:17.800
+make a better estimates but you want to be careful you don't have too much data
+
+1:19:17.800,1:19:21.130
+in that because then you don't get the noise and record that the noise is
+
+1:19:21.130,1:19:25.300
+actually useful so basically the group size in group norm is just adjusting the
+
+1:19:25.300,1:19:28.870
+amount of noise we have basically the question was how is this related to
+
+1:19:28.870,1:19:32.950
+group convolutions this was all pioneered before good convolutions were
+
+1:19:32.950,1:19:38.260
+used it certainly has some interaction with group convolutions if you use them
+
+1:19:38.260,1:19:41.920
+and so you want to be a little bit careful there I don't know exactly what
+
+1:19:41.920,1:19:44.800
+the correct thing to do is in those cases but I can tell you they definitely
+
+1:19:44.800,1:19:48.610
+use normalization in those situations probably Batchelor more more than group
+
+1:19:48.610,1:19:53.260
+norm because of the momentum I mentioned it's just more popular vaginal yes so
+
+1:19:53.260,1:19:56.890
+the question is do we ever use our Beck instances from the mini-batch in group
+
+1:19:56.890,1:20:00.310
+norm or is it always just a single instance we always just use a single
+
+1:20:00.310,1:20:04.450
+instance because there's so many benefits to that it's so much simpler in
+
+1:20:04.450,1:20:08.469
+implementation and in theory to do that maybe you can get some improvement from
+
+1:20:08.469,1:20:11.530
+that in fact I bet you there's a paper that does that somewhere because they've
+
+1:20:11.530,1:20:15.190
+tried have any combination of this in practice I suspect if it worked well
+
+1:20:15.190,1:20:19.450
+we'd probably be using it so probably probably doesn't work well under the the
+
+1:20:19.450,1:20:24.370
+death of optimization I wanted to put something a little bit interesting
+
+1:20:24.370,1:20:27.610
+because you've all been sitting through kind of a pretty dense lecture so this
+
+1:20:27.610,1:20:31.870
+is something that I've kind of been working on a little bit I thought you
+
+1:20:31.870,1:20:36.580
+might find interesting so you might have seen the the xkcd comic here that I've
+
+1:20:36.580,1:20:42.790
+modified it's not always this way it's kind of point of what it makes so
+
+1:20:42.790,1:20:46.270
+sometimes we can just barge into a field we know nothing about it and improve on
+
+1:20:46.270,1:20:50.469
+how they're currently doing it although you have to be a little bit careful so
+
+1:20:50.469,1:20:53.560
+the problem I want to talk about is one that young I think mentioned briefly in
+
+1:20:53.560,1:20:58.530
+the first lecture but I want to go into a bit of detail it's MRI reconstruction
+
+1:20:58.530,1:21:04.639
+now in the MRI reconstruction problem we take a raw data from an MRI machine a
+
+1:21:04.639,1:21:08.540
+medical imaging machine we take raw data from that machine and we reconstruct an
+
+1:21:08.540,1:21:12.530
+image and there's some pipeline an algorithm in the middle there that
+
+1:21:12.530,1:21:17.900
+produces the image and the goal basically here is to replace 30 years of
+
+1:21:17.900,1:21:21.020
+research into what algorithm they should use their with with neural networks
+
+1:21:21.020,1:21:27.949
+because that's that's what I'll get paid to do and I'll give you a bit of detail
+
+1:21:27.949,1:21:31.810
+so these MRI machines capture data in what's known as the Fourier domain I
+
+1:21:31.810,1:21:34.909
+know a lot of you have done signal processing some of you may have no idea
+
+1:21:34.909,1:21:42.070
+what this is and you don't need to understand it for this problem oh yeah
+
+1:21:44.770,1:21:49.639
+yes so you may have seen the the further domain in one dimensional case
+
+1:21:49.639,1:21:54.710
+so for neural networks sorry for MRI reconstruction we have two dimensional
+
+1:21:54.710,1:21:58.340
+Fourier domain the thing you need to know is it's a linear mapping to get
+
+1:21:58.340,1:22:02.389
+from the fluid domain to image domain it's just linear and it's very efficient
+
+1:22:02.389,1:22:06.350
+to do that mapping it literally takes milliseconds no matter how big your
+
+1:22:06.350,1:22:09.980
+images on modern computers so linear and easy to convert back and forth between
+
+1:22:09.980,1:22:15.619
+the two and the MRI machines actually capture either rows or columns of this
+
+1:22:15.619,1:22:20.540
+Fourier domain as samples they're called sample in the literature so each time
+
+1:22:20.540,1:22:25.280
+the machine computes a sample which is every few milliseconds it gets a role
+
+1:22:25.280,1:22:28.940
+column of this image and this is actually technically a complex-valued
+
+1:22:28.940,1:22:33.380
+image but this does not matter for my discussion of it so you can imagine it's
+
+1:22:33.380,1:22:38.300
+just a two channel image if you imagine a real and imaginary channel just think
+
+1:22:38.300,1:22:42.830
+of them as color channels the problem we want to do we want to solve is
+
+1:22:42.830,1:22:48.800
+accelerating MRI acceleration here is in the sense of faster so we want to run
+
+1:22:48.800,1:22:53.830
+the machines quicker and produce identical quality images
+
+1:22:55.400,1:23:00.050
+and one way we can do that in the most successful way so far is by just not
+
+1:23:00.050,1:23:05.540
+capturing all of the columns we just skip some randomly it's useful in
+
+1:23:05.540,1:23:09.320
+practice to also capture some of the middle columns it turns out they contain
+
+1:23:09.320,1:23:14.150
+a lot of the information but outside the middle we just capture randomly and we
+
+1:23:14.150,1:23:16.699
+can't just use a nice linear operation anymore
+
+1:23:16.699,1:23:20.270
+that diagram on the right is the output of that linear operation I mentioned
+
+1:23:20.270,1:23:23.810
+applied to this data so it doesn't give useful Apple they only do something a
+
+1:23:23.810,1:23:27.100
+little bit more intelligent any questions on this before I move on
+
+1:23:27.100,1:23:35.030
+it is frequency and phase dimensions so in this particular case I'm actually
+
+1:23:35.030,1:23:38.510
+sure this diagram one of the dimensions is frequency and one is phase and the
+
+1:23:38.510,1:23:44.390
+value is the magnitude of a sine wave with that frequency and phase so if you
+
+1:23:44.390,1:23:48.980
+add together all the sine waves wave them with the frequency oh so with the
+
+1:23:48.980,1:23:54.620
+weight in this image you get the original image so it's it's a little bit
+
+1:23:54.620,1:23:58.429
+more complicated because it's in two dimensions and the sine waves you gotta
+
+1:23:58.429,1:24:02.030
+be little bit careful but it's basically just each pixel is the magnitude of a
+
+1:24:02.030,1:24:06.230
+sine wave or if you want to compare to a 1d analogy
+
+1:24:06.230,1:24:11.960
+you'll just have frequencies so the pixel intensity is the strength of that
+
+1:24:11.960,1:24:16.580
+frequency if you have a musical note say a piano note with a C major as one of
+
+1:24:16.580,1:24:19.340
+the frequencies that would be one pixel this image would be the C major
+
+1:24:19.340,1:24:24.140
+frequency and another might be a minor or something like that and the magnitude
+
+1:24:24.140,1:24:28.370
+of it is just how hard they press the key on the piano so you have frequency
+
+1:24:28.370,1:24:34.370
+information yes so the video doesn't work there was one of the biggest
+
+1:24:34.370,1:24:38.750
+breakthroughs in in Threat achill mathematics for a long time was the
+
+1:24:38.750,1:24:41.690
+invention of compressed sensing I'm sure some of you have heard of compressed
+
+1:24:41.690,1:24:45.710
+sensing a hands of show of hands compressed sensing yeah some of you
+
+1:24:45.710,1:24:48.980
+especially work in the mathematical sciences would be aware of it
+
+1:24:48.980,1:24:53.330
+basically there's this phenomenal political paper that showed that we
+
+1:24:53.330,1:24:57.770
+could actually in theory get a perfect reconstruction from these subsampled
+
+1:24:57.770,1:25:02.080
+measurements and we had some requirements for this to work the
+
+1:25:02.080,1:25:06.010
+requirements were that we needed to sample randomly
+
+1:25:06.010,1:25:10.150
+in fact it's a bit weaker you have to sample incoherently but in practice
+
+1:25:10.150,1:25:14.710
+everybody samples randomly so it's essentially the same thing now here
+
+1:25:14.710,1:25:18.910
+we're randomly sampling columns but within the columns we do not randomly
+
+1:25:18.910,1:25:22.330
+sample the reason being is it's not faster in the machine the machine can
+
+1:25:22.330,1:25:25.930
+capture one column as quickly as you could capture half a column so we just
+
+1:25:25.930,1:25:29.350
+kind of capture a whole column so that makes it no longer random so that's one
+
+1:25:29.350,1:25:33.760
+kind of problem with it the other problem is kind of the the assumptions
+
+1:25:33.760,1:25:36.850
+of this compressed sensing theory are violated by the kind of images we want
+
+1:25:36.850,1:25:41.020
+to reconstruct I show you on the right they're an example of compressed sensing
+
+1:25:41.020,1:25:44.560
+Theory reconstruction this was a big step forward from what they could do
+
+1:25:44.560,1:25:48.940
+before you would you'll get something that looks like this previously that was
+
+1:25:48.940,1:25:53.020
+really considered the best in fact some people would when this result came out
+
+1:25:53.020,1:25:57.430
+swore though this was impossible it's actually not but you need some
+
+1:25:57.430,1:26:00.550
+assumptions and these assumptions are pretty critical and I mention them there
+
+1:26:00.550,1:26:05.080
+so you need sparsity of the image now that mi a-- majors not sparse by sparse
+
+1:26:05.080,1:26:09.370
+I mean it has a lot of zero or black pixels it's clearly not sparse but it
+
+1:26:09.370,1:26:13.660
+can be represented sparsely or approximately sparsely if you do a
+
+1:26:13.660,1:26:18.160
+wavelet decomposition now I won't go to the details there's a little bit of
+
+1:26:18.160,1:26:20.920
+problem though it's only approximately sparse and when you do that wavelet
+
+1:26:20.920,1:26:24.489
+decomposition that's why this is not a perfect reconstruction if it was very
+
+1:26:24.489,1:26:28.060
+sparse in the wavelet domain and perfectly that would be in exactly the
+
+1:26:28.060,1:26:33.160
+same as the left image and this compressed sensing is based off of the
+
+1:26:33.160,1:26:36.220
+field of optimization it kind of revitalize a lot of the techniques
+
+1:26:36.220,1:26:39.550
+people have been using for a long time the way you get this reconstruction is
+
+1:26:39.550,1:26:45.130
+you solve a little mini optimization problem at every step you every image
+
+1:26:45.130,1:26:47.830
+you want to reconstruct how many other machines so your machine has to solve an
+
+1:26:47.830,1:26:51.030
+optimization problem for every image every time it solves this little
+
+1:26:51.030,1:26:57.340
+quadratic problem with this kind of complicated regularization term so this
+
+1:26:57.340,1:27:00.700
+is great for optimization or all these people who had been getting low paid
+
+1:27:00.700,1:27:03.780
+jobs at universities all of a sudden there of their research was trendy and
+
+1:27:03.780,1:27:09.370
+corporations needed their help so this is great but we can do better so we
+
+1:27:09.370,1:27:13.120
+instead of solving this minimization problem at every time step I will use a
+
+1:27:13.120,1:27:16.960
+neural network so obviously being here arbitrarily to represent the huge in
+
+1:27:16.960,1:27:24.190
+your network beef a big of course we we hope that we can learn in your network
+
+1:27:24.190,1:27:28.000
+of such sufficient complexity that it can essentially solve the optimization
+
+1:27:28.000,1:27:31.240
+problem in one step it just outputs a solution that's as good as the
+
+1:27:31.240,1:27:35.200
+optimization problem solution now this would have been considered impossible 15
+
+1:27:35.200,1:27:39.820
+years ago now we know better so it's actually not very difficult in fact we
+
+1:27:39.820,1:27:44.980
+can just take an example of we can solve a few of these a few I mean like a few
+
+1:27:44.980,1:27:48.520
+hundred thousand of these optimization problems take the solution and the input
+
+1:27:48.520,1:27:53.620
+and we're gonna strain a neural network to map from input to solution that's
+
+1:27:53.620,1:27:56.830
+actually a little bit suboptimal because we get weakened in some cases we know a
+
+1:27:56.830,1:28:00.070
+better solution than the solution to the optimization problem we can gather that
+
+1:28:00.070,1:28:04.780
+by measuring the patient and that's what we actually do in practice so we don't
+
+1:28:04.780,1:28:07.000
+try and solve the optimization problem we try and get to an even better
+
+1:28:07.000,1:28:11.260
+solution and this works really well so I'll give you a very simple example of
+
+1:28:11.260,1:28:14.740
+this so this is what you can do much better than the compressed sensory
+
+1:28:14.740,1:28:18.580
+reconstruction using a neural network and this network involves the tricks
+
+1:28:18.580,1:28:23.140
+I've mentioned so it's trained using Adam it uses group norm normalization
+
+1:28:23.140,1:28:28.690
+layers and convolutional neural networks as you've already been taught and it
+
+1:28:28.690,1:28:33.970
+uses a technique known as u nets which you may go over later in the course not
+
+1:28:33.970,1:28:37.390
+sure about that but it's not a very complicated modification of only one it
+
+1:28:37.390,1:28:40.660
+works as yeah this is the kind of thing you can do and this is this is very
+
+1:28:40.660,1:28:44.880
+close to practical applications so you'll be seeing these accelerated MRI
+
+1:28:44.880,1:28:49.750
+scans happening in in clinical practice in only a few years tired this is not
+
+1:28:49.750,1:28:53.980
+vaporware and yeah that's everything i wanted to talk about you talk about
+
+1:28:53.980,1:28:58.620
+today optimization and the death of optimization thank you
diff --git a/docs/pt/week05/practicum05.sbv b/docs/pt/week05/practicum05.sbv
new file mode 100644
index 000000000..72ed0c5f4
--- /dev/null
+++ b/docs/pt/week05/practicum05.sbv
@@ -0,0 +1,1241 @@
+0:00:00.000,0:00:05.339
+last time we have seen that a matrix can be written basically let me draw here
+
+0:00:05.339,0:00:12.719
+the matrix so we had similar roles right and then we multiplied usually design by
+
+0:00:12.719,0:00:18.210
+one one column all right and so whenever we multiply these guys you can see these
+
+0:00:18.210,0:00:23.340
+and as two types two different equivalent types of representation it
+
+0:00:23.340,0:00:28.980
+can you see right you don't is it legible okay so you can see basically as
+
+0:00:28.980,0:00:35.430
+the output of this product has been a sequence of like the first row times
+
+0:00:35.430,0:00:40.469
+this column vector and then again I'm just okay shrinking them this should be
+
+0:00:40.469,0:00:46.170
+the same size right right because otherwise you can't multiply them so you
+
+0:00:46.170,0:00:52.170
+have this one and so on right until the last one and this is gonna be my final
+
+0:00:52.170,0:01:00.960
+vector and we have seen that each of these bodies here what are these I talk
+
+0:01:00.960,0:01:05.339
+to me please there's a scalar products right but what
+
+0:01:05.339,0:01:08.820
+do they represent what is it how can we call it what's another name for calling
+
+0:01:08.820,0:01:13.290
+a scalar product I show you last time a demonstration with some Chi government
+
+0:01:13.290,0:01:18.119
+trigonometry right what is it so this is all the projection if you
+
+0:01:18.119,0:01:22.619
+talk about geometry or you can think about this as a nun normalized cosine
+
+0:01:22.619,0:01:29.310
+value right so this one is going to be my projection basically of one kernel or
+
+0:01:29.310,0:01:36.030
+my input signal onto the kernel right so these are projections projection alright
+
+0:01:36.030,0:01:40.619
+and so then there was also a another interpretation of this like there is
+
+0:01:40.619,0:01:45.390
+another way of seeing this which was what basically we had the first column
+
+0:01:45.390,0:01:53.579
+of the matrix a multiplied by the first element of the X of these of this vector
+
+0:01:53.579,0:01:58.260
+right so back element number one then you had a second call
+
+0:01:58.260,0:02:04.020
+time's the second element of the X vector until you get to the last column
+
+0:02:04.020,0:02:11.100
+right times the last an element right suppose that this is long N and this is
+
+0:02:11.100,0:02:16.110
+M times n right so the height again is going to be the dimension towards we
+
+0:02:16.110,0:02:19.550
+should - and the width of a matrix is dimension where we're coming from
+
+0:02:19.550,0:02:24.810
+second part was the following so we said instead of using this matrix here
+
+0:02:24.810,0:02:29.450
+instead since we are doing convolutions because we'd like to exploit sparsity a
+
+0:02:29.450,0:02:35.400
+stationarity and compositionality of the data we still use the same matrix here
+
+0:02:35.400,0:02:41.370
+perhaps right we use the same guy here but then those kernels we are going to
+
+0:02:41.370,0:02:45.510
+be using them over and over again the same current across the whole signal
+
+0:02:45.510,0:02:51.360
+right so in this case the width of this matrix is no longer be it's no longer n
+
+0:02:51.360,0:02:56.820
+as it was here is going to be K which is gonna be the kernel size right so here
+
+0:02:56.820,0:03:03.090
+I'm gonna be drawing my thinner matrix and this one is gonna be K lowercase K
+
+0:03:03.090,0:03:10.140
+and the height maybe we can still call it n okay all right so let's say here I
+
+0:03:10.140,0:03:18.230
+have several kernels for example let me have my tsiyon carnal then I may have my
+
+0:03:18.230,0:03:25.080
+other non green let me change let's put pink so you have this one and
+
+0:03:25.080,0:03:33.180
+then you may have green one right and so on so how do we use these kernels right
+
+0:03:33.180,0:03:38.280
+now so we basically can use these kernels by stacking them and shifted
+
+0:03:38.280,0:03:43.650
+them a little bit right so we get the first kernel out of here and then you're
+
+0:03:43.650,0:03:50.519
+gonna get basically you get the first guy here then you shift it shift it
+
+0:03:50.519,0:03:58.290
+shift it and so on right until you get the whole matrix and we were putting a 0
+
+0:03:58.290,0:04:02.100
+here and a 0 here right this is just recap and then you have this one for the
+
+0:04:02.100,0:04:11.379
+blue color now you do magic here and just do copy copy and I you do paste
+
+0:04:11.379,0:04:19.370
+and now you can also do color see fantastic magic and we have pink one and
+
+0:04:19.370,0:04:25.360
+then you have the last one right can I do the same copy yes I can do fantastic
+
+0:04:25.360,0:04:29.080
+so you cannot do copy and paste on the paper
+
+0:04:29.080,0:04:38.419
+all right color and the last one light green okay all right so we just
+
+0:04:38.419,0:04:44.479
+duplicate how many matrices do we have now how many layers no don't count the
+
+0:04:44.479,0:04:50.600
+number like there are letters on the on the screen and K or M what is it what is
+
+0:04:50.600,0:05:00.620
+K the side usually you're just guessing you shouldn't be guessing you should
+
+0:05:00.620,0:05:07.120
+tell me the correct answer I think about this as a job interview I'm training you
+
+0:05:07.120,0:05:14.990
+so how many maps we have and right so this one here are as many as my M which
+
+0:05:14.990,0:05:21.470
+is the number of rows of this initial thing over here right all right so what
+
+0:05:21.470,0:05:30.289
+is instead the width of this little kernel here okay right okay what is the
+
+0:05:30.289,0:05:41.349
+height of this matrix what is the height of the matrix
+
+0:05:42.340,0:05:45.480
+you sure try again
+
+0:05:49.220,0:06:04.310
+I can't hear and minus k plus one okay and the final what is the output of this
+
+0:06:04.310,0:06:08.660
+thing right so the output is going to be one vector which is gonna be of height
+
+0:06:08.660,0:06:19.430
+the same right and minus k plus 1 and then it should be correct yeah but then
+
+0:06:19.430,0:06:27.890
+how many what is the thickness of this final vector M right so this stuff here
+
+0:06:27.890,0:06:35.600
+and goes as thick as M right so this is where we left last time right but then
+
+0:06:35.600,0:06:39.770
+someone asked me now then I realized so we have here as many as the different
+
+0:06:39.770,0:06:45.170
+colors right so for example in this case if I just draw to make sure we
+
+0:06:45.170,0:06:49.730
+understand what's going on you have the first thing here now you have the second
+
+0:06:49.730,0:06:55.600
+one here and I have the third one right in this case all right so last time they
+
+0:06:59.750,0:07:03.650
+asked me if someone asked me at the end of the class so how do we do convolution
+
+0:07:03.650,0:07:09.760
+when we end up in this situation over here because here we assume that my
+
+0:07:09.760,0:07:14.990
+corners are just you know whatever K long let's say three long but then they
+
+0:07:14.990,0:07:21.380
+are just one little vector right and so somebody told me no then what do you do
+
+0:07:21.380,0:07:24.950
+from here like how do we keep going because now we have a thickness before
+
+0:07:24.950,0:07:32.510
+we started with a something here this vector which had just n elements right
+
+0:07:32.510,0:07:35.690
+are you following so far I'm going faster because we already seen these
+
+0:07:35.690,0:07:44.030
+things I'm just reviewing but are you with me until now yes no yes okay
+
+0:07:44.030,0:07:47.720
+fantastic so let's see how we actually keep going so the thing is
+
+0:07:47.720,0:07:51.680
+show you right now is actually assuming that we start with that long vector
+
+0:07:51.680,0:08:01.400
+which was of height what was the height and right but in this case also this one
+
+0:08:01.400,0:08:13.060
+means that we have something that looks like this and so you have basically here
+
+0:08:13.060,0:08:20.720
+this is 1 this is also 1 so we only have a monophonic signal for example and this
+
+0:08:20.720,0:08:26.300
+was n the height right all right so let's assume now we're using a
+
+0:08:26.300,0:08:33.950
+stereophonic system so what is gonna be my domain here so you know my X can be
+
+0:08:33.950,0:08:39.740
+thought as a function that goes from the domain to the ℝ^{number of channels} so
+
+0:08:39.740,0:08:47.840
+what is this guy here yeah x is one dimension and somewhere so what is this
+
+0:08:47.840,0:08:59.930
+Ω we have seen this slide last slide of Tuesday lesson right second Ω is
+
+0:08:59.930,0:09:11.720
+not set of real numbers no someone else tries we are using computers it's time
+
+0:09:11.720,0:09:16.520
+line yes and how many samples you you have one sample number sample number two
+
+0:09:16.520,0:09:21.710
+or sample number three so you have basically a subset of the natural space
+
+0:09:21.710,0:09:30.860
+right so this one is going to be something like 0 1 2 so on set which is
+
+0:09:30.860,0:09:36.410
+gonna be subset of ℕ right so it's not ℝ. ℝ is gonna be if you have time
+
+0:09:36.410,0:09:45.850
+continuous domain what you see in this case the in the case I just showed you
+
+0:09:45.850,0:09:55.160
+so far what is seen in this case now number of input channels because this is
+
+0:09:55.160,0:10:00.740
+going to be my X right this is my input so in this case we show so far in this
+
+0:10:00.740,0:10:07.220
+case here we were just using one so it means we have a monophonic audio let's
+
+0:10:07.220,0:10:10.880
+seven now the assumption make the assumption that this guy is that it's
+
+0:10:10.880,0:10:22.780
+gonna be two such that you're gonna be talking about stereo phonic signal right
+
+0:10:23.200,0:10:27.380
+okay so let's see how this stuff changes so
+
+0:10:27.380,0:10:38.450
+in this case my let me think yeah so how do I draw I'm gonna just draw right
+
+0:10:38.450,0:10:43.400
+little complain if you don't follow are you following so far yes because if
+
+0:10:43.400,0:10:46.550
+i watch my tablet I don't see you right so you should be complaining if
+
+0:10:46.550,0:10:50.750
+something doesn't make sense right otherwise becomes boring from waiting
+
+0:10:50.750,0:10:56.390
+and watching you all the time right yes no yes okay I'm boring okay
+
+0:10:56.390,0:11:00.080
+thank you all right so we have here this signal
+
+0:11:00.080,0:11:07.280
+right and then now we have some thickness in this case what is the
+
+0:11:07.280,0:11:14.660
+thickness of this guy see right so in this case this one is going to be C and
+
+0:11:14.660,0:11:18.589
+in the case of the stereophonic signal you're gonna just have two channels left
+
+0:11:18.589,0:11:30.170
+and right and this one keeps going down right all right so our kernels if I'd
+
+0:11:30.170,0:11:35.030
+like to perform a convolution over this signal right so you have different same
+
+0:11:35.030,0:11:44.150
+pussy right and so on right if I'd like to perform a convolution one big
+
+0:11:44.150,0:11:47.089
+convolution I'm not talking about two deconvolution right because they are
+
+0:11:47.089,0:11:52.670
+still using domain which is here number one right so this is actually important
+
+0:11:52.670,0:11:58.510
+so if I ask you what type of signal this is you're gonna be basically
+
+0:11:58.510,0:12:02.890
+you have to look at this number over here right so we are talking about one
+
+0:12:02.890,0:12:12.490
+dimensional signal which is one dimensional domain right 1d domain okay
+
+0:12:12.490,0:12:17.710
+so we are still using a 1d signal but in this case it has you know you have two
+
+0:12:17.710,0:12:25.750
+values per point so what kind of kernels are we gonna be using so I'm gonna just
+
+0:12:25.750,0:12:31.450
+draw it in this case we're gonna be using something similar like this so I'm
+
+0:12:31.450,0:12:37.990
+gonna be drawing this guy let's say I have K here which is gonna be my width
+
+0:12:37.990,0:12:42.700
+of the kernel but in this case I'm gonna be also have some thickness in this case
+
+0:12:42.700,0:12:56.230
+here right so basically you apply this thing here okay and then you can go
+
+0:12:56.230,0:13:04.060
+second line and third line and so on right so you may still have like here m
+
+0:13:04.060,0:13:11.590
+kernels but in this case you also have some thickness which has to match the
+
+0:13:11.590,0:13:17.680
+other thickness right so this thickness here has to match the thickness of the
+
+0:13:17.680,0:13:23.980
+input size so let me show you how to apply the convolution so you're gonna
+
+0:13:23.980,0:13:37.980
+get one of these slices here and then you're gonna be applying this over here
+
+0:13:39.320,0:13:46.190
+okay and then you simply go down this way
+
+0:13:46.190,0:13:53.870
+alright so whenever you apply these you perform this guy here the inner product
+
+0:13:53.870,0:14:04.410
+with these over here what you get it's actually a one by one is a scalar so
+
+0:14:04.410,0:14:09.540
+whenever I use this orange thingy here on the left hand side and I do a dot
+
+0:14:09.540,0:14:14.190
+product scalar product with this one I just get a scalar so this is actually my
+
+0:14:14.190,0:14:19.620
+convolution in 1d the convolution in 1d means that it goes down this way and
+
+0:14:19.620,0:14:27.480
+only in one way that's why it's called 1d but we multiply each element of this
+
+0:14:27.480,0:14:36.290
+mask times this guy here now a second row and this guy here okay
+
+0:14:36.290,0:14:41.090
+you saw you multiply all of them you sum all of them and then you get your first
+
+0:14:41.090,0:14:47.250
+output here okay so whenever I make this multiplication I get my first output
+
+0:14:47.250,0:14:52.050
+here then I keep sliding this kernel down and then you're gonna get the
+
+0:14:52.050,0:14:58.380
+second output third out fourth and so on until you go down at the end then what
+
+0:14:58.380,0:15:03.780
+happens then happens that I'm gonna be picking up different kernel I'm gonna
+
+0:15:03.780,0:15:07.950
+back it let's say I get the third one okay let's get the second one I get a
+
+0:15:07.950,0:15:19.050
+second one and I perform the same operation you're gonna get here this one
+
+0:15:19.050,0:15:23.240
+actually let's actually make it like a matrix
+
+0:15:26.940,0:15:33.790
+you go down okay until you go with the last one which is gonna be the end right
+
+0:15:33.790,0:15:45.450
+the empty kernel which is gonna be going down this way you get the last one here
+
+0:15:51.680,0:15:58.790
+okay yes no confusing clearing so this was the question I got at the end of the
+
+0:15:58.790,0:16:10.339
+class yeah Suzy yeah because it's a dot product of all those values between so
+
+0:16:10.339,0:16:18.259
+basically do the projection of this part of the signal onto this kernel so you'd
+
+0:16:18.259,0:16:22.879
+like to see what is the contribution like what is the alignment of this part
+
+0:16:22.879,0:16:27.350
+of the signal on to this specific subspace okay this is how a convolution
+
+0:16:27.350,0:16:31.850
+works when you have multiple channels so far I'll show you just with single
+
+0:16:31.850,0:16:35.319
+channel now we have multiple channels okay so oh yeah yeah in one second one
+
+0:16:54.259,0:16:59.509
+and one one at the top one at the bottom so you actually lose the first row here
+
+0:16:59.509,0:17:04.850
+and you lose the last row here so at the end in this case the output is going to
+
+0:17:04.850,0:17:10.490
+be n minus three plus one so you lose two one on top okay in this case you
+
+0:17:10.490,0:17:15.140
+lose two at the bottom if you actually do a Center at the center the
+
+0:17:15.140,0:17:20.390
+convolution usually you lose one at the beginning one at the end every time you
+
+0:17:20.390,0:17:24.409
+perform a convolution you lose the number of the dimension of the kernel
+
+0:17:24.409,0:17:28.789
+minus one you can try if you put your hand like this you have a kernel of
+
+0:17:28.789,0:17:34.340
+three you get the first one here and it is matching then you switch one and then
+
+0:17:34.340,0:17:39.440
+you switch to right so okay with fight let's tell a parent of two right so you
+
+0:17:39.440,0:17:44.149
+have your signal of five you have your kernel with two you have one two three
+
+0:17:44.149,0:17:49.070
+and four so we started with five and you end up with four because you use a
+
+0:17:49.070,0:17:54.500
+kernel size of two if you use a kernel size of three you get one two and three
+
+0:17:54.500,0:17:57.289
+so you goes to if you use a kernel size of three okay
+
+0:17:57.289,0:18:01.010
+so you can always try to do this alright so I'm gonna show you now the
+
+0:18:01.010,0:18:07.040
+dimensions of these kernels and the outputs with PyTorch okay Yes No
+
+0:18:07.040,0:18:18.500
+all right good okay mister can you see anything
+
+0:18:18.500,0:18:25.520
+yes right I mean zoom a little bit more okay so now we can go we do
+
+0:18:25.520,0:18:33.770
+conda activate pDL, pytorch Deep Learning.
+
+0:18:33.770,0:18:40.520
+So here we can just run ipython if i press ctrl L I clear the screen and
+
+0:18:40.520,0:18:49.820
+we can do import torch then I can do from torch import nn so now we can see
+
+0:18:49.820,0:18:54.500
+for example called let's set my convolutional convolutional layer it's
+
+0:18:54.500,0:18:59.930
+going to be equal to NN conf and then I can keep going until I get
+
+0:18:59.930,0:19:04.220
+this one let's say yeah let's say I have no idea how to use this function I just
+
+0:19:04.220,0:19:08.750
+put a question mark I press ENTER and I'm gonna see here now the documentation
+
+0:19:08.750,0:19:13.460
+okay so in this case you're gonna have the first item is going to be the input
+
+0:19:13.460,0:19:19.820
+channel then I have the output channels then I have the corner sighs alright so
+
+0:19:19.820,0:19:24.290
+for example we are going to be putting here input channels we have a stereo
+
+0:19:24.290,0:19:30.530
+signal so we put two channels the number of corners we said that was M and let's
+
+0:19:30.530,0:19:36.650
+say we have 16 kernels so this is the number of kernels I'm gonna be using and
+
+0:19:36.650,0:19:41.810
+then let's have our kernel size of what the same I use here so let's have K or
+
+0:19:41.810,0:19:47.570
+the kernel size equal 3 okay in so here I'm going to define my first convolution
+
+0:19:47.570,0:19:52.910
+object so if I print this one comes you're gonna see we have a convolution a
+
+0:19:52.910,0:19:57.580
+2d combo sorry 1 deconvolution made that okay so we have a 1d convolution
+
+0:20:02.149,0:20:08.869
+which is going from two channels so a stereophonic to a sixteen channels means
+
+0:20:08.869,0:20:16.039
+I use sixteen kernels the skirmish size is 3 and then the stride is also 1 ok so
+
+0:20:16.039,0:20:23.859
+in this case I'm gonna be checking what is gonna be my convolutional weights
+
+0:20:27.429,0:20:33.379
+what is the size of the weights how many weights do we have how many how
+
+0:20:33.379,0:20:40.069
+many planes do we have for the weights 16 right so we have 16 weights what is
+
+0:20:40.069,0:20:53.649
+the length of the the day of the key of D of the kernel okay Oh what is this -
+
+0:20:54.549,0:21:00.349
+Janis right so I have 16 of these scanners which have thickness - and then
+
+0:21:00.349,0:21:05.539
+length of 3 ok makes sense right because you're gonna be applying each of these
+
+0:21:05.539,0:21:11.629
+16 across the whole signal so let's have my signal now you're gonna be is gonna
+
+0:21:11.629,0:21:20.599
+be equal toage dot R and and and oh sighs I don't know let's say 64 I also
+
+0:21:20.599,0:21:25.129
+have to say I have a batch of size 1 so I have a virtual site one so I just have
+
+0:21:25.129,0:21:31.879
+one signal and then this is gonna be 64 how many channels we said this has two
+
+0:21:31.879,0:21:37.819
+right so I have one signal one example which has two channels and has 64
+
+0:21:37.819,0:21:46.689
+samples so this is my X hold on what is the convolutional bias size
+
+0:21:48.320,0:21:54.380
+a 16 right because you have one bias / plain / / / way ok so what's gonna be in
+
+0:21:54.380,0:22:07.539
+our my convolution of X the output hello so I'm gonna still have one sample right
+
+0:22:07.539,0:22:15.919
+how many channels 16 what is gonna be the length of the signal okay that's
+
+0:22:15.919,0:22:22.700
+good 6 fix it okay fantastic all right so what if I'm gonna be using
+
+0:22:22.700,0:22:32.240
+a convolution with size of the kernel 5 what do I get now yet to shout I can't
+
+0:22:32.240,0:22:36.320
+hear you 60 okay you're following fantastic okay
+
+0:22:36.320,0:22:44.059
+so let's try now instead to use a hyper spectral image with a 2d convolution
+
+0:22:44.059,0:22:49.100
+okay so I'm going to be coding now my convolution here is going to be my in
+
+0:22:49.100,0:22:55.490
+this case is correct or is going to be a conf come to D again I don't know how to
+
+0:22:55.490,0:22:59.059
+use it so I put a question mark and then I have here input channel output channel
+
+0:22:59.059,0:23:05.450
+criticize strident padding okay so I'm going to be putting inputs tried input
+
+0:23:05.450,0:23:10.429
+channel so it's a hyper spectral image with 20 planes so what's gonna be the
+
+0:23:10.429,0:23:16.149
+input in this case 20 right because you have you start from 20 spectral bands
+
+0:23:16.149,0:23:20.419
+then we're gonna be inputting the output number of channels we let's say we're
+
+0:23:20.419,0:23:25.330
+gonna be using again 16 in this case I'm going to be inputting the kernel size
+
+0:23:25.330,0:23:33.440
+since I'm planning to use okay let's actually define let's actually define my
+
+0:23:33.440,0:23:40.120
+signal first so my X is gonna be a torch dot R and and let's say one sample with
+
+0:23:40.120,0:23:52.820
+20 channels of height for example I guess 6128 well hold on 64 and then with
+
+0:23:52.820,0:23:58.820
+128 okay so this is gonna be my my input my eople data okay
+
+0:23:58.820,0:24:04.370
+so my convolution now it can be something like this so I have 20
+
+0:24:04.370,0:24:09.110
+channels from input 16 our Mike Ernest I'm gonna be using then I'm gonna be
+
+0:24:09.110,0:24:15.050
+specifying the kernel size in this case let's use something that is like three
+
+0:24:15.050,0:24:24.580
+times five okay so what is going to be the output what are the kernel size
+
+0:24:29.170,0:24:47.630
+anyone yes no what no 20 Janice is the channels of the input data right so you
+
+0:24:47.630,0:24:51.680
+have how many kernels here 16 right there you go
+
+0:24:51.680,0:24:56.420
+we have 16 kernels which have 20 channels such that they can lay over the
+
+0:24:56.420,0:25:03.410
+input 3 by 5 right teeny like a short like yeah short but large ok so what is
+
+0:25:03.410,0:25:08.140
+gonna be my conv(x).size ? [1, 16, 62, 124]. Let's say I'd like to
+
+0:25:16.310,0:25:22.190
+actually add back the I'd like to head the sing dimensionality I can add some
+
+0:25:22.190,0:25:25.730
+padding right so here there is going to be the stride I'm gonna have a stride of
+
+0:25:25.730,0:25:29.930
+1 again if you don't remember the the syntax you can just put the question
+
+0:25:29.930,0:25:35.120
+mark can you figure out and then how much strive should I add now how much
+
+0:25:35.120,0:25:41.870
+stride in the y-direction sorry yes how much padding should I add in the
+
+0:25:41.870,0:25:46.490
+y-direction one because it's gonna be one on top one on the bottom but then
+
+0:25:46.490,0:25:51.890
+then on the x-direction okay you know you're following fantastic and so now if
+
+0:25:51.890,0:25:57.320
+I just run this one you wanna get the initial size okay so now you have both
+
+0:25:57.320,0:26:05.500
+1d and 2d the point is that what is the dimension of a convolutional kernel and
+
+0:26:05.500,0:26:12.470
+symbol for to the dimensional signal again I repeat what is the
+
+0:26:12.470,0:26:20.049
+dimensionality of the collection of careness use for two-dimensional data
+
+0:26:20.860,0:26:27.679
+again for right so four is gonna be the number of dimensions that are required
+
+0:26:27.679,0:26:35.659
+to store the collection of kernels when you perform 2d convolutions the one is
+
+0:26:35.659,0:26:40.370
+going to be the stride so if you don't know how this works you just put a
+
+0:26:40.370,0:26:44.000
+question mark and gonna tell you here so stride is gonna be telling you you
+
+0:26:44.000,0:26:50.929
+stride off you move every time the kernel by one if you are the first one
+
+0:26:50.929,0:26:55.460
+means you only is the batch size so torch expects you to always use batches
+
+0:26:55.460,0:27:00.110
+meaning how many signals you're using just one right so that our expectation
+
+0:27:00.110,0:27:04.549
+if you send an input vector which is going to be input tensor which has
+
+0:27:04.549,0:27:12.289
+dimension three is gonna be breaking and complain okay so we have still some time
+
+0:27:12.289,0:27:18.049
+to go in the second part all right second part is going to be so you've
+
+0:27:18.049,0:27:23.779
+been computing some derivatives right for the first homework right so the
+
+0:27:23.779,0:27:31.909
+following homework maybe you have to do you have to compute this one okay you're
+
+0:27:31.909,0:27:35.510
+supposed to be laughing it's a joke okay there you go
+
+0:27:35.510,0:27:43.340
+fantastic so this is what you can wrote back in the 90s for the computation of
+
+0:27:43.340,0:27:50.029
+the gradients of the of the lsdm which are gonna be covered I guess in next
+
+0:27:50.029,0:27:54.950
+next lesson so how somehow so they had to still do these things right it's kind
+
+0:27:54.950,0:28:00.769
+of crazy nevertheless we can use PyTorch to have automatic computation of these
+
+0:28:00.769,0:28:06.500
+gradients so we can go and check out how these automatic gradient works
+
+0:28:06.500,0:28:12.159
+okay all right so all right so we are going to be going
+
+0:28:23.090,0:28:28.490
+now to the notebook number three which is the yeah
+
+0:28:28.490,0:28:33.590
+invisible let me see if I can highlight it now it's even worse okay number three
+
+0:28:33.590,0:28:41.619
+Auto gratitute Oriole okay let me go fullscreen
+
+0:28:41.619,0:28:53.029
+okay so out of our tutorial was gonna be here here just create my tensor which
+
+0:28:53.029,0:28:57.499
+has as well these required gradients equal true in this case I mean asking
+
+0:28:57.499,0:29:02.539
+torch please track all the gradient computations did it got the competition
+
+0:29:02.539,0:29:07.749
+over the tensor such that we can perform computation of partial derivatives okay
+
+0:29:07.749,0:29:13.279
+in this case I'm gonna have my Y is going to be so X is simply gonna be one
+
+0:29:13.279,0:29:20.419
+two three four the Y is going to be X subtracted number two okay alright so
+
+0:29:20.419,0:29:26.869
+now we can notice that there is this grad F n grad f NN FN function here so
+
+0:29:26.869,0:29:32.059
+let's see what this stuff is we go sit there and see oh this is a sub backward
+
+0:29:32.059,0:29:37.629
+what is it meaning that the Y has been generated by a module which performs the
+
+0:29:37.629,0:29:43.669
+subtraction between X and and - right so you have X minus 2 therefore if you
+
+0:29:43.669,0:29:51.860
+check who generated Y well there's a sub a subtraction module ok so what's gonna
+
+0:29:51.860,0:30:01.009
+be now the God function of X you're supposed to answer oh okay
+
+0:30:01.009,0:30:03.580
+why is none because they should have written there
+
+0:30:07.580,0:30:12.020
+Alfredo generated that right okay all right none is fine as well
+
+0:30:12.020,0:30:17.000
+okay so let's actually put our nose inside we were here we can actually
+
+0:30:17.000,0:30:23.770
+access the first element you have the accumulation why is the accumulation I
+
+0:30:25.090,0:30:29.830
+don't know I forgot but then if you go inside there you're gonna see the
+
+0:30:29.830,0:30:34.760
+initial vector the initial tensor we are using is the one two three four okay so
+
+0:30:34.760,0:30:41.390
+inside this computational graph you can also find the original tensor okay all
+
+0:30:41.390,0:30:46.880
+right so let's now get the Z and inside is gonna be my Y square times three and
+
+0:30:46.880,0:30:51.620
+then I compute my average a it's gonna be the mean of Z right so if I compute
+
+0:30:51.620,0:30:56.330
+the square of this thing here and I multiply by three and I take the average
+
+0:30:56.330,0:31:00.500
+so this is the square part times 3 and then this is the average okay so you can
+
+0:31:00.500,0:31:06.200
+try if you don't believe me all right so let's see how this thing looks like so
+
+0:31:06.200,0:31:10.549
+I'm gonna be promoting here all these sequence of computations so we started
+
+0:31:10.549,0:31:16.669
+by from a two by two matrix what was this guy here to buy - who is this X
+
+0:31:16.669,0:31:22.399
+okay you're following it cool then we subtracted - right and then we
+
+0:31:22.399,0:31:27.440
+multiplied by Y twice right that's why you have to ro so you get the same
+
+0:31:27.440,0:31:31.669
+subtraction that is the whyatt the X minus 2 multiplied by itself then
+
+0:31:31.669,0:31:36.649
+you have another multiplication what is this okay multiply by three and then you
+
+0:31:36.649,0:31:42.980
+have the final the mean backward because this Y is green because it's mean no
+
+0:31:42.980,0:31:51.140
+okay yeah thank you for laughing okay so I compute back prop right
+
+0:31:51.140,0:31:59.409
+what does backdrop do what does this line do
+
+0:32:00.360,0:32:08.610
+I want to hear everyone you know already we compute what radians right so black
+
+0:32:08.610,0:32:11.580
+propagation is how you compute the gradients how do we train your networks
+
+0:32:11.580,0:32:20.730
+with gradients ain't right or whatever Aaron said yesterday back
+
+0:32:20.730,0:32:27.000
+propagation is that is used for computing the gradient completely
+
+0:32:27.000,0:32:29.970
+different things okay please keep them separate don't merge
+
+0:32:29.970,0:32:34.559
+them everyone after a bit that don't they don't see me those two things keep
+
+0:32:34.559,0:32:43.740
+colliding into one mushy thought don't it's painful okay she'll compute the
+
+0:32:43.740,0:32:51.659
+gradients right so guess what we are computing some gradients now okay so we
+
+0:32:51.659,0:33:02.580
+go on your page it's going to be what what was a it was the average right so
+
+0:33:02.580,0:33:10.529
+this is 1/4 right the summation of all those zᵢ
+
+0:33:10.529,0:33:17.460
+what so I goes from 1 to 4 okay so what is that I said I is going
+
+0:33:17.460,0:33:27.539
+to be equal to 3yᵢ² right yeah no questions no okay all right and then
+
+0:33:27.539,0:33:36.840
+this one is was equal to 3(x-2)² right so a what does it belong
+
+0:33:36.840,0:33:38.899
+to where does a belong to what is the ℝ
+
+0:33:44.279,0:33:51.200
+right so it's a scaler okay all right so now we can compute ∂a/∂x.
+
+0:33:51.200,0:33:58.110
+So how much is this stuff you're gonna have 1/4 comes out forum here and
+
+0:33:58.110,0:34:03.090
+then you have you know let's have this one with respect to the xᵢ element
+
+0:34:03.090,0:34:09.179
+okay so we're gonna have this one zᵢ inside is that, I have the 3yᵢ²,
+
+0:34:09.179,0:34:15.899
+and it's gonna be 3(xᵢ- 2)². Right so these three comes
+
+0:34:15.899,0:34:22.080
+out here the two comes down as well and then you multiply by (xᵢ – 2).
+
+0:34:22.080,0:34:33.260
+So far should be correct okay fantastic all right so my X was this element here
+
+0:34:33.589,0:34:38.190
+actually let me compute as well this one so this one goes away this one becomes
+
+0:34:38.190,0:34:47.690
+true this is 1.5 times xᵢ – 3. Right - 2 - 3
+
+0:34:55.159,0:35:06.780
+ok mathematics okay okay thank you all right. So what's gonna be ∂a/∂x ?
+
+0:35:06.780,0:35:11.339
+I'm actually writing the transpose directly here so for the first element
+
+0:35:11.339,0:35:18.859
+you have one you have one times 1.5 so 1.5 minus 3 you get 1 minus 1.5 right
+
+0:35:18.859,0:35:23.670
+second one is going to be 3 minus 3 you get 0 Ryan this is 3 minus 3
+
+0:35:23.670,0:35:27.420
+maybe I should write everything right so you're actually following so you have
+
+0:35:27.420,0:35:37.589
+1.5 minus 3 now you have 3 minus 3 below you have 4 point 5 minus 3 and then the
+
+0:35:37.589,0:35:47.160
+last one is going to be 6 minus 3 which is going to be equal to minus 1 point 5
+
+0:35:47.160,0:35:59.789
+0 1 point 5 and then 3 right you agree ok let me just write this on here
+
+0:35:59.789,0:36:06.149
+okay just remember so we have you be computed the backpropagation here I'm
+
+0:36:06.149,0:36:14.609
+gonna just bring it to the gradients and then the right it's the same stuff we
+
+0:36:14.609,0:36:27.630
+got here right such that I don't have to transpose it here whenever you perform
+
+0:36:27.630,0:36:33.209
+the partial derivative in PyTorch you get the same the same shape is the input
+
+0:36:33.209,0:36:37.469
+dimension so if you have a weight whatever dimension then when you compute
+
+0:36:37.469,0:36:41.069
+the partial you still have the same dimension they don't swap they don't
+
+0:36:41.069,0:36:44.789
+turn okay they just use this for practicality at the correct version I
+
+0:36:44.789,0:36:49.919
+mean the the gradient should be the transpose of that thing sorry did
+
+0:36:49.919,0:36:54.479
+Jacobian which is the transpose of the gradient right if it's a vector but this
+
+0:36:54.479,0:37:08.130
+is a tensor so whatever we just used the same same shape thing no so this one
+
+0:37:08.130,0:37:13.639
+should be a flipping I believe maybe I'm wrong but I don't think all right so
+
+0:37:13.639,0:37:19.919
+this is like basic these basic PyTorch now you can do crazy stuff because we
+
+0:37:19.919,0:37:23.609
+like crazy right I mean I do I think if you like me you
+
+0:37:23.609,0:37:29.669
+like crazy right okay so here I just create my
+
+0:37:29.669,0:37:34.259
+vector X which is going to be a three dimensional well a one-dimensional
+
+0:37:34.259,0:37:43.769
+tensor of three items I'm going to be multiplying X by two then I call this
+
+0:37:43.769,0:37:49.859
+one Y then I start my counter to zero and then until the norm of the Y is long
+
+0:37:49.859,0:37:56.699
+thousand below thousand I keep doubling Y okay and so you can get like a dynamic
+
+0:37:56.699,0:38:01.529
+graph right the graph is base is conditional to the actual random
+
+0:38:01.529,0:38:04.979
+initialization which you can't even tell because I didn't even use a seed so
+
+0:38:04.979,0:38:08.999
+everyone that is running this stuff is gonna get different numbers so these are
+
+0:38:08.999,0:38:11.910
+the final values of the why can you tell me
+
+0:38:11.910,0:38:23.549
+how many iterations we run so the mean of this stuff is actually lower than a
+
+0:38:23.549,0:38:27.630
+thousand yeah but then I'm asking whether you know how many times this
+
+0:38:27.630,0:38:41.119
+loop went through no good why it's random Rises you know it's bad question
+
+0:38:41.119,0:38:45.539
+about bad questions next time I have a something for you okay so I'm gonna be
+
+0:38:45.539,0:38:51.569
+printing this one now I'm telling you the grabbed are 2048 right
+
+0:38:51.569,0:38:55.589
+just check the central one for the moment right this is the actual gradient
+
+0:38:55.589,0:39:04.739
+so can you tell me now how many times the loop went on so someone said 11 how
+
+0:39:04.739,0:39:14.420
+many ends up for 11 okay for people just roast their hands what about the others
+
+0:39:14.809,0:39:17.809
+21 okay any other guys 11 10
+
+0:39:25.529,0:39:30.749
+okay we have actually someone that has the right solution and this loop went on
+
+0:39:30.749,0:39:35.759
+for 10 times why is that because you have the first multiplication by 2 here
+
+0:39:35.759,0:39:40.589
+and then loop goes on over and over and multiplies by 2 right so the final
+
+0:39:40.589,0:39:45.239
+number is gonna be the least number of iterations in the loop plus the
+
+0:39:45.239,0:39:50.779
+additional like addition and multiplication outside right yes no
+
+0:39:50.779,0:39:56.670
+you're sleeping maybe okay I told you not to eat before class otherwise you
+
+0:39:56.670,0:40:05.009
+get groggy okay so inference this is cool so here I'm gonna be just having
+
+0:40:05.009,0:40:09.420
+both my X & Y we are gonna just do linear regression right linear or
+
+0:40:09.420,0:40:17.670
+whatever think the add operator is just the scalar product okay so both the X
+
+0:40:17.670,0:40:21.589
+and W has have the requires gradient equal to true
+
+0:40:21.589,0:40:27.119
+being this means we are going to be keeping track of the the gradients and
+
+0:40:27.119,0:40:31.290
+the computational graph so if I execute this one you're gonna get the partial
+
+0:40:31.290,0:40:37.710
+derivatives of the inner product with respect to the Z with respect to the
+
+0:40:37.710,0:40:43.920
+input is gonna be the weights right so in the range is the input right and the
+
+0:40:43.920,0:40:47.160
+ones are the weights so partial derivative with respect to the input is
+
+0:40:47.160,0:40:50.070
+gonna be the weights partial with respect to the weights are gonna be the
+
+0:40:50.070,0:40:56.670
+input right yes no yes okay now I just you know usually it's this one is the
+
+0:40:56.670,0:41:00.359
+case I just have required gradients for my parameters because I'm gonna be using
+
+0:41:00.359,0:41:06.030
+the gradients for updating later on the the parameters of the mother is so in
+
+0:41:06.030,0:41:12.300
+this case you get none let's have in this case instead what I usually do
+
+0:41:12.300,0:41:17.250
+wanna do inference when I do inference I tell torch a torch stop tracking any
+
+0:41:17.250,0:41:22.950
+kind of operation so I say torch no God please so this one regardless of whether
+
+0:41:22.950,0:41:28.859
+your input always have the required grass true or false whatever when I say
+
+0:41:28.859,0:41:35.060
+torch no brats you do not have any computation a graph taken care of right
+
+0:41:35.060,0:41:41.130
+therefore if I try to run back propagation on a tensor which was
+
+0:41:41.130,0:41:46.320
+generated from like doesn't have actually you know graph because this one
+
+0:41:46.320,0:41:50.940
+doesn't have a graph you're gonna get an error okay so if I run this one you get
+
+0:41:50.940,0:41:55.410
+an error and you have a very angry face here because it's an error and then it
+
+0:41:55.410,0:42:00.720
+takes your element 0 of tensor does not require grads and does not have a god
+
+0:42:00.720,0:42:07.650
+function right so II which was the yeah whatever they reside here actually then
+
+0:42:07.650,0:42:11.400
+you couldn't run back problems that because there is no graph attached to
+
+0:42:11.400,0:42:19.710
+that ok questions this is so powerful you cannot do it this time with tensor
+
+0:42:19.710,0:42:26.790
+you okay tensor flow is like whatever yeah more stuff here actually more stuff
+
+0:42:26.790,0:42:30.600
+coming right now [Applause]
+
+0:42:30.600,0:42:36.340
+so we go back here we have inside the extra folder he has some nice cute
+
+0:42:36.340,0:42:40.450
+things I wanted to cover both of them just that we go just for the second I
+
+0:42:40.450,0:42:47.290
+think sorry the second one is gonna be the following so in this case we are
+
+0:42:47.290,0:42:52.750
+going to be generating our own specific modules so I like let's say I'd like to
+
+0:42:52.750,0:42:58.030
+define my own function which is super special amazing function I can decide if
+
+0:42:58.030,0:43:02.560
+I want to use it for you know training Nets I need to get the forward pass and
+
+0:43:02.560,0:43:06.220
+also have to know what is the partial derivative of the input respect to the
+
+0:43:06.220,0:43:10.930
+output such that I can use this module in any kind of you know point in my
+
+0:43:10.930,0:43:15.670
+inner code such that you know by using back prop you know chain rule you just
+
+0:43:15.670,0:43:20.320
+plug the thing. Yann went on several times as long as you know partial
+
+0:43:20.320,0:43:23.410
+derivative of the output with respect to the input you can plug these things
+
+0:43:23.410,0:43:31.690
+anywhere in your chain of operations so in this case we define my addition which
+
+0:43:31.690,0:43:35.620
+is performing the addition of the two inputs in this case but then when you
+
+0:43:35.620,0:43:41.130
+perform the back propagation if you have an addition what is the back propagation
+
+0:43:41.130,0:43:47.020
+so if you have a addition of the two things you get an output when you send
+
+0:43:47.020,0:43:53.320
+down the gradients what does it happen with the with the gradient it gets you
+
+0:43:53.320,0:43:57.160
+know copied over both sides right and that's why you get both of them are
+
+0:43:57.160,0:44:01.390
+copies or the same thing and they are sent through one side of the other you
+
+0:44:01.390,0:44:05.170
+can execute this stuff you're gonna see here you get the same gradient both ways
+
+0:44:05.170,0:44:09.460
+in this case I have a split so I come from the same thing and then I split and
+
+0:44:09.460,0:44:13.180
+I have those two things doing something else if I go down with the gradient what
+
+0:44:13.180,0:44:20.080
+do I do you add them right and that's why we have here the add install you can
+
+0:44:20.080,0:44:23.680
+execute this one you're going to see here that we had these two initial
+
+0:44:23.680,0:44:27.910
+gradients here and then when you went up or sorry when you went down the two
+
+0:44:27.910,0:44:30.790
+things the two gradients sum together and they are here okay
+
+0:44:30.790,0:44:36.190
+so again if you use pre-made things in PyTorch. They are correct this one you
+
+0:44:36.190,0:44:41.080
+can mess around you can put any kind of different in
+
+0:44:41.080,0:44:47.950
+for a function and backward function I think we ran out of time other questions
+
+0:44:47.950,0:44:58.800
+before we actually leave no all right so I see on Monday and stay warm
diff --git a/docs/pt/week06/06-1.md b/docs/pt/week06/06-1.md
new file mode 100644
index 000000000..cb6439af0
--- /dev/null
+++ b/docs/pt/week06/06-1.md
@@ -0,0 +1,285 @@
+---
+lang: pt
+lang-ref: ch.06-1
+lecturer: Yann LeCun
+title: Aplicações de Redes Convolucionais
+authors: Shiqing Li, Chenqin Yang, Yakun Wang, Jimin Tan
+date: 2 Mar 2020
+translator: Bernardo Lago
+translation-date: 14 Nov 2021
+---
+
+
+<!--
+## [Zip Code Recognition](https://www.youtube.com/watch?v=ycbMGyCPzvE&t=43s)
+-->
+
+## [Reconhecimento de código postal](https://www.youtube.com/watch?v=ycbMGyCPzvE&t=43s)
+
+<!--In the previous lecture, we demonstrated that a convolutional network can recognize digits, however, the question remains, how does the model pick each digit and avoid perturbation on neighboring digits. The next step is to detect non/overlapping objects and use the general approach of Non-Maximum Suppression (NMS). Now, given the assumption that the input is a series of non-overlapping digits, the strategy is to train several convolutional networks and using either majority vote or picking the digits corresponding to the highest score generated by the convolutional network.
+-->
+
+Na aula anterior, demonstramos que uma rede convolucional pode reconhecer dígitos, no entanto, a questão permanece, como o modelo escolhe cada dígito e evita perturbação nos dígitos vizinhos. A próxima etapa é detectar objetos não sobrepostos e usar a abordagem geral de Supressão Não Máxima (NMS). Agora, dada a suposição de que a entrada é uma série de dígitos não sobrepostos, a estratégia é treinar várias redes convolucionais e usando o voto da maioria ou escolhendo os dígitos correspondentes à pontuação mais alta gerada pela rede convolucional.
+
+<!--
+### Recognition with CNN
+-->
+
+### Reconhecimento com CNN
+
+<!--Here we present the task of recognizing 5 non-overlapping zip codes. The system was not given any instructions on how to separate each digit but knows that is must predict 5 digits. The system (Figure 1) consists of 4 different sized convolutional networks, each producing one set of outputs. The output is represented in matrices. The four output matrices are from models with a different kernel width in the last layer. In each output, there are 10 rows, representing 10 categories from 0 to 9. The larger white square represents a higher score in that category. In these four output blocks, the horizontal sizes of the last kernel layers are 5, 4, 3 and 2 respectively. The size of the kernel decides the width of the model's viewing window on the input, therefore each model is predicting digits based on different window sizes. The model then takes a majority vote and selects the category that corresponds to the highest score in that window. To extract useful information, one should keep in mind that not all combinations of characters are possible, therefore error correction leveraging input restrictions is useful to ensure the outputs are true zip codes.
+-->
+
+Aqui apresentamos a tarefa de reconhecer 5 CEPs não sobrepostos. O sistema não recebeu instruções sobre como separar cada dígito, mas sabe que deve prever 5 dígitos. O sistema (Figura 1) consiste em 4 redes convolucionais de tamanhos diferentes, cada uma produzindo um conjunto de saídas. A saída é representada em matrizes. As quatro matrizes de saída são de modelos com largura de kernel diferente na última camada. Em cada saída, há 10 linhas, representando 10 categorias de 0 a 9. O quadrado branco maior representa uma pontuação mais alta nessa categoria. Nestes quatro blocos de saída, os tamanhos horizontais das últimas camadas do kernel são 5, 4, 3 e 2, respectivamente. O tamanho do kernel decide a largura da janela de visualização do modelo na entrada, portanto, cada modelo está prevendo dígitos com base em tamanhos de janela diferentes. O modelo, então, obtém uma votação majoritária e seleciona a categoria que corresponde à pontuação mais alta naquela janela. Para extrair informações úteis, deve-se ter em mente que nem todas as combinações de caracteres são possíveis, portanto, a correção de erros com base nas restrições de entrada é útil para garantir que as saídas sejam códigos postais verdadeiros.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-1/O1IN3JD.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 1:</b> Multiple classifiers on zip code recognition
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-1/O1IN3JD.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 1: </b> Múltiplos classificadores no reconhecimento do CEP
+</center>
+
+<!--Now to impose the order of the characters. The trick is to utilize a shortest path algorithm. Since we are given ranges of possible characters and the total number of digits to predict, We can approach this problem by computing the minimum cost of producing digits and transitions between digit. The path has to be continuous from the lower left cell to the upper right cell on the graph, and the path is restricted to only contain movements from left to right and bottom to top. Note that if the same number is repeated next to each other, the algorithm should be able to distinguish there are repeated numbers instead of predicting a single digit.
+-->
+
+Agora, para impor a ordem dos personagens. O truque é utilizar um algoritmo de caminho mais curto. Uma vez que recebemos faixas de caracteres possíveis e o número total de dígitos a prever, podemos abordar esse problema calculando o custo mínimo de produção de dígitos e transições entre os dígitos. O caminho deve ser contínuo da célula inferior esquerda para a célula superior direita no gráfico, e o caminho é restrito para conter apenas movimentos da esquerda para a direita e de baixo para cima. Observe que se o mesmo número for repetido um ao lado do outro, o algoritmo deve ser capaz de distinguir que há números repetidos em vez de prever um único dígito.
+
+<!--
+## [Face detection](https://www.youtube.com/watch?v=ycbMGyCPzvE&t=1241s)
+-->
+
+## [Detecção de faces](https://www.youtube.com/watch?v=ycbMGyCPzvE&t=1241s)
+
+<!--Convolutional neural networks perform well on detection tasks and face detection is no exception. To perform face detection we collect a dataset of images with faces and without faces, on which we train a convolutional net with a window size such as 30 $\times$ 30 pixels and ask the network to tell whether there is a face or not. Once trained, we apply the model to a new image and if there are faces roughly within a 30 $\times$ 30 pixel window, the convolutional net will light up the output at the corresponding locations. However, two problems exist.
+-->
+
+As redes neurais convolucionais têm um bom desempenho em tarefas de detecção e a detecção de faces não é exceção. Para realizar a detecção de faces, coletamos um conjunto de dados de imagens com faces e sem faces, no qual treinamos uma rede convolucional com um tamanho de janela de 30 $\times$ 30 pixels e pedimos à rede para dizer se há um rosto ou não. Uma vez treinado, aplicamos o modelo a uma nova imagem e se houver faces dentro de uma janela de 30 $\times$ 30 pixels, a rede convolucional iluminará a saída nos locais correspondentes. No entanto, existem dois problemas.
+
+<!--
+- **False Positives**: There are many different variations of non-face objects that may appear in a patch of an image. During the training stage, the model may not see all of them (*i.e.* a fully representative set of non-face patches). Therefore, the model may suffer from a lot of false positives at test time. For example, if the network has not been trained on images containing hands, it may detect faces based on skin tones and incorrectly classify patches of images containing hands as faces, thereby giving rise to false positives.
+-->
+
+- **Falsos positivos**: Existem muitas variações diferentes de objetos não-face que podem aparecer em um patch de uma imagem. Durante o estágio de treinamento, o modelo pode não ver todos eles (*ou seja*, um conjunto totalmente representativo de remendos não faciais). Portanto, o modelo pode apresentar muitos falsos positivos no momento do teste. Por exemplo, se a rede não foi treinada em imagens contendo mãos, ela pode detectar rostos com base em tons de pele e classificar incorretamente manchas de imagens contendo mãos como rostos, dando origem a falsos positivos.
+
+<!--- **Different Face Size:** Not all faces are 30 $\times$ 30 pixels, so faces of differing sizes may not be detected. One way to handle this issue is to generate multi-scale versions of the same image. The original detector will detect faces around 30 $\times$ 30 pixels. If applying a scale on the image of factor $\sqrt 2$, the model will detect faces that were smaller in the original image since what was 30 $\times$ 30 is now 20 $\times$ 20 pixels roughly. To detect bigger faces, we can downsize the image. This process is inexpensive as half of the expense comes from processing the original non-scaled image. The sum of the expenses of all other networks combined is about the same as processing the original non-scaled image. The size of the network is the square of the size of the image on one side, so if you scale down the image by $\sqrt 2$, the network you need to run is smaller by a factor of 2. So the overall cost is $1+1/2+1/4+1/8+1/16…$, which is 2. Performing a multi-scale model only doubles the computational cost.
+-->
+
+- **Tamanho de rosto diferente:** Nem todos os rostos têm 30 $\times$ 30 pixels, portanto, rostos de tamanhos diferentes podem não ser detectados. Uma maneira de lidar com esse problema é gerar versões em várias escalas da mesma imagem. O detector original detectará rostos em torno de  30 $\times$ 30 pixels. Se aplicar uma escala na imagem do fator  $\sqrt 2$, o modelo detectará faces que eram menores na imagem original, pois o que era  30 $\times$ 30 agora é 20 $\times$ 20 pixels aproximadamente. Para detectar rostos maiores, podemos reduzir o tamanho da imagem. Esse processo é barato, pois metade das despesas vem do processamento da imagem original sem escala. A soma das despesas de todas as outras redes combinadas é quase a mesma do processamento da imagem original sem escala. O tamanho da rede é o quadrado do tamanho da imagem de um lado, então, se você reduzir a imagem em $\sqrt 2$, a rede que você precisa para executar é menor em um fator de 2. Portanto, o custo geral é $1+1/2+1/4+1/8+1/16…$, que é 2. Executar um modelo em várias escalas apenas duplica o custo computacional.
+
+<!--
+### A multi-scale face detection system
+-->
+
+### Um sistema de detecção de faces em várias escalas
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-1/8R3v0Dj.png" style="zoom: 30%; background-color:#DCDCDC;"/><br>
+<b>Figure 2:</b> Face detection system
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-1/8R3v0Dj.png" style="zoom: 30%; background-color:#DCDCDC;"/><br>
+<b> Figura 2: </b> Sistema de detecção de faces
+</center>
+
+<!--The maps shown in (Figure 3) indicate the scores of face detectors. This face detector recognizes faces that are 20 $\times$ 20 pixels in size. In fine-scale (Scale 3) there are many high scores but are not very definitive. When the scaling factor goes up (Scale 6), we see more clustered white regions. Those white regions represent detected faces. We then apply non-maximum suppression to get the final location of the face.
+-->
+
+Os mapas mostrados na (Figura 3) indicam a pontuação dos detectores de face. Este detector facial reconhece rostos com tamanho de 20 $\times$ 20 pixels. Em escala fina (Escala 3), há muitas pontuações altas, mas não são muito definitivas. Quando o fator de escala aumenta (Escala 6), vemos mais regiões brancas agrupadas. Essas regiões brancas representam rostos detectados. Em seguida, aplicamos a supressão não máxima para obter a localização final do rosto.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-1/CQ8T00O.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 3:</b> Face detector scores for various scaling factors
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-1/CQ8T00O.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 3: </b> Pontuações do detector facial para vários fatores de escala
+</center>
+
+<!--
+### Non-maximum suppression
+-->
+
+### Supressão não máxima
+
+<!--For each high-scoring region, there is probably a face underneath. If more faces are detected very close to the first, it means that only one should be considered correct and the rest are wrong. With non-maximum suppression, we take the highest-scoring of the overlapping bounding boxes and remove the others. The result will be a single bounding box at the optimum location.
+-->
+
+Para cada região de alta pontuação, provavelmente há um rosto por baixo. Se forem detectados mais rostos muito próximos do primeiro, significa que apenas um deve ser considerado correto e os demais estão errados. Com a supressão não máxima, pegamos a pontuação mais alta das caixas delimitadoras sobrepostas e removemos as outras. O resultado será uma única caixa delimitadora no local ideal.
+
+<!--
+### Negative mining
+-->
+
+### Mineração negativa
+
+<!--In the last section, we discussed how the model may run into a large number of false positives at test time as there are many ways for non-face objects to appear similar to a face. No training set will include all the possible non-face objects that look like faces. We can mitigate this problem through negative mining. In negative mining, we create a negative dataset of non-face patches which the model has (erroneously) detected as faces. The data is collected by running the model on inputs that are known to contain no faces. Then we retrain the detector using the negative dataset. We can repeat this process to increase the robustness of our model against false positives.
+-->
+
+Na última seção, discutimos como o modelo pode ser executado em um grande número de falsos positivos no momento do teste, pois há muitas maneiras de objetos não-face parecerem semelhantes a um rosto. Nenhum conjunto de treinamento incluirá todos os possíveis objetos não-rostos que se parecem com rostos. Podemos mitigar esse problema por meio da mineração negativa. Na mineração negativa, criamos um conjunto de dados negativos de patches não faciais que o modelo detectou (erroneamente) como faces. Os dados são coletados executando o modelo em entradas que são conhecidas por não conterem faces. Em seguida, treinamos novamente o detector usando o conjunto de dados negativo. Podemos repetir esse processo para aumentar a robustez do nosso modelo contra falsos positivos.
+
+<!--
+## Semantic segmentation
+-->
+
+## Segmentação semântica
+
+<!--Semantic segmentation is the task of assigning a category to every pixel in an input image.
+-->
+
+A segmentação semântica é a tarefa de atribuir uma categoria a cada pixel em uma imagem de entrada.
+
+<!--
+### [CNN for Long Range Adaptive Robot Vision](https://www.youtube.com/watch?v=ycbMGyCPzvE&t=1669s)
+-->
+
+### [CNN para Visão de Robôs Adaptável de Longo Alcance](https://www.youtube.com/watch?v=ycbMGyCPzvE&t=1669s)
+
+<!--In this project, the goal was to label regions from input images so that a robot can distinguish between roads and obstacles. In the figure, the green regions are areas the robot can drive on and the red regions are obstacles like tall grass. To train the network for this task, we took a patch from the image and manually label it traversable or not (green or red). We then train the convolutional network on the patches by asking it to predict the color of the patch. Once the system is sufficiently trained, it is applied to the entire image, labeling all the regions of the image as green or red.
+-->
+
+Neste projeto, o objetivo era rotular regiões a partir de imagens de entrada para que um robô pudesse distinguir entre estradas e obstáculos. Na figura, as regiões verdes são áreas nas quais o robô pode dirigir e as regiões vermelhas são obstáculos como grama alta. Para treinar a rede para essa tarefa, pegamos um patch da imagem e rotulamos manualmente como atravessável ou não (verde ou vermelho). Em seguida, treinamos a rede convolucional nos patches, pedindo-lhe para prever a cor do patch. Uma vez que o sistema esteja suficientemente treinado, ele é aplicado em toda a imagem, rotulando todas as regiões da imagem como verdes ou vermelhas.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-1/5mM7dTT.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 4:</b> CNN for Long Range Adaptive Robot Vision (DARPA LAGR program 2005-2008)
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-1/5mM7dTT.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 4: </b> CNN para Visão do Robô Adaptável de Longo Alcance (programa DARPA LAGR 2005-2008)
+</center>
+
+<!--There were five categories for prediction: 1) super green, 2) green, 3) purple: obstacle foot line, 4) red obstacle  5) super red: definitely an obstacle.
+-->
+
+Havia cinco categorias de previsão: 1) superverde, 2) verde, 3) roxo: linha do pé do obstáculo, 4) obstáculo vermelho 5) super vermelho: definitivamente um obstáculo.
+
+<!--
+**Stereo Labels** (Figure 4, Column 2)
+ Images are captured by the 4 cameras on the robot, which are grouped into 2 stereo vision pairs. Using the known distances between the stereo pair cameras, the positions of every pixel in 3D space are then estimated by measuring the relative distances between the pixels that appear in both the cameras in a stereo pair. This is the same process our brains use to estimate the distance of the objects that we see. Using the estimated position information, a plane is fit to the ground, and pixels are then labeled as green if they are near the ground and red if they are above it.
+-->
+
+**Rótulos estéreo** (Figura 4, Coluna 2)
+ As imagens são capturadas pelas 4 câmeras do robô, que são agrupadas em 2 pares de visão estéreo. Usando as distâncias conhecidas entre as câmeras do par estéreo, as posições de cada pixel no espaço 3D são estimadas medindo as distâncias relativas entre os pixels que aparecem em ambas as câmeras em um par estéreo. Este é o mesmo processo que nosso cérebro usa para estimar a distância dos objetos que vemos. Usando as informações de posição estimada, um plano é ajustado ao solo e os pixels são rotulados como verdes se estiverem próximos do solo e vermelhos se estiverem acima dele.
+
+<!--* **Limitations & Motivation for ConvNet**: The stereo vision only works up to 10 meters and driving a robot requires long-range vision. A ConvNet however, is capable of detecting objects at much greater distances, if trained correctly.
+-->
+
+* **Limitações e motivação para ConvNet**: A visão estéreo funciona apenas até 10 metros e dirigir um robô requer visão de longo alcance. Um ConvNet, entretanto, é capaz de detectar objetos em distâncias muito maiores, se treinado corretamente.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-1/rcxY4Lb.png" style="zoom: 100%; background-color:#DCDCDC;"/><br>
+<b>Figure 5:</b> Scale-invariant Pyramid of Distance-normalized Images
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-1/rcxY4Lb.png" style="zoom: 100%; background-color:#DCDCDC;"/><br>
+<b> Figura 5: </b> Pirâmide invariante de escala de imagens normalizadas por distância
+</center>
+
+<!--* **Served as Model Inputs**: Important pre-processing includes building a scale-invariant pyramid of distance-normalized images (Figure 5). It is similar to what we have done earlier of this lecture when we tried to detect faces of multiple scales.
+-->
+
+* **Servido como entradas do modelo**: o pré-processamento importante inclui a construção de uma pirâmide invariante de escala de imagens normalizadas por distância (Figura 5). É semelhante ao que fizemos anteriormente nesta aula, quando tentamos detectar faces de escalas múltiplas.
+
+<!--**Model Outputs** (Figure 4, Column 3)
+-->
+
+**Saídas do modelo** (Figura 4, Coluna 3)
+
+<!--The model outputs a label for every pixel in the image **up to the horizon**. These are classifier outputs of a multi-scale convolutional network.
+-->
+
+O modelo gera um rótulo para cada pixel na imagem **até o horizonte**. Estas são as saídas do classificador de uma rede convolucional multi-escala.
+
+<!--* **How the Model Becomes Adaptive**: The robots have continuous access to the stereo labels, allowing the network to re-train, adapting to the new environment it's in. Please note that only the last layer of the network would be re-trained. The previous layers are trained in the lab and fixed.
+-->
+
+* **Como o modelo se torna adaptativo**: Os robôs têm acesso contínuo às etiquetas estéreo, permitindo que a rede seja treinada novamente, adaptando-se ao novo ambiente em que se encontra. Observe que apenas a última camada da rede seria refeita -treinado. As camadas anteriores são treinadas em laboratório e fixas.
+
+<!--**System Performance**
+-->
+
+**Performance do sistema**
+
+<!--When trying to get to a GPS coordinate on the other side of a barrier, the robot "saw" the barrier from far away and planned a route that avoided it. This is thanks to the CNN detecting objects up 50-100m away.
+-->
+
+Ao tentar chegar a uma coordenada GPS do outro lado de uma barreira, o robô "avistou" a barreira de longe e planejou uma rota para evitá-la. Isso é graças à CNN detectando objetos até 50-100m de distância.
+
+<!--**Limitation**
+-->
+
+**Limitação**
+
+<!--Back in the 2000s, computation resources were restricted. The robot was able to process around 1 frame per second, which means it would not be able to detect a person that walks in its way for a whole second before being able to react. The solution for this limitation is a **Low-Cost Visual Odometry** model. It is not based on neural networks, has a vision of ~2.5m but reacts quickly.
+-->
+
+Na década de 2000, os recursos de computação eram restritos. O robô foi capaz de processar cerca de 1 quadro por segundo, o que significa que ele não seria capaz de detectar uma pessoa que andasse em seu caminho por um segundo inteiro antes de ser capaz de reagir. A solução para essa limitação é um modelo de **Odometria visual de baixo custo**. Não é baseado em redes neurais, tem uma visão de ~2,5m, mas reage rapidamente.
+
+<!--
+### Scene Parsing and Labelling
+-->
+
+### Análise e rotulagem de cenas
+
+<!--In this task, the model outputs an object category (buildings, cars, sky, etc.) for every pixel. The architecture is also multi-scale (Figure 6).
+-->
+
+Nesta tarefa, o modelo gera uma categoria de objeto (edifícios, carros, céu, etc.) para cada pixel. A arquitetura também é multi-escala (Figura 6).
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-1/VpVbkl5.jpg" style="zoom: 30%; background-color:#DCDCDC;"/><br>
+<b>Figure 6:</b> Multi-scale CNN for scene parsing
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-1/VpVbkl5.jpg" style="zoom: 30%; background-color:#DCDCDC;"/><br>
+<b> Figura 6: </b> CNN em várias escalas para análise de cena
+</center>
+
+<!--Notice that if we back project one output of the CNN onto the input, it corresponds to an input window of size $46\times46$ on the original image at the bottom of the Laplacian Pyramid. It means we are **using the context of $46\times46$ pixels to decide the category of the central pixel**.
+-->
+
+Observe que se projetarmos de volta uma saída da CNN na entrada, ela corresponderá a uma janela de entrada de tamanho $46\times46$ na imagem original na parte inferior da Pirâmide Laplaciana. Isso significa que estamos **usando o contexto de $46\times46$ pixels para decidir a categoria do pixel central**.
+
+<!--However, sometimes this context size is not enough to determine the category for larger objects.
+-->
+
+No entanto, às vezes, esse tamanho de contexto não é suficiente para determinar a categoria de objetos maiores.
+
+<!--**The multiscale approach enables a wider vision by providing extra rescaled images as  inputs.** The steps are as follows:
+1. Take the same image, reduce it by the factor of 2 and a factor of 4, separately.
+2. These two extra rescaled images are fed to **the same ConvNet** (same weights, same kernels) and we get another two sets of Level 2 Features.
+3. **Upsample** these features so that they have the same size as the Level 2 Features of the original image.
+4. **Stack** the three sets of (upsampled) features together and feed them to a classifier.
+-->
+
+**A abordagem multiescala permite uma visão mais ampla, fornecendo imagens extras redimensionadas como entradas.** As etapas são as seguintes:
+1. Pegue a mesma imagem, reduza-a pelo fator de 2 e pelo fator de 4, separadamente.
+2. Essas duas imagens redimensionadas extras são alimentadas **a mesma ConvNet** (mesmos pesos, mesmos kernels) e obtemos outros dois conjuntos de recursos de nível 2.
+3. **Aumente a amostra** desses recursos para que tenham o mesmo tamanho que os Recursos de Nível 2 da imagem original.
+4. **Empilhe** os três conjuntos de recursos (amostrados) e os envie a um classificador.
+
+<!--
+Now the largest effective size of content, which is from the 1/4 resized image, is $184\times 184\, (46\times 4=184)$.
+-->
+
+Agora, o maior tamanho efetivo de conteúdo, que é da imagem redimensionada de 1/4, é $184\times 184\, (46\times 4=184)$.
+
+<!--**Performance**: With no post-processing and running frame-by-frame, the model runs very fast even on standard hardware. It has a rather small size of training data (2k~3k), but the results are still record-breaking.
+-->
+
+**Desempenho**: sem pós-processamento e execução quadro a quadro, o modelo funciona muito rápido, mesmo em hardware padrão. Tem um tamanho bastante pequeno de dados de treinamento (2k ~ 3k), mas os resultados ainda são recordes.
+
diff --git a/docs/pt/week06/06-2.md b/docs/pt/week06/06-2.md
new file mode 100644
index 000000000..885769e86
--- /dev/null
+++ b/docs/pt/week06/06-2.md
@@ -0,0 +1,586 @@
+---
+lang: pt
+lang-ref: ch.06-2
+lecturer: Yann LeCun
+title: RNNs, GRUs, LSTMs, Modelos de Atenção, Seq2Seq e Redes com Memória
+authors: Jiayao Liu, Jialing Xu, Zhengyang Bian, Christina Dominguez
+date: 2 March 2020
+translator: Bernardo Lago
+translation-date: 14 Nov 2021
+---
+
+<!--
+## [Deep Learning Architectures](https://www.youtube.com/watch?v=ycbMGyCPzvE&t=2620s)
+-->
+
+## [Arquitetura de Aprendizagem Profunda](https://www.youtube.com/watch?v=ycbMGyCPzvE&t=2620s)
+
+<!--In deep learning, there are different modules to realize different functions. Expertise in deep learning involves designing architectures to complete particular tasks.  Similar to writing programs with algorithms to give instructions to a computer in earlier days, deep learning reduces a complex function into a graph of functional modules (possibly dynamic), the functions of which are finalized by learning.
+-->
+
+Na aprendizagem profunda, existem diferentes módulos para realizar diferentes funções. A especialização em aprendizagem profunda envolve o projeto de arquiteturas para concluir tarefas específicas. Semelhante a escrever programas com algoritmos para dar instruções a um computador nos dias anteriores, o aprendizado profundo reduz uma função complexa em um gráfico de módulos funcionais (possivelmente dinâmicos), cujas funções são finalizadas pelo aprendizado.
+
+<!--As with what we saw with convolutional networks, network architecture is important.
+-->
+
+Como com o que vimos com redes convolucionais, a arquitetura de rede é importante.
+
+<!--
+## Recurrent Networks
+-->
+
+## Redes Neurais Recorrentes
+
+<!--In a Convolutional Neural Network, the graph or interconnections between the modules cannot have loops. There exists at least a partial order among the modules such that the inputs are available when we compute the outputs.
+-->
+
+Em uma Rede Neural Convolucional, o gráfico ou as interconexões entre os módulos não podem ter laços. Existe pelo menos uma ordem parcial entre os módulos, de modo que as entradas estão disponíveis quando calculamos as saídas.
+
+<!--As shown in Figure 1, there are loops in Recurrent Neural Networks.
+-->
+
+Conforme mostrado na Figura 1, existem loops nas Redes Neurais Recorrentes.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-2/RNN_rolled.png" /><br>
+Figure 1. Recurrent Neural Network with roll
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-2/RNN_rolled.png" /><br>
+Figura 1. Rede Neural Recorrente com loops
+</center>
+
+<!-- - $x(t)$ : input that varies across time
+ - $\text{Enc}(x(t))$: encoder that generates a representation of input
+ - $h(t)$: a representation of the input
+ - $w$: trainable parameters
+ - $z(t-1)$: previous hidden state, which is the output of the previous time step
+ - $z(t)$: current hidden state
+ - $g$: function that can be a complicated neural network; one of the inputs is $z(t-1)$ which is the output of the previous time step
+ - $\text{Dec}(z(t))$: decoder that generates an output
+-->
+
+- $x(t)$: entrada que varia ao longo do tempo
+  - $\text{Enc}(x(t))$: codificador que gera uma representação de entrada
+  - $h(t)$: uma representação da entrada
+  - $w$: parâmetros treináveis
+  - $z(t-1)$: estado oculto anterior, que é a saída da etapa de tempo anterior
+  - $z(t)$: estado oculto atual
+  - $g$: função que pode ser uma rede neural complicada; uma das entradas é $z(t-1)$ que é a saída da etapa de tempo anterior
+  - $\text{Dec}(z(t))$: decodificador que gera uma saída
+
+<!--
+## Recurrent Networks: Unroll the loop
+-->
+
+
+## Redes Neurais Recorrentes: desenrolando os loops
+
+<!--Unroll the loop in time. The input is a sequence $x_1, x_2, \cdots, x_T$.
+-->
+
+Desenrole o loop no tempo. A entrada é uma sequência $x_1, x_2, \cdots, x_T$.
+
+<!--<center>
+ "
+<img src="{{site.baseurl}}/images/week06/06-2/RNN_unrolled.png" /><br>
+Figure 2. Recurrent Networks with unrolled loop
+</center>
+-->
+
+<center>
+ "
+<img src="{{site.baseurl}}/images/week06/06-2/RNN_unrolled.png" /><br>
+Figura 2. Redes recorrentes com loop desenrolado
+</center>
+
+<!--In Figure 2, the input is $x_1, x_2, x_3$.
+-->
+
+Na Figura 2, a entrada é $x_1, x_2, x_3$.
+
+<!--At time t=0, the input $x(0)$ is passed to the encoder and it generates the representation $h(x(0)) = \text{Enc}(x(0))$ and then passes it to G to generate hidden state $z(0) = G(h_0, z', w)$. At $t = 0$, $z'$ in $G$ can be initialized as $0$ or randomly initialized. $z(0)$ is passed to decoder to generate an output and also to the next time step.
+-->
+
+No tempo t = 0, a entrada  $x(0)$ é passada para o codificador e ele gera a representação $h(x(0)) = \text{Enc}(x(0))$ e então a passa para G para gerar o estado oculto $z(0) = G(h_0, z', w)$. Em $t = 0$, $z'$ em $G$ pode ser inicializado como $0$ ou inicializado aleatoriamente. $z(0)$ é passado para o decodificador para gerar uma saída e também para a próxima etapa de tempo.
+
+<!--As there are no loops in this network, and we can implement backpropagation.
+-->
+
+Como não há loops nesta rede, podemos implementar a retropropagação.
+
+<!--Figure 2 shows a regular network with one particular characteristic: every block shares the same weights. Three encoders, decoders and G functions have same weights respectively across different time steps.
+-->
+
+A Figura 2 mostra uma rede regular com uma característica particular: cada bloco compartilha os mesmos pesos. Três codificadores, decodificadores e funções G têm os mesmos pesos, respectivamente, em diferentes intervalos de tempo.
+
+<!--BPTT: Backprop through time.  Unfortunately, BPTT doesn't work so well in the naive form of RNN.
+-->
+
+BPTT: Retropropagação através do tempo (Backpropagation through time). Infelizmente, o BPTT não funciona tão bem na forma mais simples de RNN.
+
+<!--Problems with RNNs:
+-->
+
+Problemas com RNNs:
+
+<!--1. Vanishing gradients
+   - In a long sequence, the gradients get multiplied by the weight matrix (transpose) at every time step. If there are small values in the weight matrix, the norm of gradients get smaller and smaller exponentially.
+2. Exploding gradients
+   - If we have a large weight matrix and the non-linearity in the recurrent layer is not saturating, the gradients will explode. The weights will diverge at the update step. We may have to use a tiny learning rate for the gradient descent to work.
+-->
+
+1. Perda da informação do Gradiente (Dissipação do Gradiente)
+    - Em uma longa sequência, os gradientes são multiplicados pela matriz de peso (transposição) a cada passo de tempo. Se houver valores pequenos na matriz de peso, a norma dos gradientes fica cada vez menor exponencialmente.
+2. Explosão de gradientes
+    - Se tivermos uma matriz de peso grande e a não linearidade na camada recorrente não for saturada, os gradientes explodirão. Os pesos irão divergir na etapa de atualização. Podemos ter que usar uma pequena taxa de aprendizado para que o gradiente descendente funcione.
+
+<!--One reason to use RNNs is for the advantage of remembering information in the past. However, it could fail to memorize the information long ago in a simple RNN without tricks.
+-->
+
+Uma razão para usar RNNs é a vantagem de lembrar informações do passado. No entanto, ele pode falhar ao memorizar as informações há muito tempo em um RNN simples sem truques.
+
+<!--An example that has vanishing gradient problem:
+-->
+
+Um exemplo que tem problema de perda da informação do gradiente:
+
+<!--The input is the characters from a C Program. The system will tell whether it is a syntactically correct program. A syntactically correct program should have a valid number of braces and parentheses. Thus, the network should remember how many open parentheses and braces there are to check, and whether we have closed them all. The network has to store such information in hidden states like a counter.  However, because of vanishing gradients, it will fail to preserve such information in a long program.
+-->
+
+A entrada são os caracteres de um programa em C. O sistema dirá se é um programa sintaticamente correto. Um programa sintaticamente correto deve ter um número válido de chaves e parênteses. Portanto, a rede deve lembrar quantos parênteses e colchetes devem ser verificados e se todos eles foram fechados. A rede precisa armazenar essas informações em estados ocultos, como um contador. No entanto, devido ao desaparecimento de gradientes, ele deixará de preservar essas informações em um programa longo.
+
+<!--
+##  RNN Tricks
+-->
+
+##  Truques em RNN
+
+<!--- clipping gradients:  (avoid exploding gradients)
+   Squash the gradients when they get too large.
+- Initialization (start in right ballpark avoids exploding/vanishing)
+   Initialize the weight matrices to preserve the norm to some extent. For example, orthogonal initialization initializes the weight matrix as a random orthogonal matrix.
+-->
+
+- gradientes de recorte: (evite a explosão de gradientes)
+    Esmague os gradientes quando eles ficarem muito grandes.
+- Inicialização (começar no estádio certo evita explodir / desaparecer)
+    Inicialize as matrizes de peso para preservar a norma até certo ponto. Por exemplo, a inicialização ortogonal inicializa a matriz de peso como uma matriz ortogonal aleatória.
+
+<!--
+## Multiplicative Modules
+-->
+
+## Módulos Multiplicativos
+
+<!--In multiplicative modules rather than only computing a weighted sum of inputs, we compute products of inputs and then compute weighted sum of that.
+-->
+
+Em módulos multiplicativos, ao invés de apenas computar uma soma ponderada de entradas, calculamos produtos de entradas e, em seguida, calculamos a soma ponderada disso.
+
+<!--Suppose $x \in {R}^{n\times1}$, $W \in {R}^{m \times n}$, $U \in {R}^{m \times n \times d}$ and $z \in {R}^{d\times1}$. Here U is a tensor.
+-->
+
+Suponha que $x \in {R}^{n\times1}$, $W \in {R}^{m \times n}$, $U \in {R}^{m \times n \times d}$ e $z \in {R}^{d\times1}$. Aqui U é um tensor.
+
+<!--$$
+w_{ij} = u_{ij}^\top z =
+\begin{pmatrix}
+u_{ij1} & u_{ij2} & \cdots &u_{ijd}\\
+\end{pmatrix}
+\begin{pmatrix}
+z_1\\
+z_2\\
+\vdots\\
+z_d\\
+\end{pmatrix} = \sum_ku_{ijk}z_k
+$$
+-->
+
+$$
+w_{ij} = u_{ij}^\top z =
+\begin{pmatrix}
+u_{ij1} & u_{ij2} & \cdots &u_{ijd}\\
+\end{pmatrix}
+\begin{pmatrix}
+z_1\\
+z_2\\
+\vdots\\
+z_d\\
+\end{pmatrix} = \sum_ku_{ijk}z_k
+$$
+
+<!--$$
+s =
+\begin{pmatrix}
+s_1\\
+s_2\\
+\vdots\\
+s_m\\
+\end{pmatrix} = Wx =  \begin{pmatrix}
+w_{11} & w_{12} & \cdots &w_{1n}\\
+w_{21} & w_{22} & \cdots &w_{2n}\\
+\vdots\\
+w_{m1} & w_{m2} & \cdots &w_{mn}
+\end{pmatrix}
+\begin{pmatrix}
+x_1\\
+x_2\\
+\vdots\\
+x_n\\
+\end{pmatrix}
+$$
+-->
+
+$$
+s =
+\begin{pmatrix}
+s_1\\
+s_2\\
+\vdots\\
+s_m\\
+\end{pmatrix} = Wx =  \begin{pmatrix}
+w_{11} & w_{12} & \cdots &w_{1n}\\
+w_{21} & w_{22} & \cdots &w_{2n}\\
+\vdots\\
+w_{m1} & w_{m2} & \cdots &w_{mn}
+\end{pmatrix}
+\begin{pmatrix}
+x_1\\
+x_2\\
+\vdots\\
+x_n\\
+\end{pmatrix}
+$$
+
+<!--where $s_i = w_{i}^\top x = \sum_j w_{ij}x_j$.
+-->
+
+onde $s_i = w_{i}^\top x = \sum_j w_{ij}x_j$.
+
+<!--The output of the system is a classic weighted sum of inputs and weights. Weights themselves are also weighted sums of weights and inputs.
+-->
+
+A saída do sistema é uma soma ponderada clássica de entradas e pesos. Os próprios pesos também são somas ponderadas de pesos e entradas.
+
+<!--Hypernetwork architecture: weights are computed by another network.
+-->
+
+Arquitetura de hiper-rede: os pesos são calculados por outra rede.
+
+<!--
+## Attention
+-->
+
+
+## Atenção (Attention)
+
+<!--$x_1$ and $x_2$ are vectors, $w_1$ and $w_2$ are scalars after softmax where $w_1 + w_2 = 1$, and  $w_1$ and $w_2$ are between 0 and 1.
+-->
+
+$x_1$ e $x_2$ são vetores, $w_1$ e $w_2$ são escalares após softmax onde $w_1 + w_2 = 1$, e $w_1$ e $w_2$ estão entre 0 e 1.
+
+<!--$w_1x_1 + w_2x_2$ is a weighted sum of $x_1$ and $x_2$ weighted by coefficients $w_1$ and $w_2$.
+-->
+
+$w_1x_1 + w_2x_2$ é uma soma ponderada de $x_1$ e $x_2$ ponderada pelos coeficientes $w_1$ e $w_2$.
+
+<!--By changing the relative size of $w_1$ and $w_2$, we can switch the output of $w_1x_1 + w_2x_2$ to $x_1$ or $x_2$ or some linear combinations of $x_1$ and $x_2$.
+-->
+
+Alterando o tamanho relativo de $w_1$ e $w_2$, podemos mudar a saída de $w_1x_1 + w_2x_2$ para $x_1$ ou $x_2$ ou algumas combinações lineares de $x_1$ e $x_2$.
+
+<!--The inputs can have multiple $x$ vectors (more than $x_1$ and $x_2$). The system will choose an appropriate combination, the choice of which is determined by another variable z. An attention mechanism allows the neural network to focus its attention on particular input(s) and ignore the others.
+-->
+
+As entradas podem ter vários vetores $x$ (mais de $x_1$ e $x_2$). O sistema escolherá uma combinação apropriada, cuja escolha é determinada por outra variável z. Um mecanismo de atenção permite que a rede neural concentre sua atenção em determinadas entradas e ignore as outras.
+
+<!--Attention is increasingly important in NLP systems that use transformer architectures or other types of attention.
+-->
+
+A atenção é cada vez mais importante em sistemas de PNL que usam arquiteturas de transformador ou outros tipos de atenção.
+
+<!--The weights are data independent because z is data independent.
+-->
+
+Os pesos são independentes dos dados porque z é independente dos dados.
+
+<!--
+## [Gated Recurrent Units (GRU)](https://www.youtube.com/watch?v=ycbMGyCPzvE&t=3549s)
+-->
+
+
+## [Gated Recurrent Units (GRU)](https://www.youtube.com/watch?v=ycbMGyCPzvE&t=3549s)
+
+<!--As mentioned above, RNN suffers from vanishing/exploding gradients and can’t remember states for very long. GRU, [Cho, 2014](https://arxiv.org/abs/1406.1078), is an application of multiplicative modules that attempts to solve these problems. It's an example of recurrent net with memory (another is LSTM). The structure of A GRU unit is shown below:
+-->
+
+Como mencionado acima, RNN sofre de dissipação e explosão de gradientes e não consegue se lembrar dos estados por muito tempo. GRU, [Cho, 2014](https://arxiv.org/abs/1406.1078), é uma aplicação de módulos multiplicativos que tenta resolver esses problemas. É um exemplo de rede recorrente com memória (outra é LSTM). A estrutura de uma unidade GRU é mostrada abaixo:
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-2/GRU.png" height="300px" style="background-color:#226;"/><br>
+Figure 3. Gated Recurrent Unit
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-2/GRU.png" height="300px" style="background-color:#226;"/><br>
+Figura 3. Gated Recurrent Unit
+</center>
+
+<!--$$
+\begin{array}{l}
+z_t = \sigma_g(W_zx_t + U_zh_{t-1} + b_z)\\
+r_t = \sigma_g(W_rx_t + U_rh_{t-1} + b_r)\\
+h_t = z_t\odot h_{t-1} + (1- z_t)\odot\phi_h(W_hx_t + U_h(r_t\odot h_{t-1}) + b_h)
+\end{array}
+$$
+-->
+
+$$
+\begin{array}{l}
+z_t = \sigma_g(W_zx_t + U_zh_{t-1} + b_z)\\
+r_t = \sigma_g(W_rx_t + U_rh_{t-1} + b_r)\\
+h_t = z_t\odot h_{t-1} + (1- z_t)\odot\phi_h(W_hx_t + U_h(r_t\odot h_{t-1}) + b_h)
+\end{array}
+$$
+
+<!--where $\odot$ denotes element-wise multiplication(Hadamard product), $x_t$ is the input vector, $h_t$ is the output vector, $z_t$ is the update gate vector, $r_t$ is the reset gate vector, $\phi_h$ is a hyperbolic tanh, and $W$,$U$,$b$ are learnable parameters.
+-->
+
+onde $\odot$ denota multiplicação elemento a elemento (produto Hadamard), $ x_t $ é o vetor de entrada, $h_t$é o vetor de saída, $z_t$ é o vetor de porta de atualização, $r_t$ é o vetor de porta de reset, $\phi_h$ é um tanh hiperbólico e $W$, $U$, $b$ são parâmetros que podem ser aprendidos.
+
+<!--To be specific, $z_t$ is a gating vector that determines how much of the past information should be passed along to the future. It applies a sigmoid function to the sum of two linear layers and a bias over the input $x_t$ and the previous state $h_{t-1}$.  $z_t$ contains coefficients between 0 and 1 as a result of applying sigmoid. The final output state $h_t$ is a convex combination of $h_{t-1}$ and $\phi_h(W_hx_t + U_h(r_t\odot h_{t-1}) + b_h)$ via $z_t$. If the coefficient is 1, the current unit output is just a copy of the previous state and ignores the input (which is the default behaviour). If it is less than one, then it takes into account some new information from the input.
+-->
+
+Para ser específico, $z_t$ é um vetor de passagem que determina quanto das informações do passado deve ser repassado para o futuro. Ele aplica uma função sigmóide à soma de duas camadas lineares e um viés sobre a entrada $x_t$ e o estado anterior $h_{t-1}$. $z_t$ contém coeficientes entre 0 e 1 como resultado da aplicação de sigmóide. O estado de saída final $ h_t $ é uma combinação convexa de $h_{t-1}$ e $\phi_h(W_hx_t + U_h(r_t\odot h_{t-1}) + b_h)$ via $z_t$. Se o coeficiente for 1, a saída da unidade atual é apenas uma cópia do estado anterior e ignora a entrada (que é o comportamento padrão). Se for menor que um, leva em consideração algumas novas informações da entrada.
+
+<!--The reset gate $r_t$ is used to decide how much of the past information to forget. In the new memory content $\phi_h(W_hx_t + U_h(r_t\odot h_{t-1}) + b_h)$, if the coefficient in $r_t$ is 0, then it stores none of the information from the past. If at the same time $z_t$ is 0, then the system is completely reset since $h_t$ would only look at the input.
+-->
+
+A porta de reinicialização $r_t$ é usada para decidir quanto das informações anteriores deve ser esquecido. No novo conteúdo de memória $\phi_h(W_hx_t + U_h(r_t\odot h_{t-1}) + b_h)$, se o coeficiente em $r_t$ for 0, então ele não armazena nenhuma das informações do passado. Se ao mesmo tempo $z_t$ for 0, então o sistema será completamente reiniciado, já que $h_t$ só olharia para a entrada.
+
+<!--
+## LSTM (Long Short-Term Memory)
+-->
+
+
+## LSTM (Long Short-Term Memory)
+
+<!--GRU is actually a simplified version of LSTM which came out much earlier, [Hochreiter, Schmidhuber, 1997](https://www.bioinf.jku.at/publications/older/2604.pdf). By building up memory cells to preserve past information, LSTMs also aim to solve long term memory loss issues in RNNs. The structure of LSTMs is shown below:
+-->
+
+GRU é na verdade uma versão simplificada do LSTM que saiu muito antes, [Hochreiter, Schmidhuber, 1997](https://www.bioinf.jku.at/publications/older/2604.pdf). Ao construir células de memória para preservar informações anteriores, os LSTMs também visam resolver problemas de perda de memória de longo prazo em RNNs. A estrutura dos LSTMs é mostrada abaixo:
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-2/LSTM.png" height="300px"/><br>
+Figure 4. LSTM
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-2/LSTM.png" height="300px"/><br>
+Figura 4. LSTM
+</center>
+
+<!--$$
+\begin{array}{l}
+f_t = \sigma_g(W_fx_t + U_fh_{t-1} + b_f)\\
+i_t = \sigma_g(W_ix_t + U_ih_{t-1} + b_i)\\
+o_t = \sigma_o(W_ox_t + U_oh_{t-1} + b_o)\\
+c_t = f_t\odot c_{t-1} + i_t\odot \tanh(W_cx_t + U_ch_{t-1} + b_c)\\
+h_t = o_t \odot\tanh(c_t)
+\end{array}
+$$
+-->
+
+$$
+\begin{array}{l}
+f_t = \sigma_g(W_fx_t + U_fh_{t-1} + b_f)\\
+i_t = \sigma_g(W_ix_t + U_ih_{t-1} + b_i)\\
+o_t = \sigma_o(W_ox_t + U_oh_{t-1} + b_o)\\
+c_t = f_t\odot c_{t-1} + i_t\odot \tanh(W_cx_t + U_ch_{t-1} + b_c)\\
+h_t = o_t \odot\tanh(c_t)
+\end{array}
+$$
+
+<!--where $\odot$ denotes element-wise multiplication, $x_t\in\mathbb{R}^a$ is an input vector to the LSTM unit, $f_t\in\mathbb{R}^h$ is the forget gate's activation vector, $i_t\in\mathbb{R}^h$ is the input/update gate's activation vector, $o_t\in\mathbb{R}^h$ is the output gate's activation vector, $h_t\in\mathbb{R}^h$ is the hidden state vector (also known as output), $c_t\in\mathbb{R}^h$ is the cell state vector.
+-->
+
+onde $\odot$ denota multiplicação elemento a elemento, $x_t\in\mathbb{R}^a$ é um vetor de entrada para a unidade LSTM, $f_t\in\mathbb{R}^h$ é o vetor de ativação do portal de esquecimento , $i_t\in\mathbb{R}^h$ é o vetor de ativação da porta de entrada / atualização, $o_t\in\mathbb{R}^h$ é o vetor de ativação da porta de saída, $h_t\in\mathbb{R}^h$ é o vetor de estado oculto (também conhecido como saída), $c_t\in\mathbb{R}^h$ é o vetor de estado da célula.
+
+<!--An LSTM unit uses a cell state $c_t$ to convey the information through the unit. It regulates how information is preserved or removed from the cell state through structures called gates. The forget gate $f_t$ decides how much information we want to keep from the previous cell state $c_{t-1}$ by looking at the current input and previous hidden state, and produces a number between 0 and 1 as the coefficient of $c_{t-1}$.  $\tanh(W_cx_t + U_ch_{t-1} + b_c)$ computes a new candidate to update the cell state, and like the forget gate, the input gate $i_t$ decides how much of the update to be applied. Finally, the output $h_t$ will be based on the cell state $c_t$, but will be put through a $\tanh$ then filtered by the output gate $o_t$.
+-->
+
+Uma unidade LSTM usa um estado de célula $c_t$ para transmitir as informações através da unidade. Ele regula como as informações são preservadas ou removidas do estado da célula por meio de estruturas chamadas de portas. A porta de esquecimento $f_t$ decide quanta informação queremos manter do estado da célula anterior $c_{t-1}$ olhando para a entrada atual e o estado anterior oculto, e produz um número entre 0 e 1 como o coeficiente de $ c_ {t-1} $. $ \ tanh (W_cx_t + U_ch_ {t-1} + b_c) $ calcula um novo candidato para atualizar o estado da célula e, como a porta de esquecimento, a porta de entrada $ i_t $ decide quanto da atualização a ser aplicada. Finalmente, a saída $ h_t $ será baseada no estado da célula $ c_t $, mas será colocada em um $ \ tanh $ e então filtrada pela porta de saída $ o_t $.
+
+<!--Though LSTMs are widely used in NLP, their popularity is decreasing. For example, speech recognition is moving towards using temporal CNN, and NLP is moving towards using transformers.
+-->
+
+Embora os LSTMs sejam amplamente usados na PNL, sua popularidade está diminuindo. Por exemplo, o reconhecimento de voz está se movendo em direção ao uso de CNN temporal, e a PNL está se movendo em direção ao uso de transformadores.
+
+<!--
+## Sequence to Sequence Model
+-->
+
+
+## Modelo Sequência para Sequência (Seq2Seq)
+
+<!--The approach proposed by [Sutskever NIPS 2014](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf) is the first neural machine translation system to have comparable performance to classic approaches. It uses an encoder-decoder architecture where both the encoder and decoder are multi-layered LSTMs.
+-->
+
+A abordagem proposta por [Sutskever NIPS 2014](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf) é o primeiro sistema de tradução automática neural a ter comparação desempenho às abordagens clássicas. Ele usa uma arquitetura do tipo codificador-decodificador em que o codificador e o decodificador são LSTMs de várias camadas.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-2/Seq2Seq.png" height="300px" /><br>
+Figure 5. Seq2Seq
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-2/Seq2Seq.png" height="300px" /><br>
+Figura 5. Seq2Seq
+</center>
+
+<!--Each cell in the figure is an LSTM. For the encoder (the part on the left), the number of time steps equals the length of the sentence to be translated. At each step, there is a stack of LSTMs (four layers in the paper) where the hidden state of the previous LSTM is fed into the next one. The last layer of the last time step outputs a vector that represents the meaning of the entire sentence, which is then fed into another multi-layer LSTM (the decoder), that produces words in the target language. In the decoder, the text is generated in a sequential fashion. Each step produces one word, which is fed as an input to the next time step.
+-->
+
+Cada célula na figura é um LSTM. Para o codificador (a parte à esquerda), o número de intervalos de tempo é igual ao comprimento da frase a ser traduzida. Em cada etapa, há uma pilha de LSTMs (quatro camadas no papel) onde o estado oculto do LSTM anterior é alimentado para o próximo. A última camada da última etapa de tempo produz um vetor que representa o significado de toda a frase, que é então alimentado em outro LSTM de várias camadas (o decodificador), que produz palavras no idioma de destino. No decodificador, o texto é gerado de forma sequencial. Cada etapa produz uma palavra, que é alimentada como uma entrada para a próxima etapa de tempo.
+
+<!--This architecture is not satisfying in two ways: First, the entire meaning of the sentence has to be squeezed into the hidden state between the encoder and decoder. Second, LSTMs actually do not preserve information for more than about 20 words. The fix for these issues is called a Bi-LSTM, which runs two LSTMs in opposite directions.  In a Bi-LSTM the meaning is encoded in two vectors, one generated by running LSTM from left to right, and another from right to left.  This allows doubling the length of the sentence without losing too much information.
+-->
+
+Essa arquitetura não é satisfatória de duas maneiras: primeiro, todo o significado da frase deve ser comprimido no estado oculto entre o codificador e o decodificador. Em segundo lugar, os LSTMs na verdade não preservam informações por mais de cerca de 20 palavras. A correção para esses problemas é chamada de Bi-LSTM, que executa dois LSTMs em direções opostas. Em um Bi-LSTM, o significado é codificado em dois vetores, um gerado pela execução do LSTM da esquerda para a direita e outro da direita para a esquerda. Isso permite dobrar o comprimento da frase sem perder muitas informações.
+
+<!--
+## Seq2seq with Attention
+-->
+
+
+## Seq2seq com Atenção (Attention)
+
+<!--The success of the approach above was short-lived. Another paper by [Bahdanau, Cho, Bengio](https://arxiv.org/abs/1409.0473)  suggested that instead of having a gigantic network that squeezes the meaning of the entire sentence into one vector, it would make more sense if at every time step we only focus the attention on the relevant locations in the original language with equivalent meaning, *i.e.* the attention mechanism.
+-->
+
+O sucesso da abordagem acima teve vida curta. Outro artigo de [Bahdanau, Cho, Bengio](https://arxiv.org/abs/1409.0473) sugeriu que, em vez de ter uma rede gigantesca que comprime o significado de toda a frase em um vetor, faria mais sentido se em a cada passo, nós apenas focamos a atenção nos locais relevantes no idioma original com significado equivalente, ou seja, o mecanismo de atenção.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-2/Seq2SeqwAttention.png" height="300px" /><br>
+Figure 6. Seq2Seq with Attention
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-2/Seq2SeqwAttention.png" height="300px" /><br>
+Figura 6. Seq2seq com Atenção 
+</center>
+
+<!--In Attention, to produce the current word at each time step, we first need to decide which hidden representations of words in the input sentence to focus on. Essentially, a network will learn to score how well each encoded input matches the current output of the decoder. These scores are normalized by a softmax, then the coefficients are used to compute a weighted sum of the hidden states in the encoder at different time steps. By adjusting the weights, the system can adjust the area of inputs to focus on. The magic of this mechanism is that the network used to compute the coefficients can be trained through backpropagation. There is no need to build them by hand!
+-->
+
+Em Atenção, para produzir a palavra atual em cada etapa de tempo, primeiro precisamos decidir em quais representações ocultas de palavras na frase de entrada nos concentrar. Essencialmente, uma rede aprenderá a pontuar quão bem cada entrada codificada corresponde à saída atual do decodificador. Essas pontuações são normalizadas por um softmax, então os coeficientes são usados para calcular uma soma ponderada dos estados ocultos no codificador em diferentes etapas de tempo. Ao ajustar os pesos, o sistema pode ajustar a área de entradas para focar. A mágica desse mecanismo é que a rede usada para calcular os coeficientes pode ser treinada por meio de retropropagação. Não há necessidade de construí-los manualmente!
+
+<!--Attention mechanisms completely transformed neural machine translation. Later, Google published a paper [*Attention Is All You Need*](https://arxiv.org/abs/1706.03762), and they put forward transformer, where each layer and group of neurons is implementing attention.
+-->
+
+Os mecanismos de atenção transformaram completamente a tradução automática feita por redes neurais. Posteriormente, o Google publicou um artigo [*Attention Is All You Need*](https://arxiv.org/abs/1706.03762) e apresentou o transformer, em que cada camada e grupo de neurônios está implementando a atenção.
+
+<!--
+## [Memory network](https://www.youtube.com/watch?v=ycbMGyCPzvE&t=4575s)
+-->
+
+
+## [Redes com Memória](https://www.youtube.com/watch?v=ycbMGyCPzvE&t=4575s)
+
+<!--Memory networks stem from work at Facebook that was started by [Antoine Bordes](https://arxiv.org/abs/1410.3916) in 2014 and [Sainbayar Sukhbaatar](https://arxiv.org/abs/1503.08895) in 2015.
+-->
+
+Redes de memória derivam do trabalho no Facebook iniciado por [Antoine Bordes](https://arxiv.org/abs/1410.3916) em 2014 e [Sainbayar Sukhbaatar](https://arxiv.org/abs/1503.08895) em 2015.
+
+<!--The idea of a memory network is that there are two important parts in your brain: one is the **cortex**, which is where you have long term memory. There is a separate chunk of neurons called the **hippocampus** which sends wires to nearly everywhere in the cortex. The hippocampus is thought to be used for short term memory, remembering things for a relatively short period of time. The prevalent theory is that when you sleep, there is a lot of information transferred from the hippocampus to the cortex to be solidified in long term memory since the hippocampus has limited capacity.
+-->
+
+A ideia de uma rede com memória é que existem duas partes importantes em seu cérebro: uma é o **córtex**, que é onde você tem memória de longo prazo. Há um grupo separado de neurônios chamado **hipocampo**, que envia fios para quase todos os cantos do córtex. Acredita-se que o hipocampo seja usado para memória de curto prazo, lembrando coisas por um período de tempo relativamente curto. A teoria prevalente é que, quando você dorme, muitas informações são transferidas do hipocampo para o córtex para serem solidificadas na memória de longo prazo, já que o hipocampo tem capacidade limitada.
+
+<!--For a memory network, there is an input to the network, $x$ (think of it as an address of the memory), and compare this $x$ with vectors $k_1, k_2, k_3, \cdots$ ("keys") through a dot product. Put them through a softmax, what you get are an array of numbers which sum to one. And there are a set of other vectors $v_1, v_2, v_3, \cdots$ ("values"). Multiply these vectors by the scalers from softmax and sum these vectors up (note the resemblance to the attention mechanism) gives you the result.
+-->
+
+Para uma rede com memória, há uma entrada para a rede, $ x $ (pense nisso como um endereço da memória), e compare este $ x $ com os vetores $k_1, k_2, k_3, \cdots$ ("chaves") por meio de um produto escalar. Coloque-os em um softmax, o que você obtém é uma matriz de números que somam um. E há um conjunto de outros vetores $v_1, v_2, v_3, \cdots$ ("valores"). Multiplique esses vetores pelos escalonadores de softmax e some esses vetores (observe a semelhança com o mecanismo de atenção) para obter o resultado.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-2/MemoryNetwork1.png" height="300px"/><br>
+Figure 7. Memory Network
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-2/MemoryNetwork1.png" height="300px"/><br>
+Figura 7. Redes com Memória
+</center>
+
+<!--If one of the keys (*e.g.* $k_i$) exactly matches $x$, then the coefficient associated with this key will be very close to one. So the output of the system will essentially be $v_i$.
+-->
+
+Se uma das chaves (*por exemplo* $ k_i $) corresponder exatamente a $ x $, então o coeficiente associado a esta chave será muito próximo de um. Portanto, a saída do sistema será essencialmente $ v_i $.
+
+<!--This is **addressable associative memory**. Associative memory is that if your input matches a key, you get *that* value. And this is just a soft differentiable version of it, which allows you to backpropagate and change the vectors through gradient descent.
+-->
+
+Esta é a **memória associativa endereçável**. A memória associativa é que, se sua entrada corresponder a uma chave, você obtém *aquele* valor. E esta é apenas uma versão soft diferenciável dele, que permite retropropagar e alterar os vetores por meio de gradiente descendente.
+
+<!--What the authors did was tell a story to a system by giving it a sequence of sentences. The sentences are encoded into vectors by running them through a neural net that has not been pretrained. The sentences are returned to the memory of this type. When you ask a question to the system, you encode the question and put it as the input of a neural net, the neural net produces an $x$ to the memory, and the memory returns a value.
+-->
+
+O que os autores fizeram foi contar uma história a um sistema, dando-lhe uma sequência de frases. As sentenças são codificadas em vetores, passando-as por uma rede neural que não foi pré-treinada. As frases são devolvidas à memória deste tipo. Quando você faz uma pergunta ao sistema, você codifica a pergunta e a coloca como a entrada de uma rede neural, a rede neural produz um $ x $ para a memória, e a memória retorna um valor.
+
+<!--This value, together with the previous state of the network, is used to re-access the memory. And you train this entire network to produce an answer to your question. After extensive training, this model actually learns to store stories and answer questions.
+-->
+
+Este valor, junto com o estado anterior da rede, é usado para acessar novamente a memória. E você treina toda essa rede para produzir uma resposta à sua pergunta. Após um treinamento extensivo, esse modelo realmente aprende a armazenar histórias e responder a perguntas.
+
+<!--$$
+\alpha_i = k_i^\top x \\
+c = \text{softmax}(\alpha) \\
+s = \sum_i c_i v_i
+$$
+-->
+
+$$
+\alpha_i = k_i^\top x \\
+c = \text{softmax}(\alpha) \\
+s = \sum_i c_i v_i
+$$
+
+<!--In memory network, there is a neural net that takes an input and then produces an address for the memory, gets the value back to the network, keeps going, and eventually produces an output. This is very much like computer since there is a CPU and an external memory to read and write.
+-->
+
+Na rede de memória, há uma rede neural que recebe uma entrada e, em seguida, produz um endereço para a memória, retorna o valor para a rede, continua e, por fim, produz uma saída. É muito parecido com um computador, pois há uma CPU e uma memória externa para ler e escrever.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-2/MemoryNetwork2.png" height="200px" />
+<img src="{{site.baseurl}}/images/week06/06-2/MemoryNetwork3.png" height="200px" /> <br>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-2/MemoryNetwork2.png" height="200px" />
+<img src="{{site.baseurl}}/images/week06/06-2/MemoryNetwork3.png" height="200px" /> <br>
+
+<!--Figure 8. Comparision between memory network and computer (Photo by <a href='https://www.khanacademy.org/computing/ap-computer-science-principles/computers-101/computer--components/a/computer-memory'>Khan Acadamy</a>)
+</center>
+-->
+
+Figura 8. Comparação entre a rede com memória e um computador (Foto <a href='https://www.khanacademy.org/computing/ap-computer-science-principles/computers-101/computer--components/a/computer-memory'>Khan Acadamy</a>)
+</center>
+
+<!--There are people who imagine that you can actually build **differentiable computers** out of this. One example is the [Neural Turing Machine](https://arxiv.org/abs/1410.5401) from DeepMind, which was made public three days after Facebook's paper was published on arXiv.
+-->
+
+Existem pessoas que imaginam que você pode realmente construir **computadores diferenciáveis** a partir disso. Um exemplo é a [Máquina de Turing Neural](https://arxiv.org/abs/1410.5401) da DeepMind, que se tornou pública três dias depois que o artigo do Facebook foi publicado no arXiv.
+
+<!--The idea is to compare inputs to keys, generate coefficients, and produce values - which is basically what a transformer is.  A transformer is basically a neural net in which every group of neurons is one of these networks.
+-->
+
+A ideia é comparar entradas para chaves, gerar coeficientes e produzir valores - que é basicamente o que é um transformador. Um transformador é basicamente uma rede neural em que cada grupo de neurônios é uma dessas redes.
+
diff --git a/docs/pt/week06/06-3.md b/docs/pt/week06/06-3.md
new file mode 100644
index 000000000..03426aadb
--- /dev/null
+++ b/docs/pt/week06/06-3.md
@@ -0,0 +1,734 @@
+---
+lang: pt
+lang-ref: ch.06-3
+title: Propriedades dos Sinais Naturais
+lecturer: Alfredo Canziani
+authors: Zhengyuan Ding, Biao Huang, Lin Jiang, Nhung Le
+date: 3 Mar 2020
+translator: Bernardo Lago
+translation-date: 14 Nov 2021
+---
+
+
+<!--
+## [Overview](https://www.youtube.com/watch?v=8cAffg2jaT0&t=21s)
+-->
+
+## [Visão geral](https://www.youtube.com/watch?v=8cAffg2jaT0&t=21s)
+
+<!--RNN is one type of architecture that we can use to deal with sequences of data. What is a sequence? From the CNN lesson, we learned that a signal can be either 1D, 2D or 3D depending on the domain. The domain is defined by what you are mapping from and what you are mapping to. Handling sequential data is basically dealing with 1D data since the domain is the temporal axis. Nevertheless, you can also use RNN to deal with 2D data, where you have two directions.
+-->
+
+RNN é um tipo de arquitetura que podemos usar para lidar com sequências de dados. O que é uma sequência? Com a lição da CNN, aprendemos que um sinal pode ser 1D, 2D ou 3D, dependendo do domínio. O domínio é definido pelo que você está mapeando e para o que está mapeando. Manipular dados sequenciais é basicamente lidar com dados 1D, uma vez que o domínio é o eixo temporal. No entanto, você também pode usar RNN para lidar com dados 2D, onde você tem duas direções.
+
+<!--
+### Vanilla *vs.* Recurrent NN
+-->
+
+### Rede Neural "Comum" * vs. * Redes Neurais Recorrentes
+
+<!--Figure 1 is a vanilla neural network diagram with three layers. "Vanilla" is an American term meaning plain. The pink bubble is the input vector x, in the center is the hidden layer in green, and the final blue layer is the output. Using an example from digital electronics on the right, this is like a combinational logic, where the current output only depends on the current input.
+-->
+
+A Figura 1 é um diagrama de rede neural comum (vanilla) com três camadas. "Vanilla" é um termo americano que significa simples, comum. O círculo cor-de-rosa é o vetor de entrada x, no centro está a camada oculta em verde e a camada azul final é a saída. Usando um exemplo da eletrônica digital à direita, isso é como uma lógica combinatória, onde a saída de corrente depende apenas da entrada de corrente.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/vanilla.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 1:</b> Vanilla Architecture
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/vanilla.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 1: </b> Arquitetura "Vanilla"
+</center>
+
+<!--In contrast to a vanilla neural network, in recurrent neural networks the current output depends not only on the current input but also on the state of the system, shown in Figure 2. This is like a sequential logic in digital electronics, where the output also depends on a "flip-flop" (a basic memory unit in digital electronics). Therefore the main difference here is that the output of a vanilla neural network only depends on the current input, while the one of RNN depends also on the state of the system.
+-->
+
+Em contraste com uma rede neural comum, em redes neurais recorrentes (RNN) a saída atual depende não apenas da entrada atual, mas também do estado do sistema, mostrado na Figura 2. Isso é como uma lógica sequencial na eletrônica digital, onde a saída também depende de um "flip-flop" (uma unidade de memória básica em eletrônica digital). Portanto, a principal diferença aqui é que a saída de uma rede neural comum depende apenas da entrada atual, enquanto a de RNN depende também do estado do sistema.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/rnn.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 2:</b> RNN Architecture
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/rnn.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 2: </b> Arquitetura RNN
+</center>
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/basic_neural_net.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 3:</b> Basic NN Architecture
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/basic_neural_net.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 3: </b> Arquitetura de uma Rede Neural básica
+</center>
+
+<!--Yann's diagram adds these shapes between neurons to represent the mapping between one tensor and another(one vector to another). For example, in Figure 3, the input vector x will map through this additional item to the hidden representations h. This item is actually an affine transformation *i.e.* rotation plus distortion. Then through another transformation, we get from the hidden layer to the final output. Similarly, in the RNN diagram, you can have the same additional items between neurons.
+-->
+
+O diagrama de Yann adiciona essas formas entre os neurônios para representar o mapeamento entre um tensor e outro (um vetor para outro). Por exemplo, na Figura 3, o vetor de entrada x será mapeado por meio desse item adicional para as representações ocultas h. Este item é na verdade uma transformação afim, ou seja, rotação mais distorção. Em seguida, por meio de outra transformação, passamos da camada oculta para a saída final. Da mesma forma, no diagrama RNN, você pode ter os mesmos itens adicionais entre os neurônios.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/yann_rnn.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 4:</b> Yann's RNN Architecture
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/yann_rnn.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 4: </b> Arquitetura RNN de Yann
+</center>
+
+<!--
+### Four types of RNN Architectures and Examples
+-->
+
+### Quatro tipos de arquiteturas RNN e exemplos
+
+<!--The first case is vector to sequence. The input is one bubble and then there will be evolutions of the internal state of the system annotated as these green bubbles. As the state of the system evolves, at every time step there will be one specific output.
+-->
+
+O primeiro caso é vetor para sequência. A entrada é uma bolha e então haverá evoluções do estado interno do sistema anotadas como essas bolhas verdes. Conforme o estado do sistema evolui, em cada etapa de tempo haverá uma saída específica.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/vec_seq.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 5:</b> Vec to Seq
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/vec_seq.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 5: </b> Vec para Seq
+</center>
+
+<!--An example of this type of architecture is to have the input as one image while the output will be a sequence of words representing the English descriptions of the input image. To explain using Figure 6, each blue bubble here can be an index in a dictionary of English words. For instance, if the output is the sentence "This is a yellow school bus". You first get the index of the word "This" and then get the index of the word "is" and so on. Some of the results of this network are shown below. For example, in the first column the description regarding the last picture is "A herd of elephants walking across a dry grass field.", which is very well refined. Then in the second column, the first image outputs "Two dogs play in the grass.", while it's actually three dogs. In the last column are the more wrong examples such as "A yellow school bus parked in a parking lot." In general, these results show that this network can fail quite drastically and perform well sometimes. This is the case that is from one input vector, which is the representation of an image, to a sequence of symbols, which are for example characters or words making up the English sentences. This kind of architecture is called an autoregressive network. An autoregressive network is a network which gives an output given that you feed as input the previous output.
+-->
+
+Um exemplo desse tipo de arquitetura é ter a entrada como uma imagem, enquanto a saída será uma sequência de palavras representando as descrições em inglês da imagem de entrada. Para explicar usando a Figura 6, cada bolha azul aqui pode ser um índice em um dicionário de palavras em inglês. Por exemplo, se o resultado for a frase "Este é um ônibus escolar amarelo". Primeiro, você obtém o índice da palavra "Isto" e, em seguida, obtém o índice da palavra "é" e assim por diante. Alguns dos resultados desta rede são mostrados a seguir. Por exemplo, na primeira coluna a descrição da última imagem é "Uma manada de elefantes caminhando por um campo de grama seca.", Que é muito bem refinada. Então, na segunda coluna, a primeira imagem mostra "Dois cachorros brincando na grama.", Enquanto na verdade são três cachorros. Na última coluna estão os exemplos mais errados, como "Um ônibus escolar amarelo estacionado em um estacionamento". Em geral, esses resultados mostram que essa rede pode falhar drasticamente e funcionar bem às vezes. É o caso de um vetor de entrada, que é a representação de uma imagem, para uma sequência de símbolos, que são, por exemplo, caracteres ou palavras que constituem as frases em inglês. Este tipo de arquitetura é denominado rede autoregressiva. Uma rede autoregressiva é uma rede que fornece uma saída, dado que você alimenta como entrada a saída anterior.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/image_to_text_vec2seq.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 6:</b> vec2seq Example: Image to Text
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/image_to_text_vec2seq.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 6: </b> vec2seq Exemplo: Imagem para Texto
+</center>
+
+<!--The second type is sequence to a final vector. This network keeps feeding a sequence of symbols and only at the end gives a final output. An application of this can be using the network to interpret Python. For example, the input are these lines of Python program.
+-->
+
+O segundo tipo é a sequência para um vetor final. Essa rede continua alimentando uma sequência de símbolos e somente no final dá uma saída final. Uma aplicação disso pode ser usar a rede para interpretar Python. Por exemplo, a entrada são essas linhas do programa Python.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/seq2vec.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 7:</b> Seq to Vec
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/seq2vec.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 7: </b> Seq para Vec
+</center>
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/second_1.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 8:</b> Input lines of Python Codes
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/second_1.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 8: </b> Linhas de entrada de códigos Python
+</center>
+
+<!--Then the network will be able to output the correct solution of this program. Another more complicated program like this:
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/second_2.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 9:</b> Input lines of Python Codes in a more Completed Case
+</center>
+-->
+
+Então, a rede será capaz de produzir a solução correta deste programa. Outro programa mais complicado como este:
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/second_2.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 9: </b> Linhas de entrada de códigos Python em um caso mais completo
+</center>
+
+<!--Then the output should be 12184. These two examples display that you can train a neural network to do this kind of operation. We just need to feed a sequence of symbols and enforce the final output to be a specific value.
+-->
+
+Então, a saída deve ser 12184. Esses dois exemplos mostram que você pode treinar uma rede neural para fazer esse tipo de operação. Precisamos apenas alimentar uma sequência de símbolos e fazer com que a saída final seja um valor específico.
+
+<!--The third is sequence to vector to sequence, shown in Figure 10. This architecture used to be the standard way of performing language translation. You start with a sequence of symbols here shown in pink. Then everything gets condensed into this final h, which represents a concept. For instance, we can have a sentence as input and squeeze it temporarily into a vector, which is representing the meaning and message that to send across. Then after getting this meaning in whatever representation, the network unrolls it back into a different language. For example "Today I'm very happy" in a sequence of words in English can be translated into Italian or Chinese. In general, the network gets some kind of encoding as inputs and turns them into a compressed representation. Finally, it performs the decoding given the same compressed version. In recent times we have seen networks like Transformers, which we will cover in the next lesson, outperform this method at language translation tasks. This type of architecture used to be the state of the art about two years ago (2018).
+-->
+
+O terceiro é seqüência para vetor para seqüência, mostrado na Figura 10. Essa arquitetura costumava ser a forma padrão de realizar a tradução de idiomas. Você começa com uma sequência de símbolos mostrados aqui em rosa. Então, tudo se condensa neste h final, que representa um conceito. Por exemplo, podemos ter uma frase como entrada e comprimi-la temporariamente em um vetor, que representa o significado e a mensagem a ser enviada. Então, depois de obter esse significado em qualquer representação, a rede o desenrola de volta para uma linguagem diferente. Por exemplo, "Hoje estou muito feliz" em uma sequência de palavras em inglês pode ser traduzido para o italiano ou chinês. Em geral, a rede obtém algum tipo de codificação como entradas e as transforma em uma representação compactada. Finalmente, ele realiza a decodificação dada a mesma versão compactada. Recentemente, vimos redes como Transformers, que abordaremos na próxima lição, superar esse método em tarefas de tradução de idiomas. Este tipo de arquitetura era o estado da arte há cerca de dois anos (2018).
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/seq2vec2seq.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 10:</b> Seq to Vec to Seq
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/seq2vec2seq.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 10: </b> Seq para Vec para Seq
+</center>
+
+<!--If you do a PCA over the latent space, you will have the words grouped by semantics like shown in this graph.
+-->
+
+Se você fizer um PCA sobre o espaço latente, terá as palavras agrupadas por semântica como mostrado neste gráfico.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/third_1.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 11:</b> Words Grouped by Semantics after PCA
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/third_1.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 11: </b> Palavras agrupadas por semântica após PCA
+</center>
+
+<!--If we zoom in, we will see that the in the same location there are all the months, like January and November.
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/third_2.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 12:</b> Zooming in Word Groups
+</center>
+-->
+
+Se aumentarmos o zoom, veremos que no mesmo local estão todos os meses, como janeiro e novembro.
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/third_2.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 12: </b> Ampliação de grupos de palavras
+</center>
+
+<!--If you focus on a different region, you get phrases like "a few days ago " "the next few months" etc.
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/third_3.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 13:</b> Word Groups in another Region
+</center>
+-->
+
+Se você focar em uma região diferente, obterá frases como "alguns dias atrás" "nos próximos meses" etc.
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/third_3.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 13: </b> Grupos de palavras em outra região
+</center>
+
+<!--From these examples, we see that different locations will have some specific common meanings.
+-->
+
+A partir desses exemplos, vemos que diferentes locais terão alguns significados comuns específicos.
+
+<!--Figure 14 showcases how how by training this kind of network will pick up on some semantics features. For exmaple in this case you can see there is a vector connecting man to woman and another between king and queen, which means woman minus man is going to be equal to queen minus king. You will get the same distance in this embeddings space applied to cases like male-female. Another example will be walking to walked and swimming to swam. You can always apply this kind of specific linear transofmation going from one word to another or from country to capital.
+-->
+
+A Figura 14 mostra como, com o treinamento, esse tipo de rede irá captar alguns recursos semânticos. Por exemplo, neste caso, você pode ver que há um vetor conectando homem a mulher e outro entre rei e rainha, o que significa que mulher menos homem será igual a rainha menos rei. Você obterá a mesma distância neste espaço de embeddings aplicado a casos como masculino-feminino. Outro exemplo será caminhar para caminhar e nadar para nadar. Você sempre pode aplicar esse tipo de transformação linear específica, indo de uma palavra para outra ou de um país para a capital.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/fourth.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 14:</b> Semantics Features Picked during Training
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/fourth.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 14: </b> recursos semânticos escolhidos durante o treinamento
+</center>
+
+<!--The fourth and final case is sequence to sequence. In this network, as you start feeding in input the network starts generating outputs. An example of this type of architecture is T9, if you remember using a Nokia phone, you would get text suggestions as you were typing. Another example is speech to captions. One cool example is this RNN-writer. When you start typing "the rings of Saturn glittered while", it suggests the following "two men looked at each other". This network was trained on some sci-fi novels so that you can just type something and let it make suggestions to help you write a book. One more example is shown in Figure 16. You input the top prompt and then this network will try to complete the rest.
+-->
+
+O quarto e último caso é seqüência a seqüência. Nessa rede, conforme você começa a alimentar a entrada, a rede começa a gerar saídas. Um exemplo desse tipo de arquitetura é o T9. Se você se lembra de usar um telefone Nokia, receberá sugestões de texto enquanto digita. Outro exemplo é a fala com legendas. Um exemplo legal é este escritor RNN. Quando você começa a digitar "os anéis de Saturno brilharam enquanto", isso sugere o seguinte "dois homens se entreolharam". Esta rede foi treinada em alguns romances de ficção científica para que você simplesmente digite algo e deixe que ela faça sugestões para ajudá-lo a escrever um livro. Mais um exemplo é mostrado na Figura 16. Você insere o prompt superior e, em seguida, esta rede tentará completar o resto.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/seq2seq.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 15:</b> Seq to Seq
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/seq2seq.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 15: </b> Seq a Seq
+</center>
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/seq2seq_model_completion.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 16:</b> Text Auto-Completion Model of Seq to Seq Model
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/seq2seq_model_completion.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 16: </b> Modelo de preenchimento automático de texto do modelo Seq para Seq
+</center>
+
+<!--
+## [Back Propagation through time](https://www.youtube.com/watch?v=8cAffg2jaT0&t=855s)
+-->
+
+## [Retropropagação no tempo](https://www.youtube.com/watch?v=8cAffg2jaT0&t=855s)
+
+<!--
+### Model architecture
+-->
+
+### Arquitetura do modelo
+
+<!--In order to train an RNN, backpropagation through time (BPTT) must be used. The model architecture of RNN is given in the figure below. The left design uses loop representation while the right figure unfolds the loop into a row over time.
+-->
+
+Para treinar um RNN, a retropropagação através do tempo (BPTT) deve ser usada. A arquitetura do modelo do RNN é fornecida na figura abaixo. O design da esquerda usa a representação do loop, enquanto a figura da direita desdobra o loop em uma linha ao longo do tempo.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/bptt.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 17:</b> Back Propagation through time
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/bptt.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 17: </b> Retropropagação ao longo do tempo
+</center>
+
+<!--Hidden representations are stated as
+-->
+
+As representações ocultas são indicadas como
+
+<!--$$
+\begin{aligned}
+\begin{cases}
+h[t]&= g(W_{h}\begin{bmatrix}
+x[t] \\
+h[t-1]
+\end{bmatrix}
++b_h)  \\
+h[0]&\dot=\ \boldsymbol{0},\ W_h\dot=\left[ W_{hx} W_{hh}\right] \\
+\hat{y}[t]&= g(W_yh[t]+b_y)
+\end{cases}
+\end{aligned}
+$$
+-->
+
+$$
+\begin{aligned}
+\begin{cases}
+h[t]&= g(W_{h}\begin{bmatrix}
+x[t] \\
+h[t-1]
+\end{bmatrix}
++b_h)  \\
+h[0]&\dot=\ \boldsymbol{0},\ W_h\dot=\left[ W_{hx} W_{hh}\right] \\
+\hat{y}[t]&= g(W_yh[t]+b_y)
+\end{cases}
+\end{aligned}
+$$
+
+<!--The first equation indicates a non-linear function applied on a rotation of a stack version of input where the previous configuration of the hidden layer is appended. At the beginning, $h[0]$ is set 0. To simplify the equation, $W_h$ can be written as two separate matrices, $\left[ W_{hx}\ W_{hh}\right]$, thus sometimes the transformation can be stated as
+-->
+
+A primeira equação indica uma função não linear aplicada em uma rotação de uma versão da pilha de entrada onde a configuração anterior da camada oculta é anexada. No início, $ h [0] $ é definido como 0. Para simplificar a equação, $ W_h $ pode ser escrito como duas matrizes separadas, $ \ left [W_ {hx} \ W_ {hh} \ right] $, portanto, às vezes a transformação pode ser declarada como
+
+<!--$$
+W_{hx}\cdot x[t]+W_{hh}\cdot h[t-1]
+$$
+-->
+
+$$
+W_ {hx} \ cdot x [t] + W_ {hh} \ cdot h [t-1]
+$$
+
+<!--which corresponds to the stack representation of the input.
+-->
+
+que corresponde à representação da pilha da entrada.
+
+<!--$y[t]$ is calculated at the final rotation and then we can use the chain rule to backpropagate the error to the previous time step.
+-->
+
+$ y [t] $ é calculado na rotação final e então podemos usar a regra da cadeia para retropropagar o erro para a etapa de tempo anterior.
+
+<!--
+### Batch-Ification in Language Modeling
+-->
+
+### "Loteamento" na Modelagem de Linguagem
+
+<!--When dealing with a sequence of symbols, we can batchify the text into different sizes. For example, when dealing with sequences shown in the following figure, batch-ification can be applied first, where the time domain is preserved vertically. In this case, the batch size is set to 4.
+-->
+
+Ao lidar com uma sequência de símbolos, podemos agrupar o texto em diferentes tamanhos. Por exemplo, ao lidar com as sequências mostradas na figura a seguir, a ificação em lote pode ser aplicada primeiro, onde o domínio do tempo é preservado verticalmente. Nesse caso, o tamanho do lote é definido como 4.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/batchify_1.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 18:</b> Batch-Ification
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/batchify_1.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 18: </b> "Loteamento" (Batch-Ification)
+</center>
+
+<!--If BPTT period $T$ is set to 3, the first input $x[1:T]$ and output $y[1:T]$ for RNN is determined as
+-->
+
+Se o período $T$ da retropropagação baseada no tempo (BPTT) for definido como 3, a primeira entrada $x[1:T]$ e a saída $y[1:T]$ para RNN é determinada como
+
+<!--$$
+\begin{aligned}
+x[1:T] &= \begin{bmatrix}
+a & g & m & s \\
+b & h & n & t \\
+c & i & o & u \\
+\end{bmatrix} \\
+y[1:T] &= \begin{bmatrix}
+b & h & n & t \\
+c & i & o & u \\
+d & j & p & v
+\end{bmatrix}
+\end{aligned}
+$$
+-->
+
+$$
+\begin{aligned}
+x[1:T] &= \begin{bmatrix}
+a & g & m & s \\
+b & h & n & t \\
+c & i & o & u \\
+\end{bmatrix} \\
+y[1:T] &= \begin{bmatrix}
+b & h & n & t \\
+c & i & o & u \\
+d & j & p & v
+\end{bmatrix}
+\end{aligned}
+$$
+
+<!--When performing RNN on the first batch, firstly, we feed $x[1] = [a\ g\ m\ s]$ into RNN and force the output to be $y[1] = [b\ h\ n\ t]$. The hidden representation $h[1]$ will be sent forward into next time step to help the RNN predict $y[2]$ from $x[2]$. After sending $h[T-1]$ to the final set of $x[T]$ and $y[T]$, we cut gradient propagation process for both $h[T]$ and $h[0]$ so that gradients will not propagate infinitely(.detach() in Pytorch). The whole process is shown in figure below.
+-->
+
+Ao realizar RNN no primeiro lote, em primeiro lugar, alimentamos $x[1] = [a\ g\ m\ s]$ em RNN e forçamos a saída a ser  $y[1] = [b\ h\ n\ t]$. A representação oculta $h[1]$ será enviada para a próxima etapa de tempo para ajudar o RNN a prever $y[2]$ a partir de $x[2]$. Depois de enviar $h[T-1]$ para o conjunto final de $x[T]$ e $y[T]$, cortamos o processo de propagação de gradiente para $h[T]$ e $h[0]$ então que os gradientes não se propagam infinitamente (.detach () no Pytorch). Todo o processo é mostrado na figura abaixo.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/batchify_2.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 19:</b> Batch-Ification
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/batchify_2.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 19: </b> "Loteamento" (Batch-Ification)
+</center>
+
+<!--
+## Vanishing and Exploding Gradient
+-->
+
+## Dissipação e Explosão de Gradiente 
+
+<!--
+### Problem
+-->
+
+### Problema
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/rnn_3.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 20:</b> Vanishing Problem
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/rnn_3.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 20: </b> Problema de dissipação
+</center>
+
+<!--The figure above is a typical RNN architecture. In order to perform rotation over previous steps in RNN, we use matrices, which can be regarded as horizontal arrows in the model above. Since the matrices can change the size of outputs, if the determinant we select is larger than 1, the gradient will inflate over time and cause gradient explosion. Relatively speaking, if the eigenvalue we select is small across 0, the propagation process will shrink gradients and leads to the gradient vanishing.
+-->
+
+A figura acima é uma arquitetura RNN típica. Para realizar a rotação pelas etapas anteriores no RNN, usamos matrizes, que podem ser consideradas como setas horizontais no modelo acima. Uma vez que as matrizes podem alterar o tamanho das saídas, se o determinante que selecionamos for maior que 1, o gradiente se inflará com o tempo e causará a explosão do gradiente. Relativamente falando, se o autovalor que selecionamos for pequeno em 0, o processo de propagação reduzirá os gradientes e levará ao desaparecimento do gradiente (problema da dissipação de gradiente).
+
+<!--In typical RNNs, gradients will be propagated through all the possible arrows, which provides the gradients a large chance to vanish or explode. For example, the gradient at time 1 is large, which is indicated by the bright color. When it goes through one rotation, the gradient shrinks a lot and at time 3, it gets killed.
+-->
+
+Em RNNs típicos, os gradientes serão propagados por todas as setas possíveis, o que fornece aos gradientes uma grande chance de desaparecer ou explodir. Por exemplo, o gradiente no tempo 1 é grande, o que é indicado pela cor brilhante. Quando ele passa por uma rotação, o gradiente encolhe muito e no tempo 3, ele morre.
+
+<!--
+### Solution
+-->
+
+### Solução
+
+<!--An ideal to prevent gradients from exploding or vanishing is to skip connections. To fulfill this, multiply networks can be used.
+-->
+
+Um ideal para evitar que gradientes explodam ou desapareçam é pular conexões. Para cumprir isso, multiplique as redes podem ser usadas.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/rnn_2.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 21:</b> Skip Connection
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/rnn_2.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 21: </b> Pular conexão
+</center>
+
+<!--In the case above, we split the original network into 4 networks. Take the first network for instance. It takes in a value from input at time 1 and sends the output to the first intermediate state in the hidden layer. The state has 3 other networks where the $\circ$s allows the gradients to pass while the $-$s blocks propagation. Such a technique is called gated recurrent network.
+-->
+
+No caso acima, dividimos a rede original em 4 redes. Pegue a primeira rede, por exemplo. Ele obtém um valor da entrada no tempo 1 e envia a saída para o primeiro estado intermediário na camada oculta. O estado tem 3 outras redes onde $ \ circ $ s permite que os gradientes passem enquanto $ - $ s bloqueia a propagação. Essa técnica é chamada de rede recorrente com portas.
+
+<!--LSTM is one prevalent gated RNN and is introduced in detail in the following sections.
+-->
+
+O LSTM é um RNN fechado predominante e é apresentado em detalhes nas seções a seguir.
+
+<!--
+## [Long Short-Term Memory](https://www.youtube.com/watch?v=8cAffg2jaT0&t=1838s)
+-->
+
+## [Long Short-Term Memory](https://www.youtube.com/watch?v=8cAffg2jaT0&t=1838s)
+
+<!--
+### Model Architecture
+-->
+
+### Arquitetura do Modelo
+
+<!--Below are equations expressing an LSTM. The input gate is highlighted by yellow boxes, which will be an affine transformation. This input transformation will be multiplying $c[t]$, which is our candidate gate.
+-->
+
+Abaixo estão as equações que expressam um LSTM. A porta de entrada é destacada por caixas amarelas, que será uma transformação afim. Essa transformação de entrada multiplicará $ c [t] $, que é nossa porta candidata.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/lstm.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 22:</b> LSTM Architecture
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/lstm.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 22: </b> Arquitetura LSTM
+</center>
+
+<!--Don’t forget gate is multiplying the previous value of cell memory $c[t-1]$. Total cell value $c[t]$ is don’t forget gate plus input gate. Final hidden representation is element-wise multiplication between output gate $o[t]$ and hyperbolic tangent version of the cell $c[t]$, such that things are bounded. Finally, candidate gate $\tilde{c}[t]$ is simply a recurrent net. So we have a $o[t]$ to modulate the output, a $f[t]$ to modulate the don’t forget gate, and a $i[t]$ to modulate the input gate. All these interactions between memory and gates are multiplicative interactions. $i[t]$, $f[t]$ and $o[t]$ are all sigmoids, going from zero to one. Hence, when multiplying by zero, you have a closed gate. When multiplying by one, you have an open gate.
+-->
+
+Não se esqueça de que o gate está multiplicando o valor anterior da memória da célula $ c [t-1] $. O valor total da célula $ c [t] $ é não se esqueça da porta mais a porta de entrada. A representação oculta final é a multiplicação elemento a elemento entre a porta de saída $ o [t] $ e a versão tangente hiperbólica da célula $ c [t] $, de forma que as coisas sejam limitadas. Finalmente, a porta candidata $ \ tilde {c} [t] $ é simplesmente uma rede recorrente. Portanto, temos $ o [t] $ para modular a saída, $ f [t] $ para modular a porta não se esqueça e $ i [t] $ para modular a porta de entrada. Todas essas interações entre memória e portas são interações multiplicativas. $ i [t] $, $ f [t] $ e $ o [t] $ são todos sigmóides, indo de zero a um. Portanto, ao multiplicar por zero, você tem uma porta fechada. Ao multiplicar por um, você tem um portão aberto.
+
+<!--How do we turn off the output? Let’s say we have a purple internal representation $th$ and put a zero in the output gate. Then the output will be zero multiplied by something, and we get a zero. If we put a one in the output gate, we will get the same value as the purple representation.
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/lstm_2.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 23:</b> LSTM Architecture - Output On
+</center>
+-->
+
+Como desligamos a saída? Digamos que temos uma representação interna roxa $ th $ e colocamos um zero na porta de saída. Então, a saída será zero multiplicado por alguma coisa, e obteremos um zero. Se colocarmos um na porta de saída, obteremos o mesmo valor da representação roxa.
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/lstm_2.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 23: </b> Arquitetura LSTM - Saída Ligada
+</center>
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/lstm_3.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 24:</b> LSTM Architecture - Output Off
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/lstm_3.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 24: </b> Arquitetura LSTM - Saída Desligada
+</center>
+
+<!--Similarly, we can control the memory. For example, we can reset it by having $f[t]$ and $i[t]$ to be zeros. After multiplication and summation, we have a zero inside the memory. Otherwise, we can keep the memory, by still zeroing out the internal representation $th$ but keep a one in $f[t]$. Hence, the sum gets $c[t-1]$ and keeps sending it out. Finally, we can write such that we can get a one in the input gate, the multiplication gets purple, then set a zero in the don’t forget gate so it actually forget.
+-->
+
+Da mesma forma, podemos controlar a memória. Por exemplo, podemos redefini-lo fazendo com que $ f [t] $ e $ i [t] $ sejam zeros. Após a multiplicação e somatório, temos um zero na memória. Caso contrário, podemos manter a memória, ainda zerando a representação interna $ th $, mas mantendo um em $ f [t] $. Portanto, a soma obtém $ c [t-1] $ e continua sendo enviada. Finalmente, podemos escrever de forma que possamos obter um no portão de entrada, a multiplicação fica roxa e, em seguida, definir um zero no portão não se esqueça para que realmente esqueça.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/memory_cell_vis.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 25:</b> Visualization of the Memory Cell
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/memory_cell_vis.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 25: </b> Visualização da célula de memória
+</center>
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/lstm_4.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 26:</b> LSTM Architecture - Reset Memory
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/lstm_4.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 26: </b> Arquitetura LSTM - Redefinir memória
+</center>
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/lstm_keep_memory.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 27:</b> LSTM Architecture - Keep Memory
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/lstm_keep_memory.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 27: </b> Arquitetura LSTM - Manter memória
+</center>
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/lstm_write_memory.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 28:</b> LSTM Architecture - Write Memory
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/lstm_write_memory.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 28: </b> Arquitetura LSTM - Memória de Gravação
+</center>
+
+<!--
+## Notebook Examples
+-->
+
+## Exemplos de Notebook
+
+<!--
+### Sequence Classification
+-->
+
+### Classificação de sequências (sequences)
+
+<!--The goal is to classify sequences. Elements and targets are represented locally (input vectors with only one non-zero bit). The sequence **b**egins with an `B`, **e**nds with a `E` (the “trigger symbol”), and otherwise consists of randomly chosen symbols from the set `{a, b, c, d}` except for two elements at positions $t_1$ and $t_2$ that are either `X` or `Y`. For the `DifficultyLevel.HARD` case, the sequence length is randomly chosen between 100 and 110, $t_1$ is randomly chosen between 10 and 20, and $t_2$ is randomly chosen between 50 and 60. There are 4 sequence classes `Q`, `R`, `S`, and `U`, which depend on the temporal order of `X` and `Y`. The rules are: `X, X -> Q`; `X, Y -> R`; `Y, X -> S`; `Y, Y -> U`.
+-->
+
+O objetivo é classificar as sequências. Elementos e destinos são representados localmente (vetores de entrada com apenas um bit diferente de zero). A sequência ** b ** egins com um `B`, ** e ** nds com um` E` (o “símbolo de gatilho”), e de outra forma consiste em símbolos escolhidos aleatoriamente do conjunto `{a, b, c , d} `exceto por dois elementos nas posições $ t_1 $ e $ t_2 $ que são` X` ou `Y`. Para o caso `DifficultyLevel.HARD`, o comprimento da sequência é escolhido aleatoriamente entre 100 e 110, $ t_1 $ é escolhido aleatoriamente entre 10 e 20, e $ t_2 $ é escolhido aleatoriamente entre 50 e 60. Existem 4 classes de sequência` Q `,` R`, `S` e` U`, que dependem da ordem temporal de `X` e` Y`. As regras são: `X, X -> Q`; `X, Y -> R`; `Y, X -> S`; `Y, Y -> U`.
+
+<!--1). Dataset Exploration
+-->
+
+1). Exploração de conjunto de dados
+
+<!--The return type from a data generator is a tuple with length 2. The first item in the tuple is the batch of sequences with shape $(32, 9, 8)$. This is the data going to be fed into the network. There are eight different symbols in each row (`X`, `Y`, `a`, `b`, `c`, `d`, `B`, `E`). Each row is a one-hot vector. A sequence of rows represents a sequence of symbols. The first all-zero row is padding. We use padding when the length of the sequence is shorter than the maximum length in the batch.  The second item in the tuple is the corresponding batch of class labels with shape $(32, 4)$, since we have 4 classes (`Q`, `R`, `S`, and `U`). The first sequence is: `BbXcXcbE`. Then its decoded class label is $[1, 0, 0, 0]$, corresponding to `Q`.
+-->
+
+O tipo de retorno de um gerador de dados é uma tupla com comprimento 2. O primeiro item na tupla é o lote de sequências com forma $ (32, 9, 8) $. Esses são os dados que serão alimentados na rede. Existem oito símbolos diferentes em cada linha (`X`,` Y`, `a`,` b`, `c`,` d`, `B`,` E`). Cada linha é um vetor único. Uma sequência de linhas representa uma sequência de símbolos. A primeira linha totalmente zero é o preenchimento. Usamos preenchimento quando o comprimento da sequência é menor que o comprimento máximo do lote. O segundo item na tupla é o lote correspondente de rótulos de classe com forma $ (32, 4) $, uma vez que temos 4 classes (`Q`,` R`, `S` e` U`). A primeira sequência é: `BbXcXcbE`. Então, seu rótulo de classe decodificado é $ [1, 0, 0, 0] $, correspondendo a `Q`.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/dataset.png" style="zoom: 15%; background-color:#DCDCDC;"/><br>
+<b>Figure 29:</b> Input Vector Example
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/dataset.png" style="zoom: 15%; background-color:#DCDCDC;"/><br>
+<b> Figura 29: </b> Exemplo de vetor de entrada
+</center>
+
+<!--
+2). Defining the Model and Training
+-->
+
+2). Definição do modelo e treinamento
+
+<!--Let’s create a simple recurrent network, an LSTM, and train for 10 epochs. In the training loop, we should always look for five steps:
+-->
+
+Vamos criar uma rede recorrente simples, um LSTM, e treinar por 10 períodos. No ciclo de treinamento, devemos sempre procurar cinco etapas:
+
+<!-- * Perform the forward pass of the model
+ * Compute the loss
+ * Zero the gradient cache
+ * Backpropagate to compute the partial derivative of loss with regard to parameters
+ * Step in the opposite direction of the gradient
+-->
+
+* Execute o passe para frente do modelo
+ * Calcule a perda
+ * Zere o cache de gradiente
+ * Backpropagate para calcular a derivada parcial de perda em relação aos parâmetros
+ * Pise na direção oposta do gradiente
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/train_test_easy.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 30:</b> Simple RNN *vs.* LSTM - 10 Epochs
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/train_test_easy.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 30: </b> RNN Simples *vs.* LSTM - 10 épocas
+</center>
+
+<!--With an easy level of difficulty, RNN gets 50% accuracy while LSTM gets 100% after 10 epochs. But LSTM has four times more weights than RNN and has two hidden layers, so it is not a fair comparison. After 100 epochs, RNN also gets 100% accuracy, taking longer to train than the LSTM.
+-->
+
+Com um nível de dificuldade fácil, RNN obtém 50% de precisão enquanto LSTM obtém 100% após 10 épocas. Mas LSTM tem quatro vezes mais pesos do que RNN e tem duas camadas ocultas, portanto, não é uma comparação justa. Após 100 épocas, o RNN também obtém 100% de precisão, levando mais tempo para treinar do que o LSTM.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/train_test_hard.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 31:</b> Simple RNN *vs.* LSTM - 100 Epochs
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/train_test_hard.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 31:</b> RNN Simples *vs.* LSTM - 100 Épocas
+</center>
+
+<!--If we increase the difficulty of the training part (using longer sequences), we will see the RNN fails while LSTM continues to work.
+-->
+
+Se aumentarmos a dificuldade da parte de treinamento (usando sequências mais longas), veremos o RNN falhar enquanto o LSTM continua a funcionar.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/hidden_state_lstm.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 32:</b> Visualization of Hidden State Value
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/hidden_state_lstm.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 32:</b> Visualização do valor do estado oculto
+</center>
+
+<!--The above visualization is drawing the value of hidden state over time in LSTM. We will send the inputs through a hyperbolic tangent, such that if the input is below $-2.5$, it will be mapped to $-1$, and if it is above $2.5$, it will be mapped to $1$. So in this case, we can see the specific hidden layer picked on `X` (fifth row in the picture) and then it became red until we got the other `X`. So, the fifth hidden unit of the cell is triggered by observing the `X` and goes quiet after seeing the other `X`. This allows us to recognize the class of sequence.
+-->
+
+A visualização acima está desenhando o valor do estado oculto ao longo do tempo no LSTM. Enviaremos as entradas por meio de uma tangente hiperbólica, de forma que se a entrada estiver abaixo de $-2.5$, ela será mapeada para $-1$, e se estiver acima de $2,5$, será mapeada para $1$. Portanto, neste caso, podemos ver a camada oculta específica escolhida em `X` (quinta linha na imagem) e então ela se tornou vermelha até que obtivemos o outro` X`. Assim, a quinta unidade oculta da célula é acionada observando o `X` e fica quieta após ver o outro` X`. Isso nos permite reconhecer a classe de sequência.
+
+<!--
+### Signal Echoing
+-->
+
+### Eco de sinal
+
+<!--Echoing signal n steps is an example of synchronized many-to-many task. For instance, the 1st input sequence is `"1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 1 1 1 ..."`, and the 1st target sequence is `"0 0 0 1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 ..."`. In this case, the output is three steps after. So we need a short-time working memory to keep the information. Whereas in the language model, it says something that hasn't already been said.
+-->
+
+Ecoar o sinal n etapas é um exemplo de tarefa muitos-para-muitos sincronizada. Por exemplo, a 1ª sequência de entrada é `"1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 1 1 1 ..."`, e a 1ª sequência de destino é `"0 0 0 1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 ..."`. Nesse caso, a saída ocorre três etapas depois. Portanto, precisamos de uma memória de trabalho de curta duração para manter as informações. Já no modelo de linguagem, diz algo que ainda não foi dito.
+
+<!--Before we send the whole sequence to the network and force the final target to be something, we need to cut the long sequence into little chunks. While feeding a new chunk, we need to keep track of the hidden state and send it as input to the internal state when adding the next new chunk. In LSTM, you can keep the memory for a long time as long as you have enough capacity. In RNN, after you reach a certain length, it starts to forget about what happened in the past.
+-->
+
+Antes de enviarmos toda a sequência para a rede e forçarmos o destino final a ser algo, precisamos cortar a sequência longa em pequenos pedaços. Ao alimentar um novo pedaço, precisamos acompanhar o estado oculto e enviá-lo como entrada para o estado interno ao adicionar o próximo novo pedaço. No LSTM, você pode manter a memória por muito tempo, desde que tenha capacidade suficiente. No RNN, depois de atingir um determinado comprimento, começa a esquecer o que aconteceu no passado.
+
diff --git a/docs/pt/week06/06.md b/docs/pt/week06/06.md
new file mode 100644
index 000000000..6834deefc
--- /dev/null
+++ b/docs/pt/week06/06.md
@@ -0,0 +1,36 @@
+---
+lang: pt
+lang-ref: ch.06
+title: Semana 6
+translator: Bernardo Lago
+---
+
+<!--
+## Lecture part A
+
+We discussed three applications of convolutional neural networks. We started with digit recognition and the application to a 5-digit zip code recognition. In object detection, we talk about how to use multi-scale architecture in a face detection setting. Lastly, we saw how ConvNets are used in semantic segmentation tasks with concrete examples in a robotic vision system and object segmentation in an urban environment.
+-->
+
+## Aula parte A
+
+Discutimos três aplicações de redes neurais convolucionais. Começamos com o reconhecimento de dígitos e a aplicação para um reconhecimento de código postal (CEP) de 5 dígitos. Na detecção de objetos, falamos sobre como usar a arquitetura multi-escala em uma configuração de detecção de faces. Por último, vimos como ConvNets são usados em tarefas de segmentação semântica com exemplos concretos em um sistema de visão robótica e segmentação de objetos em um ambiente urbano.
+
+<!--
+## Lecture part B
+
+We examine Recurrent Neural Networks, their problems, and common techniques for mitigating these issues.  We then review a variety of modules developed to resolve RNN model issues including Attention, GRUs (Gated Recurrent Unit), LSTMs (Long Short-Term Memory), and Seq2Seq.
+-->
+
+## Aula parte B
+
+Examinamos redes neurais recorrentes, seus problemas e técnicas comuns para mitigar esses problemas. Em seguida, revisamos uma variedade de módulos desenvolvidos para resolver os problemas do modelo RNN, incluindo Atenção (Attention), GRUs (Gated Recurrent Unit), LSTMs (Long Short-Term Memory) e Seq2Seq.
+
+
+<!--
+## Practicum
+We discussed architecture of Vanilla RNN and LSTM models and compared the performance between the two. LSTM inherits advantages of RNN, while improving RNN's weaknesses by including a 'memory cell' to store information in memory for long periods of time. LSTM models significantly outperforms RNN models.
+-->
+
+## Prática
+
+Discutimos a arquitetura dos modelos de RNN básica (vanilla) e LSTM e comparamos o desempenho entre os dois. O LSTM herda as vantagens do RNN, ao mesmo tempo em que melhora os pontos fracos do RNN ao incluir uma 'célula de memória' para armazenar informações na memória por longos períodos de tempo. Os modelos LSTM superam significativamente os modelos RNN.
\ No newline at end of file
diff --git a/docs/pt/week06/lecture06.sbv b/docs/pt/week06/lecture06.sbv
new file mode 100644
index 000000000..1eab7f0b2
--- /dev/null
+++ b/docs/pt/week06/lecture06.sbv
@@ -0,0 +1,3338 @@
+0:00:04.960,0:00:08.970
+So I want to do two things, talk about
+
+0:00:11.019,0:00:14.909
+Talk a little bit about like some ways to use Convolutional Nets in various ways
+
+0:00:16.119,0:00:18.539
+Which I haven't gone through last time
+
+0:00:19.630,0:00:21.630
+and
+
+0:00:22.689,0:00:24.689
+And I'll also
+
+0:00:26.619,0:00:29.518
+Talk about different types of architectures that
+
+0:00:30.820,0:00:33.389
+Some of which are very recently designed
+
+0:00:34.059,0:00:35.710
+that people have been
+
+0:00:35.710,0:00:40.320
+Kind of playing with for quite a while. So let's see
+
+0:00:43.660,0:00:47.489
+So last time when we talked about Convolutional Nets we stopped that the
+
+0:00:47.890,0:00:54.000
+idea that we can use Convolutional Nets with kind of a sliding we do over large images and it consists in just
+
+0:00:54.550,0:00:56.550
+applying the convolution on large images
+
+0:00:57.070,0:01:01.559
+which is a very general image, a very general method, so we're gonna
+
+0:01:03.610,0:01:06.900
+See a few more things on how to use convolutional Nets and
+
+0:01:07.659,0:01:08.580
+to some extent
+
+0:01:08.580,0:01:09.520
+I'm going to
+
+0:01:09.520,0:01:16.020
+Rely on a bit of sort of historical papers and things like this to explain kind of simple forms of all of those ideas
+
+0:01:17.409,0:01:21.269
+so as I said last time
+
+0:01:21.850,0:01:27.720
+I had this example where there's multiple characters on an image and you can, you have a convolutional net that
+
+0:01:28.360,0:01:32.819
+whose output is also a convolution like everyday air is a convolution so you can interpret the output as
+
+0:01:33.250,0:01:40.739
+basically giving you a score for every category and for every window on the input and the the framing of the window depends on
+
+0:01:41.860,0:01:47.879
+Like the the windows that the system observes when your back project for my particular output
+
+0:01:49.000,0:01:54.479
+Kind of steps by the amount of subsampling the total amount of sub something you have in a network
+
+0:01:54.640,0:01:59.849
+So if you have two layers that subsample by a factor of two, you have two pooling layers, for example
+
+0:01:59.850,0:02:02.219
+That's a factor of two the overall
+
+0:02:02.920,0:02:07.199
+subsampling ratio is 4 and what that means is that every output is
+
+0:02:07.509,0:02:14.288
+Gonna basically look at a window on the input and successive outputs is going to look at the windows that are separated by four pixels
+
+0:02:14.630,0:02:17.350
+Okay, it's just a product of all the subsampling layers
+
+0:02:20.480,0:02:21.500
+So
+
+0:02:21.500,0:02:24.610
+this this is nice, but then you're gonna have to make sense of
+
+0:02:25.220,0:02:30.190
+All the stuff that's on the input. How do you pick out objects objects that
+
+0:02:31.310,0:02:33.020
+overlap each other
+
+0:02:33.020,0:02:38.949
+Etc. And one thing you can do for this is called "Non maximum suppression"
+
+0:02:41.180,0:02:43.480
+Which is what people use in sort of object detection
+
+0:02:44.750,0:02:47.350
+so basically what that consists in is that if you have
+
+0:02:49.160,0:02:53.139
+Outputs that kind of are more or less at the same place and 
+
+0:02:53.989,0:02:58.749
+or also like overlapping places and one of them tells you I see a
+
+0:02:58.910,0:03:02.199
+Bear and the other one tells you I see a horse one of them wins
+
+0:03:02.780,0:03:07.330
+Okay, it's probably one that's wrong. And you can't have a bear on a horse at the same time at the same place
+
+0:03:07.330,0:03:10.119
+So you do what's called? No, maximum suppression you can
+
+0:03:10.700,0:03:11.959
+Look at which
+
+0:03:11.959,0:03:15.429
+which of those has the highest score and you kind of pick that one or you see if
+
+0:03:15.500,0:03:19.660
+any neighbors also recognize that as a bear or a horse and you kind of make a
+
+0:03:20.360,0:03:24.999
+vote if you want, a local vote, okay, and I'm gonna go to the details of this because
+
+0:03:25.760,0:03:28.719
+Just just kind of rough ideas. Well, this is
+
+0:03:29.930,0:03:34.269
+already implemented in code that you can download and also it's kind of the topic of a 
+
+0:03:35.030,0:03:37.509
+full-fledged computer vision course
+
+0:03:38.239,0:03:42.939
+So here we just allude to kind of how we use deep learning for this kind of application
+
+0:03:46.970,0:03:48.970
+Let's see, so here's
+
+0:03:50.480,0:03:55.750
+Again going back to history a little bit some ideas of how you use
+
+0:03:57.049,0:03:59.739
+neural nets to or convolutional nets in this case to
+
+0:04:00.500,0:04:04.690
+Recognize strings of characters which is kind of the same program as recognizing multiple objects, really
+
+0:04:05.450,0:04:12.130
+So if you have, you have an image that contains the image at the top... "two, three two, zero, six"
+
+0:04:12.130,0:04:15.639
+It's a zip code and the characters touch so you don't know how to separate them in advance
+
+0:04:15.979,0:04:22.629
+So you just apply a convolutional net to the entire string but you don't know in advance what width the characters will take and so
+
+0:04:24.500,0:04:30.739
+what you see here are four different sets of outputs and those four different sets of outputs of
+
+0:04:31.170,0:04:33.170
+the convolutional net
+
+0:04:33.300,0:04:36.830
+Each of which has ten rows and the ten words corresponds to each of the ten categories
+
+0:04:38.220,0:04:43.489
+so if you look at the top for example the top, the top block
+
+0:04:44.220,0:04:46.940
+the white squares represent high-scoring categories
+
+0:04:46.940,0:04:53.450
+So what you see on the left is that the number two is being recognized. So the window that is looked at by the
+
+0:04:54.120,0:04:59.690
+Output units that are on the first column is on the, on the left side of the image and it, and it detects a two
+
+0:05:00.330,0:05:03.499
+Because the you know their order 0 1 2 3 4 etc
+
+0:05:03.810,0:05:07.160
+So you see a white square that corresponds to the detection of a 2
+
+0:05:07.770,0:05:09.920
+and then as the window is
+
+0:05:11.400,0:05:13.400
+shifted over the, over the input
+
+0:05:14.310,0:05:19.549
+Is a 3 or low scoring 3 that is seen then the 2 again there's three character
+
+0:05:19.550,0:05:24.980
+It's three detectors that see this 2 and then nothing then the 0 and then the 6
+
+0:05:26.670,0:05:28.670
+Now this first
+
+0:05:29.580,0:05:32.419
+System looks at a fairly narrow window and
+
+0:05:35.940,0:05:40.190
+Or maybe it's a wide window no, I think it's a wide window so it looks at a pretty wide window and
+
+0:05:41.040,0:05:42.450
+it
+
+0:05:42.450,0:05:44.450
+when it looks at the, the
+
+0:05:45.240,0:05:50.030
+The two, the two that's on the left for example, it actually sees a piece of the three with it, with it
+
+0:05:50.030,0:05:55.459
+So it's kind of in the window the different sets of outputs here correspond to different size
+
+0:05:55.830,0:06:01.009
+Of the kernel of the last layer. So the second row the second block
+
+0:06:01.890,0:06:05.689
+The the size of the kernel is four in the horizontal dimension
+
+0:06:07.590,0:06:11.869
+The next one is 3 and the next one is 2. what this allows the system to do is look at
+
+0:06:13.380,0:06:19.010
+Regions of various width on the input without being kind of too confused by the characters that are on the side if you want
+
+0:06:19.500,0:06:20.630
+so for example
+
+0:06:20.630,0:06:28.189
+the, the, the second to the zero is very high-scoring on the, on the, the
+
+0:06:29.370,0:06:36.109
+Second third and fourth map but not very high-scoring on the top map. Similarly, the three is kind of high-scoring on the
+
+0:06:37.020,0:06:38.400
+second third and fourth map
+
+0:06:38.400,0:06:41.850
+but not on the first map because the three kind of overlaps with the two and so
+
+0:06:42.009,0:06:45.059
+It wants to really look at in our window to be able to recognize it
+
+0:06:45.639,0:06:47.639
+Okay. Yes
+
+0:06:51.400,0:06:55.380
+So it's the size of the white square that indicates the score basically, okay
+
+0:06:57.759,0:07:02.038
+So look at you know, this this column here you have a high-scoring zero
+
+0:07:03.009,0:07:06.179
+Here because it's the first the first row correspond to the category zero
+
+0:07:06.430,0:07:10.079
+but it's not so high-scoring from the top, the top one because that
+
+0:07:10.539,0:07:15.419
+output unit looks at a pretty wide input and it gets confused by the stuff that's on the side
+
+0:07:16.479,0:07:17.910
+Okay, so you have something like this
+
+0:07:17.910,0:07:23.579
+so now you have to make sense out of it and extract the best interpretation of that, of that sequence and
+
+0:07:24.760,0:07:31.349
+It's true for zip code, but it's true for just about every piece of text. Not every combination of characters is possible
+
+0:07:31.599,0:07:36.149
+so when you read English text there is, you know, an English dictionary English grammar and
+
+0:07:36.699,0:07:40.919
+Not every combination of character is possible so you can have a language model that
+
+0:07:41.470,0:07:42.610
+attempts to
+
+0:07:42.610,0:07:48.720
+Tell you what is the most likely sequence of characters. So we're looking at here given that this is English or whatever language
+
+0:07:49.510,0:07:54.929
+Or given that this is a zip code not every zip code are possible. So this --- possibility for error correction
+
+0:07:56.949,0:08:00.719
+So how do we take that into account? I'll come to this in a second but
+
+0:08:03.460,0:08:06.930
+But here what we need to do is kind of you know
+
+0:08:08.169,0:08:10.169
+Come up with a consistent interpretation
+
+0:08:10.389,0:08:15.809
+That you know, there's obviously a three there's obviously a two, a three,a zero somewhere
+
+0:08:16.630,0:08:19.439
+Another two etc. How to return this
+
+0:08:20.110,0:08:22.710
+array of scores into, into a consistent
+
+0:08:23.470,0:08:25.470
+interpretation
+
+0:08:28.610,0:08:31.759
+Is the width of the, the horizontal width of the,
+
+0:08:33.180,0:08:35.180
+the kernel of the last layer
+
+0:08:35.400,0:08:36.750
+Okay
+
+0:08:36.750,0:08:44.090
+Which means when you backprop---, back project on the input the, the viewing window on the input that influences that particular unit
+
+0:08:44.550,0:08:48.409
+has various size depending on which unit you look at. Yes
+
+0:08:52.500,0:08:54.500
+The width of the block yeah
+
+0:08:56.640,0:08:58.070
+It's a, it corresponds
+
+0:08:58.070,0:08:58.890
+it's how wide the
+
+0:08:58.890,0:09:05.090
+Input image is divided by 4 because the substantive issue is 4 so you get one of one column of those for every four pixel
+
+0:09:05.340,0:09:11.660
+so remember we had this, this way of using a neural net, convolutional net which is that you, you basically make every
+
+0:09:12.240,0:09:17.270
+Convolution larger and you view the last layer as a convolution as well. And now what you get is multiple
+
+0:09:17.790,0:09:23.119
+Outputs. Okay. So what I'm representing here on the slide you just saw
+
+0:09:23.760,0:09:30.470
+is the, is this 2d array on the output which corresponds where, where the, the row corresponds to categories
+
+0:09:31.320,0:09:35.030
+Okay, and each column corresponds to a different location on the input
+
+0:09:39.180,0:09:41.750
+And I showed you those examples here so
+
+0:09:42.300,0:09:50.029
+Here, this is a different representation here where the, the character that is displayed just before the title bar is you know
+
+0:09:50.030,0:09:56.119
+Indicates the winning category, so I'm not displaying the scores of every category. I'm just, just, just displaying the winning category here
+
+0:09:57.180,0:09:58.260
+but each
+
+0:09:58.260,0:10:04.640
+Output looks at a 32 by 32 window and the next output by looks at a 32 by 32 window shifted by 4 pixels
+
+0:10:04.650,0:10:06.650
+Ok, etc.
+
+0:10:08.340,0:10:14.809
+So how do you turn this you know sequence of characters into the fact that it is either 3 5 or 5 3
+
+0:10:29.880,0:10:33.979
+Ok, so here the reason why we have four of those is so that is because the last player
+
+0:10:34.800,0:10:36.270
+this different
+
+0:10:36.270,0:10:42.889
+Is different last layers, if you want this four different last layers each of which is trained to recognize the ten categories
+
+0:10:43.710,0:10:50.839
+And those last layers have different kernel width so they essentially look at different width of Windows on the input
+
+0:10:53.670,0:10:59.510
+So you want some that look at wide windows so they can they can recognize kind of large characters and some that look at, look
+
+0:10:59.510,0:11:02.119
+At narrow windows so they can recognize narrow characters without being
+
+0:11:03.210,0:11:05.210
+perturbed by the the neighboring characters
+
+0:11:09.150,0:11:14.329
+So if you know a priori that there are five five characters here because it's a zip code
+
+0:11:16.529,0:11:18.529
+You can do you can use a trick and
+
+0:11:20.010,0:11:22.010
+There is sort of few specific tricks that
+
+0:11:23.130,0:11:27.140
+I can explain but I'm going to explain sort of the general trick if you want. I
+
+0:11:27.959,0:11:30.619
+Didn't want to talk about this actually at least not now
+
+0:11:31.709,0:11:37.729
+Okay here so here's a general trick the general trick is or the you know, kind of a somewhat specific trick
+
+0:11:38.370,0:11:40.609
+Oops, I don't know way it keeps changing slide
+
+0:11:43.890,0:11:50.809
+You say I have I know I have five characters in this word, is there a
+
+0:11:57.990,0:12:01.760
+So that's one of those arrays that produces scores so for each category
+
+0:12:03.060,0:12:07.279
+Let's say I have four categories here and each location
+
+0:12:11.339,0:12:18.049
+There's a score, okay and let's say I know that I want five characters out
+
+0:12:20.250,0:12:27.469
+I'm gonna draw them vertically one two, three, four five because it's a zip code
+
+0:12:29.579,0:12:34.279
+So the question I'm going to ask now is what is the best character I can put in this and
+
+0:12:35.220,0:12:37.220
+In this slot in the first slot
+
+0:12:38.699,0:12:43.188
+And the way I'm going to do this is I'm gonna draw an array
+
+0:12:48.569,0:12:50.569
+And on this array
+
+0:12:54.120,0:13:01.429
+I'm going to say what's the score here for, at every intersection in the array?
+
+0:13:07.860,0:13:11.659
+It's gonna be, what is the, what is the score of putting
+
+0:13:12.269,0:13:17.899
+A particular character here at that location given the score that I have at the output of my neural net
+
+0:13:19.560,0:13:21.560
+Okay, so let's say that
+
+0:13:24.480,0:13:28.159
+So what I'm gonna have to decide is since I have fewer characters
+
+0:13:29.550,0:13:32.539
+On the on the output to the system five
+
+0:13:33.329,0:13:39.919
+Then I have viewing windows and scores produced by the by the system. I'm gonna have to figure out which one I drop
+
+0:13:40.949,0:13:42.949
+okay, and
+
+0:13:43.860,0:13:47.689
+What I can do is build this, build this array
+
+0:13:55.530,0:13:57.530
+And
+
+0:14:01.220,0:14:09.010
+What I need to do is go from here to here by finding a path through this through this array
+
+0:14:15.740,0:14:17.859
+In such a way that I have exactly five
+
+0:14:20.420,0:14:24.640
+Steps if you want, so each step corresponds to to a character and
+
+0:14:25.790,0:14:31.630
+the overall score of a particular string is the overall is the sum of all the scores that
+
+0:14:33.050,0:14:37.060
+Are along this path in other words if I get
+
+0:14:39.560,0:14:41.560
+Three
+
+0:14:41.930,0:14:47.890
+Instances here, three locations where I have a high score for this particular category, which is category one. Okay let's call it 0
+
+0:14:48.440,0:14:50.440
+So 1 2 3
+
+0:14:51.140,0:14:54.129
+I'm gonna say this is the same guy and it's a 1
+
+0:14:55.460,0:14:57.460
+and here if I have
+
+0:14:58.160,0:15:03.160
+Two guys. I have high score for 3, I'm gonna say those are the 3 and here
+
+0:15:03.160,0:15:08.800
+I have only one guy that has high score for 2. So that's a 2 etc
+
+0:15:11.930,0:15:13.370
+So
+
+0:15:13.370,0:15:15.880
+This path here has to be sort of continuous
+
+0:15:16.580,0:15:23.080
+I can't jump from one position to another because that would be kind of breaking the order of the characters. Okay?
+
+0:15:24.650,0:15:31.809
+And I need to find a path that goes through high-scoring cells if you want that correspond to
+
+0:15:33.500,0:15:36.489
+High scoring categories along this path and it's a way of
+
+0:15:37.190,0:15:39.190
+saying you know if I have
+
+0:15:39.950,0:15:43.150
+if those three cells here or
+
+0:15:44.000,0:15:47.530
+Give me the same character. It's only one character. I'm just going to output
+
+0:15:48.440,0:15:50.799
+One here that corresponds to this
+
+0:15:51.380,0:15:57.189
+Ok, those three guys have high score. I stay on the one, on the one and then I transition
+
+0:15:57.770,0:16:02.379
+To the second character. So now I'm going to fill out this slot and this guy has high score for three
+
+0:16:02.750,0:16:06.880
+So I'm going to put three here and this guy has a high score for two
+
+0:16:07.400,0:16:08.930
+as two
+
+0:16:08.930,0:16:10.930
+Etc
+
+0:16:14.370,0:16:19.669
+The principle to find this this path is a shortest path algorithm
+
+0:16:19.670,0:16:25.190
+You can think of this as a graph where I can go from the lower left cell to the upper right cell
+
+0:16:25.560,0:16:27.560
+By either going to the left
+
+0:16:28.410,0:16:32.269
+or going up and to the left and
+
+0:16:35.220,0:16:38.660
+For each of those transitions there is a there's a cost and for each of the
+
+0:16:39.060,0:16:45.169
+For putting a character at that location, there is also a cost or a score if you want
+
+0:16:47.460,0:16:49.460
+So the overall
+
+0:16:50.700,0:16:57.049
+Score of the one at the bottom would be the combined score of the three locations that detect that one and
+
+0:16:59.130,0:17:01.340
+Because it's more all three of them are
+
+0:17:02.730,0:17:04.730
+contributing evidence to the fact that there is a 1
+
+0:17:06.720,0:17:08.959
+When you constrain the path to have 5 steps
+
+0:17:10.530,0:17:14.930
+Ok, it has to go from the bottom left to the top right and
+
+0:17:15.930,0:17:18.169
+It has 5 steps, so it has to go through 5 steps
+
+0:17:18.750,0:17:24.290
+There's no choice. That's that's how you force the system to kind of give you 5 characters basically, right?
+
+0:17:24.810,0:17:28.909
+And because the path can only go from left to right and from top to bottom
+
+0:17:30.330,0:17:33.680
+It has to give you the characters in the order in which they appear in the image
+
+0:17:34.350,0:17:41.240
+So it's a way of imposing the order of the character and imposing that there are fives, there are five characters in the string. Yes
+
+0:17:42.840,0:17:48.170
+Yes, okay in the back, yes, right. Yes
+
+0:17:52.050,0:17:55.129
+Well, so if we have just the string of one you have to have
+
+0:17:55.680,0:18:02.539
+Trained the system in advance so that when it's in between two ones or two characters, whatever they are, it says nothing
+
+0:18:02.540,0:18:04.540
+it says none of the above
+
+0:18:04.740,0:18:06.740
+Otherwise you can tell, right
+
+0:18:07.140,0:18:11.359
+Yeah, a system like this needs to be able to tell you this is none of the above. It's not a character
+
+0:18:11.360,0:18:16.160
+It's a piece of it or I'm in the middle of two characters or I have two characters on the side
+
+0:18:16.160,0:18:17.550
+But nothing in the middle
+
+0:18:17.550,0:18:19.550
+Yeah, absolutely
+
+0:18:24.300,0:18:26.300
+It's a form of non maximum suppression
+
+0:18:26.300,0:18:31.099
+so you can think of this as kind of a smart form of non maximum suppression where you say like for every location you can only
+
+0:18:31.100,0:18:31.950
+have one
+
+0:18:31.950,0:18:33.950
+character
+
+0:18:33.990,0:18:40.370
+And the order in which you produce the five characters must correspond to the order in which they appear on the image
+
+0:18:41.640,0:18:47.420
+What you don't know is how to warp one into the other. Okay. So how to kind of you know, how many
+
+0:18:48.210,0:18:53.780
+detectors are gonna see the number two. It may be three of them and we're gonna decide they're all the same
+
+0:19:00.059,0:19:02.748
+So the thing is for all of you who
+
+0:19:03.629,0:19:06.469
+are on computer science, which is not everyone
+
+0:19:07.590,0:19:12.379
+The the way you compute this path is just a shortest path algorithm. You do this with dynamic programming
+
+0:19:13.499,0:19:15.090
+Okay
+
+0:19:15.090,0:19:21.350
+so find the shortest path to go from bottom left to top right by going through by only going to
+
+0:19:22.080,0:19:25.610
+only taking transition to the right or diagonally and
+
+0:19:26.369,0:19:28.369
+by minimizing the
+
+0:19:28.830,0:19:31.069
+cost so if you think each of those
+
+0:19:31.710,0:19:38.659
+Is is filled by a cost or maximizing the score if you think that scores there are probabilities, for example
+
+0:19:38.789,0:19:41.479
+And it's just a shortest path algorithm in a graph
+
+0:19:54.840,0:19:56.840
+This kind of method by the way was
+
+0:19:57.090,0:20:04.730
+So many early methods of speech recognition kind of work this way, not with neural nets though. We sort of hand extracted features from
+
+0:20:05.909,0:20:13.189
+but it would basically match the sequence of vectors extracted from a speech signal to a template of a word and then you
+
+0:20:13.409,0:20:17.809
+know try to see how you warp the time to match the the
+
+0:20:19.259,0:20:24.559
+The word to be recognized to to the templates and you had a template for every word over fixed size
+
+0:20:25.679,0:20:32.569
+This was called DTW, dynamic time working. There's more sophisticated version of it called hidden markov models, but it's very similar
+
+0:20:33.600,0:20:35.600
+People still do this to some extent
+
+0:20:43.000,0:20:44.940
+Okay
+
+0:20:44.940,0:20:49.880
+So detection, so if you want to apply commercial net for detection
+
+0:20:50.820,0:20:55.380
+it works amazingly well, and it's surprisingly simple, but you
+
+0:20:56.020,0:20:57.210
+You know what you need to do
+
+0:20:57.210,0:20:59.210
+You basically need to let's say you wanna do face detection
+
+0:20:59.440,0:21:05.130
+Which is a very easy problem one of the first problems that computer vision started solving really well for kind of recognition
+
+0:21:05.500,0:21:07.500
+you collect a data set of
+
+0:21:08.260,0:21:11.249
+images with faces and images without faces and
+
+0:21:12.160,0:21:13.900
+you train a
+
+0:21:13.900,0:21:19.379
+convolutional net with input window in something like 20 by 20 or 30 by 30 pixels?
+
+0:21:19.870,0:21:21.959
+To tell you whether there is a face in it or not
+
+0:21:22.570,0:21:28.620
+Okay. Now you take this convolutional net, you apply it on an image and if there is a face that happens to be roughly
+
+0:21:29.230,0:21:31.230
+30 by 30 pixels the
+
+0:21:31.809,0:21:35.699
+the content will will light up at the corresponding output and
+
+0:21:36.460,0:21:38.460
+Not light up when there is no face
+
+0:21:39.130,0:21:41.999
+now there is two problems with this, the first problem is
+
+0:21:42.940,0:21:47.370
+there is many many ways a patch of an image can be a non face and
+
+0:21:48.130,0:21:53.489
+During your training, you probably haven't seen all of them. You haven't seen even a representative set of them
+
+0:21:53.950,0:21:56.250
+So your system is gonna have lots of false positives
+
+0:21:58.390,0:22:04.709
+That's the first problem. Second problem is in the picture not all faces are 30 by 30 pixels. So how do you handle
+
+0:22:05.380,0:22:10.229
+Size variation so one way to handle size variation, which is very simple
+
+0:22:10.230,0:22:14.010
+but it's mostly unnecessary in modern versions, well
+
+0:22:14.860,0:22:16.860
+ at least it's not completely necessary
+
+0:22:16.929,0:22:22.499
+Is you do a multiscale approach. So you take your image you run your detector on it. It fires whenever it wants
+
+0:22:23.440,0:22:27.299
+And you will detect faces are small then you reduce the image by
+
+0:22:27.850,0:22:30.179
+Some scale in this case, in this case here
+
+0:22:30.179,0:22:31.419
+I take a square root of two
+
+0:22:31.419,0:22:36.599
+You apply the convolutional net again on that smaller image and now it's going to be able to detect faces that are
+
+0:22:38.350,0:22:45.750
+That were larger in the original image because now what was 30 by 30 pixel is now about 20 by 20 pixels, roughly
+
+0:22:47.169,0:22:48.850
+Okay
+
+0:22:48.850,0:22:53.309
+But there may be bigger faces there. So you scale the image again by a factor of square root of 2
+
+0:22:53.309,0:22:57.769
+So now the images the size of the original one and you run the convolutional net again
+
+0:22:57.770,0:23:01.070
+And now it's going to detect faces that were 60 by 60 pixels
+
+0:23:02.190,0:23:06.109
+In the original image, but are now 30 by 30 because you reduce the size by half
+
+0:23:07.800,0:23:10.369
+You might think that this is expensive but it's not. Tthe
+
+0:23:11.220,0:23:15.439
+expense is, half of the expense is the final scale
+
+0:23:16.080,0:23:18.379
+the sum of the expense of the other networks are
+
+0:23:19.590,0:23:21.859
+Combined is about the same as the final scale
+
+0:23:26.070,0:23:29.720
+It's because the size of the network is you know
+
+0:23:29.720,0:23:33.019
+Kind of the square of the the size of the image on one side
+
+0:23:33.020,0:23:38.570
+And so you scale down the image by square root of 2 the network you have to run is smaller by a factor of 2
+
+0:23:40.140,0:23:45.619
+Okay, so the overall cost of this is 1 plus 1/2 plus 1/4 plus 1/8 plus 1/16 etc
+
+0:23:45.990,0:23:51.290
+Which is 2 you waste a factor of 2 by doing multi scale, which is very small. Ok
+
+0:23:51.290,0:23:53.290
+you can afford a factor of 2 so
+
+0:23:54.570,0:23:59.600
+This is a completely ancient face detection system from the early 90s and
+
+0:24:00.480,0:24:02.600
+the maps that you see here are all kind of
+
+0:24:03.540,0:24:05.540
+maps that indicate kind of
+
+0:24:06.120,0:24:13.160
+Scores of face detectors, the face detector here I think is 20 by 20 pixels. So it's very low res and
+
+0:24:13.890,0:24:19.070
+It's a big mess at the fine scales. You see kind of high-scoring areas, but it's not really very definite
+
+0:24:19.710,0:24:21.710
+But you see more
+
+0:24:22.530,0:24:24.150
+More definite
+
+0:24:24.150,0:24:26.720
+Things down here. So here you see
+
+0:24:27.780,0:24:33.290
+A white blob here white blob here white blob here same here. You see white blob here, White blob here and
+
+0:24:34.020,0:24:35.670
+Those are faces
+
+0:24:35.670,0:24:41.060
+and so that's now how you, you need to do maximum suppression to get those
+
+0:24:41.580,0:24:46.489
+little red squares that are kind of the winning categories if you want the winning locations where you have a face
+
+0:24:50.940,0:24:52.470
+So
+
+0:24:52.470,0:24:57.559
+Known as sumo suppression in this case means I have a high-scoring white white blob here
+
+0:24:57.560,0:25:01.340
+That means there is probably the face underneath which is roughly 20 by 20
+
+0:25:01.370,0:25:06.180
+It is another face in a window of 20 by 20. That means one of those two is wrong
+
+0:25:06.250,0:25:10.260
+so I'm just gonna take the highest-scoring one within the window of 20 by 20 and
+
+0:25:10.600,0:25:15.239
+Suppress all the others and you'll suppress the others at that location at that scale
+
+0:25:15.240,0:25:22.410
+I mean that nearby location at that scale but also at other scales. Okay, so you you pick the highest-scoring
+
+0:25:23.680,0:25:25.680
+blob if you want
+
+0:25:26.560,0:25:28.560
+For every location every scale
+
+0:25:28.720,0:25:34.439
+And whenever you pick one you you suppress the other ones that could be conflicting with it either
+
+0:25:34.780,0:25:37.259
+because they are a different scale at the same place or
+
+0:25:37.960,0:25:39.960
+At the same scale, but you know nearby
+
+0:25:44.350,0:25:46.350
+Okay, so that's the
+
+0:25:46.660,0:25:53.670
+that's the first problem and the second problem is the fact that as I said, there's many ways to be different from your face and
+
+0:25:54.730,0:25:59.820
+Most likely your training set doesn't have all the non-faces, things that look like faces
+
+0:26:00.790,0:26:05.249
+So the way people deal with this is that they do what's called negative mining
+
+0:26:05.950,0:26:07.390
+so
+
+0:26:07.390,0:26:09.390
+You go through a large collection of images
+
+0:26:09.460,0:26:14.850
+when you know for a fact that there is no face and you run your detector and you keep all the
+
+0:26:16.720,0:26:19.139
+Patches where you detector fires
+
+0:26:21.190,0:26:26.580
+You verify that there is no faces in them and if there is no face you add them to your negative set
+
+0:26:27.610,0:26:31.830
+Okay, then you retrain your detector. And then you use your retrained detector to do the same
+
+0:26:31.990,0:26:35.580
+Go again through a large dataset of images where there you know
+
+0:26:35.580,0:26:40.710
+There is no face and whenever your detector fires add that as a negative sample
+
+0:26:41.410,0:26:43.410
+you do this four or five times and
+
+0:26:43.840,0:26:50.129
+In the end you have a very robust face detector that does not fall victim to negative samples
+
+0:26:53.080,0:26:56.669
+These are all things that look like faces in natural images are not faces
+
+0:27:03.049,0:27:05.049
+This works really well
+
+0:27:10.380,0:27:17.209
+This is over 15 years old work this is my grandparents marriage, their wedding
+
+0:27:18.480,0:27:20.480
+their wedding
+
+0:27:22.410,0:27:24.410
+Okay
+
+0:27:24.500,0:27:29.569
+So here's a another interesting use of convolutional nets and this is for
+
+0:27:30.299,0:27:34.908
+Semantic segmentation what's called semantic segmentation, I alluded to this in the first the first lecture
+
+0:27:36.390,0:27:44.239
+so what is semantic segmentation is the problem of assigning a category to every pixel in an image and
+
+0:27:46.020,0:27:49.280
+Every pixel will be labeled with a category of the object it belongs to
+
+0:27:50.250,0:27:55.429
+So imagine this would be very useful if you want to say drive a robot in nature. So this is a
+
+0:27:56.039,0:28:00.769
+Robotics project that I worked on, my students and I worked on a long time ago
+
+0:28:01.770,0:28:07.520
+And what you like is to label the image so that regions that the robot can drive on
+
+0:28:08.820,0:28:10.820
+are indicated and
+
+0:28:10.860,0:28:15.199
+Areas that are obstacles also indicated so the robot doesn't drive there. Okay
+
+0:28:15.200,0:28:22.939
+So here the green areas are things that the robot can drive on and the red areas are obstacles like tall grass in that case
+
+0:28:28.049,0:28:34.729
+So the way you you train a convolutional net to do to do this kind of semantic segmentation is very similar to what I just
+
+0:28:35.520,0:28:38.659
+Described you you take a patch from the image
+
+0:28:39.360,0:28:41.360
+In this case. I think the patches were
+
+0:28:42.419,0:28:44.719
+20 by 40 or something like that, they are actually small
+
+0:28:46.080,0:28:51.860
+For which, you know what the central pixel is whether it's traversable or not, whether it's green or red?
+
+0:28:52.470,0:28:56.390
+okay, either is being manually labeled or the label has been obtained in some way and
+
+0:28:57.570,0:29:00.110
+You run a conv net on this patch and you train it, you know
+
+0:29:00.110,0:29:02.479
+tell me if it's if he's green or red tell me if it's
+
+0:29:03.000,0:29:05.000
+Drivable area or not
+
+0:29:05.970,0:29:09.439
+And once the system is trained you apply it on the entire image and it you know
+
+0:29:09.440,0:29:14.540
+It puts green or red depending on where it is. in this particular case actually, there were five categories
+
+0:29:14.830,0:29:18.990
+There's the super green green purple, which is a foot of an object
+
+0:29:19.809,0:29:24.269
+Red, which is an obstacle that you know threw off and super red, which is like a definite obstacle
+
+0:29:25.600,0:29:30.179
+Over here. We're only showing three three colors now in this particular
+
+0:29:31.809,0:29:37.319
+Project the the labels were actually collected automatically you didn't have to manually
+
+0:29:39.160,0:29:44.160
+Label the images and the patches what we do would be to run the robot around and then
+
+0:29:44.890,0:29:49.379
+through stereo vision figure out if a pixel is a
+
+0:29:51.130,0:29:53.669
+Correspond to an object that sticks out of the ground or is on the ground
+
+0:29:55.540,0:29:59.309
+So the the middle column here it says stereo labels these are
+
+0:30:00.309,0:30:05.789
+Labels, so the color green or red is computed from stereo vision from basically 3d reconstruction
+
+0:30:06.549,0:30:08.639
+okay, so for, you have two cameras and
+
+0:30:09.309,0:30:15.659
+The two cameras can estimate the distance of every pixel by basically comparing patches. It's relatively expensive, but it kind of works
+
+0:30:15.730,0:30:17.819
+It's not completely reliable, but it sort of works
+
+0:30:18.820,0:30:21.689
+So now for every pixel you have a depth the distance from the camera
+
+0:30:22.360,0:30:25.890
+Which means you know the position of that pixel in 3d which means you know
+
+0:30:25.890,0:30:30.030
+If it sticks out out of the ground or if it's on the ground because you can fit a plane to the ground
+
+0:30:30.880,0:30:33.900
+okay, so the green pixels are the ones that are basically
+
+0:30:34.450,0:30:37.980
+You know near the ground and the red ones are the ones that are up
+
+0:30:39.280,0:30:42.479
+so now you have labels you can try and accomplish on that to
+
+0:30:43.330,0:30:44.919
+predict those labels
+
+0:30:44.919,0:30:49.529
+Then you will tell me why would you want to train a convolutional net on that to do this if you can do this from stereo?
+
+0:30:50.260,0:30:53.760
+And the answer is stereo only works up to ten meters, roughly
+
+0:30:54.669,0:30:59.789
+Past ten meters you can't really using binocular vision and stereo vision, you can't really estimate the distance very well
+
+0:30:59.790,0:31:04.799
+And so that only works out to about ten meters and driving a robot by only looking
+
+0:31:05.200,0:31:07.770
+ten meters ahead of you is not a good idea
+
+0:31:08.950,0:31:13.230
+It's like driving a car in the fog right? It's gonna it's not very efficient
+
+0:31:14.380,0:31:21.089
+So what you used to accomplished on that for is to label every pixel in the image up to the horizon
+
+0:31:21.790,0:31:23.790
+essentially
+
+0:31:24.130,0:31:30.239
+Okay, so the cool thing about about this system is that as I said the labels were collected automatically but also
+
+0:31:32.080,0:31:33.730
+The robot
+
+0:31:33.730,0:31:38.849
+Adapted itself as it run because he collects stereo labels constantly
+
+0:31:39.340,0:31:43.350
+It can constantly retrain its neural net to adapt to the environment
+
+0:31:43.360,0:31:49.199
+it's in. In this particular instance of this robot, it would only will only retrain the last layer
+
+0:31:49.540,0:31:53.879
+So the N minus 1 layers of the ConvNet were fixed, were trained in the in the lab
+
+0:31:53.880,0:32:01.499
+And then the last layer was kind of adapted as the robot run, it allowed the robot to deal with environments
+
+0:32:01.500,0:32:02.680
+He'd never seen before
+
+0:32:02.680,0:32:04.120
+essentially
+
+0:32:04.120,0:32:06.120
+You still have long-range vision?
+
+0:32:10.000,0:32:17.520
+The input to the the conv network basically multiscale views of sort of bands of the image around the horizon
+
+0:32:18.700,0:32:20.700
+no need to go into details
+
+0:32:21.940,0:32:25.710
+Is a very small neural net by today's standard but that's what we could afford I
+
+0:32:27.070,0:32:29.970
+Have a video. I'm not sure it's gonna work, but I'll try
+
+0:32:31.990,0:32:33.990
+Yeah, it works
+
+0:32:41.360,0:32:45.010
+So I should tell you a little bit about the castor character he characters here so
+
+0:32:47.630,0:32:49.630
+Huh
+
+0:32:51.860,0:32:53.860
+You don't want the audio
+
+0:32:55.370,0:32:59.020
+So Pierre Semanet and Raia Hadsell were two students
+
+0:32:59.600,0:33:02.560
+working with me on this project two PhD students
+
+0:33:03.170,0:33:08.200
+Pierre Sermanet is at Google Brain. He works on robotics and Raia Hadsell is the sales director of Robotics at DeepMind
+
+0:33:09.050,0:33:11.050
+Marco Scoffier is NVIDIA
+
+0:33:11.150,0:33:15.249
+Matt Grimes is a DeepMind, Jan Ben is at Mobile Eye which is now Intel
+
+0:33:15.920,0:33:17.920
+Ayse Erkan is at
+
+0:33:18.260,0:33:20.260
+Twitter and
+
+0:33:20.540,0:33:22.540
+Urs Muller is still working with us, he is
+
+0:33:22.910,0:33:29.139
+Actually head of a big group that works on autonomous driving at Nvidia and he is collaborating with us
+
+0:33:30.800,0:33:32.800
+Actually
+
+0:33:33.020,0:33:38.020
+Our further works on this project, so this is a robot
+
+0:33:39.290,0:33:44.440
+And it can drive it about you know, sort of fast walking speed
+
+0:33:46.310,0:33:48.999
+And it's supposed to drive itself in sort of nature
+
+0:33:50.720,0:33:55.930
+So it's got this mass with four eyes, there are two stereo pairs to two stereo camera pairs and
+
+0:33:57.020,0:34:02.320
+It has three computers in the belly. So it's completely autonomous. It doesn't talk to the network or anything
+
+0:34:03.200,0:34:05.200
+And those those three computers
+
+0:34:07.580,0:34:10.120
+I'm on the left. That's when I had a pony tail
+
+0:34:13.640,0:34:19.659
+Okay, so here the the system is the the neural net is crippled so the we didn't turn on the neural Nets
+
+0:34:19.659,0:34:22.029
+It's only using stereo vision and now it's using the neural net
+
+0:34:22.130,0:34:26.529
+so it's it's pretty far away from this barrier, but it sees it and so it directly goes to
+
+0:34:27.169,0:34:31.599
+The side it wants to go to a goal, a GPS coordinate. That's behind it. Same here
+
+0:34:31.600,0:34:33.429
+He wants to go to a GPS coordinate behind it
+
+0:34:33.429,0:34:37.689
+And it sees right away that there is this wall of people that he can't go through
+
+0:34:38.360,0:34:43.539
+The guy on the right here is Marcos, He is holding the transmitter,he is not driving the robot but is holding the kill switch
+
+0:34:48.849,0:34:50.849
+And so
+
+0:34:51.039,0:34:54.689
+You know, that's what the the the convolutional net looks like
+
+0:34:55.659,0:34:57.659
+really small by today's standards
+
+0:35:00.430,0:35:02.430
+And
+
+0:35:03.700,0:35:05.700
+And it produces for every
+
+0:35:06.400,0:35:08.400
+every location every patch on the input
+
+0:35:08.829,0:35:13.859
+The second last layer is a 100 dimensional vector that goes into a classifier that classifies into five categories
+
+0:35:14.650,0:35:16.650
+so once the system classifies
+
+0:35:16.779,0:35:20.189
+Each of those five categories in the image you can you can warp the image
+
+0:35:20.349,0:35:25.979
+Into a map that's centered on the robot and you can you can do planning in this map to figure out like how to avoid
+
+0:35:25.980,0:35:31.379
+Obstacles and stuff like that. So this is what this thing does. It's a particular map called a hyperbolic map, but
+
+0:35:33.999,0:35:36.239
+It's not important for now
+
+0:35:38.380,0:35:40.380
+Now that
+
+0:35:40.509,0:35:42.509
+because this was you know
+
+0:35:42.970,0:35:49.199
+2007 the computers were slowly there were no GPUs so we could run this we could run this neural net only at about one frame per
+
+0:35:49.200,0:35:50.859
+second
+
+0:35:50.859,0:35:54.268
+As you can see here the at the bottom it updates about one frame per second
+
+0:35:54.269,0:35:54.640
+and
+
+0:35:54.640,0:35:59.609
+So if you have someone kind of walking in front of the robot the robot won't see it for a second and will you know?
+
+0:35:59.680,0:36:01.329
+Run over him
+
+0:36:01.329,0:36:07.079
+So that's why we have a second vision system here at the top. This one is stereo. It doesn't use a neural net
+
+0:36:09.039,0:36:13.949
+Odometry I think we don't care this is the controller which is also learned, but we don't care and
+
+0:36:15.730,0:36:21.989
+This is the the system here again, it's vision is crippled they can only see up to two point two and a half meters
+
+0:36:21.989,0:36:23.989
+So it's very short
+
+0:36:24.099,0:36:26.099
+But it kind of does a decent job
+
+0:36:26.529,0:36:28.529
+and
+
+0:36:28.930,0:36:34.109
+This is to test this sort of fast reacting vision systems or here pierre-simon a is jumping in front of it and
+
+0:36:34.420,0:36:40.950
+the robot stops right away so that now that's the full system with long-range vision and
+
+0:36:41.950,0:36:43.950
+annoying grad students
+
+0:36:49.370,0:36:52.150
+Right, so it's kind of giving up
+
+0:37:03.970,0:37:06.149
+Okay, oops
+
+0:37:09.400,0:37:11.049
+Okay, so
+
+0:37:11.049,0:37:12.690
+That's called semantic segmentation
+
+0:37:12.690,0:37:18.329
+But the real form of semantic segmentation is the one in which you you give an object category for every location
+
+0:37:18.729,0:37:21.599
+So that's the kind of problem here we're talking about where
+
+0:37:22.569,0:37:25.949
+every pixel is either building or sky or
+
+0:37:26.769,0:37:28.769
+Street or a car or something like this?
+
+0:37:29.799,0:37:37.409
+And around 2010 a couple datasets started appearing with a few thousand images where you could train vision systems to do this
+
+0:37:39.940,0:37:42.059
+And so the technique here is
+
+0:37:42.849,0:37:44.849
+essentially identical to the one I
+
+0:37:45.309,0:37:47.309
+Described it's also multi scale
+
+0:37:48.130,0:37:52.920
+So you basically have an input image you have a convolutional net
+
+0:37:53.259,0:37:57.959
+that has a set of outputs that you know, one for each category
+
+0:37:58.539,0:38:01.258
+Of objects for which you have label, which in this case is 33
+
+0:38:02.680,0:38:05.879
+When you back project one output of the convolutional net onto the input
+
+0:38:06.219,0:38:11.249
+It corresponds to an input window of 46 by 46 windows. So it's using a context of 46
+
+0:38:12.309,0:38:16.889
+by 46 pixels to make the decision about a single pixel at least that's the the
+
+0:38:17.589,0:38:19.589
+neural net at the back, at the bottom
+
+0:38:19.900,0:38:24.569
+But it has out 46 but 46 is not enough if you want to decide what a gray pixel is
+
+0:38:24.569,0:38:27.359
+Is it the shirt of the person is it the street? Is it the
+
+0:38:28.119,0:38:31.679
+Cloud or kind of pixel on the mountain. You have to look at a wider
+
+0:38:32.650,0:38:34.650
+context to be able to make that decision so
+
+0:38:35.529,0:38:39.179
+We use again this kind of multiscale approach where the same image is
+
+0:38:39.759,0:38:45.478
+Reduced by a factor of 2 and a factor of 4 and you run those two extra images to the same convolutional
+
+0:38:45.479,0:38:47.789
+net same weight same kernel same everything
+
+0:38:48.940,0:38:54.089
+Except the the last feature map you upscale them so that they have the same size as the original one
+
+0:38:54.089,0:38:58.859
+And now you take those combined feature Maps and send them to a couple layers of a classifier
+
+0:38:59.410,0:39:01.410
+So now the classifier to make its decision
+
+0:39:01.749,0:39:07.738
+Has four 46 by 46 windows on images that have been rescaled and so the effective
+
+0:39:08.289,0:39:12.718
+size of the context now is is 184 by 184 window because
+
+0:39:13.269,0:39:15.269
+the the core scale
+
+0:39:15.610,0:39:17.910
+Network basically looks at more this entire
+
+0:39:19.870,0:39:21.870
+Image
+
+0:39:24.310,0:39:30.299
+Then you can clean it up in various way I'm not gonna go to details for this but it works quite well
+
+0:39:33.970,0:39:36.330
+So this is the result
+
+0:39:37.870,0:39:40.140
+The guy who did this in my lab is Clément Farabet
+
+0:39:40.170,0:39:46.319
+He's a VP at Nvidia now in charge of all of machine learning infrastructure and the autonomous driving
+
+0:39:47.080,0:39:49.080
+Not surprisingly
+
+0:39:51.100,0:39:57.959
+And and so that system, you know, this is this is Washington Square Park by the way, so this is the NYU campus
+
+0:39:59.440,0:40:02.429
+It's not perfect far from that from that. You know it
+
+0:40:03.220,0:40:06.300
+Identified some areas of the street as sand
+
+0:40:07.330,0:40:09.160
+or desert and
+
+0:40:09.160,0:40:12.479
+There's no beach. I'm aware of in Washington Square Park
+
+0:40:13.750,0:40:15.750
+and
+
+0:40:16.480,0:40:17.320
+But you know
+
+0:40:17.320,0:40:22.469
+At the time this was the kind of system of this kind at the the number of training samples for this was very small
+
+0:40:22.470,0:40:24.400
+so it was kind of
+
+0:40:24.400,0:40:27.299
+It was about 2,000 or 3,000 images something like that
+
+0:40:31.630,0:40:34.410
+You run you take a you take a full resolution image
+
+0:40:36.220,0:40:42.689
+You run it to the first n minus 2 layers of your  ConvNet that gives you your future Maps
+
+0:40:42.970,0:40:45.570
+Then you reduce the image by a factor of two run it again
+
+0:40:45.570,0:40:50.009
+You get a bunch of feature maps that are smaller then running again by reducing by a factor of four
+
+0:40:50.320,0:40:51.900
+You get smaller feature maps
+
+0:40:51.900,0:40:52.420
+now
+
+0:40:52.420,0:40:57.420
+You take the small feature map and you rescale it you up sample it so it's the same size as the first one same
+
+0:40:57.420,0:41:00.089
+for the second one, you stack all those feature maps together
+
+0:41:00.880,0:41:07.199
+Okay, and that you feed to two layers for a classifier for every patch
+
+0:41:07.980,0:41:12.240
+Yeah, the paper was rejected from CVPR 2012 even though the results were
+
+0:41:13.090,0:41:14.710
+record-breaking and
+
+0:41:14.710,0:41:17.520
+It was faster than the best competing
+
+0:41:18.400,0:41:20.400
+method by a factor of 50
+
+0:41:20.950,0:41:25.920
+Even running on standard hardware, but we also had implementation on special hardware that was incredibly fast
+
+0:41:26.980,0:41:28.130
+and
+
+0:41:28.130,0:41:34.600
+people didn't know what the convolutional net was at the time and so the reviewers basically could not fathom that
+
+0:41:35.660,0:41:37.359
+The method they'd never heard of could work
+
+0:41:37.359,0:41:40.899
+So well. There is way more to say about convolutional nets
+
+0:41:40.900,0:41:44.770
+But I encourage you to take a computer vision course for to hear about this
+
+0:41:45.950,0:41:49.540
+Yeah, this is okay this data set this particular dataset that we used
+
+0:41:51.590,0:41:57.969
+Is a collection of images street images that was collected mostly by Antonio Torralba at MIT and
+
+0:42:02.690,0:42:04.130
+He had a
+
+0:42:04.130,0:42:08.530
+sort of a tool for kind of labeling so you could you know, you could sort of
+
+0:42:09.140,0:42:12.100
+draw the contour over the object and then label of the object and
+
+0:42:12.650,0:42:18.129
+So if it would kind of, you know fill up the object most of the segmentations were done by his mother
+
+0:42:20.030,0:42:22.030
+Who's in Spain
+
+0:42:22.310,0:42:24.310
+she had a lot of time to
+
+0:42:25.220,0:42:27.220
+Spend doing this
+
+0:42:27.380,0:42:29.300
+Huh?
+
+0:42:29.300,0:42:34.869
+His mother yeah labeled that stuff. Yeah. This was in the late late 2000
+
+0:42:37.190,0:42:41.530
+Okay, now let's talk about a bunch of different architectures, right so
+
+0:42:43.400,0:42:45.520
+You know as I mentioned before
+
+0:42:45.950,0:42:51.159
+the idea of deep learning is that you have this catalog of modules that you can assemble in sort of different graphs and
+
+0:42:52.040,0:42:54.879
+and together to do different functions and
+
+0:42:56.210,0:42:58.210
+and a lot of the
+
+0:42:58.430,0:43:03.280
+Expertise in deep learning is to design those architectures to do something in particular
+
+0:43:03.619,0:43:06.909
+It's a little bit like, you know in the early days of computer science
+
+0:43:08.180,0:43:11.740
+Coming up with an algorithm to write a program was kind of a new concept
+
+0:43:12.830,0:43:14.830
+you know reducing a
+
+0:43:15.560,0:43:19.209
+Problem to kind of a set of instructions that could be run on a computer
+
+0:43:19.210,0:43:21.580
+It was kind of something new and here it's the same problem
+
+0:43:21.830,0:43:26.109
+you have to sort of imagine how to reduce a complex function into sort of a
+
+0:43:27.500,0:43:29.560
+graph possibly dynamic graph of
+
+0:43:29.720,0:43:35.830
+Functional modules that you don't need to know completely the function of but that you're going to whose function is gonna be finalized by learning
+
+0:43:36.109,0:43:38.199
+But the architecture is super important, of course
+
+0:43:38.920,0:43:43.359
+As we saw with convolutional Nets. the first important category is recurrent net. So
+
+0:43:44.180,0:43:47.379
+We've we've seen when we talked about the backpropagation
+
+0:43:48.140,0:43:50.140
+There's a big
+
+0:43:50.510,0:43:58.029
+Condition of the condition was that the graph of the interconnection of the module could not have loops. Okay. It had to be a
+
+0:43:59.299,0:44:04.059
+graph for which there is sort of at least a partial order of the module so that you can compute the
+
+0:44:04.819,0:44:09.489
+The the modules in such a way that when you compute the output of a module all of its inputs are available
+
+0:44:11.240,0:44:13.299
+But recurrent net is one in which you have loops
+
+0:44:14.480,0:44:15.490
+How do you deal with this?
+
+0:44:15.490,0:44:18.459
+So here is an example of a recurrent net architecture
+
+0:44:18.920,0:44:25.210
+Where you have an input which varies over time X(t) that goes through the first neural net. Let's call it an encoder
+
+0:44:25.789,0:44:29.349
+That produces a representation of the of the input
+
+0:44:29.349,0:44:32.679
+Let's call it H(t) and it goes into a recurrent layer
+
+0:44:32.680,0:44:38.409
+This recurrent layer is a function G that depends on trainable parameters W this trainable parameters also for the encoder
+
+0:44:38.410,0:44:40.410
+but I didn't mention it and
+
+0:44:41.150,0:44:42.680
+that
+
+0:44:42.680,0:44:46.480
+Recurrent layer takes into account H(t), which is the representation of the input
+
+0:44:46.480,0:44:49.539
+but it also takes into account Z(t-1), which is the
+
+0:44:50.150,0:44:55.509
+Sort of a hidden state, which is its output at a previous time step its own output at a previous time step
+
+0:44:56.299,0:44:59.709
+Okay, this G function can be a very complicated neural net inside
+
+0:45:00.950,0:45:06.519
+convolutional net whatever could be as complicated as you want. But what's important is that one of its inputs is
+
+0:45:08.869,0:45:10.869
+Its output at a previous time step
+
+0:45:11.630,0:45:13.160
+Okay
+
+0:45:13.160,0:45:15.049
+Z(t-1)
+
+0:45:15.049,0:45:21.788
+So that's why this delay indicates here. The input of G at time t is actually Z(t-1)
+
+0:45:21.789,0:45:24.459
+Which is the output its output at a previous time step
+
+0:45:27.230,0:45:32.349
+Ok, then the output of that recurrent module goes into a decoder which basically produces an output
+
+0:45:32.450,0:45:35.710
+Ok, so it turns a hidden representation Z into an output
+
+0:45:39.859,0:45:41.979
+So, how do you deal with this, you unroll the loop
+
+0:45:44.230,0:45:47.439
+So this is basically the same diagram, but I've unrolled it in time
+
+0:45:49.160,0:45:56.170
+Okay, so at time at times 0 I have X(0) that goes through the encoder produces H of 0 and then I apply
+
+0:45:56.170,0:46:00.129
+The G function I start with a Z arbitrary Z, maybe 0 or something
+
+0:46:01.160,0:46:05.980
+And I apply the function and I get Z(0) and that goes into the decoder produces an output
+
+0:46:06.650,0:46:08.270
+Okay
+
+0:46:08.270,0:46:09.740
+and then
+
+0:46:09.740,0:46:16.479
+Now that has Z(0) at time step 1. I can use the Z(0) as the previous output for the time step. Ok
+
+0:46:17.570,0:46:22.570
+Now the output is X(1) and time 1. I run through the encoder I run through the recurrent layer
+
+0:46:22.570,0:46:24.570
+Which is now no longer recurrent
+
+0:46:24.890,0:46:28.510
+And run through the decoder and then the next time step, etc
+
+0:46:29.810,0:46:34.269
+Ok, this network that's involved in time doesn't have any loops anymore
+
+0:46:37.130,0:46:39.040
+Which means I can run backpropagation through it
+
+0:46:39.040,0:46:44.259
+So if I have an objective function that says the last output should be that particular one
+
+0:46:45.020,0:46:48.609
+Or maybe the trajectory should be a particular one of the outputs. I
+
+0:46:49.730,0:46:51.760
+Can just back propagate gradient through this thing
+
+0:46:52.940,0:46:55.510
+It's a regular network with one
+
+0:46:56.900,0:46:59.980
+Particular characteristic, which is that every block
+
+0:47:01.609,0:47:03.609
+Shares the same weights
+
+0:47:04.040,0:47:07.509
+Okay, so the three instances of the encoder
+
+0:47:08.150,0:47:11.379
+They are the same encoder at three different time steps
+
+0:47:11.380,0:47:16.869
+So they have the same weights the same G functions has the same weights, the three decoders have the same weights. Yes
+
+0:47:20.990,0:47:23.260
+It can be variable, you know, I have to decide in advance
+
+0:47:25.160,0:47:27.399
+But it depends on the length of your input sequence
+
+0:47:28.579,0:47:30.109
+basically
+
+0:47:30.109,0:47:33.159
+Right and you know, it's you can you can run it for as long as you want
+
+0:47:33.890,0:47:38.290
+You know, it's the same weights over so you can just you know, repeat the operation
+
+0:47:40.130,0:47:46.390
+Okay this technique of unrolling and then back propagating through time basically is called surprisingly
+
+0:47:47.060,0:47:49.060
+BPTT back prop through time
+
+0:47:50.000,0:47:52.000
+It's pretty obvious
+
+0:47:53.470,0:47:55.470
+That's all there is to it
+
+0:47:56.710,0:48:01.439
+Unfortunately, they don't work very well at least not in their naive form
+
+0:48:03.910,0:48:06.000
+So in the naive form
+
+0:48:07.360,0:48:11.519
+So a simple form of recurrent net is one in which the encoder is linear
+
+0:48:11.770,0:48:16.560
+The G function is linear with high probably tangent or sigmoid or perhaps ReLU
+
+0:48:17.410,0:48:22.680
+And the decoder also is linear something like this maybe with a ReLU or something like that, right so it could be very simple
+
+0:48:23.530,0:48:24.820
+and
+
+0:48:24.820,0:48:27.539
+You get a number of problems with this and one problem is?
+
+0:48:29.290,0:48:32.969
+The so called vanishing gradient problem or exploding gradient problem
+
+0:48:34.060,0:48:38.640
+And it comes from the fact that if you have a long sequence, let's say I don't know 50 time steps
+
+0:48:40.060,0:48:44.400
+Every time you back propagate gradients
+
+0:48:45.700,0:48:52.710
+The gradients that get multiplied by the weight matrix of the G function. Okay at every time step
+
+0:48:54.010,0:48:58.560
+the gradients get multiplied by the the weight matrix now imagine the weight matrix has
+
+0:48:59.110,0:49:00.820
+small values in it
+
+0:49:00.820,0:49:07.049
+Which means that means that every time you take your gradient you multiply it by the transpose of this matrix to get the gradient at previous
+
+0:49:07.050,0:49:08.290
+time step
+
+0:49:08.290,0:49:10.529
+You get a shorter vector you get a smaller vector
+
+0:49:11.200,0:49:14.520
+And you keep rolling the the vector gets shorter and shorter exponentially
+
+0:49:14.980,0:49:18.449
+That's called the vanishing gradient problem by the time you get to the 50th
+
+0:49:19.210,0:49:23.100
+Time steps which is really the first time step. You don't get any gradient
+
+0:49:28.660,0:49:32.970
+Conversely if the weight matrix is really large and the non-linearity and your
+
+0:49:33.760,0:49:36.120
+Recurrent layer is not saturating
+
+0:49:36.670,0:49:41.130
+your gradients can explode if the weight matrix is large every time you multiply the
+
+0:49:41.650,0:49:43.650
+gradient by the transpose of the matrix
+
+0:49:43.660,0:49:46.920
+the vector gets larger and it explodes which means
+
+0:49:47.290,0:49:51.810
+your weights are going to diverge when you do a gradient step or you're gonna have to use a tiny learning rate for it to
+
+0:49:51.810,0:49:53.810
+work
+
+0:49:54.490,0:49:56.290
+So
+
+0:49:56.290,0:49:58.529
+You have to use a lot of tricks to make those things work
+
+0:49:59.860,0:50:04.620
+Here's another problem. The reason why you would want to use a recurrent net. Why would you want to use a recurrent net?
+
+0:50:05.690,0:50:12.639
+The purported advantage of recurrent net is that they can remember remember things from far away in the past
+
+0:50:13.850,0:50:15.850
+Okay
+
+0:50:16.970,0:50:24.639
+If for example you imagine that the the X's are our characters that you enter one by one
+
+0:50:25.940,0:50:31.300
+The characters come from I don't know a C program or something like that, right?
+
+0:50:34.070,0:50:35.300
+And
+
+0:50:35.300,0:50:37.870
+What your system is supposed to tell you at the end, you know?
+
+0:50:37.870,0:50:42.699
+it reads a few hundred characters corresponding to the source code of a function and at the end is
+
+0:50:43.730,0:50:49.090
+you want to train your system so that it produces one if it's a syntactically correct program and
+
+0:50:49.910,0:50:51.910
+Minus one if it's not okay
+
+0:50:52.430,0:50:54.320
+hypothetical problem
+
+0:50:54.320,0:50:57.489
+Recurrent Nets won't do it. Okay, at least not with our tricks
+
+0:50:59.180,0:51:02.500
+Now there is a thing here which is the issue which is that
+
+0:51:03.860,0:51:07.599
+Among other things this program has to have balanced braces and parentheses
+
+0:51:09.110,0:51:10.280
+So
+
+0:51:10.280,0:51:13.540
+It has to have a way of remembering how many open parentheses
+
+0:51:13.540,0:51:20.350
+there are so that it can check that you're closing them all or how many open braces there are so so all of them get
+
+0:51:21.620,0:51:24.939
+Get closed right so it has to store eventually, you know
+
+0:51:27.380,0:51:29.410
+Essentially within its hidden state Z
+
+0:51:29.410,0:51:32.139
+it has to store like how many braces and and
+
+0:51:32.630,0:51:37.240
+Parentheses were open if it wants to be able to tell at the end that all of them have been closed
+
+0:51:38.620,0:51:41.040
+So it has to have some sort of counter inside right
+
+0:51:43.180,0:51:45.080
+Yes
+
+0:51:45.080,0:51:47.840
+It's going to be a topic tomorrow
+
+0:51:51.050,0:51:56.469
+Now if the program is very long that means, you know Z has to kind of preserve information for a long time and
+
+0:51:57.230,0:52:02.679
+Recurrent net, you know give you the hope that maybe a system like this can do this, but because of a vanishing gradient problem
+
+0:52:02.810,0:52:05.259
+They actually don't at least not simple
+
+0:52:07.280,0:52:09.280
+Recurrent Nets
+
+0:52:09.440,0:52:11.440
+Of the type. I just described
+
+0:52:12.080,0:52:14.080
+So you have to use a bunch of tricks
+
+0:52:14.200,0:52:18.460
+Those are tricks from you know Yoshua Bengio's lab, but there is a bunch of them that were published by various people
+
+0:52:19.700,0:52:22.090
+Like Thomas Mikolov and various other people
+
+0:52:24.050,0:52:27.789
+So to avoid exploding gradients you can clip the gradients just you know, make it you know
+
+0:52:27.790,0:52:30.279
+If the gradients get too large, you just kind of squash them down
+
+0:52:30.950,0:52:32.950
+Just normalize them
+
+0:52:35.180,0:52:41.800
+Weak integration momentum I'm not gonna mention that. a good initialization so you want to initialize the weight matrices so that
+
+0:52:42.380,0:52:44.380
+They preserves the norm more or less
+
+0:52:44.660,0:52:49.180
+this is actually a whole bunch of papers on this on orthogonal neural nets and invertible
+
+0:52:49.700,0:52:51.700
+recurrent Nets
+
+0:52:54.770,0:52:56.770
+But the big trick is
+
+0:52:57.470,0:53:04.630
+LSTM and GRUs. Okay. So what is that before I talk about that I'm gonna talk about multiplicative modules
+
+0:53:06.410,0:53:08.470
+So what are multiplicative modules
+
+0:53:09.500,0:53:11.000
+They're basically
+
+0:53:11.000,0:53:14.709
+Modules in which you you can multiply things with each other
+
+0:53:14.710,0:53:20.590
+So instead of just computing a weighted sum of inputs you compute products of inputs and then weighted sum of that
+
+0:53:20.600,0:53:23.110
+Okay, so you have an example of this on the top left
+
+0:53:23.720,0:53:25.040
+on the top
+
+0:53:25.040,0:53:29.080
+so the output of a system here is just a weighted sum of
+
+0:53:30.080,0:53:32.080
+weights and inputs
+
+0:53:32.240,0:53:37.810
+Okay classic, but the weights actually themselves are weighted sums of weights and inputs
+
+0:53:38.780,0:53:43.149
+okay, so Wij here, which is the ij'th term in the weight matrix of
+
+0:53:43.820,0:53:46.479
+The module we're considering is actually itself
+
+0:53:47.270,0:53:49.270
+a weighted sum of
+
+0:53:50.060,0:53:53.439
+three third order tenser Uijk
+
+0:53:54.410,0:53:56.560
+weighted by variables Zk.
+
+0:53:58.220,0:54:02.080
+Okay, so basically what you get is that Wij is kind of a weighted sum of
+
+0:54:04.160,0:54:06.160
+Matrices
+
+0:54:06.800,0:54:08.800
+Uk
+
+0:54:09.020,0:54:13.419
+Weighted by a coefficient Zk and the Zk can change there are input variables the same way
+
+0:54:13.460,0:54:17.230
+So in effect, it's like having a neural net
+
+0:54:18.260,0:54:22.600
+With weight matrix W whose weight matrix is computed itself by another neural net
+
+0:54:24.710,0:54:30.740
+There is a general form of this where you don't just multiply matrices, but you have a neural net that is some complex function
+
+0:54:31.650,0:54:33.650
+turns X into S
+
+0:54:34.859,0:54:40.819
+Some generic function. Ok, give you ConvNet whatever and the weights of those neural nets
+
+0:54:41.910,0:54:44.839
+are not variables that you learn directly but they are the output of
+
+0:54:44.970,0:54:48.800
+Another neuron that that takes maybe another input into account or maybe the same input
+
+0:54:49.830,0:54:55.069
+Some people call those architectures hyper networks. Ok. There are networks whose weights are computed by another network
+
+0:54:56.160,0:54:59.270
+But here's just a simple form of it, which is kind of a bilinear form
+
+0:54:59.970,0:55:01.740
+or quadratic
+
+0:55:01.740,0:55:03.180
+form
+
+0:55:03.180,0:55:05.810
+Ok, so overall when you kind of write it all down
+
+0:55:06.570,0:55:13.339
+SI is equal to sum over j And k of Uijk Zk Xj. This is a double sum
+
+0:55:15.750,0:55:18.169
+People used to call this Sigma Pi units, yes
+
+0:55:22.890,0:55:27.290
+We'll come to this in just a second basically
+
+0:55:31.500,0:55:33.500
+If you want a neural net that can
+
+0:55:34.740,0:55:36.740
+perform a transformation from
+
+0:55:37.440,0:55:41.929
+A vector into another and that transformation is to be programmable
+
+0:55:42.990,0:55:50.089
+Right, you can have that transformation be computed by a neural net but the weight of that neural net would be it themselves the output
+
+0:55:50.089,0:55:51.390
+of
+
+0:55:51.390,0:55:54.200
+Another neural net that figures out what the transformation is
+
+0:55:55.349,0:56:01.399
+That's kind of the more general form more specifically is very useful if you want to route
+
+0:56:03.359,0:56:08.389
+Signals through a neural net in different ways on a data dependent way so
+
+0:56:10.980,0:56:16.669
+You in fact that's exactly what is mentioned below so the attention module is a special case of this
+
+0:56:17.460,0:56:20.510
+It's not a quadratic layer. It's kind of a different type, but it's a
+
+0:56:21.510,0:56:23.510
+particular type of
+
+0:56:25.140,0:56:26.849
+Architecture that
+
+0:56:26.849,0:56:28.849
+basically computes a
+
+0:56:29.339,0:56:32.029
+convex linear combination of a bunch of vectors, so
+
+0:56:32.790,0:56:34.849
+x₁ and x₂ here are vectors
+
+0:56:37.770,0:56:42.499
+w₁ and w₂ are scalars, basically, okay and
+
+0:56:45.540,0:56:47.870
+What the system computes here is a weighted sum of
+
+0:56:49.590,0:56:55.069
+x₁ and x₂ weighted by w₁ w₂ and again w₁ w₂ are scalars in this case
+
+0:56:56.910,0:56:58.910
+Here the sum at the output
+
+0:56:59.730,0:57:01.020
+so
+
+0:57:01.020,0:57:07.999
+Imagine that those two weights. w₁ w₂ are between 0 and 1 and sum to 1 that's what's called a convex linear combination
+
+0:57:10.260,0:57:13.760
+So by changing w₁ w₂ so essentially
+
+0:57:15.480,0:57:18.139
+If this sum to 1 there are the output of a softmax
+
+0:57:18.810,0:57:23.629
+Which means w₂ is equal to 1 - w₁ right? That's kind of the direct consequence
+
+0:57:27.450,0:57:29.450
+So basically by changing
+
+0:57:29.790,0:57:34.340
+the size of w₁ w₂ you kind of switch the output to
+
+0:57:34.530,0:57:39.860
+Being either x₁ or x₂ or some linear combination of the two some interpolation between the two
+
+0:57:41.610,0:57:43.050
+Okay
+
+0:57:43.050,0:57:47.179
+You can have more than just x₁ and x₂ you can have a whole bunch of x vectors
+
+0:57:48.360,0:57:50.360
+and that
+
+0:57:50.730,0:57:54.800
+system will basically choose an appropriate linear combination or focus
+
+0:57:55.140,0:58:02.210
+Is called an attention mechanism because it allows a neural net to basically focus its attention on a particular input and ignoring ignoring the others
+
+0:58:02.880,0:58:05.240
+The choice of this is made by another variable Z
+
+0:58:05.790,0:58:09.679
+Which itself could be the output to some other neural net that looks at Xs for example
+
+0:58:10.740,0:58:12.270
+okay, and
+
+0:58:12.270,0:58:18.409
+This has become a hugely important type of function, it's used in a lot of different situations now
+
+0:58:19.440,0:58:22.700
+In particular it's used in LSTM and GRU but it's also used in
+
+0:58:26.730,0:58:30.020
+Pretty much every natural language processing system nowadays that use
+
+0:58:31.830,0:58:37.939
+Either transformer architectures or all the types of attention they all use this kind of this kind of trick
+
+0:58:43.280,0:58:46.570
+Okay, so you have a vector Z pass it to a softmax
+
+0:58:46.570,0:58:52.509
+You get a bunch of numbers between 0 & 1 that sum to 1 use those as coefficient to compute a weighted sum
+
+0:58:52.700,0:58:54.560
+of a bunch of vectors X
+
+0:58:54.560,0:58:56.589
+xᵢ and you get the weighted sum
+
+0:58:57.290,0:59:00.070
+Weighted by those coefficients those coefficients are data dependent
+
+0:59:00.890,0:59:02.890
+Because Z is data dependent
+
+0:59:05.390,0:59:07.390
+All right, so
+
+0:59:09.800,0:59:13.659
+Here's an example of how you use this whenever you have this symbol here
+
+0:59:15.530,0:59:17.859
+This circle with the dots in the middle, that's a
+
+0:59:20.510,0:59:26.739
+Component by component multiplication of two vectors some people call this Hadamard product
+
+0:59:29.660,0:59:34.629
+Anyway, it's turn-by-turn multiplication. So this is a
+
+0:59:36.200,0:59:41.020
+a type of a kind of functional module
+
+0:59:43.220,0:59:47.409
+GRU, gated recurrent Nets, was proposed by Kyunghyun Cho who is professor here
+
+0:59:50.420,0:59:51.880
+And it attempts
+
+0:59:51.880,0:59:54.430
+It's an attempt at fixing the problem that naturally occur in
+
+0:59:54.560,0:59:58.479
+recurrent Nets that I mentioned the fact that you have exploding gradient the fact that the
+
+1:00:00.050,1:00:04.629
+recurrent nets don't really remember their states for very long. They tend to kind of forget really quickly
+
+1:00:05.150,1:00:07.540
+And so it's basically a memory cell
+
+1:00:08.060,1:00:14.080
+Okay, and I have to say this is the kind of second big family of sort of
+
+1:00:16.820,1:00:20.919
+Recurrent net with memory. The first one is LSTM, but I'm going to talk about it just afterwards
+
+1:00:21.650,1:00:23.650
+Just because this one is a little simpler
+
+1:00:24.950,1:00:27.550
+The equations are written at the bottom here so
+
+1:00:28.280,1:00:30.280
+basically, there is a
+
+1:00:31.280,1:00:32.839
+a
+
+1:00:32.839,1:00:34.839
+gating vector Z
+
+1:00:35.720,1:00:37.550
+which is
+
+1:00:37.550,1:00:41.919
+simply the application of a nonlinear function the sigmoid function
+
+1:00:42.950,1:00:44.089
+to
+
+1:00:44.089,1:00:49.119
+two linear layers and a bias and those two linear layers take into account the input X(t) and
+
+1:00:49.400,1:00:54.389
+The previous state which they did note H in their case, not Z like I did
+
+1:00:55.930,1:01:01.889
+Okay, so you take X you take H you compute matrices
+
+1:01:02.950,1:01:04.140
+You pass a result
+
+1:01:04.140,1:01:07.440
+you add the results you pass them through sigmoid functions and you get a bunch of
+
+1:01:07.539,1:01:11.939
+values between 0 & 1 because the sigmoid is between 0 & 1 gives you a coefficient and
+
+1:01:14.140,1:01:16.140
+You use those coefficients
+
+1:01:16.660,1:01:20.879
+You see the formula at the bottom the Z is used to basically compute a linear combination
+
+1:01:21.700,1:01:24.210
+of two inputs if Z is equal to 1
+
+1:01:25.420,1:01:28.379
+You basically only look at h(t-1). If Z 
+
+1:01:29.859,1:01:35.669
+Is equal to 0 then 1 - Z is equal to 1 then you you look at this
+
+1:01:36.400,1:01:38.109
+expression here and
+
+1:01:38.109,1:01:43.528
+That expression is, you know some weight matrix multiplied by the input passed through a hyperbolic tangent function
+
+1:01:43.529,1:01:46.439
+It could be a ReLU but it's a hyperbolic tangent in this case
+
+1:01:46.839,1:01:49.528
+And it's combined with other stuff here that we can ignore for now
+
+1:01:50.829,1:01:58.439
+Okay. So basically what what the Z value does is that it tells the system just copy if Z equal 1 it just copies its
+
+1:01:58.440,1:02:00.440
+previous state and ignores the input
+
+1:02:00.789,1:02:04.978
+Ok, so it acts like a memory essentially. It just copies its previous state on its output 
+
+1:02:06.430,1:02:08.430
+and if Z
+
+1:02:09.549,1:02:17.189
+Equals 0 then the current state is forgotten essentially and is basically you would you just read the input
+
+1:02:19.450,1:02:24.629
+Ok multiplied by some matrix so it changes the state of the system
+
+1:02:28.960,1:02:35.460
+Yeah, you do this component by component essentially, okay vector 1 yeah exactly
+
+1:02:47.500,1:02:53.459
+Well, it's just like the number of independent multiplications, right, what is the derivative of
+
+1:02:54.880,1:02:59.220
+some objective function with respect to the input of a product. It's equal to the
+
+1:03:01.240,1:03:07.829
+Derivative of that objective function with respect to the add, to the product multiplied by the other term. That's the as simple as that
+
+1:03:18.039,1:03:20.039
+So it's because by default
+
+1:03:20.529,1:03:22.529
+essentially unless Z is
+
+1:03:23.619,1:03:25.509
+your Z is
+
+1:03:25.509,1:03:30.689
+More less by default equal to one and so by default the system just copies its previous state
+
+1:03:33.039,1:03:35.999
+And if it's just you know slightly less than one it
+
+1:03:37.210,1:03:42.539
+It puts a little bit of the input into the state but doesn't significantly change the state and what that means. Is that it
+
+1:03:43.630,1:03:44.799
+preserves
+
+1:03:44.799,1:03:46.919
+Norm, and it preserves information, right?
+
+1:03:48.940,1:03:53.099
+Since basically memory cell that you can change continuously
+
+1:04:00.480,1:04:04.159
+Well because you need something between zero and one it's a coefficient, right
+
+1:04:04.160,1:04:07.789
+And so it needs to be between zero and one that's what we do sigmoids
+
+1:04:11.850,1:04:13.080
+I
+
+1:04:13.080,1:04:16.850
+mean you need one that is monotonic that goes between 0 and 1 and
+
+1:04:17.970,1:04:20.059
+is monotonic and differentiable I mean
+
+1:04:20.730,1:04:22.849
+There's lots of sigmoid functions, but you know
+
+1:04:24.000,1:04:26.000
+Why not?
+
+1:04:26.100,1:04:29.779
+Yeah, I mean there is some argument for using others, but you know doesn't make a huge
+
+1:04:30.540,1:04:32.540
+amount of difference
+
+1:04:32.700,1:04:37.009
+Okay in the full form of gru. there is also a reset gate. So the reset gate is
+
+1:04:37.650,1:04:44.989
+Is this guy here? So R is another vector that's computed also as a linear combination of inputs and previous state and
+
+1:04:45.660,1:04:51.319
+It serves to multiply the previous state. So if R is 0 then the previous state is
+
+1:04:52.020,1:04:54.410
+if R is 0 and Z is 1
+
+1:04:55.950,1:05:00.499
+The system is basically completely reset to 0 because that is 0
+
+1:05:01.350,1:05:03.330
+So it only looks at the input
+
+1:05:03.330,1:05:09.950
+But that's basically a simplified version of something that came out way earlier in 1997 called
+
+1:05:10.260,1:05:12.260
+LSTM long short-term memory
+
+1:05:13.050,1:05:14.820
+Which you know attempted
+
+1:05:14.820,1:05:19.519
+Which was an attempt at solving the same issue that you know recurrent Nets basically lose memory for too long
+
+1:05:19.520,1:05:21.520
+and so you build them as
+
+1:05:22.860,1:05:26.120
+As memory cells by default and by default they will preserve the information
+
+1:05:26.760,1:05:28.430
+It's essentially the same idea here
+
+1:05:28.430,1:05:33.979
+It's a you know, the details are slightly different here don't have dots in the middle of the round shape here for the product
+
+1:05:33.980,1:05:35.610
+But it's the same thing
+
+1:05:35.610,1:05:41.539
+And there's a little more kind of moving parts. It's basically it looks more like an actual run sale
+
+1:05:41.540,1:05:44.060
+So it's like a flip-flop they can you know preserve
+
+1:05:44.430,1:05:48.200
+Information and there is some leakage that you can have, you can reset it to 0 or to 1
+
+1:05:48.810,1:05:50.810
+It's fairly complicated
+
+1:05:52.050,1:05:59.330
+Thankfully people at NVIDIA Facebook Google and various other places have very efficient implementations of those so you don't need to
+
+1:05:59.550,1:06:01.550
+figure out how to write the
+
+1:06:01.620,1:06:03.710
+CUDA code for this or write the back pop
+
+1:06:05.430,1:06:07.430
+Works really well
+
+1:06:07.500,1:06:12.689
+it's it's quite what you'd use but it's used less and less because
+
+1:06:13.539,1:06:15.539
+people use recurrent Nets
+
+1:06:16.150,1:06:18.210
+people used to use recurrent Nets for natural language processing
+
+1:06:19.329,1:06:21.220
+mostly and
+
+1:06:21.220,1:06:25.949
+Things like speech recognition and speech recognition is moving towards using convolutional Nets
+
+1:06:27.490,1:06:29.200
+temporal conditional Nets
+
+1:06:29.200,1:06:34.109
+while the natural language processing is moving towards using what's called transformers
+
+1:06:34.630,1:06:36.900
+Which we'll hear a lot about tomorrow, right?
+
+1:06:37.630,1:06:38.950
+no?
+
+1:06:38.950,1:06:40.950
+when
+
+1:06:41.109,1:06:43.109
+two weeks from now, okay
+
+1:06:46.599,1:06:48.599
+So what transformers are
+
+1:06:49.119,1:06:51.119
+Okay, I'm not gonna talk about transformers just now
+
+1:06:51.759,1:06:56.219
+but these key transformers are kind of a generalization so
+
+1:06:57.009,1:06:58.619
+General use of attention if you want
+
+1:06:58.619,1:07:02.038
+So the big neural Nets that use attention that you know
+
+1:07:02.039,1:07:06.329
+Every block of neuron uses attention and that tends to work pretty well it works
+
+1:07:06.329,1:07:09.538
+So well that people are kind of basically dropping everything else for NLP
+
+1:07:10.869,1:07:12.869
+so the problem is
+
+1:07:13.269,1:07:15.299
+Systems like LSTM are not very good at this so
+
+1:07:16.599,1:07:20.219
+Transformers are much better. The biggest transformers have billions of parameters
+
+1:07:21.430,1:07:26.879
+Like the biggest one is by 15 billion something like that that order of magnitude the t5 or whatever it's called
+
+1:07:27.910,1:07:29.910
+from Google so
+
+1:07:30.460,1:07:36.779
+That's an enormous amount of memory and it's because of the particular type of architecture that's used in transformers
+
+1:07:36.779,1:07:40.319
+They they can actually store a lot of knowledge if you want
+
+1:07:41.289,1:07:43.559
+So that's the stuff people would use for
+
+1:07:44.440,1:07:47.069
+What you're talking about like question answering systems
+
+1:07:47.769,1:07:50.099
+Translation systems etc. They will use transformers
+
+1:07:52.869,1:07:54.869
+Okay
+
+1:07:57.619,1:08:01.778
+So because LSTM kind of was sort of you know one of the first
+
+1:08:02.719,1:08:04.958
+architectures recurrent architecture that kind of worked
+
+1:08:05.929,1:08:11.408
+People tried to use them for things that at first you would think are crazy but turned out to work
+
+1:08:12.109,1:08:16.689
+And one example of this is translation. It's called neural machine translation
+
+1:08:17.509,1:08:19.509
+So there was a paper 
+
+1:08:19.639,1:08:22.149
+by Ilya Sutskever at NIPS 2014 where he
+
+1:08:22.969,1:08:29.799
+Trained this giant multi-layer LSTM. So what's a multi-layered LSTM? It's an LSTM where you have
+
+1:08:30.589,1:08:36.698
+so it's the unfolded version, right? So at the bottom here you have an LSTM which is here unfolded for three time steps
+
+1:08:36.699,1:08:41.618
+But it will have to be unfolded for the length of a sentence you want to translate, let's say a
+
+1:08:42.259,1:08:43.969
+sentence in French
+
+1:08:43.969,1:08:45.529
+and
+
+1:08:45.529,1:08:48.038
+And then you take the hidden
+
+1:08:48.289,1:08:53.709
+state at every time step of this LSTM and you feed that as input to a second LSTM and
+
+1:08:53.929,1:08:55.150
+I think in his network
+
+1:08:55.150,1:08:58.329
+he actually had four layers of that so you can think of this as a
+
+1:08:58.639,1:09:02.139
+Stacked LSTM that you know each of them are recurrent in time
+
+1:09:02.139,1:09:05.589
+But they are kind of stacked as the layers of a neural net
+
+1:09:06.500,1:09:07.670
+so
+
+1:09:07.670,1:09:14.769
+At the last time step in the last layer, you have a vector here, which is meant to represent the entire meaning of that sentence
+
+1:09:16.309,1:09:18.879
+Okay, so it could be a fairly large vector
+
+1:09:19.849,1:09:24.819
+and then you feed that to another multi-layer LSTM, which
+
+1:09:27.319,1:09:31.028
+You know you run for a sort of undetermined number of steps and
+
+1:09:32.119,1:09:37.209
+The role of this LSTM is to produce words in a target language if you do translation say German
+
+1:09:38.869,1:09:40.839
+Okay, so this is time, you know
+
+1:09:40.839,1:09:44.499
+It takes the state you run through the first two layers of the LSTM
+
+1:09:44.630,1:09:48.849
+Produce a word and then take that word and feed it as input to the next time step
+
+1:09:49.940,1:09:52.359
+So that you can generate text sequentially, right?
+
+1:09:52.909,1:09:58.899
+Run through this produce another word take that word feed it back to the input and keep going. So this is a
+
+1:10:00.619,1:10:02.619
+Should do this for translation you get this gigantic
+
+1:10:03.320,1:10:07.480
+Neural net you train and this is the it's a system of this type
+
+1:10:07.480,1:10:12.010
+The one that Sutskever represented at NIPS 2014 it was was the first neural
+
+1:10:13.130,1:10:19.209
+Translation system that had performance that could rival sort of more classical approaches not based on neural nets
+
+1:10:21.350,1:10:23.950
+And people were really surprised that you could get such results
+
+1:10:26.840,1:10:28.840
+That success was very short-lived
+
+1:10:31.280,1:10:33.280
+Yeah, so the problem is
+
+1:10:34.340,1:10:37.449
+The word you're gonna say at a particular time depends on the word you just said
+
+1:10:38.180,1:10:41.320
+Right, and if you ask the system to just produce a word
+
+1:10:42.800,1:10:45.729
+And then you don't feed that word back to the input
+
+1:10:45.730,1:10:49.120
+the system could be used in other word that has that is inconsistent with the previous one you produced
+
+1:10:55.790,1:10:57.790
+It should but it doesn't
+
+1:10:58.760,1:11:05.590
+I mean not well enough that that it works. So so this is so this is kind of sequential production is pretty much required
+
+1:11:07.790,1:11:09.790
+In principle, you're right
+
+1:11:10.910,1:11:12.910
+It's not very satisfying
+
+1:11:13.610,1:11:19.089
+so there's a problem with this which is that the entire meaning of the sentence has to be kind of squeezed into
+
+1:11:19.430,1:11:22.419
+That hidden state that is between the encoder of the decoder
+
+1:11:24.530,1:11:29.829
+That's one problem the second problem is that despite the fact that that LSTM are built to preserve information
+
+1:11:31.040,1:11:36.010
+They are basically memory cells. They don't actually preserve information for more than about 20 words
+
+1:11:36.860,1:11:40.299
+So if your sentence is more than 20 words by the time you get to the end of the sentence
+
+1:11:40.520,1:11:43.270
+Your your hidden state will have forgotten the beginning of it
+
+1:11:43.640,1:11:49.269
+so what people use for this the fix for this is a huge hack is called BiLSTM and
+
+1:11:50.060,1:11:54.910
+It's a completely trivial idea that consists in running two LSTMs in opposite directions
+
+1:11:56.210,1:11:59.020
+Okay, and then you get two codes one that is
+
+1:11:59.720,1:12:04.419
+running the LSTM from beginning to end of the sentence that's one vector and then the second vector is from
+
+1:12:04.730,1:12:09.939
+Running an LSTM in the other direction you get a second vector. That's the meaning of your sentence
+
+1:12:10.280,1:12:16.809
+You can basically double the length of your sentence without losing too much information this way, but it's not a very satisfying solution
+
+1:12:17.120,1:12:19.450
+So if you see biLSTM, that's what that's what it is
+
+1:12:22.830,1:12:29.179
+So as I said, the success was short-lived because in fact before the paper was published at NIPS
+
+1:12:30.390,1:12:32.390
+There was a paper by
+
+1:12:34.920,1:12:37.969
+Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio
+
+1:12:38.670,1:12:42.319
+which was published on arxiv in September 14 that said
+
+1:12:43.560,1:12:47.209
+We can use attention. So the attention mechanism I mentioned earlier
+
+1:12:49.320,1:12:51.300
+Instead of having those gigantic
+
+1:12:51.300,1:12:54.890
+Networks and squeezing the entire meaning of a sentence into this small vector
+
+1:12:55.800,1:12:58.190
+it would make more sense to the translation if
+
+1:12:58.710,1:13:03.169
+Every time said, you know, we want to produce a word in French corresponding to a sentence in English
+
+1:13:04.469,1:13:08.509
+If we looked at the location in the English sentence that had that word
+
+1:13:09.390,1:13:10.620
+Okay
+
+1:13:10.620,1:13:12.090
+so
+
+1:13:12.090,1:13:17.540
+Our decoder is going to produce french words one at a time and when it comes to produce a word
+
+1:13:18.449,1:13:21.559
+that has an equivalent in the input english sentence it's
+
+1:13:21.960,1:13:29.750
+going to focus its attention on that word and then the translation from French to English of that word would be simple or the
+
+1:13:30.360,1:13:32.300
+You know, it may not be a single word
+
+1:13:32.300,1:13:34.050
+it could be a group of words right because
+
+1:13:34.050,1:13:39.590
+Very often you have to turn a group of word in English into a group of words in French to kind of say the same
+
+1:13:39.590,1:13:41.590
+thing if it's German you have to
+
+1:13:42.150,1:13:43.949
+put the
+
+1:13:43.949,1:13:47.479
+You know the verb at the end of the sentence whereas in English, it might be at the beginning
+
+1:13:48.060,1:13:51.109
+So basically you use this attention mechanism
+
+1:13:51.110,1:13:57.440
+so this attention module here is the one that I showed a couple slides earlier which basically decides
+
+1:13:58.739,1:14:04.428
+Which of the time steps which of the hidden representation for which other word in the input sentence it is going to focus on
+
+1:14:06.570,1:14:12.259
+To kind of produce a representation that is going to produce the current word at a particular time step
+
+1:14:12.260,1:14:15.320
+So here we're at time step number three, we're gonna produce a third word
+
+1:14:16.140,1:14:21.829
+And we're gonna have to decide which of the input word corresponds to this and we're gonna have this attention mechanism
+
+1:14:21.830,1:14:23.830
+so essentially we're gonna have a
+
+1:14:25.140,1:14:28.759
+Small piece of neural net that's going to look at the the inputs on this side
+
+1:14:31.809,1:14:35.879
+It's going to have an output which is going to go through a soft max that is going to produce a bunch of
+
+1:14:35.979,1:14:42.269
+Coefficients that sum to 1 between 0 and 1 and they're going to compute a linear combination of the states at different time steps
+
+1:14:43.719,1:14:48.899
+Ok by setting one of those coefficients to 1 and the other ones to 0 it is going to focus the attention of the system on
+
+1:14:48.900,1:14:50.900
+one particular word
+
+1:14:50.949,1:14:56.938
+So the magic of this is that this neural net that decides that runs to the softmax and decides on those coefficients actually
+
+1:14:57.159,1:14:59.159
+Can be trained with back prop is just another
+
+1:14:59.590,1:15:03.420
+Set of weights in a neural net and you don't have to built it by hand. It just figures it out
+
+1:15:06.550,1:15:10.979
+This completely revolutionized the field of neural machine translation in the sense that
+
+1:15:11.889,1:15:13.889
+within a
+
+1:15:14.050,1:15:20.309
+Few months team from Stanford won a big competition with this beating all the other methods
+
+1:15:22.119,1:15:28.199
+And then within three months every big company that works on translation had basically deployed systems based on this
+
+1:15:29.289,1:15:31.469
+So this just changed everything
+
+1:15:33.189,1:15:40.349
+And then people started paying attention to attention, okay pay more attention to attention in a sense that
+
+1:15:41.170,1:15:44.879
+And then there was a paper by a bunch of people at Google
+
+1:15:45.729,1:15:52.529
+What the title was attention is all you need and It was basically a paper that solved a bunch of natural language processing tasks
+
+1:15:53.050,1:15:59.729
+by using a neural net where every layer, every group of neurons basically was implementing attention and that's what a
+
+1:16:00.459,1:16:03.149
+Or something called self attention. That's what a transformer is
+
+1:16:08.829,1:16:15.449
+Yes, you can have a variable number of outputs of inputs that you focus attention on
+
+1:16:18.340,1:16:20.849
+Okay, I'm gonna talk now about memory networks
+
+1:16:35.450,1:16:40.309
+So this stems from work at Facebook that was started by Antoine Bordes
+
+1:16:41.970,1:16:43.970
+I think in 2014 and
+
+1:16:45.480,1:16:47.480
+By
+
+1:16:49.650,1:16:51.799
+Sainbayar Sukhbaatar, I
+
+1:16:56.760,1:16:58.760
+Think in 2015 or 16
+
+1:16:59.040,1:17:01.040
+Called end-to-end memory networks
+
+1:17:01.520,1:17:06.890
+Sainbayar Sukhbaatar was a PhD student here and it was an intern at Facebook when he worked on this
+
+1:17:07.650,1:17:10.220
+together with a bunch of other people Facebook and
+
+1:17:10.860,1:17:12.090
+the idea of memory
+
+1:17:12.090,1:17:17.270
+Network is that you'd like to have a short-term memory you'd like your neural net to have a short-term memory or working memory
+
+1:17:18.300,1:17:23.930
+Okay, you'd like it to you know, you you tell okay, if I tell you a story I tell you
+
+1:17:25.410,1:17:27.410
+John goes to the kitchen
+
+1:17:28.170,1:17:30.170
+John picks up the milk
+
+1:17:34.440,1:17:36.440
+Jane goes to the kitchen
+
+1:17:37.290,1:17:40.910
+And then John goes to the bedroom and drops the milk there
+
+1:17:41.430,1:17:44.899
+And then goes back to the kitchen and ask you. Where's the milk? Okay
+
+1:17:44.900,1:17:47.720
+so every time I had told you a sentence you kind of
+
+1:17:48.330,1:17:50.330
+updated in your mind a
+
+1:17:50.340,1:17:52.340
+Kind of current state of the world if you want
+
+1:17:52.920,1:17:56.870
+and so by telling you the story you now you have a representation of the state to the world and if I ask you a
+
+1:17:56.870,1:17:59.180
+Question about the state of the world you can answer it. Okay
+
+1:18:00.270,1:18:02.270
+You store this in a short-term memory
+
+1:18:03.720,1:18:06.769
+You didn't store it, ok, so there's kind of this
+
+1:18:06.770,1:18:10.399
+There's a number of different parts in your brain, but it's two important parts, one is the cortex
+
+1:18:10.470,1:18:13.279
+The cortex is where you have long term memory. Where you
+
+1:18:15.120,1:18:17.120
+You know you
+
+1:18:17.700,1:18:22.129
+Where all your your thinking is done and all that stuff and there is a separate
+
+1:18:24.720,1:18:26.460
+You know
+
+1:18:26.460,1:18:28.879
+Chunk of neurons called the hippocampus which is sort of
+
+1:18:29.100,1:18:32.359
+Its kind of two formations in the middle of the brain and they kind of send
+
+1:18:34.320,1:18:36.650
+Wires to pretty much everywhere in the cortex and
+
+1:18:37.110,1:18:44.390
+The hippocampus is thought that to be used as a short-term memory. So it can just you know, remember things for relatively short time
+
+1:18:45.950,1:18:47.450
+The prevalent
+
+1:18:47.450,1:18:53.530
+theory is that when you when you sleep and you dream there's a lot of information that is being transferred from your
+
+1:18:53.810,1:18:56.800
+hippocampus to your cortex to be solidified in long-term memory
+
+1:18:59.000,1:19:01.090
+Because the hippocampus has limited capacity
+
+1:19:04.520,1:19:08.859
+When you get senile like you get really old very often your hippocampus shrinks and
+
+1:19:09.620,1:19:13.570
+You don't have short-term memory anymore. So you keep repeating the same stories to the same people
+
+1:19:14.420,1:19:16.420
+Okay, it's very common
+
+1:19:19.430,1:19:25.930
+Or you go to a room to do something and by the time you get to the room you forgot what you were there for
+
+1:19:29.450,1:19:31.869
+This starts happening by the time you're 50, by the way
+
+1:19:36.290,1:19:40.390
+So, I don't remember what I said last week of two weeks ago, um
+
+1:19:41.150,1:19:44.950
+Okay, but anyway, so memory network, here's the idea of memory network
+
+1:19:46.340,1:19:50.829
+You have an input to the memory network. Let's call it X and think of it as an address
+
+1:19:51.770,1:19:53.770
+Of the memory, okay
+
+1:19:53.930,1:19:56.409
+What you're going to do is you're going to compare this X
+
+1:19:58.040,1:20:03.070
+With a bunch of vectors, we're gonna call K
+
+1:20:08.180,1:20:10.180
+So k₁ k₂ k₃
+
+1:20:12.890,1:20:18.910
+Okay, so you compare those two vectors and the way you compare them is via dot product very simple
+
+1:20:28.460,1:20:33.460
+Okay, so now you have the three dot products of all the three Ks with the X
+
+1:20:34.730,1:20:37.990
+They are scalar values, you know plug them to a softmax
+
+1:20:47.630,1:20:50.589
+So what you get are three numbers between 0 & 1 that sum to 1
+
+1:20:53.840,1:20:59.259
+What you do with those you have 3 other vectors that I'm gonna call V
+
+1:21:00.680,1:21:02.680
+v₁, v₂ and v₃
+
+1:21:03.770,1:21:07.120
+And what you do is you multiply
+
+1:21:08.990,1:21:13.570
+These vectors by those scalars, so this is very much like the attention mechanism that we just talked about
+
+1:21:17.870,1:21:20.950
+Okay, and you sum them up
+
+1:21:27.440,1:21:34.870
+Okay, so take an X compare X with each of the K each of the Ks those are called keys
+
+1:21:39.170,1:21:44.500
+You get a bunch of coefficients between the zero and one that sum to one and then compute a linear combination of the values
+
+1:21:45.260,1:21:47.260
+Those are value vectors
+
+1:21:50.510,1:21:51.650
+And
+
+1:21:51.650,1:21:53.150
+Sum them up
+
+1:21:53.150,1:22:00.400
+Okay, so imagine that one of the key exactly matches X you're gonna have a large coefficient here and small coefficients there
+
+1:22:00.400,1:22:06.609
+So the output of the system will essentially be V2, if K 2 matches X the output would essentially be V 2
+
+1:22:08.060,1:22:09.500
+Okay
+
+1:22:09.500,1:22:11.890
+So this is an addressable associative memory
+
+1:22:12.620,1:22:19.419
+Associative memory is exactly that where you have keys with values and if your input matches a key you get the value here
+
+1:22:19.420,1:22:21.420
+It's a kind of soft differentiable version of that
+
+1:22:26.710,1:22:28.710
+So you can
+
+1:22:29.019,1:22:34.559
+you can back propagate to this you can you can write into this memory by changing the V vectors or
+
+1:22:34.929,1:22:38.609
+Even changing the K vectors. You can change the V vectors by gradient descent
+
+1:22:39.489,1:22:45.598
+Okay, so if you wanted the output of the memory to be something in particular by backpropagating gradient through this
+
+1:22:47.019,1:22:52.259
+you're going to change the currently active V to whatever it needs for the
+
+1:22:53.530,1:22:55.530
+for the output
+
+1:22:56.050,1:22:58.050
+So in those papers
+
+1:22:59.800,1:23:02.460
+What what they did was I
+
+1:23:03.969,1:23:06.299
+Mean there's a series of papers on every network, but
+
+1:23:08.409,1:23:11.879
+What they did was exactly scenario I just explained where you you kind of
+
+1:23:12.909,1:23:16.319
+Tell a story to a system so give it a sequence of sentences
+
+1:23:17.530,1:23:22.800
+Those sentences are encoded into vectors by running through a neural net which is not pre-trained, you know
+
+1:23:25.269,1:23:29.279
+it just through the training of the entire system it figures out how to encode this
+
+1:23:30.039,1:23:35.009
+and then those sentences are written to the memory of this type and
+
+1:23:35.829,1:23:41.129
+Then when you ask a question to the system you encode the question at the input of a neural net, the neural net produces
+
+1:23:41.130,1:23:44.999
+An X to the memory the memory returns a value
+
+1:23:46.510,1:23:47.590
+And
+
+1:23:47.590,1:23:49.480
+Then you use this value
+
+1:23:49.480,1:23:54.329
+and the previous state of the network to kind of reaccess the memory, you can do this multiple times and
+
+1:23:54.550,1:23:58.139
+You train this entire network to produce or an answer to your to your question
+
+1:23:59.139,1:24:03.748
+And if you have lots and lots of scenarios lots and lots of questions or also lots of answers
+
+1:24:04.119,1:24:10.169
+Which they did in this case with by artificially generating stories questions and answers
+
+1:24:11.440,1:24:12.940
+this thing actually
+
+1:24:12.940,1:24:15.989
+learns to store stories and
+
+1:24:16.780,1:24:18.760
+answer questions
+
+1:24:18.760,1:24:20.409
+Which is pretty amazing
+
+1:24:20.409,1:24:22.409
+So that's the memory Network
+
+1:24:27.110,1:24:29.860
+Okay, so the first step is you compute
+
+1:24:32.210,1:24:34.300
+Alpha I equals
+
+1:24:36.590,1:24:43.899
+KI transpose X. Okay, just a dot product. Okay, and then you compute
+
+1:24:48.350,1:24:51.519
+CI or the vector C I should say
+
+1:24:54.530,1:24:57.579
+Is the softmax function
+
+1:25:00.320,1:25:02.979
+Applied to the vector of alphas, okay
+
+1:25:02.980,1:25:07.840
+So the C's are between 0 and 1 and sum to 1 and then the output of the system
+
+1:25:09.080,1:25:11.080
+is
+
+1:25:11.150,1:25:13.360
+sum over I of
+
+1:25:14.930,1:25:16.930
+Ci
+
+1:25:17.240,1:25:21.610
+Vi where Vis are the value vectors. Okay. That's the memory
+
+1:25:30.420,1:25:34.489
+Yes, yes, yes, absolutely
+
+1:25:37.140,1:25:38.640
+Not really
+
+1:25:38.640,1:25:41.869
+No, I mean all you need is everything to be encoded as vectors?
+
+1:25:42.660,1:25:48.200
+Right and so run for your favorite convnet, you get a vector that represents the image and then you can do the QA
+
+1:25:50.880,1:25:52.880
+Yeah, I mean so
+
+1:25:53.490,1:25:57.050
+You can imagine lots of applications of this so in particular
+
+1:25:58.110,1:26:00.110
+When application is I
+
+1:26:00.690,1:26:02.690
+Mean you can you can think of
+
+1:26:06.630,1:26:09.109
+You know think of this as a kind of a memory
+
+1:26:11.160,1:26:14.000
+And then you can have some sort of neural net
+
+1:26:16.020,1:26:16.970
+That you know
+
+1:26:16.970,1:26:24.230
+it takes takes an input and then produces an address for the memory gets a value back and
+
+1:26:25.050,1:26:27.739
+Then keeps growing and eventually produces an output
+
+1:26:28.830,1:26:30.830
+This was very much like a computer
+
+1:26:31.050,1:26:33.650
+Ok. Well the neural net here is the
+
+1:26:34.920,1:26:37.099
+the CPU the ALU the CPU
+
+1:26:37.680,1:26:43.099
+Ok, and the memory is just an external memory you can access whenever you need it, or you can write to it if you want
+
+1:26:43.890,1:26:49.040
+It's a recurrent net in this case. You can unfold it in time, which is what these guys did
+
+1:26:51.330,1:26:52.650
+And
+
+1:26:52.650,1:26:58.009
+And then so then there are people who kind of imagined that you could actually build kind of differentiable computers out of this
+
+1:26:58.410,1:27:03.530
+There's something called neural Turing machine, which is essentially a form of this where the memory is not of this type
+
+1:27:03.530,1:27:07.040
+It's kind of a soft tape like in a regular Turing machine
+
+1:27:07.890,1:27:14.030
+This is somewhere from deep mind that the interesting story about this which is that the facebook people put out
+
+1:27:14.760,1:27:19.909
+The paper on the memory network on arxiv and three days later
+
+1:27:22.110,1:27:24.110
+The deepmind people put out a paper
+
+1:27:25.290,1:27:30.679
+About neural Turing machine and the reason they put three days later is that they've been working on the all Turing machine and
+
+1:27:31.350,1:27:32.640
+in their
+
+1:27:32.640,1:27:37.160
+Tradition they kind of keep project secret unless you know until they can make a big splash
+
+1:27:37.770,1:27:40.699
+But there they got scooped so they put the paper out on arxiv
+
+1:27:45.060,1:27:50.539
+Eventually, they made a big splash with another with a paper but that was a year later or so
+
+1:27:52.230,1:27:54.230
+So what's happened
+
+1:27:55.020,1:28:01.939
+since then is that people have kind of taken this module this idea that you compare inputs to keys and
+
+1:28:02.550,1:28:04.550
+that gives you coefficients and
+
+1:28:04.950,1:28:07.819
+You know you you produce values
+
+1:28:08.520,1:28:09.990
+as
+
+1:28:09.990,1:28:14.449
+Kind of a essential module in a neural net and that's basically where the transformer is
+
+1:28:15.060,1:28:18.049
+so a transformer is basically a neural net in which
+
+1:28:19.290,1:28:21.290
+Every group of neurons is one of those
+
+1:28:21.720,1:28:29.449
+It's a it's a whole bunch of memories. Essentially. There's some more twist to it. Okay, but that's kind of the basic the basic idea
+
+1:28:32.460,1:28:34.460
+But you'll hear about this
+
+1:28:34.980,1:28:36.750
+in a week Oh
+
+1:28:36.750,1:28:38.250
+in two weeks
+
+1:28:38.250,1:28:40.140
+one week one week
+
+1:28:40.140,1:28:42.140
+Okay any more questions?
+
+1:28:44.010,1:28:46.640
+Cool. All right. Thank you very much
diff --git a/docs/pt/week06/practicum06.sbv b/docs/pt/week06/practicum06.sbv
new file mode 100644
index 000000000..f05301b02
--- /dev/null
+++ b/docs/pt/week06/practicum06.sbv
@@ -0,0 +1,1742 @@
+0:00:00.030,0:00:03.959
+so today we are gonna be covering quite a lot of materials so I will try not to
+
+0:00:03.959,0:00:08.309
+run but then yesterday young scooped me completely so young talked about exactly
+
+0:00:08.309,0:00:15.269
+the same things I wanted to talk today so I'm gonna go a bit faster please slow
+
+0:00:15.269,0:00:18.210
+me down if you actually are somehow lost okay
+
+0:00:18.210,0:00:21.420
+so I will just try to be a little bit faster than you sir
+
+0:00:21.420,0:00:26.250
+so today we are gonna be talking about recurrent neural networks record neural
+
+0:00:26.250,0:00:31.050
+networks are one type of architecture we can use in order to be to deal with
+
+0:00:31.050,0:00:37.430
+sequences of data what are sequences what type of signal is a sequence
+
+0:00:39.890,0:00:44.219
+temporal is a temporal component but we already seen data with temporal
+
+0:00:44.219,0:00:49.350
+component how what are they called what dimensional what is the dimension
+
+0:00:49.350,0:00:55.320
+of that kind of signal so on the convolutional net lesson we have seen
+
+0:00:55.320,0:00:59.969
+that a signal could be one this signal to this signal 3d signal based on the
+
+0:00:59.969,0:01:06.270
+domain and the domain is what you map from to go to right so temporal handling
+
+0:01:06.270,0:01:10.580
+sequential sequences of data is basically dealing with one the data
+
+0:01:10.580,0:01:15.119
+because the domain is going to be just the temporal axis nevertheless you can
+
+0:01:15.119,0:01:18.689
+also use RNN to deal with you know two dimensional data you have double
+
+0:01:18.689,0:01:28.049
+Direction okay okay so this is a classical neural network in the diagram
+
+0:01:28.049,0:01:33.299
+that is I'm used to draw where I represent each in this case bunch of
+
+0:01:33.299,0:01:37.590
+neurons like each of those is a vector and for example the X is my input vector
+
+0:01:37.590,0:01:42.450
+it's in pink as usual then I have my hidden layer in a green in the center
+
+0:01:42.450,0:01:46.200
+then I have my final blue eared lane layer which is the output network so
+
+0:01:46.200,0:01:52.320
+this is a three layer neural network in my for my notation and so if some of you
+
+0:01:52.320,0:01:57.960
+are familiar with digital electronics this is like talking about a
+
+0:01:57.960,0:02:03.329
+combinatorial logic your current output depends only on the current input and
+
+0:02:03.329,0:02:08.420
+that's it there is no there is no other input instead when we
+
+0:02:08.420,0:02:12.590
+are talking about our men we are gonna be talking about something that looks
+
+0:02:12.590,0:02:17.420
+like this in this case our output here on the right hand side depends on the
+
+0:02:17.420,0:02:21.860
+current input and on the state of the system and again if you're a king of
+
+0:02:21.860,0:02:26.750
+digital electronics this is simply sequential logic whereas you have an
+
+0:02:26.750,0:02:31.580
+internal state the onion is the dimension flip-flop if you have no idea
+
+0:02:31.580,0:02:37.040
+what a flip-flop you know check it out it's just some very basic memory unit in
+
+0:02:37.040,0:02:41.810
+digital electronics nevertheless this is the only difference right in the first
+
+0:02:41.810,0:02:45.290
+case you have an output which is just function of the input in the second case
+
+0:02:45.290,0:02:49.580
+you have an output which is function of the input and the state of the system
+
+0:02:49.580,0:02:54.130
+okay that's the big difference yeah vanilla is in American term for saying
+
+0:02:58.040,0:03:04.670
+it's plane doesn't have a taste that American sorry I try to be the most
+
+0:03:04.670,0:03:11.390
+American I can in Italy you feel taken an ice cream which is doesn't have a
+
+0:03:11.390,0:03:15.950
+taste it's gonna be fior di latte which is milk taste in here we don't have milk
+
+0:03:15.950,0:03:20.049
+tests they have vanilla taste which is the plain ice cream
+
+0:03:20.049,0:03:28.360
+okay Americans sorry all right so oh so let's see what does
+
+0:03:28.360,0:03:32.760
+it change this with young representation so young draws those kind of little
+
+0:03:32.760,0:03:38.170
+funky things here which represent a mapping between a TENS tensor to another
+
+0:03:38.170,0:03:41.800
+painter from one a vector to another vector right so there you have your
+
+0:03:41.800,0:03:46.630
+input vector X is gonna be mapped through this item here to this hidden
+
+0:03:46.630,0:03:50.620
+representation so that actually represent my fine transformation so my
+
+0:03:50.620,0:03:54.130
+rotation Plus this question then you have the heater representation that you
+
+0:03:54.130,0:03:57.850
+have another rotation is question then you get the final output right similarly
+
+0:03:57.850,0:04:03.220
+in the recurrent diagram you can have these additional things this is a fine
+
+0:04:03.220,0:04:06.640
+transformation squashing that's like a delay module with a final transformation
+
+0:04:06.640,0:04:10.900
+excursion and now you have the final one affine transformation and squashing
+
+0:04:10.900,0:04:18.100
+right these things is making noise okay sorry all right so what is the first
+
+0:04:18.100,0:04:24.250
+case first case is this one is a vector to sequence so we input one bubble the
+
+0:04:24.250,0:04:28.270
+pink wonder and then you're gonna have this evolution of the internal state of
+
+0:04:28.270,0:04:33.070
+the system the green one and then as the state of the system evolves you can be
+
+0:04:33.070,0:04:38.470
+spitting out at every time stamp one specific output what can be an example
+
+0:04:38.470,0:04:43.240
+of this kind of architecture so this one could be the following my input is gonna
+
+0:04:43.240,0:04:46.750
+be one of these images and then the output is going to be a sequence of
+
+0:04:46.750,0:04:53.140
+characters representing the English description of whatever this input is so
+
+0:04:53.140,0:04:57.940
+for example in the center when we have a herd of elephants so the last one herd
+
+0:04:57.940,0:05:03.880
+of elephants walking across a dry grass field so it's very very very well
+
+0:05:03.880,0:05:09.130
+refined right then you have in the center here for example two dogs play in
+
+0:05:09.130,0:05:15.640
+the in the grass maybe there are three but okay they play they're playing in
+
+0:05:15.640,0:05:20.500
+the grass right so it's cool in this case you know a red motorcycle park on
+
+0:05:20.500,0:05:24.610
+the side of the road looks more pink or you know a little
+
+0:05:24.610,0:05:30.490
+blow a little a little girl in the pink that is blowing bubbles that she's not
+
+0:05:30.490,0:05:35.650
+blowing right anything there all right and then you also have you know even
+
+0:05:35.650,0:05:41.560
+more wrong examples right so you have like yellow school bus parked in the
+
+0:05:41.560,0:05:44.050
+parking lot well it's CL um but it's not a school
+
+0:05:44.050,0:05:49.860
+bus so it can be failing as well but I also can do a very very nice you know
+
+0:05:49.860,0:05:56.470
+you can also perform very well so this was from one input vector which is B for
+
+0:05:56.470,0:06:01.720
+example representation of my image to a sequence of symbols which are D for
+
+0:06:01.720,0:06:05.620
+example characters or words that are making here my English sentence okay
+
+0:06:05.620,0:06:11.440
+clear so far yeah okay another kind of usage you can have is maybe the
+
+0:06:11.440,0:06:17.560
+following so you're gonna have sequence two final vector okay so I don't care
+
+0:06:17.560,0:06:22.120
+about the intermediate sequences so okay the top right is called Auto regressive
+
+0:06:22.120,0:06:26.590
+network and outer regressive network is a network which is outputting an output
+
+0:06:26.590,0:06:29.950
+given that you feel as input the previous output okay
+
+0:06:29.950,0:06:33.700
+so this is called Auto regressive you have this kind of loopy part on the
+
+0:06:33.700,0:06:37.780
+network on the left hand side instead I'm gonna be providing several sequences
+
+0:06:37.780,0:06:40.140
+yeah that's gonna be the English translation
+
+0:06:51.509,0:06:55.380
+so you have a sequence of words that are going to make up your final sentence
+
+0:06:55.380,0:07:00.330
+it's it's blue there you can think about a index in a dictionary and then each
+
+0:07:00.330,0:07:03.300
+blue is going to tell you which word you're gonna pick on an indexed
+
+0:07:03.300,0:07:09.780
+dictionary right so this is a school bus right so oh yeah a yellow school bus you
+
+0:07:09.780,0:07:14.940
+go to a index of a then you have second index you can figure out that is yellow
+
+0:07:14.940,0:07:17.820
+and then school box right so the sequence here is going to be
+
+0:07:17.820,0:07:22.590
+representing the sequence of words the model is out on the other side there on
+
+0:07:22.590,0:07:26.460
+the left you're gonna have instead I keep feeding a sequence of symbols and
+
+0:07:26.460,0:07:30.750
+only at the end I'm gonna look what is my final output what can be an
+
+0:07:30.750,0:07:36.150
+application of this one so something yun also mentioned was different so let's
+
+0:07:36.150,0:07:40.789
+see if I can get my network to compile Python or to an open pilot own
+
+0:07:40.789,0:07:45.599
+interpretation so in this case I have my current input which I feed my network
+
+0:07:45.599,0:07:54.979
+which is going to be J equal 8580 for then for X in range eight some - J 920
+
+0:07:54.979,0:07:59.430
+blah blah blah and then print this one and then my network is going to be
+
+0:07:59.430,0:08:04.860
+tasked with the just you know giving me twenty five thousand and eleven okay so
+
+0:08:04.860,0:08:09.210
+this is the final output of a program and I enforced in the network to be able
+
+0:08:09.210,0:08:13.860
+to output me the correct output the correct in your solution of this program
+
+0:08:13.860,0:08:18.330
+or even more complicated things for example I can provide a sequence of
+
+0:08:18.330,0:08:21.900
+other symbols which are going to be eighty eight thousand eight hundred
+
+0:08:21.900,0:08:26.669
+thirty seven then I have C is going to be something then I have print this one
+
+0:08:26.669,0:08:33.360
+if something that is always true as the other one and then you know the output
+
+0:08:33.360,0:08:38.849
+should be twelve thousand eight 184 right so you can train a neural net to
+
+0:08:38.849,0:08:42.690
+do these operations so you feed a sequence of symbols and then at the
+
+0:08:42.690,0:08:48.870
+output you just enforce that the final target should be a specific value okay
+
+0:08:48.870,0:08:56.190
+and these things making noise okay maybe I'm better
+
+0:08:56.190,0:09:02.589
+all right so what's next next is going to be for example a sequence to vector
+
+0:09:02.589,0:09:07.210
+to sequence this used to be the standard way of performing length language
+
+0:09:07.210,0:09:13.000
+translation so you start with a sequence of symbols here shown in pink so you
+
+0:09:13.000,0:09:17.290
+have a sequence of inputs then everything gets condensed into this kind
+
+0:09:17.290,0:09:23.020
+of final age which is this H over here which is going to be somehow my concept
+
+0:09:23.020,0:09:27.880
+right so I have a sentence I squeeze the sentence temporal information into just
+
+0:09:27.880,0:09:31.600
+one vector which is representing the meaning the message I'd like to send
+
+0:09:31.600,0:09:36.310
+across and then I get this meaning in whatever representation unrolled back in
+
+0:09:36.310,0:09:41.380
+a different language right so I can encode I don't know today I'm very happy
+
+0:09:41.380,0:09:47.350
+in English as a sequence of word and then you know you can get LG Sonoma to
+
+0:09:47.350,0:09:53.170
+Felicia and then I speak outside Thailand today or whatever now today I'm
+
+0:09:53.170,0:09:58.480
+very tired Jin Chen walk han lei or whatever ok so
+
+0:09:58.480,0:10:02.020
+again you have some kind of encoding then you have a compressed
+
+0:10:02.020,0:10:08.110
+representation and then you get like the decoding given the same compressed
+
+0:10:08.110,0:10:15.040
+version ok and so for example I guess language translation again recently we
+
+0:10:15.040,0:10:20.709
+have seen transformers and a lot of things like in the recent time so we're
+
+0:10:20.709,0:10:25.300
+going to cover that the next lesson I think but this used to be the state of
+
+0:10:25.300,0:10:31.000
+the art until few two years ago and here you can see that if you actually check
+
+0:10:31.000,0:10:38.950
+if you do a PCA over the latent space you have that words are grouped by
+
+0:10:38.950,0:10:43.630
+semantics ok so if we zoom in that region there are we're gonna see that in
+
+0:10:43.630,0:10:48.400
+what in the same location you find all the amounts december february november
+
+0:10:48.400,0:10:52.750
+whatever right if you put a few focus on a different region you get that a few
+
+0:10:52.750,0:10:55.250
+days next few miles and so on right so
+
+0:10:55.250,0:11:00.230
+different location will have some specific you know common meaning so we
+
+0:11:00.230,0:11:05.780
+basically see in this case how by training these networks you know just
+
+0:11:05.780,0:11:09.680
+with symbols they will pick up on some specific semantics
+
+0:11:09.680,0:11:16.130
+you know features right in this case you can see like there is a vector so the
+
+0:11:16.130,0:11:20.900
+vector that is connecting women to men is gonna be the same vector that is well
+
+0:11:20.900,0:11:27.590
+woman - man which is this one I think is gonna be equal to Queen - King right and
+
+0:11:27.590,0:11:32.890
+so yeah it's correct and so you're gonna have that the same distance in this
+
+0:11:32.890,0:11:37.730
+embedding space will be applied to things that are female and male for
+
+0:11:37.730,0:11:43.370
+example or in the other case you have walk-in and walked swimming and swamp so
+
+0:11:43.370,0:11:47.960
+you always have this you know specific linear transformation you can apply in
+
+0:11:47.960,0:11:53.690
+order to go from one type of word to the other one or this one you have the
+
+0:11:53.690,0:11:59.180
+connection between cities and the capitals all right so one more right I
+
+0:11:59.180,0:12:05.210
+think what's missing from the big picture here it's a big picture because
+
+0:12:05.210,0:12:09.560
+it's so large no no it's such a big picture because it's the overview okay
+
+0:12:09.560,0:12:18.590
+you didn't get the joke it's okay what's missing here vector to seek with no okay
+
+0:12:18.590,0:12:23.330
+good but no because you can still use the other one so you have this one the
+
+0:12:23.330,0:12:27.830
+vector is sequence to sequence right so this one is you start feeding inside
+
+0:12:27.830,0:12:31.580
+inputs you start outputting something right what can be an example of this
+
+0:12:31.580,0:12:38.900
+stuff so if you had a Nokia phone and you use the t9 you know this stuff from
+
+0:12:38.900,0:12:43.100
+20 years ago you have basically suggestions on what your typing is
+
+0:12:43.100,0:12:47.150
+you're typing right so this would be one type of these suggestions where like one
+
+0:12:47.150,0:12:50.570
+type of this architecture as you getting suggestions as you're typing things
+
+0:12:50.570,0:12:57.290
+through or you may have like speech to captions right I talked and you have the
+
+0:12:57.290,0:13:02.520
+things below or something very cool which is
+
+0:13:02.520,0:13:08.089
+the following so I start writing here the rings of Saturn glitter while the
+
+0:13:08.089,0:13:16.260
+harsh ice two men look at each other hmm okay they were enemies but the server
+
+0:13:16.260,0:13:20.100
+robots weren't okay okay hold on so this network was trained on some
+
+0:13:20.100,0:13:24.360
+sci-fi novels and therefore you can just type something then you let the network
+
+0:13:24.360,0:13:28.290
+start outputting some suggestions for you so you know if you don't know how to
+
+0:13:28.290,0:13:34.620
+write a book then you can you know ask your computer to help you out okay
+
+0:13:34.620,0:13:39.740
+that's so cool or one more that I really like it this one is fantastic I think
+
+0:13:39.740,0:13:45.959
+you should read read it I think so you put some kind of input there like the
+
+0:13:45.959,0:13:51.630
+scientist named alone what is it or the prompt right so you put in the
+
+0:13:51.630,0:13:56.839
+the top prompt and then you get you know this network start writing about very
+
+0:13:56.839,0:14:05.690
+interesting unicorns with multiple horns is called horns say unicorn right okay
+
+0:14:05.690,0:14:09.480
+alright let's so cool just check it out later and you can take a screenshot of
+
+0:14:09.480,0:14:14.970
+the screen anyhow so that was like the eye candy such that you get you know
+
+0:14:14.970,0:14:21.089
+hungry now let's go into be PTT which is the thing that they aren't really like
+
+0:14:21.089,0:14:27.390
+yesterday's PTT said okay alright let's see how this stuff works okay so on the
+
+0:14:27.390,0:14:31.620
+left hand side we see again this vector middle in the representation the output
+
+0:14:31.620,0:14:35.520
+to a fine transformation and then there we have the classical equations right
+
+0:14:35.520,0:14:42.450
+all right so let's see how this stuff is similar or not similar and you can't see
+
+0:14:42.450,0:14:46.620
+anything so for the next two seconds I will want one minute I will turn off the
+
+0:14:46.620,0:14:51.300
+lights then I turn them on [Music]
+
+0:14:51.300,0:14:55.570
+okay now you can see something all right so let's see what are the questions of
+
+0:14:55.570,0:15:00.490
+this new architecture don't stand up you're gonna be crushing yourself
+
+0:15:00.490,0:15:04.270
+alright so you have here the hidden representation now there's gonna be this
+
+0:15:04.270,0:15:10.000
+nonlinear function of this rotation of a stack version of my input which I
+
+0:15:10.000,0:15:15.520
+appended the previous configuration of the hidden layer okay and so this is a
+
+0:15:15.520,0:15:19.420
+very nice compact notation it's just I just put the two vectors one on top of
+
+0:15:19.420,0:15:24.640
+each other and then I sign assign I sum the bias I also and define initial
+
+0:15:24.640,0:15:29.920
+condition my initial H is gonna be 0 so at the beginning whenever I have t=1
+
+0:15:29.920,0:15:34.360
+this stuff is gonna be settle is a vector of zeros and then I have this
+
+0:15:34.360,0:15:39.880
+matrix Wh is gonna be two separate matrices so sometimes you see this a
+
+0:15:39.880,0:15:48.130
+question is Wₕₓ times x plus Wₕₕ times h[t-1] but you can also figure out
+
+0:15:48.130,0:15:52.450
+that if you stock those two matrices you know one attached to the other that you
+
+0:15:52.450,0:15:56.620
+just put this two vertical lines completely equivalent notation but it
+
+0:15:56.620,0:16:01.360
+looked like very similar to whatever we had here so hidden layer is affine
+
+0:16:01.360,0:16:05.230
+transformation of the input inner layer is affine transformation of the input
+
+0:16:05.230,0:16:11.440
+and the previous value okay and then you have the final output is going to be
+
+0:16:11.440,0:16:20.140
+again my final rotation so I'm gonna turn on the light so no magic so far
+
+0:16:20.140,0:16:27.690
+right you're okay right you're with me to shake the heads what about the others
+
+0:16:27.690,0:16:34.930
+no yes okay whatever so this one is simply on the right hand
+
+0:16:34.930,0:16:40.330
+side I simply unroll over time such that you can see how things are just not very
+
+0:16:40.330,0:16:43.990
+crazy like this loop here is not actually a loop this is like a
+
+0:16:43.990,0:16:48.500
+connection to next time steps right so that around
+
+0:16:48.500,0:16:52.760
+arrow means is just this right arrow so this is a neural net it's dinkley a
+
+0:16:52.760,0:16:57.950
+neural net which is extended in in length rather also not only in a in a
+
+0:16:57.950,0:17:01.639
+thickness right so you have a network that is going this direction input and
+
+0:17:01.639,0:17:05.600
+output but as you can think as there's been an extended input and this been an
+
+0:17:05.600,0:17:10.220
+extended output while all these intermediate weights are all share right
+
+0:17:10.220,0:17:14.120
+so all of these weights are the same weights and then you use this kind of
+
+0:17:14.120,0:17:17.510
+shared weights so it's similar to a convolutional net in the sense that you
+
+0:17:17.510,0:17:21.410
+had this parameter sharing right across different time domains because you
+
+0:17:21.410,0:17:28.820
+assume there is some kind of you know stationarity right of the signal make
+
+0:17:28.820,0:17:32.870
+sense so this is a kind of convolution right you can see how this is kind of a
+
+0:17:32.870,0:17:40.130
+convolution alright so that was kind of you know a little bit of the theory we
+
+0:17:40.130,0:17:46.160
+already seen that so let's see how this works for a practical example so in this
+
+0:17:46.160,0:17:51.830
+case we we are just reading this code here so this is world language model you
+
+0:17:51.830,0:17:57.770
+can find it at the PyTorch examples so you have a sequence of symbols I have
+
+0:17:57.770,0:18:01.910
+just represented there every symbol is like a letter in the alphabet and then
+
+0:18:01.910,0:18:05.419
+the first part is gonna be basically splitting this one in this way right
+
+0:18:05.419,0:18:10.309
+so you preserve vertically in the time domain but then I split the long long
+
+0:18:10.309,0:18:16.640
+long sequence such that I can now chop I can use best bets bets how do you say
+
+0:18:16.640,0:18:21.980
+computation so the first thing you have the best size is gonna be 4 in this case
+
+0:18:21.980,0:18:27.410
+and then I'm gonna be getting in my first batch and then I will force the
+
+0:18:27.410,0:18:33.650
+network to be able to so this will be my best back propagation through time
+
+0:18:33.650,0:18:38.270
+period and I will force the network to output the next sequence of characters
+
+0:18:38.270,0:18:44.510
+ok so given that I have a,b,c, I will force my network to say d given that I have
+
+0:18:44.510,0:18:50.000
+g,h,i, I will force the network to come up with j. Given m,n,o,
+
+0:18:50.000,0:18:54.980
+I want p, given s,t,u, I want v. So how can you actually make
+
+0:18:54.980,0:18:59.660
+sure you understand what I'm saying whenever you are able to predict my next
+
+0:18:59.660,0:19:04.010
+world you're actually able to you know you basically know in already what I'm
+
+0:19:04.010,0:19:11.720
+saying right yeah so by trying to predict an upcoming word you're going to
+
+0:19:11.720,0:19:15.170
+be showing some kind of comprehension of whatever is going to be this temporal
+
+0:19:15.170,0:19:22.700
+information in the data all right so after we get the beds we have so how
+
+0:19:22.700,0:19:26.510
+does it work let's actually see you know and about a bit of a detail this is
+
+0:19:26.510,0:19:30.650
+gonna be my first output is going to be a batch with four items I feed this
+
+0:19:30.650,0:19:34.220
+inside the near corner all night and then my neural net we come up with a
+
+0:19:34.220,0:19:39.740
+prediction of the upcoming sample right and I will force that one to be my b,h,n,t
+
+0:19:39.740,0:19:47.450
+okay then I'm gonna be having my second input I will provide the previous
+
+0:19:47.450,0:19:53.420
+hidden state to the current RNN I will feel these inside and then I expect to
+
+0:19:53.420,0:19:58.670
+get the second line of the output the target right and then so on right I get
+
+0:19:58.670,0:20:03.410
+the next state and sorry the next input I get the next state and then I'm gonna
+
+0:20:03.410,0:20:07.700
+get inside the neural net the RNN I which I will try to force to get the
+
+0:20:07.700,0:20:13.840
+final target okay so far yeah each one is gonna be the output of the
+
+0:20:18.730,0:20:28.280
+internet recurrent neural net right I'll show you the equation before you have h[1]
+
+0:20:28.280,0:20:43.460
+comes out from this one right second the output I'm gonna be forcing the output
+
+0:20:43.460,0:20:48.170
+actually to be my target my next word in the sequence of letters right so I have
+
+0:20:48.170,0:20:52.610
+a sequence of words force my network to predict what's the next word given the
+
+0:20:52.610,0:21:02.480
+previous word know h1 is going to be fed inside here and you stuck the next word
+
+0:21:02.480,0:21:07.880
+the next word together with the previous state and then you'll do a rotation of
+
+0:21:07.880,0:21:13.670
+the previous word with a previous sorry the new word with the next state the new
+
+0:21:13.670,0:21:17.720
+word with the previous state you'll do our rotation here find transformation
+
+0:21:17.720,0:21:21.230
+right and then you apply the non-linearity so you always get a new
+
+0:21:21.230,0:21:25.610
+word that is the current X and then you get the previous state just to see in
+
+0:21:25.610,0:21:30.650
+what state the system once and then you output a new output right and so we are
+
+0:21:30.650,0:21:35.000
+in this situation here we have a bunch of inputs I have my first input and then
+
+0:21:35.000,0:21:39.200
+I get the first output I have this internal memory that is sent forward and
+
+0:21:39.200,0:21:44.240
+then this network will now be aware of what happened here and then I input the
+
+0:21:44.240,0:21:49.450
+next input and so on I get the next output and I force the output to be the
+
+0:21:49.450,0:21:57.040
+output here the value inside the batch ok alright what's missing now
+
+0:21:57.070,0:22:00.160
+[Music] this is for PowerPoint drawing
+
+0:22:02.890,0:22:08.370
+constraint all right what's happening now so here I'm gonna be sending the
+
+0:22:08.370,0:22:13.300
+here I just drawn an arrow with the final h[T] but there is a slash on the
+
+0:22:13.300,0:22:16.780
+arrow what is the slash on the arrow who can
+
+0:22:16.780,0:22:27.100
+understand what the slash mean of course there will be there is gonna be the next
+
+0:22:27.100,0:22:31.570
+batch they're gonna be starting from here D and so on this is gonna be my
+
+0:22:31.570,0:22:46.690
+next batch d,j,p,v  e,k,q,w  and f,l,r,x. This slash here means do not back
+
+0:22:46.690,0:22:51.550
+propagate through okay so that one is gonna be calling dot detach in Porsche
+
+0:22:51.550,0:22:56.560
+which is gonna be stopping the gradient to be you know propagated back to
+
+0:22:56.560,0:23:01.450
+forever okay so this one say know that and so whenever I get the sorry no no
+
+0:23:01.450,0:23:06.970
+gradient such that when I input the next gradient the first input here it's gonna
+
+0:23:06.970,0:23:11.530
+be this guy over here and also of course without gradient such that we don't have
+
+0:23:11.530,0:23:17.170
+an infinite length RNN okay make sense yes
+
+0:23:17.170,0:23:24.640
+no I assume it's a yes okay so vanishing and exploding
+
+0:23:24.640,0:23:30.730
+gradients we touch them upon these also yesterday so again I'm kind of going a
+
+0:23:30.730,0:23:35.620
+little bit faster to the intent user so let's see how this works
+
+0:23:35.620,0:23:40.390
+so usually for our recurrent neural network you have an input you have a
+
+0:23:40.390,0:23:45.160
+hidden layer and then you have an output then this value of here how do you get
+
+0:23:45.160,0:23:50.680
+this information through here what what what does this R represent do you
+
+0:23:50.680,0:23:55.840
+remember the equation of the hidden layer so the new hidden layer is gonna
+
+0:23:55.840,0:24:01.050
+be the previous hidden layer which we rotate
+
+0:24:03.100,0:24:08.030
+alright so we rotate the previous hidden layer and so how do you rotate hidden
+
+0:24:08.030,0:24:15.220
+layers matrices right and so every time you see all ads on tile arrow there is a
+
+0:24:15.220,0:24:21.920
+rotation there is a matrix now if the you know this matrix can
+
+0:24:21.920,0:24:26.900
+change the sizing of your final output right so if you think about perhaps
+
+0:24:26.900,0:24:31.190
+let's say the determinant right if the terminal is unitary it's a mapping the
+
+0:24:31.190,0:24:34.610
+same areas for the same area if it's larger than one they're going to be
+
+0:24:34.610,0:24:39.560
+getting you know this radians to getting larger and larger or if it's smaller
+
+0:24:39.560,0:24:44.660
+than I'm gonna get these gradients to go to zero whenever you perform the back
+
+0:24:44.660,0:24:48.920
+propagation in this direction okay so the problem here is that whenever we do
+
+0:24:48.920,0:24:53.390
+is send gradients back so the gains are going to be going down like that are
+
+0:24:53.390,0:24:57.800
+gonna be going like down like this then down like this way and down like this
+
+0:24:57.800,0:25:01.610
+way and also all down this way and so on right so the gradients are going to be
+
+0:25:01.610,0:25:06.380
+always going against the direction of the arrow in H ro has a matrix inside
+
+0:25:06.380,0:25:11.510
+right and again this matrix will affect how these gradients propagate and that's
+
+0:25:11.510,0:25:18.590
+why you can see here although we have a very bright input that one like gets
+
+0:25:18.590,0:25:23.720
+lost through oh well if you have like a gradient coming down here the gradient
+
+0:25:23.720,0:25:30.410
+gets you know kill over time okay so how do we fix that to fix this one we simply
+
+0:25:30.410,0:25:40.420
+remove the matrices in this horizontal operation does it make sense no yes no
+
+0:25:40.420,0:25:47.630
+the problem is that the next hidden state will have you know its own input
+
+0:25:47.630,0:25:52.910
+memory coming from the previous step through a matrix multiplication now this
+
+0:25:52.910,0:25:58.760
+matrix multiplication will affect what's gonna be the gradient that comes in the
+
+0:25:58.760,0:26:02.630
+other direction okay so whenever you have an output here you
+
+0:26:02.630,0:26:06.740
+have a final loss now you have the grade that are gonna be going against the
+
+0:26:06.740,0:26:12.050
+arrows up to the input the problem is that this gradient which is going
+
+0:26:12.050,0:26:16.910
+through the in the opposite direction of these arrows will be multiplied by the
+
+0:26:16.910,0:26:22.460
+matrix right the transpose of the matrix and again these matrices will affect
+
+0:26:22.460,0:26:26.030
+what is the overall norm of this gradient right and it will be all
+
+0:26:26.030,0:26:28.310
+killing it you have vanishing gradient or you're
+
+0:26:28.310,0:26:32.690
+gonna have exploding the gradient which is going to be whenever is going to be
+
+0:26:32.690,0:26:37.880
+getting amplified right so in order to be avoiding that we have to avoid so you
+
+0:26:37.880,0:26:41.960
+can see this is a very deep network so recurrently our network where the first
+
+0:26:41.960,0:26:45.320
+deep networks back in the night is actually and the word
+
+0:26:45.320,0:26:49.850
+depth was actually in time which and of course they were facing the same issues
+
+0:26:49.850,0:26:54.350
+we face with deep learning in modern day days where ever we were still like
+
+0:26:54.350,0:26:58.450
+stacking several layers we were observing that the gradients get lost as
+
+0:26:58.450,0:27:05.750
+depth right so how do we solve gradient getting lost through the depth in a
+
+0:27:05.750,0:27:08.770
+current days skipping constant connection right the
+
+0:27:11.270,0:27:15.530
+receiver connections we use and similarly here we can use skip
+
+0:27:15.530,0:27:21.860
+connections as well when we go down well up in in time okay so let's see how this
+
+0:27:21.860,0:27:30.500
+works yeah so the problem is that the
+
+0:27:30.500,0:27:34.250
+gradients are only going in the backward paths right back
+
+0:27:34.250,0:27:38.990
+[Music] well the gradient has to go the same way
+
+0:27:38.990,0:27:42.680
+it went forward by the opposite direction right I mean you're computing
+
+0:27:42.680,0:27:46.970
+chain rule so if you have a function of a function of a function then you just
+
+0:27:46.970,0:27:52.220
+use those functions to go back right the point is that whenever you have these
+
+0:27:52.220,0:27:55.790
+gradients coming back they will not have to go through matrices therefore also
+
+0:27:55.790,0:28:01.250
+the forward part has not doesn't have to go through the matrices meaning that the
+
+0:28:01.250,0:28:07.310
+memory cannot go through matrix multiplication if you don't want to have
+
+0:28:07.310,0:28:11.770
+this effect when you perform back propagation okay
+
+0:28:14.050,0:28:19.420
+yeah it's gonna be worth much better working I show you in the next slide
+
+0:28:19.420,0:28:22.539
+[Music] show you next slide
+
+0:28:27.740,0:28:32.270
+so how do we fix this problem well instead of using one recurrent neural
+
+0:28:32.270,0:28:36.650
+network we're gonna using for recurrent neural network okay so the first
+
+0:28:36.650,0:28:41.510
+RNN on the first network is gonna be the one that goes
+
+0:28:41.510,0:28:46.370
+from the input to this intermediate state then I have other three networks
+
+0:28:46.370,0:28:51.410
+and each of those are represented by these three symbols 1 2 & 3.
+
+0:28:51.410,0:28:56.870
+okay think about this as our open mouth and it's like a closed mouth okay like
+
+0:28:56.870,0:29:04.580
+the emoji okay so if you use this kind of for net for recurrent neural network
+
+0:29:04.580,0:29:09.740
+be regular Network you gotta have for example from the input I send things
+
+0:29:09.740,0:29:14.390
+through in the open mouth therefore it gets here I have a closed mouth here so
+
+0:29:14.390,0:29:18.920
+nothing goes forward then I'm gonna have this open mouth here such that the
+
+0:29:18.920,0:29:23.600
+history goes forward so the history gets sent forward without going through a
+
+0:29:23.600,0:29:29.120
+neural network matrix multiplication it just gets through our open mouth and
+
+0:29:29.120,0:29:34.670
+all the other inputs find a closed mouth so the hidden state will not change upon
+
+0:29:34.670,0:29:40.820
+new inputs okay and then here you're gonna have a open mouth here such that
+
+0:29:40.820,0:29:44.960
+you can get the final output here then the open mouth keeps going here such
+
+0:29:44.960,0:29:48.560
+that you have another output there and then finally you get the last closed
+
+0:29:48.560,0:29:54.620
+mouth at the last one now if you perform back prop you will have the gradients
+
+0:29:54.620,0:29:58.880
+flowing through the open mouth and you don't get any kind of matrix
+
+0:29:58.880,0:30:04.400
+multiplication so now let's figure out how these open mouths are represented
+
+0:30:04.400,0:30:10.010
+how are they instantiated in like in in terms of mathematics is it clear design
+
+0:30:10.010,0:30:13.130
+right so now we are using open and closed mouths and each of those mouths
+
+0:30:13.130,0:30:17.880
+is plus the the first guy here that connects the input to the hidden are
+
+0:30:17.880,0:30:25.580
+brn ends so these on here that is a gated recurrent network it's simply for
+
+0:30:25.580,0:30:32.060
+normal recurrent neural network combined in a clever way such that you have
+
+0:30:32.060,0:30:37.920
+multiplicative interaction and not matrix interaction is it clear so far
+
+0:30:37.920,0:30:42.000
+this is like intuition I haven't shown you how all right so let's figure out
+
+0:30:42.000,0:30:48.570
+who made this and how it works okay so we're gonna see now those long short
+
+0:30:48.570,0:30:55.530
+term memory or gated recurrent neural networks so I'm sorry okay that was the
+
+0:30:55.530,0:30:59.730
+dude okay this is the guy who actually invented this stuff actually him and his
+
+0:30:59.730,0:31:07.620
+students back some in 1997 and we were drinking here together okay all right so
+
+0:31:07.620,0:31:14.010
+that is the question of a recurrent neural network and on the top left are
+
+0:31:14.010,0:31:18.000
+you gonna see in the diagram so I just make a very compact version of this
+
+0:31:18.000,0:31:23.310
+recurrent neural network here is going to be the collection of equations that
+
+0:31:23.310,0:31:27.840
+are expressed in a long short term memory they look a little bit dense so I
+
+0:31:27.840,0:31:32.970
+just draw it for you here okay let's actually goes through how this stuff
+
+0:31:32.970,0:31:36.320
+works so I'm gonna be drawing an interactive
+
+0:31:36.320,0:31:40.500
+animation here so you have your input gate here which is going to be an affine
+
+0:31:40.500,0:31:43.380
+transformation so all of these are recurrent Network write the same
+
+0:31:43.380,0:31:49.920
+equation I show you here so this input transformation will be multiplying my C
+
+0:31:49.920,0:31:55.440
+tilde which is my candidate gate here I have a don't forget gate which is
+
+0:31:55.440,0:32:01.920
+multiplying my previous value of my cell memory and then my Poppa stylist maybe
+
+0:32:01.920,0:32:08.100
+don't forget previous plus input ii i'm gonna show you now how it works then i
+
+0:32:08.100,0:32:12.600
+have my final hidden representations to be multiplication element wise
+
+0:32:12.600,0:32:17.850
+multiplication between my output gate and my you know whatever hyperbolic
+
+0:32:17.850,0:32:22.740
+tangent version of the cell such that things are bounded and then I have
+
+0:32:22.740,0:32:26.880
+finally my C tilde which is my candidate gate is simply
+
+0:32:26.880,0:32:31.110
+Anette right so you have one recurrent network one that modulates the output
+
+0:32:31.110,0:32:35.730
+one that modulates this is don't forget gate and this is the input gate
+
+0:32:35.730,0:32:40.050
+so all this interaction between the memory and the gates is a multiplicative
+
+0:32:40.050,0:32:44.490
+interaction and this forget input and don't forget the input and output are
+
+0:32:44.490,0:32:48.780
+all sigmoids and therefore they are going from 0 to 1 so I can multiply by a
+
+0:32:48.780,0:32:53.340
+0 you have a closed mouth or you can multiply by 1 if it's open mouth right
+
+0:32:53.340,0:33:00.120
+if you think about being having our internal linear volume which is below
+
+0:33:00.120,0:33:06.120
+minus 5 or above plus 5 okay such that you using the you use the gate in the
+
+0:33:06.120,0:33:11.940
+saturated area or 0 or 1 right you know the sigmoid so let's see how this stuff
+
+0:33:11.940,0:33:16.260
+works this is the output let's turn off the
+
+0:33:16.260,0:33:20.450
+output how do I do turn off the output I simply put a 0
+
+0:33:20.450,0:33:26.310
+inside so let's say I have a purple internal representation see I put a 0
+
+0:33:26.310,0:33:29.730
+there in the output gate the output is going to be multiplying a 0 with
+
+0:33:29.730,0:33:36.300
+something you get 0 okay then let's say I have a green one I have one then I
+
+0:33:36.300,0:33:40.830
+multiply one with the purple I get purple and then finally I get the same
+
+0:33:40.830,0:33:46.170
+value similarly I can control the memory and I can for example we set it in this
+
+0:33:46.170,0:33:51.240
+case I'm gonna be I have my internal memory see this is purple and then I
+
+0:33:51.240,0:33:57.450
+have here my previous guy which is gonna be blue I guess I have a zero here and
+
+0:33:57.450,0:34:01.500
+therefore the multiplication gives me a zero there I have here a zero so
+
+0:34:01.500,0:34:05.190
+multiplication is gonna be giving a zero at some two zeros and I get a zero
+
+0:34:05.190,0:34:09.690
+inside of memory so I just erase the memory and you get the zero there
+
+0:34:09.690,0:34:15.210
+otherwise I can keep the memory I still do the internal thing I did a new one
+
+0:34:15.210,0:34:19.919
+but I keep a wonder such that the multiplication gets blue the Sun gets
+
+0:34:19.919,0:34:25.649
+blue and then I keep sending out my bloom finally I can write such that I
+
+0:34:25.649,0:34:31.110
+can get a 1 in the input gate the multiplication gets purple then the I
+
+0:34:31.110,0:34:35.010
+set a zero in the don't forget such that the
+
+0:34:35.010,0:34:40.679
+we forget and then multiplication gives me zero I some do I get purple and then
+
+0:34:40.679,0:34:45.780
+I get the final purple output okay so here we control how to send how to write
+
+0:34:45.780,0:34:50.850
+in memory how to reset the memory and how to output something okay so we have
+
+0:34:50.850,0:35:04.770
+all different operation this looks like a computer - and in an yeah it is
+
+0:35:04.770,0:35:08.700
+assumed in this case to show you like how the logic works as we are like
+
+0:35:08.700,0:35:14.250
+having a value inside the sigmoid has been or below minus 5 or being above
+
+0:35:14.250,0:35:27.780
+plus 5 such that we are working as a switch 0 1 switch okay the network can
+
+0:35:27.780,0:35:32.790
+choose to use this kind of operation to me make sense I believe this is the
+
+0:35:32.790,0:35:37.110
+rationale behind how this network has been put together the network can decide
+
+0:35:37.110,0:35:42.690
+to do anything it wants usually they do whatever they want but this seems like
+
+0:35:42.690,0:35:46.800
+they can work at least if they've had to saturate the gates it looks like things
+
+0:35:46.800,0:35:51.930
+can work pretty well so in the remaining 15 minutes of kind of I'm gonna be
+
+0:35:51.930,0:35:56.880
+showing you two notebooks I kind of went a little bit faster because again there
+
+0:35:56.880,0:36:04.220
+is much more to be seen here in the notebooks so yeah
+
+0:36:10.140,0:36:17.440
+so this the the actual weight the actual gradient you care here is gonna be the
+
+0:36:17.440,0:36:21.970
+gradient with respect to previous C's right the thing you care is gonna be
+
+0:36:21.970,0:36:25.000
+basically the partial derivative of the current seen with respect to previous
+
+0:36:25.000,0:36:30.160
+C's such that you if you have the original initial C here and you have
+
+0:36:30.160,0:36:35.140
+multiple C over time you want to change something in the original C you still
+
+0:36:35.140,0:36:39.130
+have the gradient coming down all the way until the first C which comes down
+
+0:36:39.130,0:36:43.740
+to getting gradients through that matrix Wc here right so if you want to change
+
+0:36:46.660,0:36:52.089
+those weights here you just go through the chain of multiplications that are
+
+0:36:52.089,0:36:56.890
+not involving any matrix multiplication as such that you when you get the
+
+0:36:56.890,0:37:00.490
+gradient it still gets multiplied by one all the time and it gets down to
+
+0:37:00.490,0:37:05.760
+whatever we want to do okay did I answer your question
+
+0:37:09.150,0:37:16.660
+so the matrices will change the amplitude of your gradient right so if
+
+0:37:16.660,0:37:22.000
+you have like these largest eigenvalue being you know 0.0001 every time you
+
+0:37:22.000,0:37:26.079
+multiply you get the norm of this vector getting killed right so you have like an
+
+0:37:26.079,0:37:31.569
+exponential decay in this case if my forget gate is actually always equal to
+
+0:37:31.569,0:37:37.510
+1 then you get c = c-t. What is the partial
+
+0:37:37.510,0:37:43.299
+derivative of c[t]/c[t-1]?
+
+0:37:43.299,0:37:48.579
+1 right so the parts of the relative that is the
+
+0:37:48.579,0:37:52.390
+thing that you actually multiply every time there's gonna be 1 so output
+
+0:37:52.390,0:37:57.609
+gradient output gradients can be input gradients right yeah i'll pavillions
+
+0:37:57.609,0:38:01.510
+gonna be implicit because it would apply the output gradient by the derivative of
+
+0:38:01.510,0:38:05.599
+this module right if the this module is e1 then the thing that is
+
+0:38:05.599,0:38:14.660
+here keeps going that is the rationale behind this now this is just for drawing
+
+0:38:14.660,0:38:24.710
+purposes I assumed it's like a switch okay such that I can make things you
+
+0:38:24.710,0:38:29.089
+know you have a switch on and off to show like how it should be working maybe
+
+0:38:29.089,0:38:46.579
+doesn't work like that but still it works it can work this way right yeah so
+
+0:38:46.579,0:38:50.089
+that's the implementation of pro question is gonna be simply you just pad
+
+0:38:50.089,0:38:55.069
+all the other sync when sees with zeros before the sequence so if you have
+
+0:38:55.069,0:38:59.920
+several several sequences yes several sequences that are of a different length
+
+0:38:59.920,0:39:03.619
+you just put them all aligned to the right
+
+0:39:03.619,0:39:08.960
+and then you put some zeros here okay such that you always have in the last
+
+0:39:08.960,0:39:14.599
+column the latest element if you put two zeros here it's gonna be a mess in right
+
+0:39:14.599,0:39:17.299
+in the code if you put the zeros in the in the beginning you just stop doing
+
+0:39:17.299,0:39:21.319
+back propagation when you hit the last symbol right so you start from here you
+
+0:39:21.319,0:39:25.460
+go back here so you go forward then you go back prop and stop whenever you
+
+0:39:25.460,0:39:29.599
+actually reach the end of your sequence if you pad on the other side you get a
+
+0:39:29.599,0:39:34.730
+bunch of drop there in the next ten minutes so you're gonna be seen two
+
+0:39:34.730,0:39:45.049
+notebooks if you don't have other questions okay wow you're so quiet okay
+
+0:39:45.049,0:39:49.970
+so we're gonna be going now for sequence classification alright so in this case
+
+0:39:49.970,0:39:54.589
+I'm gonna be I just really stuff loud out loud the goal is to classify a
+
+0:39:54.589,0:40:00.259
+sequence of elements sequence elements and targets are represented locally
+
+0:40:00.259,0:40:05.660
+input vectors with only one nonzero bit so it's a one hot encoding the sequence
+
+0:40:05.660,0:40:10.770
+starts with a B for beginning and end with a E and otherwise consists of a
+
+0:40:10.770,0:40:16.370
+randomly chosen symbols from a set {a, b, c, d} which are some kind of noise
+
+0:40:16.370,0:40:22.380
+expect for two elements in position t1 and t2 this position can be either or X
+
+0:40:22.380,0:40:29.460
+or Y in for the hard difficulty level you have for example that the sequence
+
+0:40:29.460,0:40:35.220
+length length is chose randomly between 100 and 110 10 t1 is randomly chosen
+
+0:40:35.220,0:40:40.530
+between 10 and 20 Tinto is randomly chosen between 50 and 60 there are four
+
+0:40:40.530,0:40:47.010
+sequences classes Q, R, S and U which depends on the temporal order of x and y so if
+
+0:40:47.010,0:40:53.520
+you have X,X you can be getting a Q. X,Y you get an R. Y,X you get an S
+
+0:40:53.520,0:40:57.750
+and Y,Y get U. You so we're going to be doing a sequence classification based on
+
+0:40:57.750,0:41:03.720
+the X and y or whatever those to import to these kind of triggers okay
+
+0:41:03.720,0:41:08.370
+and in the middle in the middle you can have a,b,c,d in random positions like you
+
+0:41:08.370,0:41:12.810
+know randomly generated is it clear so far so we do cast a classification of
+
+0:41:12.810,0:41:23.180
+sequences where you may have these X,X  X,Y  Y,X ou Y,Y. So in this case
+
+0:41:23.210,0:41:29.460
+I'm showing you first the first input so the return type is a tuple of sequence
+
+0:41:29.460,0:41:36.780
+of two which is going to be what is the output of this example generator and so
+
+0:41:36.780,0:41:43.050
+let's see what is what is this thing here so this is my data I'm going to be
+
+0:41:43.050,0:41:48.030
+feeding to the network so I have 1, 2, 3, 4, 5, 6, 7, 8 
+
+0:41:48.030,0:41:54.180
+different symbols here in a row every time why there are eight we
+
+0:41:54.180,0:42:02.970
+have X and Y and a, b, c and d beginning and end. So we have one hot out of you
+
+0:42:02.970,0:42:08.400
+know eight characters and then i have a sequence of rows which are my sequence
+
+0:42:08.400,0:42:12.980
+of symbols okay in this case you can see here i have a beginning with all zeros
+
+0:42:12.980,0:42:19.260
+why is all zeros padding right so in this case the sequence was shorter than
+
+0:42:19.260,0:42:21.329
+the expect the maximum sequence in the bed
+
+0:42:21.329,0:42:29.279
+and then the first first sequence has an extra zero item at the beginning in them
+
+0:42:29.279,0:42:34.859
+you're gonna have like in this case the second item is of the two a pole to pole
+
+0:42:34.859,0:42:41.160
+is the corresponding best class for example I have a batch size of 32 and
+
+0:42:41.160,0:42:51.930
+then I'm gonna have an output size of 4. Why 4 ? Q, R, S and U.
+
+0:42:51.930,0:42:57.450
+so I have 4 a 4 dimensional target vector and I have a sequence of 8
+
+0:42:57.450,0:43:04.499
+dimensional vectors as input okay so let's see how this sequence looks like
+
+0:43:04.499,0:43:12.779
+in this case is gonna be BbXcXcbE. So X,X let's see X X X X is Q
+
+0:43:12.779,0:43:18.569
+right so we have our Q sequence and that's why the final target is a Q the 1
+
+0:43:18.569,0:43:25.019
+0 0 0 and then you're gonna see B B X C so the second item and the second last
+
+0:43:25.019,0:43:30.390
+is gonna be B lowercase B you can see here the second item and the second last
+
+0:43:30.390,0:43:36.390
+item is going to be a be okay all right so let's now create a recurrent Network
+
+0:43:36.390,0:43:41.249
+in a very quick way so here I can simply say my recurrent network is going to be
+
+0:43:41.249,0:43:47.369
+torch and an RNN and I'm gonna be using a reader network really non-linearity
+
+0:43:47.369,0:43:52.709
+and then I have my final linear layer in the other case I'm gonna be using a led
+
+0:43:52.709,0:43:57.119
+STM and then I'm gonna have a final inner layer so I just execute these guys
+
+0:43:57.119,0:44:07.920
+I have my training loop and I'm gonna be training for 10 books so in the training
+
+0:44:07.920,0:44:13.259
+group you can be always looking for those five different steps first step is
+
+0:44:13.259,0:44:18.900
+gonna be get the data inside the model right so that's step number one what is
+
+0:44:18.900,0:44:30.669
+step number two there are five steps we remember hello
+
+0:44:30.669,0:44:35.089
+you feel that you feed the network if you feed the network with some data then
+
+0:44:35.089,0:44:41.539
+what do you do you compute the loss okay then we have compute step to compute the
+
+0:44:41.539,0:44:52.549
+loss fantastic number three is zero the cash right then number four which is
+
+0:44:52.549,0:45:09.699
+computing the off yes lost dog backwards lost not backward don't compute the
+
+0:45:09.699,0:45:16.449
+partial derivative of the loss with respect to the network's parameters yeah
+
+0:45:16.449,0:45:27.380
+here backward finally number five which is step in opposite direction of the
+
+0:45:27.380,0:45:31.819
+gradient okay all right those are the five steps you always want to see in any
+
+0:45:31.819,0:45:37.909
+training blueprint if someone is missing then you're [ __ ] up okay so we try now
+
+0:45:37.909,0:45:42.469
+the RNN and the LSTM and you get something looks like this
+
+0:45:42.469,0:45:55.929
+so our NN goes up to 50% in accuracy and then the LSTM got 100% okay oh okay
+
+0:45:56.439,0:46:06.019
+first of all how many weights does this LSTM have compared to the RNN four
+
+0:46:06.019,0:46:11.059
+times more weights right so it's not a fair comparison I would say because LSTM
+
+0:46:11.059,0:46:16.819
+is simply for rnns combined somehow right so this is a two layer neural
+
+0:46:16.819,0:46:20.659
+network whereas the other one is at one layer right always both ever like it has
+
+0:46:20.659,0:46:25.009
+one hidden layer they are an end if Alice TM we can think about having two
+
+0:46:25.009,0:46:33.199
+hidden so again one layer two layers well one hidden to lead in one set of
+
+0:46:33.199,0:46:37.610
+parameters four sets of the same numbers like okay not fair okay anyway
+
+0:46:37.610,0:46:43.610
+let's go with hundred iterations okay so now I just go with 100 iterations and I
+
+0:46:43.610,0:46:49.490
+show you how if they work or not and also when I be just clicking things such
+
+0:46:49.490,0:46:56.000
+that we have time to go through stuff okay now my computer's going to be
+
+0:46:56.000,0:47:02.990
+complaining all right so again what are the five types of operations like five
+
+0:47:02.990,0:47:06.860
+okay now is already done sorry I was going to do okay so this is
+
+0:47:06.860,0:47:16.280
+the RNN right RNN and finally actually gave to 100% okay so iron and it just
+
+0:47:16.280,0:47:20.030
+let it more time like a little bit more longer training actually works the other
+
+0:47:20.030,0:47:26.060
+one okay and here you can see that we got 100% in twenty eight bucks okay the
+
+0:47:26.060,0:47:30.650
+other case we got 2,100 percent in roughly twice as long
+
+0:47:30.650,0:47:35.690
+twice longer at a time okay so let's first see how they perform here so I
+
+0:47:35.690,0:47:42.200
+have this sequence BcYdYdaE which is a U sequence and then we ask the network
+
+0:47:42.200,0:47:46.760
+and he actually meant for actually like labels it as you okay so below we're
+
+0:47:46.760,0:47:51.140
+gonna be seeing something very cute so in this case we were using sequences
+
+0:47:51.140,0:47:56.870
+that are very very very very small right so even the RNN is able to train on
+
+0:47:56.870,0:48:02.390
+these small sequences so what is the point of using a LSTM well we can first
+
+0:48:02.390,0:48:07.430
+of all increase the difficulty of the training part and we're gonna see that
+
+0:48:07.430,0:48:13.280
+the RNN can be miserably failing whereas the LSTM keeps working in this
+
+0:48:13.280,0:48:19.790
+visualization part below okay I train a network now Alice and LSTM now with the
+
+0:48:19.790,0:48:26.000
+moderate level which has eighty symbols rather than eight or ten ten symbols so
+
+0:48:26.000,0:48:31.430
+you can see here how this model actually managed to succeed at the end although
+
+0:48:31.430,0:48:38.870
+there is like a very big spike and I'm gonna be now drawing the value of the
+
+0:48:38.870,0:48:43.970
+cell state over time okay so I'm going to be input in our sequence of eighty
+
+0:48:43.970,0:48:49.090
+symbols and I'm gonna be showing you what is the value of the hidden state
+
+0:48:49.090,0:48:53.330
+hidden State so in this case I'm gonna be showing you
+
+0:48:53.330,0:48:56.910
+[Music] hidden hold on
+
+0:48:56.910,0:49:01.140
+yeah I'm gonna be showing I'm gonna send my input through a hyperbolic tangent
+
+0:49:01.140,0:49:06.029
+such that if you're below minus 2.5 I'm gonna be mapping to minus 1 if you're
+
+0:49:06.029,0:49:12.329
+above 2.5 you get mapped to plus 1 more or less and so let's see how this stuff
+
+0:49:12.329,0:49:18.029
+looks so in this case here you can see that this specific hidden layer picked
+
+0:49:18.029,0:49:27.720
+on the X here and then it became red until you got the other X right so this
+
+0:49:27.720,0:49:33.710
+is visualizing the internal state of the LSD and so you can see that in specific
+
+0:49:33.710,0:49:39.599
+unit because in this case I use hidden representation like hidden dimension of
+
+0:49:39.599,0:49:47.700
+10 and so in this case the 1 2 3 4 5 the fifth hidden unit of the cell lay the
+
+0:49:47.700,0:49:52.829
+5th cell actually is trigger by observing the first X and then it goes
+
+0:49:52.829,0:49:58.410
+quiet after seen the other acts this allows me to basically you know take
+
+0:49:58.410,0:50:07.440
+care of I mean recognize if the sequence is U, P, R or S. Okay does it make sense okay
+
+0:50:07.440,0:50:14.519
+oh this one more notebook I'm gonna be showing just quickly which is the 09-echo_data
+
+0:50:14.519,0:50:22.410
+in this case I'm gonna be in South corner I'm gonna have a network echo in
+
+0:50:22.410,0:50:27.059
+whatever I'm saying so if I say something I asked a network to say if I
+
+0:50:27.059,0:50:30.960
+say something I asked my neighbor to say if I say something I ask ok Anderson
+
+0:50:30.960,0:50:42.150
+right ok so in this case here and I'll be inputting this is the first sequence
+
+0:50:42.150,0:50:50.579
+is going to be 0 1 1 1 1 0 and you'll have the same one here 0 1 1 1 1 0 and I
+
+0:50:50.579,0:50:57.259
+have 1 0 1 1 0 1 etc right so in this case if you want to output something
+
+0:50:57.259,0:51:00.900
+after some right this in this case is three time
+
+0:51:00.900,0:51:06.809
+step after you have to have some kind of short-term memory where you keep in mind
+
+0:51:06.809,0:51:11.780
+what I just said where you keep in mind what I just said where you keep in mind
+
+0:51:11.780,0:51:16.890
+[Music] what I just said yeah that's correct so
+
+0:51:16.890,0:51:22.099
+you know pirating actually requires having some kind of working memory
+
+0:51:22.099,0:51:27.569
+whereas the other one the language model which it was prompted prompted to say
+
+0:51:27.569,0:51:33.539
+something that I haven't already said right so that was a different kind of
+
+0:51:33.539,0:51:38.700
+task you actually had to predict what is the most likely next word in keynote you
+
+0:51:38.700,0:51:42.329
+cannot be always right right but this one you can always be right you know
+
+0:51:42.329,0:51:49.079
+this is there is no random stuff anyhow so I have my first batch here and then
+
+0:51:49.079,0:51:53.549
+the sec the white patch which is the same similar thing which is shifted over
+
+0:51:53.549,0:52:01.319
+time and then we have we have to chunk this long long long sequence so before I
+
+0:52:01.319,0:52:05.250
+was sending a whole sequence inside the network and I was enforcing the final
+
+0:52:05.250,0:52:09.569
+target to be something right in this case I had to chunk if the sequence goes
+
+0:52:09.569,0:52:13.319
+this direction I had to chunk my long sequence in little chunks and then you
+
+0:52:13.319,0:52:18.869
+have to fill the first chunk keep trace of whatever is the hidden state send a
+
+0:52:18.869,0:52:23.549
+new chunk where you feed and initially as the initial hidden state the output
+
+0:52:23.549,0:52:28.319
+of this chant right so you feed this chunk you have a final hidden state then
+
+0:52:28.319,0:52:33.960
+you feed this chunk and as you put you have to put these two as input to the
+
+0:52:33.960,0:52:38.430
+internal memory right now you feed the next chunk where you put this one as
+
+0:52:38.430,0:52:44.670
+input as to the internal state and you we are going to be comparing here RNN
+
+0:52:44.670,0:52:57.059
+with analyst TMS I think so at the end here you can see that okay we managed to
+
+0:52:57.059,0:53:02.789
+actually get we are an n/a accuracy that goes 100 100 percent then if you are
+
+0:53:02.789,0:53:08.220
+starting now to mess with the size of the memory chunk with a memory interval
+
+0:53:08.220,0:53:11.619
+you can be seen with the LSTM you can keep this memory
+
+0:53:11.619,0:53:16.399
+for a long time as long as you have enough capacity the RNN after you reach
+
+0:53:16.399,0:53:22.880
+some kind of length you start forgetting what happened in the past and it was
+
+0:53:22.880,0:53:29.809
+pretty much everything for today so stay warm wash your hands and I'll see you
+
+0:53:29.809,0:53:34.929
+next week bye bye

From d0143adaf2c36f009bd3639962256320518ffd8a Mon Sep 17 00:00:00 2001
From: Leon Silva <leonsolon@gmail.com>
Date: Fri, 19 Nov 2021 20:59:13 -0300
Subject: [PATCH 2/3] [PT] Adding weeks 03 to 06

---
 docs/_config.yml               |   18 +
 docs/pt/week03/03-1.md         |  487 +++++
 docs/pt/week03/03-2.md         |  476 +++++
 docs/pt/week03/03-3.md         |  285 +++
 docs/pt/week03/03.md           |   40 +
 docs/pt/week03/lecture03.sbv   | 3429 ++++++++++++++++++++++++++++++
 docs/pt/week03/practicum03.sbv | 1751 ++++++++++++++++
 docs/pt/week04/04-1.md         |  596 ++++++
 docs/pt/week04/04.md           |   18 +
 docs/pt/week04/practicum04.sbv | 1517 ++++++++++++++
 docs/pt/week05/05-1.md         |  451 ++++
 docs/pt/week05/05-2.md         |  512 +++++
 docs/pt/week05/05-3.md         |  490 +++++
 docs/pt/week05/05.md           |   40 +
 docs/pt/week05/lecture05.sbv   | 3572 ++++++++++++++++++++++++++++++++
 docs/pt/week05/practicum05.sbv | 1241 +++++++++++
 docs/pt/week06/06-1.md         |  285 +++
 docs/pt/week06/06-2.md         |  586 ++++++
 docs/pt/week06/06-3.md         |  734 +++++++
 docs/pt/week06/06.md           |   36 +
 docs/pt/week06/lecture06.sbv   | 3338 +++++++++++++++++++++++++++++
 docs/pt/week06/practicum06.sbv | 1742 ++++++++++++++++
 22 files changed, 21644 insertions(+)
 create mode 100644 docs/pt/week03/03-1.md
 create mode 100644 docs/pt/week03/03-2.md
 create mode 100644 docs/pt/week03/03-3.md
 create mode 100644 docs/pt/week03/03.md
 create mode 100644 docs/pt/week03/lecture03.sbv
 create mode 100644 docs/pt/week03/practicum03.sbv
 create mode 100644 docs/pt/week04/04-1.md
 create mode 100644 docs/pt/week04/04.md
 create mode 100644 docs/pt/week04/practicum04.sbv
 create mode 100644 docs/pt/week05/05-1.md
 create mode 100644 docs/pt/week05/05-2.md
 create mode 100644 docs/pt/week05/05-3.md
 create mode 100644 docs/pt/week05/05.md
 create mode 100644 docs/pt/week05/lecture05.sbv
 create mode 100644 docs/pt/week05/practicum05.sbv
 create mode 100644 docs/pt/week06/06-1.md
 create mode 100644 docs/pt/week06/06-2.md
 create mode 100644 docs/pt/week06/06-3.md
 create mode 100644 docs/pt/week06/06.md
 create mode 100644 docs/pt/week06/lecture06.sbv
 create mode 100644 docs/pt/week06/practicum06.sbv

diff --git a/docs/_config.yml b/docs/_config.yml
index 94f33a605..cf1b3725f 100644
--- a/docs/_config.yml
+++ b/docs/_config.yml
@@ -757,6 +757,24 @@ pt:
         - path: pt/week02/02-1.md
         - path: pt/week02/02-2.md
         - path: pt/week02/02-3.md
+    - path: pt/week03/03.md
+      sections:
+        - path: pt/week03/03-1.md
+        - path: pt/week03/03-2.md
+        - path: pt/week03/03-3.md
+    - path: pt/week04/04.md
+      sections:
+        - path: pt/week04/04-1.md
+    - path: pt/week05/05.md
+      sections:
+        - path: pt/week05/05-1.md
+        - path: pt/week05/05-2.md
+        - path: pt/week05/05-3.md
+    - path: pt/week06/06.md
+      sections:
+        - path: pt/week06/06-1.md
+        - path: pt/week06/06-2.md
+        - path: pt/week06/06-3.md
 
 ################################## Hungarian ###################################
 hu:
diff --git a/docs/pt/week03/03-1.md b/docs/pt/week03/03-1.md
new file mode 100644
index 000000000..e887a7733
--- /dev/null
+++ b/docs/pt/week03/03-1.md
@@ -0,0 +1,487 @@
+---
+lang: pt
+lang-ref: ch.03-1
+lecturer: Yann LeCun
+title: Visualização da Transformação de Parâmetros de Redes Neurais e Conceitos Fundamentais de Convoluções
+authors: Jiuhong Xiao, Trieu Trinh, Elliot Silva, Calliea Pan
+date: 10 Feb 2020
+typora-root-url: 03-1
+translation-date: 14 Nov 2021
+translator: Leon Solon
+---
+
+
+<!--
+## [Visualization of neural networks](https://www.youtube.com/watch?v=FW5gFiJb-ig&t=5s)
+-->
+
+## [Visualização de redes neurais](https://www.youtube.com/watch?v=FW5gFiJb-ig&t=5s)
+
+<!--In this section we will visualise the inner workings of a neural network.
+-->
+
+Nesta seção, visualizaremos o funcionamento interno de uma rede neural.
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/Network.png" alt="Network" style="zoom:35%;" /><br>
+Fig. 1 Network Structure</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/Network.png" alt="Network" style="zoom:35%;" /><br>
+Fig. 1 Estrutura da rede </center>
+
+<!--Figure 1 depicts the structure of the neural network we would like to visualise. Typically, when we draw the structure of a neural network, the input appears on the bottom or on the left, and the output appears on the top side or on the right. In Figure 1, the pink neurons represent the inputs, and the blue neurons represent the outputs. In this network, we have 4 hidden layers (in green), which means we have 6 layers in total (4 hidden layers + 1 input layer + 1 output layer). In this case, we have 2 neurons per hidden layer, and hence the dimension of the weight matrix ($W$) for each layer is 2-by-2. This is because we want to transform our input plane into another plane that we can visualize.
+-->
+
+A Figura 1 mostra a estrutura da rede neural que gostaríamos de visualizar. Normalmente, quando desenhamos a estrutura de uma rede neural, a entrada aparece na parte inferior ou à esquerda e a saída aparece na parte superior ou direita. Na Figura 1, os neurônios de cor rosa representam as entradas e os neurônios azuis representam as saídas. Nesta rede, temos 4 camadas ocultas (em verde), o que significa que temos 6 camadas no total (4 camadas ocultas + 1 camada de entrada + 1 camada de saída). Nesse caso, temos 2 neurônios por camada oculta e, portanto, a dimensão da matriz de peso ($W$) para cada camada é 2 por 2. Isso ocorre porque queremos transformar nosso plano de entrada em outro plano que possamos visualizar.
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/Visual1.png" alt="Network" style="zoom:35%;" /><br>
+Fig. 2 Visualization of folding space</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/Visual1.png" alt="Network" style="zoom:35%;" /><br>
+Fig. 2 Visualização do espaço dobrável </center>
+
+<!--The transformation of each layer is like folding our plane in some specific regions as shown in Figure 2. This folding is very abrupt, this is because all the transformations are performed in the 2D layer. In the experiment, we find that if we have only 2 neurons in each hidden layer, the optimization will take longer; the optimization is easier if we have more neurons in the hidden layers. This leaves us with an important question to consider: Why is it harder to train the network with fewer neurons in the hidden layers? You should consider this question yourself and we will return to it after the visualization of $\texttt{ReLU}$.
+-->
+
+A transformação de cada camada é como dobrar nosso plano em algumas regiões específicas, conforme mostrado na Figura 2. Esse dobramento é muito abrupto, isso porque todas as transformações são realizadas na camada 2D. No experimento, descobrimos que, se tivermos apenas 2 neurônios em cada camada oculta, a otimização será mais demorada; a otimização é mais fácil se tivermos mais neurônios nas camadas ocultas. Isso nos deixa com uma questão importante a considerar: por que é mais difícil treinar a rede com menos neurônios nas camadas ocultas? Você mesmo deve considerar esta questão e retornaremos a ela após a visualização de $\texttt{ReLU}$.
+
+<!--| <img src="{{site.baseurl}}/images/week03/03-1/Visual2a.png" alt="Network" style="zoom:45%;" /> | <img src="{{site.baseurl}}/images/week03/03-1/Visual2b.png" alt="Network" style="zoom:45%;" /> |
+|(a)|(b)|
+-->
+
+| <img src="{{site.baseurl}}/images/week03/03-1/Visual2a.png" alt="Network" style="zoom:45%;" /> | <img src="{{site.baseurl}}/images/week03/03-1/Visual2b.png" alt="Network" style="zoom:45%;" /> |
+|(a)|(b)|
+
+<!--<center>Fig. 3 Visualization of ReLU operator</center>
+-->
+
+<center>Fig. 3 Visualização do operador ReLU</center>
+
+<!--When we step through the network one hidden layer at a time, we see that with each layer we perform some affine transformation followed by applying the non-linear ReLU operation, which eliminates any negative values. In Figures 3(a) and (b), we can see the visualisation of ReLU operator. The ReLU operator helps us to do non-linear transformations. After mutliple steps of performing an affine transformation followed by the ReLU operator, we are eventually able to linearly separate the data as can be seen in Figure 4.
+-->
+
+Quando percorremos a rede, uma camada oculta de cada vez, vemos que, em cada camada, realizamos alguma transformação afim, seguida pela aplicação da operação ReLU não linear, que elimina quaisquer valores negativos. Nas Figuras 3 (a) e (b), podemos ver a visualização do operador ReLU. O operador ReLU nos ajuda a fazer transformações não lineares. Após várias etapas de realização de uma transformação afim seguida pelo operador ReLU, somos eventualmente capazes de separar linearmente os dados, como pode ser visto na Figura 4.
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/Visual3.png" alt="Network" style="zoom:30%;" /><br>
+Fig. 4 Visualization of Outputs</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/Visual3.png" alt="Network" style="zoom:30%;" /><br>
+Fig. 4 Visualização de saídas </center>
+
+<!--This provides us with some insight into why the 2-neuron hidden layers are harder to train. Our 6-layer network has one bias in each hidden layers. Therefore if one of these biases moves points out of top-right quadrant, then applying the ReLU operator will eliminate these points to zero. After that, no matter how later layers transform the data, the values will remain zero. We can make a neural network easier to train by making the network "fatter" - *i.e.* adding more neurons in hidden layers - or we can add more hidden layers, or a combination of the two methods. Throughout this course we will explore how to determine the best network architecture for a given problem, stay tuned.
+-->
+
+Isso nos fornece algumas dicas sobre por que as camadas ocultas de 2 neurônios são mais difíceis de treinar. Nossa rede de 6 camadas tem um viés em cada camada oculta. Portanto, se uma dessas polarizações mover pontos para fora do quadrante superior direito, a aplicação do operador ReLU eliminará esses pontos para zero. Depois disso, não importa o quanto as camadas posteriores transformem os dados, os valores permanecerão zero. Podemos tornar uma rede neural mais fácil de treinar tornando a rede "mais gorda" - *ou seja,* adicionando mais neurônios em camadas ocultas - ou podemos adicionar mais camadas ocultas, ou uma combinação dos dois métodos. Ao longo deste curso, exploraremos como determinar a melhor arquitetura de rede para um determinado problema, fique atento.
+
+<!--
+## [Parameter transformations](https://www.youtube.com/watch?v=FW5gFiJb-ig&t=477s)
+-->
+
+## [Transformações de parâmetro](https://www.youtube.com/watch?v=FW5gFiJb-ig&t=477s)
+
+<!--General parameter transformation means that our parameter vector $w$ is the output of a function. By this transformation, we can map original parameter space into another space. In Figure 5, $w$ is actually the output of $H$ with the parameter $u$. $G(x,w)$ is a network and $C(y,\bar y)$ is a cost function. The backpropagation formula is also adapted as follows,
+-->
+
+A transformação de parâmetro geral significa que nosso vetor de parâmetro $w$ é a saída de uma função. Por meio dessa transformação, podemos mapear o espaço de parâmetro original em outro espaço. Na Figura 5, $ w $ é na verdade a saída de $H$ com o parâmetro $u$. $G (x, w)$ é uma rede e $C(y,\bar y)$ é uma função de custo. A fórmula de retropropagação também é adaptada da seguinte forma,
+
+<!--$$
+u \leftarrow u - \eta\frac{\partial H}{\partial u}^\top\frac{\partial C}{\partial w}^\top
+$$
+-->
+
+$$
+u \leftarrow u - \eta\frac{\partial H}{\partial u}^\top\frac{\partial C}{\partial w}^\top
+$$
+
+<!--$$
+w \leftarrow w - \eta\frac{\partial H}{\partial u}\frac{\partial H}{\partial u}^\top\frac{\partial C}{\partial w}^\top
+$$
+-->
+
+$$
+w \leftarrow w - \eta\frac{\partial H}{\partial u}\frac{\partial H}{\partial u}^\top\frac{\partial C}{\partial w}^\top
+$$
+
+<!--These formulas are applied in a matrix form. Note that the dimensions of the terms should be consistent. The dimension of $u$,$w$,$\frac{\partial H}{\partial u}^\top$,$\frac{\partial C}{\partial w}^\top$ are $[N_u \times 1]$,$[N_w \times 1]$,$[N_u \times N_w]$,$[N_w \times 1]$, respectively. Therefore, the dimension of our backpropagation formula is consistent.
+-->
+
+Essas fórmulas são aplicadas em forma de matriz. Observe que as dimensões dos termos devem ser consistentes. As dimensões de $u$,$w$,$\frac{\partial H}{\partial u}^\top$,$\frac{\partial C}{\partial w}^\top$ são $[N_u \times 1]$,$[N_w \times 1]$,$[N_u \times N_w]$,$[N_w \times 1]$, respectivamente. Portanto, a dimensão de nossa fórmula de retropropagação é consistente.
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/PT.png" alt="Network" style="zoom:35%;" /><br>
+Fig. 5 General Form of Parameter Transformations</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/PT.png" alt="Network" style="zoom:35%;" /><br>
+Fig. 5 Forma geral das transformações de parâmetros </center>
+
+<!--
+### A simple parameter transformation: weight sharing
+-->
+
+### Uma transformação de parâmetro simples: compartilhamento de peso
+
+<!--A Weight Sharing Transformation means $H(u)$ just replicates one component of $u$ into multiple components of $w$. $H(u)$ is like a **Y** branch to copy $u_1$ to $w_1$, $w_2$. This can be expressed as,
+-->
+
+Uma transformação de compartilhamento de peso significa que $H(u)$ apenas replica um componente de $u$ em vários componentes de $w$. $H(u)$ é como uma divisão em **Y** para copiar $u_1$ para $w_1$, $w_2$. Isso pode ser expressado como,
+
+<!--$$
+w_1 = w_2 = u_1, w_3 = w_4 = u_2
+$$
+-->
+
+$$
+w_1 = w_2 = u_1, w_3 = w_4 = u_2
+$$
+
+<!--We force shared parameters to be equal, so the gradient w.r.t. to shared parameters will be summed in the backprop. For example the gradient of the cost function $C(y, \bar y)$ with respect to $u_1$ will be the sum of the gradient of the cost function $C(y, \bar y)$ with respect to $w_1$ and the gradient of the cost function $C(y, \bar y)$ with respect to $w_2$.
+-->
+
+Forçamos os parâmetros compartilhados a serem iguais, então o gradiente em relação aos parâmetros compartilhados será somado na retroprogação. Por exemplo, o gradiente da função de custo $C(y, \bar y)$ em relação a $u_1$ será a soma do gradiente da função de custo $C(y, \bar y)$ em relação a $w_1$ e o gradiente da função de custo $C(y, \bar y)$ em relação a $w_2$.
+
+<!--
+### Hypernetwork
+-->
+
+### Hiper-rede
+
+<!--A hypernetwork is a network where the weights of one network is the output of another network. Figure 6 shows the computation graph of a "hypernetwork". Here the function $H$ is a network with parameter vector $u$ and input $x$. As a result, the weights of $G(x,w)$ are dynamically configured by the network $H(x,u)$. Although this is an old idea, it remains very powerful.
+-->
+
+Uma hiper-rede é uma rede em que os pesos de uma rede são a saída de outra rede. A Figura 6 mostra o gráfico de computação de uma "hiper-rede". Aqui, a função $H$ é uma rede com vetor de parâmetro $u$ e entrada $x$. Como resultado, os pesos de $G(x,w)$ são configurados dinamicamente pela rede $H(x,u)$. Embora seja uma ideia antiga, continua muito poderosa.
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/HyperNetwork.png" alt="Network" style="zoom:35%;" /><br>
+Fig. 6 "Hypernetwork"</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/HyperNetwork.png" alt="Network" style="zoom:35%;" /><br>
+Fig. 6 "Hypernetwork" </center>
+
+<!--
+### Motif detection in sequential data
+-->
+
+### Detecção de motivos em dados sequenciais
+
+<!--Weight sharing transformation can be applied to motif detection. Motif detection means to find some motifs in sequential data like keywords in speech or text. One way to achieve this, as shown in Figure 7, is to use a sliding window on data, which moves the weight-sharing function to detect a particular motif (*i.e.* a particular sound in speech signal), and the outputs (*i.e.* a score) goes into a maximum function.
+-->
+
+A transformação de compartilhamento de peso pode ser aplicada à detecção de motivos. A detecção de motivos significa encontrar alguns motivos em dados sequenciais, como palavras-chave em voz ou texto. Uma maneira de conseguir isso, conforme mostrado na Figura 7, é usar uma janela deslizante de dados, que move a função de divisão de peso para detectar um motivo específico (*ou seja* um determinado som no sinal de fala) e as saídas (*ou seja,* uma pontuação) vai para uma função máxima.
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/Motif.png" alt="Network" style="zoom:30%;" /><br>
+Fig. 7 Motif Detection for Sequential Data</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/Motif.png" alt="Network" style="zoom:30%;" /><br>
+Fig. 7 Detecção de Motivos para Dados Sequenciais </center>
+
+<!--In this example we have 5 of those functions. As a result of this solution, we sum up five gradients and backpropagate the error to update the parameter $w$. When implementing this in PyTorch, we want to prevent the implicit accumulation of these gradients, so we need to use `zero_grad()` to initialize the gradient.
+-->
+
+Neste exemplo, temos 5 dessas funções. Como resultado dessa solução, somamos cinco gradientes e retropropagamos o erro para atualizar o parâmetro $w$. Ao implementar isso no PyTorch, queremos evitar o acúmulo implícito desses gradientes, então precisamos usar `zero_grad()` para inicializar o gradiente.
+
+<!--
+### Motif detection in images
+-->
+
+### Detecção de motivos em imagens
+
+<!--The other useful application is motif detection in images. We usually swipe our "templates" over images to detect the shapes independent of position and distortion of the shapes. A simple example is to distinguish between "C" and "D",  as Figure 8 shows. The difference between "C" and "D" is that "C" has two endpoints and "D" has two corners. So we can design "endpoint templates" and "corner templates". If the shape is similar to the "templates", it will have thresholded outputs. Then we can distinguish letters from these outputs by summing them up. In Figure 8, the network detects two endpoints and zero corners, so it activates "C".
+-->
+
+A outra aplicação útil é a detecção de motivos em imagens. Normalmente, passamos nossos "modelos" sobre as imagens para detectar as formas, independentemente da posição e distorção das formas. Um exemplo simples é distinguir entre "C" e "D", como mostra a Figura 8. A diferença entre "C" e "D" é que "C" tem dois pontos finais e "D" tem dois cantos. Assim, podemos projetar "modelos de terminal" e "modelos de canto". Se a forma for semelhante aos "modelos", ele terá saídas limitadas. Então, podemos distinguir as letras dessas saídas, somando-as. Na Figura 8, a rede detecta dois pontos finais e zero cantos, portanto, ativa "C".
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/MotifImage.png" alt="Network" style="zoom:35%;" /><br>
+Fig. 8 Motif Detection for Images</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/MotifImage.png" alt="Network" style="zoom:35%;" /><br>
+Fig. 8 Detecção de motivos para imagens </center>
+
+<!--It is also important that our "template matching" should be shift-invariant - when we shift the input, the output (*i.e.* the letter detected) shouldn't change. This can be solved with weight sharing transformation. As Figure 9 shows, when we change the location of "D", we can still detect the corner motifs even though they are shifted. When we sum up the motifs, it will activate the "D" detection.
+-->
+
+Também é importante que o nosso "modelo de correspondência" seja invariante ao deslocamento - quando mudamos a entrada, a saída (*ou seja,* a letra detectada) não deve mudar. Isso pode ser resolvido com a transformação do compartilhamento de peso. Como mostra a Figura 9, quando mudamos a localização de "D", ainda podemos detectar os motivos dos cantos, embora eles estejam deslocados. Ao somarmos os motivos, ele ativará a detecção "D".
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/ShiftInvariance.png" alt="Network" style="zoom:35%;" /><br>
+Fig. 9 Shift Invariance</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/ShiftInvariance.png" alt="Network" style="zoom:35%;" /><br>
+Fig. 9 Invariância de deslocamento </center>
+
+<!--This hand-crafted method of using local detectors and summation to for digit-recognition was used for many years. But it presents us with the following problem: How can we design these "templates" automatically? Can we use neural networks to learn these "templates"? Next, We will introduce the concept of **convolutions** , that is, the operation we use to match images with "templates".
+-->
+
+Este método artesanal de usar detectores locais e soma para reconhecimento de dígitos foi usado por muitos anos. Mas isso nos apresenta o seguinte problema: Como podemos criar esses "modelos" automaticamente? Podemos usar redes neurais para aprender esses "modelos"? A seguir, apresentaremos o conceito de **convoluções**, ou seja, a operação que usamos para combinar imagens com "modelos".
+
+<!--
+## Discrete convolution
+-->
+
+## Convolução discreta
+
+<!--
+### Convolution
+-->
+
+### Convolução
+
+<!--The precise mathematical definition of a convolution in the 1-dimensional case between input $x$ and $w$ is:
+-->
+
+A definição matemática precisa de uma convolução no caso unidimensional entre a entrada $x$ e $w$ é:
+
+<!--$$y_i = \sum_j w_j x_{i-j}$$
+-->
+
+$$y_i = \sum_j w_j x_{i-j}$$
+
+<!--In words, the $i$-th output is computed as the dot product between the **reversed** $w$ and a window of the same size in $x$. To compute the full output, start the window at the beginning, shift this window by one entry each time and repeat until $x$ is exhausted.
+-->
+
+Em palavras, a $i$-ésima saída é calculada como o produto escalar entre o  $w$ **invertido** e uma janela do mesmo tamanho em $x$. Para calcular a saída completa, inicie a janela no início, avance esta janela um passo de cada vez e repita até que $x$ se esgote.
+
+<!--
+### Cross-correlation
+-->
+
+### Correlação cruzada
+
+<!--In practice, the convention adopted in deep learning frameworks such as PyTorch is slightly different. Convolution in PyTorch is implemented where $w$ is **not reversed**:
+-->
+
+Na prática, a convenção adotada em estruturas de aprendizado profundo, como o PyTorch, é um pouco diferente. Na implementação das convoluções no PyTorch, $w$ **não é invertido**:
+
+<!--$$y_i = \sum_j w_j x_{i+j}$$
+-->
+
+$$y_i = \sum_j w_j x_{i+j}$$
+
+<!--Mathematicians call this formulation "cross-correlation". In our context, this difference is just a difference in convention. Practically, cross-correlation and convolution can be interchangeable if one reads the weights stored in memory forward or backward.
+-->
+
+Os matemáticos chamam essa formulação de "correlação cruzada". Em nosso contexto, essa diferença é apenas uma diferença na convenção. Praticamente, correlação cruzada e convolução podem ser intercambiáveis ​​se alguém ler os pesos armazenados na memória para frente ou para trás.
+
+<!--Being aware of this difference is important, for example, when one want to make use of certain mathematical properties of convolution/correlation from mathematical texts.
+-->
+
+Estar ciente dessa diferença é importante, por exemplo, quando se deseja fazer uso de certas propriedades matemáticas de convolução / correlação de textos matemáticos.
+
+<!--
+### Higher dimensional convolution
+-->
+
+### Convolução dimensional superior
+
+<!--For two dimensional inputs such as images, we make use of the two dimensional version of convolution:
+-->
+
+Para entradas bidimensionais, como imagens, usamos a versão bidimensional da convolução:
+
+<!--$$y_{ij} = \sum_{kl} w_{kl} x_{i+k, j+l}$$
+-->
+
+$$y_{ij} = \sum_{kl} w_{kl} x_{i+k, j+l}$$
+
+<!--This definition can easily be extended beyond two dimensions to three or four dimensions. Here $w$ is called the *convolution kernel*
+-->
+
+Essa definição pode ser facilmente estendida além de duas dimensões para três ou quatro dimensões. Aqui $w$ é chamado de *kernel de convolução*
+
+<!--
+### Regular twists that can be made with the convolutional operator in DCNNs
+-->
+
+### Torções regulares que podem ser feitas com o operador convolucional em DCNNs
+
+<!--1. **Striding**: instead of shifting the window in $x$ one entry at a time, one can do so with a larger step (for example two or three entries at a time).
+Example: Suppose the input $x$ is one dimensional and has size of 100 and $w$ has size 5. The output size with a stride of 1 or 2 is shown in the table below:
+-->
+
+1. **Striding**: em vez de mudar a janela em $x$ uma entrada por vez, pode-se fazer isso com um passo maior (por exemplo, duas ou três entradas por vez).
+Exemplo: suponha que a entrada $x$ seja unidimensional e tenha tamanho 100 e $w$ tenha tamanho 5. O tamanho de saída com uma passada de 1 ou 2 é mostrado na tabela abaixo:
+
+<!--| Stride       | 1                          | 2                          |
+| ------------ | -------------------------- | -------------------------- |
+| Output size: | $\frac{100 - (5-1)}{1}=96$ | $\frac{100 - (5-1)}{2}=48$ |
+-->
+
+| Stride       | 1                          | 2                          |
+| ------------ | -------------------------- | -------------------------- |
+| Tamanho da saída: | $\frac{100 - (5-1)}{1}=96$ | $\frac{100 - (5-1)}{2}=48$ |
+
+<!--
+2. **Padding**: Very often in designing Deep Neural Networks architectures, we want the output of convolution to be of the same size as the input. This can be achieved by padding the input ends with a number of (typically) zero entries, usually on both sides. Padding is done mostly for convenience. It can sometimes impact performance and result in strange border effects, that said, when using a ReLU non-linearity, zero padding is not unreasonable.
+-->
+
+2. **Preenchimento (Padding)**: Muito frequentemente, ao projetar arquiteturas de Redes Neurais Profundas, queremos que a saída de convolução seja do mesmo tamanho que a entrada. Isso pode ser obtido preenchendo as extremidades da entrada com um número de entradas (normalmente) de zero, geralmente em ambos os lados. O enchimento é feito principalmente por conveniência. Isso às vezes pode afetar o desempenho e resultar em efeitos de borda estranhos, ou seja, ao usar uma não linearidade ReLU, o preenchimento de zero não é irracional.
+
+<!--
+## Deep Convolution Neural Networks (DCNNs)
+-->
+
+## Redes Neurais de Convolução Profunda (DCNNs)
+
+<!--As previously described, deep neural networks are typically organized as repeated alternation between linear operators and point-wise nonlinearity layers. In convolutional neural networks, the linear operator will be the convolution operator described above. There is also an optional third type of layer called the pooling layer.
+-->
+
+Conforme descrito anteriormente, as redes neurais profundas são normalmente organizadas como alternância repetida entre operadores lineares e camadas não lineares pontuais. Em redes neurais convolucionais, o operador linear será o operador de convolução descrito acima. Também existe um terceiro tipo opcional de camada, denominado camada de pool.
+
+<!--The reason for stacking multiple such layers is that we want to build a hierarchical representation of the data. CNNs do not have to be limited to processing images, they have also been successfully applied to speech and language. Technically they can be applied to any type of data that comes in the form of arrays, although we also these arrays to satisfy certain properties.
+-->
+
+A razão para empilhar várias dessas camadas é que queremos construir uma representação hierárquica dos dados. As CNNs não precisam se limitar a processar imagens; elas também foram aplicadas com sucesso à fala e à linguagem. Tecnicamente, eles podem ser aplicados a qualquer tipo de dado que venha na forma de arrays, embora também utilizemos esses arrays para satisfazer certas propriedades.
+
+<!--Why would we want to capture the hierarchical representation of the world? Because the world we live in is compositional. This point is alluded to in previous sections. Such hierarchical nature can be observed from the fact that local pixels assemble to form simple motifs such as oriented edges. These edges in turn are assembled to form local features such as corners, T-junctions, etc. These edges are assembled to form motifs that are even more abstract. We can keep building on these hierarchical representation to eventually form the objects we observe in the real world.
+-->
+
+Por que desejaríamos capturar a representação hierárquica do mundo? Porque o mundo em que vivemos é composto. Este ponto é mencionado nas seções anteriores. Essa natureza hierárquica pode ser observada a partir do fato de que os pixels locais se reúnem para formar motivos simples, como bordas orientadas. Essas bordas, por sua vez, são montadas para formar características locais, como cantos, junções em T, etc. Essas bordas são montadas para formar motivos que são ainda mais abstratos. Podemos continuar construindo sobre essas representações hierárquicas para, eventualmente, formar os objetos que observamos no mundo real.
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/cnn_features.png" alt="CNN Features" style="zoom:35%;" /><br>
+Figure 10. Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/cnn_features.png" alt="CNN Features" style="zoom:35%;" /><br>
+Figura 10. Visualização de recurso de rede convolucional treinada em ImageNet de [Zeiler & Fergus 2013]</center>
+
+<!--
+This compositional, hierarchical nature we observe in the natural world is therefore not just the result of our visual perception, but also true at the physical level. At the lowest level of description, we have elementary particles, which assembled to form atoms, atoms together form molecules, we continue to build on this process to form materials, parts of objects and eventually full objects in the physical world.
+-->
+
+Essa natureza composicional e hierárquica que observamos no mundo natural não é, portanto, apenas o resultado de nossa percepção visual, mas também verdadeira no nível físico. No nível mais baixo de descrição, temos partículas elementares, que se agrupam para formar átomos, átomos juntos formam moléculas. Continuamos a desenvolver esse processo para formar materiais, partes de objetos e, eventualmente, objetos completos no mundo físico.
+
+<!--The compositional nature of the world might be the answer to Einstein's rhetorical question on how humans understand the world they live in:
+-->
+
+A natureza composicional do mundo pode ser a resposta à pergunta retórica de Einstein sobre como os humanos entendem o mundo em que vivem:
+
+<!-- The most incomprehensible thing about the universe is that it is comprehensible.
+-->
+
+> A coisa mais incompreensível sobre o universo é que ele é compreensível.
+
+<!--The fact that humans understand the world thanks to this compositional nature still seems like a conspiracy to Yann. It is, however, argued that without compositionality, it will take even more magic for humans to comprehend the world they live in. Quoting the great mathematician Stuart Geman:
+-->
+
+O fato de os humanos entenderem o mundo graças a essa natureza composicional ainda parece uma conspiração para Yann. No entanto, argumenta-se que, sem composicionalidade, será necessário ainda mais magia para os humanos compreenderem o mundo em que vivem. Citando o grande matemático Stuart Geman:
+
+<!-- The world is compositional or God exists.
+-->
+
+> O mundo é composicional ou Deus existe.
+
+<!--
+## [Inspirations from Biology](https://www.youtube.com/watch?v=FW5gFiJb-ig&t=2254s)
+-->
+
+## [Inspirações na Biologia](https://www.youtube.com/watch?v=FW5gFiJb-ig&t=2254s)
+
+<!--So why should Deep Learning be rooted in the idea that our world is comprehensible and has a compositional nature? Research conducted by Simon Thorpe helped motivate this further. He showed that the way we recognize everyday objects is extremely fast. His experiments involved flashing a set of images every 100ms, and then asking users to identify these images, which they were able to do successfully. This demonstrated that it takes about 100ms for humans to detect objects. Furthermore, consider the diagram below, illustrating parts of the brain annotated with the time it takes for neurons to propagate from one area to the next:
+-->
+
+Então, por que o Deep Learning deveria estar enraizado na ideia de que nosso mundo é compreensível e tem uma natureza composicional? A pesquisa conduzida por Simon Thorpe ajudou a motivar isso ainda mais. Ele mostrou que a maneira como reconhecemos objetos do cotidiano é extremamente rápida. Seus experimentos envolviam gerar um conjunto de imagens a cada 100 ms e, em seguida, pedir aos usuários que identificassem essas imagens, o que eles foram capazes de fazer com sucesso. Isso demonstrou que leva cerca de 100 ms para os humanos detectarem objetos. Além disso, considere o diagrama abaixo, ilustrando partes do cérebro anotadas com o tempo que leva para os neurônios se propagarem de uma área para a próxima:
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/Simon_Thorpe.png" alt="Simon_Thorpe" style="zoom:55%;" /></center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/Simon_Thorpe.png" alt="Simon_Thorpe" style="zoom:55%;" /></center>
+
+<!--<div align="center">Figure 11. Simon Thorpe's model of visual information flow in the brain <div>
+-->
+
+<div align="center">Figura 11. Modelo de Simon Thorpe de fluxo de informações visuais no cérebro<div>
+
+<!--Signals pass from the retina to the LGN (helps with contrast enhancement, gate control, etc.), then to the V1 primary visual cortex, V2, V4, then to the inferotemporal cortex (PIT), which is the part of the brain where categories are defined. Observations from open-brain surgery showed that if you show a human a film, neurons in the PIT will fire only when they detect certain images -- such as Jennifer Aniston or a person's grandmother -- and nothing else. The neural firings are invariant to things such as position, size, illumination, your grandmother's orientation, what she's wearing, etc.
+-->
+
+Os sinais passam da retina para o LGN (ajuda com aumento de contraste, controle de porta, etc.), em seguida, para o córtex visual primário V1, V2, V4 e, em seguida, para o córtex inferotemporal (PIT), que é a parte do cérebro onde categorias são definidas. As observações da cirurgia de cérebro aberto mostraram que, se você mostrar um filme a um humano, os neurônios no PIT serão disparados apenas quando detectarem certas imagens - como Jennifer Aniston ou a avó de uma pessoa - e nada mais. Os disparos neurais são invariáveis ​​a coisas como posição, tamanho, iluminação, orientação de sua avó, o que ela está vestindo, etc.
+
+<!--Furthermore, the fast reaction times with which humans were able to categorize these items -- barely enough time for a few spikes to get through -- demonstrates that it's possible to do this without additional time spent on complex recurrent computations. Rather, this is a single feed-forward process.
+-->
+
+Além disso, os tempos de reação rápidos com os quais os humanos foram capazes de categorizar esses itens - apenas o tempo suficiente para alguns picos passarem - demonstra que é possível fazer isso sem tempo adicional gasto em cálculos recorrentes complexos. Em vez disso, este é um único processo de feed-forward.
+
+<!--These insights suggested that we could develop a neural network architecture which is completely feed-forward, yet still able to solve the problem of recognition, in a way that is invariant to irrelevant transformations of the input.
+-->
+
+Esses insights sugeriram que poderíamos desenvolver uma arquitetura de rede neural que é completamente feed-forward, mas ainda capaz de resolver o problema de reconhecimento, de uma forma que é invariável para transformações irrelevantes de entrada.
+
+<!--One further insight from the human brain comes from Gallant & Van Essen, whose model of the human brain illustrates two distinct pathways:
+-->
+
+Um outro insight do cérebro humano vem de Gallant & Van Essen, cujo modelo do cérebro humano ilustra duas vias distintas:
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/Gallant_and_Van_Essen.png" alt="Gallant_and_Van_Essen" style="zoom:55%;" /></center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/Gallant_and_Van_Essen.png" alt="Gallant_and_Van_Essen" style="zoom:55%;" /></center>
+
+<!--<div align="center">Figure 12. Gallen & Van Essen's model of dorsal & ventral pathways in the brain <div>
+-->
+
+<div align="center">Figura 12. Modelo de Gallen e Van Essen das vias dorsais e ventrais no cérebro <div>
+
+<!--The right side shows the ventral pathway, which tells you what you're looking at, while the left side shows the dorsal pathway, which identifies locations, geometry, and motion. They seem fairly separate in the human (and primate) visual cortex (with a few interactions between them of course).
+-->
+
+O lado direito mostra a via ventral, que indica o que você está olhando, enquanto o lado esquerdo mostra a via dorsal, que identifica localizações, geometria e movimento. Eles parecem bastante separados no córtex visual humano (e primata) (com algumas interações entre eles, é claro).
+
+<!--
+### Hubel & Weisel's contributions (1962)
+-->
+
+### Contribuições de Hubel & Weisel (1962)
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/Hubel_and_Weisel.png" alt="Hubel_and_Weisel" style="zoom:55%;" /></center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/Hubel_and_Weisel.png" alt="Hubel_and_Weisel" style="zoom:55%;" /></center>
+
+<!--<div align="center">Figure 13. Hubel & Weisel's experiments with visual stimuli in cat brains <div>
+-->
+
+<div align="center"> Figura 13. Experimentos de Hubel e Wiesel com estímulos visuais em cérebros de gatos <div>
+
+<!--Hubel and Weisel experiments used electrodes to measure neural firings in cat brains in response to visual stimuli. They discovered that neurons in the V1 region are only sensitive to certain areas of a visual field (called "receptive fields"), and detect oriented edges in that area. For example, they demonstrated that if you showed the cat a vertical bar and start rotating it, at a particular angle the neuron will fire. Similarly, as the bar moves away from that angle, the activation of the neuron diminishes. These activation-selective neurons Hubel & Weisel named "simple cells", for their ability to detect local features.
+-->
+
+Os experimentos de Hubel e Weisel usaram eletrodos para medir disparos neurais em cérebros de gatos em resposta a estímulos visuais. Eles descobriram que os neurônios na região V1 são sensíveis apenas a certas áreas de um campo visual (chamadas de "campos receptivos") e detectam bordas orientadas nessa área. Por exemplo, eles demonstraram que se você mostrasse ao gato uma barra vertical e começasse a girá-la, em um determinado ângulo o neurônio dispararia. Da mesma forma, conforme a barra se afasta desse ângulo, a ativação do neurônio diminui. Hubel & Weisel denominaram esses neurônios seletivos de ativação de "células simples", por sua capacidade de detectar características locais.
+
+<!--They also discovered that if you move the bar out of the receptive field, that particular neuron doesn't fire any more, but another neuron will. There are local feature detectors corresponding to all areas of the visual field, hence the idea that the human brain processes visual information as a collection of "convolutions".
+-->
+
+Eles também descobriram que se você mover a barra para fora do campo receptivo, aquele neurônio específico não dispara mais, mas outro neurônio o faz. Existem detectores de características locais que correspondem a todas as áreas do campo visual, daí a ideia de que o cérebro humano processa informações visuais como uma coleção de "circunvoluções".
+
+<!--Another type of neuron, which they named "complex cells", aggregate the output of multiple simple cells within a certain area. We can think of these as computing an aggregate of the activations using a function such as maximum, sum, sum of squares, or any other function not depending on the order. These complex cells detect edges and orientations in a region, regardless of where those stimuli lie specifically within the region. In other words, they are shift-invariant with respect to small variations in positions of the input.
+-->
+
+Outro tipo de neurônio, que eles chamaram de "células complexas", agregam a saída de várias células simples dentro de uma determinada área. Podemos pensar nisso como o cálculo de um agregado das ativações usando uma função como máximo, soma, soma dos quadrados ou qualquer outra função que não dependa da ordem. Essas células complexas detectam bordas e orientações em uma região, independentemente de onde esses estímulos estejam especificamente na região. Em outras palavras, eles são invariáveis ​​ao deslocamento em relação a pequenas variações nas posições da entrada.
+
+<!--
+### Fukushima's contributions (1982)
+-->
+
+### Contribuições de Fukushima (1982)
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-1/Fukushima.png" alt="Fukushima" style="zoom:55%;" /></center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-1/Fukushima.png" alt="Fukushima" style="zoom:55%;" /></center>
+
+<!--<div align="center">Figure 14. Fukushima's CNN model <div>
+-->
+
+<div align="center"> Figura 14. Modelo CNN de Fukushima <div>
+
+<!--Fukushima was the first to implement the idea of multiple layers of simple cells and complex cells with computer models, using a dataset of handwritten digits. Some of these feature detectors were hand-crafted or learned, though the learning used unsupervised clustering algorithms, trained separately for each layer, as backpropagation was not yet in use.
+-->
+
+Fukushima foi o primeiro a implementar a ideia de múltiplas camadas de células simples e células complexas com modelos de computador, usando um conjunto de dados de dígitos escritos à mão. Alguns desses detectores de recursos foram feitos à mão ou aprendidos, embora o aprendizado usasse algoritmos de agrupamento não supervisionados, treinados separadamente para cada camada, já que a retropropagação ainda não estava em uso.
+
+<!--Yann LeCun came in a few years later (1989, 1998) and implemented the same architecture, but this time trained them in a supervised setting using backpropagation. This is widely regarded as the genesis of modern convolutional neural networks. (Note: Riesenhuber at MIT in 1999 also re-discovered this architecture, though he didn't use backpropagation.)
+-->
+
+Yann LeCun veio alguns anos depois (1989, 1998) e implementou a mesma arquitetura, mas desta vez os treinou em um ambiente supervisionado usando retropropagação. Isso é amplamente considerado como a gênese das redes neurais convolucionais modernas. (Observação: Riesenhuber no MIT em 1999 também redescobriu essa arquitetura, embora ele não tenha usado a retropropagação.)
diff --git a/docs/pt/week03/03-2.md b/docs/pt/week03/03-2.md
new file mode 100644
index 000000000..8b178c8ba
--- /dev/null
+++ b/docs/pt/week03/03-2.md
@@ -0,0 +1,476 @@
+---
+lang: pt
+lang-ref: ch.03-2
+lecturer: Yann LeCun
+title: Evolução, Arquiteturas, Detalhes de Implementação e Vantagens das Redes Convolucionais.
+authors: Chris Ick, Soham Tamba, Ziyu Lei, Hengyu Tang
+date: 10 Feb 2020
+translation-date: 14 Nov 2021
+translator: Leon Solon
+---
+
+<!--
+## [Proto-CNNs and evolution to modern CNNs](https://www.youtube.com/watch?v=FW5gFiJb-ig&t=2949s)
+-->
+
+
+## [Proto-CNNs e evolução para CNNs modernas](https://www.youtube.com/watch?v=FW5gFiJb-ig&t=2949s)
+
+<!--
+### Proto-convolutional neural nets on small data sets
+-->
+
+
+### Redes neurais protoconvolucionais em pequenos conjuntos de dados
+
+<!--Inspired by Fukushima's work on visual cortex modelling, using the simple/complex cell hierarchy combined with supervised training and backpropagation lead to the development of the first CNN at University of Toronto in '88-'89 by Prof. Yann LeCun. The experiments used a small dataset of 320 'mouser-written' digits. Performances of the following architectures were compared:
+-->
+
+Inspirado pelo trabalho de Fukushima na modelagem do córtex visual, o uso da hierarquia celular simples / complexa combinada com treinamento supervisionado e retropropagação levou ao desenvolvimento da primeira CNN na Universidade de Toronto em '88-'89 pelo Prof. Yann LeCun. Os experimentos usaram um pequeno conjunto de dados de 320 dígitos 'escritos no mouse'. Os desempenhos das seguintes arquiteturas foram comparados:
+
+<!--1. Single FC(fully connected) Layer
+2. Two FC Layers
+3. Locally Connected Layers w/o shared weights
+4. Constrained network w/ shared weights and local connections
+5. Constrained network w/ shared weights and local connections 2 (more feature maps)
+-->
+
+1. Camada FC única (totalmente conectada)
+2. Duas camadas FC
+3. Camadas conectadas localmente sem pesos compartilhados
+4. Rede restrita com pesos compartilhados e conexões locais
+5. Rede restrita com pesos compartilhados e conexões locais 2 (mais mapas de recursos)
+
+<!--The most successful networks (constrained network with shared weights) had the strongest generalizability, and form the basis for modern CNNs. Meanwhile, singler FC layer tends to overfit.
+-->
+
+As redes mais bem-sucedidas (rede restrita com pesos compartilhados) tiveram a generalização mais forte e formam a base para as CNNs modernas. Enquanto isso, a camada FC simples tende a se ajustar demais.
+
+<!--
+### First "real" ConvNets at Bell Labs
+-->
+
+
+### Primeiras ConvNets "reais" na Bell Labs
+
+<!--After moving to Bell Labs, LeCunn's research shifted to using handwritten zipcodes from the US Postal service to train a larger CNN:
+-->
+
+Depois de se mudar para o Bell Labs, a pesquisa de LeCunn mudou para o uso de códigos postais manuscritos dos Correios dos EUA para treinar uma CNN maior:
+
+<!--* 256 (16$\times$16) input layer
+* 12 5$\times$5 kernels with stride 2 (stepped 2 pixels): next layer has lower resolution
+* **NO** separate pooling
+-->
+
+* 256 (16$\times$16) camada de entrada
+* 12 5$\times$5 kernels com *stride* 2 (com passos de 2 pixels): a próxima camada tem resolução mais baixa
+* **Sem** pooling em separado
+
+<!--
+### Convolutional network architecture with pooling
+-->
+
+
+### Arquitetura de rede convolucional com pooling
+
+<!--The next year, some changes were made: separate pooling was introduced. Separate pooling is done by averaging input values, adding a bias, and passing to a nonlinear function (hyperbolic tangent function). The 2$\times$2 pooling was performed with a stride of 2, hence reducing resolutions by half.
+-->
+
+No ano seguinte, algumas mudanças foram feitas: o pooling separado foi introduzido. O agrupamento separado é feito calculando a média dos valores de entrada, adicionando uma polarização e passando para uma função não linear (função tangente hiperbólica). O pool de 2 $ \ vezes $ 2 foi realizado com um passo de 2, reduzindo, portanto, as resoluções pela metade.
+
+<!--<center>
+    <img src="{{site.baseurl}}/images/week03/03-2/detailed_convNet.png" width="600px" /><br>
+    <b>Fig. 1</b> ConvNet Architecture
+</center>
+-->
+
+<center>
+    <img src="{{site.baseurl}}/images/week03/03-2/detailed_convNet.png" width="600px" /><br>
+    <b>Fig. 1</b> Arquitetura ConvNet
+</center>
+
+<!--An example of a single convolutional layer would be as follows:
+1. Take an input with size *32$\times$32*
+2. The convolution layer passes a 5$\times$5 kernel with stride 1 over the image, resulting feature map size *28$\times$28*
+3. Pass the feature map to a nonlinear function: size *28$\times$28*
+4. Pass to the pooling layer that averages over a 2$\times$2 window with stride 2: size *14$\times$14*
+5. Repeat 1-4 for 4 kernels
+-->
+
+Um exemplo de uma única camada convolucional seria o seguinte:
+1. Pegue uma entrada com tamanho *32$\times$32*
+2. A camada de convolução passa um kernel 5$\times$5 com *stride* 1 sobre a imagem, resultando no tamanho do mapa de características *28$\times$28*
+3. Passe o mapa de características para uma função não linear: tamanho *28$\times$28*
+4. Passe para a camada de agrupamento que tem em média uma janela de 2$\times$2 com *stride* 2 (passo 2): tamanho *14$\times$14*
+5. Repita 1-4 para 4 kernels
+
+<!--The first-layer, simple convolution/pool combinations usually detect simple features, such as oriented edge detections. After the first convolution/pool layer, the objective is to detect combinations of features from previous layers. To do this, steps 2 to 4 are repeated with multiple kernels over previous-layer feature maps, and are summed in a new feature map:
+-->
+
+As combinações simples de convolução/pool de primeira camada geralmente detectam recursos simples, como detecções de bordas orientadas. Após a primeira camada de convolução/pool, o objetivo é detectar combinações de recursos de camadas anteriores. Para fazer isso, as etapas 2 a 4 são repetidas com vários kernels sobre os mapas de recursos da camada anterior e são somados em um novo mapa de recursos:
+
+<!--1. A new 5$\times$5 kernel is slid over all feature maps from previous layers, with results summed up. (Note: In Prof. LeCun's experiment in 1989 the connection is not full for computation purpose. Modern settings usually enforce full connections): size *10$\times$10*
+2. Pass the output of the convolution to a nonlinear function: size *10$\times$10*
+3. Repeat 1/2 for 16 kernels.
+4. Pass the result to the pooling layer that averages over 2$\times$2 window with stride 2: size *5$\times$5* each feature map
+-->
+
+1. Um novo kernel 5$\times$5 é deslizado sobre todos os mapas de características das camadas anteriores, com os resultados somados. (Observação: no experimento do Prof. LeCun em 1989, a conexão não está completa para fins de computação. As configurações modernas geralmente impõem conexões completas): tamanho *10$\times$10*
+2. Passe a saída da convolução para uma função não linear: tamanho *10$\times$10*
+3. Repita 1/2 para 16 grãos.
+4. Passe o resultado para a camada de agrupamento que tem em média mais de 2$\times$2 janela com passo 2: tamanho *5$\times$5* cada mapa de característica
+
+<!--To generate an output, the last layer of convolution is conducted, which seems like a full connections but indeed is convolutional.
+-->
+
+Para gerar uma saída, a última camada de convolução é conduzida, que parece conexões completas, mas na verdade é convolucional.
+
+<!--1. The final convolution layer slides a 5$\times$5 kernel over all feature maps, with results summed up: size *1$\times$1*
+2. Pass through nonlinear function: size *1$\times$1*
+3. Generate the single output for one category.
+4. Repeat all pervious steps for each of the 10 categories(in parallel)
+-->
+
+1. A camada de convolução final desliza um kernel 5$\times$5 sobre todos os mapas de características, com resultados resumidos: tamanho *1$\times$1*
+2. Passagem pela função não linear: tamanho *1$\times$1*
+3. Gere o resultado único para uma categoria.
+4. Repita todas as etapas anteriores para cada uma das 10 categorias (em paralelo)
+
+<!--See [this animation](http://cs231n.github.io/convolutional-networks/) on Andrej Karpathy's website on how convolutions change the shape of the next layer's feature maps. Full paper can be found [here](https://papers.nips.cc/paper/293-handwritten-digit-recognition-with-a-back-propagation-network.pdf).
+-->
+
+Veja [esta animação] (http://cs231n.github.io/convolutional-networks/) no site de Andrej Karpathy sobre como as convoluções mudam a forma dos mapas de características da próxima camada. O artigo completo pode ser encontrado [aqui](https://papers.nips.cc/paper/293-handwritten-digit-recognition-with-a-back-propagation-network.pdf).
+
+<!--
+### Shift equivariance
+-->
+
+
+### Equivariância de deslocamento (Shift equivariance)
+
+<!--<center>
+    <img src="{{site.baseurl}}/images/week03/03-2/shift_invariance.gif" width="600px" /><br>
+    <b>Fig. 2</b> Shift Equivariance
+</center>
+-->
+
+<center>
+    <img src="{{site.baseurl}}/images/week03/03-2/shift_invariance.gif" width="600px" /><br>
+    <b>Fig. 2</b> Equivariância de deslocamento (Shift Equivariance)
+</center>
+
+<!--As demonstrated by the animation on the slides(here's another example), translating the input image results in same translation of the feature maps. However, the changes in feature maps are scaled by convolution/pooling operations. *E.g.* the 2$\times$2 pooling with stride 2 will reduce the 1-pixel shift in input layer to 0.5-pixel shift in the following feature maps. Spatial resolution is then exchanged for increased number of feature types, *i.e.* making the representation more abstract and less sensitive to shifts and distortions.
+-->
+
+Conforme demonstrado pela animação nos slides (aqui está outro exemplo), a tradução da imagem de entrada resulta na mesma tradução dos mapas de características. No entanto, as mudanças nos mapas de recursos são dimensionadas por operações de convolução/agrupamento. *Por exemplo, o agrupamento de 2$\times$2 com *stride* 2 (passo de 2) reduzirá o deslocamento de 1 pixel na camada de entrada para o deslocamento de 0,5 pixel nos seguintes mapas de características. A resolução espacial é então trocada por um número maior de tipos de recursos, *ou seja,* tornando a representação mais abstrata e menos sensível a deslocamentos e distorções.
+
+<!--
+### Overall architecture breakdown
+-->
+
+
+### Análise geral da arquitetura
+
+<!--Generic CNN architecture can be broken down into several basic layer archetypes:
+-->
+
+A arquitetura genérica da CNN pode ser dividida em vários arquétipos de camadas básicas:
+
+<!--* **Normalisation**
+  * Adjusting whitening (optional)
+  * Subtractive methods *e.g.* average removal, high pass filtering
+  * Divisive: local contrast normalisation, variance normalisation
+-->
+
+* **Normalização**
+  * Ajustando o clareamento (opcional)
+  * Métodos subtrativos *por exemplo* remoção média, filtragem passa-alta
+  * Divisivo: normalização de contraste local, normalização de variância
+
+<!--* **Filter Banks**
+  * Increase dimensionality
+  * Projection on overcomplete basis
+  * Edge detections
+-->
+
+* **Bancos de filtros**
+  * Aumente a dimensionalidade
+  * Projeção em base supercompleta
+  * Detecções de borda
+
+<!--* **Non-linearities**
+  * Sparsification
+  * Typically Rectified Linear Unit (ReLU): $\text{ReLU}(x) = \max(x, 0)$
+-->
+
+* **Não linearidades**
+  * Esparsificação
+  * Unidade Linear Retificada(ReLU): $\text{ReLU}(x) = \max(x, 0)$
+
+<!--* **Pooling**
+  * Aggregating over a feature map
+  * Max Pooling: $\text{MAX}= \text{Max}_i(X_i)$
+-->
+
+* **Pooling**
+  * Agragação a partir de um mapa de características
+  * Max Pooling: $\text{MAX}= \text{Max}_i(X_i)$
+
+<!--  * LP-Norm Pooling:  $$\text{L}p= \left(\sum_{i=1}^n \|X_i\|^p \right)^{\frac{1}{p}}$$
+-->
+
+  * Pooling de Norma LP:  $$\text{L}p= \left(\sum_{i=1}^n \|X_i\|^p \right)^{\frac{1}{p}}$$
+
+<!--  * Log-Prob Pooling:  $\text{Prob}= \frac{1}{b} \left(\sum_{i=1}^n e^{b X_i} \right)$
+-->
+
+  * Pooling de Probabilidade Logarítimica:  $\text{Prob}= \frac{1}{b} \left(\sum_{i=1}^n e^{b X_i} \right)$
+
+<!--
+## [LeNet5 and digit recognition](https://www.youtube.com/watch?v=FW5gFiJb-ig&t=3830s)
+-->
+
+
+## [LeNet5 e reconhecimento de dígitos](https://www.youtube.com/watch?v=FW5gFiJb-ig&t=3830s)
+
+<!--
+### Implementation of LeNet5 in PyTorch
+-->
+
+
+### Implementação da LeNet5 no PyTorch
+
+<!--LeNet5 consists of the following layers (1 being the top-most layer):
+-->
+
+LeNet5 consiste nas seguintes camadas (1 sendo a camada superior):
+
+<!--1. Log-softmax
+2. Fully connected layer of dimensions 500$\times$10
+3. ReLu
+4. Fully connected layer of dimensions (4$\times$4$\times$50)$\times$500
+5. Max Pooling of dimensions 2$\times$2, stride of 2.
+6. ReLu
+7. Convolution with 20 output channels, 5$\times$5 kernel, stride of 1.
+8. Max Pooling of dimensions 2$\times$2, stride of 2.
+9. ReLu
+10. Convolution with 20 output channels, 5$\times$5 kernel, stride of 1.
+-->
+
+1. Log-softmax
+2. Camada de dimensões totalmente conectada 500$\times$10
+3. ReLu
+4. Camada de dimensões totalmente conectada (4$\times$4$\times$50)$\times$500
+5. Combinação máxima de dimensões 2$\times$2, *stride* de 2 (passo de 2).
+6. ReLu
+7. Convolução com 20 canais de saída, kernel 5$\times$5, *stride* de 1.
+8. Combinação máxima de dimensões 2$\times$2, *stride* de 2.
+9. ReLu
+10. Convolução com 20 canais de saída, 5$\times$5 kernel, *stride* de 1.
+
+<!--The input is a 32$\times$32 grey scale image (1 input channel).
+-->
+
+A entrada é uma imagem em escala de cinza de 32$\times$32 (1 canal de entrada).
+
+<!--LeNet5 can be implemented in PyTorch with the following code:
+-->
+
+LeNet5 pode ser implementado em PyTorch com o seguinte código:
+
+<!--```python
+class LeNet5(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.conv1 = nn.Conv2d(1, 20, 5, 1)
+        self.conv2 = nn.Conv2d(20, 20, 5, 1)
+        self.fc1 = nn.Linear(4*4*50, 500)
+        self.fc2 = nn.Linear(500, 10)
+
+    def forward(self, x):
+        x = F.relu(self.conv1(x))
+        x = F.max_pool2d(x, 2, 2)
+        x = F.relu(self.conv2(x))
+        x = F.max_pool2d(x, 2, 2)
+        x = x.view(-1, 4*4*50)
+        x = F.relu(self.fc1)
+        x = self.fc2(x)
+        
+        return F.logsoftmax(x, dim=1)
+-->
+
+```python
+class LeNet5(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.conv1 = nn.Conv2d(1, 20, 5, 1)
+        self.conv2 = nn.Conv2d(20, 20, 5, 1)
+        self.fc1 = nn.Linear(4*4*50, 500)
+        self.fc2 = nn.Linear(500, 10)
+
+    def forward(self, x):
+        x = F.relu(self.conv1(x))
+        x = F.max_pool2d(x, 2, 2)
+        x = F.relu(self.conv2(x))
+        x = F.max_pool2d(x, 2, 2)
+        x = x.view(-1, 4*4*50)
+        x = F.relu(self.fc1)
+        x = self.fc2(x)
+
+        return F.logsoftmax(x, dim=1)
+```
+
+<!--Although `fc1` and `fc2` are fully connected layers, they can be thought of as convolutional layers whose kernels cover the entire input. Fully connected layers are used for efficiency purposes.
+-->
+
+Embora `fc1` e` fc2` sejam camadas totalmente conectadas, elas podem ser consideradas camadas convolucionais cujos núcleos cobrem toda a entrada. Camadas totalmente conectadas são usadas para fins de eficiência.
+
+<!--The same code can be expressed using `nn.Sequential`, but it is outdated.
+-->
+
+O mesmo código pode ser expresso usando `nn.Sequential`, mas está desatualizado.
+
+<!--
+## Advantages of CNN
+-->
+
+
+## Vantagens das CNNs
+
+<!--In a fully convolutional network, there is no need to specify the size of the input. However, changing the size of the input changes the size of the output.
+-->
+
+Em uma rede totalmente convolucional, não há necessidade de especificar o tamanho da entrada. No entanto, alterar o tamanho da entrada altera o tamanho da saída.
+
+<!--Consider a cursive hand-writing recognition system. We do not have to break the input image into segments. We can apply the CNN over the entire image: the kernels will cover all locations in the entire image and record the same output regardless of where the pattern is located. Applying the CNN over an entire image is much cheaper than applying it at multiple locations separately. No prior segmentation is required, which is a relief because the task of segmenting an image is similar to recognizing an image.
+-->
+
+Considere um sistema de reconhecimento de escrita cursiva. Não precisamos quebrar a imagem de entrada em segmentos. Podemos aplicar o CNN em toda a imagem: os kernels cobrirão todos os locais da imagem inteira e registrarão a mesma saída, independentemente de onde o padrão esteja localizado. Aplicar a CNN sobre uma imagem inteira é muito mais barato do que aplicá-la em vários locais separadamente. Nenhuma segmentação anterior é necessária, o que é um alívio porque a tarefa de segmentar uma imagem é semelhante a reconhecer uma imagem.
+
+<!--
+### Example: MNIST
+-->
+
+
+### Exemplo: MNIST
+
+<!--LeNet5 is trained on MNIST images of size 32$\times$32 to classify individual digits in the centre of the image. Data augmentation was applied by shifting the digit around, changing the size of the digit, inserting digits to the side. It was also trained with an 11-th category which represented none of the above. Images labelled by this category were generated either by producing blank images, or placing digits at the side but not the centre.
+-->
+
+LeNet5 é treinado em imagens MNIST de tamanho 32 $ \ vezes $ 32 para classificar dígitos individuais no centro da imagem. O aumento de dados foi aplicado deslocando o dígito, alterando o tamanho do dígito, inserindo dígitos ao lado. Também foi treinado com uma 11ª categoria que não representou nenhuma das anteriores. As imagens rotuladas por esta categoria foram geradas produzindo imagens em branco ou colocando dígitos na lateral, mas não no centro.
+
+<!--<center>
+    <img src="{{site.baseurl}}/images/week03/03-2/various_input.gif" width="600px" /><br>
+    <b>Fig. 3</b> Sliding Window ConvNet
+</center>
+-->
+
+<center>
+    <img src="{{site.baseurl}}/images/week03/03-2/various_input.gif" width="600px" /><br>
+    <b>Fig. 3</b> ConvNet (Rede Convolucional) com janela deslizante
+</center>
+
+<!--
+The above image demonstrates that a LeNet5 network trained on 32$\times$32 can be applied on a 32$\times$64 input image to recognise the digit at multiple locations.
+-->
+
+A imagem acima demonstra que uma rede LeNet5 treinada em 32$\times$32 pode ser aplicada em uma imagem de entrada de  32$\times$64 para reconhecer o dígito em vários locais.
+
+<!--
+## [Feature binding problem](https://www.youtube.com/watch?v=FW5gFiJb-ig&t=4827s)
+-->
+
+
+## [Problema de vinculação de características](https://www.youtube.com/watch?v=FW5gFiJb-ig&t=4827s)
+
+<!--
+### What is the feature binding problem?
+-->
+
+
+### Qual é o problema de vinculação de recursos (Feature Binding)?
+
+<!--Visual neural scientists and computer vision people have the problem of defining the object as an object. An object is a collection of features, but how to bind all of the features to form this object?
+-->
+
+Cientistas da neurologia visual e especialistas em visão computacional têm o problema de definir o objeto como um objeto. Um objeto é uma coleção de recursos, mas como vincular todos os recursos para formar esse objeto?
+
+<!--
+### How to solve it?
+-->
+
+
+### Como solucionar?
+
+<!--We can solve this feature binding problem by using a very simple CNN: only two layers of convolutions with poolings plus another two fully connected layers without any specific mechanism for it, given that we have enough non-linearities and data to train our CNN.
+-->
+
+Podemos resolver esse problema de vinculação de recursos usando um CNN muito simples: apenas duas camadas de convoluções com agrupamentos mais outras duas camadas totalmente conectadas sem nenhum mecanismo específico para isso, visto que temos não linearidades e dados suficientes para treinar nosso CNN.
+
+<!--<center>
+    <img src="{{site.baseurl}}/images/week03/03-2/feature_binding.gif" width="600px" /><br>
+    <b>Fig. 4</b> ConvNet Addressing Feature Binding
+</center>
+-->
+
+<center>
+    <img src="{{site.baseurl}}/images/week03/03-2/feature_binding.gif" width="600px" /><br>
+    <b>Fig. 4</b> ConvNet solucionando a vinculação de características (Feature Binding)
+</center>
+
+<!--The above animation showcases the ability of CNN to recognize different digits by moving a single stroke around, demonstrating its ability to address feature binding problems, *i.e.* recognizing features in a hierarchical, compositional way.
+-->
+
+A animação acima mostra a capacidade da CNN de reconhecer diferentes dígitos movendo um único traço, demonstrando sua capacidade de resolver problemas de vinculação de recursos, *ou seja,* reconhecendo recursos de forma hierárquica e composicional.
+
+<!--
+### Example: dynamic input length
+-->
+
+
+### Example: comprimento de entrada dinâmica
+
+<!--We can build a CNN with 2 convolution layers with stride 1 and 2 pooling layers with stride 2 such that the overall stride is 4. Thus, if we want to get a new output, we need to shift our input window by 4. To be more explicit, we can see the figure below (green units). First, we have an input of size 10, and we perform convolution of size 3 to get 8 units. After that, we perform pooling of size 2 to get 4 units. Similarly, we repeat the convolution and pooling again and eventually we get 1 output.
+-->
+
+Podemos construir um CNN com 2 camadas de convolução com passo 1 e 2 camadas de pool com passo 2, de modo que o passo geral seja 4. Assim, se quisermos obter uma nova saída, precisamos mudar nossa janela de entrada em 4. Para ser mais explícito, podemos ver a figura abaixo (unidades verdes). Primeiro, temos uma entrada de tamanho 10 e realizamos a convolução de tamanho 3 para obter 8 unidades. Depois disso, realizamos o pooling de tamanho 2 para obter 4 unidades. Da mesma forma, repetimos a convolução e o agrupamento novamente e, eventualmente, obtemos 1 saída.
+
+<!--<center>
+    <img src="{{site.baseurl}}/images/week03/03-2/example.jpg" width="600px" /><br>
+    <b>Fig. 5</b> ConvNet Architecture On Variant Input Size Binding
+</center>
+-->
+
+<center>
+    <img src="{{site.baseurl}}/images/week03/03-2/example.jpg" width="600px" /><br>
+    <b>Fig. 5</b> Arquitetura ConvNet na vinculação de tamanho de entrada variante
+</center>
+
+<!--Let’s assume we add 4 units at the input layer (pink units above), so that we can get 4 more units after the first convolution layer, 2 more units after the first pooling layer, 2 more units after the second convolution layer, and 1 more output. Therefore, window size to generate a new output is 4 (2 stride $\times$2)<!<--the overall subsampling we have shown from input to output is 4 (2x2)->->. Moreover, this is a demonstration of the fact that if we increase the size of the input, we will increase the size of every layer, proving CNNs' capability in handling dynamic length inputs.
+-->
+
+Vamos supor que adicionamos 4 unidades na camada de entrada (unidades rosa acima), de modo que possamos obter mais 4 unidades após a primeira camada de convolução, mais 2 unidades após a primeira camada de agrupamento, mais 2 unidades após a segunda camada de convolução e 1 mais saída. Portanto, o tamanho da janela para gerar uma nova saída é 4 (*stride* de 2 $\times$2). Além disso, esta é uma demonstração de que se aumentarmos o tamanho da entrada, aumentaremos o tamanho de cada camada, comprovando a capacidade das CNNs em lidar com entradas de comprimento dinâmico.
+
+<!--
+## What are CNN good for
+-->
+
+
+## No que as CNNs são boas
+
+<!--CNNs are good for natural signals that come in the form of multidimensional arrays and have three major properties:
+1. **Locality**: The first one is that there is a strong local correlation between values. If we take two nearby pixels of a natural image, those pixels are very likely to have the same colour. As two pixels become further apart, the similarity between them will decrease. The local correlations can help us detect local features, which is what the CNNs are doing. If we feed the CNN with permuted pixels, it will not perform well at recognizing the input images, while FC will not be affected. The local correlation justifies local connections.
+2. **Stationarity**: Second character is that the features are essential and can appear anywhere on the image, justifying the shared weights and pooling. Moreover, statistical signals are uniformly distributed, which means we need to repeat the feature detection for every location on the input image.
+3. **Compositionality**: Third character is that the natural images are compositional, meaning the features compose an image in a hierarhical manner. This justifies the use of multiple layers of neurons, which also corresponds closely with Hubel and Weisel's research on simple and complex cells.
+-->
+
+CNNs são bons para sinais naturais que vêm na forma de matrizes multidimensionais e têm três propriedades principais:
+1. **Localidade**: O primeiro é que existe uma forte correlação local entre os valores. Se pegarmos dois pixels próximos de uma imagem natural, é muito provável que esses pixels tenham a mesma cor. À medida que dois pixels se tornam mais distantes, a semelhança entre eles diminui. As correlações locais podem nos ajudar a detectar características locais, que é o que as CNNs estão fazendo. Se alimentarmos a CNN com pixels permutados, ela não terá um bom desempenho no reconhecimento das imagens de entrada, enquanto o FC não será afetado. A correlação local justifica conexões locais.
+2. **Estacionaridade**: O segundo caráter é que as características são essenciais e podem aparecer em qualquer lugar da imagem, justificando os pesos compartilhados e pooling. Além disso, os sinais estatísticos são distribuídos uniformemente, o que significa que precisamos repetir a detecção do recurso para cada local na imagem de entrada.
+3. **Composicionalidade**: O terceiro caráter é que as imagens naturais são composicionais, ou seja, os recursos compõem uma imagem de forma hierárquica. Isso justifica o uso de múltiplas camadas de neurônios, o que também corresponde intimamente às pesquisas de Hubel e Weisel sobre células simples e complexas.
+
+<!--Furthermore, people make good use of CNNs on videos, images, texts, and speech recognition.
+-->
+
+Além disso, as pessoas fazem bom uso das CNNs em vídeos, imagens, textos e reconhecimento de fala.
\ No newline at end of file
diff --git a/docs/pt/week03/03-3.md b/docs/pt/week03/03-3.md
new file mode 100644
index 000000000..2adb70108
--- /dev/null
+++ b/docs/pt/week03/03-3.md
@@ -0,0 +1,285 @@
+---
+lang: pt
+lang-ref: ch.03-3
+title: Propriedades dos sinais naturais
+lecturer: Alfredo Canziani
+authors: Ashwin Bhola, Nyutian Long, Linfeng Zhang, and Poornima Haridas
+date: 11 Feb 2020
+translation-date: 14 Nov 2021
+translator: Leon Solon
+---
+
+
+<!--
+## [Properties of natural signals](https://www.youtube.com/watch?v=kwPWpVverkw&t=26s)
+-->
+
+## [Propriedades dos sinais naturais](https://www.youtube.com/watch?v=kwPWpVverkw&t=26s)
+
+<!--All signals can be thought of as vectors. As an example, an audio signal is a 1D signal $\boldsymbol{x} = [x_1, x_2, \cdots, x_T]$ where each value $x_t$ represents the amplitude of the waveform at time $t$. To make sense of what someone is speaking, your cochlea first converts the air pressure vibrations to signals and then your brain uses a language model to convert this signal to a language *i.e.* it needs to pick the most probable utterance given the signal. For music, the signal is stereophonic which has 2 or more channels to give you an illusion that the sound is coming from multiple directions. Even though it has 2 channels, it's still a 1D signal because time is the only variable along which the signal is changing.
+-->
+
+Todos os sinais podem ser considerados vetores. Como exemplo, um sinal de áudio é um sinal 1D $\boldsymbol{x} = [x_1, x_2, \cdots, x_T]$ onde cada valor $x_t$ representa a amplitude da forma de onda no tempo $ t $. Para entender o que alguém está falando, sua cóclea primeiro converte as vibrações da pressão do ar em sinais e, em seguida, seu cérebro usa um modelo de linguagem para converter esse sinal em uma linguagem *ou seja,* ele precisa escolher a expressão mais provável dado o sinal. Para música, o sinal é estereofônico, com 2 ou mais canais para dar a ilusão de que o som vem de várias direções. Embora tenha 2 canais, ainda é um sinal 1D porque o tempo é a única variável ao longo da qual o sinal está mudando.
+
+<!--An image is a 2D signal because the information is spatially depicted. Note that each point can be a vector in itself. This means that if we have $d$ channels in an image, each spatial point in the image is a vector of dimension $d$. A colour image has RGB planes, which means $d = 3$. For any point $x_{i,j}$, this corresponds to the intensity of red, green and blue colours respectively.
+-->
+
+Uma imagem é um sinal 2D porque a informação é representada espacialmente. Observe que cada ponto pode ser um vetor em si. Isso significa que se temos $d$ canais em uma imagem, cada ponto espacial na imagem é um vetor de dimensão $d$. Uma imagem colorida tem planos RGB, o que significa $d = 3$. Para qualquer ponto $x_{i,j}$, isso corresponde à intensidade das cores vermelha, verde e azul, respectivamente.
+
+<!--We can even represent language with the above logic. Each word corresponds to a one-hot vector with one at the position it occurs in our vocabulary and zeroes everywhere else. This means that each word is a vector of the size of the vocabulary.
+-->
+
+Podemos até representar a linguagem com a lógica acima. Cada palavra corresponde a um vetor one-hot com um na posição em que ocorre em nosso vocabulário e zeros em todas as outras. Isso significa que cada palavra é um vetor do tamanho do vocabulário.
+
+<!--Natural data signals follow these properties:
+1. Stationarity: Certain motifs repeat throughout a signal. In audio signals, we observe the same type of patterns over and over again across the temporal domain. In images, this means that we can expect similar visual patterns repeat across the dimensionality.
+2. Locality: Nearby points are more correlated than points far away. For 1D signal, this means that if we observe a peak at some point $t_i$, we expect the points in a small window around $t_i$ to have similar values as $t_i$ but for a point $t_j$ far away from $t_i$, $x_{t_i}$ has very less bearing on $x_{t_j}$. More formally, the convolution between a signal and its flipped counterpart has a peak when the signal is perfectly overlapping with it's flipped version. A convolution between two 1D signals (cross-correlation) is nothing but their dot product which is a measure of how similar or close the two vectors are. Thus, information is contained in specific portions and parts of the signal. For images, this means that the correlation between two points in an image decreases as we move the points away. If $x_{0,0}$ pixel is blue, the probability that the next pixel ($x_{1,0},x_{0,1}$) is also blue is pretty high but as you move to the opposite end of the image ($x_{-1,-1}$), the value of this pixel is independent of the pixel value at $x_{0,0}$.
+3. Compositionality: Everything in nature is composed of parts that are composed of sub-parts and so on. As an example, characters form strings that form words, which further form sentences. Sentences can be combined to form documents. Compositionality allows the world to be explainable.
+-->
+
+Os sinais de dados naturais seguem estas propriedades:
+1. Estacionariedade: Certos motivos se repetem ao longo de um sinal. Em sinais de áudio, observamos o mesmo tipo de padrões repetidamente em todo o domínio temporal. Em imagens, isso significa que podemos esperar que padrões visuais semelhantes se repitam em toda a dimensionalidade.
+2. Localidade: os pontos próximos são mais correlacionados do que os pontos distantes. Para o sinal 1D, isso significa que se observarmos um pico em algum ponto $t_i$, esperamos que os pontos em uma pequena janela em torno de $t_i$ tenham valores semelhantes a $t_i$, mas para um ponto $t_j$ longe de $t_i$, $x_{t_i}$ tem muito menos influência em $x_{t_j}$. Mais formalmente, a convolução entre um sinal e sua contraparte invertida tem um pico quando o sinal está perfeitamente sobreposto à sua versão invertida. Uma convolução entre dois sinais 1D (correlação cruzada) nada mais é do que seu produto escalar, que é uma medida de quão semelhantes ou próximos os dois vetores são. Assim, a informação está contida em porções e partes específicas do sinal. Para imagens, isso significa que a correlação entre dois pontos em uma imagem diminui à medida que afastamos os pontos. Se $x_{0,0}$ pixel for azul, a probabilidade de que o próximo pixel ($x_{1,0}, x_{0,1}$) também seja azul é muito alta, mas conforme você se move para a extremidade oposta da imagem ($x_{- 1, -1}$), o valor deste pixel é independente do valor do pixel em $x_{0,0}$.
+3. Composicionalidade: Tudo na natureza é composto de partes que são compostas de sub-partes e assim por diante. Por exemplo, os caracteres formam cadeias de caracteres que formam palavras, que também formam frases. As frases podem ser combinadas para formar documentos. A composicionalidade permite que o mundo seja explicável.
+
+<!--If our data exhibits stationarity, locality, and compositionality, we can exploit them with networks that use sparsity, weight sharing and stacking of layers.
+-->
+
+Se nossos dados exibem estacionariedade, localidade e composicionalidade, podemos explorá-los com redes que usam dispersão, compartilhamento de peso e empilhamento de camadas.
+
+<!--
+## [Exploiting properties of natural signals to build invariance and equivariance](https://www.youtube.com/watch?v=kwPWpVverkw&t=1074s)
+-->
+
+## [Explorando propriedades de sinais naturais para construir invariância e equivariância](https://www.youtube.com/watch?v=kwPWpVverkw&t=1074s)
+
+<!--
+### Locality  $\Rightarrow$ sparsity
+-->
+
+### Localidade $\Rightarrow$ esparcidade
+
+<!--Fig.1 shows a 5-layer fully connected network. Each arrow represents a weight to be multiplied by the inputs. As we can see, this network is very computationally expensive.
+-->
+
+A Fig.1 mostra uma rede de 5 camadas totalmente conectada. Cada seta representa um peso a ser multiplicado pelas entradas. Como podemos ver, essa rede é muito cara em termos computacionais.
+
+<!--<center><img src="{{site.baseurl}}/images/week02/02-3/pre-inference4layers.png" width="400px" /><br>
+<b>Figure 1:</b> Fully Connected Network</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week02/02-3/pre-inference4layers.png" width="400px" /><br>
+<b> Figura 1: </b> Rede totalmente conectada </center>
+
+<!--If our data exhibits locality, each neuron needs to be connected to only a few local neurons of the previous layer. Thus, some connections can be dropped as shown in Fig.2. Fig.2(a) represents an FC network. Taking advantage of the locality property of our data, we drop connections between far away neurons in Fig.2(b). Although the hidden layer neurons (green) in Fig.2(b) don't span the whole input, the overall architecture will be able to account for all input neurons. The receptive field (RF) is the number of neurons of previous layers, that each neuron of a particular layer can see or has taken into account. Therefore, the RF of the output layer w.r.t the hidden layer is 3, RF of the hidden layer w.r.t the input layer is 3, but the RF of the output layer w.r.t the input layer is 5.
+-->
+
+Se nossos dados exibem localidade, cada neurônio precisa ser conectado a apenas alguns neurônios locais da camada anterior. Assim, algumas conexões podem ser interrompidas, conforme mostrado na Fig.2. A Fig.2 (a) representa uma rede FC. Aproveitando a propriedade de localidade de nossos dados, eliminamos as conexões entre neurônios distantes na Fig.2 (b). Embora os neurônios da camada oculta (verde) na Fig.2 (b) não abranjam toda a entrada, a arquitetura geral será capaz de dar conta de todos os neurônios de entrada. O campo receptivo (RF) é o número de neurônios das camadas anteriores, que cada neurônio de uma determinada camada pode ver ou levou em consideração. Portanto, o RF da camada de saída com a camada oculta é 3, o RF da camada oculta com a camada de entrada é 3, mas o RF da camada de saída com a camada de entrada é 5.
+
+<!--|<img src="{{site.baseurl}}/images/week03/03-3/Figure 2(a) Before Applying Sparsity.png" width="300"/> | <img src="{{site.baseurl}}/images/week03/03-3/Figure 2(b) After Applying Sparsity.png" width="300"/>|
+|<b>Figure 2(a):</b> Before Applying Sparsity | <b>Figure 2(b):</b> After Applying Sparsity|
+-->
+
+|<img src="{{site.baseurl}}/images/week03/03-3/Figure 2(a) Before Applying Sparsity.png" width="300"/> | <img src="{{site.baseurl}}/images/week03/03-3/Figure 2(b) After Applying Sparsity.png" width="300"/>|
+|<b>Figura 2 (a):</b>Antes de aplicar a esparcidade | <b>Figura 2(b):</b> Após a aplicação da esparcidade |
+
+<!--
+### Stationarity $\Rightarrow$ parameters sharing
+-->
+
+### Estacionariedade $\Rightarrow$ Compartilhamento de parâmetros 
+
+<!--If our data exhibits stationarity, we could use a small set of parameters multiple times across the network architecture. For example in our sparse network, Fig.3(a), we can use a set of 3 shared parameters (yellow, orange and red). The number of parameters will then drop from 9 to 3! The new architecture might even work better because we have more data for training those specific weights.
+The weights after applying sparsity and parameter sharing is called a convolution kernel.
+-->
+
+Se nossos dados exibirem estacionariedade, poderíamos usar um pequeno conjunto de parâmetros várias vezes na arquitetura da rede. Por exemplo, em nossa rede esparsa, Fig.3 (a), podemos usar um conjunto de 3 parâmetros compartilhados (amarelo, laranja e vermelho). O número de parâmetros cairá de 9 para 3! A nova arquitetura pode até funcionar melhor porque temos mais dados para treinar esses pesos específicos.
+Os pesos após a aplicação de dispersão e compartilhamento de parâmetros são chamados de kernel de convolução.
+
+<!--|<img src="{{site.baseurl}}/images/week03/03-3/Figure 3(a) Before Applying Parameter Sharing.png" width="300"/> | <img src="{{site.baseurl}}/images/week03/03-3/Figure 3(b) After Applying Parameter Sharing.png" width="300"/>|
+|<b>Figure 3(a):</b> Before Applying Parameter Sharing | <b>Figure 3(b):</b> After Applying Parameter Sharing|
+-->
+
+|<img src="{{site.baseurl}}/images/week03/03-3/Figure 3(a) Before Applying Parameter Sharing.png" width="300"/> | <img src="{{site.baseurl}}/images/week03/03-3/Figure 3(b) After Applying Parameter Sharing.png" width="300"/>| <b> Figura 3 (a): </b> Antes de Aplicar o Compartilhamento de Parâmetro | <b> Figura 3 (b): </b> Após aplicar o compartilhamento de parâmetro |
+
+<!--Following are some advantages of using sparsity and parameter sharing:-
+* Parameter sharing
+  * faster convergence
+  * better generalisation
+  * not constained to input size
+  * kernel indepence $\Rightarrow$ high parallelisation
+* Connection sparsity
+  * reduced amount of computation
+-->
+
+A seguir estão algumas vantagens de usar esparsidade e compartilhamento de parâmetros: -
+* Compartilhamento de parâmetros
+  * convergência mais rápida
+  * melhor generalização
+  * não restrito ao tamanho de entrada
+  * independência do kernel $\Rightarrow$ alta paralelização
+* Esparsidade de conexão
+  * quantidade reduzida de computação
+
+<!--Fig.4 shows an example of kernels on 1D data, where the kernel size is: 2(number of kernels) * 7(thickness of the previous layer) * 3(number of unique connections/weights).
+-->
+
+A Fig.4 mostra um exemplo de kernels em dados 1D, onde o tamanho do kernel é: 2(número de kernels) * 7(espessura da camada anterior) * 3(número de conexões / pesos únicos).
+
+<!--The choice of kernel size is empirical. 3 * 3 convolution seems to be the minimal size for spatial data. Convolution of size 1 can be used to obtain a final layer that can be applied to a larger input image. Kernel size of even number might lower the quality of the data, thus we always have kernel size of odd numbers, usually 3 or 5.
+-->
+
+A escolha do tamanho do kernel é empírica. A convolução 3 * 3 parece ser o tamanho mínimo para dados espaciais. A convolução de tamanho 1 pode ser usada para obter uma camada final que pode ser aplicada a uma imagem de entrada maior. O tamanho do kernel de número par pode diminuir a qualidade dos dados, portanto, sempre temos o tamanho do kernel de números ímpares, geralmente 3 ou 5.
+
+<!--|<img src="{{site.baseurl}}/images/week03/03-3/Figure_4a_kernels_ on_1D_data.png" width="300"/> | <img src="{{site.baseurl}}/images/week03/03-3/Figure_4b_zero_padding.png" width="350"/>|
+|<b>Figure 4(a):</b> Kernels on 1D Data | <b>Figure 4(b):</b> Data with Zero Padding|
+-->
+
+|<img src="{{site.baseurl}}/images/week03/03-3/Figure_4a_kernels_ on_1D_data.png" width="300"/> | <img src="{{site.baseurl}}/images/week03/03-3/Figure_4b_zero_padding.png" width="350"/>|
+| <b> Figura 4 (a): </b> Kernels em dados 1D | <b> Figura 4 (b): </b> Dados com Preenchimento com Zeros |
+
+<!--### Padding
+-->
+
+### Preenchimento (Padding)
+
+<!--Padding generally hurts the final results, but it is convenient programmatically. We usually use zero-padding: `size =  (kernel size - 1)/2`.
+-->
+
+O preenchimento (padding) geralmente prejudica os resultados finais, mas é conveniente programaticamente. Normalmente usamos preenchimento com zeros (zero-padding): `tamanho = (tamanho do kernel - 1)/2`.
+
+<!--
+### Standard spatial CNN
+-->
+
+### CNN espacial padrão
+
+<!--A standard spatial CNN has the following properties:
+-->
+
+Uma CNN espacial padrão tem as seguintes propriedades:
+
+<!--* Multiple layers
+  * Convolution
+  * Non-linearity (ReLU and Leaky)
+  * Pooling
+  * Batch normalisation
+* Residual bypass connection
+-->
+
+* Múltiplas camadas
+  * Convolução
+  * Não linearidade (ReLU e Leaky)
+  * Pooling
+  * Normalização em lote (batch normalization)
+* Conexão de bypass residual
+
+<!--Batch normalization and residual bypass connections are very helpful to get the network to train well.
+Parts of a signal can get lost if too many layers have been stacked so, additional connections via residual bypass, guarantee a path from bottom to top and also for a path for gradients coming from top to bottom.
+-->
+
+A normalização em lote e as conexões de bypass residuais são muito úteis para fazer com que a rede treine bem.
+Partes de um sinal podem se perder se muitas camadas forem empilhadas, portanto, conexões adicionais via bypass residual garantem um caminho de baixo para cima e também um caminho para gradientes vindo de cima para baixo.
+
+<!--In Fig.5, while the input image contains mostly spatial information across two dimensions (apart from characteristic information, which is the colour of each pixel), the output layer is thick. Midway, there is a trade off between the spatial information and the characteristic information and the representation becomes denser. Therefore, as we move up the hierarchy, we get denser representation as we lose the spatial information.
+-->
+
+Na Fig.5, enquanto a imagem de entrada contém principalmente informações espaciais em duas dimensões (além das informações características, que são a cor de cada pixel), a camada de saída é espessa. No meio do caminho, há uma troca entre as informações espaciais e as informações características e a representação torna-se mais densa. Portanto, à medida que subimos na hierarquia, obtemos uma representação mais densa à medida que perdemos as informações espaciais.
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-3/Figure 5 Information Representations Moving up the Hierachy.png" width="350px" /><br>
+<b>Figure 5:</b> Information Representations Moving up the Hierarchy</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-3/Figure 5 Information Representations Moving up the Hierachy.png" width="350px" /><br>
+<b> Figura 5: </b> Representações de informações subindo na hierarquia </center>
+
+<!--
+### [Pooling](https://www.youtube.com/watch?v=kwPWpVverkw&t=2376s)
+-->
+
+### [Pooling](https://www.youtube.com/watch?v=kwPWpVverkw&t=2376s)
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-3/Figure 6 Illustration of Pooling.png" width="350px" /><br>
+<b>Figure 6:</b> Illustration of Pooling</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-3/Figure 6 Illustration of Pooling.png" width="350px" /><br>
+<b> Figura 6: </b> Ilustração de pooling </center>
+
+<!--A specific operator, $L_p$-norm, is applied to different regions (refer to Fig.6). Such an operator gives only one value per region(1 value for 4 pixels in our example). We then iterate over the whole data region-by-region, taking steps based on the stride. If we start with $m * n$ data with $c$ channels, we will end up with $\frac{m}{2} * \frac{n}{2}$ data still with $c$ channels (refer to Fig.7).
+Pooling is not parametrized; nevertheless, we can choose different polling types like max pooling, average pooling and so on. The main purpose of pooling reduces the amount of data so that we can compute in a reasonable amount of time.
+-->
+
+Um operador específico, $L_p$-norm, é aplicado a diferentes regiões (consulte a Fig.6). Esse operador fornece apenas um valor por região (1 valor para 4 pixels em nosso exemplo). Em seguida, iteramos sobre todos os dados, região por região, realizando etapas com base no passo. Se começarmos com $m * n$ dados com $c$ canais, terminaremos com $\frac{m}{2} * \frac{n}{2}$ dados ainda com $c$ canais (consulte Fig.7).
+O agrupamento não é parametrizado; no entanto, podemos escolher diferentes tipos de sondagem, como pooling máximo, pooling médio e assim por diante. O objetivo principal do agrupamento reduz a quantidade de dados para que possamos computar em um período de tempo razoável.
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-3/Figure 7 Pooling results.png" width="350px" /><br>
+<b>Figure 7:</b> Pooling results </center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-3/Figure 7 Pooling results.png" width="350px" /><br>
+<b> Figura 7: </b> Agrupando resultados </center>
+
+<!--
+## CNN - Jupyter Notebook
+-->
+
+## CNN - Jupyter Notebook
+
+<!--The Jupyter notebook can be found [here](https://github.com/Atcold/pytorch-Deep-Learning/blob/master/06-convnet.ipynb). To run the notebook, make sure you have the `pDL` environment installed as specified in [`README.md`](https://github.com/Atcold/pytorch-Deep-Learning/blob/master/README.md).
+-->
+
+O Jupyter Notebook pode ser encontrado [aqui](https://github.com/Atcold/pytorch-Deep-Learning/blob/master/06-convnet.ipynb). Para executar o notebook, certifique-se de ter o ambiente `pDL` instalado conforme especificado em [`README.md`](https://github.com/Atcold/pytorch-Deep-Learning/blob/master/README.md) .
+
+<!--In this notebook, we train a multilayer perceptron (FC network) and a convolution neural network (CNN) for the classification task on the MNIST dataset. Note that both networks have an equal number of parameters. (Fig.8)
+-->
+
+Neste notebook, treinamos um perceptron multicamadas (rede totalmente conectada - FC) e uma rede neural convolucional (CNN) para a tarefa de classificação no conjunto de dados MNIST. Observe que ambas as redes têm um número igual de parâmetros. (Fig.8)
+
+<!--<center> <img src="{{site.baseurl}}/images/week03/03-3/Figure 8 Instances from the Original MNIST Dataset.png" width="350px" /><br>
+<b>Figure 8:</b> Instances from the Original MNIST Dataset </center>
+-->
+
+<center> <img src="{{site.baseurl}}/images/week03/03-3/Figure 8 Instances from the Original MNIST Dataset.png" width="350px" /><br>
+<b> Figura 8: </b> instâncias do conjunto de dados MNIST original </center>
+
+<!--Before training, we normalize our data so that the initialization of the network will match our data distribution (very important!). Also, make sure that the following five operations/steps are present in your training:
+-->
+
+Antes do treinamento, normalizamos nossos dados para que a inicialização da rede corresponda à nossa distribuição de dados (muito importante!). Além disso, certifique-se de que as cinco operações/etapas a seguir estejam presentes em seu treinamento:
+
+<!-- 1. Feeding data to the model
+ 2. Computing the loss
+ 3. Cleaning the cache of accumulated gradients with `zero_grad()`
+ 4. Computing the gradients
+ 5. Performing a step in the optimizer method
+-->
+
+1. Alimentando dados para o modelo
+2. Calculando a perda
+3. Limpar o cache de gradientes acumulados com `zero_grad()`
+4. Calculando os gradientes
+5. Executar uma etapa no método do otimizador
+
+<!--First, we train both the networks on the normalized MNIST data. The accuracy of the FC network turned out to be $87\%$ while the accuracy of the CNN turned out to be $95\%$. Given the same number of parameters, the CNN managed to train many more filters. In the FC network, filters that try to get some dependencies between things that are further away with things that are close by, are trained. They are completely wasted. Instead, in the convolutional network, all these parameters concentrate on the relationship between neighbour pixels.
+-->
+
+Primeiro, treinamos ambas as redes nos dados MNIST normalizados. A precisão da rede totalmente conectada acabou sendo $87\%$, enquanto a precisão da CNN acabou sendo $95\%$. Dado o mesmo número de parâmetros, a CNN conseguiu treinar muitos mais filtros. Na rede FC, os filtros que tentam obter algumas dependências entre coisas que estão mais distantes com coisas que estão por perto são treinados. Eles estão completamente perdidos. Em vez disso, na rede convolucional, todos esses parâmetros se concentram na relação entre os pixels vizinhos.
+
+<!--Next, we perform a random permutation of all the pixels in all the images of our MNIST dataset. This transforms our Fig.8 to Fig.9. We then train both the networks on this modified dataset.
+-->
+
+Em seguida, realizamos uma permutação aleatória de todos os pixels em todas as imagens de nosso conjunto de dados MNIST. Isso transforma nossa Fig.8 em Fig.9. Em seguida, treinamos ambas as redes neste conjunto de dados modificado.
+
+<!--<center><img src="{{site.baseurl}}/images/week03/03-3/Figure 9 Instances from Permuted MNIST Dataset.png" width="350px" /><br>
+<b>Figure 9:</b> Instances from Permuted MNIST Dataset</center>
+-->
+
+<center><img src="{{site.baseurl}}/images/week03/03-3/Figure 9 Instances from Permuted MNIST Dataset.png" width="350px" /><br>
+<b> Figura 9: </b> instâncias do conjunto de dados MNIST permutado </center>
+
+<!--The performance of the FC network almost stayed unchanged ($85\%$), but the accuracy of CNN dropped to $83\%$. This is because, after a random permutation, the images no longer hold the three properties of locality, stationarity, and compositionality, that are exploitable by a CNN.
+-->
+
+O desempenho da rede totalmente conectada quase permaneceu inalterado ($85\%$), mas a precisão da CNN caiu para $83\%$. Isso ocorre porque, após uma permutação aleatória, as imagens não possuem mais as três propriedades de localidade, estacionariedade e composicionalidade, que são exploradas por uma CNN.
+
diff --git a/docs/pt/week03/03.md b/docs/pt/week03/03.md
new file mode 100644
index 000000000..8e652a19d
--- /dev/null
+++ b/docs/pt/week03/03.md
@@ -0,0 +1,40 @@
+---
+lang: pt
+lang-ref: ch.03
+title: Semana 3
+translator: Leon Solon
+translation-date: 14 Nov 2021
+---
+
+<!--
+## Lecture part A
+-->
+
+## Aula parte A
+
+<!--We first see a visualization of a 6-layer neural network. Next we begin with the topic of Convolutions and Convolution Neural Networks (CNN). We review several types of parameter transformations in the context of CNNs and introduce the idea of a kernel, which is used to learn features in a hierarchical manner. Thereby allowing us to classify our input data which is the basic idea motivating the use of CNNs.
+-->
+
+Iniciamos com a visualização de uma rede neural de 6 camadas. A seguir, começamos com o tópico de Convoluções e Redes Neurais Convolucionais (CNN). Revisamos vários tipos de transformações de parâmetros no contexto de CNNs e apresentamos a ideia de um kernel, que é usado para aprender características de maneira hierárquica. Assim, podemos classificar nossos dados de entrada, que é a ideia básica que motiva o uso de CNNs.
+
+<!--
+## Lecture part B
+-->
+
+## Aula parte B
+
+<!--We give an introduction on how CNNs have evolved over time. We discuss in detail different CNN architectures, including a modern implementation of LeNet5 to exemplify the task of digit recognition on the MNIST dataset. Based on its design principles, we expand on the advantages of CNNs which allows us to exploit the compositionality, stationarity, and locality features of natural images.
+-->
+
+Damos uma introdução de como as CNNs evoluíram ao longo do tempo. Discutimos em detalhes diferentes arquiteturas de CNN, incluindo uma implementação moderna de LeNet5 para exemplificar a tarefa de reconhecimento de dígitos no conjunto de dados MNIST. Com base em seus princípios de design, expandimos as vantagens das CNNs, o que nos permite explorar as características de composicionalidade, estacionariedade e localidade de imagens naturais.
+
+<!--
+## Practicum
+-->
+
+## Prática
+
+<!--Properties of natural signals that are most relevant to CNNs are discussed in more detail, namely: Locality, Stationarity, and Compositionality. We explore precisely how a kernel exploits these features through sparsity, weight sharing and the stacking of layers, as well as motivate the concepts of padding and pooling. Finally, a performance comparison between FCN and CNN was done for different data modalities.
+-->
+
+As propriedades dos sinais naturais que são mais relevantes para as CNNs são discutidas em mais detalhes, a saber: Localidade, Estacionaridade e Composicionalidade. Exploramos precisamente como um kernel explora essas características por meio de dispersão, compartilhamento de peso e empilhamento de camadas, além de motivar os conceitos de preenchimento (padding) e pooling. Finalmente, uma comparação de desempenho entre FCN (Redes Convolucionais Completas) e CNN foi feita para diferentes modalidades de dados.
\ No newline at end of file
diff --git a/docs/pt/week03/lecture03.sbv b/docs/pt/week03/lecture03.sbv
new file mode 100644
index 000000000..a6444f462
--- /dev/null
+++ b/docs/pt/week03/lecture03.sbv
@@ -0,0 +1,3429 @@
+0:00:04.819,0:00:08.319
+In this case, we have a network which has an input on the left-hand side
+
+0:00:08.959,0:00:14.259
+Usually you have the input on the bottom side or on the left. They are pink in my slides
+
+0:00:14.260,0:00:17.409
+So if you take notes, make them pink. No, just kidding!
+
+0:00:18.400,0:00:23.020
+And then we have... How many activations? How many hidden layers do you count there?
+
+0:00:23.539,0:00:27.789
+Four hidden layers. So overall how many layers does the network have here?
+
+0:00:28.820,0:00:32.980
+Six, right? Because we have four hidden, plus one input, plus one output layer
+
+0:00:33.649,0:00:37.568
+So in this case, I have two neurons per layer, right?
+
+0:00:37.569,0:00:41.739
+So what does it mean? What are the dimensions of the matrices we are using here?
+
+0:00:43.339,0:00:46.119
+Two by two. So what does that two by two matrix do?
+
+0:00:48.739,0:00:51.998
+Come on! You have... You know the answer to this question
+
+0:00:53.359,0:00:57.579
+Rotation, yeah. Then scaling, then shearing and...
+
+0:00:59.059,0:01:05.469
+reflection. Fantastic, right? So we constrain our network to perform all the operations on the plan
+
+0:01:05.540,0:01:12.380
+We have seen the first time if I allow the hidden layer to be a hundred neurons long we can...
+
+0:01:12.380,0:01:13.680
+Wow okay!
+
+0:01:13.680,0:01:15.680
+We can easily...
+
+0:01:18.079,0:01:20.079
+Ah fantastic. What is it?
+
+0:01:21.170,0:01:23.170
+We are watching movies now. I see...
+
+0:01:24.409,0:01:29.889
+See? Fantastic. What is it? Mandalorian is so cool, no? Okay...
+
+0:01:32.479,0:01:39.428
+Okay, how nice is this lesson. Is it even recorded? Okay, we have no idea
+
+0:01:40.789,0:01:43.719
+Okay, give me a sec. Okay, so we go here...
+
+0:01:47.810,0:01:49.810
+Done
+
+0:01:50.390,0:01:52.070
+Listen
+
+0:01:52.070,0:01:53.600
+All right
+
+0:01:53.600,0:01:59.679
+So we started from this network here, right? Which had this intermediate layer and we forced them to be
+
+0:02:00.289,0:02:05.229
+2-dimensional, right? Such that all the transformations are enforced to be on a plane
+
+0:02:05.270,0:02:08.319
+So this is what the network does to our plan
+
+0:02:08.319,0:02:14.269
+It folds it on specific regions, right? And those foldings are very abrupt
+
+0:02:14.370,0:02:18.499
+This is because all the transformations are performed on the 2d layer, right?
+
+0:02:18.500,0:02:22.550
+So this training took me really a lot of effort because the
+
+0:02:23.310,0:02:25.310
+optimization is actually quite hard
+
+0:02:25.740,0:02:30.769
+Whenever I had a hundred-neuron hidden layer, that was very easy to train
+
+0:02:30.770,0:02:35.299
+This one really took a lot of effort and you have to tell me why, okay?
+
+0:02:35.400,0:02:39.469
+If you don't know the answer right now, you'd better know the answer for the midterm
+
+0:02:40.470,0:02:43.370
+So you can take note of what are the questions for the midterm...
+
+0:02:43.980,0:02:49.600
+Right, so this is the final output of the network, which is also that 2d layer
+
+0:02:50.010,0:02:55.489
+to the embedding, so I have no non-linearity on my last layer. And these are the final
+
+0:02:56.370,0:03:01.850
+classification regions. So let's see what each layer does. This is the first layer, affine transformation
+
+0:03:01.850,0:03:06.710
+so it looks like it's a 3d rotation, but it's not right? It's just a 2d rotation
+
+0:03:07.740,0:03:15.600
+reflection, scaling, and shearing. And then what is this part? Ah, what's happened right now? Do you see?
+
+0:03:17.820,0:03:21.439
+We have like the ReLU part, which is killing all the negative
+
+0:03:22.800,0:03:27.079
+sides of the network, right? Sorry, all the negative sides of this
+
+0:03:28.080,0:03:33.499
+space, right? It is the second affine transformation and then here you apply again
+
+0:03:34.770,0:03:37.460
+the ReLU, you can see all the negative
+
+0:03:38.220,0:03:41.149
+subspaces have been erased and they've been set to zero
+
+0:03:41.730,0:03:44.509
+Then we keep going with a third affine transformation
+
+0:03:45.120,0:03:46.790
+We zoom it... it's zooming a lot...
+
+0:03:46.790,0:03:54.469
+And then you're gonna have the ReLU layer which is gonna be killing one of those... all three quadrants, right?
+
+0:03:54.470,0:03:59.240
+Only one quadrant survives every time. And then we go with the fourth affine transformation
+
+0:03:59.790,0:04:06.200
+where it's elongating a lot because given that we confine all the transformation to be living in this space
+
+0:04:06.210,0:04:12.439
+it really needs to stretch and use all the power it can, right? Again, this is the
+
+0:04:13.170,0:04:18.589
+second last. Then we have the last affine transformation, which is the final one. And then we reach finally
+
+0:04:19.320,0:04:20.910
+linearly separable
+
+0:04:20.910,0:04:26.359
+regions here. Finally, we're gonna see how each affine transformation can be
+
+0:04:27.240,0:04:31.759
+split in each component. So we have the rotation, we have now squashing, like zooming
+
+0:04:32.340,0:04:38.539
+Then we have rotation, reflection because the determinant is minus one, and then we have the final bias
+
+0:04:38.539,0:04:42.769
+You have the positive part of the ReLU (Rectified Linear Unit), again rotation
+
+0:04:43.650,0:04:47.209
+flipping because we had a negative, a minus one determinant
+
+0:04:47.849,0:04:49.849
+Zooming, rotation
+
+0:04:49.889,0:04:54.258
+One more reflection and then the final bias. This was the second affine transformation
+
+0:04:54.259,0:04:58.609
+Then we have here the positive part again. We have third layer so rotation, reflection
+
+0:05:00.000,0:05:05.629
+zooming and then we have... this is SVD decomposition, right? You should be aware of that, right?
+
+0:05:05.629,0:05:09.799
+You should know. And then final is the translation and the third
+
+0:05:10.229,0:05:15.589
+ReLU, then we had the fourth layer, so rotation, reflection because the determinant was negative
+
+0:05:16.169,0:05:18.169
+zooming, again the other rotation
+
+0:05:18.599,0:05:21.769
+Once more... reflection and bias
+
+0:05:22.379,0:05:24.559
+Finally a ReLU and then we have the last...
+
+0:05:25.259,0:05:27.259
+the fifth layer. So rotation
+
+0:05:28.139,0:05:32.059
+zooming, we didn't have reflection because the determinant was +1
+
+0:05:32.490,0:05:37.069
+Again, reflection in this case because the determinant was negative and then finally the final bias, right?
+
+0:05:37.139,0:05:41.478
+And so this was pretty much how this network, which was
+
+0:05:42.599,0:05:44.599
+just made of
+
+0:05:44.759,0:05:46.759
+a sequence of layers of
+
+0:05:47.159,0:05:52.218
+neurons that are only two neurons per layer, is performing the classification task
+
+0:05:54.990,0:05:58.159
+And all those transformation have been constrained to be
+
+0:05:58.680,0:06:03.199
+living on the plane. Okay, so this was really hard to train
+
+0:06:03.419,0:06:05.959
+Can you figure out why it was really hard to train?
+
+0:06:06.539,0:06:08.539
+What does it happen if my...
+
+0:06:09.270,0:06:16.219
+if my bias of one of the four layers puts my points away from the top right quadrant?
+
+0:06:21.060,0:06:25.519
+Exactly, so if you have one of the four biases putting my
+
+0:06:26.189,0:06:28.549
+initial point away from the top right quadrant
+
+0:06:29.189,0:06:34.039
+then the ReLUs are going to be completely killing everything, and everything gets collapsed into zero
+
+0:06:34.560,0:06:38.399
+Okay? And so there you can't do any more of anything, so
+
+0:06:38.980,0:06:44.129
+this network here was really hard to train. If you just make it a little bit fatter than...
+
+0:06:44.320,0:06:48.659
+instead of constraining it to be two neurons for each of the hidden layers
+
+0:06:48.660,0:06:52.230
+then it is much easier to train. Or you can do a combination of the two, right?
+
+0:06:52.230,0:06:54.300
+So instead of having just a fat network
+
+0:06:54.300,0:07:01.589
+you can have a network that is less fat, but then you have a few hidden layers, okay?
+
+0:07:02.770,0:07:06.659
+So the fatness is how many neurons you have per hidden layer, right?
+
+0:07:07.810,0:07:11.429
+Okay. So the question is how do we determine the structure or the
+
+0:07:12.730,0:07:15.150
+configuration of our network, right? How do we design the network?
+
+0:07:15.580,0:07:20.550
+And the answer is going to be, that's what Yann is gonna be teaching across the semester, right?
+
+0:07:20.550,0:07:27.300
+So keep your attention high because that's what we're gonna be teaching here
+
+0:07:28.090,0:07:30.840
+That's a good question right? There is no
+
+0:07:32.410,0:07:34.679
+mathematical rule, there is a lot of experimental
+
+0:07:35.710,0:07:39.569
+empirical evidence and a lot of people are trying different configurations
+
+0:07:39.570,0:07:42.000
+We found something that actually works pretty well now.
+
+0:07:42.100,0:07:46.200
+We're gonna be covering these architectures in the following lessons. Other questions?
+
+0:07:48.790,0:07:50.790
+Don't be shy
+
+0:07:51.880,0:07:56.130
+No? Okay, so I guess then we can switch to the second part of the class
+
+0:07:57.880,0:08:00.630
+Okay, so we're gonna talk about convolutional nets today
+
+0:08:02.710,0:08:05.879
+Let's dive right in. So I'll start with
+
+0:08:06.820,0:08:09.500
+something that's relevant to convolutional nets but not just [to them]
+
+0:08:10.000,0:08:12.500
+which is the idea of transforming the parameters of a neural net
+
+0:08:12.570,0:08:17.010
+So here we have a diagram that we've seen before except for a small twist
+
+0:08:17.920,0:08:22.300
+The diagram we're seeing here is that we have a neural net G of X and W
+
+0:08:22.360,0:08:27.960
+W being the parameters, X being the input that makes a prediction about an output, and that goes into a cost function
+
+0:08:27.960,0:08:29.500
+We've seen this before
+
+0:08:29.500,0:08:34.500
+But the twist here is that the weight vector instead of being a
+
+0:08:35.830,0:08:39.660
+parameter that's being optimized, is actually itself the output of some other function
+
+0:08:40.599,0:08:43.589
+possibly parameterized. In this case this function is
+
+0:08:44.320,0:08:50.369
+not a parameterized function, or it's a parameterized function but the only input is another parameter U, okay?
+
+0:08:50.750,0:08:56.929
+So what we've done here is make the weights of that neural net be the function of some more elementary...
+
+0:08:57.480,0:08:59.480
+some more elementary parameters U
+
+0:09:00.420,0:09:02.420
+through a function and
+
+0:09:02.940,0:09:07.880
+you realize really quickly that backprop just works there, right? If you back propagate gradients
+
+0:09:09.210,0:09:15.049
+through the G function to get the gradient of whatever objective function we're minimizing with respect to the
+
+0:09:15.600,0:09:21.290
+weight parameters, you can keep back propagating through the H function here to get the gradients with respect to U
+
+0:09:22.620,0:09:27.229
+So in the end you're sort of propagating things like this
+
+0:09:30.600,0:09:42.220
+So when you're updating U, you're multiplying the Jacobian of the objective function with respect to the parameters, and then by the...
+
+0:09:42.750,0:09:46.760
+Jacobian of the H function with respect to its own parameters, okay?
+
+0:09:46.760,0:09:50.960
+So you get the product of two Jacobians here, which is just what you get from back propagating
+
+0:09:50.960,0:09:54.919
+You don't have to do anything in PyTorch for this. This will happen automatically as you define the network
+
+0:09:59.130,0:10:03.080
+And that's kind of the update that occurs
+
+0:10:03.840,0:10:10.820
+Now, of course, W being a function of U through the function H, the change in W
+
+0:10:12.390,0:10:16.460
+will be the change in U multiplied by the Jacobian of H transpose
+
+0:10:18.090,0:10:24.739
+And so this is the kind of thing you get here, the effective change in W that you get without updating W
+
+0:10:24.740,0:10:30.260
+--you actually are updating U-- is the update in U multiplied by the Jacobian of H
+
+0:10:30.690,0:10:37.280
+And we had a transpose here. We have the opposite there. This is a square matrix
+
+0:10:37.860,0:10:41.720
+which is Nw by Nw, which is the number of... the dimension of W squared, okay?
+
+0:10:42.360,0:10:44.690
+So this matrix here
+
+0:10:45.780,0:10:47.780
+has as many rows as
+
+0:10:48.780,0:10:52.369
+W has components and then the number of columns is the number of
+
+0:10:52.560,0:10:57.470
+components of U. And then this guy, of course, is the other way around so it's an Nu by Nw
+
+0:10:57.540,0:11:02.669
+So when you make the product, do the product of those two matrices you get an Nw by Nw matrix
+
+0:11:03.670,0:11:05.670
+And then you multiply this by this
+
+0:11:06.190,0:11:10.380
+Nw vector and you get an Nw vector which is what you need for updating
+
+0:11:11.440,0:11:13.089
+the weights
+
+0:11:13.089,0:11:16.828
+Okay, so that's kind of a general form of transforming the parameter space and there's
+
+0:11:18.430,0:11:22.979
+many ways you can use this and a particular way of using it is when
+
+0:11:23.769,0:11:25.389
+H is what's called a...
+
+0:11:26.709,0:11:30.089
+what we talked about last week, which is a "Y connector"
+
+0:11:30.089,0:11:35.578
+So imagine the only thing that H does is that it takes one component of U and it copies it multiple times
+
+0:11:36.029,0:11:40.000
+So that you have the same value, the same weight replicated across the G function
+
+0:11:40.000,0:11:43.379
+the G function we use the same value multiple times
+
+0:11:45.639,0:11:47.639
+So this would look like this
+
+0:11:48.339,0:11:50.339
+So let's imagine U is two dimensional
+
+0:11:51.279,0:11:54.448
+u1, u2 and then W is four dimensional but
+
+0:11:55.000,0:11:59.969
+w1 and w2 are equal to u1 and w3, w4 are equal to u2
+
+0:12:01.060,0:12:04.400
+So basically you only have two free parameters
+
+0:12:04.700
+and when you're changing one component of U changing two components of W at the same time
+
+0:12:08.560,0:12:14.579
+in a very simple manner. And that's called weight sharing, okay? When two weights are forced to be equal
+
+0:12:14.579,0:12:19.200
+They are actually equal to a more elementary parameter that controls both
+
+0:12:19.300,0:12:21.419
+That's weight sharing and that's kind of the basis of
+
+0:12:21.940,0:12:23.940
+a lot of
+
+0:12:24.670,0:12:26.880
+ideas... you know, convolutional nets among others
+
+0:12:27.730,0:12:31.890
+but you can think of this as a very simple form of H of U
+
+0:12:33.399,0:12:38.489
+So you don't need to do anything for this in the sense that when you have weight sharing
+
+0:12:39.100,0:12:45.810
+If you do it explicitly with a module that does kind of a Y connection on the way back, when the gradients are back propagated
+
+0:12:45.810,0:12:47.800
+the gradients are summed up
+
+0:12:47.800,0:12:53.099
+so the gradient of some cost function with respect to u1, for example, will be the sum of the gradient so that
+
+0:12:53.199,0:12:55.559
+cost function with respect to w1 and w2
+
+0:12:56.860,0:13:02.219
+And similarly for the gradient with respect to u2 would be the sum of the gradients with respect to w3 and w4, okay?
+
+0:13:02.709,0:13:06.328
+That's just the effect of backpropagating through the two Y connectors
+
+0:13:13.310,0:13:19.119
+Okay, here is a slightly more general view of this parameter transformation that some people have called hypernetworks
+
+0:13:19.970,0:13:23.350
+So a hypernetwork is a network where
+
+0:13:23.839,0:13:28.299
+the weights of one network are computed as the output of another network
+
+0:13:28.459,0:13:33.969
+Okay, so you have a network H that looks at the input, it has its own parameters U
+
+0:13:35.569,0:13:37.929
+And it computes the weights of a second network
+
+0:13:38.959,0:13:44.199
+Okay? so the advantage of doing this... there are various names for it
+
+0:13:44.199,0:13:46.508
+The idea is very old, it goes back to the 80s
+
+0:13:46.880,0:13:52.539
+people using what's called multiplicative interactions, or three-way network, or sigma-pi units and they're basically
+
+0:13:53.600,0:13:59.050
+this idea --and this is maybe a slightly more general general formulation of it
+
+0:14:00.949,0:14:02.949
+that you have sort of a dynamically
+
+0:14:04.069,0:14:06.519
+Your function that's dynamically defined
+
+0:14:07.310,0:14:09.669
+In G of X and W
+
+0:14:10.459,0:14:14.318
+Because W is really a complex function of the input and some other parameter
+
+0:14:16.189,0:14:17.959
+This is particularly
+
+0:14:17.959,0:14:22.419
+interesting architecture when what you're doing to X is transforming it in some ways
+
+0:14:23.000,0:14:29.889
+Right? So you can think of W as being the parameters of that transformation, so Y would be a transformed version of X
+
+0:14:32.569,0:14:37.809
+And the X, I mean the function H basically computes that transformation
+
+0:14:38.899,0:14:41.739
+Okay? But we'll come back to that in a few weeks
+
+0:14:42.829,0:14:46.209
+Just wanted to mention this because it's basically a small modification of
+
+0:14:46.579,0:14:52.869
+of this right? You just have one more wire that goes from X to H, and that's how you get those hypernetworks
+
+0:14:56.569,0:15:03.129
+Okay, so we're showing the idea that you can have one parameter controlling
+
+0:15:06.500,0:15:12.549
+multiple effective parameters in another network. And one reason that's useful is
+
+0:15:13.759,0:15:16.779
+if you want to detect a motif on an input
+
+0:15:17.300,0:15:20.139
+And you want to detect this motif regardless of where it appears, okay?
+
+0:15:20.689,0:15:27.099
+So let's say you have an input, let's say it's a sequence but it could be an image, in this case is a sequence
+
+0:15:27.100,0:15:28.000
+Sequence of vectors, let's say
+
+0:15:28.300,0:15:33.279
+And you have a network that takes a collection of three of those vectors, three successive vectors
+
+0:15:34.010,0:15:36.339
+It's this network G of X and W and
+
+0:15:37.010,0:15:42.249
+it's trained to detect a particular motif of those three vectors. Maybe this is... I don't know
+
+0:15:42.889,0:15:44.750
+the power consumption
+
+0:15:44.750,0:15:51.880
+Electrical power consumption, and sometimes you might want to be able to detect like a blip or a trend or something like that
+
+0:15:52.519,0:15:54.519
+Or maybe it's, you know...
+
+0:15:56.120,0:15:58.120
+financial instruments of some kind
+
+0:15:59.149,0:16:05.289
+Some sort of time series. Maybe it's a speech signal and you want to detect a particular sound that consists in three
+
+0:16:06.050,0:16:10.899
+vectors that define the sort of audio content of that speech signal
+
+0:16:12.440,0:16:15.709
+And so you'd like to be able to detect
+
+0:16:15.709,0:16:20.469
+if it's a speech signal and there's a particular sound you need to detect for doing speech recognition
+
+0:16:20.470,0:16:22.630
+You might want to detect the sound
+
+0:16:23.180,0:16:28.690
+The vowel P, right? The sound P wherever it occurs in a sequence
+
+0:16:28.690,0:16:31.299
+You want some detector that fires when the sound P is...
+
+0:16:33.589,0:16:41.439
+...is pronounced. And so what we'd like to have is a detector you can slide over and regardless of where this motif occurs
+
+0:16:42.470,0:16:47.500
+detect it. So what you need to have is some network, some parameterized function that...
+
+0:16:48.920,0:16:55.029
+You have multiple copies of that function that you can apply to various regions on the input and they all share the same weight
+
+0:16:55.029,0:16:58.600
+but you'd like to train this entire system end to end
+
+0:16:58.700,0:17:01.459
+So for example, let's say...
+
+0:17:01.459,0:17:03.459
+Let's talk about a slightly more sophisticated
+
+0:17:05.569,0:17:07.688
+thing here where you have...
+
+0:17:11.059,0:17:13.059
+Let's see...
+
+0:17:14.839,0:17:17.349
+A keyword that's being being pronounced so
+
+0:17:18.169,0:17:22.959
+the system listens to sound and wants to detect when a particular keyword, a wakeup
+
+0:17:24.079,0:17:28.329
+word has been has been pronounced, right? So this is Alexa, right?
+
+0:17:28.459,0:17:32.709
+And you say "Alexa!" and Alexa wakes up it goes bong, right?
+
+0:17:35.260,0:17:40.619
+So what you'd like to have is some network that kind of takes a window over the sound and then sort of keeps
+
+0:17:41.890,0:17:44.189
+in the background sort of detecting
+
+0:17:44.860,0:17:47.219
+But you'd like to be able to detect
+
+0:17:47.220,0:17:52.020
+wherever the sound occurs within the frame that is being looked at, or it's been listened to, I should say
+
+0:17:52.300,0:17:56.639
+So you can have a network like this where you have replicated detectors
+
+0:17:56.640,0:17:59.520
+They all share the same weight and then the output which is
+
+0:17:59.520,0:18:03.329
+the score as to whether something has been detected or not, goes to a max function
+
+0:18:04.090,0:18:07.500
+Okay? And that's the output. And the way you train a system like this
+
+0:18:08.290,0:18:10.290
+you will have a bunch of samples
+
+0:18:10.780,0:18:14.140
+Audio examples where the keyword
+
+0:18:14.140,0:18:18.000
+has been pronounced and a bunch of audio samples where the keyword was not pronounced
+
+0:18:18.100,0:18:20.249
+And then you train a 2 class classifier
+
+0:18:20.470,0:18:24.689
+Turn on when "Alexa" is somewhere in this frame, turn off when it's not
+
+0:18:25.059,0:18:30.899
+But nobody tells you where the word "Alexa" occurs within the window that you train the system on, okay?
+
+0:18:30.900,0:18:35.729
+Because it's really expensive for labelers to look at the audio signal and tell you exactly
+
+0:18:35.730,0:18:37.570
+This is where the word "Alexa" is being pronounced
+
+0:18:37.570,0:18:42.720
+The only thing they know is that within this segment of a few seconds, the word has been pronounced somewhere
+
+0:18:43.450,0:18:48.390
+Okay, so you'd like to apply a network like this that has those replicated detectors?
+
+0:18:48.390,0:18:53.429
+You don't know exactly where it is, but you run through this max and you want to train the system to...
+
+0:18:53.950,0:18:59.370
+You want to back propagate gradient to it so that it learns to detect "Alexa", or whatever...
+
+0:19:00.040,0:19:01.900
+wake up word occurs
+
+0:19:01.900,0:19:09.540
+And so there what happens is you have those multiple copies --five copies in this example
+
+0:19:09.580,0:19:11.580
+of this network and they all share the same weight
+
+0:19:11.710,0:19:16.650
+You can see there's just one weight vector sending its value to five different
+
+0:19:17.410,0:19:22.559
+instances of the same network and so we back propagate through the
+
+0:19:23.260,0:19:27.689
+five copies of the network, you get five gradients, so those gradients get added up...
+
+0:19:29.679,0:19:34.949
+for the parameter. Now, there's this slightly strange way this is implemented in PyTorch and other
+
+0:19:35.740,0:19:41.760
+Deep Learning frameworks, which is that this accumulation of gradient in a single parameter is done implicitly
+
+0:19:42.550,0:19:46.659
+And it's one reason why before you do a backprop in PyTorch, you have to zero out the gradient
+
+0:19:47.840,0:19:49.840
+Because there's sort of implicit
+
+0:19:50.510,0:19:52.510
+accumulation of gradients when you do back propagation
+
+0:19:58.640,0:20:02.000
+Okay, so here's another situation where that would be useful 
+
+0:20:02.100,0:20:07.940
+And this is the real motivation behind conditional nets in the first place
+
+0:20:07.940,0:20:09.940
+Which is the problem of
+
+0:20:10.850,0:20:15.000
+training a system to recognize the shape independently of the position
+
+0:20:16.010,0:20:17.960
+of where the shape occurs
+
+0:20:17.960,0:20:22.059
+and whether there are distortions of that shape in the input
+
+0:20:22.850,0:20:28.929
+So this is a very simple type of convolutional net that is has been built by hand. It's not been trained
+
+0:20:28.929,0:20:30.929
+It's been designed by hand
+
+0:20:31.760,0:20:36.200
+And it's designed explicitly to distinguish C's from D's
+
+0:20:36.400,0:20:38.830
+Okay, so you can draw a C on the input
+
+0:20:39.770,0:20:41.770
+image which is very low resolution
+
+0:20:43.880,0:20:48.459
+And what distinguishes C's from D's is that C's have end points, right?
+
+0:20:48.460,0:20:54.610
+The stroke kind of ends, and you can imagine designing a detector for that. Whereas these have corners
+
+0:20:55.220,0:20:59.679
+So if you have an endpoint detector or something that detects the end of a segment and
+
+0:21:00.290,0:21:02.290
+a corner detector
+
+0:21:02.330,0:21:06.699
+Wherever you have corners detected, it's a D and wherever you have
+
+0:21:07.700,0:21:09.700
+segments that end, it's a C
+
+0:21:11.870,0:21:16.989
+So here's an example of a C. You take the first detector, so the little
+
+0:21:17.750,0:21:19.869
+black and white motif here at the top
+
+0:21:20.870,0:21:24.640
+is an endpoint detector, okay? It detects the end of a
+
+0:21:25.610,0:21:28.059
+of a segment and the way this
+
+0:21:28.760,0:21:33.969
+is represented here is that the black pixels here...
+
+0:21:35.840,0:21:37.929
+So think of this as some sort of template
+
+0:21:38.990,0:21:43.089
+Okay, you're going to take this template and you're going to swipe it over the input image
+
+0:21:44.510,0:21:51.160
+and you're going to compare that template to the little image that is placed underneath, okay?
+
+0:21:51.980,0:21:56.490
+And if those two match, the way you're going to determine whether they match is that you're going to do a dot product
+
+0:21:56.490,0:22:03.930
+So you're gonna think of those black and white pixels as value of +1 or -1, say +1 for black and -1 for white
+
+0:22:05.020,0:22:09.420
+And you're gonna think of those pixels also as being +1 for blacks and -1 for white and
+
+0:22:10.210,0:22:16.800
+when you compute the dot product of a little window with that template
+
+0:22:17.400,0:22:22.770
+If they are similar, you're gonna get a large positive value. If they are dissimilar, you're gonna get a...
+
+0:22:24.010,0:22:27.629
+zero or negative value. Or a smaller value, okay?
+
+0:22:29.020,0:22:35.489
+So you take that little detector here and you compute the dot product with the first window, second window, third window, etc.
+
+0:22:35.650,0:22:42.660
+You shift by one pixel every time for every location and you recall the result. And what you what you get is this, right?
+
+0:22:42.660,0:22:43.660
+So this is...
+
+0:22:43.660,0:22:51.640
+Here the grayscale is an indication of the matching
+
+0:22:51.640,0:22:57.959
+which is actually the dot product between the vector formed by those values
+
+0:22:58.100,0:23:05.070
+And the patch of the corresponding location on the input. So this image here is roughly the same size as that image
+
+0:23:06.250,0:23:08.250
+minus border effects
+
+0:23:08.290,0:23:13.469
+And you see there is a... whenever the output is dark there is a match
+
+0:23:14.380,0:23:16.380
+So you see a match here
+
+0:23:16.810,0:23:20.249
+because this endpoint detector here matches the
+
+0:23:20.980,0:23:24.810
+the endpoint. You see sort of a match here at the bottom
+
+0:23:25.630,0:23:27.930
+And the other kind of values are not as
+
+0:23:28.750,0:23:32.459
+dark, okay? Not as strong if you want
+
+0:23:33.250,0:23:38.820
+Now, if you threshold those those values you set the output to +1 if it's above the threshold
+
+0:23:39.520,0:23:41.520
+Zero if it's below the threshold
+
+0:23:42.070,0:23:46.499
+You get those maps here, you have to set the threshold appropriately but what you get is that
+
+0:23:46.500,0:23:50.880
+this little guy here detected a match at the two end points of the C, okay?
+
+0:23:52.150,0:23:54.749
+So now if you take this map and you sum it up
+
+0:23:56.050,0:23:58.050
+Just add all the values
+
+0:23:58.600,0:24:00.430
+You get a positive number
+
+0:24:00.430,0:24:03.989
+Pass that through threshold, and that's your C detector. It's not a very good C detector
+
+0:24:03.990,0:24:07.859
+It's not a very good detector of anything, but for those particular examples of C's
+
+0:24:08.429,0:24:10.210
+and maybe those D's
+
+0:24:10.210,0:24:16.980
+It will work, it'll be enough. Now for the D is similar, those other detectors here are meant to detect the corners of the D
+
+0:24:17.679,0:24:24.538
+So this guy here, this detector, as you swipe it over the input will detect the
+
+0:24:25.659,0:24:29.189
+upper left corner and that guy will detect the lower right corner
+
+0:24:29.649,0:24:33.689
+Once you threshold, you will get those two maps where the corners are detected
+
+0:24:34.509,0:24:37.019
+and then you can sum those up and the
+
+0:24:37.360,0:24:44.729
+D detector will turn on. Now what you see here is an example of why this is good because that detection now is shift invariant
+
+0:24:44.730,0:24:49.169
+So if I take the same input D here, and I shift it by a couple pixels
+
+0:24:50.340,0:24:56.279
+And I run this detector again, it will detect the motifs wherever they appear. The output will be shifted
+
+0:24:56.379,0:25:01.559
+Okay, so this is called equivariance to shift. So the output of that network
+
+0:25:02.590,0:25:10.499
+is equivariant to shift, which means that if I shift the input the output gets shifted, but otherwise unchanged. Okay? That's equivariance
+
+0:25:11.289,0:25:12.909
+Invariance would be
+
+0:25:12.909,0:25:17.398
+if I shift it, the output will be completely unchanged but here it is modified
+
+0:25:17.399,0:25:19.739
+It just modified the same way as the input
+
+0:25:23.950,0:25:31.080
+And so if I just sum up the activities in the feature maps here, it doesn't matter where they occur
+
+0:25:31.809,0:25:34.199
+My D detector will still activate
+
+0:25:34.929,0:25:38.998
+if I just compute the sum. So this is sort of a handcrafted
+
+0:25:39.700,0:25:47.100
+pattern recognizer that uses local feature detectors and then kind of sums up their activity and what you get is an invariant detection
+
+0:25:47.710,0:25:52.529
+Okay, this is a fairly classical way actually of building certain types of pattern recognition systems
+
+0:25:53.049,0:25:55.049
+Going back many years
+
+0:25:57.730,0:26:03.929
+But the trick here, what's important of course, what's interesting would be to learn those templates
+
+0:26:04.809,0:26:10.258
+Can we view this as just a neural net and we back propagate to it and we learn those templates?
+
+0:26:11.980,0:26:18.779
+As weights of a neural net? After all we're using them to do that product which is a weighted sum, so basically
+
+0:26:21.710,0:26:29.059
+This layer here to go from the input to those so-called feature maps that are weighted sums
+
+0:26:29.520,0:26:33.080
+is a linear operation, okay? And we know how to back propagate through that
+
+0:26:35.850,0:26:41.750
+We'd have to use a kind of a soft threshold, a ReLU or something like this here because otherwise we can't do backprop
+
+0:26:43.470,0:26:48.409
+Okay, so this operation here of taking the dot product of a bunch of coefficients
+
+0:26:49.380,0:26:53.450
+with an input window and then swiping it over, that's a convolution
+
+0:26:57.810,0:27:03.409
+Okay, so that's the definition of a convolution. It's actually the one up there so this is in the one dimensional case
+
+0:27:05.400,0:27:07.170
+where imagine you have
+
+0:27:10.530,0:27:16.639
+An input Xj, so X indexed by the j in the index
+
+0:27:20.070,0:27:22.070
+You take a window
+
+0:27:23.310,0:27:26.029
+of X at a particular location i
+
+0:27:27.330,0:27:30.080
+Okay, and then you sum
+
+0:27:31.890,0:27:40.340
+You do a weighted sum of the window of the X values and you multiply those by the weights wⱼ's
+
+0:27:41.070,0:27:50.359
+Okay, and the sum presumably runs over a kind of a small window so j here would go from 1 to 5
+
+0:27:51.270,0:27:54.259
+Something like that, which is the case in the little example I showed earlier
+
+0:27:58.020,0:28:00.950
+and that gives you one Yi
+
+0:28:01.770,0:28:05.510
+Okay, so take the first window of 5 values of X
+
+0:28:06.630,0:28:13.280
+Compute the weighted sum with the weights, that gives you Y1. Then shift that window by 1, compute the weighted sum of the
+
+0:28:13.620,0:28:18.320
+dot product of that window by the Y's, that gives you Y2, shift again, etc.
+
+0:28:23.040,0:28:26.839
+Now, in practice when people implement in things like PyTorch
+
+0:28:26.840,0:28:31.069
+there is a confusion between two things that mathematicians think are very different
+
+0:28:31.070,0:28:37.009
+but in fact, they're pretty much the same. It's convolution and cross correlation. So in convolution, the convention is that the...
+
+0:28:37.979,0:28:44.359
+the index goes backwards in the window when it goes forwards in the weights
+
+0:28:44.359,0:28:49.519
+In cross correlation, they both go forward. In the end, it's just a convention, it depends on how you lay...
+
+0:28:51.659,0:28:59.598
+organize the data and your weights. You can interpret this as a convolution if you read the weights backwards, so really doesn't make any difference
+
+0:29:01.259,0:29:06.949
+But for certain mathematical properties of a convolution if you want everything to be consistent you have to have the...
+
+0:29:07.440,0:29:10.849
+The j in the W having an opposite sign to the j in the X
+
+0:29:11.879,0:29:13.879
+So the two dimensional version of this...
+
+0:29:15.419,0:29:17.419
+If you have an image X
+
+0:29:17.789,0:29:21.258
+that has two indices --in this case i and j
+
+0:29:23.339,0:29:25.909
+You do a weighted sum over two indices k and l
+
+0:29:25.909,0:29:31.368
+And so you have a window a two-dimensional window indexed by k and l and you compute the dot product
+
+0:29:31.769,0:29:34.008
+of that window over X with the...
+
+0:29:35.099,0:29:39.679
+the weight, and that gives you one value in Yij which is the output
+
+0:29:43.349,0:29:51.319
+So the vector W or the matrix W in the 2d version, there is obvious extensions of this to 3d and 4d, etc.
+
+0:29:52.080,0:29:55.639
+It's called a kernel, it's called a convolutional kernel, okay?
+
+0:30:00.380,0:30:03.309
+Is it clear? I'm sure this is known for many of you but...
+
+0:30:10.909,0:30:13.449
+So what we're going to do with this is that
+
+0:30:14.750,0:30:18.699
+We're going to organize... build a network as a succession of
+
+0:30:20.120,0:30:23.769
+convolutions where in a regular neural net you have
+
+0:30:25.340,0:30:29.100
+alternation of linear operators and pointwise non-linearity
+
+0:30:29.250,0:30:34.389
+In convolutional nets, we're going to have an alternation of linear operators that will happen to be convolutions, so multiple convolutions
+
+0:30:34.940,0:30:40.179
+Then also pointwise non-linearity and there's going to be a third type of operation called pooling...
+
+0:30:42.620,0:30:44.620
+which is actually optional
+
+0:30:45.470,0:30:50.409
+Before I go further, I should mention that there are
+
+0:30:52.220,0:30:56.889
+twists you can make to this convolution. So one twist is what's called a stride
+
+0:30:57.380,0:31:01.239
+So a stride in a convolution consists in moving the window
+
+0:31:01.760,0:31:07.509
+from one position to another instead of moving it by just one value
+
+0:31:07.940,0:31:13.510
+You move it by two or three or four, okay? That's called a stride of a convolution
+
+0:31:14.149,0:31:17.138
+And so if you have an input of a certain length and...
+
+0:31:19.700,0:31:26.590
+So let's say you have an input which is kind of a one-dimensional and size 100 hundred
+
+0:31:27.019,0:31:31.059
+And you have a convolution kernel of size five
+
+0:31:32.330,0:31:34.330
+Okay, and you convolve
+
+0:31:34.909,0:31:38.409
+this kernel with the input
+
+0:31:39.350,0:31:46.120
+And you make sure that the window stays within the input of size 100
+
+0:31:46.730,0:31:51.639
+The output you get has 96 outputs, okay? It's got the number of inputs
+
+0:31:52.519,0:31:56.019
+minus the size of the kernel, which is 5 minus 1
+
+0:31:57.110,0:32:00.610
+Okay, so that makes it 4. So you get 100 minus 4, that's 96
+
+0:32:02.299,0:32:08.709
+That's the number of windows of size 5 that fit within this big input of size 100
+
+0:32:11.760,0:32:13.760
+Now, if I use this stride...
+
+0:32:13.760,0:32:21.960
+So what I do now is I take my window of 5 where I applied the kernel and I shift not by one pixel but by 2 pixels
+
+0:32:21.960,0:32:24.710
+Or two values, let's say. They're not necessarily pixels
+
+0:32:26.310,0:32:31.880
+Okay, the number of outputs I'm gonna get is gonna be divided by two roughly
+
+0:32:33.570,0:32:36.500
+Okay, instead of 96 I'm gonna have
+
+0:32:37.080,0:32:42.949
+a little less than 50, 48 or something like that. The number is not exact, you can...
+
+0:32:44.400,0:32:46.400
+figure it out in your head
+
+0:32:47.430,0:32:51.470
+Very often when people run convolutions in convolutional nets they actually pad the convolution
+
+0:32:51.470,0:32:59.089
+So they sometimes like to have the output being the same size as the input, and so they actually displace the input window
+
+0:32:59.490,0:33:02.479
+past the end of the vector assuming that it's padded with zeros
+
+0:33:04.230,0:33:06.230
+usually on both sides
+
+0:33:16.110,0:33:19.849
+Does it have any effect on performance or is it just for convenience?
+
+0:33:21.480,0:33:25.849
+If it has an effect on performance is bad, okay? But it is convenient
+
+0:33:28.350,0:33:30.350
+That's pretty much the answer
+
+0:33:32.700,0:33:37.800
+The assumption that's bad is assuming that when you don't have data it's equal to zero
+
+0:33:38.000,0:33:41.720
+So when your nonlinearities are ReLU, it's not necessarily completely unreasonable
+
+0:33:43.650,0:33:48.079
+But it sometimes creates funny border effects (boundary effects)
+
+0:33:51.120,0:33:53.539
+Okay, everything clear so far?
+
+0:33:54.960,0:33:59.059
+Right. Okay. So what we're going to build is a
+
+0:34:01.050,0:34:03.050
+neural net composed of those
+
+0:34:03.690,0:34:08.120
+convolutions that are going to be used as feature detectors, local feature detectors
+
+0:34:09.090,0:34:13.069
+followed by nonlinearities, and then we're gonna stack multiple layers of those
+
+0:34:14.190,0:34:18.169
+And the reason for stacking multiple layers is because
+
+0:34:19.170,0:34:21.090
+We want to build
+
+0:34:21.090,0:34:25.809
+hierarchical representations of the visual world of the data
+
+0:34:26.089,0:34:32.258
+It's not... convolutional nets are not necessarily applied to images. They can be applied to speech and other signals
+
+0:34:32.299,0:34:35.619
+They basically can be applied to any signal that comes to you in the form of an array
+
+0:34:36.889,0:34:41.738
+And I'll come back to the properties that this array has to verify
+
+0:34:43.789,0:34:45.789
+So what you want is...
+
+0:34:46.459,0:34:48.698
+Why do you want to build hierarchical representations?
+
+0:34:48.699,0:34:54.369
+Because the world is compositional --and I alluded to this I think you the first lecture if remember correctly
+
+0:34:55.069,0:35:03.519
+It's the fact that pixes assemble to form simple motifs like oriented edges
+
+0:35:04.430,0:35:10.839
+Oriented edges kind of assemble to form local features like corners and T junctions and...
+
+0:35:11.539,0:35:14.018
+things like that... gratings, you know, and...
+
+0:35:14.719,0:35:19.600
+then those assemble to form motifs that are slightly more abstract.
+
+0:35:19.700,0:35:23.559
+Then those assemble to form parts of objects, and those assemble to form objects
+
+0:35:23.559,0:35:28.000
+So there is a sort of natural compositional hierarchy in the natural world
+
+0:35:28.100,0:35:33.129
+And this natural compositional hierarchy in the natural world is not just because of
+
+0:35:34.369,0:35:38.438
+perception --visual perception-- is true at a physical level, right?
+
+0:35:41.390,0:35:46.808
+You start at the lowest level of the description
+
+0:35:47.719,0:35:50.079
+You have elementary particles and they form...
+
+0:35:50.079,0:35:56.438
+they clump to form less elementary particles, and they clump to form atoms, and they clump to form molecules, and molecules clump to form
+
+0:35:57.229,0:36:00.399
+materials, and materials parts of objects and
+
+0:36:01.130,0:36:03.609
+parts of objects into objects, and things like that, right?
+
+0:36:04.670,0:36:07.599
+Or macromolecules or polymers, bla bla bla
+
+0:36:08.239,0:36:13.239
+And then you have this natural composition or hierarchy the world is built this way
+
+0:36:14.719,0:36:19.000
+And it may be why the world is understandable, right?
+
+0:36:19.100,0:36:22.419
+So there's this famous quote from Einstein that says:
+
+0:36:23.329,0:36:26.750
+"the most incomprehensible thing about the world is that the world is comprehensible"
+
+0:36:26.800,0:36:30.069
+And it seems like a conspiracy that we live in a world that we are able to comprehend
+
+0:36:31.130,0:36:35.019
+But we can comprehend it because the world is compositional and
+
+0:36:36.970,0:36:38.970
+it happens to be easy to build
+
+0:36:39.760,0:36:44.370
+brains in a compositional world that actually can interpret compositional world
+
+0:36:45.580,0:36:47.580
+It still seems like a conspiracy to me
+
+0:36:49.660,0:36:51.660
+So there's a famous quote from...
+
+0:36:53.650,0:36:54.970
+from a...
+
+0:36:54.970,0:37:00.780
+Not that famous, but somewhat famous, from a statistician at Brown called Stuart Geman.
+
+0:37:01.360,0:37:04.799
+And he says that sounds like a conspiracy, like magic
+
+0:37:06.070,0:37:08.070
+But you know...
+
+0:37:08.440,0:37:15.570
+If the world were not compositional we would need some even more magic to be able to understand it
+
+0:37:17.260,0:37:21.540
+The way he says this is: "the world is compositional or there is a God"
+
+0:37:25.390,0:37:32.339
+You would need to appeal to superior powers if the world was not compositional to explain how we can understand it
+
+0:37:35.830,0:37:37.830
+Okay, so this idea of hierarchy
+
+0:37:38.440,0:37:44.520
+and local feature detection comes from biology. So the whole idea of convolutional nets comes from biology. It's been
+
+0:37:45.850,0:37:47.850
+so inspired by biology and
+
+0:37:48.850,0:37:53.399
+what you see here on the right is a diagram by Simon Thorpe who's a
+
+0:37:54.160,0:37:56.160
+psycho-physicist and
+
+0:37:56.500,0:38:02.939
+did some relatively famous experiments where he showed that the way we recognize everyday objects
+
+0:38:03.580,0:38:05.969
+seems to be extremely fast. So if you show...
+
+0:38:06.640,0:38:10.409
+if you flash the image of an everyday object to a person and
+
+0:38:11.110,0:38:12.730
+you flash
+
+0:38:12.730,0:38:16.649
+one of them every 100 milliseconds or so, you realize that the
+
+0:38:18.070,0:38:23.549
+the time it takes for a person to identify in a long sequence, whether there was a particular object, let's say a tiger
+
+0:38:25.780,0:38:27.640
+is about 100 milliseconds
+
+0:38:27.640,0:38:34.769
+So the time it takes for brain to interpret an image and recognize basic objects in them is about 100 milliseconds
+
+0:38:35.650,0:38:37.740
+A tenth of a second, right?
+
+0:38:39.490,0:38:42.120
+And that's just about the time it takes for the
+
+0:38:43.000,0:38:45.000
+nerve signal to propagate from
+
+0:38:45.700,0:38:47.550
+the retina
+
+0:38:47.550,0:38:54.090
+where images are formed in the eye to what's called the LGN (lateral geniculate nucleus)
+
+0:38:54.340,0:38:56.340
+which is a small
+
+0:38:56.350,0:39:02.640
+piece of the brain that basically does sort of contrast enhancement and gain control, and things like that
+
+0:39:03.580,0:39:08.789
+And then that signal goes to the back of your brain v1. That's the primary visual cortex area
+
+0:39:09.490,0:39:15.600
+in humans and then v2, which is very close to v1. There's a fold that sort of makes v1 sort of
+
+0:39:17.380,0:39:20.549
+right in front of v2, and there is lots of wires between them
+
+0:39:21.580,0:39:28.890
+And then v4, and then the inferior temporal cortex, which is on the side here and that's where object categories are represented
+
+0:39:28.890,0:39:35.369
+So there are neurons in your inferior temporal cortex that represent generic object categories
+
+0:39:38.350,0:39:41.370
+And people have done experiments with this where...
+
+0:39:44.320,0:39:51.150
+epileptic patients are in hospital and have their skull open because they need to locate the...
+
+0:39:52.570,0:40:00.200
+exact position of the source of their epilepsy seizures
+
+0:40:02.080,0:40:04.650
+And because they have electrodes on the surface of their brain
+
+0:40:05.770,0:40:11.000
+you can show the movies and then observe if a particular neuron turns on for particular movies
+
+0:40:11.100,0:40:14.110
+And you show them a movie with Jennifer Aniston and there is this
+
+0:40:14.110,0:40:17.900
+neuron that only turns on when Jennifer Aniston is there, okay?
+
+0:40:18.000,0:40:21.000
+It doesn't turn on for anything else as far as we could tell, okay?
+
+0:40:21.700,0:40:27.810
+So you seem to have very selective neurons in the inferior temporal cortex that react to a small number of categories
+
+0:40:30.760,0:40:35.669
+There's a joke, kind of a running joke, in neuroscience of a concept called the grandmother cell
+
+0:40:35.670,0:40:40.350
+So this is the one neuron in your inferior temporal cortex that turns on when you see your grandmother
+
+0:40:41.050,0:40:45.120
+regardless of what position what she's wearing, how far, whether it's a photo or not
+
+0:40:46.510,0:40:50.910
+Nobody really believes in this concept, what people really believe in is distributed representations
+
+0:40:50.910,0:40:54.449
+So there is no such thing as a cell that just turns on for you grandmother
+
+0:40:54.970,0:41:00.820
+There are this collection of cells that turn on for various things and they serve to represent general categories
+
+0:41:01.100,0:41:04.060
+But the important thing is that they are invariant to
+
+0:41:04.700,0:41:06.700
+position, size...
+
+0:41:06.920,0:41:11.080
+illumination, all kinds of different things and the real motivation behind
+
+0:41:11.930,0:41:14.349
+convolutional nets is to build
+
+0:41:15.140,0:41:18.670
+neural nets that are invariant to irrelevant transformation of the inputs
+
+0:41:19.510,0:41:27.070
+You can still recognize a C or D or your grandmother regardless of the position and to some extent the orientation, the style, etc.
+
+0:41:29.150,0:41:36.790
+So this idea that the signal only takes 100 milliseconds to go from the retina to the inferior temporal cortex
+
+0:41:37.160,0:41:40.330
+Seems to suggest that if you count the delay
+
+0:41:40.850,0:41:42.850
+to go through every neuron or every
+
+0:41:43.340,0:41:45.489
+stage in that pathway
+
+0:41:46.370,0:41:48.880
+There's barely enough time for a few spikes to get through
+
+0:41:48.880,0:41:55.720
+So there's no time for complex recurrent computation, is basically a feed-forward process. It's very fast
+
+0:41:56.930,0:41:59.980
+Okay, and we need it to be fast because that's a question of survival for us
+
+0:41:59.980,0:42:06.159
+There's a lot of... for most animals, you need to be able to recognize really quickly what's going on, particularly...
+
+0:42:07.850,0:42:12.820
+fast-moving predators or preys for that matter
+
+0:42:17.570,0:42:20.830
+So that kind of suggests the idea that we can do
+
+0:42:21.920,0:42:26.230
+perhaps we could come up with some sort of neuronal net architecture that is completely feed-forward and
+
+0:42:27.110,0:42:29.110
+still can do recognition
+
+0:42:30.230,0:42:32.230
+The diagram on the right
+
+0:42:34.430,0:42:39.280
+is from Gallent & Van Essen, so this is a type of sort of abstract
+
+0:42:39.920,0:42:43.450
+conceptual diagram of the two pathways in the visual cortex
+
+0:42:43.490,0:42:50.530
+There is the ventral pathway and the dorsal pathway. The ventral pathway is, you know, basically the v1, v2, v4, IT hierarchy
+
+0:42:50.530,0:42:54.999
+which is sort of from the back of the brain, and goes to the bottom and to the side and
+
+0:42:55.280,0:42:58.179
+then the dorsal pathway kind of goes
+
+0:42:59.060,0:43:02.469
+through the top also towards the inferior temporal cortex and
+
+0:43:04.040,0:43:09.619
+there is this idea somehow that the ventral pathway is there to tell you what you're looking at, right?
+
+0:43:10.290,0:43:12.499
+The dorsal pathway basically identifies
+
+0:43:13.200,0:43:15.200
+locations
+
+0:43:15.390,0:43:17.390
+geometry and motion
+
+0:43:17.460,0:43:25.040
+Okay? So there is a pathway for what, and another pathway for where, and that seems fairly separate in the
+
+0:43:25.040,0:43:29.030
+human or primate visual cortex
+
+0:43:32.610,0:43:34.610
+And of course there are interactions between them
+
+0:43:39.390,0:43:45.499
+So various people had the idea of kind of using... so where does that idea come from? There is
+
+0:43:46.080,0:43:48.799
+classic work in neuroscience from the late 50s early 60s
+
+0:43:49.650,0:43:52.129
+By Hubel & Wiesel, they're on the picture here
+
+0:43:53.190,0:43:57.440
+They won a Nobel Prize for it, so it's really classic work and what they showed
+
+0:43:58.290,0:44:01.519
+with cats --basically by poking electrodes into cat brains
+
+0:44:02.310,0:44:08.480
+is that neurons in the cat brain --in v1-- detect...
+
+0:44:09.150,0:44:13.789
+are only sensitive to a small area of the visual field and they detect oriented edges
+
+0:44:14.970,0:44:17.030
+contours in that particular area, okay?
+
+0:44:17.880,0:44:22.160
+So the area to which a particular neuron is sensitive is called a receptive field
+
+0:44:23.700,0:44:27.859
+And you take a particular neuron and you show it
+
+0:44:29.070,0:44:35.719
+kind of an oriented bar that you rotate, and at one point the neuron will fire
+
+0:44:36.270,0:44:40.640
+for a particular angle, and as you move away from that angle the activation of the neuron kind of
+
+0:44:42.690,0:44:50.149
+diminishes, okay? So that's called orientation selective neurons, and Hubel & Wiesel called it simple cells
+
+0:44:51.420,0:44:56.930
+If you move the bar a little bit, you go out of the receptive field, that neuron doesn't fire anymore
+
+0:44:57.150,0:45:03.049
+it doesn't react to it. This could be another neuron almost exactly identical to it, just a little bit
+
+0:45:04.830,0:45:09.620
+Away from the first one that does exactly the same function. It will react to a slightly different
+
+0:45:10.380,0:45:12.440
+receptive field but with the same orientation
+
+0:45:14.700,0:45:18.889
+So you start getting this idea that you have local feature detectors that are positioned
+
+0:45:20.220,0:45:23.689
+replicated all over the visual field, which is basically this idea of
+
+0:45:24.960,0:45:26.960
+convolution, okay?
+
+0:45:27.870,0:45:33.470
+So they are called simple cells. And then another idea that or discovery that
+
+0:45:35.100,0:45:40.279
+Hubel & Wiesel did is the idea of complex cells. So what a complex cell is is another type of neuron
+
+0:45:41.100,0:45:45.200
+that integrates the output of multiple simple cells within a certain area
+
+0:45:46.170,0:45:50.120
+Okay? So they will take different simple cells that all detect
+
+0:45:51.180,0:45:54.079
+contours at a particular orientation, edges at a particular orientation
+
+0:45:55.350,0:46:02.240
+And compute an aggregate of all those activations. It will either do a max, or a sum, or
+
+0:46:02.760,0:46:08.239
+a sum of squares, or square root of sum of squares. Some sort of function that does not depend on the order of the arguments
+
+0:46:08.820,0:46:11.630
+Okay? Let's say max for the sake of simplicity
+
+0:46:12.900,0:46:17.839
+So basically a complex cell will turn on if any of the simple cells within its
+
+0:46:19.740,0:46:22.399
+input group turns on
+
+0:46:22.680,0:46:29.480
+Okay? So that complex cell will detect an edge at a particular orientation regardless of its position within that little region
+
+0:46:30.210,0:46:32.210
+So it builds a little bit of
+
+0:46:32.460,0:46:34.609
+shift invariance of the
+
+0:46:35.250,0:46:40.159
+representation coming out of the complex cells with respect to small variation of positions of
+
+0:46:40.890,0:46:42.890
+features in the input
+
+0:46:46.680,0:46:52.010
+So a gentleman by the name of Kunihiko Fukushima
+
+0:46:54.420,0:46:56.569
+--No real relationship with the nuclear power plant
+
+0:46:58.230,0:47:00.230
+In the late 70s early 80s
+
+0:47:00.330,0:47:07.190
+experimented with computer models that sort of implemented this idea of simple cell / complex cell, and he had the idea of sort of replicating this
+
+0:47:07.500,0:47:09.500
+with multiple layers, so basically...
+
+0:47:11.310,0:47:17.810
+The architecture he did was very similar to the one I showed earlier here with this sort of handcrafted
+
+0:47:18.570,0:47:20.490
+feature detector
+
+0:47:20.490,0:47:24.559
+Some of those feature detectors in his model were handcrafted but some of them were learned
+
+0:47:25.230,0:47:30.709
+They were learned by an unsupervised method. He didn't have have backprop, right? Backprop didn't exist
+
+0:47:30.710,0:47:36.770
+I mean, it existed but it wasn't really popular and people didn't use it
+
+0:47:38.609,0:47:43.338
+So he trained those filters basically with something that amounts to a
+
+0:47:44.190,0:47:46.760
+sort of clustering algorithm a little bit...
+
+0:47:49.830,0:47:53.569
+and separately for each layer. And so he would
+
+0:47:56.609,0:48:02.389
+train the filters for the first layer, train this with handwritten digits --he also had a dataset of handwritten digits
+
+0:48:03.390,0:48:06.470
+and then feed this to complex cells that
+
+0:48:06.470,0:48:10.820
+pool the activity of simple cells together, and then that would
+
+0:48:11.880,0:48:18.440
+form the input to the next layer, and it would repeat the same running algorithm. His model of neuron was very complicated
+
+0:48:18.440,0:48:19.589
+It was kind of inspired by biology
+
+0:48:19.589,0:48:27.229
+So it had separate inhibitory neurons, the other neurons only have positive weights and outgoing weights, etc.
+
+0:48:27.839,0:48:29.839
+He managed to get this thing to kind of work
+
+0:48:30.510,0:48:33.800
+Not very well, but sort of worked
+
+0:48:36.420,0:48:39.170
+Then a few years later
+
+0:48:40.770,0:48:44.509
+I basically kind of got inspired by similar architectures, but
+
+0:48:45.780,0:48:51.169
+trained them supervised with backprop, okay? So that's the genesis of convolutional nets, if you want
+
+0:48:51.750,0:48:53.869
+And then independently more or less
+
+0:48:57.869,0:49:04.969
+Max Riesenhuber and Tony Poggio's lab at MIT kind of rediscovered this architecture also, but also didn't use backprop for some reason
+
+0:49:06.060,0:49:08.060
+He calls this H-max
+
+0:49:12.150,0:49:20.039
+So this is sort of early experiments I did with convolutional nets when I was finishing my postdoc in the University of Toronto in 1988
+
+0:49:20.040,0:49:22.040
+So that goes back a long time
+
+0:49:22.840,0:49:26.730
+And I was trying to figure out, does this work better on a small data set?
+
+0:49:26.730,0:49:27.870
+So if you have a tiny amount of data
+
+0:49:27.870,0:49:31.109
+you're trying to fully connect to network or linear network with just one layer or
+
+0:49:31.480,0:49:34.529
+a network with local connections but no shared weights or compare this with
+
+0:49:35.170,0:49:39.299
+what was not yet called a convolutional net, where you have shared weights and local connections
+
+0:49:39.400,0:49:42.749
+Which one works best? And it turned out that in terms of
+
+0:49:43.450,0:49:46.439
+generalization ability, which are the curves on the bottom left
+
+0:49:49.270,0:49:52.499
+which you see here, the top curve here, is...
+
+0:49:53.500,0:50:00.330
+basically the baby convolutional net architecture trained with very a simple data set of handwritten digits that were drawn with a mouse, right?
+
+0:50:00.330,0:50:02.490
+We didn't have any way of collecting images, basically
+
+0:50:03.640,0:50:05.640
+at that time
+
+0:50:05.860,0:50:09.240
+And then if you have real connections without shared weights
+
+0:50:09.240,0:50:12.119
+it works a little worse. And then if you have fully connected
+
+0:50:14.470,0:50:22.230
+networks it works worse, and if you have a linear network, it not only works worse, but but it also overfits, it over trains
+
+0:50:23.110,0:50:28.410
+So the test error goes down after a while, and this was trained with 320
+
+0:50:29.410,0:50:35.519
+320 training samples, which is really small. Those networks had on the order of
+
+0:50:36.760,0:50:43.170
+five thousand connections, one thousand parameters. So this is a billion times smaller than what we do today
+
+0:50:43.990,0:50:45.990
+A million times I would say
+
+0:50:47.890,0:50:53.730
+And then I finished my postdoc, I went to Bell Labs, and Bell Labs had slightly bigger computers
+
+0:50:53.730,0:50:57.389
+but what they had was a data set that came from the Postal Service
+
+0:50:57.390,0:51:00.629
+So they had zip codes for envelopes and we built a
+
+0:51:00.730,0:51:05.159
+data set out of those zip codes and then trained a slightly bigger a neural net for three weeks
+
+0:51:06.430,0:51:12.749
+and got really good results. So this convolutional net did not have separate
+
+0:51:13.960,0:51:15.960
+convolution and pooling
+
+0:51:16.240,0:51:22.769
+It had strided convolution, so convolutions where the window is shifted by more than one pixel. So that's...
+
+0:51:23.860,0:51:29.739
+What's the result of this? So the result is that the output map when you do a convolution where the stride is
+
+0:51:30.710,0:51:36.369
+more than one, you get an output whose resolution is smaller than the input and you see an example here
+
+0:51:36.370,0:51:40.390
+So here the input is 16 by 16 pixels. That's what we could afford
+
+0:51:41.900,0:51:49.029
+The kernels are 5 by 5, but they are shifted by 2 pixels every time and so the
+
+0:51:51.950,0:51:56.919
+the output here is smaller because of that
+
+0:52:11.130,0:52:13.980
+Okay? And then one year later this was the next generation
+
+0:52:14.830,0:52:16.830
+convolutional net. This one had separate
+
+0:52:17.680,0:52:19.680
+convolution and pooling so...
+
+0:52:20.740,0:52:24.389
+Where's the pooling operation? At that time, the pooling operation was just another
+
+0:52:25.690,0:52:31.829
+neuron except that all the weights of that neuron were equal, okay? So a pooling unit was basically
+
+0:52:32.680,0:52:36.839
+a unit that computed an average of its inputs
+
+0:52:37.180,0:52:41.730
+it added a bias, and then passed it to a non-linearity, which in this case was a hyperbolic tangent function
+
+0:52:42.820,0:52:48.450
+Okay? All the non-linearities in this network were hyperbolic tangents at the time. That's what people were doing
+
+0:52:53.200,0:52:55.200
+And the pooling operation was
+
+0:52:56.380,0:52:58.440
+performed by shifting
+
+0:52:59.680,0:53:01.710
+the window over which you compute the
+
+0:53:02.770,0:53:09.240
+the aggregate of the output of the previous layer by 2 pixels, okay? So here
+
+0:53:10.090,0:53:13.470
+you get a 32 by 32 input window
+
+0:53:14.470,0:53:20.730
+You convolve this with filters that are 5 by 5. I should mention that a convolution kernel sometimes is also called a filter
+
+0:53:22.540,0:53:25.230
+And so what you get here are
+
+0:53:27.520,0:53:29.520
+outputs that are
+
+0:53:30.520,0:53:33.749
+I guess minus 4 so is 28 by 28, okay?
+
+0:53:34.540,0:53:40.380
+And then there is a pooling which computes an average of
+
+0:53:41.530,0:53:44.400
+pixels here over a 2 by 2 window and
+
+0:53:45.310,0:53:47.310
+then shifts that window by 2
+
+0:53:48.160,0:53:50.160
+So how many such windows do you have?
+
+0:53:51.220,0:53:56.279
+Since the image is 28 by 28, you divide by 2, is 14 by 14, okay? So those images
+
+0:53:57.460,0:54:00.359
+here are 14 by 14 pixels
+
+0:54:02.050,0:54:05.759
+And they are basically half the resolution as the previous window
+
+0:54:07.420,0:54:09.420
+because of this stride
+
+0:54:10.360,0:54:16.470
+Okay? Now it becomes interesting because what you want is, you want the next layer to detect combinations of features from the previous layer
+
+0:54:17.200,0:54:19.200
+And so...
+
+0:54:20.200,0:54:22.619
+the way to do this is... you have
+
+0:54:23.440,0:54:26.579
+different convolution filters apply to each of those feature maps
+
+0:54:27.730,0:54:29.730
+Okay?
+
+0:54:29.950,0:54:35.939
+And you sum them up, you sum the results of those four convolutions and you pass the result to a non-linearity and that gives you
+
+0:54:36.910,0:54:42.239
+one feature map of the next layer. So because those filters are 5 by 5 and those
+
+0:54:43.330,0:54:46.380
+images are 14 by 14, those guys are 10 by 10
+
+0:54:47.290,0:54:49.739
+Okay? To not have border effects
+
+0:54:52.270,0:54:56.999
+So each of these feature maps --of which there are sixteen if I remember correctly
+
+0:54:59.290,0:55:01.290
+uses a different set of
+
+0:55:02.860,0:55:04.860
+kernels to...
+
+0:55:06.340,0:55:09.509
+convolve the previous layers. In fact
+
+0:55:10.630,0:55:13.799
+the connection pattern between the feature map...
+
+0:55:14.650,0:55:18.720
+the feature map at this layer and the feature map at the next layer is actually not full
+
+0:55:18.720,0:55:22.349
+so not every feature map is connected to every feature map. There's a particular scheme of
+
+0:55:23.680,0:55:25.950
+different combinations of feature map from the previous layer
+
+0:55:28.030,0:55:33.600
+combining to four feature maps at the next layer. And the reason for doing this is just to save computer time
+
+0:55:34.000,0:55:40.170
+We just could not afford to connect everything to everything. It would have taken twice the time to run or more
+
+0:55:41.890,0:55:48.359
+Nowadays we are kind of forced more or less to actually have a complete connection between feature maps in a convolutional net
+
+0:55:49.210,0:55:52.289
+Because of the way that multiple convolutions are implemented in GPUs
+
+0:55:53.440,0:55:55.440
+Which is sad
+
+0:55:56.560,0:55:59.789
+And then the next layer up. So again those maps are 10 by 10
+
+0:55:59.790,0:56:02.729
+Those feature maps are 10 by 10 and the next layer up
+
+0:56:03.970,0:56:06.389
+is produced by pooling and subsampling
+
+0:56:07.330,0:56:09.330
+by a factor of 2
+
+0:56:09.370,0:56:11.370
+and so those are 5 by 5
+
+0:56:12.070,0:56:14.880
+Okay? And then again there is a 5 by 5 convolution here
+
+0:56:14.880,0:56:18.089
+Of course, you can't move the window 5 by 5 over a 5 by 5 image
+
+0:56:18.090,0:56:21.120
+So it looks like a full connection, but it's actually a convolution
+
+0:56:22.000,0:56:24.000
+Okay? Keep this in mind
+
+0:56:24.460,0:56:26.460
+But you basically just sum in only one location
+
+0:56:27.250,0:56:33.960
+And those feature maps at the top here are really outputs. And so you have one special location
+
+0:56:33.960,0:56:39.399
+Okay? Because you can only place one 5 by 5 window within a 5 by 5 image
+
+0:56:40.460,0:56:45.340
+And you have 10 of those feature maps each of which corresponds to a category so you train the system to classify
+
+0:56:45.560,0:56:47.619
+digits from 0 to 9, you have ten categories
+
+0:56:59.750,0:57:03.850
+This is a little animation that I borrowed from Andrej Karpathy
+
+0:57:05.570,0:57:08.439
+He spent the time to build this really nice real animation
+
+0:57:09.470,0:57:16.780
+which is to represent several convolutions, right? So you have three feature maps here on the input and you have six
+
+0:57:18.650,0:57:21.100
+convolution kernels and two feature maps on the output
+
+0:57:21.100,0:57:26.709
+So here the first group of three feature maps are convolved with...
+
+0:57:28.520,0:57:31.899
+kernels are convolved with the three input feature maps to produce
+
+0:57:32.450,0:57:37.330
+the first group, the first of the two feature maps, the green one at the top
+
+0:57:38.390,0:57:40.370
+Okay?
+
+0:57:40.370,0:57:42.820
+And then...
+
+0:57:44.180,0:57:49.000
+Okay, so this is the first group of three kernels convolved with the three feature maps
+
+0:57:49.000,0:57:53.349
+And they produce the green map at the top, and then you switch to the second group of
+
+0:57:54.740,0:57:58.479
+of convolution kernels. You convolve with the
+
+0:57:59.180,0:58:04.149
+three input feature maps to produce the map at the bottom. Okay? So that's
+
+0:58:05.810,0:58:07.810
+an example of
+
+0:58:10.070,0:58:17.709
+n-feature map on the input, n-feature map on the output, and N times M convolution kernels to get all combinations
+
+0:58:25.000,0:58:27.000
+Here's another animation which I made a long time ago
+
+0:58:28.100,0:58:34.419
+That shows convolutional net after it's been trained in action trying to recognize digits
+
+0:58:35.330,0:58:38.529
+And so what's interesting to look at here is you have
+
+0:58:39.440,0:58:41.440
+an input here, which is I believe
+
+0:58:42.080,0:58:44.590
+32 rows by 64 columns
+
+0:58:45.770,0:58:52.570
+And after doing six convolutions with six convolution kernels passing it through a hyperbolic tangent non-linearity after a bias
+
+0:58:52.570,0:58:59.229
+you get those feature maps here, each of which kind of activates for a different type of feature. So, for example
+
+0:58:59.990,0:59:01.990
+the feature map at the top here
+
+0:59:02.390,0:59:04.690
+turns on when there is some sort of a horizontal edge
+
+0:59:07.400,0:59:10.090
+This guy here it turns on whenever there is a vertical edge
+
+0:59:10.940,0:59:15.340
+Okay? And those convolutional kernels have been learned through backprop, the thing has been just been trained
+
+0:59:15.980,0:59:20.980
+with backprop. Not set by hand. They're set randomly usually
+
+0:59:21.620,0:59:26.769
+So you see this notion of equivariance here, if I shift the input image the
+
+0:59:27.500,0:59:31.600
+activations on the feature maps shift, but otherwise stay unchanged
+
+0:59:32.540,0:59:34.540
+All right?
+
+0:59:34.940,0:59:36.940
+That's shift equivariance
+
+0:59:36.950,0:59:38.860
+Okay, and then we go to the pooling operation
+
+0:59:38.860,0:59:42.519
+So this first feature map here corresponds to a pooled version of
+
+0:59:42.800,0:59:46.149
+this first one, the second one to the second one, third went to the third one
+
+0:59:46.250,0:59:51.370
+and the pooling operation here again is an average, then a bias, then a similar non-linearity
+
+0:59:52.070,0:59:55.029
+And so if this map shifts by
+
+0:59:56.570,0:59:59.499
+one pixel this map will shift by one half pixel
+
+1:00:01.370,1:00:02.780
+Okay?
+
+1:00:02.780,1:00:05.259
+So you still have equavariance, but
+
+1:00:06.260,1:00:11.830
+shifts are reduced by a factor of two, essentially
+
+1:00:11.830,1:00:15.850
+and then you have the second stage where each of those maps here is a result of
+
+1:00:16.160,1:00:23.440
+doing a convolution on each, or a subset of the previous maps with different kernels, summing up the result, passing the result through
+
+1:00:24.170,1:00:27.070
+a sigmoid, and so you get those kind of abstract features
+
+1:00:28.730,1:00:32.889
+here that are a little hard to interpret visually, but it's still equivariant to shift
+
+1:00:33.860,1:00:40.439
+Okay? And then again you do pooling and subsampling. So the pooling also has this stride by a factor of two
+
+1:00:40.630,1:00:42.580
+So what you get here are
+
+1:00:42.580,1:00:47.609
+our maps, so that those maps shift by one quarter pixel if the input shifts by one pixel
+
+1:00:48.730,1:00:55.290
+Okay? So we reduce the shift and it becomes... it might become easier and easier for following layers to kind of interpret what the shape is
+
+1:00:55.290,1:00:57.290
+because you exchange
+
+1:00:58.540,1:01:00.540
+spatial resolution for
+
+1:01:01.030,1:01:05.009
+feature type resolution. You increase the number of feature types as you go up the layers
+
+1:01:06.040,1:01:08.879
+The spatial resolution goes down because of the pooling and subsampling
+
+1:01:09.730,1:01:14.459
+But the number of feature maps increases and so you make the representation a little more abstract
+
+1:01:14.460,1:01:19.290
+but less sensitive to shift and distortions. And the next layer
+
+1:01:20.740,1:01:25.080
+again performs convolutions, but now the size of the convolution kernel is equal to the height of the image
+
+1:01:25.080,1:01:27.449
+And so what you get is a single band
+
+1:01:28.359,1:01:32.219
+for this feature map. It basically becomes one dimensional and
+
+1:01:32.920,1:01:39.750
+so now any vertical shift is basically eliminated, right? It's turned into some variation of activation, but it's not
+
+1:01:40.840,1:01:42.929
+It's not a shift anymore. It's some sort of
+
+1:01:44.020,1:01:45.910
+simpler --hopefully
+
+1:01:45.910,1:01:49.020
+transformation of the input. In fact, you can show it's simpler
+
+1:01:51.160,1:01:53.580
+It's flatter in some ways
+
+1:01:56.650,1:02:00.330
+Okay? So that's the sort of generic convolutional net architecture we have
+
+1:02:01.570,1:02:05.699
+This is a slightly more modern version of it, where you have some form of normalization
+
+1:02:07.450,1:02:09.450
+Batch norm
+
+1:02:10.600,1:02:15.179
+Good norm, whatever. A filter bank, those are the multiple convolutions
+
+1:02:16.660,1:02:18.690
+In signal processing they're called filter banks
+
+1:02:19.840,1:02:27.149
+Pointwise non-linearity, generally a ReLU, and then some pooling, generally max pooling in the most common
+
+1:02:28.330,1:02:30.629
+implementations of convolutional nets. You can, of course
+
+1:02:30.630,1:02:35.880
+imagine other types of pooling. I talked about the average but the more generic version is the LP norm
+
+1:02:36.640,1:02:38.640
+which is...
+
+1:02:38.770,1:02:45.530
+take all the inputs through a complex cell, elevate them to some power and then take the...
+
+1:02:45.530,1:02:47.530
+Sum them up, and then take the...
+
+1:02:49.860,1:02:51.860
+Elevate that to 1 over the power
+
+1:02:53.340,1:02:58.489
+Yeah, this should be a sum inside of the P-th root here
+
+1:03:00.870,1:03:02.870
+Another way to pool and again
+
+1:03:03.840,1:03:07.759
+a good pooling operation is an operation that is
+
+1:03:07.920,1:03:11.719
+invariant to a permutation of the input. It gives you the same result
+
+1:03:12.750,1:03:14.750
+regardless of the order in which you put the input
+
+1:03:15.780,1:03:22.670
+Here's another example. We talked about this function before: 1 over b log sum of our inputs of e to the bXᵢ
+
+1:03:25.920,1:03:30.649
+Exponential bX. Again, that's a kind of symmetric aggregation operation that you can use
+
+1:03:32.400,1:03:35.539
+So that's kind of a stage of a convolutional net, and then you can repeat that
+
+1:03:36.270,1:03:43.729
+There's sort of various ways of positioning the normalization. Some people put it after the non-linearity before the pooling
+
+1:03:43.730,1:03:45.730
+You know, it depends
+
+1:03:46.590,1:03:48.590
+But it's typical
+
+1:03:53.640,1:03:56.569
+So, how do you do this in PyTorch? there's a number of different ways
+
+1:03:56.570,1:04:02.479
+You can do it by writing it explicitly, writing a class. So this is an example of a convolutional net class
+
+1:04:04.020,1:04:10.520
+In particular one here where you do convolutions, ReLU and max pooling
+
+1:04:12.600,1:04:17.900
+Okay, so the constructor here creates convolutional layers which have parameters in them
+
+1:04:18.810,1:04:24.499
+And this one has what's called fully-connected layers. I hate that. Okay?
+
+1:04:25.980,1:04:30.919
+So there is this idea somehow that the last layer of a convolutional net
+
+1:04:32.760,1:04:34.790
+Like this one, is fully connected because
+
+1:04:37.320,1:04:42.860
+every unit in this layer is connected to every unit in that layer. So that looks like a full connection
+
+1:04:44.010,1:04:47.060
+But it's actually useful to think of it as a convolution
+
+1:04:49.200,1:04:51.060
+Okay?
+
+1:04:51.060,1:04:56.070
+Now, for efficiency reasons, or maybe some others bad reasons they're called
+
+1:04:57.370,1:05:00.959
+fully-connected layers, and we used the class linear here
+
+1:05:01.120,1:05:05.459
+But it kind of breaks the whole idea that your network is a convolutional network
+
+1:05:06.070,1:05:09.209
+So it's much better actually to view them as convolutions
+
+1:05:09.760,1:05:14.370
+In this case one by one convolution which is sort of a weird concept. Okay. So here we have
+
+1:05:15.190,1:05:20.46
+four layers, two convolutional layers and two so-called fully-connected layers
+
+1:05:21.790,1:05:23.440
+And then the way we...
+
+1:05:23.440,1:05:29.129
+So we need to create them in the constructor, and the way we use them in the forward pass is that
+
+1:05:30.630,1:05:35.310
+we do a convolution of the input, and then we apply the ReLU, and then we do max pooling and then we
+
+1:05:35.710,1:05:38.699
+run the second layer, and apply the ReLU, and do max pooling again
+
+1:05:38.700,1:05:44.280
+And then we reshape the output because it's a fully connected layer. So we want to make this a
+
+1:05:45.190,1:05:47.879
+vector so that's what the x.view(-1) does
+
+1:05:48.820,1:05:50.820
+And then apply a
+
+1:05:51.160,1:05:53.160
+ReLU to it
+
+1:05:53.260,1:05:55.260
+And...
+
+1:05:55.510,1:06:00.330
+the second fully-connected layer, and then apply a softmax if we want to do classification
+
+1:06:00.460,1:06:04.409
+And so this is somewhat similar to the architecture you see at the bottom
+
+1:06:04.900,1:06:08.370
+The numbers might be different in terms of feature maps and stuff, but...
+
+1:06:09.160,1:06:11.160
+but the general architecture is
+
+1:06:12.250,1:06:14.250
+pretty much what we're talking about
+
+1:06:15.640,1:06:17.640
+Yes?
+
+1:06:20.530,1:06:22.530
+Say again
+
+1:06:24.040,1:06:26.100
+You know, whatever gradient descent decides
+
+1:06:28.630,1:06:30.630
+We can look at them, but
+
+1:06:31.180,1:06:33.180
+if you train with a lot of
+
+1:06:33.280,1:06:37.590
+examples of natural images, the kind of filters you will see at the first layer
+
+1:06:37.840,1:06:44.999
+basically will end up being mostly oriented edge detectors, very much similar to what people, to what neuroscientists
+
+1:06:45.340,1:06:49.110
+observe in the cortex of
+
+1:06:49.210,1:06:50.440
+animals
+
+1:06:50.440,1:06:52.440
+In the visual cortex of animals
+
+1:06:55.780,1:06:58.469
+They will change when you train the model, that's the whole point yes
+
+1:07:05.410,1:07:11.160
+Okay, so it's pretty simple. Here's another way of defining those. This is... I guess it's kind of an
+
+1:07:12.550,1:07:15.629
+outdated way of doing it, right? Not many people do this anymore
+
+1:07:17.170,1:07:23.340
+but it's kind of a simple way. Also there is this class in PyTorch called nn.Sequential
+
+1:07:24.550,1:07:28.469
+It's basically a container and you keep putting modules in it and it just
+
+1:07:29.080,1:07:36.269
+automatically kind of use them as being kind of connected in sequence, right? And so then you just have to call
+
+1:07:40.780,1:07:45.269
+forward on it and it will just compute the right thing
+
+1:07:46.360,1:07:50.370
+In this particular form here, you pass it a bunch of pairs
+
+1:07:50.370,1:07:55.229
+It's like a dictionary so you can give a name to each of the layers, and you can later access them
+
+1:08:08.079,1:08:10.079
+It's the same architecture we were talking about earlier
+
+1:08:18.489,1:08:24.029
+Yeah, I mean the backprop is automatic, right? You get it
+
+1:08:25.630,1:08:27.630
+by default you just call
+
+1:08:28.690,1:08:32.040
+backward and it knows how to back propagate through it
+
+1:08:44.000,1:08:49.180
+Well, the class kind of encapsulates everything into an object where the parameters are
+
+1:08:49.250,1:08:51.250
+There's a particular way of...
+
+1:08:52.220,1:08:54.220
+getting the parameters out and 
+
+1:08:55.130,1:08:58.420
+kind of feeding them to an optimizer
+
+1:08:58.420,1:09:01.330
+And so the optimizer doesn't need to know what your network looks like
+
+1:09:01.330,1:09:06.910
+It just knows that there is a function and there is a bunch of parameters and it gets a gradient and
+
+1:09:06.910,1:09:08.910
+it doesn't need to know what your network looks like
+
+1:09:10.790,1:09:12.879
+Yeah, you'll hear more about this
+
+1:09:14.840,1:09:16.840
+tomorrow
+
+1:09:25.610,1:09:33.159
+So here's a very interesting aspect of convolutional nets and it's one of the reasons why they've become so
+
+1:09:33.830,1:09:37.390
+successful in many applications. It's the fact that
+
+1:09:39.440,1:09:45.280
+if you view every layer in a convolutional net as a convolution, so there is no full connections, so to speak
+
+1:09:47.660,1:09:53.320
+you don't need to have a fixed size input. You can vary the size of the input and the network will
+
+1:09:54.380,1:09:56.380
+vary its size accordingly
+
+1:09:56.780,1:09:58.780
+because...
+
+1:09:59.510,1:10:01.510
+when you apply a convolution to an image
+
+1:10:02.150,1:10:05.800
+you fit it an image of a certain size, you do a convolution with a kernel
+
+1:10:06.620,1:10:11.979
+you get an image whose size is related to the size of the input
+
+1:10:12.140,1:10:15.789
+but you can change the size of the input and it just changes the size of the output
+
+1:10:16.760,1:10:20.320
+And this is true for every convolutional-like like operation, right?
+
+1:10:20.320,1:10:25.509
+So if your network is composed only of convolutions, then it doesn't matter what the size of the input is
+
+1:10:26.180,1:10:31.450
+It's going to go through the network and the size of every layer will change according to the size of the input
+
+1:10:31.580,1:10:34.120
+and the size of the output will also change accordingly
+
+1:10:34.640,1:10:37.329
+So here is a little example here where
+
+1:10:38.720,1:10:40.720
+I wanna do
+
+1:10:41.300,1:10:45.729
+cursive handwriting recognition and it's very hard because I don't know where the letters are
+
+1:10:45.730,1:10:48.700
+So I can't just have a character recognizer that...
+
+1:10:49.260,1:10:51.980
+I mean a system that will first cut the
+
+1:10:52.890,1:10:56.100
+word into letters
+
+1:10:56.100,1:10:57.72
+because I don't know where the letters are
+
+1:10:57.720,1:10:59.900
+and then apply the convolutional net to each of the letters
+
+1:11:00.210,1:11:05.200
+So the best I can do is take the convolutional net and swipe it over the input and then record the output
+
+1:11:05.850,1:11:11.810
+Okay? And so you would think that to do this you will have to take a convolutional net like this that has a window
+
+1:11:12.060,1:11:14.389
+large enough to see a single character
+
+1:11:15.120,1:11:21.050
+and then you take your input image and compute your convolutional net at every location
+
+1:11:21.660,1:11:27.110
+shifting it by one pixel or two pixels or four pixels or something like this, a small enough number of pixels that
+
+1:11:27.630,1:11:30.619
+regardless of where the character occurs in the input
+
+1:11:30.620,1:11:35.000
+you will still get a score on the output whenever it needs to recognize one
+
+1:11:36.150,1:11:38.989
+But it turns out that will be extremely wasteful
+
+1:11:40.770,1:11:42.770
+because...
+
+1:11:43.290,1:11:50.179
+you will be redoing the same computation multiple times. And so the proper way to do this --and this is very important to understand
+
+1:11:50.880,1:11:56.659
+is that you don't do what I just described where you have a small convolutional net that you apply to every window
+
+1:11:58.050,1:12:00.050
+What you do is you
+
+1:12:01.230,1:12:07.939
+take a large input and you apply the convolutions to the input image since it's larger you're gonna get a larger output
+
+1:12:07.940,1:12:11.270
+you apply the second layer convolution to that, or the pooling, whatever it is
+
+1:12:11.610,1:12:15.170
+You're gonna get a larger input again, etc.
+
+1:12:15.170,1:12:16.650
+all the way to the top and
+
+1:12:16.650,1:12:20.929
+whereas in the original design you were getting only one output now you're going to get multiple outputs because
+
+1:12:21.570,1:12:23.570
+it's a convolutional layer
+
+1:12:27.990,1:12:29.990
+This is super important because
+
+1:12:30.600,1:12:35.780
+this way of applying a convolutional net with a sliding window is
+
+1:12:36.870,1:12:40.610
+much, much cheaper than recomputing the convolutional net at every location
+
+1:12:42.510,1:12:44.510
+Okay?
+
+1:12:45.150,1:12:51.619
+You would not believe how many decades it took to convince people that this was a good thing
+
+1:12:58.960,1:13:03.390
+So here's an example of how you can use this
+
+1:13:04.090,1:13:09.180
+This is a conventional net that was trained on individual digits, 32 by 32. It was trained on a MNIST, okay?
+
+1:13:09.760,1:13:11.760
+32 by 32 input windows
+
+1:13:12.400,1:13:15.690
+It's LeNet 5, so it's very similar to the architecture
+
+1:13:15.690,1:13:20.940
+I just showed the code for, okay? It's trained on individual characters to just classify
+
+1:13:21.970,1:13:26.369
+the character in the center of the image. And the way it was trained was there was a little bit of data
+
+1:13:26.770,1:13:30.359
+augmentation where the character in the center was kind of shifted a little bit in various locations
+
+1:13:31.420,1:13:36.629
+changed in size. And then there were two other characters
+
+1:13:37.420,1:13:39.600
+that were kind of added to the side to confuse it
+
+1:13:40.480,1:13:45.660
+in many samples. And then it was also trained with an 11th category
+
+1:13:45.660,1:13:50.249
+which was "none of the above" and the way it's trained is either you show it a blank image
+
+1:13:50.410,1:13:54.149
+or you show it an image where there is no character in the center but there are characters on the side
+
+1:13:54.940,1:13:59.399
+so that it would detect whenever it's inbetween two characters
+
+1:14:00.520,1:14:02.520
+and then you do this thing of
+
+1:14:02.650,1:14:10.970
+computing the convolutional net at every location on the input without actually shifting it but just applying the convolutions to the entire image
+
+1:14:11.740,1:14:13.740
+And that's what you get
+
+1:14:13.780,1:14:23.220
+So here the input image is 64 by 32, even though the network was trained on 32 by 32 with those kind of generated examples
+
+1:14:24.280,1:14:28.049
+And what you see is the activity of some of the layers, not all of them are represented
+
+1:14:29.410,1:14:32.309
+And what you see at the top here, those kind of funny shapes
+
+1:14:33.520,1:14:37.560
+You see threes and fives popping up and they basically are an
+
+1:14:38.830,1:14:41.850
+indication of the winning category for every location, right?
+
+1:14:42.670,1:14:47.339
+So the eight outputs that you see at the top are
+
+1:14:48.520,1:14:50.520
+basically the output corresponding to eight different
+
+1:14:51.250,1:14:56.790
+positions of the 32 by 32 input window on the input, shifted by 4 pixels every time
+
+1:14:59.530,1:15:05.859
+And what is represented is the winning category within that window and the grayscale indicates the score, okay?
+
+1:15:07.220,1:15:10.419
+So what you see is that there's two detectors detecting the five
+
+1:15:11.030,1:15:15.850
+until the three kind of starts overlapping. And then two detectors are detecting the three that kind of moved around
+
+1:15:18.230,1:15:22.779
+because within a 32 by 32 window
+
+1:15:23.390,1:15:29.919
+the three appears to the left of that 32 by 32 window, and then to the right of that other 32 by 32 windows shifted by four
+
+1:15:29.920,1:15:31.920
+and so those two detectors detect
+
+1:15:32.690,1:15:34.690
+that 3 or that 5
+
+1:15:36.140,1:15:39.890
+So then what you do is you take all those scores here at the top and you
+
+1:15:39.890,1:15:43.809
+do a little bit of post-processing very simple and you figure out if it's a three and a five
+
+1:15:44.630,1:15:46.630
+What's interesting about this is that
+
+1:15:47.660,1:15:49.899
+you don't need to do prior segmentation
+
+1:15:49.900,1:15:51.860
+So something that people had to do
+
+1:15:51.860,1:15:58.180
+before, in computer vision, was if you wanted to recognize an object you had to separate the object from its background because the recognition system
+
+1:15:58.490,1:16:00.490
+would get confused by
+
+1:16:00.800,1:16:07.900
+the background. But here with this convolutional net, it's been trained with overlapping characters and it knows how to tell them apart
+
+1:16:08.600,1:16:10.809
+And so it's not confused by characters that overlap
+
+1:16:10.810,1:16:15.729
+I have a whole bunch of those on my web website, by the way, those animations from the early nineties
+
+1:16:38.450,1:16:41.679
+No, that was the main issue. That's one of the reasons why
+
+1:16:44.210,1:16:48.040
+computer vision wasn't working very well. It's because the very problem of
+
+1:16:49.850,1:16:52.539
+figure/background separation, detecting an object
+
+1:16:53.780,1:16:59.530
+and recognizing it is the same. You can't recognize the object until you segment it but you can't segment it until you recognize it
+
+1:16:59.840,1:17:05.290
+It's the same for cursive handwriting recognition, right? You can't... so here's an example
+
+1:17:07.460,1:17:09.460
+Do we have pens?
+
+1:17:10.650,1:17:12.650
+Doesn't look like we have pens right?
+
+1:17:14.969,1:17:21.859
+Here we go, that's true. I'm sorry... maybe I should use the...
+
+1:17:24.780,1:17:26.780
+If this works...
+
+1:17:34.500,1:17:36.510
+Oh, of course...
+
+1:17:43.409,1:17:45.409
+Okay...
+
+1:17:52.310,1:17:54.310
+Can you guys read this?
+
+1:17:55.670,1:18:01.990
+Okay, I mean it's horrible handwriting but it's also because I'm writing on the screen. Okay, now can you read it?
+
+1:18:08.240,1:18:10.240
+Minimum, yeah
+
+1:18:11.870,1:18:15.010
+Okay, there's actually no way you can segment the letters out of this right
+
+1:18:15.010,1:18:17.439
+I mean this is kind of a random number of waves
+
+1:18:17.900,1:18:23.260
+But just the fact that the two "I"s are identified, then it's basically not ambiguous at least in English
+
+1:18:24.620,1:18:26.620
+So that's a good example of
+
+1:18:28.100,1:18:30.340
+the interpretation of individual
+
+1:18:31.580,1:18:38.169
+objects depending on their context. And what you need is some sort of high-level language model to know what words are possible
+
+1:18:38.170,1:18:40.170
+If you don't know English or similar
+
+1:18:40.670,1:18:44.320
+languages that have the same word, there's no way you can you can read this
+
+1:18:45.500,1:18:48.490
+Spoken language is very similar to this
+
+1:18:49.700,1:18:53.679
+All of you who have had the experience of learning a foreign language
+
+1:18:54.470,1:18:56.470
+probably had the experience that
+
+1:18:57.110,1:19:04.150
+you have a hard time segmenting words from a new language and then recognizing the words because you don't have the vocabulary
+
+1:19:04.850,1:19:09.550
+Right? So if I speak in French -- si je commence à parler français, vous n'avez aucune idée d'où sont les limites des mots --
+[If I start speaking French, you have no idea where the limits of words are]
+
+1:19:09.740,1:19:13.749
+Except if you speak French. So I spoke a sentence, it's words
+
+1:19:13.750,1:19:17.140
+but you can't tell the boundary between the words right because it is basically no
+
+1:19:17.990,1:19:23.800
+clear seizure between the words unless you know where the words are in advance, right? So that's the problem of segmentation
+
+1:19:23.900,1:19:28.540
+You can't recognize until you segment, you can't segment until you recognize you have to do both at the same time
+
+1:19:29.150,1:19:32.379
+Early computer vision systems had a really hard time doing this
+
+1:19:40.870,1:19:46.739
+So that's why this kind of stuff is big progress because you don't have to do segmentation in advance, it just...
+
+1:19:47.679,1:19:52.559
+just train your system to be robust to kind of overlapping objects and things like that. Yes, in the back!
+
+1:19:55.510,1:19:59.489
+Yes, there is a background class. So when you see a blank response
+
+1:20:00.340,1:20:04.410
+it means the system says "none of the above" basically, right? So it's been trained
+
+1:20:05.590,1:20:07.590
+to produce "none of the above"
+
+1:20:07.690,1:20:11.699
+either when the input is blank or when there is one character that's too
+
+1:20:13.420,1:20:17.190
+outside of the center or when you have two characters
+
+1:20:17.620,1:20:24.029
+but there's nothing in the center. Or when you have two characters that overlap, but there is no central character, right? So it's...
+
+1:20:24.760,1:20:27.239
+trying to detect boundaries between characters essentially
+
+1:20:28.420,1:20:30.420
+Here's another example
+
+1:20:31.390,1:20:38.640
+This is an example that shows that even a very simple convolutional net with just two stages, right? convolution, pooling, convolution
+
+1:20:38.640,1:20:40.640
+pooling, and then two layers of...
+
+1:20:42.010,1:20:44.010
+two more layers afterwards
+
+1:20:44.770,1:20:47.429
+can solve what's called the feature-binding problem
+
+1:20:48.130,1:20:50.130
+So visual neuroscientists and
+
+1:20:50.320,1:20:56.190
+computer vision people had the issue --it was kind of a puzzle-- How is it that
+
+1:20:57.489,1:21:01.289
+we perceive objects as objects? Objects are collections of features
+
+1:21:01.290,1:21:04.229
+but how do we bind all the features together of an object to form this object?
+
+1:21:06.460,1:21:09.870
+Is there some kind of magical way of doing this?
+
+1:21:12.520,1:21:16.589
+And they did... psychologists did experiments like...
+
+1:21:24.210,1:21:26.210
+draw this and then that
+
+1:21:28.239,1:21:31.349
+and you perceive the bar as
+
+1:21:32.469,1:21:39.419
+a single bar because you're used to bars being obstructed by, occluded by other objects
+
+1:21:39.550,1:21:41.550
+and so you just assume it's an occlusion
+
+1:21:44.410,1:21:47.579
+And then there are experiments that figure out how much do I have to
+
+1:21:48.430,1:21:52.109
+shift the two bars to make me perceive them as two separate bars
+
+1:21:53.980,1:21:56.580
+But in fact, the minute they perfectly line and if you...
+
+1:21:57.250,1:21:59.080
+if you do this..
+
+1:21:59.080,1:22:03.809
+maybe exactly identical to what you see here, but now you perceive them as two different objects
+
+1:22:06.489,1:22:12.929
+So how is it that we seem to be solving the feature-binding problem?
+
+1:22:15.880,1:22:21.450
+And what this shows is that you don't need any specific mechanism for it. It just happens
+
+1:22:22.210,1:22:25.919
+If you have enough nonlinearities and you train with enough data
+
+1:22:26.440,1:22:33.359
+then, as a side effect, you get a system that solves the feature-binding problem without any particular mechanism for it
+
+1:22:37.510,1:22:40.260
+So here you have two shapes and you move a single
+
+1:22:43.060,1:22:50.519
+stroke and it goes from a six and a one, to a three, to a five and a one, to a seven and a three
+
+1:22:53.140,1:22:55.140
+Etcetera
+
+1:23:00.020,1:23:07.480
+Right, good question. So the question is: how do you distinguish between the two situations? We have two fives next to each other and
+
+1:23:08.270,1:23:14.890
+the fact that you have a single five being detected by two different frames, right? Two different framing of that five
+
+1:23:15.470,1:23:17.470
+Well there is this explicit
+
+1:23:17.660,1:23:20.050
+training so that when you have two characters that
+
+1:23:20.690,1:23:25.029
+are touching and none of them is really centered you train the system to say "none of the above", right?
+
+1:23:25.030,1:23:29.079
+So it's always going to have five blank five
+
+1:23:30.020,1:23:35.800
+It's always gonna have even like one blank one, and the ones can be very close. It will you'll tell you the difference
+
+1:23:39.170,1:23:41.289
+Okay, so what are convnets good for?
+
+1:24:04.970,1:24:07.599
+So what you have to look at is this
+
+1:24:11.510,1:24:13.510
+Every layer here is a convolution
+
+1:24:13.610,1:24:15.020
+Okay?
+
+1:24:15.020,1:24:21.070
+Including the last layer, so it looks like a full connection because every unit in the second layer goes into the output
+
+1:24:21.070,1:24:24.460
+But in fact, it is a convolution, it just happens to be applied to a single location
+
+1:24:24.950,1:24:31.300
+So now imagine that this layer at the top here now is bigger, okay? Which is represented here
+
+1:24:32.840,1:24:34.130
+Okay?
+
+1:24:34.130,1:24:37.779
+Now the size of the kernel is the size of the image you had here previously
+
+1:24:37.820,1:24:43.360
+But now it's a convolution that has multiple locations, right? And so what you get is multiple outputs
+
+1:24:46.430,1:24:55.100
+That's right, that's right. Each of which corresponds to a classification over an input window of size 32 by 32 in the example I showed
+
+1:24:55.100,1:25:02.710
+And those windows are shifted by 4 pixels. The reason being that the network architecture I showed
+
+1:25:04.280,1:25:11.739
+here has a convolution with stride one, then pooling with stride two, convolution with stride one, pooling with stride two
+
+1:25:13.949,1:25:17.178
+And so the overall stride is four, right?
+
+1:25:18.719,1:25:22.788
+And so to get a new output you need to shift the input window by four
+
+1:25:24.210,1:25:29.509
+to get one of those because of the two pooling layers with...
+
+1:25:31.170,1:25:35.480
+Maybe I should be a little more explicit about this. Let me draw a picture, that would be clearer
+
+1:25:39.929,1:25:43.848
+So you have an input
+
+1:25:49.110,1:25:53.749
+like this... a convolution, let's say a convolution of size three
+
+1:25:57.420,1:25:59.420
+Okay? Yeah with stride one
+
+1:26:01.289,1:26:04.518
+Okay, I'm not gonna draw all of them, then you have
+
+1:26:05.460,1:26:11.389
+pooling with subsampling of size two, so you pool over 2 and you subsample, the stride is 2, so you shift by two
+
+1:26:12.389,1:26:14.389
+No overlap
+
+1:26:18.550,1:26:25.060
+Okay, so here the input is this size --one two, three, four, five, six, seven, eight
+
+1:26:26.150,1:26:29.049
+because the convolution is of size three you get
+
+1:26:29.840,1:26:31.840
+an output here of size six and
+
+1:26:32.030,1:26:39.010
+then when you do pooling with subsampling with stride two, you get three outputs because that divides the output by two, okay?
+
+1:26:39.880,1:26:41.880
+Let me add another one
+
+1:26:43.130,1:26:45.130
+Actually two
+
+1:26:46.790,1:26:48.790
+Okay, so now the output is ten
+
+1:26:50.030,1:26:51.680
+This guy is eight
+
+1:26:51.680,1:26:53.680
+This guy is four
+
+1:26:54.260,1:26:56.409
+I can do convolutions now also
+
+1:26:57.650,1:26:59.650
+Let's say three
+
+1:27:01.400,1:27:03.400
+I only get two outputs
+
+1:27:04.490,1:27:06.490
+Okay? Oops!
+
+1:27:07.040,1:27:10.820
+Hmm not sure why it doesn't... draw
+
+1:27:10.820,1:27:13.270
+Doesn't wanna draw anymore, that's interesting
+
+1:27:17.060,1:27:19.060
+Aha!
+
+1:27:24.110,1:27:26.380
+It doesn't react to clicks, that's interesting
+
+1:27:34.460,1:27:39.609
+Okay, not sure what's going on! Oh "xournal" is not responding
+
+1:27:41.750,1:27:44.320
+All right, I guess it crashed on me
+
+1:27:46.550,1:27:48.550
+Well, that's annoying
+
+1:27:53.150,1:27:55.150
+Yeah, definitely crashed
+
+1:28:02.150,1:28:04.150
+And, of course, it forgot it, so...
+
+1:28:09.860,1:28:12.760
+Okay, so we have ten, then eight
+
+1:28:15.230,1:28:20.470
+because of convolution with three, then we have pooling
+
+1:28:22.520,1:28:24.520
+of size two with
+
+1:28:26.120,1:28:28.120
+stride two, so we get four
+
+1:28:30.350,1:28:36.970
+Then we have convolution with three so we get two, okay? And then maybe pooling again
+
+1:28:38.450,1:28:42.700
+of size two and subsampling two, we get one. Okay, so...
+
+1:28:44.450,1:28:46.869
+ten input, eight
+
+1:28:49.370,1:28:53.079
+four, two, and...
+
+1:28:58.010,1:29:03.339
+then one for the pooling. This is convolution three, you're right
+
+1:29:06.500,1:29:08.500
+This is two
+
+1:29:09.140,1:29:11.140
+And those are three
+
+1:29:12.080,1:29:14.080
+Etcetera. Right. Now, let's assume
+
+1:29:14.540,1:29:17.860
+I add a few units here
+
+1:29:18.110,1:29:21.010
+Okay? So that's going to add, let's say
+
+1:29:21.890,1:29:24.160
+four units here, two units here
+
+1:29:27.620,1:29:29.620
+Then...
+
+1:29:41.190,1:29:42.840
+Yeah, this one is
+
+1:29:42.840,1:29:46.279
+like this and like that so I got four and
+
+1:29:47.010,1:29:48.960
+I got another one here
+
+1:29:48.960,1:29:52.460
+Okay? So now I have only one output and by adding four
+
+1:29:53.640,1:29:55.640
+four inputs here
+
+1:29:55.830,1:29:58.249
+which is not 14. I got two outputs
+
+1:29:59.790,1:30:02.090
+Why four? Because I have 2
+
+1:30:02.970,1:30:04.830
+stride of 2
+
+1:30:04.830,1:30:10.939
+Okay? So the overall subsampling ratio from input to output is 4, it's 2 times 2
+
+1:30:13.140,1:30:17.540
+Now this is 12, and this is 6, and this is 4
+
+1:30:20.010,1:30:22.010
+So that's a...
+
+1:30:22.620,1:30:24.620
+demonstration of the fact that
+
+1:30:24.900,1:30:26.900
+you can increase the size of the input
+
+1:30:26.900,1:30:32.330
+it will increase the size of every layer, and if you have a layer that has size 1 and it's a convolutional layer
+
+1:30:32.330,1:30:34.330
+its size is going to be increased
+
+1:30:42.870,1:30:44.870
+Yes
+
+1:30:47.250,1:30:52.760
+Change the size of a layer, like, vertically, horizontally? Yeah, so there's gonna be...
+
+1:30:54.390,1:30:57.950
+So first you have to train for it, if you want the system to have so invariance to size
+
+1:30:58.230,1:31:03.860
+you have to train it with characters of various sizes. You can do this with data augmentation if your characters are normalized
+
+1:31:04.740,1:31:06.740
+That's the first thing. Second thing is...
+
+1:31:08.850,1:31:16.579
+empirically simple convolutional nets are only invariant to size within a factor of... rather small factor, like you can increase the size by
+
+1:31:17.610,1:31:23.599
+maybe 40 percent or something. I mean change the size about 40 percent plus/minus 20 percent, something like that, right?
+
+1:31:26.250,1:31:28.250
+Beyond that...
+
+1:31:28.770,1:31:33.830
+you might have more trouble getting invariance, but people have trained with input...
+
+1:31:33.980,1:31:38.390
+I mean objects of sizes that vary by a lot. So the way to handle this is
+
+1:31:39.750,1:31:46.430
+if you want to handle variable size, is that if you have an image and you don't know what size the objects are
+
+1:31:46.950,1:31:50.539
+that are in this image, you apply your convolutional net to that image and
+
+1:31:51.180,1:31:53.979
+then you take the same image, reduce it by a factor of two
+
+1:31:54.440,1:31:58.179
+just scale the image by a factor of two, run the same convolutional net on that new image and
+
+1:31:59.119,1:32:02.949
+then reduce it by a factor of two again, and run the same convolutional net again on that image
+
+1:32:03.800,1:32:08.110
+Okay? So the first convolutional net will be able to detect small objects within the image
+
+1:32:08.630,1:32:11.859
+So let's say your network has been trained to detect objects of size...
+
+1:32:11.860,1:32:16.179
+I don't know, 20 pixels, like faces for example, right? They are 20 pixels
+
+1:32:16.789,1:32:20.739
+It will detect faces that are roughly 20 pixels within this image and
+
+1:32:21.320,1:32:24.309
+then when you subsample by a factor of 2 and you apply the same network
+
+1:32:24.309,1:32:31.209
+it will detect faces that are 20 pixels within the new image, which means there were 40 pixels in the original image
+
+1:32:32.179,1:32:37.899
+Okay? Which the first network will not see because the face would be bigger than its input window
+
+1:32:39.170,1:32:41.529
+And then the next network over will detect
+
+1:32:42.139,1:32:44.409
+faces that are 80 pixels, etc., right?
+
+1:32:44.659,1:32:49.089
+So then by kind of combining the scores from all of those, and doing something called non-maximum suppression
+
+1:32:49.090,1:32:51.090
+we can actually do detection and
+
+1:32:51.230,1:32:57.939
+localization of objects. People use considerably more sophisticated techniques for detection now, and for localization that we'll talk about next week
+
+1:32:58.429,1:33:00.429
+But that's the basic idea
+
+1:33:00.920,1:33:02.920
+So let me conclude
+
+1:33:03.019,1:33:09.429
+What are convnets good for? They're good for signals that come to you in the form of a multi-dimensional array
+
+1:33:10.190,1:33:12.190
+But that multi-dimensional array has
+
+1:33:13.190,1:33:17.500
+to have two characteristics at least. The first one is
+
+1:33:18.469,1:33:23.828
+there is strong local correlations between values. So if you take an image
+
+1:33:24.949,1:33:32.949
+random image, take two pixels within this image, two pixels that are nearby. Those two pixels are very likely to have very similar colors
+
+1:33:33.530,1:33:38.199
+Take a picture of this class, for example, two pixels on the wall basically have the same color
+
+1:33:39.469,1:33:42.069
+Okay? It looks like there is a ton of objects here, but
+
+1:33:43.280,1:33:49.509
+--animate objects-- but in fact mostly, statistically, neighboring pixels are essentially the same color
+
+1:33:52.699,1:34:00.129
+As you move the distance from two pixels away and you compute the statistics of how similar pixels are as a function of distance
+
+1:34:00.650,1:34:02.650
+they're less and less similar
+
+1:34:03.079,1:34:05.079
+So what does that mean? Because
+
+1:34:06.350,1:34:09.430
+nearby pixels are likely to have similar colors
+
+1:34:09.560,1:34:14.499
+that means that when you take a patch of pixels, say five by five, or eight by eight or something
+
+1:34:16.040,1:34:18.040
+The type of patch you're going to observe
+
+1:34:18.920,1:34:21.159
+is very likely to be kind of a smoothly varying
+
+1:34:21.830,1:34:23.830
+color or maybe with an edge
+
+1:34:24.770,1:34:32.080
+But among all the possible combinations of 25 pixels, the ones that you actually observe in natural images is a tiny subset
+
+1:34:34.130,1:34:38.380
+What that means is that it's advantageous to represent the content of that patch
+
+1:34:39.440,1:34:46.509
+by a vector with perhaps less than 25 values that represent the content of that patch. Is there an edge, is it uniform?
+
+1:34:46.690,1:34:48.520
+What color is it? You know things like that, right?
+
+1:34:48.520,1:34:52.660
+And that's basically what the convolutions in the first layer of a convolutional net are doing
+
+1:34:53.900,1:34:58.809
+Okay. So if you have local correlations, there is an advantage in detecting local features
+
+1:34:59.090,1:35:01.659
+That's what we observe in the brain. That's what convolutional nets are doing
+
+1:35:03.140,1:35:08.140
+This idea of locality. If you feed a convolutional net with permuted pixels
+
+1:35:09.020,1:35:15.070
+it's not going to be able to do a good job at recognizing your images, even if the permutation is fixed
+
+1:35:17.030,1:35:19.960
+Right? A fully connected net doesn't care
+
+1:35:21.410,1:35:23.410
+about permutations
+
+1:35:25.700,1:35:28.240
+Then the second characteristics is that
+
+1:35:30.050,1:35:34.869
+features that are important may appear anywhere on the image. So that's what justifies shared weights
+
+1:35:35.630,1:35:38.499
+Okay? The local correlation justifies local connections
+
+1:35:39.560,1:35:46.570
+The fact that features can appear anywhere, that the statistics of images or the signal is uniform
+
+1:35:47.810,1:35:52.030
+means that you need to have repeated feature detectors for every location
+
+1:35:52.850,1:35:54.850
+And that's where shared weights
+
+1:35:55.880,1:35:57.880
+come into play
+
+1:36:01.990,1:36:06.059
+It does justify the pooling because the pooling is if you want invariance to
+
+1:36:06.760,1:36:11.400
+variations in the location of those characteristic features. And so if the objects you're trying to recognize
+
+1:36:12.340,1:36:16.619
+don't change their nature by kind of being slightly distorted then you want pooling
+
+1:36:21.160,1:36:24.360
+So people have used convnets for cancer stuff, image video
+
+1:36:25.660,1:36:31.019
+text, speech. So speech actually is pretty... speech recognition convnets are used a lot
+
+1:36:32.260,1:36:34.380
+Time series prediction, you know things like that
+
+1:36:36.220,1:36:42.030
+And you know biomedical image analysis, so if you want to analyze an MRI, for example
+
+1:36:42.030,1:36:44.030
+MRI or CT scan is a 3d image
+
+1:36:44.950,1:36:49.170
+As humans we can't because we don't have a good visualization technology. We can't really
+
+1:36:49.960,1:36:54.960
+apprehend or understand a 3d volume, a 3-dimensional image
+
+1:36:55.090,1:36:58.709
+But a convnet is fine, feed it a 3d image and it will deal with it
+
+1:36:59.530,1:37:02.729
+That's a big advantage because you don't have to go through slices to kind of figure out
+
+1:37:04.000,1:37:06.030
+the object in the image
+
+1:37:10.390,1:37:15.300
+And then the last thing here at the bottom, I don't know if you guys know where hyperspectral images are
+
+1:37:15.300,1:37:19.139
+So hyperspectral image is an image where... most natural color images
+
+1:37:19.140,1:37:22.619
+I mean images that you collect with a normal camera you get three color components
+
+1:37:23.470,1:37:25.390
+RGB
+
+1:37:25.390,1:37:28.019
+But we can build cameras with way more
+
+1:37:28.660,1:37:30.660
+spectral bands than this and
+
+1:37:31.510,1:37:34.709
+that's particularly the case for satellite imaging where some
+
+1:37:36.160,1:37:40.920
+cameras have many spectral bands going from infrared to ultraviolet and
+
+1:37:41.890,1:37:44.610
+that gives you a lot of information about what you see in each pixel
+
+1:37:45.760,1:37:47.040
+Some tiny animals
+
+1:37:47.040,1:37:54.930
+that have small brains find it easier to process hyperspectral images of low resolution than high resolution images with just three colors
+
+1:37:55.750,1:38:00.450
+For example, there's a particular type of shrimp, right? They have those beautiful
+
+1:38:01.630,1:38:07.499
+eyes and they have like 17 spectral bands or something, but super low resolution and they have a tiny brain to process it
+
+1:38:09.770,1:38:12.850
+Okay, that's all for today. See you!
diff --git a/docs/pt/week03/practicum03.sbv b/docs/pt/week03/practicum03.sbv
new file mode 100644
index 000000000..79126d43e
--- /dev/null
+++ b/docs/pt/week03/practicum03.sbv
@@ -0,0 +1,1751 @@
+0:00:00.020,0:00:07.840
+So convolutional neural networks, I guess today I so foundations me, you know, I post nice things on Twitter
+
+0:00:09.060,0:00:11.060
+Follow me. I'm just kidding
+
+0:00:11.290,0:00:16.649
+Alright. So again anytime you have no idea what's going on. Just stop me ask questions
+
+0:00:16.900,0:00:23.070
+Let's make these lessons interactive such that I can try to please you and provide the necessary information
+
+0:00:23.980,0:00:25.980
+For you to understand what's going on?
+
+0:00:26.349,0:00:27.970
+alright, so
+
+0:00:27.970,0:00:31.379
+Convolutional neural networks. How cool is this stuff? Very cool
+
+0:00:32.439,0:00:38.699
+mostly because before having convolutional nets we couldn't do much and we're gonna figure out why now
+
+0:00:39.850,0:00:43.800
+how why why and how these networks are so powerful and
+
+0:00:44.379,0:00:48.329
+They are going to be basically making they are making like a very large
+
+0:00:48.879,0:00:52.859
+Chunk of like the whole networks are used these days
+
+0:00:53.980,0:00:55.300
+so
+
+0:00:55.300,0:01:02.369
+More specifically we are gonna get used to repeat several times those three words, which are the key words for understanding
+
+0:01:02.920,0:01:05.610
+Convolutions, but we are going to be figuring out that soon
+
+0:01:06.159,0:01:09.059
+so let's get started and figuring out how
+
+0:01:09.580,0:01:11.470
+these
+
+0:01:11.470,0:01:13.470
+signals these images and these
+
+0:01:13.990,0:01:17.729
+different items look like so whenever we talk about
+
+0:01:18.670,0:01:21.000
+signals we can think about them as
+
+0:01:21.580,0:01:23.200
+vectors for example
+
+0:01:23.200,0:01:30.600
+We have there a signal which is representing a monophonic audio signal so given that is only
+
+0:01:31.180,0:01:38.339
+We have only the temporal dimension going in like the signal happens over one dimension, which is the temporal dimension
+
+0:01:38.560,0:01:46.079
+This is called 1d signal and can be represented by a singular vector as is shown up up there
+
+0:01:46.750,0:01:48.619
+each
+
+0:01:48.619,0:01:52.389
+Value of that vector represents the amplitude of the wave form
+
+0:01:53.479,0:01:56.589
+for example, if you have just a sign you're going to be just hearing like
+
+0:01:57.830,0:01:59.830
+Like some sound like that
+
+0:02:00.560,0:02:05.860
+If you have like different kind of you know, it's not just a sign a sign you're gonna hear
+
+0:02:06.500,0:02:08.500
+different kind of Timbers or
+
+0:02:09.200,0:02:11.200
+different kind of
+
+0:02:11.360,0:02:13.190
+different kind of
+
+0:02:13.190,0:02:15.190
+flavor of the sound
+
+0:02:15.440,0:02:18.190
+Moreover you're familiar. How sound works, right? So
+
+0:02:18.709,0:02:21.518
+Right now I'm just throwing air through my windpipe
+
+0:02:22.010,0:02:26.830
+where there are like some membranes which is making the air vibrate these the
+
+0:02:26.930,0:02:33.640
+Vibration propagates through the air there are going to be hitting your ears and the ear canal you have inside some little
+
+0:02:35.060,0:02:38.410
+you have likely cochlea right and then given about
+
+0:02:38.989,0:02:45.159
+How much the sound propagates through the cochlea you're going to be detecting the pitch and then by adding different pitch
+
+0:02:45.830,0:02:49.119
+information you can and also like different kind of
+
+0:02:50.090,0:02:53.350
+yeah, I guess speech information you're going figure out what is the
+
+0:02:53.930,0:02:59.170
+Sound I was making over here and then you reconstruct that using your language model you have in your brain
+
+0:02:59.170,0:03:03.369
+Right and the same thing Yun was mentioning if you start speaking another language
+
+0:03:04.310,0:03:11.410
+then you won't be able to parse the information because you're using both a speech model like a conversion between
+
+0:03:12.019,0:03:17.709
+Vibrations and like, you know signal your brain plus the language model in order to make sense
+
+0:03:18.709,0:03:22.629
+Anyhow, that was a 1d signal. Let's say I'm listening to music so
+
+0:03:23.570,0:03:25.570
+What kind of signal do I?
+
+0:03:25.910,0:03:27.910
+have there
+
+0:03:28.280,0:03:34.449
+So if I listen to music user is going to be a stare of stereophonic, right? So it means you're gonna have how many channels?
+
+0:03:35.420,0:03:37.420
+Two channels, right?
+
+0:03:37.519,0:03:38.570
+nevertheless
+
+0:03:38.570,0:03:41.019
+What type of signal is gonna be this one?
+
+0:03:41.150,0:03:46.420
+It's still gonna be one this signal although there are two channels so you can think about you know
+
+0:03:46.640,0:03:54.459
+regardless of how many chanted channels like if you had Dolby Surround you're gonna have what 5.1 so six I guess so, that's the
+
+0:03:55.050,0:03:56.410
+You know
+
+0:03:56.410,0:03:58.390
+vectorial the
+
+0:03:58.390,0:04:02.790
+size of the signal and then the time is the only variable which is
+
+0:04:03.820,0:04:07.170
+Like moving forever. Okay. So those are 1d signals
+
+0:04:09.430,0:04:13.109
+All right, so let's have a look let's zoom in a little bit so
+
+0:04:14.050,0:04:18.420
+We have it. For example on the left hand side. We have something that looks like a sinusoidal
+
+0:04:19.210,0:04:25.619
+function here nevertheless a little bit after you're gonna have again the same type of
+
+0:04:27.280,0:04:29.640
+Function appearing again, so this is called
+
+0:04:30.460,0:04:37.139
+Stationarity you're gonna see over and over and over again the same type of pattern across the temporal
+
+0:04:37.810,0:04:39.810
+Dimension, okay
+
+0:04:40.090,0:04:47.369
+So the first property of this signal which is our natural signal because it happens in nature is gonna be we said
+
+0:04:49.330,0:04:51.330
+Stationarity, okay. That's the first one
+
+0:04:51.580,0:04:53.580
+Moreover what do you think?
+
+0:04:54.130,0:04:56.130
+How likely is?
+
+0:04:56.140,0:05:00.989
+If I have a peak on the left hand side to have a peak also very nearby
+
+0:05:03.430,0:05:09.510
+So how likely is to have a peak there rather than having a peak there given that you had a peak before or
+
+0:05:09.610,0:05:11.590
+if I keep going
+
+0:05:11.590,0:05:18.119
+How likely is you have a peak, you know few seconds later given that you have a peak on the left hand side. So
+
+0:05:19.960,0:05:24.329
+There should be like some kind of common sense common knowledge perhaps that
+
+0:05:24.910,0:05:27.390
+If you are close together and if you are
+
+0:05:28.000,0:05:33.360
+Close to the left hand side is there's gonna be a larger probability that things are gonna be looking
+
+0:05:33.880,0:05:40.589
+Similar, for example you have like a specific sound will have a very kind of specific shape
+
+0:05:41.170,0:05:43.770
+But then if you go a little bit further away from that sound
+
+0:05:44.050,0:05:50.010
+then there's no relation anymore about what happened here given what happened before and so if you
+
+0:05:50.410,0:05:55.170
+Compute the cross correlation between a signal and itself, do you know what's a cross correlation?
+
+0:05:57.070,0:06:02.670
+Do know like if you don't know okay how many hands up who doesn't know a cross correlation
+
+0:06:04.360,0:06:07.680
+Okay fine, so that's gonna be homework for you
+
+0:06:07.680,0:06:14.489
+If you take one signal just a signal audio signal they perform convolution of that signal with itself
+
+0:06:14.650,0:06:15.330
+Okay
+
+0:06:15.330,0:06:19.680
+and so convolution is going to be you have your own signal you take the thing you flip it and then you
+
+0:06:20.170,0:06:22.170
+pass it across and then you multiply
+
+0:06:22.390,0:06:25.019
+Whenever you're gonna have them overlaid in the same
+
+0:06:25.780,0:06:27.780
+Like when there is zero
+
+0:06:28.450,0:06:33.749
+Misalignment you're gonna have like a spike. And then as you start moving around you're gonna have basically two decaying
+
+0:06:34.360,0:06:36.930
+sides that represents the fact that
+
+0:06:37.990,0:06:44.850
+Things have much things in common basically performing a dot product right? So things that have much in common when they are
+
+0:06:45.370,0:06:47.970
+Very close to one specific location
+
+0:06:47.970,0:06:55.919
+If you go further away things start, you know averaging out. So here the second property of this natural signal is locality
+
+0:06:56.500,0:07:04.470
+Information is contained in specific portion and parts of the in this case temporal domain. Okay. So before we had
+
+0:07:06.940,0:07:08.940
+Stationarity now we have
+
+0:07:09.640,0:07:11.640
+Locality alright don't
+
+0:07:12.160,0:07:17.999
+Bless you. All, right. So how about this one right? This is completely unrelated to what happened over there
+
+0:07:20.110,0:07:24.960
+Okay, so let's look at the nice little kitten what kind of
+
+0:07:25.780,0:07:27.070
+dimensions
+
+0:07:27.070,0:07:31.200
+What kind of yeah what dimension has this signal? What was your guess?
+
+0:07:32.770,0:07:34.829
+It's a 2 dimensional signal why is that
+
+0:07:39.690,0:07:45.469
+Okay, we have also a three-dimensional signal option here so someone said two dimensions someone said three dimensions
+
+0:07:47.310,0:07:51.739
+It's two-dimensional why is that sorry noise? Why is two-dimensional
+
+0:07:54.030,0:07:56.030
+Because the information is
+
+0:07:58.050,0:08:00.050
+Sorry the information is
+
+0:08:00.419,0:08:01.740
+especially
+
+0:08:01.740,0:08:03.740
+Depicted right? So the information
+
+0:08:03.750,0:08:05.310
+is
+
+0:08:05.310,0:08:08.450
+Basically encoded in the spatial location of those points
+
+0:08:08.760,0:08:15.439
+Although each point is a vector for example of three or if it's a hyper spectral image. It can be several planes
+
+0:08:16.139,0:08:23.029
+Nevertheless you still you still have two directions in which points can move right? The thickness doesn't change
+
+0:08:24.000,0:08:27.139
+across like in the thicknesses of a given space
+
+0:08:27.139,0:08:33.408
+Right so given thickness and it doesn't change right so you can have as many, you know planes as you want
+
+0:08:33.409,0:08:35.409
+but the information is basically
+
+0:08:35.640,0:08:41.779
+It's a spatial information is spread across the plane. So these are two dimensional data you can also
+
+0:08:50.290,0:08:53.940
+Okay, I see your point so like a wide image or a
+
+0:08:54.910,0:08:56.350
+grayscale image
+
+0:08:56.350,0:08:58.350
+It's definitely a 2d
+
+0:08:58.870,0:09:04.169
+Signal and also it can be represented by using a tensor of two dimensions
+
+0:09:04.870,0:09:07.739
+A color image has RGB planes
+
+0:09:08.350,0:09:14.550
+but the thickness is always three doesn't change and the information is still spread across the
+
+0:09:15.579,0:09:21.839
+Other two dimensions so you can change the size of a color image, but you won't change the thickness of a color image, right?
+
+0:09:22.870,0:09:28.319
+So we are talking about here. The dimension of the signal is how is the information?
+
+0:09:29.470,0:09:31.680
+Basically spread around right in the temporal information
+
+0:09:31.959,0:09:38.789
+If you have Dolby Surround mono mono signal or you have a stereo we still have over time, right?
+
+0:09:38.790,0:09:41.670
+So it's one dimensional images are 2d
+
+0:09:42.250,0:09:44.759
+so let's have a look to the little nice kitten and
+
+0:09:45.519,0:09:47.909
+Let's focus on the on the nose, right? Oh
+
+0:09:48.579,0:09:50.579
+My god, this is a monster. No
+
+0:09:50.949,0:09:52.949
+Okay. Nice big
+
+0:09:53.649,0:09:55.948
+Creature here, right? Okay, so
+
+0:09:56.740,0:10:03.690
+We observe there and there is some kind of dark region nearby the eye you can observe that kind of seeing a pattern
+
+0:10:04.329,0:10:09.809
+Appear over there, right? So what is this property of natural signals? I
+
+0:10:12.699,0:10:18.239
+Told you two properties, this is stationarity. Why is this stationarity?
+
+0:10:22.029,0:10:29.129
+Right, so the same pattern appears over and over again across the dimensionality in this case the dimension is two dimension. Sorry
+
+0:10:30.220,0:10:36.600
+Moreover, what is the likelihood that given that the color in the pupil is black? What is the likelihood that?
+
+0:10:37.149,0:10:42.448
+The pixel on the arrow or like on the tip of the arrow is also black
+
+0:10:42.449,0:10:47.879
+I would say it's quite likely right because it's very close. How about that point?
+
+0:10:48.069,0:10:51.899
+Yeah, kind of less likely right if I keep clicking
+
+0:10:52.480,0:10:59.649
+You know, it's completely it's bright. No, no the other pics in right so is further you go in spacial dimension
+
+0:11:00.290,0:11:06.879
+The less less likely you're gonna have, you know similar information. And so this is called
+
+0:11:08.629,0:11:10.629
+Locality which means
+
+0:11:12.679,0:11:16.269
+There's a higher likelihood for things to have if like
+
+0:11:16.549,0:11:22.509
+The information is like containers in a specific region as you move around things get much much more
+
+0:11:24.649,0:11:26.649
+You know independent
+
+0:11:27.199,0:11:32.529
+Alright, so we have two properties. The third property is gonna be the following. What is this?
+
+0:11:33.829,0:11:35.829
+Are you hungry?
+
+0:11:37.579,0:11:41.769
+So you can see here some donuts right no donuts how you called
+
+0:11:42.649,0:11:44.230
+Bagels, right? All right
+
+0:11:44.230,0:11:51.009
+So for the you the the one of you which have glasses take your glasses off and now answer my question
+
+0:11:53.179,0:11:55.179
+Okay
+
+0:11:59.210,0:12:01.210
+So the third property
+
+0:12:02.210,0:12:07.059
+It's compositionality right and so compositionality means that the
+
+0:12:07.880,0:12:10.119
+Word is actually explainable, right?
+
+0:12:11.060,0:12:13.060
+okay, you enjoy the
+
+0:12:15.830,0:12:20.199
+The thing okay, you gotta get back to me right? I just try to keep your life
+
+0:12:26.180,0:12:28.100
+Hello
+
+0:12:28.100,0:12:33.520
+Okay. So for the one that doesn't have glasses ask the friend who has glasses and try them on. Okay now
+
+0:12:34.430,0:12:36.430
+Don't do it if it's not good
+
+0:12:37.010,0:12:43.659
+I'm just kidding. You can squint just queen don't don't don't use other people glasses. Okay?
+
+0:12:44.990,0:12:46.990
+Question. Yeah
+
+0:12:50.900,0:12:52.130
+So
+
+0:12:52.130,0:12:57.489
+Stationerity means you observe the same kind of pattern over and over again your data
+
+0:12:58.160,0:13:01.090
+Locality means that pattern are just localized
+
+0:13:01.820,0:13:08.109
+So you have some specific information here some information here information here as you move away from this point
+
+0:13:08.270,0:13:10.270
+this other value is gonna be
+
+0:13:10.760,0:13:11.780
+almost
+
+0:13:11.780,0:13:15.249
+Independent from the value of this point here. So things are correlated
+
+0:13:15.860,0:13:17.860
+Only within a neighborhood, okay
+
+0:13:19.910,0:13:27.910
+Okay, everyone has been experimenting now squinting and looking at this nice picture, okay. So this is the third part which is compositionality
+
+0:13:28.730,0:13:32.289
+Here you can tell how you can actually see something
+
+0:13:33.080,0:13:35.080
+If you blur it a little bit
+
+0:13:35.810,0:13:39.250
+because again things are made of small parts and you can actually
+
+0:13:40.010,0:13:42.429
+You know compose things in this way
+
+0:13:43.400,0:13:47.829
+anyhow, so these are the three main properties of natural signals, which
+
+0:13:48.650,0:13:50.650
+allow us to
+
+0:13:51.260,0:13:55.960
+Can be exploited for making, you know, a design of our architecture, which is more
+
+0:13:56.600,0:14:00.880
+Actually prone to extract information that has these properties
+
+0:14:00.880,0:14:05.169
+Okay, so we are just talking now about signals that exhibits those properties
+
+0:14:07.730,0:14:11.500
+Finally okay. There was the last one which I didn't talk so
+
+0:14:12.890,0:14:18.159
+We had the last one here. We have an English sentence, right John picked up the apple
+
+0:14:18.779,0:14:22.818
+whatever and here again, you can represent each word as
+
+0:14:23.399,0:14:26.988
+One vector, for example each of those items. It can be a
+
+0:14:27.869,0:14:30.469
+Vector which has a 1 in correspondent
+
+0:14:31.110,0:14:35.329
+Correspondence to the position of where that word happens to be in a dictionary, okay
+
+0:14:35.329,0:14:39.709
+so if you have a dictionary of 10,000 words, you can just check whatever is the
+
+0:14:40.679,0:14:44.899
+The word on this dictionary you just put the page plus the whatever number
+
+0:14:45.629,0:14:50.599
+Like you just figured that the position of the page in the dictionary. So also language
+
+0:14:51.899,0:14:56.419
+Has those kind of properties things that are close by have, you know
+
+0:14:56.420,0:15:01.069
+Some kind of relationship things away are not less unless you know
+
+0:15:01.470,0:15:05.149
+Correlated and then similar patterns happen over and over again over
+
+0:15:05.819,0:15:12.558
+Moreover, you can use you know words make sentences to make full essays and to make finally your write-ups for the
+
+0:15:12.839,0:15:16.008
+Sessions. I'm just kidding. Okay. All right, so
+
+0:15:17.429,0:15:19.789
+We already seen this one. So I'm gonna be going quite fast
+
+0:15:20.759,0:15:28.279
+there shouldn't be any I think questions because also we have everything written down on the website, right so you can always check the
+
+0:15:28.860,0:15:30.919
+summaries of the previous lesson on the website
+
+0:15:32.040,0:15:39.349
+So fully connected layer. So this actually perhaps is a new version of the diagram. This is my X,Y is at the bottom
+
+0:15:42.089,0:15:49.698
+Low level features. What's the color of the decks? Pink. Okay good. All right, so we have an arrow which represents my
+
+0:15:51.299,0:15:54.439
+Yeah, fine that's the proper term, but I like to call them
+
+0:15:55.410,0:16:02.299
+Rotations and then there is some squashing right? squashing means the non-linearity then I have my hidden layer then I have another
+
+0:16:04.379,0:16:06.379
+Rotation and a final
+
+0:16:06.779,0:16:12.888
+Squashing. Okay. It's not necessary. Maybe can be a linear, you know final transformation like a linear
+
+0:16:14.520,0:16:18.059
+Whatever function they're like if you do if you perform a regression task
+
+0:16:19.750,0:16:21.750
+There you have the equations, right
+
+0:16:22.060,0:16:24.060
+And those guys can be any of those
+
+0:16:24.610,0:16:26.260
+nonlinear functions or
+
+0:16:26.260,0:16:33.239
+Even a linear function right if you perform regression once more and so you can write down these layers where I expand
+
+0:16:33.240,0:16:39.510
+So this guy here the the bottom guy is actually a vector and I represent the vector G with just one pole there
+
+0:16:39.510,0:16:42.780
+I just show you all the five items elements of that vector
+
+0:16:43.030,0:16:45.239
+So you have the X the first layer?
+
+0:16:45.370,0:16:50.520
+Then you have the first hidden second hidden third hit and the last layer so we have how many layers?
+
+0:16:53.590,0:16:55.240
+Five okay
+
+0:16:55.240,0:16:56.950
+And then you can also call them
+
+0:16:56.950,0:17:03.689
+activation layer 1 layer 2 3 4 whatever and then the matrices are where you store your
+
+0:17:03.970,0:17:10.380
+Parameters you have those different W's and then in order to get each of those values you already seen the stuff, right?
+
+0:17:10.380,0:17:17.280
+So I go quite faster you perform just the scalar product. Which means you just do that thing
+
+0:17:17.860,0:17:23.400
+You get all those weights. I multiply the input for each of those weights and you keep going like that
+
+0:17:24.490,0:17:28.920
+And then you store those weights in those matrices and so on. So as you can tell
+
+0:17:30.700,0:17:37.019
+There is a lot of arrows right and regardless of the fact that I spent too many hours doing that drawing
+
+0:17:38.200,0:17:43.649
+This is also like very computationally expensive because there are so many computations right each arrow
+
+0:17:44.350,0:17:46.350
+represents a weight which you have to multiply
+
+0:17:46.960,0:17:49.110
+for like by its own input
+
+0:17:49.870,0:17:51.870
+so
+
+0:17:52.090,0:17:53.890
+What can we do now?
+
+0:17:53.890,0:17:55.150
+so
+
+0:17:55.150,0:17:57.150
+given that our information is
+
+0:17:57.700,0:18:04.679
+Has locality. No our data has this locality as a property. What does it mean if I had something here?
+
+0:18:05.290,0:18:07.290
+Do I care what's happening here?
+
+0:18:09.460,0:18:12.540
+So some of you are just shaking the hand and the rest of
+
+0:18:13.000,0:18:17.219
+You are kind of I don't know not responsive and I have to ping you
+
+0:18:18.140,0:18:18.900
+so
+
+0:18:18.900,0:18:25.849
+We have locality, right? So things are just in specific regions. You actually care to look about far away
+
+0:18:27.030,0:18:28.670
+No, okay. Fantastic
+
+0:18:28.670,0:18:32.119
+So let's simply drop some connections, right?
+
+0:18:32.130,0:18:38.660
+So here we go from layer L-1 to the layer L by using the first, you know five
+
+0:18:39.570,0:18:45.950
+Ten and fifteen, right? Plus I have the last one here to from the layer L to L+1
+
+0:18:45.950,0:18:48.529
+I have three more right so in total we have
+
+0:18:50.550,0:18:53.089
+Eighteen weights computations, right
+
+0:18:53.760,0:18:55.760
+so, how about we
+
+0:18:56.370,0:19:01.280
+Drop the things that we don't care, right? So like let's say for this neuron, perhaps
+
+0:19:01.830,0:19:04.850
+Why do we have to care about those guys there on the bottom, right?
+
+0:19:05.160,0:19:08.389
+So, for example, I can just use those three weights, right?
+
+0:19:08.390,0:19:12.770
+I just forget about the other two and then again, I just use those three weights
+
+0:19:12.770,0:19:15.229
+I skip the first and the last and so on
+
+0:19:16.170,0:19:23.570
+Okay. So right now we have just nine connections now just now nine multiplications and finally three more
+
+0:19:24.360,0:19:28.010
+so as we go from the left hand side to the right hand side we
+
+0:19:28.920,0:19:32.149
+Climb the hierarchy and we're gonna have a larger and larger
+
+0:19:33.960,0:19:34.790
+View right
+
+0:19:34.790,0:19:40.879
+so although these green bodies here and don't see the whole input is you keep climbing the
+
+0:19:41.310,0:19:45.109
+Hierarchy you're gonna be able to see the whole span of the input, right?
+
+0:19:46.590,0:19:48.590
+so in this case, we're going to be
+
+0:19:49.230,0:19:55.760
+Defining the RF as receptive field. So my receptive field here from the last
+
+0:19:56.400,0:20:03.769
+Neuron to the intermediate neuron is three. So what is gonna be? This means that the final neuron sees three
+
+0:20:04.500,0:20:10.820
+Neurons from the previous layer. So what is the receptive field of the hidden layer with respect to the input layer?
+
+0:20:14.970,0:20:21.199
+The answer was three. Yeah, correct, but what is now their septic field of the output layer with respect to the input layer
+
+0:20:23.549,0:20:25.549
+Five right. That's fantastic
+
+0:20:25.679,0:20:30.708
+Okay, sweet. So right now the whole architecture does see the whole input
+
+0:20:31.229,0:20:33.229
+while each sub part
+
+0:20:33.239,0:20:39.019
+Like intermediate layers only sees small regions and this is very nice because you will spare
+
+0:20:39.239,0:20:46.939
+Computations which are unnecessary because on average they have no whatsoever in information. And so we managed to speed up
+
+0:20:47.669,0:20:50.059
+The computations that you actually can compute
+
+0:20:51.119,0:20:53.208
+things in a decent amount of time
+
+0:20:54.809,0:20:58.998
+Clear so we can talk about sparsity only because
+
+0:21:02.669,0:21:05.238
+We assume that our data shows
+
+0:21:06.329,0:21:08.249
+locality, right
+
+0:21:08.249,0:21:12.708
+Question if my data doesn't show locality. Can I use sparsity?
+
+0:21:16.139,0:21:19.279
+No, okay fantastic, okay. All right
+
+0:21:20.549,0:21:23.898
+more stuff so we also said that this natural signals are
+
+0:21:24.209,0:21:28.399
+Stationary and so given that they're stationary things appear over and over again
+
+0:21:28.399,0:21:34.008
+So maybe we don't have to learn again again the same stuff of all over the time right? So
+
+0:21:34.679,0:21:37.668
+In this case we said oh we drop those two lines, right?
+
+0:21:38.729,0:21:41.179
+And so how about we use?
+
+0:21:41.969,0:21:46.999
+The first connection the oblique one from you know going in down
+
+0:21:47.549,0:21:52.158
+Make it yellow. So all of those are yellows then these are orange
+
+0:21:52.859,0:21:57.139
+And then the final one are red, right? So how many weights do I have here?
+
+0:21:59.639,0:22:01.639
+And I had over here
+
+0:22:03.089,0:22:05.089
+Nine right and before we had
+
+0:22:06.749,0:22:09.769
+15 right so we drop from 15 to 3
+
+0:22:10.529,0:22:14.958
+This is like a huge reduction and how perhaps now it is actually won't work
+
+0:22:14.969,0:22:16.759
+So we have to fix that in a bit
+
+0:22:16.759,0:22:22.368
+But anyhow in this way when I train a network, I just had to train three weights the red
+
+0:22:22.840,0:22:25.980
+sorry, the yellow orange and red and
+
+0:22:26.889,0:22:30.959
+It's gonna be actually working even better because it just has to learn
+
+0:22:31.749,0:22:37.079
+You're gonna have more information you have more data for you know training those specific weights
+
+0:22:41.320,0:22:48.299
+So those are those three colors the yellow orange and red are gonna be called my kernel and so I stored them
+
+0:22:48.850,0:22:50.850
+Into a vector over here
+
+0:22:53.200,0:22:58.679
+And so those if you talk about you know convolutional careness those are simply the weight of these
+
+0:22:59.200,0:22:59.909
+over here
+
+0:22:59.909,0:23:04.589
+Right the weights that we are using by using sparsity and then using parameter sharing
+
+0:23:04.869,0:23:09.629
+Parameter sharing means you use the same parameter over over again across the architecture
+
+0:23:10.330,0:23:15.090
+So there are the following nice properties of using those two combined
+
+0:23:15.490,0:23:20.699
+So parameter sharing gives us faster convergence because you're gonna have much more information
+
+0:23:21.399,0:23:23.549
+To use in order to train these weights
+
+0:23:24.519,0:23:26.139
+You have a better
+
+0:23:26.139,0:23:32.008
+Generalization because you don't have to learn every time a specific type of thing that happened in different region
+
+0:23:32.009,0:23:34.079
+You just learn something. That makes sense
+
+0:23:34.720,0:23:36.720
+You know globally
+
+0:23:37.570,0:23:44.460
+Then we also have we are not constrained to the input size this is so important ray also Yann said this thing three times yesterday
+
+0:23:45.700,0:23:48.029
+Why are we not constrained to the input size?
+
+0:23:54.039,0:24:00.449
+Because we can keep shifting in over right before in these other case if you have more neurons you have to learn new stuff
+
+0:24:00.450,0:24:06.210
+Right, in this case. I can simply add more neurons and I keep using my weight across right that was
+
+0:24:07.240,0:24:09.809
+Some of the major points Yann, you know
+
+0:24:10.509,0:24:12.509
+highlighted yesterday
+
+0:24:12.639,0:24:14.939
+Moreover we have the kernel independence
+
+0:24:15.999,0:24:18.689
+So for the one of you they are interested in optimization
+
+0:24:19.659,0:24:21.009
+optimizing like computation
+
+0:24:21.009,0:24:22.299
+this is so cool because
+
+0:24:22.299,0:24:29.189
+This kernel and another kernel are completely independent so you can train them you can paralyze is to make things go faster
+
+0:24:33.580,0:24:38.549
+So finally we have also some connection sparsity property and so here we have a
+
+0:24:39.070,0:24:41.700
+Reduced amount of computation, which is also very good
+
+0:24:42.009,0:24:48.659
+So all these properties allowed us to be able to train this network on a lot of data
+
+0:24:48.659,0:24:55.739
+you still require a lot of data, but without having sparsity locality, so without having sparsity and
+
+0:24:56.409,0:25:01.859
+Parameter sharing you wouldn't be able to actually finish training this network in a reasonable amount of time
+
+0:25:03.639,0:25:11.039
+So, let's see, for example now how this works when you have like audio signal which is how many dimensional signal
+
+0:25:12.279,0:25:17.849
+1 dimensional signal, right? Okay. So for example kernels for 1d data
+
+0:25:18.490,0:25:24.119
+On the right hand side. You can see again. My my neurons can I'll be using my
+
+0:25:24.909,0:25:30.359
+Different the first scanner here. And so I'm gonna be storing my kernel there in that vector
+
+0:25:31.330,0:25:36.059
+For example, I can have a second kernel right. So right now we have two kernels the
+
+0:25:36.700,0:25:39.749
+Blue purple and pink and the yellow, orange and red
+
+0:25:41.559,0:25:44.158
+So let's say my output is r2
+
+0:25:44.799,0:25:46.829
+So that means that each of those
+
+0:25:47.980,0:25:50.909
+Bubbles here. Each of those neurons are actually
+
+0:25:51.639,0:25:57.359
+One and two rightly come out from the from the board, right? So it's each of those are having a thickness of two
+
+0:25:58.929,0:26:02.819
+And let's say the other guy here are having a thickness of seven, right
+
+0:26:02.990,0:26:07.010
+They are coming outside from the screen and they are you know, seven euros in this way
+
+0:26:08.070,0:26:13.640
+so in this case, my kernel are going to be of size 2 * 7 * 3
+
+0:26:13.860,0:26:17.719
+So 2 means I have two kernels which are going from 7
+
+0:26:18.240,0:26:20.070
+to give me
+
+0:26:20.070,0:26:22.070
+3
+
+0:26:22.950,0:26:24.950
+Outputs
+
+0:26:28.470,0:26:32.959
+Hold on my bad. So the 2 means you have ℝ² right here
+
+0:26:33.659,0:26:37.069
+Because you have two corners. So the first kernel will give you the first
+
+0:26:37.679,0:26:41.298
+The first column here and the second kernel is gonna give you the second column
+
+0:26:42.179,0:26:44.869
+Then it has to init 7
+
+0:26:45.210,0:26:50.630
+Because it needs to match all the thickness of the previous layer and then it has 3 because there are three
+
+0:26:50.789,0:26:56.778
+Connections right? So maybe I miss I got confused before does it make sense the sizing?
+
+0:26:58.049,0:26:59.820
+so given that our
+
+0:26:59.820,0:27:03.710
+273  2 means you had 2 kernels and therefore you have two
+
+0:27:04.080,0:27:08.000
+Items here like one a one coming out for each of those columns
+
+0:27:08.640,0:27:15.919
+It has seven because each of these have a thickness of 7 and finally 3 means there are 3 connection connecting to the previous layer
+
+0:27:17.429,0:27:22.819
+Right so 1d data uses 3d kernels ok
+
+0:27:23.460,0:27:30.049
+so if I call this my collection of kernel, right, so if those are gonna be stored in a tensor
+
+0:27:30.049,0:27:32.898
+This tensor will be a three dimensional tensor
+
+0:27:33.690,0:27:34.919
+so
+
+0:27:34.919,0:27:37.939
+Question for you, if I'm gonna be playing now with images
+
+0:27:38.580,0:27:40.580
+What is the size of?
+
+0:27:40.679,0:27:43.999
+You know full pack of kernels for an image
+
+0:27:45.809,0:27:47.809
+Convolutional net
+
+0:27:49.590,0:27:56.209
+Four right. So we're gonna have the number of kernels then it's going to be the number of the thickness
+
+0:27:56.730,0:28:00.589
+And then you're gonna have connections in height and connection in width
+
+0:28:01.799,0:28:03.179
+Okay
+
+0:28:03.179,0:28:09.798
+So if you're gonna be checking the currently convolutional kernels later on in your notebook, actually you should check that
+
+0:28:09.929,0:28:12.138
+You should find the same kind of dimensions
+
+0:28:14.159,0:28:16.159
+All right, so
+
+0:28:18.059,0:28:20.478
+Questions so far, is this so clear?. Yeah
+
+0:28:50.460,0:28:52.460
+Okay, so good question so
+
+0:28:52.469,0:28:56.149
+trade-off about, you know sizing of those convolutions
+
+0:28:56.700,0:28:59.119
+convolutional kernels, right is it correct? Right
+
+0:28:59.909,0:29:06.409
+Three by three he seems to be like the minimum you can go for if you actually care about spatial information
+
+0:29:07.499,0:29:13.098
+As Yann pointed out you can also use one by one convolution. Oh, sorry one come one
+
+0:29:13.769,0:29:15.149
+like a
+
+0:29:15.149,0:29:20.718
+Convolution with which has only one weight or if you use like in images you have a one by one convolution
+
+0:29:21.179,0:29:23.179
+Those are used in order to be
+
+0:29:23.309,0:29:24.570
+having like a
+
+0:29:24.570,0:29:26.570
+final layer, which is still
+
+0:29:26.909,0:29:30.528
+Spatial still can be applied to a larger input image
+
+0:29:31.649,0:29:36.138
+Right now we just use kernels that are three or maybe five
+
+0:29:36.929,0:29:42.348
+it's kind of empirical so it's not like we don't have like a magic formulas, but
+
+0:29:43.349,0:29:44.279
+we've been
+
+0:29:44.279,0:29:50.329
+trying hard in the past ten years to figure out what is you know the best set of hyper parameters and if you check
+
+0:29:50.969,0:29:55.879
+For each field like for a speech processing visual processing like image processing
+
+0:29:55.879,0:29:59.718
+You're gonna figure out what is the right compromise for your specific data?
+
+0:30:01.769,0:30:03.769
+Yeah
+
+0:30:04.910,0:30:06.910
+Second
+
+0:30:07.970,0:30:12.279
+Okay, that's a good question why odd numbers why the kernel has an odd number
+
+0:30:14.390,0:30:16.220
+Of elements
+
+0:30:16.220,0:30:20.049
+So if you actually have a odd number of elements there would be a central element
+
+0:30:20.240,0:30:25.270
+Right. If you have a even number of elements there, we'll know there won't be a central value
+
+0:30:25.370,0:30:27.880
+So if you have again odd number
+
+0:30:27.880,0:30:30.790
+You know that from a specific point you're gonna be considering
+
+0:30:31.220,0:30:36.789
+Even number of left and even number of right items if it's a even size
+
+0:30:37.070,0:30:42.399
+Kernel that you actually don't know where the center is and the center is gonna be the average of two
+
+0:30:43.040,0:30:48.310
+Neighboring samples which actually creates like a low-pass filter effect. So even
+
+0:30:49.220,0:30:51.910
+kernel sizes are not usually
+
+0:30:52.580,0:30:56.080
+preferred or not usually used because they imply some kind of
+
+0:30:57.290,0:30:59.889
+additional lowering of the quality of the data
+
+0:31:02.000,0:31:08.380
+Okay, so one more thing that we mentioned also yesterday its padding padding is something
+
+0:31:09.590,0:31:16.629
+that if it has an effect on the final results is getting it worse, but it's very convenient for
+
+0:31:17.570,0:31:25.450
+programming side so if we've had our so as you can see here when we apply convolution from this layer you're gonna end up with
+
+0:31:27.680,0:31:31.359
+Okay, how many how many neurons we have here
+
+0:31:32.720,0:31:34.720
+three and we started from
+
+0:31:35.480,0:31:39.400
+five, so if we use a convolutional kernel of three
+
+0:31:40.490,0:31:42.490
+We lose how many neurons? 
+
+0:31:43.310,0:31:50.469
+Two, okay, one per side. If you're gonna be using a convolutional kernel of size five how much you're gonna be losing
+
+0:31:52.190,0:31:57.639
+Four right and so that's the rule user zero padding you have to add an extra
+
+0:31:58.160,0:32:02.723
+Neuron here an extra neuron here. So you're gonna do number size of the kernel, right?
+
+0:32:02.723,0:32:05.800
+Three minus one divided by two and then you add that extra
+
+0:32:06.560,0:32:12.850
+Whatever number of neurons here, you've set them to zero. Why to zero? because usually you zero mean
+
+0:32:13.470,0:32:18.720
+Your inputs or your zero each layer output by using some normalization layers
+
+0:32:19.900,0:32:21.820
+in this case
+
+0:32:21.820,0:32:25.770
+Yeah, three comes from the size of the kernel and then you have that
+
+0:32:26.740,0:32:28.630
+Some animation should be playing
+
+0:32:28.630,0:32:31.289
+Yeah, you have one extra neuron there there then
+
+0:32:31.289,0:32:37.289
+I have an extra neuron there such that finally you end up with these, you know ghosts neurons there
+
+0:32:37.330,0:32:41.309
+But now you have the same number of input and the same number of output
+
+0:32:41.740,0:32:47.280
+And this is so convenient because if we started with I don't know 64 neurons you apply a convolution
+
+0:32:47.280,0:32:54.179
+You still have 64 neurons and therefore you can use let's say max pooling of two you're going to end up at 32 neurons
+
+0:32:54.179,0:32:57.809
+Otherwise you gonna have this I don't know if you consider one
+
+0:32:58.539,0:33:01.019
+We have a odd number right so you don't know what to do
+
+0:33:04.030,0:33:06.030
+after a bit, right?
+
+0:33:08.320,0:33:10.320
+Okay, so
+
+0:33:10.720,0:33:12.720
+Yeah, and you have the same size
+
+0:33:13.539,0:33:20.158
+All right. So, let's see how much time you have left. You have a bit of time. So, let's see how we use this
+
+0:33:21.130,0:33:27.270
+Convolutional net work in practice. So this is like the theory behind and we have said that we can use convolutions
+
+0:33:28.000,0:33:33.839
+So this is a convolutional operator. I didn't even define. What's a convolution. We just said that if our data has
+
+0:33:37.090,0:33:39.929
+Stationarity locality and is actually
+
+0:33:42.130,0:33:45.689
+Compositional then we can exploit this by using
+
+0:33:49.240,0:33:51.240
+Weight sharing
+
+0:33:51.940,0:33:56.730
+Sparsity and then you know by stacking several of this layer. You have a like a hierarchy, right?
+
+0:33:58.510,0:34:06.059
+So by using this kind of operation this is a convolution I didn't even define it I don't care right now maybe next class
+
+0:34:07.570,0:34:11.999
+So this is like the theory behind now, we're gonna see a little bit of practical
+
+0:34:12.429,0:34:15.628
+You know suggestions how we actually use this stuff in practice
+
+0:34:16.119,0:34:22.229
+So next thing we have like a standard a spatial convolutional net which is operating which kind of data
+
+0:34:22.840,0:34:24.840
+If it's spatial
+
+0:34:25.780,0:34:28.229
+It's special because it's my network right special
+
+0:34:29.260,0:34:32.099
+Not just kidding so special as you know space
+
+0:34:33.190,0:34:37.139
+So in this case, we have multiple layers, of course we stuck them
+
+0:34:37.300,0:34:42.419
+We also talked about why it's better to have several layers rather than having a fat layer
+
+0:34:43.300,0:34:48.149
+We have convolutions. Of course, we have nonlinearities because otherwise
+
+0:34:55.270,0:34:56.560
+So
+
+0:34:56.560,0:35:04.439
+ok, next time we're gonna see how a convolution can be implemented with matrices but convolutions are just linear operator with which a lot of
+
+0:35:04.440,0:35:07.470
+zeros and like replication of the same by the weights
+
+0:35:07.570,0:35:13.019
+but otherwise if you don't use non-linearity a convolution of a convolution
+
+0:35:13.020,0:35:16.679
+It's gonna be a convolution. So we have to clean up stuff
+
+0:35:17.680,0:35:19.510
+that
+
+0:35:19.510,0:35:25.469
+We have to like put barriers right? in order to avoid collapse of the whole network. We had some pooling operator
+
+0:35:26.140,0:35:27.280
+which
+
+0:35:27.280,0:35:33.989
+Geoffrey says that's you know, something already bad. But you know, you're still doing that Hinton right Geoffrey Hinton
+
+0:35:35.410,0:35:40.950
+Then we've had something that if you don't use it, your network is not gonna be training. So just use it
+
+0:35:41.560,0:35:44.339
+although we don't know exactly why it works but
+
+0:35:45.099,0:35:48.659
+I think there is a question on Piazza. I will put a link there
+
+0:35:49.330,0:35:53.519
+About this batch normalization. Also Yann is going to be covering all the normalization layers
+
+0:35:54.910,0:36:01.889
+Finally we have something that also is quite recent which is called a receival or bypass connections
+
+0:36:01.990,0:36:03.990
+Which are basically these?
+
+0:36:04.240,0:36:05.859
+extra
+
+0:36:05.859,0:36:07.089
+connections
+
+0:36:07.089,0:36:09.089
+Which allow me to
+
+0:36:09.250,0:36:10.320
+Get the network
+
+0:36:10.320,0:36:13.320
+You know the network decided whether whether to send information
+
+0:36:13.780,0:36:18.780
+Through this line or actually send it forward if you stack so many many layers one after each other
+
+0:36:18.910,0:36:24.330
+The signal get lost a little bit after sometime if you add these additional connections
+
+0:36:24.330,0:36:27.089
+You always have like a path in order to go back
+
+0:36:27.710,0:36:31.189
+The bottom to the top and also to have gradients coming down from the top to the bottom
+
+0:36:31.440,0:36:38.599
+so that's actually a very important both the receiver connection and the batch normalization are really really helpful to get this network to
+
+0:36:39.059,0:36:46.849
+Properly train if you don't use them then it's going to be quite hard to get those networks to really work for the training part
+
+0:36:48.000,0:36:51.949
+So how does it work we have here an image, for example
+
+0:36:53.010,0:36:55.939
+Where most of the information is spatial information?
+
+0:36:55.940,0:36:59.000
+So the information is spread across the two dimensions
+
+0:36:59.220,0:37:04.520
+Although there is a thickness and I call the thickness as characteristic information
+
+0:37:04.770,0:37:07.339
+Which means it provides a information?
+
+0:37:07.890,0:37:11.569
+At that specific point. So what is my characteristic information?
+
+0:37:12.180,0:37:15.740
+ in this image let's say it's a RGB image
+
+0:37:16.680,0:37:18.680
+It's a color image right?
+
+0:37:19.230,0:37:27.109
+So we have the most of the information is spread on a  spatial information. Like if you have me making funny faces
+
+0:37:28.109,0:37:30.109
+but then at each point
+
+0:37:30.300,0:37:33.769
+This is not a grayscale image is a color image, right?
+
+0:37:33.770,0:37:39.199
+So each point will have an additional information which is my you know specific
+
+0:37:39.990,0:37:42.439
+Characteristic information. What is it in this case?
+
+0:37:44.640,0:37:46.910
+It's a vector of three values which represent
+
+0:37:48.630,0:37:51.530
+RGB are the three letters by the __  as they represent
+
+0:37:54.780,0:37:57.949
+Okay, overall, what does it represent like
+
+0:37:59.160,0:38:02.480
+Yes intensity. Just you know, tell me in English without weird
+
+0:38:03.359,0:38:05.130
+things
+
+0:38:05.130,0:38:11.480
+The color of the pixel, right? So my specific information. My characteristic information. Yeah. I don't know what you're saying
+
+0:38:11.480,0:38:18.500
+Sorry, the characteristic information in this case is just a color right so the color is the only information that is specific there
+
+0:38:18.500,0:38:20.780
+But then otherwise information is spread around
+
+0:38:21.359,0:38:23.359
+As if we climb climb the hierarchy
+
+0:38:23.730,0:38:31.189
+You can see now some final vector which has let's say we are doing classification in this case. So my
+
+0:38:31.770,0:38:36.530
+You know the height and width or the thing is going to be one by one so it's just one vector
+
+0:38:37.080,0:38:43.590
+And then let's say there you have the specific final logit, which is the highest one so which is representing the class
+
+0:38:43.590,0:38:47.400
+Which is most likely to be the correct one if it's trained well
+
+0:38:48.220,0:38:51.630
+in the Midway, you have something that is, you know a trade-off between
+
+0:38:52.330,0:38:59.130
+Spatial information and then these characteristic information. Okay. So basically it's like a conversion between
+
+0:39:00.070,0:39:01.630
+spatial information
+
+0:39:01.630,0:39:03.749
+into this characteristic information
+
+0:39:04.360,0:39:07.049
+Do you see so it basically go from a thing?
+
+0:39:07.660,0:39:08.740
+input
+
+0:39:08.740,0:39:13.920
+Data to something. It is very thick, but then has no more information spatial information
+
+0:39:14.710,0:39:20.760
+and so you can see here with my ninja PowerPoint skills how you can get you know a
+
+0:39:22.240,0:39:27.030
+Reduction of the ___ thickener like a figure thicker in our presentation
+
+0:39:27.070,0:39:30.840
+Whereas you actually lose the spatial special one
+
+0:39:32.440,0:39:39.870
+Okay, so that was oh one more pooling so pooling is simply again for example
+
+0:39:41.620,0:39:43.600
+It can be performed in this way
+
+0:39:43.600,0:39:48.660
+So there you have some hand drawing because I didn't want to do you have time to make it in latex?
+
+0:39:49.270,0:39:52.410
+So you have different regions you apply a specific?
+
+0:39:53.500,0:39:57.060
+Operator to that specific region, for example, you have the P norm
+
+0:39:58.150,0:39:59.680
+and then
+
+0:39:59.680,0:40:02.760
+Yes, the P goes to plus infinity. You have the Max
+
+0:40:03.730,0:40:09.860
+And then that one is not give you one value right then you perform a stride.
+
+0:40:09.860,0:40:12.840
+jump to Pixels further and then you again you compute the same thing
+
+0:40:12.840,0:40:18.150
+you're gonna get another value there and so on until you end up from
+
+0:40:18.700,0:40:24.900
+Your data which was m by n with c channels you get still c channels
+
+0:40:24.900,0:40:31.199
+But then in this case you gonna get m/2 and c and  n/2. Okay, and this is for images
+
+0:40:35.029,0:40:41.079
+There are no parameters on the pooling how you can nevertheless choose which kind of pooling, right you can choose max pooling
+
+0:40:41.390,0:40:44.229
+Average pooling any pooling is wrong. So
+
+0:40:45.769,0:40:48.879
+Yeah, let's also the problem, okay, so
+
+0:40:49.999,0:40:55.809
+This was the mean part with the slides. We are gonna see now the notebooks will go a bit slower this time
+
+0:40:55.809,0:40:58.508
+I noticed that last time I kind of rushed
+
+0:40:59.900,0:41:02.529
+Are there any questions so far on this part that we cover?
+
+0:41:04.519,0:41:06.519
+Yeah
+
+0:41:10.670,0:41:12.469
+So there is like
+
+0:41:12.469,0:41:17.769
+Geoffrey Hinton is renowned for saying that max pooling is something which is just
+
+0:41:18.259,0:41:23.319
+Wrong because you just throw away information as you average or you take the max you just throw away things
+
+0:41:24.380,0:41:29.140
+He's been working on like something called capsule networks, which have you know specific
+
+0:41:29.660,0:41:33.849
+routing paths that are choosing, you know some
+
+0:41:34.519,0:41:41.319
+Better strategies in order to avoid like throwing away information. Okay. Basically that's the the argument behind yeah
+
+0:41:45.469,0:41:52.329
+Yes, so the main purpose of using this pooling or the stride is actually to get rid of a lot of data such that you
+
+0:41:52.329,0:41:54.579
+Can compute things in a reasonable amount of time?
+
+0:41:54.619,0:42:00.939
+Usually you need a lot of stride or pooling at the first layers at the bottom because otherwise  it's absolutely  you know
+
+0:42:01.339,0:42:03.339
+Too computationally expensive
+
+0:42:03.979,0:42:05.979
+Yeah
+
+0:42:21.459,0:42:23.459
+So on that sit
+
+0:42:24.339,0:42:32.068
+Those network architectures are so far driven by you know the state of the art, which is completely an empirical base
+
+0:42:33.279,0:42:40.109
+we try hard and we actually go to I mean now we actually arrive to some kind of standard so a
+
+0:42:40.359,0:42:44.399
+Few years back. I was answering like I don't know but right now we actually have
+
+0:42:45.099,0:42:47.049
+Determined some good configurations
+
+0:42:47.049,0:42:53.968
+Especially using those receiver connections and the batch normalization. We actually can get to train basically everything
+
+0:42:54.759,0:42:56.759
+Yeah
+
+0:43:05.859,0:43:11.038
+So basically you're gonna have your gradient at a specific point coming down as well
+
+0:43:11.039,0:43:13.679
+And then you have the other gradient coming down down
+
+0:43:13.839,0:43:18.238
+Then you had a branch right a branching and if you have branch what's happening with the gradient?
+
+0:43:19.720,0:43:25.439
+That's correct. Yeah, they get added right so you have the two gradients coming from two different branches getting added together
+
+0:43:26.470,0:43:31.769
+All right. So let's go to the notebook such that we can cover  we don't rush too much
+
+0:43:32.859,0:43:37.139
+So here I just go through the convnet part. So here I train
+
+0:43:39.519,0:43:41.289
+Initially I
+
+0:43:41.289,0:43:43.979
+Load the MNIST data set so I show you a few
+
+0:43:44.680,0:43:45.849
+characters here
+
+0:43:45.849,0:43:52.828
+Okay, and I train now a multi-layer perceptron like a fully connected Network like a mood, you know
+
+0:43:53.440,0:44:00.509
+Yeah, fully connected Network and a convolutional neural net which have the same number of parameters. Okay. So these two models will have the same
+
+0:44:01.150,0:44:05.819
+Dimension in terms of D. If you save them we'll wait the same so
+
+0:44:07.269,0:44:11.219
+I'm training here this guy here with the fully connected Network
+
+0:44:12.640,0:44:14.640
+It takes a little bit of time
+
+0:44:14.829,0:44:21.028
+And he gets some 87% Okay. This is trained on classification of the MNIST digits from Yann
+
+0:44:21.999,0:44:24.419
+We actually download from his website if you check
+
+0:44:25.239,0:44:32.189
+Anyhow, I train a convolutional neural net with the same number of parameters what you expect to have a better a worse result
+
+0:44:32.349,0:44:35.548
+So my multi-layer perceptron gets 87 percent
+
+0:44:36.190,0:44:38.190
+What do we get with a convolutional net?
+
+0:44:41.739,0:44:43.739
+Yes, why
+
+0:44:46.910,0:44:50.950
+Okay, so what is the point here of using sparsity what does it mean
+
+0:44:52.640,0:44:55.089
+Given that we have the same number of parameters
+
+0:44:56.690,0:44:58.690
+We manage to train much
+
+0:44:59.570,0:45:05.440
+more filters right in the second case because in the first case we use filters that are completely trying to get some
+
+0:45:05.960,0:45:12.549
+dependencies between things that are further away with things that are closed by so they are completely wasted basically they learn 0
+
+0:45:12.830,0:45:19.930
+Instead in the convolutional net. I have all these parameters. They're just concentrated for figuring out. What is the relationship within a
+
+0:45:20.480,0:45:23.799
+Neighboring pixels. All right. So now it takes the pictures I
+
+0:45:24.740,0:45:26.740
+Shake everything just got scrambled
+
+0:45:27.410,0:45:33.369
+But I keep the same I scramble the same same way all the images. So I perform a random permutation
+
+0:45:34.850,0:45:38.710
+Always the same random permutation of all my images or the pixels on my images
+
+0:45:39.500,0:45:41.090
+What does it happen?
+
+0:45:41.090,0:45:43.299
+If I train both networks
+
+0:45:47.990,0:45:50.049
+So here I trained see here
+
+0:45:50.050,0:45:56.950
+I have my pics images and here I just scrambled with the same scrambling function all the pixels
+
+0:46:00.200,0:46:04.240
+All my inputs are going to be these images here
+
+0:46:06.590,0:46:10.870
+The output is going to be still the class of the original so this is a four you
+
+0:46:11.450,0:46:13.780
+Can see this this is a four. This is a nine
+
+0:46:14.920,0:46:19.889
+This is a 1 this is a 7 is a 3 in this is a 4 so I keep the same labels
+
+0:46:19.930,0:46:24.450
+But I scrambled the order of the pixels and I perform the same scrambling every time
+
+0:46:25.239,0:46:27.239
+What do you expect is performance?
+
+0:46:31.029,0:46:33.299
+Who's better who's working who's the same?
+
+0:46:38.619,0:46:46.258
+Perception how does it do with the perception? Does he see any difference? No, okay. So the guy still 83
+
+0:46:47.920,0:46:49.920
+Yann's network
+
+0:46:52.029,0:46:54.029
+What do you guys
+
+0:47:04.089,0:47:09.988
+Know that's a fully connected. Sorry. I'll change the order. Yeah, see. Okay. There you go
+
+0:47:12.460,0:47:14.999
+So I can't even show you this thing
+
+0:47:17.920,0:47:18.730
+All right
+
+0:47:18.730,0:47:24.659
+So the fully connected guy basically performed the same the differences are just basic based on the initial
+
+0:47:25.059,0:47:30.899
+The random initialization the convolutional net which was winning by kind of large advance
+
+0:47:31.509,0:47:33.509
+advantage before actually performs
+
+0:47:34.059,0:47:38.008
+Kind of each similarly, but I mean worse than much worse than before
+
+0:47:38.499,0:47:42.449
+Why is the convolutional network now performing worse than my fully connected Network?
+
+0:47:44.829,0:47:46.829
+Because we fucked up
+
+0:47:47.739,0:47:55.379
+Okay, and so every time you use a convolutional network, you actually have to think can I use of convolutional network, okay
+
+0:47:56.440,0:47:59.700
+If it holds now, you have the three properties then yeah
+
+0:47:59.700,0:48:05.759
+Maybe of course, it should be giving you a better performance if those three properties don't hold
+
+0:48:06.579,0:48:09.058
+then using convolutional networks is
+
+0:48:11.499,0:48:17.939
+BS right, which was the bias? No. Okay. Never mind. All right. Well, good night
diff --git a/docs/pt/week04/04-1.md b/docs/pt/week04/04-1.md
new file mode 100644
index 000000000..78ecbe56c
--- /dev/null
+++ b/docs/pt/week04/04-1.md
@@ -0,0 +1,596 @@
+---
+lang: pt
+lang-ref: ch.04-1
+lecturer: Alfredo Canziani
+title: Álgebra Linear e Convoluções
+authors: Yuchi Ge, Anshan He, Shuting Gu e Weiyang Wen
+date: 18 Feb 2020
+translation-date: 05 Nov 2021
+translator: Leon Solon
+---
+
+<!--
+## [Linear Algebra review](https://www.youtube.com/watch?v=OrBEon3VlQg&t=68s)
+-->
+
+## [Revisão de Álgebra Linear](https://www.youtube.com/watch?v=OrBEon3VlQg&t=68s)
+
+<!--This part is a recap of basic linear algebra in the context of neural networks. We start with a simple hidden layer $\boldsymbol{h}$:
+-->
+
+Esta parte é uma recapitulação de Álgebra Linear básica no contexto das redes neurais. Começamos com uma camada oculta simples $\boldsymbol{h}$:
+
+<!--$$
+\boldsymbol{h} = f(\boldsymbol{z})
+$$
+-->
+
+$$
+\boldsymbol{h} = f(\boldsymbol{z})
+$$
+
+<!--The output is a non-linear function $f$ applied to a vector $z$. Here $z$ is the output of an affine transformation $\boldsymbol{A} \in\mathbb{R^{m\times n}}$ to the input vector $\boldsymbol{x} \in\mathbb{R^n}$:
+-->
+
+A saída é uma função não linear $f$ aplicada a um vetor $z$. Aqui $z$ é a saída de uma transformação afim (affine transformation) $\boldsymbol{A} \in\mathbb{R^{m\times n}}$ para o vetor de entrada $\boldsymbol{x} \in\mathbb{R^n}$:
+
+<!--$$
+\boldsymbol{z} = \boldsymbol{A} \boldsymbol{x}
+$$
+-->
+
+$$
+\boldsymbol{z} = \boldsymbol{A} \boldsymbol{x}
+$$
+
+<!--For simplicity biases are ignored. The linear equation can be expanded as:
+-->
+
+Para simplificar, os viéses (biases) são ignorados. A equação linear pode ser expandida como:
+
+<!--$$
+\boldsymbol{A}\boldsymbol{x} =
+\begin{pmatrix}
+a_{11} & a_{12} & \cdots & a_{1n}\\
+a_{21} & a_{22} & \cdots & a_{2n} \\
+\vdots & \vdots & \ddots & \vdots \\
+a_{m1} & a_{m2} & \cdots & a_{mn} \end{pmatrix} \begin{pmatrix}
+x_1 \\ \vdots \\x_n \end{pmatrix} =
+\begin{pmatrix}
+    \text{---} \; \boldsymbol{a}^{(1)} \; \text{---} \\
+    \text{---} \; \boldsymbol{a}^{(2)} \; \text{---} \\
+    \vdots \\
+    \text{---} \; \boldsymbol{a}^{(m)} \; \text{---} \\
+\end{pmatrix}
+\begin{matrix}
+    \rvert \\ \boldsymbol{x} \\ \rvert
+\end{matrix} =
+\begin{pmatrix}
+    {\boldsymbol{a}}^{(1)} \boldsymbol{x} \\ {\boldsymbol{a}}^{(2)} \boldsymbol{x} \\ \vdots \\ {\boldsymbol{a}}^{(m)} \boldsymbol{x}
+\end{pmatrix}_{m \times 1}
+$$
+-->
+
+$$
+\boldsymbol{A}\boldsymbol{x} =
+\begin{pmatrix}
+a_{11} & a_{12} & \cdots & a_{1n}\\
+a_{21} & a_{22} & \cdots & a_{2n} \\
+\vdots & \vdots & \ddots & \vdots \\
+a_{m1} & a_{m2} & \cdots & a_{mn} \end{pmatrix} \begin{pmatrix}
+x_1 \\ \vdots \\x_n \end{pmatrix} =
+\begin{pmatrix}
+    \text{---} \; \boldsymbol{a}^{(1)} \; \text{---} \\
+    \text{---} \; \boldsymbol{a}^{(2)} \; \text{---} \\
+    \vdots \\
+    \text{---} \; \boldsymbol{a}^{(m)} \; \text{---} \\
+\end{pmatrix}
+\begin{matrix}
+    \rvert \\ \boldsymbol{x} \\ \rvert
+\end{matrix} =
+\begin{pmatrix}
+    {\boldsymbol{a}}^{(1)} \boldsymbol{x} \\ {\boldsymbol{a}}^{(2)} \boldsymbol{x} \\ \vdots \\ {\boldsymbol{a}}^{(m)} \boldsymbol{x}
+\end{pmatrix}_{m \times 1}
+$$
+
+<!--where $\boldsymbol{a}^{(i)}$ is the $i$-th row of the matrix $\boldsymbol{A}$.
+-->
+
+onde $\boldsymbol{a}^{(i)}$ é a $i$-ésima linha da matriz $\boldsymbol{A}$.
+
+<!--To understand the meaning of this transformation, let us analyse one component of $\boldsymbol{z}$ such as $a^{(1)}\boldsymbol{x}$. Let  $n=2$, then $\boldsymbol{a} = (a_1,a_2)$ and $\boldsymbol{x}  = (x_1,x_2)$.
+-->
+
+Para entender o significado dessa transformação, vamos analisar um componente de $\boldsymbol{z}$ como $a^{(1)}\boldsymbol{x}$. Seja $n=2$, então $\boldsymbol{a} = (a_1,a_2)$ e $\boldsymbol{x}  = (x_1,x_2)$.
+
+<!--$\boldsymbol{a}$ and $\boldsymbol{x}$ can be drawn as vectors in the 2D coordinate axis. Now, if the angle between $\boldsymbol{a}$ and $\hat{\boldsymbol{\imath}}$ is $\alpha$ and the angle between $\boldsymbol{x}$ and $\hat{\boldsymbol{\imath}}$ is $\xi$, then with trigonometric formulae $a^\top\boldsymbol{x}$ can be expanded as:
+-->
+
+$\boldsymbol{a}$ e $\boldsymbol{x}$ podem ser desenhados como vetores no eixo de coordenadas 2D. Agora, se o ângulo entre $\boldsymbol{a}$ e $\hat{\boldsymbol{\imath}}$ é $\alpha$ e o ângulo entre $\boldsymbol{x}$ e $\hat{\boldsymbol{\imath}}$ é $\xi$, então com fórmulas trigonométricas $a^\top\boldsymbol{x}$ pode ser expandido como:
+
+<!--$$
+\begin {aligned}
+\boldsymbol{a}^\top\boldsymbol{x} &= a_1x_1+a_2x_2\\
+&=\lVert \boldsymbol{a} \rVert \cos(\alpha)\lVert \boldsymbol{x} \rVert \cos(\xi) + \lVert \boldsymbol{a} \rVert \sin(\alpha)\lVert \boldsymbol{x} \rVert \sin(\xi)\\
+&=\lVert \boldsymbol{a} \rVert \lVert \boldsymbol{x} \rVert \big(\cos(\alpha)\cos(\xi)+\sin(\alpha)\sin(\xi)\big)\\
+&=\lVert \boldsymbol{a} \rVert \lVert \boldsymbol{x} \rVert \cos(\xi-\alpha)
+\end {aligned}
+$$
+-->
+
+$$
+\begin {aligned}
+\boldsymbol{a}^\top\boldsymbol{x} &= a_1x_1+a_2x_2\\
+&=\lVert \boldsymbol{a} \rVert \cos(\alpha)\lVert \boldsymbol{x} \rVert \cos(\xi) + \lVert \boldsymbol{a} \rVert \sin(\alpha)\lVert \boldsymbol{x} \rVert \sin(\xi)\\
+&=\lVert \boldsymbol{a} \rVert \lVert \boldsymbol{x} \rVert \big(\cos(\alpha)\cos(\xi)+\sin(\alpha)\sin(\xi)\big)\\
+&=\lVert \boldsymbol{a} \rVert \lVert \boldsymbol{x} \rVert \cos(\xi-\alpha)
+\end {aligned}
+$$
+
+<!--
+The output measures the alignment of the input to a specific row of the matrix $\boldsymbol{A}$. This can be understood by observing the angle between the two vectors, $\xi-\alpha$. When $\xi = \alpha$, the two vectors are perfectly aligned and maximum is attained. If $\xi - \alpha = \pi$, then $\boldsymbol{a}^\top\boldsymbol{x}$ attains its minimum and the two vectors are pointing at opposite directions. In essence, the linear transformation allows one to see the projection of an input to various orientations as defined by $A$. This intuition is expandable to higher dimensions as well.
+-->
+
+A saída mede o alinhamento da entrada a uma linha específica da matriz $\boldsymbol{A}$. Isso pode ser entendido observando o ângulo entre os dois vetores, $\xi-\alpha$. Quando $\xi = \alpha$, os dois vetores estão perfeitamente alinhados e o máximo é atingido. Se $\xi - \alpha = \pi$, então $\boldsymbol{a}^\top\boldsymbol{x}$ atinge seu mínimo e os dois vetores estão apontando em direções opostas. Em essência, a transformação linear permite ver a projeção de uma entrada para várias orientações definidas por $A$. Essa intuição também pode ser expandida para dimensões superiores.
+
+<!--Another way to understand the linear transformation is by understanding that $\boldsymbol{z}$ can also be expanded as:
+-->
+
+Outra maneira de entender a transformação linear é entendendo que $\boldsymbol{z}$ também pode ser expandido como:
+
+<!--$$
+\boldsymbol{A}\boldsymbol{x} =
+\begin{pmatrix}
+    \vert            & \vert            &        & \vert             \\
+    \boldsymbol{a}_1 & \boldsymbol{a}_2 & \cdots & \boldsymbol{a}_n  \\
+    \vert            & \vert            &        & \vert             \\
+\end{pmatrix}
+\begin{matrix}
+    \rvert \\ \boldsymbol{x} \\ \rvert
+\end{matrix} =
+x_1 \begin{matrix} \rvert \\ \boldsymbol{a}_1 \\ \rvert \end{matrix} +
+x_2 \begin{matrix} \rvert \\ \boldsymbol{a}_2 \\ \rvert \end{matrix} +
+    \cdots +
+x_n \begin{matrix} \rvert \\ \boldsymbol{a}_n \\ \rvert \end{matrix}
+$$
+-->
+
+$$
+\boldsymbol{A}\boldsymbol{x} =
+\begin{pmatrix}
+    \vert            & \vert            &        & \vert             \\
+    \boldsymbol{a}_1 & \boldsymbol{a}_2 & \cdots & \boldsymbol{a}_n  \\
+    \vert            & \vert            &        & \vert             \\
+\end{pmatrix}
+\begin{matrix}
+    \rvert \\ \boldsymbol{x} \\ \rvert
+\end{matrix} =
+x_1 \begin{matrix} \rvert \\ \boldsymbol{a}_1 \\ \rvert \end{matrix} +
+x_2 \begin{matrix} \rvert \\ \boldsymbol{a}_2 \\ \rvert \end{matrix} +
+    \cdots +
+x_n \begin{matrix} \rvert \\ \boldsymbol{a}_n \\ \rvert \end{matrix}
+$$
+
+<!--The output is the weighted sum of the columns of matrix $\boldsymbol{A}$. Therefore, the signal is nothing but a composition of the input.
+-->
+
+A saída é a soma ponderada das colunas da matriz $\boldsymbol{A}$. Portanto, o sinal nada mais é do que uma composição da entrada.
+
+<!--
+## [Extend Linear Algebra to convolutions](https://www.youtube.com/watch?v=OrBEon3VlQg&t=1030s)
+-->
+
+## [Extender Álgebra Linear para convoluções](https://www.youtube.com/watch?v=OrBEon3VlQg&t=1030s)
+
+<!--Now we extend linear algebra to convolutions, by using the example of audio data analysis. We start with representing a fully connected layer as a form of matrix multiplication: -
+-->
+
+Agora estendemos a álgebra linear às convoluções, usando o exemplo de análise de dados de áudio. Começamos representando uma camada totalmente conectada como uma forma de multiplicação de matriz: -
+
+<!--$$
+\begin{bmatrix}
+w_{11} & w_{12} & w_{13}\\
+w_{21} & w_{22} & w_{23}\\
+w_{31} & w_{32} & w_{33}\\
+w_{41} & w_{42} & w_{43}
+\end{bmatrix}
+\begin{bmatrix}
+x_1\\
+x_2\\
+x_3
+\end{bmatrix} = \begin{bmatrix}
+y_1\\
+y_2\\
+y_3\\
+y_4
+\end{bmatrix}
+$$
+-->
+
+$$
+\begin{bmatrix}
+w_{11} & w_{12} & w_{13}\\
+w_{21} & w_{22} & w_{23}\\
+w_{31} & w_{32} & w_{33}\\
+w_{41} & w_{42} & w_{43}
+\end{bmatrix}
+\begin{bmatrix}
+x_1\\
+x_2\\
+x_3
+\end{bmatrix} = \begin{bmatrix}
+y_1\\
+y_2\\
+y_3\\
+y_4
+\end{bmatrix}
+$$
+
+<!--In this example, the weight matrix has a size of $4 \times 3$, the input vector has a size of $3 \times 1$ and the output vector has a of size $4 \times 1$.
+-->
+
+Neste exemplo, a matriz de peso tem um tamanho de $4 \times 3$, o vetor de entrada tem um tamanho de $3 \times 1$ e o vetor de saída tem um tamanho de $4 \times 1$.
+
+<!--However, for audio data, the data is much longer (not 3-sample long). The number of samples in the audio data is equal to the duration of the audio (*e.g.* 3 seconds) times the sampling rate (*e.g.* 22.05 kHz). As shown below, the input vector $\boldsymbol{x}$ will be quite long. Correspondingly, the weight matrix will become "fat".
+-->
+
+No entanto, para dados de áudio, os dados são muito mais longos (não com 3 amostras). O número de amostras nos dados de áudio é igual à duração do áudio (*por exemplo,* 3 segundos) vezes a taxa de amostragem (*por exemplo,* 22,05 kHz). Conforme mostrado abaixo, o vetor de entrada $\boldsymbol{x}$ será bem longo. Correspondentemente, a matriz de peso se tornará "gorda".
+
+<!--$$
+\begin{bmatrix}
+w_{11} & w_{12} & w_{13} & w_{14} & \cdots &w_{1k}& \cdots &w_{1n}\\
+w_{21} & w_{22} & w_{23}& w_{24} & \cdots & w_{2k}&\cdots &w_{2n}\\
+w_{31} & w_{32} & w_{33}& w_{34} & \cdots & w_{3k}&\cdots &w_{3n}\\
+w_{41} & w_{42} & w_{43}& w_{44} & \cdots & w_{4k}&\cdots &w_{4n}
+\end{bmatrix}
+\begin{bmatrix}
+x_1\\
+x_2\\
+x_3\\
+x_4\\
+\vdots\\
+x_k\\
+\vdots\\
+x_n
+\end{bmatrix} = \begin{bmatrix}
+y_1\\
+y_2\\
+y_3\\
+y_4
+\end{bmatrix}
+$$
+-->
+
+$$
+\begin{bmatrix}
+w_{11} & w_{12} & w_{13} & w_{14} & \cdots &w_{1k}& \cdots &w_{1n}\\
+w_{21} & w_{22} & w_{23}& w_{24} & \cdots & w_{2k}&\cdots &w_{2n}\\
+w_{31} & w_{32} & w_{33}& w_{34} & \cdots & w_{3k}&\cdots &w_{3n}\\
+w_{41} & w_{42} & w_{43}& w_{44} & \cdots & w_{4k}&\cdots &w_{4n}
+\end{bmatrix}
+\begin{bmatrix}
+x_1\\
+x_2\\
+x_3\\
+x_4\\
+\vdots\\
+x_k\\
+\vdots\\
+x_n
+\end{bmatrix} = \begin{bmatrix}
+y_1\\
+y_2\\
+y_3\\
+y_4
+\end{bmatrix}
+$$
+
+<!--The above formulation will be difficult to train. Fortunately there are ways to simplify the same.
+-->
+
+A formulação acima será difícil de treinar. Felizmente, existem maneiras de simplificar o mesmo.
+
+<!--
+### Property: locality
+-->
+
+### Propriedade: localidade
+
+<!--Due to locality (*i.e.* we do not care for data points that are far away) of data, $w_{1k}$ from the weight matrix above, can be filled with 0 when $k$ is relatively large. Therefore, the first row of the matrix becomes a kernel of size 3. Let's denote this size-3 kernel as $\boldsymbol{a}^{(1)} = \begin{bmatrix} a_1^{(1)}  & a_2^{(1)}  & a_3^{(1)} \end{bmatrix}$.
+-->
+
+Devido à localidade (*ou seja,* não nos importamos com pontos de dados distantes) dos dados, $ w_ {1k} $ da matriz de peso acima pode ser preenchido com 0 quando $ k $ é relativamente grande. Portanto, a primeira linha da matriz torna-se um kernel de tamanho 3. Vamos denotar este kernel de tamanho 3 como $\boldsymbol{a}^{(1)} = \begin{bmatrix} a_1^{(1)}  & a_2^{(1)}  & a_3^{(1)} \end{bmatrix}$.
+
+<!--$$
+\begin{bmatrix}
+a_1^{(1)}  & a_2^{(1)}  & a_3^{(1)}  & 0 & \cdots &0& \cdots &0\\
+w_{21} & w_{22} & w_{23}& w_{24} & \cdots & w_{2k}&\cdots &w_{2n}\\
+w_{31} & w_{32} & w_{33}& w_{34} & \cdots & w_{3k}&\cdots &w_{3n}\\
+w_{41} & w_{42} & w_{43}& w_{44} & \cdots & w_{4k}&\cdots &w_{4n}
+\end{bmatrix}
+\begin{bmatrix}
+x_1\\
+x_2\\
+x_3\\
+x_4\\
+\vdots\\
+x_k\\
+\vdots\\
+x_n
+\end{bmatrix} = \begin{bmatrix}
+y_1\\
+y_2\\
+y_3\\
+y_4
+\end{bmatrix}
+$$
+-->
+
+$$
+\begin{bmatrix}
+a_1^{(1)}  & a_2^{(1)}  & a_3^{(1)}  & 0 & \cdots &0& \cdots &0\\
+w_{21} & w_{22} & w_{23}& w_{24} & \cdots & w_{2k}&\cdots &w_{2n}\\
+w_{31} & w_{32} & w_{33}& w_{34} & \cdots & w_{3k}&\cdots &w_{3n}\\
+w_{41} & w_{42} & w_{43}& w_{44} & \cdots & w_{4k}&\cdots &w_{4n}
+\end{bmatrix}
+\begin{bmatrix}
+x_1\\
+x_2\\
+x_3\\
+x_4\\
+\vdots\\
+x_k\\
+\vdots\\
+x_n
+\end{bmatrix} = \begin{bmatrix}
+y_1\\
+y_2\\
+y_3\\
+y_4
+\end{bmatrix}
+$$
+
+<!--
+### Property: stationarity
+-->
+
+### Propriedade: estacionariedade
+
+<!--Natural data signals have the property of stationarity (*i.e.* certain patterns/motifs will repeat). This helps us reuse kernel $\mathbf{a}^{(1)}$ that we defined previously. We use this kernel by placing it one step further each time (*i.e.* stride is 1), resulting in the following:
+-->
+
+Os sinais de dados naturais têm a propriedade de estacionariedade (*ou seja,* certos padrões / motivos se repetirão). Isso nos ajuda a reutilizar o kernel $\mathbf{a}^{(1)}$ que definimos anteriormente. Usamos este kernel colocando-o um passo adiante a cada vez (*ou seja,* o passo é 1), resultando no seguinte:
+
+<!--$$
+\begin{bmatrix}
+a_1^{(1)} & a_2^{(1)}  & a_3^{(1)}  & 0 & 0 & 0 & 0&\cdots  &0\\
+0 & a_1^{(1)}  & a_2^{(1)} & a_3^{(1)}  & 0&0&0&\cdots &0\\
+0 & 0 & a_1^{(1)} & a_2^{(1)}  & a_3^{(1)}  & 0&0&\cdots &0\\
+0 & 0 & 0& a_1^{(1)}  & a_2^{(1)}  &a_3^{(1)} &0&\cdots &0\\
+0 & 0 & 0& 0 & a_1^{(1)}  &a_2^{(1)} &a_3^{(1)} &\cdots &0\\
+\vdots&&\vdots&&\vdots&&\vdots&&\vdots
+\end{bmatrix}
+\begin{bmatrix}
+x_1\\
+x_2\\
+x_3\\
+x_4\\
+\vdots\\
+x_k\\
+\vdots\\
+x_n
+\end{bmatrix}
+$$
+-->
+
+$$
+\begin{bmatrix}
+a_1^{(1)} & a_2^{(1)}  & a_3^{(1)}  & 0 & 0 & 0 & 0&\cdots  &0\\
+0 & a_1^{(1)}  & a_2^{(1)} & a_3^{(1)}  & 0&0&0&\cdots &0\\
+0 & 0 & a_1^{(1)} & a_2^{(1)}  & a_3^{(1)}  & 0&0&\cdots &0\\
+0 & 0 & 0& a_1^{(1)}  & a_2^{(1)}  &a_3^{(1)} &0&\cdots &0\\
+0 & 0 & 0& 0 & a_1^{(1)}  &a_2^{(1)} &a_3^{(1)} &\cdots &0\\
+\vdots&&\vdots&&\vdots&&\vdots&&\vdots
+\end{bmatrix}
+\begin{bmatrix}
+x_1\\
+x_2\\
+x_3\\
+x_4\\
+\vdots\\
+x_k\\
+\vdots\\
+x_n
+\end{bmatrix}
+$$
+
+<!--Both the upper right part and lower left part of the matrix are filled with $0$s thanks to locality, leading to sparsity. The reuse of a certain kernel again and again is called weight sharing.
+-->
+
+Tanto a parte superior direita quanto a parte inferior esquerda da matriz são preenchidas com $ 0 $ s graças à localidade, levando à dispersão. A reutilização de um determinado kernel repetidamente é chamada de divisão de peso.
+
+<!--
+### Multiple layers of Toeplitz matrix
+-->
+
+### Múltiplas camadas de matriz Toeplitz
+
+<!--After these changes, the number of parameters we are left with is 3 (*i.e.* $a_1,a_2,a_3$). In comparison to the previous weight matrix, which had 12 parameters (*i.e.* $w_{11},w_{12},\cdots,w_{43}$), the current number of parameters is too restrictive and we would like to expand the same.
+-->
+
+Após essas alterações, o número de parâmetros que resta é 3 (*ou seja,* $a_1,a_2,a_3$). Em comparação com a matriz de peso anterior, que tinha 12 parâmetros (*por exemplo* $w_{11},w_{12},\cdots,w_{43}$), o número atual de parâmetros é muito restritivo e gostaríamos de expandir o mesmo.
+
+<!--The previous matrix can be considered to be a layer (*i.e.* a convolutional layer) with the kernel $\boldsymbol{a}^{(1)}$. Then we can construct multiple layers with different kernels $\boldsymbol{a}^{(2)}$, $\boldsymbol{a}^{(3)}$, etc, thereby increasing the parameters.
+-->
+
+A matriz anterior pode ser considerada uma camada (*ou seja,* uma camada convolucional) com o kernel $\boldsymbol{a}^{(1)}$. Então podemos construir múltiplas camadas com diferentes kernels $\boldsymbol{a}^{(2)}$, $\boldsymbol{a}^{(3)}$, etc, aumentando assim os parâmetros.
+
+<!--Each layer has a matrix containing just one kernel that is replicated multiple times. This type of matrix is called a Toeplitz matrix. In every Toeplitz matrix, each descending diagonal from left to right is constant. The Toeplitz matrices that we use here are sparse matrices as well.
+-->
+
+Cada camada possui uma matriz contendo apenas um kernel que é replicado várias vezes. Este tipo de matriz é denominado matriz de Toeplitz. Em cada matriz de Toeplitz, cada diagonal descendente da esquerda para a direita é constante. As matrizes Toeplitz que usamos aqui também são matrizes esparsas.
+
+<!--Given the first kernel $\boldsymbol{a}^{(1)}$ and the input vector $\boldsymbol{x}$, the first entry in the output given by this layer is, $a_1^{(1)} x_1 + a_2^{(1)} x_2 + a_3^{(1)}x_3$. Therefore, the whole output vector looks like the following: -
+-->
+
+Dado o primeiro kernel $\boldsymbol{a}^{(1)}$ e o vetor de entrada $\boldsymbol{x}$, a primeira entrada na saída fornecida por esta camada é, $a_1^{(1)} x_1 + a_2^{(1)} x_2 + a_3^{(1)}x_3$. Portanto, todo o vetor de saída se parece com o seguinte: -
+
+<!--$$
+\begin{bmatrix}
+\mathbf{a}^{(1)}x[1:3]\\
+\mathbf{a}^{(1)}x[2:4]\\
+\mathbf{a}^{(1)}x[3:5]\\
+\vdots
+\end{bmatrix}
+$$
+-->
+
+$$
+\begin{bmatrix}
+\mathbf{a}^{(1)}x[1:3]\\
+\mathbf{a}^{(1)}x[2:4]\\
+\mathbf{a}^{(1)}x[3:5]\\
+\vdots
+\end{bmatrix}
+$$
+
+<!--The same matrix multiplication method can be applied on following convolutional layers with other kernels (*e.g.* $\boldsymbol{a}^{(2)}$ and $\boldsymbol{a}^{(3)}$) to get similar results.
+-->
+
+O mesmo método de multiplicação de matriz pode ser aplicado nas seguintes camadas convolucionais com outros kernels (*por exemplo* $\boldsymbol{a}^{(2)}$ e $\boldsymbol{a}^{(3)}$) para obter similar resultados.
+
+<!--
+## [Listening to convolutions - Jupyter Notebook](https://www.youtube.com/watch?v=OrBEon3VlQg&t=1709s)
+-->
+
+## [Ouvindo as convoluções - Jupyter Notebook](https://www.youtube.com/watch?v=OrBEon3VlQg&t=1709s)
+
+<!--The Jupyter Notebook can be found [here](https://github.com/Atcold/pytorch-Deep-Learning/blob/master/07-listening_to_kernels.ipynb).
+-->
+
+O Jupyter Notebook pode ser encontrado [aqui](https://github.com/Atcold/pytorch-Deep-Learning/blob/master/07-listening_to_kernels.ipynb).
+
+<!--In this notebook, we are going to explore Convolution as a 'running scalar product'.
+-->
+
+Neste bloco de notas, vamos explorar a Convolução como um 'produto escalar em execução'.
+
+<!--The library `librosa` enables us to load the audio clip $\boldsymbol{x}$ and its sampling rate. In this case, there are 70641 samples, sampling rate is 22.05kHz and total length of the clip is 3.2s. The imported audio signal is wavy (refer to Fig 1) and we can guess what it sounds like from the amplitude of $y$ axis. The audio signal $x(t)$ is actually the sound played when turning off the Windows system (refer to Fig 2).
+-->
+
+A biblioteca `librosa` nos permite carregar o clipe de áudio $\boldsymbol{x}$ e sua taxa de amostragem. Nesse caso, existem 70641 amostras, a taxa de amostragem é de 22,05 kHz e a duração total do clipe é de 3,2 s. O sinal de áudio importado é ondulado (consulte a Figura 1) e podemos adivinhar como ele soa a partir da amplitude do eixo $ y $. O sinal de áudio $x(t)$ é na verdade o som reproduzido ao desligar o sistema Windows (consulte a Fig 2).
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week04/04-1/audioSignal.png" width="500px" /><br>
+<b>Fig. 1</b>: A visualization of the audio signal.<br>
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week04/04-1/audioSignal.png" width="500px" /><br>
+<b>Fig. 1</b>: Uma visualização do sinal de áudio.<br>
+</center>
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week04/04-1/notes.png" width="500px" /><br>
+<b>Fig. 2</b>: Notes for the above audio signal.<br>
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week04/04-1/notes.png" width="500px" /><br>
+<b>Fig. 2</b>: Observações para o sinal de áudio acima.<br>
+</center>
+
+<!--
+We need to seperate the notes from the waveform. To achieve this, if we use Fourier transform (FT) all the notes would come out together and it will be hard to figure out the exact time and location of each pitch. Therefore, a localized FT is needed (also known as spectrogram). As is observed in the spectrogram (refer to Fig 3), different pitches peak at different frequencies (*e.g.* first pitch peaks at 1600). Concatenating the four pitches at their frequencies gives us a pitched version of the original signal.
+-->
+
+Precisamos separar as notas da forma de onda. Para conseguir isso, se usarmos a transformada de Fourier (FT), todas as notas sairão juntas e será difícil descobrir a hora exata e a localização de cada afinação. Portanto, um FT localizado é necessário (também conhecido como espectrograma). Como é observado no espectrograma (consulte a Fig. 3), diferentes tons de pico em diferentes frequências (*por exemplo* primeiros picos de tom em 1600). A concatenação dos quatro tons em suas frequências nos dá uma versão do sinal original.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week04/04-1/spectrogram.png" width="500px" /><br>
+<b>Fig. 3</b>: Audio signal and its spectrogram.<br>
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week04/04-1/spectrogram.png" width="500px" /><br>
+<b>Fig. 3</b>: Sinal de áudio e seu espectrograma.<br>
+</center>
+
+<!--Convolution of the input signal with all the pitches (all the keys of the piano for example) can help extract all notes in the input piece (*i.e.* the hits when the audio matches the specific kernels). The spectrograms of the original signal and the signal of the concatenated pitches is shown in Fig 4 while the frequencies of the original signal and the four pitches is shown in Fig 5. The plot of the convolutions of the four kernels with the input signal (original signal) is shown in Fig 6. Fig 6 along with the audio clips of the convolutions prove the effectiveness of the convolutions in extracting the notes.
+-->
+
+A convolução do sinal de entrada com todos os tons (todas as teclas do piano, por exemplo) pode ajudar a extrair todas as notas na peça de entrada (*ou seja,* os hits quando o áudio corresponde aos núcleos específicos). Os espectrogramas do sinal original e o sinal dos tons concatenados são mostrados na Fig. 4, enquanto as frequências do sinal original e os quatro tons são mostrados na Fig. 5. O gráfico das convoluções dos quatro núcleos com o sinal de entrada (original sinal) é mostrado na Fig 6. A Fig 6 junto com os clipes de áudio das convoluções comprovam a eficácia das convoluções na extração das notas.
+
+<!--
+<center>
+<img src="{{site.baseurl}}/images/week04/04-1/fig4.png" width="500px" /><br>
+<b>Fig. 4</b>: Spectrogram of original signal (left) and Sepctrogram of the concatenation of pitches (right).<br>
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week04/04-1/fig4.png" width="500px" /><br>
+<b>Fig. 4</b>: Espectrograma do sinal original (esquerda) e Espectrograma da concatenação de tons (direita).<br>
+</center>
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week04/04-1/fig5.png" width="500px" /><br>
+<b>Fig. 5</b>: First note of the melody.<br>
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week04/04-1/fig5.png" width="500px" /><br>
+<b>Fig. 5</b>: Primeira nota da melodia.<br>
+</center>
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week04/04-1/fig6.png" width="500px" /><br>
+<b>Fig. 6</b>: Convolution of four kernels.<br>
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week04/04-1/fig6.png" width="500px" /><br>
+<b>Fig. 6</b>: Convolução de quatro grãos.<br>
+</center>
+
+<!--
+## Dimensionality of different datasets
+-->
+
+## Dimensionalidade de diferentes conjuntos de dados
+
+<!--The last part is a short digression on the different representations of dimensionality and examples for the same. Here we consider input set $X$ is made of  functions mapping from domains $\Omega$ to channels $c$.
+-->
+
+A última parte é uma pequena digressão sobre as diferentes representações da dimensionalidade e exemplos para as mesmas. Aqui, consideramos que o conjunto de entrada $X$ é feito de mapeamento de funções dos domínios $\Omega$ para os canais $c$.
+
+<!--
+### Examples
+-->
+
+### Exemplos
+
+<!--* Audio data: domain is 1-D, discrete signal indexed by time; number of channels $c$ can range from 1 (mono), 2 (stereo), 5+1 (Dolby 5.1), *etc.*
+* Image data: domain is 2-D (pixels); $c$ can range from 1(greyscale), 3(colour), 20(hyperspectral), *etc.*
+* Special relativity: domain is $\mathbb{R^4} \times \mathbb{R^4}$ (space-time $\times$ four-momentum); when $c = 1$ it is called Hamiltonian.
+-->
+
+* Dados de áudio: o domínio é 1-D, sinal discreto indexado pelo tempo; o número de canais $ c $ pode variar de 1 (mono), 2 (estéreo), 5 + 1 (Dolby 5.1), *etc.*
+* Dados da imagem: o domínio é 2-D (pixels); $ c $ pode variar de 1 (escala de cinza), 3 (cor), 20 (hiperespectral), *etc.*
+* Relatividade especial: o domínio é $\mathbb{R^4} \times \mathbb{R^4}$ (espaço-tempo $\times$ quatro-momento); quando $c = 1$ é chamado de Hamiltoniano.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week04/04-1/fig7.png" width="600px" /><br>
+<b>Fig. 7</b>: Different dimensions of different types of signals.<br>
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week04/04-1/fig7.png" width="600px" /><br>
+<b>Fig. 7</b>: Dimensões diferentes de tipos diferentes de sinais.<br>
+</center>
diff --git a/docs/pt/week04/04.md b/docs/pt/week04/04.md
new file mode 100644
index 000000000..942afab61
--- /dev/null
+++ b/docs/pt/week04/04.md
@@ -0,0 +1,18 @@
+---
+lang: pt
+lang-ref: ch.04
+title: Semana 4
+translation-date: 05 Nov 2021
+translator: Leon Solon
+---
+
+<!--
+## Practicum
+-->
+
+## Prática
+
+<!--We start with a brief review of linear algebra and then extend the topic to convolutions using audio data as an example. Key concepts like locality, stationarity and Toeplitz matrix are reiterated. Then we give a live demo of convolution performance in pitch analysis. Finally, there is a short digression about the dimensionality of different data.
+-->
+
+Começamos com uma breve revisão de Álgebra Linear e, em seguida, estendemos o tópico para as convoluções usando dados de áudio como exemplo. Conceitos chave como localidade, estacionariedade e matriz de Toeplitz são reiterados. Em seguida, oferecemos uma demonstração ao vivo do desempenho da convolução na análise do tom. Finalmente, há uma pequena digressão sobre a dimensionalidade de diferentes dados.
\ No newline at end of file
diff --git a/docs/pt/week04/practicum04.sbv b/docs/pt/week04/practicum04.sbv
new file mode 100644
index 000000000..5b8ec5afc
--- /dev/null
+++ b/docs/pt/week04/practicum04.sbv
@@ -0,0 +1,1517 @@
+0:00:00.030,0:00:04.730
+então, desde a última vez, ok, bem-vindo de volta, obrigado por estar aqui.
+
+0:00:04.730,0:00:09.059
+A última vez que Yann usou o tablet, certo? e como você pode usar o tablet e eu
+
+0:00:09.059,0:00:13.040
+não usa o tablet, certo? Então, eu deveria ser tão legal quanto Yann, pelo menos eu acho.
+
+0:00:13.040,0:00:18.900
+Mais uma coisa para começar, há uma planilha onde você pode decidir se
+
+0:00:18.900,0:00:22.890
+você gostaria de entrar no canal do Slack, onde colaboramos para fazer
+
+0:00:22.890,0:00:28.529
+alguns desenhos para o site, corrigindo algumas notações matemáticas, tendo alguns
+
+0:00:28.529,0:00:32.640
+tipo de, você sabe, consertar o erro em inglês na gramática inglesa ou
+
+0:00:32.640,0:00:37.290
+seja o que for, então, se você estiver interessado em ajudar a melhorar o conteúdo de
+
+0:00:37.290,0:00:42.450
+esta turma, fique à vontade para preencher a planilha, ok? Já somos alguns de
+
+0:00:42.450,0:00:49.789
+nós no canal do Slack, então quero dizer, se você quiser entrar, de nada. Então
+
+0:00:49.789,0:00:53.399
+em vez de escrever no quadro branco, porque é impossível ver, eu acho
+
+0:00:53.399,0:01:00.270
+do lado superior, vamos experimentar um novo brinquedo aqui. Tudo
+
+0:01:00.270,0:01:03.930
+direito. Primeira vez, então você sabe, estou um pouco
+
+0:01:03.930,0:01:11.250
+tenso. Da última vez, estraguei um notebook, então tudo bem, tudo bem. Então nós vamos
+
+0:01:11.250,0:01:15.659
+comece com uma pequena revisão sobre álgebra linear. Espero que não seja
+
+0:01:15.659,0:01:20.670
+ofender alguém, estou ciente de que você já estudou álgebra linear e está
+
+0:01:20.670,0:01:25.920
+muito forte nisso, mas, no entanto, gostaria de lhe fornecer minha intuição, meu
+
+0:01:25.920,0:01:31.320
+perspectiva, ok? é apenas um slide, não muito, então talvez você queira
+
+0:01:31.320,0:01:36.290
+para tirar papel e caneta ou você pode apenas saber, o que for, acompanhar.
+
+0:01:36.290,0:01:49.850
+Portanto, esta será uma revisão de álgebra linear.
+
+0:01:51.170,0:02:06.450
+OK. Estou esperando um pouco? Deixe-me esperar um segundo. Preparar?? Sim? Não? mexa sua
+
+0:02:06.450,0:02:12.150
+cabeça. Fantástico, certo, estávamos conversando da última vez que tivemos um
+
+0:02:12.150,0:02:15.270
+rede com a entrada na parte inferior, então tínhamos um afim
+
+0:02:15.270,0:02:18.510
+transformação, então temos uma camada oculta à direita. Então, vou apenas escrever o
+
+0:02:18.510,0:02:23.280
+primeira equação. Teremos essa minha camada oculta, e como estou escrevendo
+
+0:02:23.280,0:02:26.970
+com uma caneta, você pode ver alguma coisa? sim? então, como estou escrevendo com uma caneta, estou
+
+0:02:26.970,0:02:30.360
+vai colocar um sublinhado embaixo da variável para
+
+0:02:30.360,0:02:35.120
+indicam que é um vetor. OK? é assim que escrevo vetores. Então meu H vai ser um
+
+0:02:35.120,0:02:42.660
+função não linear f aplicada ao meu z e z vai ser minha entrada linear,
+
+0:02:42.660,0:02:46.230
+a saída da transformação afim, portanto, neste caso
+
+0:02:46.230,0:02:54.989
+Vou escrever aqui z será igual à minha matriz A vezes x. Nós
+
+0:02:54.989,0:03:00.630
+podemos imaginar que não há preconceito neste caso, é genérico o suficiente porque podemos
+
+0:03:00.630,0:03:06.030
+inclua o viés dentro da matriz e tenha o primeiro item de x igual a
+
+0:03:06.030,0:03:18.360
+1. Então, se este x aqui pertence a R n e este z aqui pertence a R m, a primeira pergunta:
+
+0:03:18.360,0:03:24.989
+qual é o tamanho desta matriz? Ok, fantástico, então esta matriz aqui é
+
+0:03:24.989,0:03:30.360
+vai ser o nosso m vezes n, você tem tantas linhas quanto a dimensão para onde você atira
+
+0:03:30.360,0:03:34.200
+e você tem tantas colunas quanto a dimensão de onde você está filmando, ok?
+
+0:03:34.200,0:03:39.930
+Tudo bem, então vamos expandir este, então esta matriz aqui vai ser igual a
+
+0:03:39.930,0:03:49.549
+o que você tem a_ {1,1} a_ {1,2} assim por diante até o último, qual vai ser? gritar.
+
+0:03:49.549,0:03:56.269
+Obrigado, 1 ...? sim 1 e então você tem o segundo
+
+0:03:56.269,0:04:04.249
+vai ter a_ {2,1} a_ {2,2} assim por diante até o último que é a {2, n}, certo? e aí você
+
+0:04:04.249,0:04:14.840
+continuar descendo até o último qual vai ser? quais são os índices? m 1,
+
+0:04:14.840,0:04:24.740
+direito? Ok, então você tem a_ {m, 1}, a_ {m, 2} e assim por diante até a_ {m, n}, ok, obrigado.
+
+0:04:24.740,0:04:37.220
+E então temos aqui o nosso x, certo? então você tem x_1, x_2 e assim por diante até x_n,
+
+0:04:37.220,0:04:39.490
+direito?
+
+0:04:41.680,0:04:47.479
+Você está mais responsivo do que no ano passado, ótimo, obrigado. Tudo bem, então nós também podemos
+
+0:04:47.479,0:04:52.099
+reescrever este de maneiras diferentes. Então, a primeira maneira que vou escrever este
+
+0:04:52.099,0:05:00.020
+será o seguinte, então terei aqui estes 1, então terei
+
+0:05:00.020,0:05:12.409
+aqui meu a 2 e então eu tenho o último que vai ser meu an, ok? e então
+
+0:05:12.409,0:05:15.800
+aqui vou multiplicar isso por um vetor de coluna, certo? então meu vetor de coluna
+
+0:05:15.800,0:05:23.409
+Eu vou escrever assim. Tudo bem, então qual é o resultado desta operação?
+
+0:05:23.409,0:05:28.820
+então essas são métricas, você tem um vetor, o resultado será um? vetor. Então
+
+0:05:28.820,0:05:35.780
+qual vai ser o primeiro item do meu vetor? Eu não uso pontos porque eu não sou
+
+0:05:35.780,0:05:40.219
+um físico, na verdade eu sou, mas estamos fazendo álgebra linear, então o que devo
+
+0:05:40.219,0:05:42.400
+escrever?
+
+0:05:43.810,0:05:47.509
+Tudo bem, isso já foi transposto porque esses são um vetor linha, então eu apenas
+
+0:05:47.509,0:05:54.979
+escreva um Eu vou dizer apenas certo então eu tenho um 1 x ok então não há
+
+0:05:54.979,0:06:02.780
+transposição aqui, sem pontos ao redor e assim por diante. O segundo elemento vai ser? um 2, ok,
+
+0:06:02.780,0:06:12.970
+xe então até o último qual vai ser? ok, não há ponto, mas com certeza.
+
+0:06:12.970,0:06:17.810
+Como se alguém estivesse chamando aquele produto escalar em vez de produto vetorial, mas isso é
+
+0:06:17.810,0:06:20.389
+assume que você usa um tipo diferente de notação.
+
+0:06:20.389,0:06:26.710
+Tudo bem, então esse será o meu, quantos elementos esse vetor tem?
+
+0:06:26.710,0:06:35.120
+estou bem, então temos z 1, z 2 e assim por diante até zm e este é meu conjunto final, certo?
+
+0:06:35.120,0:06:39.710
+meu vetor z, ok? Fantástico. Agora vamos nos concentrar um pouco
+
+0:06:39.710,0:06:48.590
+sobre o significado dessa coisa aqui ok, outras perguntas até agora?
+
+0:06:48.590,0:06:54.530
+tudo bem, isso é muito trivial até agora, espero, quero dizer, deixe-me saber se não, ok
+
+0:06:54.530,0:07:00.140
+então vamos analisar um desses caras aqui, então eu gostaria de descobrir o que é
+
+0:07:00.140,0:07:10.970
+o significado de escrever um T vezes x certo, então meu a T será meu ai genérico, então
+
+0:07:10.970,0:07:24.680
+vamos supor, neste caso, quando n é igual a 2, certo? Então, o que é um T x? então um T x está indo
+
+0:07:24.680,0:07:29.780
+ser igual a quê? então deixe-me desenhar aqui algo para que seja mais fácil para
+
+0:07:29.780,0:07:37.720
+você entender. Então este vai ser meu a, esses vão ser meu alfa e
+
+0:07:37.720,0:07:48.620
+então aqui você tem como se fosse meu x e isso vai ser aqui meu xi, então
+
+0:07:48.620,0:07:57.430
+qual é a saída deste produto aqui? Este aqui.
+
+0:07:58.210,0:08:08.470
+Diga novamente, desculpe. Uma transposição, vamos chamá-la, digamos que é um vetor linha
+
+0:08:10.030,0:08:27.880
+Você pode ver? não? qual vai ser o resultado desta operação aqui? não por que?
+
+0:08:29.410,0:08:34.610
+Isso é como um genérico de muitos a's, certo? há muitos,
+
+0:08:34.610,0:08:39.169
+este é um daqueles m a's, então estou multiplicando um daqueles a's vezes meu x
+
+0:08:39.169,0:08:43.310
+certo, vamos supor que existam apenas duas dimensões, então qual será a
+
+0:08:43.310,0:08:50.660
+saída deste produto escalar? alguem pode me dizer? não não não normal
+
+0:08:50.660,0:09:02.720
+produto escalar. Espere então você tem um aqui, você tem
+
+0:09:02.720,0:09:08.510
+aqui, essa parte aqui vai ser 1, vai ser 2, certo? então você tem
+
+0:09:08.510,0:09:18.550
+aqui x 1 e agora você tem x 2, certo? então, como você expressa este produto escalar aqui?
+
+0:09:19.980,0:09:26.730
+ok, então estou apenas escrevendo, deixe-me saber se está tudo claro, então vou escrever aqui:
+
+0:09:28.890,0:09:37.780
+a 1 vezes x 1 mais a 2 vezes x 2, certo? esta é a definição. Claro, certo? Então
+
+0:09:37.780,0:09:44.830
+longe né? não? ok, sim, pergunta sim. Isso é para ser uma transposição. Uma fila
+
+0:09:44.830,0:09:49.600
+vetor de coluna de vezes, então vamos assumir que a é um vetor de coluna, então eu tenho tempos de linha
+
+0:09:49.600,0:10:02.230
+coluna. Então, vamos continuar escrevendo essas coisas aqui, então o que é 1? Como posso
+
+0:10:02.230,0:10:15.220
+computar 1? Repita? ok ok Então, vou escrever aqui que a 1 está indo
+
+0:10:15.220,0:10:20.770
+sendo o comprimento do vetor a vezes o cosseno alfa, então o que dizer de x 1? alguém
+
+0:10:20.770,0:10:30.010
+outro. O mesmo certo? Não, espere, o quê? o que é x 1? mesma coisa, certo, diferente
+
+0:10:30.010,0:10:37.450
+cartas. então alguém fala algo. Você está seguindo, você está completamente confuso,
+
+0:10:37.450,0:10:43.210
+você não está tendo ideia, é muito fácil? Não tenho ideia do que está acontecendo aqui. Isso é
+
+0:10:43.210,0:10:53.770
+boa direita? até agora ok? qual vai ser o segundo mandato? Este vai ser x
+
+0:10:53.770,0:11:00.430
+aqui né? vezes cos xi certo, e então você tinha o segundo termo que vai ser
+
+0:11:00.430,0:11:13.920
+que? magnitude de um ... grito, não consigo ouvir. Ok, seno de alfa e então
+
+0:11:17.660,0:11:21.290
+ok obrigado ok
+
+0:11:23.930,0:11:27.870
+tudo bem, vou apenas juntar aqueles dois caras, então você vai
+
+0:11:27.870,0:11:40.370
+obter igual magnitude de uma magnitude de vezes de x vezes cosseno alfa cosseno Xi mais
+
+0:11:40.370,0:11:48.210
+seno alfa e seno, cosseno, seno Xi, desculpe.
+
+0:11:48.210,0:11:54.210
+o que é o material entre parênteses? tudo bem, então é o cosseno do
+
+0:11:54.210,0:11:57.240
+diferença dos dois ângulos certo? todo mundo sabe trigonometria aqui, certo?
+
+0:11:57.240,0:12:05.160
+então, coisas do ensino médio, então este será igual a machado vezes o cosseno
+
+0:12:05.160,0:12:11.000
+de cos xi menos alfa, certo? ou o contrário, alfa menos xi.
+
+0:12:11.000,0:12:17.040
+Então o que isso quer dizer? Você pode pensar em cada elemento. até agora está claro? eu
+
+0:12:17.040,0:12:22.680
+não fez nenhuma mágica, sim, sacuda a cabeça assim para sim, isso para não, isso
+
+0:12:22.680,0:12:27.810
+porque talvez nada esteja funcionando. Ok, então você pode pensar sempre que quiser
+
+0:12:27.810,0:12:33.600
+multiplique uma matriz por um vetor que basicamente cada saída desta operação
+
+0:12:33.600,0:12:39.570
+vai medir, então ok espere um pouco, o que é esse cosseno? quanto é cosseno
+
+0:12:39.570,0:12:45.780
+de zero? 1. Então, isso significa que se esses dois ângulos, se os dois vetores estiverem alinhados,
+
+0:12:45.780,0:12:50.460
+o que significa que há um ângulo zero entre os dois vetores, você pode ter o
+
+0:12:50.460,0:12:57.090
+valor máximo desse elemento, certo? sempre que você tiver o menos o
+
+0:12:57.090,0:13:03.240
+valor mais negativo? quando eles são opostos, certo? então quando eles estão em
+
+0:13:03.240,0:13:08.310
+oposição de fase, você obterá a magnitude mais negativa, mas se você
+
+0:13:08.310,0:13:11.730
+aplique apenas digamos um ReLU, você vai cortar todas as coisas negativas que você está apenas
+
+0:13:11.730,0:13:16.200
+verificando as correspondências positivas, então a rede neural basicamente apenas talvez
+
+0:13:16.200,0:13:20.520
+vai descobrir apenas as correspondências positivas, certo? e então novamente quando
+
+0:13:20.520,0:13:23.620
+você multiplica uma matriz por um vetor de coluna que você
+
+0:13:23.620,0:13:31.330
+estar realizando um produto escalar lamentável em termos de elemento entre cada coluna, cada linha de
+
+0:13:31.330,0:13:36.220
+a matriz que representa o seu kernel certo? então, sempre que você tiver um
+
+0:13:36.220,0:13:40.000
+camada seu kernel vai ser toda a linha da matriz e agora você vê o que
+
+0:13:40.000,0:13:47.050
+é a projeção dessa entrada nessa coluna, quero dizer, na entrada dessa linha
+
+0:13:47.050,0:13:52.900
+direito? então cada elemento deste produto vai lhe dizer o alinhamento com
+
+0:13:52.900,0:13:57.580
+qual a entrada é qual é o alinhamento da entrada em relação ao
+
+0:13:57.580,0:14:04.600
+linha específica da matriz ok? sim? não? isso deve moldar um pouco mais como
+
+0:14:04.600,0:14:08.290
+intuição, enquanto usamos essas transformações lineares, elas são como
+
+0:14:08.290,0:14:13.140
+permitindo que você veja a projeção da entrada em diferentes tipos de
+
+0:14:13.140,0:14:22.300
+orientações digamos assim. Certo? você pode tentar saber extrapolar isso em
+
+0:14:22.300,0:14:26.140
+dimensões altas, eu acho que a intuição pelo menos eu posso dar a você funciona
+
+0:14:26.140,0:14:30.580
+definitivamente em duas e três dimensões, em dimensões superiores eu meio que acho
+
+0:14:30.580,0:14:34.209
+funciona de maneira semelhante. próxima lição que vamos assistir, na verdade somos nós
+
+0:14:34.209,0:14:38.890
+vamos ver como qual é a distribuição das projeções em um
+
+0:14:38.890,0:14:43.240
+espaço dimensional mais alto é esse tipo que vai ser tão legal, eu acho. tudo bem então
+
+0:14:43.240,0:14:49.779
+essa foi a primeira parte de eu penso na aula ah bem, tem mais uma
+
+0:14:49.779,0:14:54.940
+parte, então na verdade aqui este z aqui também podemos escrever de uma maneira diferente,
+
+0:14:54.940,0:15:00.130
+talvez isso seja talvez seja conhecido talvez não seja conhecido. quando eu vi pela primeira vez
+
+0:15:00.130,0:15:05.050
+Eu não sabia, então você sabe que é legal às vezes você ver essas coisas uma vez
+
+0:15:05.050,0:15:10.450
+de novo talvez então vamos voltar aqui é o mesmo z ali e então você pode expressar
+
+0:15:10.450,0:15:18.820
+este z como sendo igual ao vetor a 1, neste caso a 1 será o primeiro
+
+0:15:18.820,0:15:22.830
+coluna da matriz a ok e esta vai ser multiplicada pelo escalar
+
+0:15:22.830,0:15:29.380
+x1 agora você tem a segunda coluna da matriz, então eu tenho um 2 que é multiplicado
+
+0:15:29.380,0:15:34.870
+pelo segundo elemento do X à direita até o último qual vai ser?
+
+0:15:34.870,0:15:44.930
+novamente? Não consigo ouvir se é m ou n? m? assim? ou n? você conhece a linguagem de sinais? que
+
+0:15:44.930,0:15:52.880
+1? n? certo, você conhece a linguagem de sinais? não? você deve aprender que é bom, sabe?
+
+0:15:52.880,0:15:59.630
+inclusividade. um certo? então a última coluna vezes seu xn, é claro porque x
+
+0:15:59.630,0:16:04.100
+tem um tamanho de n, existem n itens, certo? e então basicamente quando você também quando você
+
+0:16:04.100,0:16:07.940
+faça uma transformação linear ou aplique um operador linear
+
+0:16:07.940,0:16:12.890
+vão pesar basicamente cada coluna da matriz com o coeficiente que
+
+0:16:12.890,0:16:16.640
+está em um, você sabe, você tem a primeira coluna vezes o primeiro coeficiente do
+
+0:16:16.640,0:16:22.130
+vetor, segunda coluna e pelo segundo item, mais a terceira coluna vezes o terceiro
+
+0:16:22.130,0:16:26.420
+item e, portanto, você pode ver que a saída dessa transformação de DN é uma soma ponderada
+
+0:16:26.420,0:16:32.180
+das colunas da matriz a ok? então este é um tipo diferente de intuição
+
+0:16:32.180,0:16:36.620
+às vezes você vê isso como se você quisesse expressar seu sinal, seu
+
+0:16:36.620,0:16:45.290
+dados são uma combinação de diferentes, você sabe, a composição, isso é uma espécie de
+
+0:16:45.290,0:16:50.600
+composição linear de sua entrada. tudo bem então essa foi a primeira parte, é o
+
+0:16:50.600,0:16:57.530
+recapitulação sobre a álgebra linear. uma segunda parte vai ser algo ainda mais
+
+0:16:57.530,0:17:06.790
+legal eu acho. perguntas até agora? não? fácil? muito fácil? você está ficando entediado?
+
+0:17:06.790,0:17:10.670
+desculpe ok tudo bem então eu vou acelerar eu acho.
+
+0:17:10.670,0:17:15.170
+tudo bem, então vamos ver como podemos estender o que as coisas que vimos
+
+0:17:15.170,0:17:19.040
+agora para as convoluções certas, então talvez as convoluções às vezes sejam um pouco
+
+0:17:19.040,0:17:28.900
+estranho, vamos ver como podemos fazer uma extensão para convoluções
+
+0:17:31.720,0:17:38.390
+tudo bem. então, digamos que eu comece com a mesma matriz. então vou ter aqui quatro
+
+0:17:38.390,0:17:53.660
+linhas e, em seguida, três colunas. Certo. então meus dados têm que ser? se eu tenho, se eu tenho isso
+
+0:17:53.660,0:17:59.360
+matriz, se eu multiplicar isso por uma coluna, meu vetor de coluna deve ser? do tamanho? três,
+
+0:17:59.360,0:18:04.250
+obrigada. tudo bem, deixe-me desenhar aqui meu vetor de coluna de tamanho três e este
+
+0:18:04.250,0:18:08.809
+vai te dar uma saída de tamanho quatro ok fantástico
+
+0:18:08.809,0:18:15.740
+mas então são seus dados, digamos que você vai ouvir um bom áudio, áudio
+
+0:18:15.740,0:18:21.260
+arquivo, seus dados têm apenas três amostras de comprimento? quanto tempo vão ficar seus dados? Digamos
+
+0:18:21.260,0:18:24.590
+você ouvindo uma música que dura três minutos
+
+0:18:24.590,0:18:32.330
+quantas amostras tem três minutos de áudio? sim, eu acho, o que é
+
+0:18:32.330,0:18:40.070
+vai ser minha taxa de amostragem? digamos vinte e dois, ok. vinte e dois mil quilos
+
+0:18:40.070,0:18:46.480
+Hertz, certo? 22 quilohertz então quantas amostras de três minutos de música tem?
+
+0:18:47.799,0:18:58.010
+Repita? Tem certeza que? é monofônico ou estereofônico? estou brincando. OK, então
+
+0:18:58.010,0:19:02.650
+você vai multiplicar o número de amostras, o número de segundos, certo? a
+
+0:19:02.650,0:19:08.660
+número de segundos vezes a taxa de quadros, certo? ok, a frequência neste
+
+0:19:08.660,0:19:12.620
+caso. de qualquer forma, esse sinal vai ser muito, muito longo, certo? vai ser mantido
+
+0:19:12.620,0:19:16.940
+indo para baixo, então se eu tiver um vetor que é muito, muito longo, eu tenho que usar um
+
+0:19:16.940,0:19:23.540
+matriz que vai ficar muito, muito gorda, larga, certo? ok fantástico então este top
+
+0:19:23.540,0:19:27.080
+continua indo nessa direção, tudo bem, então minha pergunta para você vai
+
+0:19:27.080,0:19:31.540
+ser o que devo colocar neste local aqui?
+
+0:19:35.570,0:19:48.020
+o que devo colocar aqui? então, nós nos importamos com coisas que estão mais distantes?
+
+0:19:48.740,0:19:54.000
+Não porque não? porque nossos dados têm a propriedade de
+
+0:19:54.000,0:20:00.299
+localidade, fantástica. então o que vou fazer o que vou colocar aqui? Uma grande
+
+0:20:00.299,0:20:06.690
+zero, certo, fantástico, bom trabalho, ok então colocamos um zero aqui e então qual é o outro
+
+0:20:06.690,0:20:09.809
+propriedade, então deixe-me começar a desenhar essas coisas novamente para que eu possa ter meu
+
+0:20:09.809,0:20:16.890
+kernel de tamanho três e aqui estou meus dados que serão muito longos
+
+0:20:16.890,0:20:31.169
+direito? e assim por diante. Não posso desenhar, espere. Eu não consigo ver Tudo bem, então aqui há zero, então vamos dizer o que
+
+0:20:31.169,0:20:36.960
+é a outra propriedade que meus dados naturais têm? estacionariedade, que
+
+0:20:36.960,0:20:45.059
+meios? o padrão que você espera encontrar pode ser uma espécie de repetição
+
+0:20:45.059,0:20:49.140
+e de novo certo? e então se eu tiver aqui meus três valores aqui talvez eu
+
+0:20:49.140,0:20:53.789
+gostaria de reutilizá-los uma e outra vez, certo? e então, se esses três valores permitirem
+
+0:20:53.789,0:21:00.779
+eu mude a cor talvez para que você possa ver que há a mesma coisa. então eu tenho
+
+0:21:00.779,0:21:07.230
+três valores aqui e então vou usar esses três mesmos valores em uma etapa
+
+0:21:07.230,0:21:16.770
+mais, certo? e eu continuo descendo e continuo assim
+
+0:21:16.770,0:21:22.799
+direito. Então, o que devo colocar aqui no fundo? O que devo colocar aqui? um zero,
+
+0:21:22.799,0:21:28.529
+direito? por que isso por que isso? devido à localidade dos dados. direito? tão colocando
+
+0:21:28.529,0:21:36.350
+zeros ao redor é chamado também é chamado de preenchimento, mas neste caso é chamado
+
+0:21:36.350,0:21:44.850
+esparsidade, certo? então isso é como esparsidade e, em seguida, a replicação dessa coisa
+
+0:21:44.850,0:21:52.730
+isso é repetidamente chamado de estacionariedade era a propriedade
+
+0:21:52.730,0:21:57.430
+do sinal, isso é chamado de divisão de peso. sim?
+
+0:22:04.330,0:22:10.090
+ok fantástico tudo bem então quantos valores nós temos agora? quantos
+
+0:22:10.090,0:22:13.530
+parâmetros que tenho do lado direito?
+
+0:22:15.150,0:22:22.750
+bem, então temos três parâmetros. No lado esquerdo, ao invés, nós
+
+0:22:22.750,0:22:28.900
+teve? Doze, certo? então o lado direito vai ser, o lado direito vai ser
+
+0:22:28.900,0:22:36.810
+trabalhar em tudo? você tem três parâmetros de um lado do outro lado você tem 12.
+
+0:22:36.810,0:22:41.860
+OK? isso é bom, usando localidade e qualquer coisa diferente de dispersão
+
+0:22:41.860,0:22:45.610
+e compartilhamento de parâmetros, mas acabamos com apenas três parâmetros, não é
+
+0:22:45.610,0:22:51.190
+isso é muito restritivo? como podemos ter vários parâmetros múltiplos? o que é
+
+0:22:51.190,0:22:56.290
+faltando aqui no quadro geral? existem vários canais, certo? então isso é
+
+0:22:56.290,0:23:01.570
+apenas uma camada aqui e então você tem essas coisas saindo do
+
+0:23:01.570,0:23:08.230
+placa aqui, então você tem o primeiro kernel aqui, agora você tem algum
+
+0:23:08.230,0:23:20.500
+segundo kernel, digamos este, e eu tenho o último aqui, certo? e então você tem
+
+0:23:20.500,0:23:24.760
+cada plano dessas métricas contendo apenas um kernel que é
+
+0:23:24.760,0:23:36.190
+replicado várias vezes. Quem sabe o nome desta matriz? então isso vai
+
+0:23:36.190,0:23:40.650
+ser chamada de matriz Toeplitz
+
+0:23:43.419,0:23:48.219
+Certo? então qual é a principal característica dessas matrizes Toeplitz? qual é o grande
+
+0:23:48.219,0:24:01.479
+grande coisa que você não notará? é uma matriz esparsa. ok ok o que vai
+
+0:24:01.479,0:24:16.149
+estar aqui, esse primeiro item aqui? qual é o conteúdo do primeiro cara? sim? tão
+
+0:24:16.149,0:24:21.669
+este aqui vai ser a extensão da minha transformação linear que foi,
+
+0:24:21.669,0:24:26.169
+você sabe, eu tenho um sinal que é maior do que três amostras, portanto, eu tenho que
+
+0:24:26.169,0:24:32.229
+tornar esta matriz mais gorda, a segunda parte será dada que eu não me importo
+
+0:24:32.229,0:24:36.129
+que coisas, tipo, coisas que estão aqui embaixo, não me importo com coisas que estão
+
+0:24:36.129,0:24:40.299
+aqui, se eu olhar para os pontos que estão aqui em cima, vou colocar um grande 0 aqui
+
+0:24:40.299,0:24:45.070
+para que tudo que está aqui embaixo seja limpo, né? e
+
+0:24:45.070,0:24:49.899
+finalmente vou usar o mesmo kernel repetidamente porque
+
+0:24:49.899,0:24:55.839
+suponho que meus dados estão estacionários e, portanto, suponho que padrões semelhantes
+
+0:24:55.839,0:24:58.929
+vão acontecer uma e outra vez, portanto, vou usar este
+
+0:24:58.929,0:25:03.429
+aquele que está escrito aqui: divisão de peso.
+
+0:25:03.429,0:25:08.019
+Finalmente, dado que este fornece apenas três parâmetros para
+
+0:25:08.019,0:25:12.940
+trabalhar com vou usar várias camadas para ter diferentes, sabe,
+
+0:25:12.940,0:25:17.739
+canais. Portanto, este é um kernel. antes, um kernel era toda a linha de
+
+0:25:17.739,0:25:21.849
+a matriz, ok? então, quando você tem uma camada totalmente conectada, a única diferença
+
+0:25:21.849,0:25:25.179
+entre uma camada totalmente conectada e uma convolução é que você tem todo o
+
+0:25:25.179,0:25:37.019
+linha da matriz. Então, o que vai estar neste primeiro item aqui? qualquer um?
+
+0:25:38.480,0:25:43.230
+então o kernel verde, vamos chamar o kernel verde de apenas 1, deixe-me realmente
+
+0:25:43.230,0:25:55.200
+faça com que ela brilhe em verde porque é uma semente verde. Então você tem 1 vezes ... o quê? Está
+
+0:25:55.200,0:25:58.890
+vai ser do número um ao número três certo? e então o segundo item é
+
+0:25:58.890,0:26:05.640
+vai ser o mesmo cara aqui um 1 e então você vai ter o x mudado por
+
+0:26:05.640,0:26:20.039
+um e assim por diante certo? faz sentido? sim e então nós teremos este está indo
+
+0:26:20.039,0:26:23.970
+para ser a saída verde, então você terá a saída azul uma camada chegando
+
+0:26:23.970,0:26:26.730
+para fora e então você tem o outro vindo o vermelho.
+
+0:26:26.730,0:26:32.929
+mesmo uma camada de fora. OK? A experiência com o iPad foi legal?
+
+0:26:32.929,0:26:40.520
+sim? não? Eu gostei. OK. Outras perguntas?
+
+0:26:41.029,0:26:49.049
+Repita? o círculo azul este aqui? é um grande zero
+
+0:26:49.049,0:26:53.480
+essa é a dispersão que o mesmo está aqui.
+
+0:26:53.899,0:26:58.520
+sim? não? Yeah, yeah.
+
+0:27:01.530,0:27:16.200
+então aqui eu coloquei muitos zeros aqui dentro então matei todos os
+
+0:27:16.200,0:27:21.000
+valores que estão longe da pequena parte e depois repito os mesmos três
+
+0:27:21.000,0:27:24.480
+valores repetidamente porque espero encontrar o mesmo padrão em
+
+0:27:24.480,0:27:33.360
+diferentes regiões deste, este grande grande sinal que eu tenho. Este aqui? Então eu disse
+
+0:27:33.360,0:27:36.840
+que neste caso terei apenas três valores, certo? e começamos com
+
+0:27:36.840,0:27:41.040
+12 valores e acabei com 3, que é realmente muito pouco, então se eu quiser
+
+0:27:41.040,0:27:44.760
+tem, digamos, 6 valores, então se eu quiser ter seis valores e posso ter meu
+
+0:27:44.760,0:27:49.530
+segundo 3 em um plano diferente e eu realizo a mesma operação sempre que você
+
+0:27:49.530,0:27:55.110
+multiplique esta matriz por um vetor e você realiza uma convolução para que ela apenas diga
+
+0:27:55.110,0:27:58.770
+você que uma convolução é apenas uma multiplicação de matriz com muitos zeros
+
+0:27:58.770,0:28:09.600
+é isso. Sim, então eles vão ter este aqui, então você tem um segundo,
+
+0:28:09.600,0:28:14.790
+então você tem um terceiro, então você tem três versões da entrada. Tudo bem. Então, para o
+
+0:28:14.790,0:28:18.000
+segunda parte da aula, vou mostrar a vocês algumas coisas mais interativas
+
+0:28:18.000,0:28:27.810
+por favor, participe da segunda parte também, certo? então vamos tentar então eu tenho
+
+0:28:27.810,0:28:37.950
+reformulei a marca Eu mudei a marca do site e agora o
+
+0:28:37.950,0:28:42.450
+ambiente será chamado de pDL, portanto, o aprendizado profundo PyTorch em vez de
+
+0:28:42.450,0:28:49.980
+minicurso de aprendizagem profunda, era muito longo. Então, deixe-me começar executando este
+
+0:28:49.980,0:28:52.460
+tão
+
+0:28:55.320,0:29:04.960
+Aprendizado profundo do PyTorch para que possamos fazer apenas Conda ativar ativar o PyTorch profundo
+
+0:29:04.960,0:29:12.940
+aprendizagem (pDL) e a seguir vamos abrir o caderno, o caderno de Júpiter. Tudo bem, então agora você está
+
+0:29:12.940,0:29:18.520
+vai estar assistindo, repassando a escuta de kernels. Então eu te mostrei um
+
+0:29:18.520,0:29:22.630
+convolução no papel bem no meu tablet agora você vai ouvir
+
+0:29:22.630,0:29:25.990
+convolução também pode, de modo que você pode realmente apreciar o que essas convoluções
+
+0:29:25.990,0:29:35.919
+estão. Aqui dissemos, o novo kernel certo que é chamado pDL PyTorch deep
+
+0:29:35.919,0:29:42.520
+aprendendo, então você notará o mesmo tipo de procedimento se atualizar
+
+0:29:42.520,0:29:49.690
+Seu sistema. Tudo bem, então, neste caso, podemos ler o topo aqui, então deixe-me esconder o
+
+0:29:49.690,0:29:52.200
+topo aqui.
+
+0:29:52.890,0:29:56.950
+Tudo bem, considerando a suposição de localidade, estacionariedade e
+
+0:29:56.950,0:30:00.280
+composicionalidade, podemos reduzir a quantidade de computação para uma matriz
+
+0:30:00.280,0:30:05.169
+multiplicação de vetores usando uma matriz de Toeplitz esparsa porque local porque
+
+0:30:05.169,0:30:09.850
+esquema estacionário, desta forma, podemos simplesmente acabar redescobrindo o
+
+0:30:09.850,0:30:14.980
+operador de convolução, certo? além disso, também podemos lembrar que um produto escalar é
+
+0:30:14.980,0:30:19.150
+uma distância cosseno simplesmente normalizada que nos diz o alinhamento de dois
+
+0:30:19.150,0:30:21.850
+vetores, mais especificamente, calculamos o
+
+0:30:21.850,0:30:26.320
+magnitude da projeção ortogonal de dois vetores um sobre o outro e vice
+
+0:30:26.320,0:30:29.590
+versa. Então, vamos descobrir agora como tudo isso
+
+0:30:29.590,0:30:34.270
+pode fazer sentido usando nossos ouvidos, certo? então vou importar uma biblioteca que
+
+0:30:34.270,0:30:39.880
+professor aqui da NYU feito e aqui vou carregar meus dados de áudio e
+
+0:30:39.880,0:30:43.600
+Eu vou ter isso no meu x, e então minha taxa de amostragem vai ser
+
+0:30:43.600,0:30:48.570
+na outra variável. Então, aqui vou apenas mostrar que terei cerca de 70.000
+
+0:30:48.570,0:30:54.430
+amostras neste caso porque eu tenho uma taxa de amostragem de 22 quilo Hertz e então
+
+0:30:54.430,0:31:01.720
+meu tempo total será de três segundos, ok, então três segundos vezes 22 você começa
+
+0:31:01.720,0:31:06.910
+que? então não é 180 que você estava dizendo, era cento e oitenta, era três, certo?
+
+0:31:06.910,0:31:11.380
+Oh, foram três minutos, você está certo, são três segundos, então você realmente está
+
+0:31:11.380,0:31:16.540
+corrigir meu mal. Então, são três segundos, então vezes 22 quilo Hertz você tem 70
+
+0:31:16.540,0:31:22.390
+cerca de 70.000 amostras. Aqui, vou importar algumas bibliotecas para
+
+0:31:22.390,0:31:28.180
+mostrarei algo e então mostrarei o primeiro gráfico, então este é
+
+0:31:28.180,0:31:37.270
+o sinal de áudio que importei agora, como está? ondulado, ok legal.
+
+0:31:37.270,0:31:50.680
+Você pode me dizer como isso soa? Aluno: "aaaaaaaaaaahhhhhh". Esse foi um bom palpite. O palpite era 'aaah'. Sim, você não pode dizer exatamente
+
+0:31:50.680,0:31:55.450
+qual é o conteúdo, certo? a partir deste diagrama porque a amplitude de,
+
+0:31:55.450,0:32:01.240
+o eixo y aqui vai mostrar apenas a amplitude. posso
+
+0:32:01.240,0:32:05.530
+apague a luz? está tudo bem? ou ... tem certeza? ok obrigado,
+
+0:32:05.530,0:32:16.090
+Eu realmente não gosto dessas luzes. OK. Boa noite. Oh, vê como isso é bom? Certo
+
+0:32:16.090,0:32:19.930
+legal. Tudo bem, então você não pode dizer nada aqui, certo?
+
+0:32:19.930,0:32:26.580
+você não pode dizer o que é o que é o som, certo? então como podemos descobrir
+
+0:32:26.580,0:32:31.870
+qual é o som aqui dentro? então, por exemplo, posso mostrar a você uma transcrição de
+
+0:32:31.870,0:32:37.660
+o som e, na verdade, deixe-me realmente forçá-los em seu
+
+0:32:37.660,0:32:44.810
+sua cabeça, certo? então você vai ter ... espere, não funcionou. * Ouve-se um som *
+
+0:32:44.810,0:32:50.610
+tudo bem, agora nós realmente ouvimos, ok, agora você pode realmente ver * imita o som *
+
+0:32:50.610,0:32:56.400
+você sabe que pode imaginar um pouco, mas tudo bem e daí
+
+0:32:56.400,0:33:00.830
+notas que tocamos lá? como posso descobrir quais são as notas que
+
+0:33:00.830,0:33:05.550
+eles estão dentro? então vou mostrar este, já que é um pouco mais claro
+
+0:33:05.550,0:33:13.820
+Eu posso ver seus rostos. Quantos de vocês não podem ler isso? Oh, ai ...
+
+0:33:13.820,0:33:20.960
+Ok, deixe-me ver se posso pedir ajuda.
+
+0:33:23.620,0:33:26.620
+Talvez alguém possa nos ajudar aqui.
+
+0:33:29.400,0:33:32.480
+OK. Vamos ver.
+
+0:33:40.140,0:33:42.140
+Ei, ei Alf! Oh, oi Alf!
+
+0:33:42.880,0:33:45.420
+Como está indo? Sim, estou bem, obrigado.
+
+0:33:45.420,0:33:47.040
+Óculos bonitos lá! Oh, obrigado pelos óculos.
+
+0:33:47.040,0:33:49.040
+Oh, belo suéter! você também! Belo suéter!
+
+0:33:49.040,0:33:51.040
+Oh, estamos usando o mesmo suéter!
+
+0:33:51.040,0:33:54.480
+Você pode nos ajudar? Eles não sabem ler o
+
+0:33:54.480,0:33:57.120
+Oh, a conexão ... Que diabos!
+
+0:33:57.120,0:34:00.380
+Eles não podem ler a partitura! Você pode nos ajudar, por favor?
+
+0:34:00.380,0:34:02.380
+Tudo bem! Deixe-me tentar ajudá-lo.
+
+0:34:02.380,0:34:04.380
+Obrigado! Deixe-me trocar a câmera.
+
+0:34:04.380,0:34:06.380
+Tudo bem. Por favor faça.
+
+0:34:06.380,0:34:08.380
+Então, aqui podemos ir como ...
+
+0:34:08.380,0:34:10.860
+e ouça primeiro como tudo soa.
+
+0:34:10.860,0:34:14.320
+Então, vai ser assim.
+
+0:34:14.440,0:34:23.380
+Quão legal é isso? * alunos aplaudem *
+
+0:34:23.380,0:34:27.990
+Obrigada. Demorou quatro lições para você me aplaudir. Então agora…
+
+0:34:27.990,0:34:32.360
+Isso é muito legal da sua parte. Vamos continuar.
+
+0:34:32.360,0:34:36.320
+A ♭, então temos um E ♭, e então um A ♭.
+
+0:34:36.320,0:34:40.380
+A diferença entre o primeiro A ♭ e o outro em frequências
+
+0:34:40.380,0:34:46.540
+é que o primeiro A ♭ terá o dobro da frequência do outro.
+
+0:34:46.540,0:34:51.400
+E em vez disso, no meio, temos o 5º. Vamos descobrir qual é a frequência disso.
+
+0:34:51.400,0:34:55.320
+E então, vamos para um B ♭, aqui.
+
+0:34:55.320,0:34:57.680
+No lado esquerdo, em vez disso, temos o acompanhamento,
+
+0:34:57.680,0:35:01.220
+e então teremos um A ♭ e B ♭
+
+0:35:01.220,0:35:05.620
+e então B ♭ e E ♭.
+
+0:35:05.680,0:35:10.900
+Então, se juntarmos todos, vamos conseguir este.
+
+0:35:11.300,0:35:14.020
+Tudo bem? Simples, não? Sim! Obrigado!
+
+0:35:14.020,0:35:17.320
+Bye Bye! Tchaaaau!
+
+0:35:18.820,0:35:23.540
+Ver? Demorou um dia inteiro para se preparar ...
+
+0:35:23.540,0:35:27.480
+Eu estava tão nervoso antes de vir aqui ...
+
+0:35:27.480,0:35:32.140
+Eu não sabia se realmente teria funcionado ... Ambos, tablet e este.
+
+0:35:32.140,0:35:35.060
+Estou tão feliz! Agora posso realmente dormir, mais tarde.
+
+0:35:35.060,0:35:39.280
+De qualquer forma, isso foi como no primeiro
+
+0:35:39.290,0:35:43.280
+parte você vai ter a primeira nota, há A ♭ que você tem um B ♭
+
+0:35:43.280,0:35:51.550
+A ♭ e B ♭ para que você * recrie o som * e oe a diferença entre o primeiro tom e
+
+0:35:51.550,0:35:57.440
+é uma oitava, portanto a primeira frequência será o dobro da segunda
+
+0:35:57.440,0:36:02.870
+frequência. OK? então, sempre que vamos observar a forma de onda, um sinal
+
+0:36:02.870,0:36:07.730
+tem um mais curto igual a metade do período do outro, certo?
+
+0:36:07.730,0:36:12.410
+especialmente o A ♭ no topo terá um período que é a metade de
+
+0:36:12.410,0:36:20.000
+o período do A ♭ no inferior, certo, então você * recria o som * ok, se você for a metade de
+
+0:36:20.000,0:36:27.290
+este que você obteve * soa * certo, ok ok, então, como realmente tiramos essas notas de
+
+0:36:27.290,0:36:33.770
+esse espectro, da forma de onda? quem pode me dizer como posso extrair estes
+
+0:36:33.770,0:36:40.790
+arremessos, essas frequências do outro sinal? qualquer palpite? ok transformada de Fourier
+
+0:36:40.790,0:36:45.530
+que eu acho que é um bom palpite. O que acontece se eu executar agora um
+
+0:36:45.530,0:36:50.660
+Transformada de Fourier desse sinal? alguém pode realmente me responder? você não pode aumentar
+
+0:36:50.660,0:36:54.100
+sua mão porque eu não vejo, apenas grite.
+
+0:36:55.120,0:36:59.690
+Então, se você basicamente realizar a transformada de Fourier de todo o sinal, você
+
+0:36:59.690,0:37:06.470
+vai ouvir * faz som * como todas as notas juntas * faz som *? todos juntos, certo, mas então você não pode
+
+0:37:06.470,0:37:13.260
+descobrir qual pitch está tocando, onde ou quando, neste caso, certo.
+
+0:37:13.260,0:37:18.210
+Ha! Então, precisamos de uma espécie de transformada de Fourier que é localizada e, portanto, um
+
+0:37:18.210,0:37:23.190
+transformada de Fourier localizada no tempo ou no espaço, dependendo de qualquer domínio
+
+0:37:23.190,0:37:27.390
+você está usando seu espectrograma denominado. certo, e assim por diante eu vou ser
+
+0:37:27.390,0:37:30.000
+imprimindo para você o espectrograma, desculpe.
+
+0:37:30.000,0:37:34.380
+e estarei imprimindo aqui o espectrograma deste aqui. E então aqui
+
+0:37:34.380,0:37:39.960
+você pode comparar os dois, certo, na primeira parte aqui deste lado aqui você está
+
+0:37:39.960,0:37:47.970
+vai ter esse pico aqui em 1600 que é o * faz som *, o tom é realmente mais alto. * faz som * lá vamos nós. Agora
+
+0:37:47.970,0:37:56.640
+você tem um segundo que é este pico aqui * faz som * e então este * faz som *. Você pode ver esse pico, certo? E você
+
+0:37:56.640,0:38:04.260
+veja este pico, tudo bem, então esses picos serão as notas reais que toco
+
+0:38:04.260,0:38:07.560
+com a mão direita, então vamos colocá-los juntos e
+
+0:38:07.560,0:38:13.140
+Vou ter aqui as frequências. Então eu tenho 1600, 1200 e 800, você pode ver aqui?
+
+0:38:13.140,0:38:20.550
+Eu tenho 1600, 800, por que um é o dobro do outro? porque eles são uma oitava
+
+0:38:20.550,0:38:27.540
+separados, então se isso é * faz som * isso vai ser * faz som * certo e este é um quinto que também
+
+0:38:27.540,0:38:32.280
+tem um bom intervalo. Então, deixe-me gerar esses sinais aqui e depois
+
+0:38:32.280,0:38:36.720
+tem que ser concatená-los todos, então vou jogar os dois. o primeiro
+
+0:38:36.720,0:38:42.109
+um é na verdade o áudio original, mas
+
+0:38:42.109,0:38:45.470
+deixe-me tentar de novo, enquanto se eu jogar
+
+0:38:45.680,0:38:54.860
+o segundo, a concatenação, sim, está um pouco alto, agora não consigo nem
+
+0:38:54.860,0:39:01.720
+reduza o volume. Oh, eu posso reduzir isso aqui. Demais. Ok, deixe-me ir de novo. Tudo
+
+0:39:02.020,0:39:05.350
+direito. Então, esta é a concatenação desses
+
+0:39:05.350,0:39:12.380
+quatro pitches diferentes, então adivinhe o que faremos a seguir? então como posso
+
+0:39:12.380,0:39:20.420
+extrair todas as notas que posso ouvir em uma peça específica? então vamos dizer
+
+0:39:20.420,0:39:29.360
+você joga uma partitura completa e eu gostaria de saber qual campo é jogado e a que horas. Do
+
+0:39:29.360,0:39:36.230
+que? então a resposta foi convolução, apenas para a gravação, então estou pedindo convolução
+
+0:39:36.230,0:39:43.370
+sobre o que? sem convolução do espectrograma, então você tem convolução de sua entrada
+
+0:39:43.370,0:39:49.460
+sinalizar com o quê? com algum tipo diferente de pitches, os quais irão
+
+0:39:49.460,0:39:59.150
+sua vez? digamos que você não veja o espectro, porque digamos que eu só vou
+
+0:39:59.150,0:40:03.770
+tocar qualquer tipo de música, então eu gostaria de saber todas as notas possíveis que são
+
+0:40:03.770,0:40:06.220
+aí o que você faria?
+
+0:40:06.220,0:40:13.430
+você não conhece todos os arremessos, como você tentaria? certo, então em que estão todos os
+
+0:40:13.430,0:40:20.900
+tons que você pode querer usar, se estiver tocando piano? todas as chaves de
+
+0:40:20.900,0:40:24.530
+o piano, certo? então, se eu tocar um concerto com o piano, eu quero
+
+0:40:24.530,0:40:28.010
+tenho um pedaço de áudio para cada uma dessas teclas e vou estar executando
+
+0:40:28.010,0:40:32.690
+circunvoluções de toda a minha peça com as chaves antigas, certo? e portanto você é
+
+0:40:32.690,0:40:36.470
+veremos picos que são o alinhamento da similaridade do cosseno
+
+0:40:36.470,0:40:41.349
+sempre que você obtiver basicamente o áudio correspondente ao seu kernel específico.
+
+0:40:41.349,0:40:46.989
+então vou fazer isso, mas com esses tons específicos, na verdade, extraio
+
+0:40:46.989,0:40:52.929
+aqui. Então, aqui vou mostrar primeiro como os dois espectrogramas se parecem
+
+0:40:52.929,0:40:57.699
+como se o lado esquerdo fosse o espectrograma do meu sinal real X de t
+
+0:40:57.699,0:41:01.630
+e no lado direito eu tenho apenas o espectrograma desta concatenação de
+
+0:41:01.630,0:41:10.749
+meus argumentos, então aqui você pode ver claramente que isso * faz som *, mas aqui, em primeiro lugar, o que
+
+0:41:10.749,0:41:14.429
+são essas barras aqui, essas barras verticais?
+
+0:41:15.269,0:41:20.589
+você está seguindo, certo? Eu não posso te ver, tenho que realmente responder. O que são esses vermelhos
+
+0:41:20.589,0:41:24.160
+barras aqui, barras verticais? Agora, o horizontal eu já falei pra vocês, né?
+
+0:41:24.160,0:41:34.390
+* faz som * e a vertical? o que é? problemas de amostragem, certo, transições. Então
+
+0:41:34.390,0:41:39.099
+sempre que você tem o * faz som *, você na verdade tem uma forma de onda branca, uma forma de onda e depois
+
+0:41:39.099,0:41:44.019
+o outro, uma forma de onda tem que parar para que não seja mais periódica e sempre
+
+0:41:44.019,0:41:47.609
+você faz uma transformada de Fourier de um sinal não periódico, você sabe uma porcaria.
+
+0:41:47.609,0:41:53.589
+É por isso que sempre que você consegue a junção entre eles o * faz o som * o salto
+
+0:41:53.589,0:41:57.729
+aqui você vai ter este pico porque você pode
+
+0:41:57.729,0:42:01.749
+pensar no salto é como ter uma frequência muito alta né? Porque
+
+0:42:01.749,0:42:05.469
+é como um delta, então você realmente consegue todas as frequências, é por isso que você
+
+0:42:05.469,0:42:12.549
+obtenha todas as frequências aqui. Estrondo. OK? faz sentido até agora? tipo de? tudo bem.
+
+0:42:12.549,0:42:18.720
+Esta é a versão limpa * faz som * Não consigo nem assinar e o que
+
+0:42:18.720,0:42:25.800
+lado esquerdo aqui? por que está do lado esquerdo todo vermelho aí embaixo? OK
+
+0:42:25.800,0:42:31.440
+sim, você sabia. então o lado esquerdo do lado esquerdo do cabo é o que eu mostro a vocês no
+
+0:42:31.440,0:42:37.320
+lado esquerdo inferior. Ok, então deixe-me terminar esta aula e depois deixo você ir. Então
+
+0:42:37.320,0:42:42.990
+aqui vou te mostrar primeiro todos os kernels, você pode dizer agora
+
+0:42:42.990,0:42:48.090
+o vermelho vai ser o primeiro pedaço do meu sinal, o real
+
+0:42:48.090,0:42:53.280
+um e então você pode ver que o primeiro tom tem a mesma frequência,
+
+0:42:53.280,0:42:58.460
+você pode ver? Portanto, o * faz som * tem o mesmo
+
+0:42:58.460,0:43:04.230
+delta t o mesmo intervalo, período, você pode ver? você não pode acenar com a cabeça
+
+0:43:04.230,0:43:08.700
+cabeça porque de novo eu não vejo você, tem que me responder. Você pode ver ou não? OK,
+
+0:43:08.700,0:43:13.050
+obrigado, fantástico. E então este é o terceiro, você pode ver que
+
+0:43:13.050,0:43:17.130
+começa aqui no período e termina aqui, se você subir aqui você está
+
+0:43:17.130,0:43:20.520
+veremos exatamente que havia dois desses caras, certo, então isso é
+
+0:43:20.520,0:43:24.619
+como você pode ver isso é como o dobro da frequência do abaixo.
+
+0:43:24.619,0:43:30.089
+Finalmente, irei realizar a convolução desses quatro kernels com
+
+0:43:30.089,0:43:36.839
+meu sinal de entrada, e é assim que parecemos, ok, então o primeiro kernel tem um alto
+
+0:43:36.839,0:43:42.150
+coincidir na primeira parte do placar. Então, entre zero e zero cinco
+
+0:43:42.150,0:43:46.830
+segundos. O segundo começa logo após o primeiro, então você tem o
+
+0:43:46.830,0:43:50.820
+terceiro começando em zero três eu acho e então você tem o último
+
+0:43:50.820,0:43:56.940
+começando do zero seis, certo? Então adivinhe? Eu vou fazer você ouvir
+
+0:43:56.940,0:44:02.520
+convoluções agora, você está animado? ok, você realmente está respondendo agora, ótimo!
+
+0:44:02.520,0:44:07.440
+Tudo bem e esses são os resultados. Deixe-me baixar um pouco os volumes
+
+0:44:07.440,0:44:14.900
+caso contrário, você reclamará, sim, eu não posso diminuir o,
+
+0:44:16.800,0:44:23.690
+ok, então o primeiro, vamos tentar novamente
+
+0:44:28.880,0:44:37.110
+* toca som * não é legal? Você escuta as convoluções. Ok, então basicamente isso era
+
+0:44:37.110,0:44:41.280
+quase isso, tenho mais um slide porque senti que houve alguma confusão no último
+
+0:44:41.280,0:44:45.360
+tempo sobre qual é a diferente dimensionalidade de diferentes tipos de
+
+0:44:45.360,0:44:50.850
+sinais, então estou realmente recomendando ir e fazer a aula da Joan Bruna que
+
+0:44:50.850,0:44:56.580
+é matemática para aprendizado profundo e eu roubei uma das pequenas coisas que ele era
+
+0:44:56.580,0:45:04.320
+ensinando, acabei de colocar um slide aqui para você. Portanto, este slide é o
+
+0:45:04.320,0:45:13.140
+Segue. Portanto, temos a camada de entrada ou as amostras que fornecemos em
+
+0:45:13.140,0:45:18.420
+esta rede e então normalmente nossa última vez eu defino isso eu tenho este X encaracolado
+
+0:45:18.420,0:45:23.010
+que será feito daqueles xi, que são todas as minhas amostras de dados corretas
+
+0:45:23.010,0:45:29.100
+e geralmente tenho m amostras de dados, então meu i vai de m = 1 para n, ok, então
+
+0:45:29.100,0:45:34.080
+está claro? em que esta notação está clara? porque é um pouco mais formal, normalmente sou um pouco
+
+0:45:34.080,0:45:39.030
+menos formal, mas, de alguma forma, alguém estava se sentindo um pouco desconfortável.
+
+0:45:39.030,0:45:46.470
+acho que este é apenas meus exemplos de entrada, mas também podemos ver este
+
+0:45:46.470,0:45:52.950
+é este X encaracolado que é minha entrada definida como o conjunto de todas essas funções como xi
+
+0:45:52.950,0:45:59.850
+que estão mapeando meu Omega capital Omega, que é meu domínio, para um RC que é
+
+0:45:59.850,0:46:06.150
+serão basicamente meus canais desse exemplo específico e aqui estou
+
+0:46:06.150,0:46:14.040
+mapearei aqueles Omega minúsculos para esses xi's de ômega, então vamos ver como
+
+0:46:14.040,0:46:17.550
+estes são diferentes da notação anterior. Então eu vou te dar agora três
+
+0:46:17.550,0:46:21.300
+exemplos e você deve ser capaz de dizer agora qual é a dimensionalidade e
+
+0:46:21.300,0:46:24.820
+neste exemplo. Então, o primeiro, digamos,
+
+0:46:24.820,0:46:29.560
+Eu gostei do que mostrei a você agora, apenas uma parte divertida de você
+
+0:46:29.560,0:46:34.870
+sinal de áudio, então meu Omega será apenas amostras como a amostra número um
+
+0:46:34.870,0:46:39.550
+amostra número dois como o índice, certo? então você tem índice um, índice dois, índice
+
+0:46:39.550,0:46:44.740
+até esses 70.000 seja o que for que acabamos de ver agora, ok? e o último valor é
+
+0:46:44.740,0:46:49.330
+vai ser o T, T maiúsculo, que é o número de segundos dividido pelo delta T
+
+0:46:49.330,0:46:53.200
+que seria o 1 sobre a frequência e isso vai ser um
+
+0:46:53.200,0:46:57.250
+subconjunto de n certo? então este é um número discreto de amostras, porque você tem um
+
+0:46:57.250,0:47:03.220
+computador, você sempre tem amostras discretas. Portanto, estes são meus dados de entrada, e
+
+0:47:03.220,0:47:09.310
+então que tal a imagem desta função? então quando eu pergunto o que é
+
+0:47:09.310,0:47:13.330
+dimensionalidade deste tipo de sinal, você deve responder que é um
+
+0:47:13.330,0:47:19.360
+sinal unidimensional porque a potência desses n aqui é 1 ok? então isso é como
+
+0:47:19.360,0:47:25.000
+um sinal unidimensional, embora você possa ter o tempo total e o
+
+0:47:25.000,0:47:28.900
+1 lá estava um intervalo de amostragem, do lado direito você tem o número
+
+0:47:28.900,0:47:33.340
+de canais pode ser 1 se você tiver um sinal mono ou 2 se tiver um
+
+0:47:33.340,0:47:38.230
+estereofônico, então você tem mono aí, você tem 2 para estereofônico ou o que é 5
+
+0:47:38.230,0:47:44.650
+mais 1? esse é o Dolby como 5.1 não é legal? tudo bem então este ainda é um
+
+0:47:44.650,0:47:48.430
+sinal dimensional que pode ter vários canais, mas ainda é
+
+0:47:48.430,0:47:53.020
+sinal unidimensional porque há apenas uma variável em execução lá, ok? é isso
+
+0:47:53.020,0:47:58.270
+de alguma forma melhor do que da última vez? sim? não? Melhor? obrigada.
+
+0:47:58.270,0:48:03.070
+vamos agradecer a Joan. Tudo bem, segundo exemplo que tenho aqui, meu Omega vai
+
+0:48:03.070,0:48:07.630
+ser o produto cartesiano desses dois conjuntos, o primeiro conjunto vai
+
+0:48:07.630,0:48:13.150
+de 1 em altura, e também esta discreta e a outra vai
+
+0:48:13.150,0:48:17.020
+indo de 1 para a largura, então estes são os pixels reais, e este
+
+0:48:17.020,0:48:21.690
+é um sinal bidimensional porque tenho 2 graus de liberdade no meu
+
+0:48:21.690,0:48:28.690
+domínio. Quais são os canais possíveis que temos? Então, aqui, os canais possíveis que
+
+0:48:28.690,0:48:32.440
+são muito comuns são os seguintes: Assim, você pode ter uma imagem em tons de cinza e
+
+0:48:32.440,0:48:36.849
+portanto, você apenas produz um valor escalar ou obtém o
+
+0:48:36.849,0:48:42.789
+arco-íris ali, a cor e, portanto, você fica como o meu X que é uma função de
+
+0:48:42.789,0:48:49.299
+as coordenadas Omega 1 Omega 2 em que cada ponto é
+
+0:48:49.299,0:48:52.690
+representado por um vetor de três componentes que será o R
+
+0:48:52.690,0:48:59.109
+componente do ponto Omega 1 Omega 2, o componente G do Omega 1 Omega 2, e
+
+0:48:59.109,0:49:04.299
+o componente azul do Omega 1 Omega 2. Então, novamente, vocês podem pensar nisso como um
+
+0:49:04.299,0:49:08.980
+ponto de big big data ou você pode pensar nisso como um mapeamento de função
+
+0:49:08.980,0:49:12.640
+domínio dimensional que é um domínio bidimensional para um domínio tridimensional
+
+0:49:12.640,0:49:18.490
+domínio dimensional, certo? finalmente os vinte, quem sabe o nome do
+
+0:49:18.490,0:49:24.400
+imagem de vinte canais? sim, esta é uma imagem hiperespectral. É muito comum
+
+0:49:24.400,0:49:31.869
+tem 20 bandas. Finalmente, quem pode adivinhar este?
+
+0:49:31.869,0:49:40.829
+se meu domínio for r4 x r4, o que pode ser?
+
+0:49:41.099,0:49:50.589
+Não, não, isso discreto né? Este é o r4, então nem é computador. Ha!
+
+0:49:50.589,0:49:56.740
+quem disse algo ai? Ouvi! Sim, está correto, então este é o espaço-tempo, o que
+
+0:49:56.740,0:50:00.849
+é o segundo? Sim, qual impulso? Tem um especial
+
+0:50:00.849,0:50:07.779
+nome. É chamado de quatro momentos porque tem uma informação temporal como
+
+0:50:07.779,0:50:12.160
+bem, certo? E então qual será a minha possível imagem
+
+0:50:12.160,0:50:24.790
+da função X? digamos que c é igual a 1. O que é? você sabe?
+
+0:50:24.790,0:50:29.630
+Então esse poderia ser, por exemplo, o hamiltoniano do sistema, certo? então, é isso
+
+0:50:29.630,0:50:36.460
+foi como um pouco mais de introdução matemática ou matemática
+
+0:50:37.000,0:50:42.890
+procedimento, como se diz, você fará uma definição mais precisa. De modo a
+
+0:50:42.890,0:50:48.980
+foi praticamente tudo por hoje, deixa eu acender a luz e vejo você
+
+0:50:48.980,0:50:54.969
+na próxima segunda-feira. Obrigado por estar comigo.
\ No newline at end of file
diff --git a/docs/pt/week05/05-1.md b/docs/pt/week05/05-1.md
new file mode 100644
index 000000000..92eae3fca
--- /dev/null
+++ b/docs/pt/week05/05-1.md
@@ -0,0 +1,451 @@
+---
+lang: pt
+lang-ref: ch.05-1
+title: Técnicas de Otimização I
+lecturer: Aaron Defazio
+authors: Vaibhav Gupta, Himani Shah, Gowri Addepalli, Lakshmi Addepalli
+date: 24 Feb 2020
+translation-date: 06 Nov 2021
+translator: Felipe Schiavon
+---
+
+<!--
+## [Gradient descent](https://www.youtube.com/watch?v=--NZb480zlg&t=88s)
+-->
+
+
+## [Gradiente Descendente](https://www.youtube.com/watch?v=--NZb480zlg&t=88s)
+
+<!--We start our study of Optimization Methods with the most basic and the worst (reasoning to follow) method of the lot, Gradient Descent.
+-->
+
+Começamos nosso estudo de Métodos de Otimização com o pior e mais básico método (raciocínio a seguir) do lote, o Gradiente Descendente.
+
+<!--**Problem:**
+-->
+
+**Problema:**
+
+<!--$$
+\min_w f(w)
+$$
+-->
+
+$$
+\min_w f(w)
+$$
+
+<!--**Iterative Solution:**
+-->
+
+**Solução Iterativa:**
+
+<!--$$
+w_{k+1} = w_k - \gamma_k \nabla f(w_k)
+$$
+-->
+
+$$
+w_{k+1} = w_k - \gamma_k \nabla f(w_k)
+$$
+
+<!--where,
+ - $w_{k+1}$ is the updated value after the $k$-th iteration,
+ - $w_k$ is the initial value before the $k$-th iteration,
+ - $\gamma_k$ is the step size,
+ - $\nabla f(w_k)$ is the gradient of $f$.
+-->
+
+onde,
+ - $w_{k+1}$ é o valor atualizado depois da $k$-ésima iteração,
+ - $w_k$ é o valor inicial antes da $k$-ésima iteração,
+ - $\gamma_k$ é o tamanho do passo,
+ - $\nabla f(w_k)$ é o gradiente de $f$.
+
+<!--The assumption here is that the function $f$ is continuous and differentiable. Our aim is to find the lowest point (valley) of the optimization function. However, the actual direction to this valley is not known. We can only look locally, and therefore the direction of the negative gradient is the best information that we have. Taking a small step in that direction can only take us closer to the minimum. Once we have taken the small step, we again compute the new gradient and again move a small amount in that direction, till we reach the valley. Therefore, essentially all that the gradient descent is doing is following the direction of steepest descent (negative gradient).
+-->
+
+A suposição aqui é que a função $f$ é contínua e diferenciável. Nosso objetivo é encontrar o ponto mais baixo (vale) da função de otimização. No entanto, a direção real para este vale não é conhecida. Só podemos olhar localmente e, portanto, a direção do gradiente negativo é a melhor informação que temos. Dar um pequeno passo nessa direção só pode nos levar mais perto do mínimo. Assim que tivermos dado o pequeno passo, calculamos novamente o novo gradiente e novamente nos movemos um pouco nessa direção, até chegarmos ao vale. Portanto, basicamente tudo o que o gradiente descendente está fazendo é seguir a direção da descida mais acentuada (gradiente negativo).
+
+<!--The $\gamma$ parameter in the iterative update equation is called the **step size**. Generally we don't know the value of the optimal step-size; so we have to try different values. Standard practice is to try a bunch of values on a log-scale and then use the best one. There are a few different scenarios that can occur. The image above depicts these scenarios for a 1D quadratic. If the learning rate is too low, then we would make steady progress towards the minimum. However, this might take more time than what is ideal. It is generally very difficult (or impossible) to get a step-size that would directly take us to the minimum. What we would ideally want is to have a step-size a little larger than the optimal. In practice, this gives the quickest convergence. However, if we use too large a learning rate, then the iterates get further and further away from the minima and we get divergence. In practice, we would want to use a learning rate that is just a little less than diverging.
+-->
+
+O parâmetro $\gamma$ na equação de atualização iterativa é chamado de **tamanho do passo**. Geralmente não sabemos o valor do tamanho ideal do passo; então temos que tentar valores diferentes. A prática padrão é tentar vários valores em uma escala logarítmica e, a seguir, usar o melhor valor. Existem alguns cenários diferentes que podem ocorrer. A imagem acima descreve esses cenários para uma função de erro quadrática de uma dimensão (1D). Se a taxa de aprendizado for muito baixa, faremos um progresso constante em direção ao mínimo. No entanto, isso pode levar mais tempo do que o ideal. Geralmente é muito difícil (ou impossível) obter um tamanho de passo que nos leve diretamente ao mínimo. O que desejaríamos idealmente é ter um tamanho de degrau um pouco maior do que o ideal. Na prática, isso dá a convergência mais rápida. No entanto, se usarmos uma taxa de aprendizado muito grande, as iterações se distanciam cada vez mais dos mínimos e obtemos divergência. Na prática, gostaríamos de usar uma taxa de aprendizado um pouco menor do que divergente.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-1/step-size.png" style="zoom: 70%; background-color:#DCDCDC;" /><br>
+<b>Figure 1:</b> Step sizes for 1D Quadratic
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-1/step-size.png" style="zoom: 70%; background-color:#DCDCDC;" /><br>
+<b>Figure 1:</b> Tamanhos dos passos para função de erro quadrática de uma dimensão (1D)
+</center>
+
+<!--
+## [Stochastic gradient descent](https://www.youtube.com/watch?v=--NZb480zlg&t=898s)
+-->
+
+
+## [Gradiente Descendente Estocástico](https://www.youtube.com/watch?v=--NZb480zlg&t=898s)
+
+<!--In Stochastic Gradient Descent, we replace the actual gradient vector with a stochastic estimation of the gradient vector. Specifically for a neural network, the stochastic estimation means the gradient of the loss for a single data point (single instance).
+-->
+
+No Gradiente Descendente Estocástico, substituímos o vetor gradiente real por uma estimativa estocástica do vetor gradiente. Especificamente para uma rede neural, a estimativa estocástica significa o gradiente da perda para um único ponto dos dados (única instância).
+
+<!--Let $f_i$ denote the loss of the network for the $i$-th instance.
+-->
+
+Seja $f_i$ a perda da rede para a $i$-ésima instância.
+
+<!--$$
+f_i = l(x_i, y_i, w)
+$$
+-->
+
+$$
+f_i = l(x_i, y_i, w)
+$$
+
+<!--The function that we eventually want to minimize is $f$, the total loss over all instances.
+-->
+
+A função que queremos minimizar é $f$, a perda total de todas as instâncias.
+
+<!--$$
+f = \frac{1}{n}\sum_i^n f_i
+$$
+-->
+
+$$
+f = \frac{1}{n}\sum_i^n f_i
+$$
+
+<!--In SGD, we update the weights according to the gradient over $f_i$ (as opposed to the gradient over the total loss $f$).
+-->
+
+
+No SGD, atualizamos os pesos de acordo com o gradiente sobre $f_i$ (em oposição ao gradiente sobre a perda total $f$).
+
+<!--$$
+\begin{aligned}
+w_{k+1} &= w_k - \gamma_k \nabla f_i(w_k) & \quad\text{(i chosen uniformly at random)}
+\end{aligned}
+$$
+-->
+
+$$
+\begin{aligned}
+w_{k+1} &= w_k - \gamma_k \nabla f_i(w_k) & \quad\text{(i escolhido uniformemente ao acaso)}
+\end{aligned}
+$$
+
+<!--If $i$ is chosen randomly, then $f_i$ is a noisy but unbiased estimator of $f$, which is mathematically written as:
+-->
+
+Se $i$ for escolhido aleatoriamente, então $f_i$ é um estimador com ruído, mas sem viés, de $f$, que é matematicamente escrito como:
+
+<!--$$
+\mathbb{E}[\nabla f_i(w_k)] = \nabla f(w_k)
+$$
+-->
+
+$$
+\mathbb{E}[\nabla f_i(w_k)] = \nabla f(w_k)
+$$
+
+<!--As a result of this, the expected $k$-th step of SGD is the same as the $k$-th step of full gradient descent:
+-->
+
+Como resultado disso, a $k$-ésima etapa esperada do SGD é a mesma que a $k$-ésima etapa da Gradiente Descendente completo:
+
+<!--$$
+\mathbb{E}[w_{k+1}] = w_k - \gamma_k \mathbb{E}[\nabla f_i(w_k)] = w_k - \gamma_k \nabla f(w_k)
+$$
+-->
+
+$$
+\mathbb{E}[w_{k+1}] = w_k - \gamma_k \mathbb{E}[\nabla f_i(w_k)] = w_k - \gamma_k \nabla f(w_k)
+$$
+
+<!--Thus, any SGD update is the same as full-batch update in expectation. However, SGD is not just faster gradient descent with noise. Along with being faster, SGD can also get us better results than full-batch gradient descent. The noise in SGD can help us avoid the shallow local minima and find a better (deeper) minima. This phenomenon is called **annealing**.
+-->
+
+Portanto, qualquer atualização do SGD é igual à atualização de lote completo em expectativa. No entanto, o SGD não é apenas um gradiente descendente mais rápida com algum ruído. Além de ser mais rápido, o SGD também pode nos dar melhores resultados do que o gradiente descendente completo. O ruído no SGD pode nos ajudar a evitar os mínimos locais superficiais e a encontrar mínimos melhores (mais profundos). Este fenômeno é denominado **recozimento** (**annealing**).
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-1/annealing.png"/><br>
+<b>Figure 2:</b> Annealing with SGD
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-1/annealing.png"/><br>
+<b>Figure 2:</b> Recozimento com SGD
+</center>
+
+<!--In summary, the advantages of Stochastic Gradient Descent are as follows:
+-->
+
+Em resumo, as vantagens do Gradiente Descendente Estocástico são as seguintes:
+
+<!-- 1. There is a lot of redundant information across instances. SGD prevents a lot of these redundant computations.
+ 2. At early stages, the noise is small as compared to the information in the gradient. Therefore a SGD step is *virtually as good as* a GD step.
+ 3. *Annealing* - The noise in SGD update can prevent convergence to a bad(shallow) local minima.
+ 4. Stochastic Gradient Descent is drastically cheaper to compute (as you don't go over all data points).
+-->
+
+1. Há muitas informações redundantes entre as instâncias. O SGD evita muitos desses cálculos redundantes.
+ 2. Nos estágios iniciais, o ruído é pequeno em comparação com as informações no gradiente. Portanto, uma etapa SGD é *virtualmente tão boa quanto* uma etapa de Gradiente Descendente.
+ 3. *Recozimento* - O ruído na atualização do SGD pode impedir a convergência para mínimos locais ruins (rasos).
+ 4. O Gradiente Descendente Estocástico é drasticamente mais barato para calcular (já que você não passa por todos os pontos de dados).
+
+<!--
+### Mini-batching
+-->
+
+
+### Mini-lotes
+
+<!--In mini-batching, we consider the loss over multiple randomly selected instances instead of calculating it over just one instance. This reduces the noise in the step update.
+-->
+
+Em mini-lotes, consideramos a perda em várias instâncias selecionadas aleatoriamente em vez de calculá-la em apenas uma instância. Isso reduz o ruído em cada etapa da atualização do passo.
+
+<!--$$
+w_{k+1} = w_k - \gamma_k \frac{1}{|B_i|} \sum_{j \in B_i}\nabla f_j(w_k)
+$$
+-->
+
+$$
+w_{k+1} = w_k - \gamma_k \frac{1}{|B_i|} \sum_{j \in B_i}\nabla f_j(w_k)
+$$
+
+<!--Often we are able to make better use of our hardware by using mini batches instead of a single instance. For example, GPUs are poorly utilized when we use single instance training. Distributed network training techniques split a large mini-batch between the machines of a cluster and then aggregate the resulting gradients. Facebook recently trained a network on ImageNet data within an hour, using distributed training.
+-->
+
+Freqüentemente, podemos fazer melhor uso de nosso hardware usando mini-lotes em vez de uma única instância. Por exemplo, as GPUs são mal utilizadas quando usamos o treinamento de instância única. As técnicas de treinamento de rede distribuída dividem um grande mini-lote entre as máquinas de um cluster e, em seguida, agregam os gradientes resultantes. O Facebook treinou recentemente uma rede em dados ImageNet em uma hora, usando treinamento distribuído.
+
+<!--It is important to note that Gradient Descent should never be used with full sized batches. In case you want to train on the full batch-size, use an optimization technique called LBFGS. PyTorch and SciPy both provide implementations of this technique.
+-->
+
+É importante observar que o Gradiente Descendente nunca deve ser usado com lotes de tamanho normal. Caso você queira treinar no tamanho total do lote, use uma técnica de otimização chamada LBFGS. O PyTorch e o SciPy possuem implementações desta técnica.
+
+<!--## [Momentum](https://www.youtube.com/watch?v=--NZb480zlg&t=1672s)
+-->
+
+## [Momento](https://www.youtube.com/watch?v=--NZb480zlg&t=1672s)
+
+<!--In Momentum, we have two iterates ($p$ and $w$) instead of just one. The updates are as follows:
+-->
+
+No Momento, temos duas iterações ($p$ e $w$) ao invés de apenas uma. As atualizações são as seguintes:
+
+<!--$$
+\begin{aligned}
+p_{k+1} &= \hat{\beta_k}p_k + \nabla f_i(w_k) \\
+w_{k+1} &=  w_k - \gamma_kp_{k+1} \\
+\end{aligned}
+$$
+-->
+
+$$
+\begin{aligned}
+p_{k+1} &= \hat{\beta_k}p_k + \nabla f_i(w_k) \\
+w_{k+1} &=  w_k - \gamma_kp_{k+1} \\
+\end{aligned}
+$$
+
+<!--$p$ is called the SGD momentum. At each update step we add the stochastic gradient to the old value of the momentum, after dampening it by a factor $\beta$ (value between 0 and 1). $p$ can be thought of as a running average of the gradients. Finally we move $w$ in the direction of the new momentum $p$.
+-->
+
+$p$ é chamado de momento SGD. Em cada etapa de atualização do passo, adicionamos o gradiente estocástico ao antigo valor do momento, após amortecê-lo por um fator $\beta$ (valor entre 0 e 1). $p$ pode ser considerado uma média contínua dos gradientes. Finalmente, movemos $w$ na direção do novo momento $p$.
+
+<!--Alternate Form: Stochastic Heavy Ball Method
+-->
+
+Forma alternativa: Método Estocástico de Bola Pesada
+
+<!--$$
+\begin{aligned}
+w_{k+1} &= w_k - \gamma_k\nabla f_i(w_k) + \beta_k(w_k - w_{k-1}) & 0 \leq \beta < 1
+\end{aligned}
+$$
+-->
+
+$$
+\begin{aligned}
+w_{k+1} &= w_k - \gamma_k\nabla f_i(w_k) + \beta_k(w_k - w_{k-1}) & 0 \leq \beta < 1
+\end{aligned}
+$$
+
+<!--This form is mathematically equivalent to the previous form. Here, the next step is a combination of previous step's direction ($w_k - w_{k-1}$) and the new negative gradient.
+-->
+
+Esta forma é matematicamente equivalente à forma anterior. Aqui, o próximo passo é uma combinação da direção do passo anterior ($w_k - w_{k-1}$) e o novo gradiente negativo.
+
+<!--
+### Intuition
+-->
+
+### Intuição
+
+<!--SGD Momentum is similar to the concept of momentum in physics. The optimization process resembles a heavy ball rolling down the hill. Momentum keeps the ball moving in the same direction that it is already moving in. Gradient can be thought of as a force pushing the ball in some other direction.
+-->
+
+O Momento do SGD é semelhante ao conceito de momentum na física. O processo de otimização se assemelha a uma bola pesada rolando colina abaixo. O momento mantém a bola se movendo na mesma direção em que já está se movendo. O gradiente pode ser considerado como uma força que empurra a bola em alguma outra direção.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-1/momentum.png"/><br>
+<b>Figure 3:</b> Effect of Momentum<br>
+<b>Source:</b><a href="https://distill.pub/2017/momentum/" target="_blank"> distill.pub </a><br>
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-1/momentum.png"/><br>
+<b>Figure 3:</b> Efeito do Momento<br>
+<b>Source:</b><a href="https://distill.pub/2017/momentum/" target="_blank"> distill.pub </a><br>
+</center>
+
+<!--Rather than making dramatic changes in the direction of travel (as in the figure on the left), momentum makes modest changes. Momentum dampens the oscillations which are common when we use only SGD.
+-->
+
+
+Ao invés de fazer mudanças dramáticas na direção do caminho (como na figura à esquerda), o momento faz mudanças pequenas. O momento amortece as oscilações que são comuns quando usamos apenas SGD.
+
+<!--The $\beta$ parameter is called the Dampening Factor. $\beta$ has to be greater than zero, because if it is equal to zero, you are just doing gradient descent. It also has to be less than 1, otherwise everything will blow up. Smaller values of $\beta$ result in change in direction quicker. For larger values, it takes longer to make turns.
+-->
+
+O parâmetro $\beta$ é chamado de fator de amortecimento. $\beta$ tem que ser maior que zero, porque se for igual a zero, você está apenas fazendo um gradiente descendente comum. Também deve ser menor que 1, caso contrário, tudo explodirá. Valores menores de $\beta$ resultam em mudanças de direção mais rápidas. Para valores maiores, leva mais tempo para fazer curvas.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-1/momentum-beta.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 4:</b> Effect of Beta on Convergence
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-1/momentum-beta.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 4:</b> Efeito do Beta na Convergência
+</center>
+
+<!--
+### Practical guidelines
+-->
+
+
+### Diretrizes práticas
+
+<!--Momentum must pretty much be always be used with stochastic gradient descent.
+$\beta$ = 0.9 or 0.99 almost always works well.
+-->
+
+O momento deve quase sempre ser usado com o Gradiente Descendente Estocástico.
+$\beta$ = 0,9 ou 0,99 quase sempre funciona bem.
+
+<!--The step size parameter usually needs to be decreased when the momentum parameter is increased to maintain convergence. If $\beta$ changes from 0.9 to 0.99, learning rate must be decreased by a factor of 10.
+-->
+
+O parâmetro de tamanho do passo geralmente precisa ser reduzido quando o parâmetro de momento é aumentado para manter a convergência. Se $\beta$ mudar de 0,9 para 0,99, a taxa de aprendizagem deve ser reduzida em um fator de 10.
+
+<!--
+### Why does momentum works?
+-->
+
+
+### Por que o momento funciona?
+
+<!--
+#### Acceleration
+-->
+
+
+#### Aceleração
+
+<!--The following are the update rules for Nesterov's momentum.
+-->
+
+
+A seguir estão as regras de atualização para o Momento de Nesterov.
+
+<!--$$
+p_{k+1} = \hat{\beta_k}p_k + \nabla f_i(w_k) \\
+w_{k+1} =  w_k - \gamma_k(\nabla f_i(w_k) +\hat{\beta_k}p_{k+1})
+$$
+-->
+
+$$
+p_{k+1} = \hat{\beta_k}p_k + \nabla f_i(w_k) \\
+w_{k+1} =  w_k - \gamma_k(\nabla f_i(w_k) +\hat{\beta_k}p_{k+1})
+$$
+
+<!--With Nesterov's Momentum, you can get accelerated convergence if you choose the constants very carefully. But this applies only to convex problems and not to Neural Networks.
+-->
+
+Com o Momento de Nesterov, você pode obter uma convergência acelerada se escolher as constantes com cuidado. Mas isso se aplica apenas a problemas convexos e não a redes neurais.
+
+<!--Many people say that normal momentum is also an accelerated method. But in reality, it is accelerated only for quadratics. Also, acceleration does not work well with SGD, as SGD has noise and acceleration does not work well with noise. Therefore, though some bit of acceleration is present with Momentum SGD, it alone is not a good explanation for the high performance of the technique.
+-->
+
+Muitas pessoas dizem que o momento normal também é um método acelerado. Mas, na realidade, ele é acelerado apenas para funções quadráticas. Além disso, a aceleração não funciona bem com SGD, pois SGD tem ruído e a aceleração não funciona bem com ruído. Portanto, embora um pouco de aceleração esteja presente no SGD com Momento, por si só não é uma boa explicação para o alto desempenho da técnica.
+
+<!--
+#### Noise smoothing
+-->
+
+
+#### Suavização de ruído
+
+<!--Probably a more practical and probable reason to why momentum works is Noise Smoothing.
+-->
+
+Provavelmente, uma razão mais prática e provável de por que o momento funciona é a Suavização de ruído.
+
+<!--Momentum averages gradients. It is a running average of gradients that we use for each step update.
+-->
+
+
+O momento calcula a média dos gradientes. É uma média contínua de gradientes que usamos para cada atualização do passo.
+
+<!--Theoretically, for SGD to work we should take average over all step updates.
+-->
+
+Teoricamente, para que o SGD funcione, devemos obter a média de todas as atualizações dos passos.
+
+<!--$$
+\bar w_k = \frac{1}{K} \sum_{k=1}^K w_k
+$$
+-->
+
+$$
+\bar w_k = \frac{1}{K} \sum_{k=1}^K w_k
+$$
+
+<!--The great thing about SGD with momentum is that this averaging is no longer necessary. Momentum adds smoothing to the optimization process, which makes each update a good approximation to the solution. With SGD you would want to average a whole bunch of updates and then take a step in that direction.
+-->
+
+A grande vantagem do SGD com momento é que essa média não é mais necessária. O Momento adiciona suavização ao processo de otimização, o que torna cada atualização uma boa aproximação da solução. Com o SGD, você desejaria calcular a média de um monte de atualizações e, em seguida, dar um passo nessa direção.
+
+<!--Both Acceleration and Noise smoothing contribute to high performance of momentum.
+-->
+
+Tanto a aceleração quanto a suavização de ruído contribuem para um alto desempenho do Momento.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-1/sgd-vs-momentum.png" style="zoom: 35%; background-color:#DCDCDC;"/><br>
+<b>Figure 5:</b> SGD <i>vs.</i> Momentum
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-1/sgd-vs-momentum.png" style="zoom: 35%; background-color:#DCDCDC;"/><br>
+<b>Figure 5:</b> SGD <i>vs.</i> Momento
+</center>
+
+<!--With SGD, we make good progress towards solution initially but when we reach bowl (bottom of the valley) we bounce around in this floor. If we adjust learning rate we will bounce around slower. With momentum we smooth out the steps, so that there is no bouncing around.
+-->
+
+Com o Gradiente Descendente Estocástico, inicialmente, fazemos um bom progresso em direção à solução, mas quando alcançamos o fundo da "tigela", ficamos rodeando em volta deste piso. Se ajustarmos a taxa de aprendizado, vamos rodear mais devagar. Com o impulso, suavizamos os degraus, para que não haja saltos.
diff --git a/docs/pt/week05/05-2.md b/docs/pt/week05/05-2.md
new file mode 100644
index 000000000..6c925fb0f
--- /dev/null
+++ b/docs/pt/week05/05-2.md
@@ -0,0 +1,512 @@
+---
+lang: pt
+lang-ref: ch.05-2
+title: Técnicas de Otimização II
+lecturer: Aaron Defazio
+authors: Guido Petri, Haoyue Ping, Chinmay Singhal, Divya Juneja
+date: 24 Feb 2020
+translator: Felipe Schiavon
+translation-date: 14 Nov 2021
+---
+
+<!--
+## [Adaptive methods](https://www.youtube.com/watch?v=--NZb480zlg&t=2675s)
+-->
+
+
+## [Métodos Adaptativos](https://www.youtube.com/watch?v=--NZb480zlg&t=2675s) 
+
+<!--SGD with momentum is currently the state of the art optimization method for a lot of ML problems. But there are other methods, generally called Adaptive Methods, innovated over the years that are particularly useful for poorly conditioned problems (if SGD does not work).
+-->
+
+Momento com SGD é atualmente o método de otimização de última geração para muitos problemas de aprendizagem de máquina. Mas existem outros métodos, geralmente chamados de Métodos Adaptativos, inovados ao longo dos anos que são particularmente úteis para problemas mal condicionados (se o SGD não funcionar).
+
+<!--In the SGD formulation, every single weight in network is updated using an equation with the same learning rate (global $\gamma$). Here, for adaptive methods, we *adapt a learning rate for each weight individually*. For this purpose, the information we get from gradients for each weight is used.
+-->
+
+Na formulação SGD, cada peso na rede é atualizado usando uma equação com a mesma taxa de aprendizado  (global $\gamma$). Aqui, para métodos adaptativos, *adaptamos uma taxa de aprendizagem para cada peso individualmente*. Para tanto, são utilizadas as informações que obtemos dos gradientes para cada peso.
+
+<!--Networks that are often used in practice have different structure in different parts of them. For instance, early parts of CNN may be very shallow convolution layers on large images and later in the network we might have convolutions of large number of channels on small images. Both of these operations are very different so a learning rate which works well for the beginning of the network may not work well for the latter sections of the network. This means adaptive learning rates by layer could be useful.
+-->
+
+As redes que são frequentemente usadas na prática têm estruturas diferentes em diferentes partes delas. Por exemplo, partes iniciais da CNN podem ser camadas de convolução muito rasas em imagens grandes e, posteriormente, na rede, podemos ter convoluções de grande número de canais em imagens pequenas. Ambas as operações são muito diferentes, portanto, uma taxa de aprendizado que funciona bem para o início da rede pode não funcionar bem para as últimas seções da rede. Isso significa que as taxas de aprendizagem adaptativa por camada podem ser úteis.
+
+<!--Weights in the latter part of the network (4096 in figure 1 below) directly dictate the output and have a very strong effect on it. Hence, we need smaller learning rates for those. In contrast, earlier weights will have smaller individual effects on the output, especially when initialized randomly.
+-->
+
+Os pesos na última parte da rede (4096 na figura 1 abaixo) ditam diretamente a saída e têm um efeito muito forte sobre ela. Portanto, precisamos de taxas de aprendizado menores para eles. Em contraste, pesos anteriores terão efeitos individuais menores na saída, especialmente quando inicializados aleatoriamente.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_vgg.png" style="zoom:40%"><br>
+<b>Figure 1: </b>VGG16
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_vgg.png" style="zoom:40%"><br>
+<b>Figure 1: </b>VGG16
+</center>
+
+<!--
+### RMSprop
+-->
+
+
+### RMSprop
+
+<!--The key idea of *Root Mean Square Propagation* is that the gradient is normalized by its root-mean-square.
+-->
+
+A ideia principal da *Propagação da raiz do valor quadrático médio* (*Root Mean Square Propagation*) é que o gradiente é normalizado por sua raiz quadrada média.
+
+<!--In the equation below, squaring the gradient denotes that each element of the vector is squared individually.
+-->
+
+Na equação abaixo, elevar ao quadrado o gradiente denota que cada elemento do vetor é elevado ao quadrado individualmente.
+
+<!--$$
+\begin{aligned}
+v_{t+1} &= {\alpha}v_t + (1 - \alpha) \nabla f_i(w_t)^2 \\
+w_{t+1} &=  w_t - \gamma \frac {\nabla f_i(w_t)}{ \sqrt{v_{t+1}} + \epsilon}
+\end{aligned}
+$$
+-->
+
+$$
+\begin{aligned}
+v_{t+1} &= {\alpha}v_t + (1 - \alpha) \nabla f_i(w_t)^2 \\
+w_{t+1} &=  w_t - \gamma \frac {\nabla f_i(w_t)}{ \sqrt{v_{t+1}} + \epsilon}
+\end{aligned}
+$$
+
+<!--where $\gamma$ is the global learning rate, $\epsilon$ is a value close to machine $\epsilon$ (on the order of $10^{-7}$ or  $10^{-8}$) -- in order to avoid division by zero errors, and $v_{t+1}$ is the 2nd moment estimate.
+-->
+
+onde $\gamma$ é a taxa de aprendizagem global,  $\epsilon$ é um valor próximo a máquina  $\epsilon$ (na ordem de $10^{-7}$ ou $10^{-8}$) - na ordem para evitar erros de divisão por zero, e $v_{t+1}$ é a estimativa do segundo momento.
+
+<!--We update $v$ to estimate this noisy quantity via an *exponential moving average* (which is a standard way of maintaining an average of a quantity that may change over time). We need to put larger weights on the newer values as they provide more information. One way to do that is down-weight old values exponentially. The values in the $v$ calculation that are very old are down-weighted at each step by an $\alpha$ constant, which varies between 0 and 1. This dampens the old values until they are no longer an important part of the exponential moving average.
+-->
+
+Atualizamos $v$ para estimar essa quantidade ruidosa por meio de uma *média móvel exponencial* (que é uma maneira padrão de manter uma média de uma quantidade que pode mudar com o tempo). Precisamos colocar pesos maiores nos valores mais novos, pois eles fornecem mais informações. Uma maneira de fazer isso é reduzir exponencialmente os valores antigos. Os valores no cálculo de $v$ que são muito antigos são reduzidos a cada etapa por uma constante $\alpha$, que varia entre 0 e 1. Isso amortece os valores antigos até que eles não sejam mais uma parte importante do exponencial média móvel.
+
+<!--The original method keeps an exponential moving average of a non-central second moment, so we don't subtract the mean here. The *second moment* is used to normalize the gradient element-wise, which means that every element of the gradient is divided by the square root of the second moment estimate. If the expected value of gradient is small, this process is similar to dividing the gradient by the standard deviation.
+-->
+
+O método original mantém uma média móvel exponencial de um segundo momento não central, portanto, não subtraímos a média aqui. O *segundo momento* é usado para normalizar o gradiente em termos de elemento, o que significa que cada elemento do gradiente é dividido pela raiz quadrada da estimativa do segundo momento. Se o valor esperado do gradiente for pequeno, esse processo é semelhante a dividir o gradiente pelo desvio padrão.
+
+<!--Using a small $\epsilon$ in the denominator doesn't diverge because when $v$ is very small, the momentum is also very small.
+-->
+
+Usar um $\epsilon$ pequeno no denominador não diverge porque quando $v$ é muito pequeno, o momento também é muito pequeno.
+
+<!--
+### ADAM
+-->
+
+
+### ADAM
+
+<!--ADAM, or *Adaptive Moment Estimation*, which is RMSprop plus momentum, is a more commonly used method. The momentum update is converted to an exponential moving average and we don't need to change the learning rate when we deal with $\beta$. Just as in RMSprop, we take an exponential moving average of the squared gradient here.
+-->
+
+ADAM, ou *Estimativa Adaptativa do Momento*, que é RMSprop mais o Momento, é o método mais comumente usado. A atualização do Momento é convertida em uma média móvel exponencial e não precisamos alterar a taxa de aprendizagem quando lidamos com $\beta$. Assim como no RMSprop, pegamos uma média móvel exponencial do gradiente quadrado aqui.
+
+<!--$$
+\begin{aligned}
+m_{t+1} &= {\beta}m_t + (1 - \beta) \nabla f_i(w_t) \\
+v_{t+1} &= {\alpha}v_t + (1 - \alpha) \nabla f_i(w_t)^2 \\
+w_{t+1} &=  w_t - \gamma \frac {m_{t}}{ \sqrt{v_{t+1}} + \epsilon}
+\end{aligned}
+$$
+-->
+
+$$
+\begin{aligned}
+m_{t+1} &= {\beta}m_t + (1 - \beta) \nabla f_i(w_t) \\
+v_{t+1} &= {\alpha}v_t + (1 - \alpha) \nabla f_i(w_t)^2 \\
+w_{t+1} &=  w_t - \gamma \frac {m_{t}}{ \sqrt{v_{t+1}} + \epsilon}
+\end{aligned}
+$$
+
+<!--where $m_{t+1}$ is the momentum's exponential moving average.
+-->
+
+onde $m_{t+1}$ é a média móvel exponencial do momento.
+
+<!--Bias correction that is used to keep the moving average unbiased during early iterations is not shown here.
+-->
+
+A correção de viés que é usada para manter a média móvel imparcial durante as iterações iniciais não é mostrada aqui.
+
+<!--
+### Practical side
+-->
+
+
+### Lado Prático
+
+<!--When training neural networks, SGD often goes in the wrong direction in the beginning of the training process, whereas RMSprop hones in on the right direction. However, RMSprop suffers from noise just as regular SGD, so it bounces around the optimum significantly once it's close to a local minimizer. Just like when we add momentum to SGD, we get the same kind of improvement with ADAM. It is a good, not-noisy estimate of the solution, so **ADAM is generally recommended over RMSprop**.
+-->
+
+Ao treinar redes neurais, o SGD geralmente vai na direção errada no início do processo de treinamento, enquanto o RMSprop aprimora a direção certa. No entanto, o RMSprop sofre de ruído da mesma forma que o SGD normal, então ele oscila em torno do ótimo significativamente quando está perto de um minimizador local. Assim como quando adicionamos impulso ao SGD, obtemos o mesmo tipo de melhoria com o ADAM. É uma estimativa boa e não ruidosa da solução, portanto **ADAM é geralmente recomendado em vez de RMSprop**.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_comparison.png" style="zoom:45%"><br>
+<b>Figure 2: </b> SGD *vs.* RMSprop *vs.* ADAM
+</center><br>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_comparison.png" style="zoom:45%"><br>
+<b>Figure 2: </b> SGD *vs.* RMSprop *vs.* ADAM
+</center><br>
+
+<!--ADAM is necessary for training some of the networks for using language models. For optimizing neural networks, SGD with momentum or ADAM is generally preferred. However, ADAM's theory in papers is poorly understood and it also has several disadvantages:
+-->
+
+O ADAM é necessário para treinar algumas das redes para usar modelos de linguagem. Para otimizar redes neurais, SGD com momentum ou ADAM é geralmente preferido. No entanto, a teoria do ADAM em artigos é mal compreendida e também tem várias desvantagens:
+
+<!--* It can be shown on very simple test problems that the method does not converge.
+* It is known to give generalization errors. If the neural network is trained to give zero loss on the data you trained it on, it will not give zero loss on other data points that it has never seen before. It is quite common, particularly on image problems, that we get worse generalization errors than when SGD is used. Factors could include that it finds the closest local minimum, or less noise in ADAM, or its structure, for instance.
+* With ADAM we need to maintain 3 buffers, whereas SGD needs 2 buffers. This doesn't really matter unless we train a model on the order of several gigabytes in size, in which case it might not fit in memory.
+* 2 momentum parameters need to be tuned instead of 1.
+-->
+
+* Pode ser mostrado em problemas de teste muito simples que o método não converge.
+* É conhecido por fornecer erros de generalização. Se a rede neural for treinada para fornecer perda zero nos dados em que você a treinou, ela não fornecerá perda zero em outros pontos de dados que nunca viu antes. É bastante comum, principalmente em problemas de imagem, obtermos erros de generalização piores do que quando se usa SGD. Os fatores podem incluir que ele encontre o mínimo local mais próximo, ou menos ruído no ADAM, ou sua estrutura, por exemplo.
+* Com o ADAM, precisamos manter 3 buffers, enquanto o SGD precisa de 2 buffers. Isso realmente não importa, a menos que treinemos um modelo da ordem de vários gigabytes de tamanho; nesse caso, ele pode não caber na memória.
+* 2 parâmetros de momentum precisam ser ajustados em vez de 1.
+
+<!--
+## [Normalization layers](https://www.youtube.com/watch?v=--NZb480zlg&t=3907s)
+-->
+
+
+## [Camadas de Normalização](https://www.youtube.com/watch?v=--NZb480zlg&t=3907s)
+
+<!--Rather than improving the optimization algorithms, *normalization layers* improve the network structure itself. They are additional layers in between existing layers. The goal is to improve the optimization and generalization performance.
+-->
+
+Em vez de melhorar os algoritmos de otimização, as *camadas de normalização* melhoram a própria estrutura da rede. Eles são camadas adicionais entre as camadas existentes. O objetivo é melhorar o desempenho de otimização e generalização.
+
+<!--In neural networks, we typically alternate linear operations with non-linear operations. The non-linear operations are also known as activation functions, such as ReLU. We can place normalization layers before the linear layers, or after the activation functions. The most common practice is to put them between the linear layers and activation functions, as in the figure below.
+-->
+
+Em redes neurais, normalmente alternamos operações lineares com operações não lineares. As operações não lineares também são conhecidas como funções de ativação, como ReLU. Podemos colocar camadas de normalização antes das camadas lineares ou após as funções de ativação. A prática mais comum é colocá-los entre as camadas lineares e as funções de ativação, como na figura abaixo.
+
+<!--| <center><img src="{{site.baseurl}}/images/week05/05-2/5_2_norm_layer_a.png" width="200px"/></center> | <center><img src="{{site.baseurl}}/images/week05/05-2/5_2_norm_layer_b.png" width="200px"/></center> | <center><img src="{{site.baseurl}}/images/week05/05-2/5_2_norm_layer_c.png" width="225px"/></center> |
+| (a) Before adding normalization                              |                (b) After adding normalization                |                    (c) An example in CNNs                    |
+-->
+
+| <center><img src="{{site.baseurl}}/images/week05/05-2/5_2_norm_layer_a.png" width="200px"/></center> | <center><img src="{{site.baseurl}}/images/week05/05-2/5_2_norm_layer_b.png" width="200px"/></center> | <center><img src="{{site.baseurl}}/images/week05/05-2/5_2_norm_layer_c.png" width="225px"/></center> |
+| (a) Antes de adicionar a normalização                              |                (b) Depois de adicionar a normalização                |                    (c) Um exemplo em CNNs                    |
+
+<!--<center><b>Figure 3:</b> Typical positions of normalization layers.</center>
+-->
+
+<center><b>Figura 3:</b> Posições típicas de camadas de normalização.</center>
+
+<!--In figure 3(c), the convolution is the linear layer, followed by batch normalization, followed by ReLU.
+-->
+
+Na figura 3 (c), a convolução é a camada linear, seguida pela normalização do lote, seguida por ReLU.
+
+<!--Note that the normalization layers affect the data that flows through, but they don't change the power of the network in the sense that, with proper configuration of the weights, the unnormalized network can still give the same output as a normalized network.
+-->
+
+Observe que as camadas de normalização afetam os dados que fluem, mas não alteram o poder da rede no sentido de que, com a configuração adequada dos pesos, a rede não normalizada ainda pode dar a mesma saída que uma rede normalizada.
+
+<!--
+### Normalization operations
+-->
+
+
+### Operações de normalização
+
+<!--This is the generic notation for normalization:
+-->
+
+Esta é a notação genérica para normalização:
+
+<!--$$
+y = \frac{a}{\sigma}(x - \mu) + b
+$$
+-->
+
+$$
+y = \frac{a}{\sigma}(x - \mu) + b
+$$
+
+<!--where $x$ is the input vector, $y$ is the output vector, $\mu$ is the estimate of the mean of $x$, $\sigma$ is the estimate of the standard deviation (std) of $x$, $a$ is the learnable scaling factor, and $b$ is the learnable bias term.
+-->
+
+onde $x$ é o vetor de entrada, $y$ é o vetor de saída, $\mu$ é a estimativa da média de $x$, $\sigma$ é a estimativa do desvio padrão (std) de $x$ , $a$ é o fator de escala que pode ser aprendido e $b$ é o termo de polarização que pode ser aprendido.
+
+<!--Without the learnable parameters $a$ and $b$, the distribution of output vector $y$ will have fixed mean 0 and std 1. The scaling factor $a$ and bias term $b$ maintain the representation power of the network,*i.e.*the output values can still be over any particular range. Note that $a$ and $b$ do not reverse the normalization, because they are learnable parameters and are much more stable than $\mu$ and $\sigma$.
+-->
+
+Sem os parâmetros aprendíveis $a$ e $b$, a distribuição do vetor de saída $y$ terá média fixa 0 e padrão 1. O fator de escala $a$ e o termo de polarização $b$ mantêm o poder de representação da rede, *ou seja,* os valores de saída ainda podem estar acima de qualquer faixa específica. Observe que $a$ e $b$ não invertem a normalização, porque eles são parâmetros aprendíveis e são muito mais estáveis do que $\mu$ e $\sigma$.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_norm_operations.png"/><br>
+<b>Figure 4:</b> Normalization operations.
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_norm_operations.png"/><br>
+<b>Figure 4:</b> Operações de Normalização.
+</center>
+
+<!--There are several ways to normalize the input vector, based on how to select samples for normalization. Figure 4 lists 4 different normalization approaches, for a mini-batch of $N$ images of height $H$ and width $W$, with $C$ channels:
+-->
+
+Existem várias maneiras de normalizar o vetor de entrada, com base em como selecionar amostras para normalização. A Figura 4 lista 4 abordagens diferentes de normalização, para um minilote de $N$ imagens de altura $H$ e largura $W$, com canais $C$:
+
+<!--- *Batch norm*: the normalization is applied only over one channel of the input. This is the first proposed and the most well-known approach. Please read [How to Train Your ResNet 7: Batch Norm](https://myrtle.ai/learn/how-to-train-your-resnet-7-batch-norm/) for more information.
+- *Layer norm*: the normalization is applied within one image across all channels.
+- *Instance norm*: the normalization is applied only over one image and one channel.
+- *Group norm*: the normalization is applied over one image but across a number of channels. For example, channel 0 to 9 is a group, then channel 10 to 19 is another group, and so on. In practice, the group size is almost always 32. This is the approach recommended by Aaron Defazio, since it has good performance in practice and it does not conflict with SGD.
+-->
+
+- *Normalização em lote (Batch Norm)*: a normalização é aplicada apenas em um canal da entrada. Esta é a primeira proposta e a abordagem mais conhecida. Leia [Como treinar seu ResNet 7: norma de lote](https://myrtle.ai/learn/how-to-train-your-resnet-7-batch-norm/) para obter mais informações.
+- *Normalização de camada (Layer Norm)*: a normalização é aplicada dentro de uma imagem em todos os canais.
+- *Normalização de instância (Instance Norm)*: a normalização é aplicada apenas sobre uma imagem e um canal.
+- *Normalização de grupo (Group Norm)*: a normalização é aplicada sobre uma imagem, mas em vários canais. Por exemplo, os canais 0 a 9 são um grupo, os canais 10 a 19 são outro grupo e assim por diante. Na prática, o tamanho do grupo é quase sempre de 32. Essa é a abordagem recomendada por Aaron Defazio, pois tem um bom desempenho na prática e não conflita com o SGD.
+
+<!--In practice, batch norm and group norm work well for computer vision problems, while layer norm and instance norm are heavily used for language problems.
+-->
+
+Na prática, a norma de lote e a norma de grupo funcionam bem para problemas de visão computacional, enquanto a norma de camada e a norma de instância são muito usadas para problemas de linguagem.
+
+<!--
+### Why does normalization help?
+-->
+
+
+### Por que a normalização ajuda?
+
+<!--Although normalization works well in practice, the reasons behind its effectiveness are still disputed. Originally, normalization is proposed to reduce "internal covariate shift", but some scholars proved it wrong in experiments. Nevertheless, normalization clearly has a combination of the following factors:
+-->
+
+Embora a normalização funcione bem na prática, as razões por trás de sua eficácia ainda são contestadas. Originalmente, a normalização é proposta para reduzir a "mudança interna da covariável" ("internal covariate shift"), mas alguns estudiosos provaram que estava errada em experimentos. No entanto, a normalização claramente tem uma combinação dos seguintes fatores:
+
+<!--- Networks with normalization layers are easier to optimize, allowing for the use of larger learning rates. Normalization has an optimization effect that speeds up the training of neural networks.
+- The mean/std estimates are noisy due to the randomness of the samples in batch. This extra "noise" results in better generalization in some cases. Normalization has a regularization effect.
+- Normalization reduces sensitivity to weight initialization.
+-->
+
+- Redes com camadas de normalização são mais fáceis de otimizar, permitindo o uso de maiores taxas de aprendizado. A normalização tem um efeito de otimização que acelera o treinamento das redes neurais.
+- As estimativas de média/padrão são ruidosas devido à aleatoriedade das amostras no lote. Este "ruído" extra resulta em melhor generalização em alguns casos. A normalização tem um efeito de regularização.
+- A normalização reduz a sensibilidade à inicialização do peso.
+
+<!--As a result, normalization lets you be more "careless" -- you can combine almost any neural network building blocks together and have a good chance of training it without having to consider how poorly conditioned it might be.
+-->
+
+Como resultado, a normalização permite que você seja mais "descuidado" - você pode combinar quase todos os blocos de construção de rede neural e ter uma boa chance de treiná-la sem ter que considerar o quão mal condicionada ela pode estar.
+
+<!--
+### Practical considerations
+-->
+
+
+### Considerações práticas
+
+<!--It’s important that back-propagation is done through the calculation of the mean and std, as well as the application of the normalization: the network training will diverge otherwise. The back-propagation calculation is fairly difficult and error-prone, but PyTorch is able to automatically calculate it for us, which is very helpful. Two normalization layer classes in PyTorch are listed below:
+-->
+
+É importante que a retropropagação seja feita por meio do cálculo da média e do padrão, bem como a aplicação da normalização: o treinamento da rede irá divergir de outra forma. O cálculo da propagação reversa é bastante difícil e sujeito a erros, mas o PyTorch é capaz de calculá-lo automaticamente para nós, o que é muito útil. Duas classes de camada de normalização em PyTorch estão listadas abaixo:
+
+<!--```python
+torch.nn.BatchNorm2d(num_features, ...)
+torch.nn.GroupNorm(num_groups, num_channels, ...)
+```
+-->
+
+```python
+torch.nn.BatchNorm2d(num_features, ...)
+torch.nn.GroupNorm(num_groups, num_channels, ...)
+```
+
+<!--Batch norm was the first method developed and is the most widely known. However, **Aaron Defazio recommends using group norm** instead. It’s more stable, theoretically simpler, and usually works better. Group size 32 is a good default.
+-->
+
+A normalização em lote (batch norm) foi o primeiro método desenvolvido e é o mais amplamente conhecido. No entanto, **Aaron Defazio recomenda usar a normalização de grupo (group norm)** ao invés da primeira. Ele é mais estável, teoricamente mais simples e geralmente funciona melhor. O tamanho do grupo 32 é um bom padrão.
+
+<!--Note that for batch norm and instance norm, the mean/std used are fixed after training, rather than re-computed every time the network is evaluated, because multiple training samples are needed to perform normalization. This is not necessary for group norm and layer norm, since their normalization is over only one training sample.
+-->
+
+Observe que para normalização em lote e normalização de instância, a média/padrão usada é fixada após o treinamento, em vez de recalculada toda vez que a rede é avaliada, porque várias amostras de treinamento são necessárias para realizar a normalização. Isso não é necessário para normalização de grupo e normalização de camada, uma vez que sua normalização é sobre apenas uma amostra de treinamento.
+
+<!--
+## [The Death of Optimization](https://www.youtube.com/watch?v=--NZb480zlg&t=4817s)
+-->
+
+
+## [A morte da otimização](https://www.youtube.com/watch?v=--NZb480zlg&t=4817s)
+
+<!--Sometimes we can barge into a field we know nothing about and improve how they are currently implementing things. One such example is the use of deep neural networks in the field of Magnetic Resonance Imaging (MRI) to accelerate MRI image reconstruction.
+-->
+
+Às vezes, podemos invadir um campo sobre o qual nada sabemos e melhorar a forma como eles estão implementando as coisas. Um exemplo é o uso de redes neurais profundas no campo da exames de Ressonância Magnética (MRI) para acelerar a reconstrução de imagens de MRI.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_conv_xkcd.png" style="zoom:60%"><br>
+<b>Figure 5:</b> Sometimes it actually works!
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_conv_xkcd.png" style="zoom:60%"><br>
+<b>Figure 5:</b> Às vezes realmente funciona!
+</center>
+
+<!--
+### MRI Reconstruction
+-->
+
+
+### Reconstrução de Ressonância Magnética
+
+<!--In the traditional MRI reconstruction problem, raw data is taken from an MRI machine and an image is reconstructed from it using a simple pipeline/algorithm. MRI machines capture data in a 2-dimensional Fourier domain, one row or one column at a time (every few milliseconds). This raw input is composed of a frequency and a phase channel and the value represents the magnitude of a sine wave with that particular frequency and phase. Simply speaking, it can be thought of as a complex valued image, having a real and an imaginary channel. If we apply an inverse Fourier transform on this input, i.e add together all these sine waves weighted by their values, we can get the original anatomical image.
+-->
+
+No problema de reconstrução tradicional de exames de ressonância magnética (MRI), os dados brutos são obtidos de uma máquina de MRI e uma imagem é reconstruída a partir dele usando um pipeline/algoritmo simples. As máquinas de ressonância magnética capturam dados em um domínio de Fourier bidimensional, uma linha ou uma coluna por vez (a cada poucos milissegundos). Esta entrada bruta é composta por uma frequência e um canal de fase e o valor representa a magnitude de uma onda senoidal com aquela frequência e fase específicas. Em termos simples, pode ser pensada como uma imagem de valor complexo, possuindo um canal real e outro imaginário. Se aplicarmos uma transformada inversa de Fourier nesta entrada, ou seja, somarmos todas essas ondas senoidais ponderadas por seus valores, podemos obter a imagem anatômica original.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_mri.png" style="zoom:60%"/><br>
+<b>Fig. 6:</b> MRI reconstruction
+</center><br>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_mri.png" style="zoom:60%"/><br>
+<b>Fig. 6:</b> Reconstrução de ressonância magnética
+</center><br>
+
+<!--A linear mapping currently exists to go from the Fourier domain to the image domain and it's very efficient, literally taking milliseconds, no matter how big the image is. But the question is, can we do it even faster?
+-->
+
+Existe atualmente um mapeamento linear para ir do domínio de Fourier ao domínio da imagem e é muito eficiente, levando literalmente milissegundos, não importa o tamanho da imagem. Mas a questão é: podemos fazer isso ainda mais rápido?
+
+<!--
+### Accelerated MRI
+-->
+
+
+### Ressonância magnética acelerada
+
+<!--The new problem that needs to be solved is accelerated MRI, where by acceleration we mean making the MRI reconstruction process much faster. We want to run the machines quicker and still be able to produce identical quality images. One way we can do this and the most successful way so far has been to not capture all the columns from the MRI scan. We can skip some columns randomly, though it's useful in practice to capture the middle columns, as they contain a lot of information across the image, but outside them we just capture randomly. The problem is that we can't use our linear mapping anymore to reconstruct the image. The rightmost image in Figure 7 shows the output of a linear mapping applied to the subsampled Fourier space. It's clear that this method doesn't give us very useful outputs, and that there's room to do something a little bit more intelligent.
+-->
+
+O novo problema que precisa ser resolvido é a ressonância magnética acelerada, onde por aceleração queremos dizer tornar o processo de reconstrução por ressonância magnética muito mais rápido. Queremos operar as máquinas mais rapidamente e ainda ser capazes de produzir imagens de qualidade idêntica. Uma maneira de fazer isso e a maneira mais bem-sucedida até agora tem sido não capturar todas as colunas da varredura de ressonância magnética. Podemos pular algumas colunas aleatoriamente, embora seja útil na prática capturar as colunas do meio, pois elas contêm muitas informações na imagem, mas fora delas apenas capturamos aleatoriamente. O problema é que não podemos mais usar nosso mapeamento linear para reconstruir a imagem. A imagem mais à direita na Figura 7 mostra a saída de um mapeamento linear aplicado ao espaço de Fourier subamostrado. É claro que esse método não nos dá resultados muito úteis e que há espaço para fazer algo um pouco mais inteligente.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_acc_mri.png" style="zoom:45%"><br>
+<b>Fig.:</b> Linear mapping on subsampled Fourier-space
+</center><br>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_acc_mri.png" style="zoom:45%"><br>
+<b>Fig.:</b> Mapeamento linear no espaço de Fourier subamostrado
+</center><br>
+
+<!--
+### Compressed sensing
+-->
+
+
+### Compressed sensing
+
+<!--One of the biggest breakthroughs in theoretical mathematics for a long time was compressed sensing. A paper by <a href="https://arxiv.org/pdf/math/0503066.pdf">Candes et al.</a> showed that theoretically, we can get a perfect reconstruction from the subsampled Fourier-domain image. In other words, when the signal we are trying to reconstruct is sparse or sparsely structured, then it is possible to perfectly reconstruct it from fewer measurements. But there are some practical requirements for this to work -- we don't need to sample randomly, rather we need to sample incoherently -- though in practice, people just end up sampling randomly. Additionally, it takes the same time to sample a full column or half a column, so in practice we also sample entire columns.
+-->
+
+Um dos maiores avanços na matemática teórica por muito tempo foi o sensoriamento comprimido. Um artigo de <a href="https://arxiv.org/pdf/math/0503066.pdf"> Candes et al. </a> mostrou que, teoricamente, podemos obter uma reconstrução perfeita a partir da subamostra da imagem do domínio de Fourier . Em outras palavras, quando o sinal que estamos tentando reconstruir é esparso ou esparsamente estruturado, então é possível reconstruí-lo perfeitamente a partir de menos medições. Mas existem alguns requisitos práticos para que isso funcione - não precisamos amostrar aleatoriamente, em vez disso, precisamos amostrar incoerentemente - embora, na prática, as pessoas acabem apenas amostrando aleatoriamente. Além disso, leva o mesmo tempo para amostrar uma coluna inteira ou meia coluna, portanto, na prática, também amostramos colunas inteiras.
+
+<!--Another condition is that we need to have *sparsity* in our image, where by sparsity we mean a lot of zeros or black pixels in the image. The raw input can be represented sparsely if we do a wavelength decomposition, but even this decomposition gives us an approximately sparse and not an exactly sparse image. So, this approach gives us a pretty good but not perfect reconstruction, as we can see in Figure 8. However, if the input were very sparse in the wavelength domain, then we would definitely get a perfect image.
+-->
+
+Outra condição é que precisamos ter *esparsidade* em nossa imagem, onde por esparsidade queremos dizer muitos zeros ou pixels pretos na imagem. A entrada bruta pode ser representada esparsamente se fizermos uma decomposição do comprimento de onda, mas mesmo essa decomposição nos dá uma imagem aproximadamente esparsa e não exatamente esparsa. Portanto, essa abordagem nos dá uma reconstrução muito boa, mas não perfeita, como podemos ver na Figura 8. No entanto, se a entrada fosse muito esparsa no domínio do comprimento de onda, com certeza obteríamos uma imagem perfeita.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_comp_sensing.png" style="zoom:50%"><br>
+<b>Figure 8: </b>Compressed sensing
+</center><br>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_comp_sensing.png" style="zoom:50%"><br>
+<b>Figure 8: </b>Sensoriamento comprimido
+</center><br>
+
+<!--Compressed sensing is based on the theory of optimization. The way we can get this reconstruction is by solving a mini-optimization problem which has an additional regularization term:
+-->
+
+O sensoriamento comprimido é baseado na teoria da otimização. A maneira como podemos obter essa reconstrução é resolvendo um problema de mini-otimização que tem um termo de regularização adicional:
+
+<!--$$
+\hat{x} = \arg\min_x \frac{1}{2} \Vert M (\mathcal{F}(x)) - y \Vert^2 + \lambda TV(x)
+$$
+-->
+
+$$
+\hat{x} = \arg\min_x \frac{1}{2} \Vert M (\mathcal{F}(x)) - y \Vert^2 + \lambda TV(x)
+$$
+
+<!--where $M$ is the mask function that zeros out non-sampled entries, $\mathcal{F}$ is the Fourier transform, $y$ is the observed Fourier-domain data, $\lambda$ is the regularization penalty strength, and $V$ is the regularization function.
+-->
+
+onde $M$ é a função de máscara que zera as entradas não amostradas, $\mathcal{F}$ é a transformação de Fourier, $y$ são os dados observados do domínio de Fourier, $\lambda$ é a força da penalidade de regularização e $V$ é a função de regularização.
+
+<!--The optimization problem must be solved for each time step or each "slice" in an MRI scan, which often takes much longer than the scan itself. This gives us another reason to find something better.
+-->
+
+O problema de otimização deve ser resolvido para cada etapa de tempo ou cada "fatia" em uma ressonância magnética, que geralmente leva muito mais tempo do que a própria varredura. Isso nos dá outro motivo para encontrar algo melhor.
+
+<!--
+### Who needs optimization?
+-->
+
+
+### Quem precisa de otimização?
+
+<!--Instead of solving the little optimization problem at every time step, why not use a big neural network to produce the required solution directly? Our hope is that we can train a neural network with sufficient complexity that it essentially solves the optimization problem in one step and produces an output that is as good as the solution obtained from solving the optimization problem at each time step.
+-->
+
+Em vez de resolver o pequeno problema de otimização em cada etapa do tempo, por que não usar uma grande rede neural para produzir a solução necessária diretamente? Nossa esperança é que possamos treinar uma rede neural com complexidade suficiente para que essencialmente resolva o problema de otimização em uma etapa e produza uma saída que seja tão boa quanto a solução obtida ao resolver o problema de otimização em cada etapa de tempo.
+
+<!--$$
+\hat{x} = B(y)
+$$
+-->
+
+$$
+\hat{x} = B(y)
+$$
+
+<!--where $B$ is our deep learning model and $y$ is the observed Fourier-domain data.
+-->
+
+onde $B$ é o nosso modelo de aprendizado profundo e $y$ são os dados observados do domínio de Fourier.
+
+<!--15 years ago, this approach was difficult -- but nowadays this is a lot easier to implement. Figure 9 shows the result of a deep learning approach to this problem and we can see that the output is much better than the compressed sensing approach and looks very similar to the actual scan.
+-->
+
+Há 15 anos, essa abordagem era difícil - mas hoje em dia é muito mais fácil de implementar. A Figura 9 mostra o resultado de uma abordagem de aprendizado profundo para esse problema e podemos ver que a saída é muito melhor do que a abordagem de detecção compactada e é muito semelhante ao exame de imagem real.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_dl_approach.png" style="zoom:60%"><br>
+<b>Figure 9: </b>Deep Learning approach
+</center><br>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-2/5_2_dl_approach.png" style="zoom:60%"><br>
+<b>Figure 9: </b>Abordagem com aprendizado profundo
+</center><br>
+
+<!--The model used to generate this reconstruction uses an ADAM optimizer, group-norm normalization layers, and a U-Net based convolutional neural network. Such an approach is very close to practical applications and we will hopefully be seeing these accelerated MRI scans happening in clinical practice in a few years' time.
+-->
+
+O modelo usado para gerar essa reconstrução usa um otimizador ADAM, camadas de normalização de norma de grupo e uma rede neural convolucional baseada em U-Net. Essa abordagem está muito próxima de aplicações práticas e esperamos ver esses exames de ressonância magnética acelerados acontecendo na prática clínica em alguns anos.
\ No newline at end of file
diff --git a/docs/pt/week05/05-3.md b/docs/pt/week05/05-3.md
new file mode 100644
index 000000000..029a327af
--- /dev/null
+++ b/docs/pt/week05/05-3.md
@@ -0,0 +1,490 @@
+---
+lang: pt
+lang-ref: ch.05-3
+title: Noções básicas sobre convoluções e mecanismo de diferenciação automática
+lecturer: Alfredo Canziani
+authors: Leyi Zhu, Siqi Wang, Tao Wang, Anqi Zhang
+date: 25 Feb 2020
+translator: Felipe Schiavon
+translation-date: 14 Nov 2021
+---
+
+<!--
+## [Understanding 1D convolution](https://www.youtube.com/watch?v=eEzCZnOFU1w&t=140s)
+-->
+
+
+## [Entendendo a convolução 1D](https://www.youtube.com/watch?v=eEzCZnOFU1w&t=140s)
+
+<!--In this part we will discuss convolution, since we would like to explore the sparsity, stationarity, compositionality of the data.
+-->
+
+Nesta parte discutiremos a convolução, uma vez que gostaríamos de explorar a esparsidade, estacionariedade e composicionalidade dos dados.
+
+<!--Instead of using the matrix $A$ discussed in the [previous week]({{site.baseurl}}/en/week04/04-1), we will change the matrix width to the kernel size $k$. Therefore, each row of the matrix is a kernel. We can use the kernels by stacking and shifting (see Fig 1). Then we can have $m$ layers of height $n-k+1$.
+-->
+
+Ao invés de usar a matriz $A$ discutida na [semana anterior]({{site.baseurl}}/pt/week04/04-1), vamos alterar a largura da matriz para o tamanho do kernel $k$. Portanto, cada linha da matriz é um kernel. Podemos usar os kernels os empilhando e deslocando (veja a Fig. 1). Então podemos ter $m$ camadas de altura $n-k+1$.
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-3/Illustration_1D_Conv.png" alt="1" style="zoom:40%;" /><br>
+<b>Fig 1</b>: Illustration of 1D Convolution
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-3/Illustration_1D_Conv.png" alt="1" style="zoom:40%;" /><br>
+<b>Fig 1</b>: Ilustração de uma Convolução 1D
+</center>
+
+<!--The output is $m$ (thickness) vectors of size $n-k+1$.
+-->
+
+A saída são $m$ (espessura) vetores de tamanho $n-k+1$.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-3/Result_1D_Conv.png" alt="2" style="zoom:40%;" /><br>
+<b>Fig 2</b>: Result of 1D Convolution
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-3/Result_1D_Conv.png" alt="2" style="zoom:40%;" /><br>
+<b>Fig 2</b>: Resultado da Convolução 1D
+</center>
+
+<!--Furthermore, a single input vector can viewed as a monophonic signal.
+-->
+
+Além disso, um único vetor de entrada pode ser visto como um sinal monofônico.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-3/Monophonic_Signal.png" alt="3" style="zoom:40%;" /><br>
+<b>Fig 3</b>: Monophonic Signal
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-3/Monophonic_Signal.png" alt="3" style="zoom:40%;" /><br>
+<b>Fig 3</b>: Monofônico Signal
+</center>
+
+<!--Now, the input $x$ is a mapping
+-->
+
+Agora, a entrada $x$ é o mapeamento
+
+<!--$$
+x:\Omega\rightarrow\mathbb{R}^{c}
+$$
+-->
+
+$$
+x:\Omega\rightarrow\mathbb{R}^{c}
+$$
+
+<!--where $\Omega = \lbrace 1, 2, 3, \cdots \rbrace \subset \mathbb{N}^1$ (since this is $1$ dimensional signal / it has a $1$ dimensional domain) and in this case the channel number $c$ is $1$. When $c = 2$ this becomes a stereophonic signal.
+-->
+
+onde $\Omega = \lbrace 1, 2, 3, \cdots \rbrace \subset \mathbb{N}^1$ (uma vez que este é um sinal $1$ dimensional / tem um domínio unidimensional) e, neste caso, o canal o número $c$ é $1$. Quando $c = 2$, isso se torna um sinal estereofônico.
+
+<!--For the 1D convolution, we can just compute the scalar product, kernel by kernel (see Fig 4).
+-->
+
+Para a convolução 1D, podemos apenas calcular o produto escalar, kernel por kernel (consulte a Figura 4).
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-3/Layer_by_layer_scalar_product.png" alt="4" style="zoom:40%;" /><br>
+<b>Fig 4</b>: Layer-by-layer Scalar Product of 1D Convolution
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-3/Layer_by_layer_scalar_product.png" alt="4" style="zoom:40%;" /><br>
+<b>Fig 4</b>: Produto escalar camada por camada da convolução 1D
+</center>
+
+<!--
+## [Dimension of kernels and output width in PyTorch](https://www.youtube.com/watch?v=eEzCZnOFU1w&t=1095s)
+-->
+
+
+## [Dimensões das larguras dos kernels e saídas no PyTorch](https://www.youtube.com/watch?v=eEzCZnOFU1w&t=1095s)
+
+<!--Tips: We can use ***question mark*** in IPython to get access to the documents of functions. For example,
+-->
+
+Dicas: Podemos usar ***ponto de interrogação*** no IPython para obter acesso à documentação das funções. Por exemplo,
+
+<!--```python
+Init signature:
+nn.Conv1d(
+	in_channels,           # number of channels in the input image
+	out_channels,          # number of channels produced by the convolution
+	kernel_size,           # size of the convolving kernel
+	stride=1,              # stride of the convolution
+	padding=0,             # zero-padding added to both sides of the input
+	dilation=1,            # spacing between kernel elements
+	groups=1,              # nb of blocked connections from input to output
+	bias=True,             # if `True`, adds a learnable bias to the output
+	padding_mode='zeros',  # accepted values `zeros` and `circular`
+)
+```
+-->
+
+```python
+Init signature:
+nn.Conv1d(
+	in_channels,           # número de canais na imagem de entrada
+	out_channels,          # número de canais produzidos pela convolução
+	kernel_size,           # tamanho do kernel convolvente
+	stride=1,              # stride (passo) da convolução
+	padding=0,             # zero-padding (preenchimento com zero) adicionado nos dois lados da entrada
+	dilation=1,            # espaçamento entre os elementos do kernel
+	groups=1,              # número de conexões bloqueadas da entrada para a saída
+	bias=True,             # se `True`, adiciona um viés "aprendível" na saída
+	padding_mode='zeros',  # modo de preenchimento, aceita valores `zeros` e `circular`
+)
+```
+
+<!--
+### 1D convolution
+-->
+
+
+### Convolução 1D
+
+<!--We have $1$ dimensional convolution going from $2$ channels (stereophonic signal) to $16$ channels ($16$ kernels) with kernel size of $3$ and stride of $1$. We then have $16$ kernels with thickness $2$ and length $3$. Let's assume that the input signal has a batch of size $1$ (one signal), $2$ channels and $64$ samples. The resulting output layer has $1$ signal, $16$ channels and the length of the signal is $62$ ($=64-3+1$). Also, if we output the bias size, we'll find the bias size is $16$, since we have one bias per weight.
+-->
+
+Temos $1$ convolução dimensional indo de $2$ canais (sinal estereofônico) para $16$ canais ($16$ kernels) com tamanho de kernel de $3$ e *stride* (passo) de $1$. Temos então $16$ kernels com espessura $2$ e comprimento $3$. Vamos supor que o sinal de entrada tenha um lote de tamanho $1$ (um sinal), $2$ canais e $64$ amostras. A camada de saída resultante tem $1$ sinal, $16$ canais e o comprimento do sinal é $62$ ($=64-3+1$). Além disso, se gerarmos o tamanho do enviesamento, descobriremos que o tamanho do viés é $16$, já que temos um viés para cada peso.
+
+<!--```python
+conv = nn.Conv1d(2, 16, 3)  # 2 channels (stereo signal), 16 kernels of size 3
+conv.weight.size()          # output: torch.Size([16, 2, 3])
+conv.bias.size()            # output: torch.Size([16])
+x = torch.rand(1, 2, 64)    # batch of size 1, 2 channels, 64 samples
+conv(x).size()              # saída: torch.Size([1, 16, 62])
+conv = nn.Conv1d(2, 16, 5)  # 2 channels, 16 kernels of size 5
+conv(x).size()              # output: torch.Size([1, 16, 60])
+```
+
+-->
+
+```python
+conv = nn.Conv1d(2, 16, 3)  # 2 canais (sinal estéreo), 16 kernels de tamanho 3
+conv.weight.size()          # saída: torch.Size([16, 2, 3])
+conv.bias.size()            # saída: torch.Size([16])
+x = torch.rand(1, 2, 64)    # lote de tamanho 1, 2 canais, 64 amostras
+conv(x).size()              # saída: torch.Size([1, 16, 62])
+conv = nn.Conv1d(2, 16, 5)  # 2 canais, 16 kernels de tamanho 5
+conv(x).size()              # saída: torch.Size([1, 16, 60])
+
+```
+
+<!--
+### 2D convolution
+-->
+
+
+### Convolução 2D
+
+<!--We first define the input data as $1$ sample, $20$ channels (say, we're using an hyperspectral image) with height $64$ and width $128$. The 2D convolution has $20$ channels from input and $16$ kernels with size of $3 \times 5$. After the convolution, the output data has $1$ sample, $16$ channels with height $62$ ($=64-3+1$) and width $124$ ($=128-5+1$).
+-->
+
+Primeiro definimos os dados de entrada como $1$ amostra, $20$ canais (digamos, estamos usando uma imagem hiperespectral) com altura $64$ e largura $128$. A convolução 2D tem $20$ canais de entrada e $16$ kernels com tamanho de $3\times 5$. Após a convolução, os dados de saída têm $1$ amostra, $16$ canais com altura $62$ ($=64-3+1$) e largura $124$ ($=128-5+1$).
+
+<!--```python
+x = torch.rand(1, 20, 64, 128)    # 1 sample, 20 channels, height 64, and width 128
+conv = nn.Conv2d(20, 16, (3, 5))  # 20 channels, 16 kernels, kernel size is 3 x 5
+conv.weight.size()                # output: torch.Size([16, 20, 3, 5])
+conv(x).size()                    # output: torch.Size([1, 16, 62, 124])
+```
+-->
+
+```python
+x = torch.rand(1, 20, 64, 128)    # 1 amostra, 20 canais, altura 64 e largura 128
+conv = nn.Conv2d(20, 16, (3, 5))  # 20 canais, 16 kernels, kernel de tamanho 3 x 5
+conv.weight.size()                # saída: torch.Size([16, 20, 3, 5])
+conv(x).size()                    # saída: torch.Size([1, 16, 62, 124])
+```
+
+<!--If we want to achieve the same dimensionality, we can have paddings. Continuing the code above, we can add new parameters to the convolution function: `stride=1` and `padding=(1, 2)`, which means $1$ on $y$ direction ($1$ at the top and $1$ at the bottom) and $2$ on $x$ direction. Then the output signal is in the same size compared to the input signal. The number of dimensions that is required to store the collection of kernels when you perform 2D convolution is $4$.
+-->
+
+Se quisermos atingir a mesma dimensionalidade, podemos ter preenchimentos. Continuando o código acima, podemos adicionar novos parâmetros à função de convolução: `stride = 1` e` padding = (1, 2) `, o que significa $1$ em $y$ direction ($1$ no topo e $1$ na parte inferior) e $2$ na direção $x$. Então, o sinal de saída tem o mesmo tamanho em comparação com o sinal de entrada. O número de dimensões necessárias para armazenar a coleção de kernels ao realizar a convolução 2D é $4$.
+
+<!--```python
+# 20 channels, 16 kernels of size 3 x 5, stride is 1, padding of 1 and 2
+conv = nn.Conv2d(20, 16, (3, 5), 1, (1, 2))
+conv(x).size()  # output: torch.Size([1, 16, 64, 128])
+```
+-->
+
+```python
+# 20 canais, 16 kernels de tamanho 3 x 5, stride de 1, preenchimento (padding) de 1 e 2
+conv = nn.Conv2d(20, 16, (3, 5), 1, (1, 2))
+conv(x).size()  # saída: torch.Size([1, 16, 64, 128])
+```
+
+<!--
+## [How automatic gradient works?](https://www.youtube.com/watch?v=eEzCZnOFU1w&t=1634s)
+-->
+
+
+## [Como funciona o gradiente automático?](https://www.youtube.com/watch?v=eEzCZnOFU1w&t=1634s)
+
+<!--In this section we're going to ask torch to check all the computation over the tensors so that we can perform the computation of partial derivatives.
+-->
+
+Nesta seção, vamos pedir ao torch para verificar todos os cálculos sobre os tensores para que possamos realizar o cálculo das derivadas parciais.
+
+<!--- Create a $2\times2$ tensor $\boldsymbol{x}$ with gradient-accumulation capabilities;
+- Deduct $2$ from all elements of $\boldsymbol{x}$ and get $\boldsymbol{y}$; (If we print `y.grad_fn`, we will get `<SubBackward0 object at 0x12904b290>`, which means that `y` is generated by the module of subtraction $\boldsymbol{x}-2$. Also we can use `y.grad_fn.next_functions[0][0].variable` to derive the original tensor.)
+- Do more operations: $\boldsymbol{z} = 3\boldsymbol{y}^2$;
+- Calculate the mean of $\boldsymbol{z}$.
+-->
+
+- Crie um tensor $2\times2$ $\boldsymbol{x}$ com capacidades de acumulação de gradiente;
+- Deduza $2$ de todos os elementos de $\boldsymbol{x}$ e obtenha $\boldsymbol{y}$; (Se imprimirmos `y.grad_fn`, obteremos` <SubBackward0 object em 0x12904b290> `, o que significa que `y` é gerado pelo módulo da subtração $\boldsymbol{x}-2$. Também podemos usar `y.grad_fn.next_functions[0][0].variable` para derivar o tensor original.)
+- Faça mais operações: $\boldsymbol{z}=3 \boldsymbol{y}^2 $;
+- Calcule a média de $\boldsymbol{z}$.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week05/05-3/Flow_Chart.png" alt="5" style="zoom:60%;" /><br>
+<b>Fig 5</b>: Flow Chart of the Auto-gradient Example
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week05/05-3/Flow_Chart.png" alt="5" style="zoom:60%;" /><br>
+<b>Fig 5</b>: Fluxograma do Exemplo de gradiente automático
+</center>
+
+<!--Back propagation is used for computing the gradients. In this example, the process of back propagation can be viewed as computing the gradient $\frac{d\boldsymbol{a}}{d\boldsymbol{x}}$. After computing $\frac{d\boldsymbol{a}}{d\boldsymbol{x}}$ by hand as a validation, we can find that the execution of `a.backward()` gives us the same value of *x.grad* as our computation.
+-->
+
+A retropropagação (backpropagation) é usada para calcular os gradientes. Neste exemplo, o processo de retropropagação pode ser visto como o cálculo do gradiente $\frac{d\boldsymbol{a}}{d\boldsymbol{x}}$. Depois de calcular $\frac{d\boldsymbol{a}}{d\boldsymbol{x}}$ manualmente como uma validação, podemos descobrir que a execução de `a.backward()` nos dá o mesmo valor de *x.grad* como nosso cálculo.
+
+<!--Here is the process of computing back propagation by hand:
+-->
+
+Aqui está o processo de cálculo da retropropagação manualmente:
+
+<!--$$
+\begin{aligned}
+a &= \frac{1}{4} (z_1 + z_2 + z_3 + z_4) \\
+z_i &= 3y_i^2 = 3(x_i-2)^2 \\
+\frac{da}{dx_i} &= \frac{1}{4}\times3\times2(x_i-2) = \frac{3}{2}x_i-3 \\
+x &= \begin{pmatrix} 1&2\\3&4\end{pmatrix} \\
+\left(\frac{da}{dx_i}\right)^\top &= \begin{pmatrix} 1.5-3&3-3\\[2mm]4.5-3&6-3\end{pmatrix}=\begin{pmatrix} -1.5&0\\[2mm]1.5&3\end{pmatrix}
+\end{aligned}
+$$
+-->
+
+$$
+\begin{aligned}
+a &= \frac{1}{4} (z_1 + z_2 + z_3 + z_4) \\
+z_i &= 3y_i^2 = 3(x_i-2)^2 \\
+\frac{da}{dx_i} &= \frac{1}{4}\times3\times2(x_i-2) = \frac{3}{2}x_i-3 \\
+x &= \begin{pmatrix} 1&2\\3&4\end{pmatrix} \\
+\left(\frac{da}{dx_i}\right)^\top &= \begin{pmatrix} 1.5-3&3-3\\[2mm]4.5-3&6-3\end{pmatrix}=\begin{pmatrix} -1.5&0\\[2mm]1.5&3\end{pmatrix}
+\end{aligned}
+$$
+
+<!--Whenever you use partial derivative in PyTorch, you get the same shape of the original data. But the correct Jacobian thing should be the transpose.
+-->
+
+Sempre que você usa derivada parcial em PyTorch, obtém a mesma forma dos dados originais. Mas a coisa jacobiana correta deveria ser a transposição.
+
+<!--
+### From basic to more crazy
+-->
+
+
+### Do básico ao mais louco
+
+<!--Now we have a $1\times3$ vector $x$, assign $y$ to the double $x$ and keep doubling $y$ until its norm is smaller than $1000$. Due to the randomness we have for $x$, we cannot directly know the number of iterations when the procedure terminates.
+-->
+
+Agora temos um vetor $1\times3$ $x$, atribua $y$ ao dobro de $x$ e continue dobrando $y$ até que sua norma seja menor que $1000$. Devido à aleatoriedade que temos para $x$, não podemos saber diretamente o número de iterações quando o procedimento termina.
+
+<!--```python
+x = torch.randn(3, requires_grad=True)
+y = x * 2
+i = 0
+while y.data.norm() < 1000:
+    y = y * 2
+    i += 1
+```
+-->
+
+```python
+x = torch.randn(3, requires_grad=True)
+
+y = x * 2
+i = 0
+while y.data.norm() < 1000:
+    y = y * 2
+    i += 1
+```
+
+<!--However, we can infer it easily by knowing the gradients we have.
+-->
+
+No entanto, podemos inferir isso facilmente conhecendo os gradientes que temos.
+
+<!--```python
+gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
+y.backward(gradients)
+print(x.grad)
+tensor([1.0240e+02, 1.0240e+03, 1.0240e-01])
+print(i)
+9
+```
+-->
+
+```python
+gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
+y.backward(gradients)
+
+print(x.grad)
+tensor([1.0240e+02, 1.0240e+03, 1.0240e-01])
+print(i)
+9
+```
+
+<!--As for the inference, we can use `requires_grad=True` to label that we want to track the gradient accumulation as shown below. If we omit `requires_grad=True` in either $x$ or $w$'s declaration and call `backward()` on $z$, there will be runtime error due to we do not have gradient accumulation on $x$ or $w$.
+-->
+
+Quanto à inferência, podemos usar `requires_grad=True` para rotular que queremos rastrear o acúmulo de gradiente conforme mostrado abaixo. Se omitirmos `requires_grad=True` na declaração de $x$ ou $w$ e chamar`backward ()`em $z$, haverá um erro de execução devido a não termos acumulação de gradiente em $x$ ou $w$.
+
+<!--```python
+# Both x and w that allows gradient accumulation
+x = torch.arange(1., n + 1, requires_grad=True)
+w = torch.ones(n, requires_grad=True)
+z = w @ x
+z.backward()
+print(x.grad, w.grad, sep='\n')
+```
+-->
+
+```python
+# Tanto x quanto w que permitem o acúmulo de gradiente
+x = torch.arange(1., n + 1, requires_grad=True)
+w = torch.ones(n, requires_grad=True)
+z = w @ x
+z.backward()
+print(x.grad, w.grad, sep='\n')
+```
+
+<!--And, we can have `with torch.no_grad()` to omit the gradient accumulation.
+-->
+
+E, podemos usar o comando `with torch.no_grad()` para omitir o acúmulo de gradiente.
+
+<!--```python
+x = torch.arange(1., n + 1)
+w = torch.ones(n, requires_grad=True)
+
+# All torch tensors will not have gradient accumulation
+with torch.no_grad():
+    z = w @ x
+
+try:
+    z.backward()  # PyTorch will throw an error here, since z has no grad accum.
+except RuntimeError as e:
+    print('RuntimeError!!! >:[')
+    print(e)
+```
+-->
+
+
+```python
+x = torch.arange(1., n + 1)
+w = torch.ones(n, requires_grad=True)
+
+# Todos os tensores do torch não terão gradientes acumulados
+with torch.no_grad():
+    z = w @ x
+
+try:
+    z.backward()  # PyTorch vai lançar um erro aqui, pois z não tem acumulador de gradientes
+except RuntimeError as e:
+    print('RuntimeError!!! >:[')
+    print(e)
+```
+
+<!--
+## More stuff -- custom gradients
+-->
+
+
+## Mais coisas - gradientes personalizados
+
+<!--Also, instead of basic numerical operations, we can generate our own self-defined modules / functions, which can be plugged into the neural graph. The Jupyter Notebook can be found [here](https://github.com/Atcold/pytorch-Deep-Learning/blob/master/extra/b-custom_grads.ipynb).
+-->
+
+Além disso, em vez de operações numéricas básicas, podemos criar nossos próprios módulos / funções, que podem ser plugados no grafo da rede neural. O Jupyter Notebook pode ser encontrado [aqui](https://github.com/Atcold/pytorch-Deep-Learning/blob/master/extra/b-custom_grads.ipynb).
+
+<!--To do so, we need to inherit `torch.autograd.Function` and override `forward()` and `backward()` functions. For example, if we want to training nets, we need to get the forward pass and know the partial derivatives of the input respect to the output, such that we can use this module in any kind of point in the code. Then, by using back-propagation (chain rule), we can plug the thing anywhere in the chain of operations, as long as we know the partial derivatives of the input respect to the output.
+-->
+
+Para fazer isso, precisamos herdar `torch.autograd.Function` e substituir as funções `forward ()` e `backward()`. Por exemplo, se quisermos treinar redes, precisamos obter a passagem pelo *forward* e saber as derivadas parciais da entrada em relação à saída, de forma que possamos usar este módulo em qualquer tipo de ponto do código. Então, usando retropropagação (regra da cadeia), podemos conectar a coisa em qualquer lugar na cadeia de operações, desde que conheçamos as derivadas parciais da entrada em relação à saída.
+
+<!--In this case, there are three examples of ***custom modules*** in the *notebook*, the `add`, `split`, and `max` modules. For example, the custom addition module:
+-->
+
+Neste caso, existem três exemplos de ***módulos personalizados*** no *notebook*, os módulos `add`,`split` e `max`. Por exemplo, o módulo de adição personalizado:
+
+<!--```python
+# Custom addition module
+class MyAdd(torch.autograd.Function):
+
+    @staticmethod
+    def forward(ctx, x1, x2):
+        # ctx is a context where we can save
+        # computations for backward.
+        ctx.save_for_backward(x1, x2)
+        return x1 + x2
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        x1, x2 = ctx.saved_tensors
+        grad_x1 = grad_output * torch.ones_like(x1)
+        grad_x2 = grad_output * torch.ones_like(x2)
+        # need to return grads in order
+        # of inputs to forward (excluding ctx)
+        return grad_x1, grad_x2
+```
+-->
+
+```python
+# Custom addition module
+class MyAdd(torch.autograd.Function):
+
+    @staticmethod
+    def forward(ctx, x1, x2):
+        # ctx is a context where we can save
+        # computations for backward.
+        ctx.save_for_backward(x1, x2)
+        return x1 + x2
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        x1, x2 = ctx.saved_tensors
+        grad_x1 = grad_output * torch.ones_like(x1)
+        grad_x2 = grad_output * torch.ones_like(x2)
+        # need to return grads in order
+        # of inputs to forward (excluding ctx)
+        return grad_x1, grad_x2
+```
+
+<!--If we have addition of two things and get an output, we need to overwrite the forward function like this. And when we go down to do back propagation, the gradients copied over both sides. So we overwrite the backward function by copying.
+-->
+
+Se adicionarmos duas coisas e obtivermos uma saída, precisamos sobrescrever a função forward desta forma. E quando descemos para fazer a propagação reversa, os gradientes são copiados em ambos os lados. Portanto, sobrescrevemos a função de retrocesso copiando.
+
+<!--For `split` and `max`, see the code of how we overwrite forward and backward functions in the *notebook*. If we come from the same thing and **Split**, when go down doing gradients, we should add / sum them. For `argmax`, it selects the index of the highest thing, so the index of the highest should be $1$ while others being $0$. Remember, according to different custom modules, we need to overwrite its own forward pass and how they do gradients in backward function.
+-->
+
+Para `split` e `max`, veja o código de como sobrescrevemos as funções de avanço e retrocesso no *bloco de notas*. Se viermos da mesma coisa e **Dividir**, ao descermos fazendo gradientes, devemos somar / somar. Para `argmax`, ele seleciona o índice da coisa mais alta, então o índice da mais alta deve ser $1$ enquanto os outros devem ser $0$. Lembre-se, de acordo com diferentes módulos personalizados, precisamos sobrescrever sua própria passagem do *forward* e como eles fazem os gradientes na função *backward*.
diff --git a/docs/pt/week05/05.md b/docs/pt/week05/05.md
new file mode 100644
index 000000000..10ca81a95
--- /dev/null
+++ b/docs/pt/week05/05.md
@@ -0,0 +1,40 @@
+---
+lang: pt
+lang-ref: ch.05
+title: Semana 5
+translation-date: 05 Nov 2021
+translator: Felipe Schiavon
+---
+
+<!--## Lecture part A
+-->
+
+## Aula parte A
+
+<!--We begin by introducing Gradient Descent. We discuss the intuition and also talk about how step sizes play an important role in reaching the solution. Then we move on to SGD and its performance in comparison to Full Batch GD. Finally we talk about Momentum Updates, specifically the two update rules, the intuition behind momentum and its effect on convergence.
+-->
+
+Começamos apresentando o Gradiente Descendente. Discutimos a intuição e também falamos sobre como os tamanhos dos passos desempenham um papel importante para se chegar à solução. Em seguida, passamos para Gradiente Descendente Estocástico (SGD) e seu desempenho em comparação com Gradiente Descendente completo (Full Batch GD). Por fim, falamos sobre as atualizações de momento, especificamente as duas regras de atualização, a intuição por trás do momento e seu efeito na convergência.
+
+<!--
+## Lecture part B
+-->
+
+## Aula parte B
+
+<!--We discuss adaptive methods for SGD such as RMSprop and ADAM. We also talk about normalization layers and their effects on the neural network training process. Finally, we discuss a real-world example of neural nets being used in industry to make MRI scans faster and more efficient.
+-->
+
+Discutimos métodos adaptativos para SGD, como RMSprop e ADAM. Também falamos sobre camadas de normalização e seus efeitos no processo de treinamento das redes neurais. Finalmente, discutimos um exemplo do mundo real de redes neurais sendo usadas na indústria para tornar os exames de ressonância magnética mais rápidos e eficientes.
+
+<!--
+## Practicum
+-->
+
+## Prática
+
+<!--We briefly review the matrix-multiplications and then discuss the convolutions. Key point is we use kernels by stacking and shifting. We first understand the 1D convolution by hand, and then use PyTorch to learn the dimension of kernels and output width in 1D and 2D convolutions examples. Furthermore, we use PyTorch to learn about how automatic gradient works and custom-grads.
+-->
+
+Revisamos brevemente as multiplicações de matrizes e, em seguida, discutimos as convoluções. O ponto principal é que usamos kernels por empilhamento e deslocamento. Primeiro entendemos a convolução de uma dimensão (1D) manualmente e, em seguida, usamos o PyTorch para aprender a dimensão dos kernels e da largura da saída em exemplos de convoluções de uma (1D) e duas dimensões (2D). Além disso, usamos o PyTorch para aprender sobre como o funciona o gradiente automático e os gradientes customizados.
+
diff --git a/docs/pt/week05/lecture05.sbv b/docs/pt/week05/lecture05.sbv
new file mode 100644
index 000000000..8d3d55657
--- /dev/null
+++ b/docs/pt/week05/lecture05.sbv
@@ -0,0 +1,3572 @@
+0:00:00.000,0:00:04.410
+All right so as you can see today we don't have Yann. Yann is somewhere else
+
+0:00:04.410,0:00:09.120
+having fun. Hi Yann. Okay so today's that we have
+
+0:00:09.120,0:00:13.740
+Aaron DeFazio he's a research scientist at Facebook working mostly on
+
+0:00:13.740,0:00:16.619
+optimization he's been there for the past three years
+
+0:00:16.619,0:00:21.900
+and before he was a data scientist at Ambiata and then a student at the
+
+0:00:21.900,0:00:27.599
+Australian National University so why don't we give a round of applause to the
+
+0:00:27.599,0:00:37.350
+our speaker today I'll be talking about optimization and if we have time at the
+
+0:00:37.350,0:00:42.739
+end the death of optimization so these are the topics I will be covering today
+
+0:00:42.739,0:00:47.879
+now optimization is at the heart of machine learning and some of the things
+
+0:00:47.879,0:00:52.680
+are going to be talking about today will be used every day in your role
+
+0:00:52.680,0:00:56.640
+potentially as an applied scientist or even as a research scientist or a data
+
+0:00:56.640,0:01:01.590
+scientist and I'm gonna focus on the application of these methods
+
+0:01:01.590,0:01:05.850
+particularly rather than the theory behind them part of the reason for this
+
+0:01:05.850,0:01:10.260
+is that we don't fully understand all of these methods so for me to come up here
+
+0:01:10.260,0:01:15.119
+and say this is why it works I would be oversimplifying things but what I can
+
+0:01:15.119,0:01:22.320
+tell you is how to use them how we know that they work in certain situations and
+
+0:01:22.320,0:01:28.320
+what the best method may be to use to train your neural network and to
+
+0:01:28.320,0:01:31.770
+introduce you to the topic of optimization I need to start with the
+
+0:01:31.770,0:01:36.720
+worst method in the world gradient descent and I'll explain in a minute why
+
+0:01:36.720,0:01:43.850
+it's the worst method but to begin with we're going to use the most generic
+
+0:01:43.850,0:01:47.549
+formulation of optimization now the problems you're going to be considering
+
+0:01:47.549,0:01:51.659
+will have more structure than this but it's very useful useful notationally to
+
+0:01:51.659,0:01:56.969
+start this way so we talked about a function f now we're trying to prove
+
+0:01:56.969,0:02:03.930
+properties of our optimizer will assume additional structure on f but in
+
+0:02:03.930,0:02:07.049
+practice the structure in our neural networks essentially obey no of the
+
+0:02:07.049,0:02:09.239
+assumptions none of the assumptions people make in
+
+0:02:09.239,0:02:12.030
+practice I'm just gonna start with the generic F
+
+0:02:12.030,0:02:17.070
+and we'll assume it's continuous and differentiable even though we're already
+
+0:02:17.070,0:02:20.490
+getting into the realm of incorrect assumptions since the neural networks
+
+0:02:20.490,0:02:25.170
+most people are using in practice these days are not differentiable instead you
+
+0:02:25.170,0:02:29.460
+have a equivalent sub differential which you can essentially plug into all these
+
+0:02:29.460,0:02:33.570
+formulas and if you cross your fingers there's no theory to support this it
+
+0:02:33.570,0:02:38.910
+should work so the method of gradient descent is shown here it's an iterative
+
+0:02:38.910,0:02:44.790
+method so you start at a point k equals zero and at each step you update your
+
+0:02:44.790,0:02:49.410
+point and here we're going to use W to represent our current iterate either it
+
+0:02:49.410,0:02:54.000
+being the standard nomenclature for the point for your neural network this w
+
+0:02:54.000,0:03:00.420
+will be some large collection of weights one weight tensor per layer but notation
+
+0:03:00.420,0:03:03.540
+we we kind of squash the whole thing down to a single vector and you can
+
+0:03:03.540,0:03:09.000
+imagine just doing that literally by reshaping all your vectors to all your
+
+0:03:09.000,0:03:13.740
+tensors two vectors and just concatenate them together and this method is
+
+0:03:13.740,0:03:17.519
+remarkably simple all we do is we follow the direction of the negative gradient
+
+0:03:17.519,0:03:24.750
+and the rationale for this it's pretty simple so let me give you a diagram and
+
+0:03:24.750,0:03:28.410
+maybe this will help explain exactly why following the negative gradient
+
+0:03:28.410,0:03:33.570
+direction is a good idea so we don't know enough about our function to do
+
+0:03:33.570,0:03:38.760
+better this is a high level idea when we're optimizing a function we look at
+
+0:03:38.760,0:03:45.060
+the landscape the optimization landscape locally so by optimization landscape I
+
+0:03:45.060,0:03:49.230
+mean the domain of all possible weights of our network now we don't know what's
+
+0:03:49.230,0:03:53.459
+going to happen if we use any particular weights on your network we don't know if
+
+0:03:53.459,0:03:56.930
+it'll be better at the task we're trying to train it to or worse but we do know
+
+0:03:56.930,0:04:01.530
+locally is the point that are currently ad and the gradient and this gradient
+
+0:04:01.530,0:04:05.190
+provides some information about a direction which we can travel in that
+
+0:04:05.190,0:04:09.870
+may improve the performance of our network or in this case reduce the value
+
+0:04:09.870,0:04:14.340
+of our function were minimizing here in this set up this general setup
+
+0:04:14.340,0:04:19.380
+minimizing a function is essentially training in your network so minimizing
+
+0:04:19.380,0:04:23.520
+the loss will give you the best performance on your classification task
+
+0:04:23.520,0:04:26.550
+or whatever you're trying to do and because we only look at the world
+
+0:04:26.550,0:04:31.110
+locally here this gradient is basically the best information we have and you can
+
+0:04:31.110,0:04:36.270
+think of this as descending a valley where you start somewhere horrible some
+
+0:04:36.270,0:04:39.600
+pinkie part of the landscape the top of a mountain for instance and you travel
+
+0:04:39.600,0:04:43.590
+down from there and at each point you follow the direction near you that has
+
+0:04:43.590,0:04:50.040
+the most sorry the steepest descent and in fact the go the method of grading %
+
+0:04:50.040,0:04:53.820
+is sometimes called the method of steepest descent and this direction will
+
+0:04:53.820,0:04:57.630
+change as you move in the space now if you move locally by only an
+
+0:04:57.630,0:05:02.040
+infinitesimal amount assuming this smoothness that I mentioned before which
+
+0:05:02.040,0:05:04.740
+is actually not true in practice but we'll get to that assuming the
+
+0:05:04.740,0:05:08.280
+smoothness this small step will only change the gradient a small amount so
+
+0:05:08.280,0:05:11.820
+the direction you're traveling in is at least a good direction when you take
+
+0:05:11.820,0:05:18.120
+small steps and we essentially just follow this path taking as larger steps
+
+0:05:18.120,0:05:20.669
+as we can traversing the landscape until we reach
+
+0:05:20.669,0:05:25.229
+the valley at the bottom which is the minimizer our function now there's a
+
+0:05:25.229,0:05:30.690
+little bit more we can say for some problem classes and I'm going to use the
+
+0:05:30.690,0:05:34.950
+most simplistic problem class we can just because it's the only thing that I
+
+0:05:34.950,0:05:39.210
+can really do any mathematics for on one slide so bear with me
+
+0:05:39.210,0:05:44.580
+this class is quadratics so for a quadratic optimization problem we
+
+0:05:44.580,0:05:51.570
+actually know quite a bit just based off the gradient so firstly a gradient cuts
+
+0:05:51.570,0:05:55.440
+off an entire half of a space and now illustrate this here with this green
+
+0:05:55.440,0:06:02.130
+line so we're at that point there where the line starts near the Green Line we
+
+0:06:02.130,0:06:05.789
+know the solution cannot be in the rest of the space and this is not true from
+
+0:06:05.789,0:06:09.930
+your networks but it's still a genuinely a good guideline that we want to follow
+
+0:06:09.930,0:06:13.710
+the direction of negative gradient there could be better solutions elsewhere in
+
+0:06:13.710,0:06:17.910
+the space but finding them is is much harder than just trying to find the best
+
+0:06:17.910,0:06:21.300
+solution near to where we are so that's what we do we trying to find the best
+
+0:06:21.300,0:06:24.930
+solution near to where we are you could imagine this being the surface of the
+
+0:06:24.930,0:06:28.410
+earth where there are many hills and valleys and we can't hope to know
+
+0:06:28.410,0:06:31.020
+something about a mountain on the other side of the planet but we can certainly
+
+0:06:31.020,0:06:34.559
+look for the valley directly beneath the mountain where we currently are
+
+0:06:34.559,0:06:39.089
+in fact you can think of these functions here as being represented with these
+
+0:06:39.089,0:06:44.369
+topographic maps this is the same as topographic maps you use that you may be
+
+0:06:44.369,0:06:50.369
+familiar with from from the planet Earth where mountains are shown by these rings
+
+0:06:50.369,0:06:53.309
+now here the rings are representing descent so this is the bottom of the
+
+0:06:53.309,0:06:57.839
+valley we're showing here not the top of a hill at the center there so yes our
+
+0:06:57.839,0:07:02.459
+gradient knocks off a whole half of the possible space now it's very reasonable
+
+0:07:02.459,0:07:06.059
+then to go in the direction find this negative gradient because it's kind of
+
+0:07:06.059,0:07:10.199
+orthogonal to this line that cuts off after space and you can see that I've
+
+0:07:10.199,0:07:21.409
+got the indication of orthogonal you there the little la square so the
+
+0:07:21.409,0:07:25.319
+properties of gradient to spend a gradient descent depend greatly on the
+
+0:07:25.319,0:07:28.889
+structure of the problem for these quadratic problems it's actually
+
+0:07:28.889,0:07:32.549
+relatively simple to characterize what will happen so I'm going to give you a
+
+0:07:32.549,0:07:35.369
+little bit of an overview here and I'll spend a few minutes on this because it's
+
+0:07:35.369,0:07:38.339
+quite interesting and I'm hoping that those of you with some background in
+
+0:07:38.339,0:07:42.629
+linear algebra can follow this derivation but we're going to consider a
+
+0:07:42.629,0:07:47.309
+quadratic optimization problem now the problem stated in the gray box
+
+0:07:47.309,0:07:53.309
+at the top you can see that this is a quadratic where a is a positive definite
+
+0:07:53.309,0:07:58.769
+matrix we can handle broader classes of Quadra quadratics and this potentially
+
+0:07:58.769,0:08:04.649
+but the analysis is most simple in the positive definite case and the grating
+
+0:08:04.649,0:08:09.539
+of that function is very simple of course as Aw - b and u the solution of
+
+0:08:09.539,0:08:13.379
+this problem has a closed form in the case of quadratics it's as inverse of a
+
+0:08:13.379,0:08:20.179
+times B now what we do is we take the steps they're shown in the green box and
+
+0:08:20.179,0:08:26.519
+we just plug it into the distance from solution. So this || wₖ₊₁ – w*||
+
+0:08:26.519,0:08:30.479
+is a distance from solution so we want to see how this changes over time and
+
+0:08:30.479,0:08:34.050
+the idea is that if we're moving closer to the solution over time the method is
+
+0:08:34.050,0:08:38.579
+converging so we start with that distance from solution to be plug in the
+
+0:08:38.579,0:08:44.509
+value of the update now with a little bit of rearranging we can pull
+
+0:08:45.050,0:08:50.950
+the terms we can group the terms together and we can write B as a inverse
+
+0:08:50.950,0:09:05.090
+so we can pull or we can pull the W star inside the inside the brackets there and
+
+0:09:05.090,0:09:11.960
+then we get this expression where it's matrix times the previous distance to
+
+0:09:11.960,0:09:16.040
+the solution matrix times previous distance solution now we don't know
+
+0:09:16.040,0:09:20.720
+anything about which directions this quadratic it varies most extremely in
+
+0:09:20.720,0:09:24.890
+but we can just not bound this very simply by taking the product of the
+
+0:09:24.890,0:09:28.850
+matrix as norm and the distance to the solution here this norm at the bottom so
+
+0:09:28.850,0:09:34.070
+that's the bottom line now now when you're considering matrix norms it's
+
+0:09:34.070,0:09:39.590
+pretty straightforward to see that you're going to have an expression where
+
+0:09:39.590,0:09:45.710
+the eigen values of this matrix are going to be 1 minus μ γ or 1 minus
+
+0:09:45.710,0:09:48.950
+L γ now the way I get this is I just look at what are the extreme eigen
+
+0:09:48.950,0:09:54.050
+values of a which we call them μ and L and by plugging these into the
+
+0:09:54.050,0:09:56.930
+expression we can see what the extreme eigen values will be of this combined
+
+0:09:56.930,0:10:03.050
+matrix I minus γ a and you have this absolute value here now you can optimize
+
+0:10:03.050,0:10:06.320
+this and get an optimal learning rate for the quadratics
+
+0:10:06.320,0:10:09.920
+but that optimal learning rate is not robust in practice you probably don't
+
+0:10:09.920,0:10:16.910
+want to use that so a simpler value you can use is 1/L. L being the largest
+
+0:10:16.910,0:10:22.420
+eigen value and this gives you this convergence rate of 1 – μ/L
+
+0:10:22.420,0:10:29.240
+reduction in distance to solution every step do we have any questions here I
+
+0:10:29.240,0:10:32.020
+know it's a little dense yes yes it's it's a substitution from in
+
+0:10:41.120,0:10:46.010
+that gray box do you see the bottom line on the gray box yeah that's that's just
+
+0:10:46.010,0:10:51.230
+a by definition we can solve the gradient so by taking the gradient to
+
+0:10:51.230,0:10:53.060
+zero if you see in that second line in the box
+
+0:10:53.060,0:10:55.720
+taking the gradient to zero this so replaced our gradient with zero and
+
+0:10:55.720,0:11:01.910
+rearranging you get the closed form solution to the problem here so the
+
+0:11:01.910,0:11:04.490
+problem with using that closed form solution in practice is we have to
+
+0:11:04.490,0:11:08.420
+invert a matrix and by using gradient descent we can solve this problem by
+
+0:11:08.420,0:11:12.920
+only doing matrix multiplications instead I'm not that I would suggest you
+
+0:11:12.920,0:11:15.560
+actually use this technique to solve the matrix as I mentioned before it's the
+
+0:11:15.560,0:11:20.750
+worst method in the world and the convergence rate of this method is
+
+0:11:20.750,0:11:25.100
+controlled by this new overall quantity now these are standard notations so
+
+0:11:25.100,0:11:27.950
+we're going from linear algebra where you talk about the min and Max eigen
+
+0:11:27.950,0:11:33.430
+value to the notation typically used in the field of optimization.
+
+0:11:33.430,0:11:39.380
+μ is smallest eigen value L being largest eigen value and this μ/L is the
+
+0:11:39.380,0:11:44.570
+inverse of the condition number condition number being L/μ this
+
+0:11:44.570,0:11:51.140
+gives you a broad characterization of how quickly optimization methods will
+
+0:11:51.140,0:11:57.440
+work on this problem and this these military terms they don't exist for
+
+0:11:57.440,0:12:02.870
+neural networks only in the very simplest situations do we have L exists
+
+0:12:02.870,0:12:06.740
+and we essentially never have μ existing nevertheless we want to talk
+
+0:12:06.740,0:12:10.520
+about network networks being polar conditioned and well conditioned and
+
+0:12:10.520,0:12:14.930
+poorly conditioned would typically be some approximation to L is very large
+
+0:12:14.930,0:12:21.260
+and well conditioned maybe L is very close to one so the step size we can
+
+0:12:21.260,0:12:27.770
+select in one summer training depends very heavily on these constants so let
+
+0:12:27.770,0:12:30.800
+me give you a little bit of an intuition for step sizes and this is very
+
+0:12:30.800,0:12:34.640
+important in practice I myself find a lot of my time is spent treating
+
+0:12:34.640,0:12:40.310
+learning rates and I'm sure you'll be involved in similar procedure so we have
+
+0:12:40.310,0:12:45.740
+a couple of situations that can occur if we use a learning rate that's too low
+
+0:12:45.740,0:12:49.310
+we'll find that we make steady progress towards the solution here we're
+
+0:12:49.310,0:12:56.480
+minimizing a little 1d quadratic and by steady progress I mean that every
+
+0:12:56.480,0:13:00.920
+iteration the gradient stays in buffer the same direction and you make similar
+
+0:13:00.920,0:13:05.420
+progress as you approach the solution this is slower than it is possible so
+
+0:13:05.420,0:13:09.910
+what you would ideally want to do is go straight to the solution for a quadratic
+
+0:13:09.910,0:13:12.650
+especially a 1d one like this that's going to be pretty straightforward
+
+0:13:12.650,0:13:16.340
+there's going to be an exact step size that'll get you all the way to solution
+
+0:13:16.340,0:13:20.810
+but more generally you can't do that and what you typically want to use is
+
+0:13:20.810,0:13:26.150
+actually a step size a bit above that optimal and this is for a number of
+
+0:13:26.150,0:13:29.570
+reasons it tends to be quicker in practice we have to be very very careful
+
+0:13:29.570,0:13:33.800
+because you get divergence and the term divergence means that the iterates will
+
+0:13:33.800,0:13:37.160
+get further away than from the solution instead of closer this will typically
+
+0:13:37.160,0:13:42.530
+happen if you use two larger learning rate unfortunately for us we want to use
+
+0:13:42.530,0:13:45.590
+learning rates as large as possible to get as quick learning as possible so
+
+0:13:45.590,0:13:50.180
+we're always at the edge of divergence in fact it's very rare that you'll see
+
+0:13:50.180,0:13:55.400
+that the gradients follow this nice trajectory where they all point the same
+
+0:13:55.400,0:13:58.670
+direction until you kind of reach the solution what almost always happens in
+
+0:13:58.670,0:14:02.960
+practice especially with gradient descent invariants is that you observe
+
+0:14:02.960,0:14:06.770
+this zigzagging behavior now we can't actually see zigzagging in million
+
+0:14:06.770,0:14:10.940
+dimensional spaces that we train your networks in but it's very evident in
+
+0:14:10.940,0:14:15.680
+these 2d plots of a quadratic so here I'm showing the level sets you can see
+
+0:14:15.680,0:14:20.560
+the numbers or the function value indicated there on the level sets and
+
+0:14:20.560,0:14:27.830
+when we use a learning rate that is good not optimal but good we get pretty close
+
+0:14:27.830,0:14:31.760
+to that blue dot the solution are for the 10 steps when we use a learning rate
+
+0:14:31.760,0:14:35.450
+that seems nicer in that it's not oscillating it's well-behaved when we
+
+0:14:35.450,0:14:38.330
+use such a learning rate we actually end up quite a bit further away from the
+
+0:14:38.330,0:14:42.830
+solution so it's a fact of life that we have to deal with these learning rates
+
+0:14:42.830,0:14:50.690
+that are stressfully high it's kind of like a race right you know no one wins a
+
+0:14:50.690,0:14:55.730
+a race by driving safely so our network training should be very comparable to
+
+0:14:55.730,0:15:01.940
+that so the core topic we want to talk about is actually it stochastic
+
+0:15:01.940,0:15:08.600
+optimization and this is the method that we will be using every day for training
+
+0:15:08.600,0:15:14.660
+neural networks in practice so it's de casting optimization is actually not so
+
+0:15:14.660,0:15:19.190
+different what we're gonna do is we're going to replace the gradients in our
+
+0:15:19.190,0:15:25.700
+gradient descent step with a stochastic approximation to the gradient now in a
+
+0:15:25.700,0:15:29.930
+neural network we can be a bit more precise here by stochastic approximation
+
+0:15:29.930,0:15:36.310
+what we mean is the gradient of the loss for a single data point single instance
+
+0:15:36.310,0:15:42.970
+you might want to call it so I've got that in the notation here this function
+
+0:15:42.970,0:15:49.430
+L is the loss of one day the point here the data point is indexed by AI and we
+
+0:15:49.430,0:15:52.970
+would write this typically in the optimization literature as the function
+
+0:15:52.970,0:15:57.380
+fᵢ and I'm going to use this notation but you should imagine fᵢ as being the
+
+0:15:57.380,0:16:02.390
+loss for a single instance I and here I'm using supervised learning setup
+
+0:16:02.390,0:16:08.330
+where we have data points I labels yᵢ so they points xᵢ labels yᵢ the full
+
+0:16:08.330,0:16:14.290
+loss for a function is shown at the top there it's a sum of all these fᵢ. Now
+
+0:16:14.290,0:16:17.600
+let me give you a bit more explanation for what we're doing here we're placing
+
+0:16:17.600,0:16:24.230
+this through gradient with a stochastic gradient this is a noisy approximation
+
+0:16:24.230,0:16:30.350
+and this is how it's often explained in the stochastic optimization setup so we
+
+0:16:30.350,0:16:36.440
+have this function the gradient and in our setup it's expected value is equal
+
+0:16:36.440,0:16:41.150
+to the full gradient so you can think of a stochastic gradient descent step as
+
+0:16:41.150,0:16:47.210
+being a full gradient step in expectation now this is not actually the
+
+0:16:47.210,0:16:50.480
+best way to view it because there's a lot more going on than that it's not
+
+0:16:50.480,0:16:58.310
+just gradient descent with noise so let me give you a little bit more detail but
+
+0:16:58.310,0:17:03.050
+first I let anybody ask any questions I have here before I move on yes
+
+0:17:03.050,0:17:08.420
+mm-hmm yeah I could talk a bit more about that but yes so you're right so
+
+0:17:08.420,0:17:12.500
+using your entire dataset to calculate a gradient is here what I mean by gradient
+
+0:17:12.500,0:17:17.720
+descent we also call that full batch gradient descent just to be clear now in
+
+0:17:17.720,0:17:22.280
+machine learning we virtually always use mini batches so people may use the name
+
+0:17:22.280,0:17:24.620
+gradient descent or something when they're really talking about stochastic
+
+0:17:24.620,0:17:29.150
+gradient descent and what you mentioned is absolutely true so there are some
+
+0:17:29.150,0:17:33.920
+difficulties of training neural networks using very large batch sizes and this is
+
+0:17:33.920,0:17:37.010
+understood to some degree and I'll actually explain that on the very next
+
+0:17:37.010,0:17:39.230
+slide so let me let me get to to your point first
+
+0:17:39.230,0:17:45.679
+so the point the answer to your question is actually the third point here the
+
+0:17:45.679,0:17:50.780
+noise in stochastic gradient descent induces this phenomena known as
+
+0:17:50.780,0:17:54.770
+annealing and the diagram directly to the right of it illustrates this
+
+0:17:54.770,0:18:00.260
+phenomena so your network training landscapes have a bumpy structure to
+
+0:18:00.260,0:18:05.330
+them where there are lots of small minima that are not good minima that
+
+0:18:05.330,0:18:09.320
+appear on the path to the good minima so the theory that a lot of people
+
+0:18:09.320,0:18:13.760
+subscribe to is that SGD in particular the noise induced in the gradient
+
+0:18:13.760,0:18:18.919
+actually helps the optimizer to jump over these bad minima and the theory is
+
+0:18:18.919,0:18:22.669
+that these bad minima are quite small in the space and so they're easy to jump
+
+0:18:22.669,0:18:27.380
+over we're good minima that results in good performance around your own network
+
+0:18:27.380,0:18:34.070
+are larger and harder to skip so does this answer your question yes so besides
+
+0:18:34.070,0:18:39.440
+that annealing point of view there's there's actually a few other reasons so
+
+0:18:39.440,0:18:45.559
+we have a lot of redundancy in the information we get from each terms
+
+0:18:45.559,0:18:51.679
+gradient and using stochastic gradient lets us exploit this redundancy in a lot
+
+0:18:51.679,0:18:56.870
+of situations the gradient computed on a few hundred examples is almost as good
+
+0:18:56.870,0:19:01.460
+as a gradient computed on the full data set and often thousands of times cheaper
+
+0:19:01.460,0:19:05.300
+depending on your problem so it's it's hard to come up with a compelling reason
+
+0:19:05.300,0:19:09.320
+to use gradient descent given the success of stochastic gradient descent
+
+0:19:09.320,0:19:13.809
+and this is part of the reason why disgusted gradient said is one of the
+
+0:19:15.659,0:19:19.859
+best misses we have but gradient descent is one of the worst and in fact early
+
+0:19:19.859,0:19:23.580
+stages the correlation is remarkable this disgusted gradient can be
+
+0:19:23.580,0:19:28.499
+correlated up to a coefficient of 0.999 correlation coefficient to the true
+
+0:19:28.499,0:19:33.869
+gradient at those early steps of optimization so I want to briefly talk
+
+0:19:33.869,0:19:38.179
+about a something you need to know about I think Yann has already mentioned this
+
+0:19:38.179,0:19:43.259
+briefly but in practice we don't use individual instances in stochastic
+
+0:19:43.259,0:19:48.749
+gradient descent how we use mini batches of instances so I'm just using some
+
+0:19:48.749,0:19:52.649
+notation here but everybody uses different notation for mini batching so
+
+0:19:52.649,0:19:56.970
+you shouldn't get too attached to the notation but essentially at every step
+
+0:19:56.970,0:20:03.149
+you have some batch here I'm going to call it B an index with I for step and
+
+0:20:03.149,0:20:09.299
+you basically use the average of the gradients over this mini batch which is
+
+0:20:09.299,0:20:13.470
+a subset of your data rather than a single instance or the full full batch
+
+0:20:13.470,0:20:19.799
+now almost everybody will use this mini batch selected uniformly at random
+
+0:20:19.799,0:20:23.009
+some people use with replacement sampling and some people use without
+
+0:20:23.009,0:20:26.669
+with replacement sampling but the differences are not important for this
+
+0:20:26.669,0:20:31.729
+purposes you can use either and there's a lot of advantages to mini batching so
+
+0:20:31.729,0:20:35.220
+there's actually some good impelling theoretical reasons to not be any batch
+
+0:20:35.220,0:20:38.609
+but the practical reasons are overwhelming part of these practical
+
+0:20:38.609,0:20:43.950
+reasons are computational we make ammonia may utilize our hardware say at
+
+0:20:43.950,0:20:47.489
+1% efficiency when training some of the network's we use if we try and use
+
+0:20:47.489,0:20:51.239
+single instances and we get the most efficient utilization of the hardware
+
+0:20:51.239,0:20:55.979
+with batch sizes often in the hundreds if you're training on the typical
+
+0:20:55.979,0:20:59.999
+ImageNet data set for in for instance you don't use batch sizes less than
+
+0:20:59.999,0:21:08.429
+about 64 to get good efficiency maybe can go down to 32 but another important
+
+0:21:08.429,0:21:13.080
+application is distributed training and this is really becoming a big thing so
+
+0:21:13.080,0:21:17.309
+as was mentioned before people were recently able to Train ImageNet days
+
+0:21:17.309,0:21:21.639
+said that normally takes two days to train and not so long ago it took
+
+0:21:21.639,0:21:25.779
+in a week to train in only one hour and the way they did that was using very
+
+0:21:25.779,0:21:29.889
+large mini batches and along with using large many batches there are some tricks
+
+0:21:29.889,0:21:34.059
+that you need to use to get it to work it's probably not something that you
+
+0:21:34.059,0:21:37.149
+would cover an introductory lecture so I encourage you to check out that paper if
+
+0:21:37.149,0:21:40.409
+you're interested it's ImageNet in one hour
+
+0:21:40.409,0:21:45.279
+leaves face book authors I can't recall the first author at the moment as a side
+
+0:21:45.279,0:21:51.459
+note there are some situations where you need to do full batch optimization do
+
+0:21:51.459,0:21:54.759
+not use gradient descent in that situation I can't emphasize it enough to
+
+0:21:54.759,0:21:59.950
+not use gradient ascent ever if you have full batch data by far the most
+
+0:21:59.950,0:22:03.249
+effective method that is kind of plug-and-play you don't to think about
+
+0:22:03.249,0:22:08.859
+it is known as l-bfgs it's accumulation of 50 years of optimization research and
+
+0:22:08.859,0:22:12.519
+it works really well torch's implementation is pretty good
+
+0:22:12.519,0:22:17.379
+but the Scipy implementation causes some filtering code that was written 15 years
+
+0:22:17.379,0:22:23.440
+ago that is pretty much bulletproof so because they were those so that's a good
+
+0:22:23.440,0:22:26.619
+question classically you do need to use the full
+
+0:22:26.619,0:22:28.809
+data set now PyTorch implementation actually
+
+0:22:28.809,0:22:34.209
+supports using mini battery now this is somewhat of a gray area in that there's
+
+0:22:34.209,0:22:37.899
+really no theory to support the use of this and it may work well for your
+
+0:22:37.899,0:22:43.839
+problem or it may not so it could be worth trying I mean you want to use your
+
+0:22:43.839,0:22:49.929
+whole data set for each gradient evaluation or probably more likely since
+
+0:22:49.929,0:22:52.359
+it's very rarely you want to do that probably more likely you're solving some
+
+0:22:52.359,0:22:56.889
+other optimization problem that isn't isn't training in your network but maybe
+
+0:22:56.889,0:23:01.869
+some ancillary problem related and you need to solve an optimization problem
+
+0:23:01.869,0:23:06.669
+without this data point structure that doesn't summer isn't a sum of data
+
+0:23:06.669,0:23:12.239
+points yeah hopefully it was another question yep oh yes the question was
+
+0:23:12.239,0:23:16.869
+Yann recommended we used mini batches equal to the size of the number of
+
+0:23:16.869,0:23:20.079
+classes we have in our data set why is that reasonable that was the question
+
+0:23:20.079,0:23:23.889
+the answer is that we want any vectors to be representative of the full data
+
+0:23:23.889,0:23:28.329
+set and typically each class is quite distinct from the other classes in its
+
+0:23:28.329,0:23:33.490
+properties so about using a mini batch that contains on average
+
+0:23:33.490,0:23:36.850
+one instance from each class in fact we can enforce that explicitly although
+
+0:23:36.850,0:23:39.820
+it's not necessary by having an approximately equal to that
+
+0:23:39.820,0:23:44.590
+size we can assume it has the kind of structure of a food gradient so you
+
+0:23:44.590,0:23:49.870
+capture a lot of the correlations in the data you see with the full gradient and
+
+0:23:49.870,0:23:54.279
+it's a good guide especially if you're using training on CPU where you're not
+
+0:23:54.279,0:23:58.690
+constrained too much by hardware efficiency here when training on energy
+
+0:23:58.690,0:24:05.080
+on a CPU batch size is not critical for hardware utilization it's problem
+
+0:24:05.080,0:24:09.370
+dependent I would always recommend mini batching I don't think it's worth trying
+
+0:24:09.370,0:24:13.899
+size one as a starting point if you try to eke out small gains maybe that's
+
+0:24:13.899,0:24:19.779
+worth exploring yes there was another question so in the annealing example so
+
+0:24:19.779,0:24:24.760
+the question was why is the lost landscape so wobbly and this is this is
+
+0:24:24.760,0:24:31.600
+actually something that is very a very realistic depiction of actual law slams
+
+0:24:31.600,0:24:37.630
+codes for neural networks they're incredibly in the sense that they have a
+
+0:24:37.630,0:24:41.860
+lot of hills and valleys and this is something that is actively researched
+
+0:24:41.860,0:24:47.140
+now what we can say for instance is that there is a very large number of good
+
+0:24:47.140,0:24:52.720
+minima and and so hills and valleys we know this because your networks have
+
+0:24:52.720,0:24:56.590
+this combinatorial aspect to them you can reaper ammeter eyes a neural network
+
+0:24:56.590,0:25:00.309
+by shifting all the weights around and you can get in your work you'll know if
+
+0:25:00.309,0:25:04.750
+it outputs exactly the same output for whatever task you're looking at with all
+
+0:25:04.750,0:25:07.419
+these weights moved around and that correspondence essentially to a
+
+0:25:07.419,0:25:12.460
+different location in parameter space so given that there's an exponential number
+
+0:25:12.460,0:25:16.270
+of these possible ways of rearranging the weights to get the same network
+
+0:25:16.270,0:25:18.940
+you're going to end up with the space that's incredibly spiky exponential
+
+0:25:18.940,0:25:24.789
+number of these spikes now the reason why these these local minima appear that
+
+0:25:24.789,0:25:27.580
+is something that is still active research so I'm not sure I can give you
+
+0:25:27.580,0:25:32.890
+a great answer there but they're definitely observed in practice and what
+
+0:25:32.890,0:25:39.000
+I can say is they appear to be less of a problem we've very
+
+0:25:39.090,0:25:42.810
+like close to state-of-the-art networks so these local minima were considered
+
+0:25:42.810,0:25:47.940
+big problems 15 years ago but so much at the moment people essentially never hit
+
+0:25:47.940,0:25:52.350
+them in practice when using kind of recommended parameters and things like
+
+0:25:52.350,0:25:55.980
+that when you use very large batches you can run into these problems it's not
+
+0:25:55.980,0:25:59.490
+even clear that the the poor performance when using large batches is even
+
+0:25:59.490,0:26:03.900
+attributable to these larger minima to these local minima so this is yes to
+
+0:26:03.900,0:26:08.550
+ongoing research yes the problem is you can't really see this local structure
+
+0:26:08.550,0:26:10.920
+because we're in this million dimensional space it's not a good way to
+
+0:26:10.920,0:26:15.090
+see it so yeah I don't know if people might have explored that already I'm not
+
+0:26:15.090,0:26:18.840
+familiar with papers on that but I bet someone has looked at it so you might
+
+0:26:18.840,0:26:23.520
+want to google that yeah so a lot of the advances in neural network design have
+
+0:26:23.520,0:26:27.420
+actually been in reducing this bumpiness in a lot of ways so this is part of the
+
+0:26:27.420,0:26:30.510
+reason why it's not considered a huge problem anymore whether it was it was
+
+0:26:30.510,0:26:35.960
+considered a big problem in the past there's any other questions yes so it's
+
+0:26:35.960,0:26:41.550
+it is hard to see but there are certain things you can do that we make the the
+
+0:26:41.550,0:26:46.830
+peaks and valleys smaller certainly and by rescaling some parts the neural
+
+0:26:46.830,0:26:50.010
+network you can amplify certain directions the curvature in certain
+
+0:26:50.010,0:26:54.320
+directions can be stretched and squashed the particular innovation residual
+
+0:26:54.320,0:27:00.000
+connections that were mentioned they're very easy to see that they smooth out
+
+0:27:00.000,0:27:03.600
+the the loss in fact you can kind of draw two line between two points in the
+
+0:27:03.600,0:27:06.570
+space and you can see what happens along that line that's really the best way we
+
+0:27:06.570,0:27:10.170
+have a visualizing million dimensional spaces so I turn him into one dimension
+
+0:27:10.170,0:27:13.200
+and you can see that it's that it's a much nicer between these two points
+
+0:27:13.200,0:27:17.370
+whatever two points you choose when using these residual connections I'll be
+
+0:27:17.370,0:27:21.570
+talking all about dodging or later in the lecture so yeah if hopefully I'll
+
+0:27:21.570,0:27:24.870
+answer that question without you having to ask it again but we'll see
+
+0:27:24.870,0:27:31.560
+thanks any other questions yes so l-bfgs excellent method it's it's kind of a
+
+0:27:31.560,0:27:34.650
+constellation of optimization researchers that we still use SGD a
+
+0:27:34.650,0:27:40.470
+method invented in the 60s or earlier is still state of the art but there has
+
+0:27:40.470,0:27:44.880
+been some innovation in fact only a couple years later but there was some
+
+0:27:44.880,0:27:49.180
+innovation since the invention of sed and one of these innovations is
+
+0:27:49.180,0:27:54.730
+and I'll talk about another later so momentum it's a trick
+
+0:27:54.730,0:27:57.520
+that you should pretty much always be using when you're using stochastic
+
+0:27:57.520,0:28:00.880
+gradient descent it's worth be going into this in a little bit of detail
+
+0:28:00.880,0:28:04.930
+you'll often be tuning the momentum parameter and your network and it's
+
+0:28:04.930,0:28:09.340
+useful to understand what it's actually doing when you're tuning up so part of
+
+0:28:09.340,0:28:15.970
+the problem with momentum it's very misunderstood and this can be explained
+
+0:28:15.970,0:28:18.760
+by the fact that there's actually three different ways of writing momentum that
+
+0:28:18.760,0:28:21.790
+look completely different but turn out to be equivalent I'm only going to
+
+0:28:21.790,0:28:25.120
+present two of these ways because the third way is not as well known but is
+
+0:28:25.120,0:28:30.070
+actually in my opinion the correct way to view it I don't talk about my
+
+0:28:30.070,0:28:32.470
+research here so we'll talk about how it's actually implemented in the
+
+0:28:32.470,0:28:37.390
+packages you'll be using and this first form here is what's actually implemented
+
+0:28:37.390,0:28:42.040
+in PyTorch and other software that you'll be using here we maintain two variables
+
+0:28:42.040,0:28:47.650
+now you'll see lots of papers using different notation here P is the
+
+0:28:47.650,0:28:51.580
+notation used in physics for momentum and it's very common to use that also as
+
+0:28:51.580,0:28:55.720
+the momentum variable when talking about sed with momentum so I'll be following
+
+0:28:55.720,0:29:01.000
+that convention so instead of having a single iterate we now have to Eretz P
+
+0:29:01.000,0:29:06.940
+and W and at every step we update both and this is quite a simple update so the
+
+0:29:06.940,0:29:13.060
+P update involves adding to the old P and instead of adding exactly to the old
+
+0:29:13.060,0:29:16.720
+P we kind of damp the old P we reduce it by multiplying it by a constant that's
+
+0:29:16.720,0:29:21.310
+worse than one so reduce the old P and here I'm using β̂ as the constant
+
+0:29:21.310,0:29:24.880
+there so that would probably be 0.9 in practice a small amount of damping and
+
+0:29:24.880,0:29:32.650
+we add to that the new gradient so P is kind of this accumulated gradient buffer
+
+0:29:32.650,0:29:38.170
+you can think of where new gradients come in at full value and past gradients
+
+0:29:38.170,0:29:42.490
+are reduced at each step by a certain factor usually 0.9 which used to reduce
+
+0:29:42.490,0:29:47.910
+reduced so the buffer tends to be a some sort of running sum of gradients and
+
+0:29:47.910,0:29:53.080
+it's basically we just modify this to custer gradient two-step descent step by
+
+0:29:53.080,0:29:56.440
+using this P instead of the negative gradient instead of the gradient sorry
+
+0:29:56.440,0:30:00.260
+using P instead of the in the update since the two line formula
+
+0:30:00.260,0:30:05.790
+it may be better to understand this by the second form that I put below this is
+
+0:30:05.790,0:30:09.600
+equivalent you've got a map the β with a small transformation so it's not
+
+0:30:09.600,0:30:12.750
+exactly the same β between the two methods but it's practically the same
+
+0:30:12.750,0:30:20.300
+for in practice so these are essentially the same up to reap romanization and
+
+0:30:21.260,0:30:25.530
+this film I think is maybe clearer this form is called the stochastic heavy ball
+
+0:30:25.530,0:30:31.170
+method and here our update still includes the gradient but we're also
+
+0:30:31.170,0:30:40.020
+adding on a multiplied copy of the past direction we traveled in now what does
+
+0:30:40.020,0:30:43.320
+this mean what are we actually doing here so it's actually not too difficult
+
+0:30:43.320,0:30:49.170
+to visualize and I'm going to kind of use a visualization from a distilled
+
+0:30:49.170,0:30:52.710
+publication you can see the dress at the bottom there and I disagree with a lot
+
+0:30:52.710,0:30:55.620
+of what they talked about in that document but I like the visualizations
+
+0:30:55.620,0:31:02.820
+so let's use had and I'll explain why I disagreed some regards later but it's
+
+0:31:02.820,0:31:07.440
+quite simple so you can think of momentum as the physical process and I
+
+0:31:07.440,0:31:10.650
+mention those of you have done introductory physics courses would have
+
+0:31:10.650,0:31:17.340
+covered this so momentum is the property of something to keep moving in the
+
+0:31:17.340,0:31:21.330
+direction that's currently moving in all right if you're familiar with Newton's
+
+0:31:21.330,0:31:24.240
+laws things want to keep going in the direction they're going and this is
+
+0:31:24.240,0:31:28.860
+momentum and when you do this mapping the physics the gradient is kind of a
+
+0:31:28.860,0:31:34.020
+force that is pushing you're literate which by this analogy is a heavy ball
+
+0:31:34.020,0:31:39.860
+it's pushing this heavy ball at each point so rather than making dramatic
+
+0:31:39.860,0:31:44.030
+changes in the direction we travel at every step which is shown in that left
+
+0:31:44.030,0:31:48.480
+diagram instead of making these dramatic changes we're going to make kind of a
+
+0:31:48.480,0:31:51.480
+bit more modest changes so when we realize we're going in the wrong
+
+0:31:51.480,0:31:55.740
+direction we kind of do a u-turn instead of putting the hand brake on and
+
+0:31:55.740,0:31:59.440
+swinging around it turns out in a lot of practical
+
+0:31:59.440,0:32:01.810
+problems this gives you a big improvement so here you can see you're
+
+0:32:01.810,0:32:06.280
+getting much closer to the solution by the end of it with much less oscillation
+
+0:32:06.280,0:32:10.840
+and you can see this oscillation so it's kind of a fact of life if you're using
+
+0:32:10.840,0:32:14.650
+gradient descent type methods so here we talk about momentum on top of gradient
+
+0:32:14.650,0:32:18.550
+descent in the visualization you're gonna get this oscillation it's just a
+
+0:32:18.550,0:32:22.240
+property of gradient descent no way to get rid of it without modifying the
+
+0:32:22.240,0:32:27.490
+method and we're meant to them to some degree dampens this oscillation I've got
+
+0:32:27.490,0:32:30.760
+another visualization here which will kind of give you an intuition for how
+
+0:32:30.760,0:32:34.660
+this β parameter controls things now the Department of these to be greater
+
+0:32:34.660,0:32:39.280
+than zero if it's equal to zero you distr in gradient descent and it's gotta
+
+0:32:39.280,0:32:43.330
+be less than one otherwise the Met everything blows up as you start
+
+0:32:43.330,0:32:45.970
+including past gradients with more and more weight over times it's gotta be
+
+0:32:45.970,0:32:54.070
+between zero and one and typical values range from you know small 0.25 up to
+
+0:32:54.070,0:32:59.230
+like 0.99 so in practice you can get pretty close to one and what happens is
+
+0:32:59.230,0:33:09.130
+the smaller values they result in you're changing direction quicker okay so in
+
+0:33:09.130,0:33:12.820
+this diagram you can see on the left with the small β you as soon as you
+
+0:33:12.820,0:33:16.120
+get close to the solution you kind of change direction pretty rapidly and head
+
+0:33:16.120,0:33:19.900
+towards a solution when you use these larger βs it takes longer for you to
+
+0:33:19.900,0:33:23.530
+make this dramatic turn you can think of it as a car with a bad turning circle
+
+0:33:23.530,0:33:26.170
+takes you quite a long time to get around that corner and head towards
+
+0:33:26.170,0:33:31.180
+solution now this may seem like a bad thing but actually in practice this
+
+0:33:31.180,0:33:35.110
+significantly dampens the oscillations that you get from gradient descent and
+
+0:33:35.110,0:33:40.450
+that's the nice property of it now in terms of practice I can give you some
+
+0:33:40.450,0:33:45.760
+pretty clear guidance here you pretty much always want to use momentum it's
+
+0:33:45.760,0:33:48.820
+pretty hard to find problems where it's actually not beneficial to some degree
+
+0:33:48.820,0:33:52.960
+now part of the reason for this is it's just an extra parameter now typically
+
+0:33:52.960,0:33:55.870
+when you take some method and just add more parameters to it you can usually
+
+0:33:55.870,0:34:01.000
+find some value of that parameter that makes us slightly better now that is
+
+0:34:01.000,0:34:04.330
+sometimes the case here but often these improvements from using momentum are
+
+0:34:04.330,0:34:08.810
+actually quite substantial and using a momentum value of point nine is
+
+0:34:08.810,0:34:13.610
+really a default value used in machine learning quite often and often in some
+
+0:34:13.610,0:34:19.010
+situations 0.99 may be better so I would recommend trying both values if you have
+
+0:34:19.010,0:34:24.770
+time otherwise just try point nine but I have to do a warning the way momentum is
+
+0:34:24.770,0:34:29.300
+stated in this expression if you look at it carefully when we increase the
+
+0:34:29.300,0:34:36.440
+momentum we kind of increase the step size now it's not the step size of the
+
+0:34:36.440,0:34:39.380
+current gradient so the current gradient is included in the step with the same
+
+0:34:39.380,0:34:43.399
+strengths but past gradients become included in the step with a higher
+
+0:34:43.399,0:34:48.290
+strength when you increase momentum now when you write momentum in other forms
+
+0:34:48.290,0:34:53.179
+this becomes a lot more obvious so this firm kind of occludes that but what you
+
+0:34:53.179,0:34:58.820
+should generally do when you change momentum you want to change it so that
+
+0:34:58.820,0:35:04.310
+you have your step size divided by one minus β is your new step size so if
+
+0:35:04.310,0:35:07.790
+your old step size was using a certain B do you want to map it to that equation
+
+0:35:07.790,0:35:11.690
+then map it back to get the the new step size now this may be very modest change
+
+0:35:11.690,0:35:16.400
+but if you're going from momentum 0.9 to momentum 0.99 you may need to reduce
+
+0:35:16.400,0:35:20.480
+your learning rate by a factor of 10 approximately so just be wary of that
+
+0:35:20.480,0:35:22.850
+you can't expect to keep the same learning rate and change the momentum
+
+0:35:22.850,0:35:27.260
+parameter at wallmart work now I want to go into a bit of detail about why
+
+0:35:27.260,0:35:31.880
+momentum works is very misunderstood and the explanation you'll see in that
+
+0:35:31.880,0:35:38.570
+Distilled post is acceleration and this is certainly a contributor to the
+
+0:35:38.570,0:35:44.380
+performance of momentum now acceleration is a topic yes if you've got a question
+
+0:35:44.380,0:35:48.170
+the question was is there a big difference between using momentum and
+
+0:35:48.170,0:35:54.890
+using a mini batch of two and there is so momentum has advantages in for when
+
+0:35:54.890,0:35:59.150
+using gradient descent as well as stochastic gradient descent so in fact
+
+0:35:59.150,0:36:03.110
+this acceleration explanation were about to use applies both in the stochastic
+
+0:36:03.110,0:36:07.520
+and non stochastic case so no matter what batch size you're going to use the
+
+0:36:07.520,0:36:13.100
+benefits of momentum still are shown now it also has benefits in the stochastic
+
+0:36:13.100,0:36:17.000
+case as well which I'll cover in a slide or two so the answer is it's quite
+
+0:36:17.000,0:36:19.579
+distinct from batch size and you shouldn't complete them
+
+0:36:19.579,0:36:22.459
+learn it like really you should be changing your learning rate when you
+
+0:36:22.459,0:36:26.239
+change your bat size rather than changing the momentum and for very large
+
+0:36:26.239,0:36:30.380
+batch sizes there's a clear relationship between learning rate and batch size but
+
+0:36:30.380,0:36:34.729
+for small batch sizes it's not clear so it's problem dependent any other
+
+0:36:34.729,0:36:38.599
+questions before I move on on momentum yes yes it's it's just blow up so it's
+
+0:36:38.599,0:36:42.979
+actually in the in the in the physics interpretation it's conservation of
+
+0:36:42.979,0:36:48.499
+momentum would be exactly equal to one now that's not good because if you're in
+
+0:36:48.499,0:36:51.890
+a world with no friction then you drop a heavy ball somewhere it's gonna keep
+
+0:36:51.890,0:36:56.479
+moving forever it's not good stuff so we need some dampening and this is where
+
+0:36:56.479,0:37:01.069
+the physics interpretation breaks down so you do need some damping now now you
+
+0:37:01.069,0:37:05.209
+can imagine if you use a larger value than one those past gradients get
+
+0:37:05.209,0:37:09.410
+amplified every step so in fact the first gradient you evaluate in your
+
+0:37:09.410,0:37:13.940
+network is not relevant information content wise later in optimization but
+
+0:37:13.940,0:37:16.910
+if it used to be the larger than 1 it would dominate the step that you're
+
+0:37:16.910,0:37:21.170
+using does that answer your question yeah ok any other questions about
+
+0:37:21.170,0:37:26.359
+momentum before we move on they are for a particular value of β yes it's
+
+0:37:26.359,0:37:30.859
+strictly equivalent it's not very hard to you should be able to do it in like
+
+0:37:30.859,0:37:38.359
+two lines if you try and do the equivalence yourself no the bidders are
+
+0:37:38.359,0:37:40.910
+not quite the same but the the γ is the same that's why I use the same
+
+0:37:40.910,0:37:45.319
+notation for it oh yes so that's what I mentioned yes so when you change β
+
+0:37:45.319,0:37:48.349
+you want to scale your learning rate by the learning rate divided by one over
+
+0:37:48.349,0:37:52.369
+β so in this form I'm not sure if it appears in this form it could be a
+
+0:37:52.369,0:37:55.969
+mistake but I think I'm okay here I think it's not in this formula but yeah
+
+0:37:55.969,0:37:59.269
+what you definitely when you change β you need to change learning rate as well
+
+0:37:59.269,0:38:09.300
+to keep things balanced yeah Oh either averaging form it's probably
+
+0:38:09.300,0:38:13.830
+not worth going over but you can think of it as momentum is basically changing
+
+0:38:13.830,0:38:17.850
+the point that you evaluate the gradient at in the standard firm you evaluate the
+
+0:38:17.850,0:38:22.230
+gradient at this W point in the inner averaging form you take a running
+
+0:38:22.230,0:38:25.890
+average of the points you've been evaluating the Grady Nutt and you
+
+0:38:25.890,0:38:30.630
+evaluate at that point so it's basically instead of averaging gradients to
+
+0:38:30.630,0:38:37.530
+average points it's clear sense Jewell yes yes so acceleration now this is
+
+0:38:37.530,0:38:43.260
+something you can spend the whole career studying and it's it's somewhat poorly
+
+0:38:43.260,0:38:47.070
+understood now if you try and read Nesterov original work on it now
+
+0:38:47.070,0:38:53.520
+Nesterov is kind of the grandfather of modern optimization in practically half
+
+0:38:53.520,0:38:56.460
+the methods we use are named after him to some degree which is can be confusing
+
+0:38:56.460,0:39:01.740
+at times and in the 80s he came up with this formulation he didn't write it in
+
+0:39:01.740,0:39:04.650
+this form he wrote it in another form which people realized a while later
+
+0:39:04.650,0:39:09.450
+could be written in this form and his analysis is also very opaque and
+
+0:39:09.450,0:39:15.590
+originally written in Russian doesn't help no for understanding unfortunately
+
+0:39:15.590,0:39:21.180
+those nice people the NSA translated all of the Russian literature back then so
+
+0:39:21.180,0:39:27.330
+so we have access to them and it's actually a very small modification of
+
+0:39:27.330,0:39:31.890
+the momentum step but I think that small modification belittles what it's
+
+0:39:31.890,0:39:36.600
+actually doing it's really not the same method at all what I can say is with
+
+0:39:36.600,0:39:41.400
+Nesterov Swimmer momentum if you very carefully choose these constants you can
+
+0:39:41.400,0:39:46.050
+get what's known as accelerated convergence now this doesn't apply in
+
+0:39:46.050,0:39:49.560
+your networks but for convex problems I won't go into details of convexity but
+
+0:39:49.560,0:39:52.230
+some of you may know what that means it's kind of a simple structure but
+
+0:39:52.230,0:39:55.740
+convex problems it's a radically improved convergence rate from this
+
+0:39:55.740,0:39:59.940
+acceleration but only for very carefully chosen constants and you really can't
+
+0:39:59.940,0:40:03.030
+choose these carefully ahead of time so you've got to do quite a large search
+
+0:40:03.030,0:40:05.640
+over your parameters your hyper parameters sorry to find the right
+
+0:40:05.640,0:40:10.710
+constants to get that acceleration what I can say is this actually occurs for
+
+0:40:10.710,0:40:14.779
+quadratics when using regular momentum and this is confused a lot of people
+
+0:40:14.779,0:40:18.559
+so you'll see a lot of people say that momentum is an accelerated method it's
+
+0:40:18.559,0:40:23.449
+excited only for quadratics and even then it's it's a little bit iffy I would
+
+0:40:23.449,0:40:27.529
+not recommend using it for quadratics use conjugate gradients or some new
+
+0:40:27.529,0:40:33.499
+methods that have been developed over the last few years and this is
+
+0:40:33.499,0:40:36.919
+definitely a contributing factor to our momentum works so well in practice and
+
+0:40:36.919,0:40:42.499
+there's definitely some acceleration going on but this acceleration is hard
+
+0:40:42.499,0:40:46.669
+to realize when you have stochastic gradients now when you look at what
+
+0:40:46.669,0:40:51.679
+makes acceleration work noise really kills it and it's it's hard to believe
+
+0:40:51.679,0:40:55.549
+that it's the main factor contributing to the performance but it's certainly
+
+0:40:55.549,0:40:59.989
+there and the the still post I mentioned attributes or the performance of
+
+0:40:59.989,0:41:02.689
+momentum to acceleration but I wouldn't go that quite that far but it's
+
+0:41:02.689,0:41:08.390
+definitely a contributing factor but probably the practical and provable
+
+0:41:08.390,0:41:13.669
+reason why acceleration why knows sorry why momentum helps is noise smoothing
+
+0:41:13.669,0:41:21.619
+and this is very intuitive momentum averages gradients in a sense we keep
+
+0:41:21.619,0:41:25.099
+this running buffer gradients that we use as a step instead of individual
+
+0:41:25.099,0:41:30.259
+gradients this is kind of a form of averaging and it turns out that when you
+
+0:41:30.259,0:41:33.229
+use s to D without momentum to prove anything at all about it
+
+0:41:33.229,0:41:37.449
+you actually have to work with the average of all the points you visited
+
+0:41:37.449,0:41:42.380
+you can get really weak bounds on the last point that you ended up at but
+
+0:41:42.380,0:41:45.349
+really you've got to work with this average of points and this is suboptimal
+
+0:41:45.349,0:41:48.529
+like we never want to actually take this average in practice it's heavily
+
+0:41:48.529,0:41:52.099
+weighted with points that we visited a long time ago which may be irrelevant
+
+0:41:52.099,0:41:55.159
+and in fact this averaging doesn't work very well in practice for neural
+
+0:41:55.159,0:41:59.150
+networks it's really only important for convex problems but nevertheless it's
+
+0:41:59.150,0:42:03.380
+necessary to analyze regular s2d and one of the remarkable facts about momentum
+
+0:42:03.380,0:42:09.019
+is actually this averaging is no longer theoretically necessary so essentially
+
+0:42:09.019,0:42:14.509
+momentum adds smoothing dream optimization that makes it makes us so
+
+0:42:14.509,0:42:19.459
+the last point you visit is still a good approximation to the solution with SGG
+
+0:42:19.459,0:42:23.329
+really you want to average a whole bunch of last points you've seen in order to
+
+0:42:23.329,0:42:26.700
+get a good approximation to the solution now let me illustrate that
+
+0:42:26.700,0:42:31.190
+here so this is this is a very typical example of what happens when using STD
+
+0:42:31.190,0:42:36.329
+STD at the beginning you make great progress the gradient is essentially
+
+0:42:36.329,0:42:39.960
+almost the same as the stochastic gradient so first few steps you make
+
+0:42:39.960,0:42:44.490
+great progress towards solution but then you end up in this ball now recall here
+
+0:42:44.490,0:42:47.579
+that's a valley that we're heading down so this ball here is kind of the floor
+
+0:42:47.579,0:42:53.550
+of the valley and you kind of bounce around in this floor and the most common
+
+0:42:53.550,0:42:56.579
+solution of this is if you reduce your learning rate you'll bounce around
+
+0:42:56.579,0:43:01.290
+slower not exactly a great solution but it's one way to handle it but when you
+
+0:43:01.290,0:43:04.710
+use s to deal with momentum you can kind of smooth out this bouncing around and
+
+0:43:04.710,0:43:08.160
+you kind of just kind of wheel around now the path is not always going to be
+
+0:43:08.160,0:43:12.300
+this corkscrew tile path it's actually quite random you could kind of wobble
+
+0:43:12.300,0:43:15.990
+left and right but when I seeded it with 42 this is what it spread out so that's
+
+0:43:15.990,0:43:20.790
+what I'm using here you typically get this corkscrew you get this cork scoring
+
+0:43:20.790,0:43:24.660
+for this set of parameters and yeah I think this is a good explanation so some
+
+0:43:24.660,0:43:27.960
+combination of acceleration and noise smoothing is why momentum works
+
+0:43:27.960,0:43:33.180
+oh yes yes so I should say that when we inject noise here the gradient may not
+
+0:43:33.180,0:43:37.470
+even be the right direction to travel in fact it could be in the opposite
+
+0:43:37.470,0:43:40.800
+direction from where you want to go and this is why you kind of bounce around in
+
+0:43:40.800,0:43:46.410
+the valley there so in fact the gray you can see here that the first step with
+
+0:43:46.410,0:43:49.980
+SUV is practically orthogonal to the level set there that's because it is
+
+0:43:49.980,0:43:52.770
+such a good step at the beginning but once you get further down it can point
+
+0:43:52.770,0:44:00.300
+in pretty much any direction vaguely around the solution so yesterday with
+
+0:44:00.300,0:44:03.540
+momentum is currently state of the art optimization method for a lot of machine
+
+0:44:03.540,0:44:08.730
+learning problems so you'll probably be using it in your course for a lot of
+
+0:44:08.730,0:44:12.990
+problems but there has been some other innovations over the years and these are
+
+0:44:12.990,0:44:16.829
+particularly useful for poorly conditioned problems now as I mentioned
+
+0:44:16.829,0:44:19.770
+earlier in the lecture some problems have this kind of well condition
+
+0:44:19.770,0:44:22.530
+property that we can't really characterize for neural networks but we
+
+0:44:22.530,0:44:27.450
+can measure it by the test that if s to D works then it's well conditioned
+
+0:44:27.450,0:44:31.470
+eventually there doesent works and if I must be walking poorly conditioned so we
+
+0:44:31.470,0:44:34.410
+have other methods we can handle we can use to handle this in some
+
+0:44:34.410,0:44:39.690
+situations and these generally are called adaptive methods now you need to
+
+0:44:39.690,0:44:43.500
+be a little bit careful because what are you adapting to people in literature use
+
+0:44:43.500,0:44:51.780
+this nomenclature for adapting learning rates adapting momentum parameters but
+
+0:44:51.780,0:44:56.339
+in our our situation we're talk about a specific type of adaptivity roman this
+
+0:44:56.339,0:45:03.780
+adaptivity is individual learning rates now what I mean by that so in the
+
+0:45:03.780,0:45:06.869
+simulation I already showed you a stochastic gradient descent
+
+0:45:06.869,0:45:10.619
+I used a global learning rate by that I mean every single rate in your network
+
+0:45:10.619,0:45:16.800
+is updated using an equation with the same γ now γ could vary over
+
+0:45:16.800,0:45:21.720
+time step so you used γ K in the notation but often you use a fixed
+
+0:45:21.720,0:45:26.310
+camera for quite a long time but for adaptive methods we want to adapt a
+
+0:45:26.310,0:45:30.240
+learning rate for every weight individually and we want to use
+
+0:45:30.240,0:45:37.109
+information we get from gradients for each weight to adapt this so this seems
+
+0:45:37.109,0:45:39.900
+like the obvious thing to do and people have been trying to get this stuff to
+
+0:45:39.900,0:45:43.200
+work for decades and we're kind of stumbled upon some methods that work and
+
+0:45:43.200,0:45:48.510
+some that don't but I want to ask for questions here if there's any any
+
+0:45:48.510,0:45:53.040
+explanation needed so I can say that it's not entirely clear why you need to
+
+0:45:53.040,0:45:56.880
+do this right if your network is well conditioned you don't need to do this
+
+0:45:56.880,0:46:01.349
+potentially but often the network's we use in practice have very different
+
+0:46:01.349,0:46:05.069
+structure in different parts of the network so for instance the early parts
+
+0:46:05.069,0:46:10.619
+of your convolutional neural network may be very shallow convolutional layers on
+
+0:46:10.619,0:46:14.849
+large images later in the network you're going to be doing convolutions with
+
+0:46:14.849,0:46:18.359
+large numbers of channels on small images now these operations are very
+
+0:46:18.359,0:46:21.150
+different and there's no reason to believe that a learning rate that works
+
+0:46:21.150,0:46:26.310
+well for one would work well for the other and this is why the adaptive
+
+0:46:26.310,0:46:28.140
+learning rates can be useful any questions here
+
+0:46:28.140,0:46:32.250
+yes so unfortunately there's no good definition for neural networks we
+
+0:46:32.250,0:46:35.790
+couldn't measure it even if there was a good definition so I'm going to use it
+
+0:46:35.790,0:46:40.109
+in a vague sense that it actually doesn't works and it's poorly
+
+0:46:40.109,0:46:42.619
+conditioned yes so in the sort of quadratic case if
+
+0:46:45.830,0:46:51.380
+you recall I have an explicit definition of this condition number L over μ.
+
+0:46:51.380,0:46:55.910
+L being maximized in value μ being smallest eigen value and yeah the large
+
+0:46:55.910,0:47:00.140
+of this gap between largest larger and smaller eigen value the worst condition
+
+0:47:00.140,0:47:03.320
+it is this does not imply if in your network so that μ does not exist in
+
+0:47:03.320,0:47:07.610
+your networks L still has some information in it but I wouldn't say
+
+0:47:07.610,0:47:12.800
+it's a determining factor there's just a lot going on so there are some ways that
+
+0:47:12.800,0:47:15.619
+your looks behave a lot like simple problems but there are other ways where
+
+0:47:15.619,0:47:23.090
+we just kind of hang wave and say that they like them yeah yeah yes so for this
+
+0:47:23.090,0:47:25.910
+particular network this is a network that actually isn't too poorly
+
+0:47:25.910,0:47:30.920
+conditioned already in fact this is a VDD 16 which is practically the best net
+
+0:47:30.920,0:47:34.490
+method best network when you had a train before the invention of certain
+
+0:47:34.490,0:47:37.369
+techniques to improve conditioning so this is almost the best of first
+
+0:47:37.369,0:47:40.910
+condition you can actually get and there are a lot of the structure of this
+
+0:47:40.910,0:47:45.140
+network is actually defined by this conditioning like we double the number
+
+0:47:45.140,0:47:48.680
+of channels after certain steps because that seems to result in networks at a
+
+0:47:48.680,0:47:53.600
+world condition rather than any other reason but it's certainly what you can
+
+0:47:53.600,0:47:57.170
+say is that weights very light the network have very large effect on the
+
+0:47:57.170,0:48:02.630
+output that very last layer there with if there are 4096 weights in it that's a
+
+0:48:02.630,0:48:06.400
+very small number of whites this network has millions of whites I believe those
+
+0:48:06.400,0:48:10.640
+4096 weights have a very strong effect on the output because they directly
+
+0:48:10.640,0:48:14.450
+dictate that output and for that reason you generally want to use smaller
+
+0:48:14.450,0:48:19.190
+learning rates for those whereas yeah weights early in the network some of
+
+0:48:19.190,0:48:21.770
+them might have a large effect but especially when you've initialized
+
+0:48:21.770,0:48:25.910
+network of randomly they typically will have a smaller effect of those those
+
+0:48:25.910,0:48:29.840
+earlier weights and this is very hand wavy and the reason why is because we
+
+0:48:29.840,0:48:33.859
+really don't understand this well enough for me to give you a precise precise
+
+0:48:33.859,0:48:41.270
+statement here 120 million weights in this network actually so yeah so that
+
+0:48:41.270,0:48:47.710
+last layer is like 4096 by 4096 matrix so
+
+0:48:47.950,0:48:53.510
+yeah okay any other questions yeah yes I would recommend only using them when
+
+0:48:53.510,0:48:59.120
+your problem doesn't have a structure that decomposes into a large sum of
+
+0:48:59.120,0:49:04.880
+similar things okay yeah that's a bit of a mouthful but sut works well when you
+
+0:49:04.880,0:49:09.830
+have an objective that is a sum where each term of the sum is is vaguely
+
+0:49:09.830,0:49:14.990
+comparable so in machine learning each sub term in this sum is a loss of one
+
+0:49:14.990,0:49:18.290
+data point and these have very similar structures individual losses that's a
+
+0:49:18.290,0:49:21.080
+hand-wavy sense that they have very similar structure because of course each
+
+0:49:21.080,0:49:25.220
+data point could be quite different but when your problem doesn't have a large
+
+0:49:25.220,0:49:30.440
+sum as the main part of its structure then l-bfgs would be useful that's the
+
+0:49:30.440,0:49:35.840
+general answer I doubt you make use of it in this course l-bfgs doubt it that
+
+0:49:35.840,0:49:40.660
+it can be very handy for small networks you can experiment around with it with
+
+0:49:40.660,0:49:44.720
+the leaner v network or something which I'm sure you probably use in this course
+
+0:49:44.720,0:49:51.230
+you could experiment with l-bfgs probably and have some success there one
+
+0:49:51.230,0:49:58.670
+of the kind of founding techniques in modern your network training is rmsprop
+
+0:49:58.670,0:50:03.680
+and i'm going to talk about this year now at some point kind of the standard
+
+0:50:03.680,0:50:07.640
+practice in the field of optimization is in research and optimization kind of
+
+0:50:07.640,0:50:10.640
+diverged with what people were actually doing when training neural networks and
+
+0:50:10.640,0:50:14.150
+this IMS prop was kind of the fracturing point where we all went off in different
+
+0:50:14.150,0:50:19.820
+directions and this rmsprop is usually attributed to Geoffrey Hinton slides
+
+0:50:19.820,0:50:23.380
+which he then attributes to an unpublished paper from someone else
+
+0:50:23.380,0:50:28.790
+which is really unsatisfying to be citing someone slides in a paper but
+
+0:50:28.790,0:50:34.400
+anyway it's a method that has some it has no proof behind why it works but
+
+0:50:34.400,0:50:38.050
+it's similar to methods that you can prove work so that's at least something
+
+0:50:38.050,0:50:43.520
+and it works pretty well in practice and that's why I look if we use it so I want
+
+0:50:43.520,0:50:46.310
+to give you that kind of introduction before what I explained what it actually
+
+0:50:46.310,0:50:51.020
+is and rmsprop stands for root mean squared propagation
+
+0:50:51.020,0:50:54.579
+this was from the era where everything we do the fuel networks we
+
+0:50:54.579,0:50:58.690
+called propagation such-and-such like back prop which now we call deep so it
+
+0:50:58.690,0:51:02.920
+probably be called Armas deep propyl something if it was embedded now and
+
+0:51:02.920,0:51:08.470
+it's a little bit of a modification so it still to line algorithm but a little
+
+0:51:08.470,0:51:11.200
+bit different so I'm gonna go over these terms in some detail because it's
+
+0:51:11.200,0:51:19.450
+important to understand this now we we keep around this V buffer now this is
+
+0:51:19.450,0:51:22.720
+not a momentum buffer okay so we using different notation here he is doing
+
+0:51:22.720,0:51:27.069
+something different and I'm going to use some notation that that some people
+
+0:51:27.069,0:51:30.760
+really hates but I think it's convenient I'm going to write the element wise
+
+0:51:30.760,0:51:36.040
+square of a vector just by squaring the vector this is not really confusing
+
+0:51:36.040,0:51:40.390
+notationally in almost all situations but it's a nice way to write it so here
+
+0:51:40.390,0:51:43.480
+I'm writing the gradient squared I really mean you take every element in
+
+0:51:43.480,0:51:47.109
+that vector million element vector or whatever it is and square each element
+
+0:51:47.109,0:51:51.309
+individually so this video update is what's known as an exponential moving
+
+0:51:51.309,0:51:55.480
+average I do I have a quick show of hands who's familiar with exponential
+
+0:51:55.480,0:51:59.890
+moving averages I want to know if I need to talk about it in some more seems like
+
+0:51:59.890,0:52:03.270
+it's probably need to explain it in some depth but in expose for a moving average
+
+0:52:03.270,0:52:08.020
+it's a standard way this has been used for many many decades across many fields
+
+0:52:08.020,0:52:14.650
+for maintaining an average that are the quantity that may change over time okay
+
+0:52:14.650,0:52:19.630
+so when a quantity is changing over time we need to put larger weights on newer
+
+0:52:19.630,0:52:24.210
+values because they provide more information and one way to do that is
+
+0:52:24.210,0:52:30.700
+down weight old values exponentially and when you do this exponentially you mean
+
+0:52:30.700,0:52:36.880
+that the weight of an old value from say ten steps ago will have weight alpha to
+
+0:52:36.880,0:52:41.109
+the ten in your thing so that's where the exponential comes in the output of
+
+0:52:41.109,0:52:43.900
+the ten now it's that's not really in the notation and in the notation at each
+
+0:52:43.900,0:52:49.390
+step we just download the pass vector by this alpha constant and as if you can
+
+0:52:49.390,0:52:53.440
+imagine in your head things in that buffer the V buffer that are very old at
+
+0:52:53.440,0:52:57.760
+each step they get downloaded by alpha at every step and just as before alpha
+
+0:52:57.760,0:53:01.359
+here is something between zero and one so we can't use values greater than one
+
+0:53:01.359,0:53:04.280
+there so this will damp those all values until they no longer
+
+0:53:04.280,0:53:08.180
+the exponential moving average so this method keeps an exponential moving
+
+0:53:08.180,0:53:12.860
+average of the second moment I mean non-central second moment so we do not
+
+0:53:12.860,0:53:18.920
+subtract off the mean here the PyTorch implementation has a switch where you
+
+0:53:18.920,0:53:22.370
+can tell it to subtract off the mean play with that if you like it'll
+
+0:53:22.370,0:53:25.460
+probably perform very similarly in practice there's a paper on that I'm
+
+0:53:25.460,0:53:30.620
+sure but the original method does not subtract off the mean there and we use
+
+0:53:30.620,0:53:35.000
+this second moment to normalize the gradient and we do this element-wise so
+
+0:53:35.000,0:53:39.560
+all this notation is element wise every element of the gradient is divided
+
+0:53:39.560,0:53:43.310
+through by the square root of the second moment estimate and if you think that
+
+0:53:43.310,0:53:47.090
+this square root is really being the standard deviation even though this is
+
+0:53:47.090,0:53:50.990
+not a central moment so it's not actually the standard deviation it's
+
+0:53:50.990,0:53:55.580
+useful to think of it that way and the name you know root means square is kind
+
+0:53:55.580,0:54:03.590
+of alluding to that division by the root of the mean of the squares and the
+
+0:54:03.590,0:54:07.820
+important technical detail here you have to add epsilon here for the annoying
+
+0:54:07.820,0:54:12.950
+problem that when you divide 0 by 0 everything breaks so you occasionally
+
+0:54:12.950,0:54:16.310
+have zeros in your network there are some situations where it makes a
+
+0:54:16.310,0:54:20.060
+difference outside of when your gradients zero but you absolutely do
+
+0:54:20.060,0:54:25.310
+need that epsilon in your method and you'll see this is a recurring theme all
+
+0:54:25.310,0:54:29.900
+of these no adaptive methods basically you've got to put an epsilon when your
+
+0:54:29.900,0:54:34.040
+the divide something just to avoiding to avoid dividing by 0 and typically that
+
+0:54:34.040,0:54:38.690
+epsilon will be close to your machine Epsilon I don't know if so if you're
+
+0:54:38.690,0:54:41.750
+familiar with that term but it's something like 10 to a negative 7
+
+0:54:41.750,0:54:45.710
+sometimes 10 to the negative 8 something of that order so really only has a small
+
+0:54:45.710,0:54:49.790
+effect on the value before I talk about why this method works I want to talk
+
+0:54:49.790,0:54:53.150
+about the the most recent kind of innovation on top of this method and
+
+0:54:53.150,0:54:57.560
+that is the method that we actually use in practice so rmsprop is sometimes
+
+0:54:57.560,0:55:03.170
+still use but more often we use a method notice atom an atom means adaptive
+
+0:55:03.170,0:55:10.790
+moment estimation so Adam is rmsprop with momentum so I spent 20 minutes
+
+0:55:10.790,0:55:13.760
+telling you I should use momentum so I'm going to say well you should put it on
+
+0:55:13.760,0:55:18.420
+top of rmsprop as well there's always of doing that at least
+
+0:55:18.420,0:55:21.569
+half a dozen in this papers for each of them but Adam is the one that caught on
+
+0:55:21.569,0:55:25.770
+and the way we do have a mention here is we actually convert the momentum update
+
+0:55:25.770,0:55:32.609
+to an exponential moving average as well now this may seem like a quantity
+
+0:55:32.609,0:55:37.200
+qualitatively different update like doing momentum by moving average in fact
+
+0:55:37.200,0:55:40.829
+what we were doing before is essentially equivalent to that you can work out some
+
+0:55:40.829,0:55:44.490
+constants where you can get a method where you use a moving exponential
+
+0:55:44.490,0:55:47.760
+moving average momentum that is equivalent to the regular mentum so
+
+0:55:47.760,0:55:50.460
+don't think of this moving average momentum as being anything different
+
+0:55:50.460,0:55:54.000
+than your previous momentum but it has a nice property that you don't need to
+
+0:55:54.000,0:55:57.660
+change the learning rate when you mess with the β here which I think it's a
+
+0:55:57.660,0:56:03.780
+big improvement so yeah we added momentum of the gradient and just as
+
+0:56:03.780,0:56:07.980
+before with rmsprop we have this exponential moving average of the
+
+0:56:07.980,0:56:13.050
+squared gradient on top of that we basically just plug in this moving
+
+0:56:13.050,0:56:17.010
+average gradient where we had the gradient in the previous update so it's
+
+0:56:17.010,0:56:20.579
+not too complicated now if you actually read the atom paper you'll see a whole
+
+0:56:20.579,0:56:23.880
+bunch of additional notation the algorithm is like ten lines long instead
+
+0:56:23.880,0:56:28.859
+of three and that is because they add something called bias correction this is
+
+0:56:28.859,0:56:34.260
+actually not necessary but it'll help a little bit so everybody uses it and all
+
+0:56:34.260,0:56:39.780
+it does is it increases the value of these parameters during the early stages
+
+0:56:39.780,0:56:43.319
+of optimization and the reason you do that is because you initialize this
+
+0:56:43.319,0:56:48.150
+momentum buffer at zero typically now imagine your initial initializer at zero
+
+0:56:48.150,0:56:52.440
+then after the first step we're going to be adding to that a value of 1 minus
+
+0:56:52.440,0:56:56.700
+β times the gradient now 1 minus β will typically be 0.1 because we
+
+0:56:56.700,0:57:00.599
+typically use momentum point 9 so when we do that our gradient step is actually
+
+0:57:00.599,0:57:05.069
+using a learning rate 10 times smaller because this momentum buffer has a tenth
+
+0:57:05.069,0:57:08.670
+of a gradient in it and that's undesirable so all the bias
+
+0:57:08.670,0:57:13.890
+correction does is just multiply by 10 the step in those early iterations and
+
+0:57:13.890,0:57:18.420
+the bias correction formula is just basically the correct way to do that to
+
+0:57:18.420,0:57:23.030
+result in a step that's unbiased and unbiased here means just the expectation
+
+0:57:23.030,0:57:28.420
+of the momentum buffer is the gradient so it's nothing too mysterious
+
+0:57:28.420,0:57:32.960
+yeah don't think of it as being like a huge addition although I do think that
+
+0:57:32.960,0:57:37.190
+the atom paper was the first one to use bicycle action in a mainstream
+
+0:57:37.190,0:57:40.310
+optimization method I don't know if they invented it but it certainly pioneered
+
+0:57:40.310,0:57:44.990
+the base correction so these methods work really well in practice let me just
+
+0:57:44.990,0:57:48.590
+give you a common empirical comparison here now this quadratic I'm using is a
+
+0:57:48.590,0:57:52.220
+diagonal quadratic so it's a little bit shading to use a method that works well
+
+0:57:52.220,0:57:55.060
+on down or quadratics on and diagonal quadratic but I'm gonna do that anyway
+
+0:57:55.060,0:58:00.320
+and you can see that the direction they travel is quite an improvement over SGD
+
+0:58:00.320,0:58:03.950
+so in this simplified problem sut kind of goes in the wrong direction at the
+
+0:58:03.950,0:58:08.780
+beginning where rmsprop basically heads in the right direction now the problem
+
+0:58:08.780,0:58:15.140
+is rmsprop suffers from noise just as regular sut without noise suffers so you
+
+0:58:15.140,0:58:19.490
+get this situation where kind of bounces around the optimum quite significantly
+
+0:58:19.490,0:58:24.710
+and just as with std with momentum when we add momentum to atom we get the same
+
+0:58:24.710,0:58:29.210
+kind of improvement where we kind of corkscrew or sometimes reverse corkscrew
+
+0:58:29.210,0:58:32.240
+around the solution that kind of thing and this gets you to the solution
+
+0:58:32.240,0:58:35.960
+quicker and it means that the last point you're currently at is a good estimate
+
+0:58:35.960,0:58:39.370
+of the solution not a noisy estimate but it's kind of the best estimate you have
+
+0:58:39.370,0:58:45.350
+so I would generally recommend using a demova rmsprop and it's serving the case
+
+0:58:45.350,0:58:50.750
+that for some problems you just can't use SGD atom is necessary for training
+
+0:58:50.750,0:58:53.690
+some of the neural networks were using our language models or say our language
+
+0:58:53.690,0:58:57.290
+models it's necessary for training the network so I'm going to talk about near
+
+0:58:57.290,0:59:03.580
+the end of this presentation and it's it's generally the if I have to
+
+0:59:07.490,0:59:10.670
+recommend something you should use you should try either s to D with momentum
+
+0:59:10.670,0:59:14.690
+or atom as you'll go to methods for optimizing your networks so there's some
+
+0:59:14.690,0:59:19.430
+practical advice for you personally I hate atom because I'm an optimization
+
+0:59:19.430,0:59:24.920
+researcher and the theory and their paper is wrong this has been shown
+
+0:59:24.920,0:59:29.360
+recently so the method in fact does not converge and you can show this on very
+
+0:59:29.360,0:59:32.430
+simple test problems so one of the most heavily music
+
+0:59:32.430,0:59:35.820
+use methods in modern machine learning actually doesn't work in a lot of
+
+0:59:35.820,0:59:40.740
+situations this is unsatisfying and it's I'm kind of an ongoing research question
+
+0:59:40.740,0:59:44.670
+of the best way to fix this I don't think just modifying Adam a little bit
+
+0:59:44.670,0:59:47.160
+to try and fix it is really the best solution I think it's got some more
+
+0:59:47.160,0:59:52.620
+fundamental problems but I won't go into any detail for that there is a very
+
+0:59:52.620,0:59:56.460
+practical problem they need to talk about though Adam is known to sometimes
+
+0:59:56.460,1:00:01.140
+give worse generalization error I think Yara's talked in detail about
+
+1:00:01.140,1:00:08.730
+generalization error do I go over that so yeah generalization error is the
+
+1:00:08.730,1:00:14.100
+error on data that you didn't train your model on basically so your networks are
+
+1:00:14.100,1:00:17.370
+very heavily parameter over parameterised and if you train them to
+
+1:00:17.370,1:00:22.200
+give zero loss on the data you trained it on they won't give zero loss on other
+
+1:00:22.200,1:00:27.240
+data points data that it's never seen before and this generalization error is
+
+1:00:27.240,1:00:32.310
+that error typically the best thing we can do is minimize the loss and the data
+
+1:00:32.310,1:00:37.080
+we have but sometimes that's suboptimal and it turns out when you use Adam it's
+
+1:00:37.080,1:00:40.860
+quite common on particularly on image problems that you get worst
+
+1:00:40.860,1:00:46.140
+generalization error than when you use STD and people attribute this to a whole
+
+1:00:46.140,1:00:50.400
+bunch of different things it may be finding those bad local minima that I
+
+1:00:50.400,1:00:54.180
+mentioned earlier the ones that are smaller it's kind of unfortunate that
+
+1:00:54.180,1:00:57.840
+the better your optimization method the more likely it is to hit those small
+
+1:00:57.840,1:01:02.460
+local minima because they're closer to where you currently are and kind of it's
+
+1:01:02.460,1:01:06.510
+the goal of an optimization method to find you the closest minima in a sense
+
+1:01:06.510,1:01:10.620
+these local optimization methods we use but there's a whole bunch of other
+
+1:01:10.620,1:01:16.950
+reasons that you can attribute to it less noise in Adam perhaps it could be
+
+1:01:16.950,1:01:20.100
+some structure maybe these methods where you rescale
+
+1:01:20.100,1:01:23.070
+space like this have this fundamental problem where they give worst
+
+1:01:23.070,1:01:26.430
+generalization we don't really understand this but it's important to
+
+1:01:26.430,1:01:30.390
+know that this may be a problem or in some cases it's not to say that it will
+
+1:01:30.390,1:01:33.450
+give horrible performance you'll still get a pretty good neuron that workout at
+
+1:01:33.450,1:01:37.200
+the end and what I can tell you is the language models that we trained at
+
+1:01:37.200,1:01:41.890
+Facebook use methods like atom or atom itself and they
+
+1:01:41.890,1:01:46.960
+much better results than if you use STD and there's a kind of a small thing that
+
+1:01:46.960,1:01:51.490
+won't affect you at all I would expect but with Adam you have to maintain these
+
+1:01:51.490,1:01:56.410
+three buffers where's sed you have two buffers of parameters this doesn't
+
+1:01:56.410,1:01:59.230
+matter except when you're training a model that's like 12 gigabytes and then
+
+1:01:59.230,1:02:02.790
+it really becomes a problem I don't think you'll encounter that in practice
+
+1:02:02.790,1:02:06.280
+and surely there's a little bit iffy so you gotta trim two parameters instead of
+
+1:02:06.280,1:02:13.060
+one so yeah that's practical advice use Adam arrest you do but onto something
+
+1:02:13.060,1:02:18.220
+that is also sup is also kind of a core thing oh sorry have a question yes yes
+
+1:02:18.220,1:02:22.600
+you absolutely correct but typically I guess the question the question was
+
+1:02:22.600,1:02:28.000
+weren't using a small epsilon in the denominator result in blow-up certainly
+
+1:02:28.000,1:02:32.440
+if the numerator was equal to roughly one than dividing through by ten to the
+
+1:02:32.440,1:02:37.900
+negative seven could be catastrophic and this this is a legitimate question but
+
+1:02:37.900,1:02:45.250
+typically in order for the V buffer to have very small values the gradient also
+
+1:02:45.250,1:02:48.340
+has to have had very small values you can see that from the way the
+
+1:02:48.340,1:02:53.110
+exponential moving averages are updated so in fact it's not a practical problem
+
+1:02:53.110,1:02:56.860
+when this when this V is incredibly small the momentum is also very small
+
+1:02:56.860,1:03:01.180
+and when you're dividing small thing by a small thing you don't get blow-up oh
+
+1:03:01.180,1:03:08.050
+yeah so the question is should I you buy an SUV and atom separately at the same
+
+1:03:08.050,1:03:11.860
+time and just see which one works better in fact that is pretty much what we do
+
+1:03:11.860,1:03:14.620
+because we have lots of computers we just have one computer runners you need
+
+1:03:14.620,1:03:17.890
+one computer one atom and see which one works better although we kind of know
+
+1:03:17.890,1:03:21.730
+from most problems which one is the better choice for whatever problems
+
+1:03:21.730,1:03:24.460
+you're working with maybe you can try both it depends how long it's going to
+
+1:03:24.460,1:03:27.940
+take to train I'm not sure exactly what you're gonna be doing in terms of
+
+1:03:27.940,1:03:31.150
+practice in this course yeah certainly legitimate way to do it
+
+1:03:31.150,1:03:35.020
+in fact some people use SGD at the beginning and then switch to atom at the
+
+1:03:35.020,1:03:39.430
+end that's certainly a good approach it just makes it more complicated and
+
+1:03:39.430,1:03:44.740
+complexity should be avoided if possible yes this is one of those deep unanswered
+
+1:03:44.740,1:03:48.400
+questions so the question was should we 1s you deal with lots of different
+
+1:03:48.400,1:03:51.850
+initializations and see which one gets the best solution won't I help with the
+
+1:03:51.850,1:03:54.990
+bumpiness this is the case with small neural net
+
+1:03:54.990,1:03:59.160
+that you will get different solutions depending on your initialization now
+
+1:03:59.160,1:04:02.369
+there's a remarkable property of the kind of large networks we use at the
+
+1:04:02.369,1:04:07.349
+moment and the art networks as long as you use similar random initialization in
+
+1:04:07.349,1:04:11.400
+terms of the variance of initialization you'll end up practically at a similar
+
+1:04:11.400,1:04:16.380
+quality solutions and this is not well understood so yeah it's it's quite
+
+1:04:16.380,1:04:19.319
+remarkable that your neural network can train for three hundred epochs and you
+
+1:04:19.319,1:04:23.550
+end up with solution the test error is like almost exactly the same as what you
+
+1:04:23.550,1:04:26.220
+got with some completely different initialization we don't understand this
+
+1:04:26.220,1:04:31.800
+so if you really need to eke out tiny performance gains you may be able to get
+
+1:04:31.800,1:04:36.150
+a little bit better Network by running multiple and picking the best and it
+
+1:04:36.150,1:04:39.180
+seems the bigger your network and the harder your problem the less game you
+
+1:04:39.180,1:04:44.190
+get from doing that yes so the question was we have three buffers for each
+
+1:04:44.190,1:04:49.470
+weight on the answer answer is yes so essentially yeah we basically in memory
+
+1:04:49.470,1:04:53.160
+we have a copy of the same size as our weight data so our weight will be a
+
+1:04:53.160,1:04:55.920
+whole bunch of tensors in memory we have a separate whole bunch of tensors that
+
+1:04:55.920,1:05:01.849
+our momentum tensors and we have a whole bunch of other tensors that are the the
+
+1:05:01.849,1:05:09.960
+second moment tensors so yeah so normalization layers so this is kind of
+
+1:05:09.960,1:05:14.369
+a clever idea why try and salt why try and come up with a better optimization
+
+1:05:14.369,1:05:20.540
+algorithm where we can just come up with a better network and this is the idea so
+
+1:05:20.960,1:05:24.960
+modern neural networks typically we modify the network by adding additional
+
+1:05:24.960,1:05:32.280
+layers in between existing layers and the goal of these layers to improve the
+
+1:05:32.280,1:05:36.450
+optimization and generalization performance of the network and the way
+
+1:05:36.450,1:05:39.059
+they do this can happen in a few different ways but let me give you an
+
+1:05:39.059,1:05:44.430
+example so we would typically take standard kind of combinations so as you
+
+1:05:44.430,1:05:48.930
+know in modern your networks we typically alternate linear operations
+
+1:05:48.930,1:05:52.319
+with nonlinear operations and here I call that activation functions we
+
+1:05:52.319,1:05:56.069
+alternate them linear nonlinear linear nonlinear what we could do is we can
+
+1:05:56.069,1:06:01.819
+place these normalization layers either between the linear order non-linear or
+
+1:06:01.819,1:06:11.009
+before so there in this case we are using for instance this is the kind of
+
+1:06:11.009,1:06:14.369
+structure we have in real networks where we have a convolution recover that
+
+1:06:14.369,1:06:18.240
+convolutions or linear operations followed by batch normalization this is
+
+1:06:18.240,1:06:20.789
+a type of normalization which I will detail in a minute
+
+1:06:20.789,1:06:28.140
+followed by riilu which is currently the most popular activation function and we
+
+1:06:28.140,1:06:31.230
+place this mobilization between these existing layers and what I want to make
+
+1:06:31.230,1:06:35.940
+clear is this normalization layers they affect the flow of data through so they
+
+1:06:35.940,1:06:39.150
+modify the data that's flowing through but they don't change the power of the
+
+1:06:39.150,1:06:43.380
+network in the sense that that you can set up the weights in the network in
+
+1:06:43.380,1:06:46.769
+some way that'll still give whatever output you had in an unknown alized
+
+1:06:46.769,1:06:50.220
+network with a normalized network so normalization layers you're not making
+
+1:06:50.220,1:06:53.670
+that work more powerful they improve it in other ways normally when we add
+
+1:06:53.670,1:06:57.660
+things to a neural network the goal is to make it more powerful and yes this
+
+1:06:57.660,1:07:01.740
+normalization layer can also be after the activation or before the linear or
+
+1:07:01.740,1:07:05.009
+you know because this wraps around we do this in order a lot of them are
+
+1:07:05.009,1:07:11.400
+equivalent but any questions here this is this bits yes yes so that's certainly
+
+1:07:11.400,1:07:16.140
+true but we kind of want that we want the real o2 sensor some of the data but
+
+1:07:16.140,1:07:20.009
+not too much but it's also not quite accurate because normalization layers
+
+1:07:20.009,1:07:24.989
+can also scale and ship the data and so it won't necessarily be that although
+
+1:07:24.989,1:07:28.739
+it's certainly at initialization they do not do that scaling in ship so typically
+
+1:07:28.739,1:07:32.460
+cut off half the data and in fact if you try to do a theoretical analysis of this
+
+1:07:32.460,1:07:37.470
+it's very convenient that it cuts off half the data so the structure this
+
+1:07:37.470,1:07:42.239
+normalization layers they all pretty much do the same kind of operation and
+
+1:07:42.239,1:07:47.640
+how many use kind of generic notation here so you should imagine that X is an
+
+1:07:47.640,1:07:54.930
+input to the normalization layer and Y is an output and what you do is use do a
+
+1:07:54.930,1:08:00.119
+whitening or normalization operation where you subtract off some estimate of
+
+1:08:00.119,1:08:05.190
+the mean of the data and you divide through by some estimate of the standard
+
+1:08:05.190,1:08:10.259
+deviation and remember before that I mentioned we want to keep the
+
+1:08:10.259,1:08:12.630
+representational power of the network the same
+
+1:08:12.630,1:08:17.430
+what we do to ensure that is we multiply by an alpha and we add a sorry in height
+
+1:08:17.430,1:08:22.050
+multiplied by an hey and we add a B and this is just so that the layer can still
+
+1:08:22.050,1:08:27.120
+output values over any particular range or if we just always had every layer
+
+1:08:27.120,1:08:30.840
+output in white and data the network couldn't output like a value million or
+
+1:08:30.840,1:08:35.370
+something like that it wouldn't it could only do that you know with very in very
+
+1:08:35.370,1:08:38.520
+rare cases because that would be very heavy on the tail of the normal
+
+1:08:38.520,1:08:41.850
+distribution so this allows our layers to essentially output things that are
+
+1:08:41.850,1:08:49.200
+the same range as before and yes so normalization layers have parameters and
+
+1:08:49.200,1:08:51.900
+in the network is a little bit more complicated in the sensor has more
+
+1:08:51.900,1:08:56.010
+parameters it's typically a very small number of parameters like rounding error
+
+1:08:56.010,1:09:04.290
+in your counts of network parameters typically and yeah so the complexity of
+
+1:09:04.290,1:09:06.840
+this is on being kind of vague about how you compute the mean and standard
+
+1:09:06.840,1:09:10.170
+deviation the reason I'm doing that is because all the methods compute in a
+
+1:09:10.170,1:09:18.210
+different way and I'll detail that in a second yes question weighs re lb oh it's
+
+1:09:18.210,1:09:24.630
+just a shift parameter so the data could have had a nonzero mean and we want it
+
+1:09:24.630,1:09:28.470
+delayed to be able to produce outputs with a nonzero mean so if we always just
+
+1:09:28.470,1:09:30.570
+subtract off the mean it couldn't do that
+
+1:09:30.570,1:09:34.950
+so it just adds back representational power to the layer yes so the question
+
+1:09:34.950,1:09:40.110
+is don't these a and B parameters reverse the normalization and and in
+
+1:09:40.110,1:09:44.730
+fact that often is the case that they do something similar but they move at
+
+1:09:44.730,1:09:48.750
+different time scales so between the steps or between evaluations your
+
+1:09:48.750,1:09:52.410
+network the mean and variance can can shift quite substantially based off the
+
+1:09:52.410,1:09:55.320
+data you're feeding but these a and B parameters are quite stable they move
+
+1:09:55.320,1:10:01.260
+slowly as you learn them so because they're most stable this has beneficial
+
+1:10:01.260,1:10:04.530
+properties and I'll describe those a little bit later but I want to talk
+
+1:10:04.530,1:10:08.610
+about is exactly how you normalize the data and this is where the crucial thing
+
+1:10:08.610,1:10:11.760
+so the earliest of these methods developed was batch norm and he is this
+
+1:10:11.760,1:10:16.429
+kind of a bizarre normalization that I I think is a horrible idea
+
+1:10:16.429,1:10:22.460
+but unfortunately works fantastically well so it normalizes across batches so
+
+1:10:22.460,1:10:28.370
+we want information about a certain channel recall for a convolutional
+
+1:10:28.370,1:10:32.000
+neural network which channel is one of these latent images that you have in
+
+1:10:32.000,1:10:34.610
+your network that part way through the network you have some data it doesn't
+
+1:10:34.610,1:10:37.070
+really look like an image if you actually look at it but it's it's shaped
+
+1:10:37.070,1:10:41.000
+like an image anyway and that's a channel so we want to compute an average
+
+1:10:41.000,1:10:47.239
+over this over this channel but we only have a small amount of data that's
+
+1:10:47.239,1:10:51.380
+what's in this channel basically height times width if it's a if it's an image
+
+1:10:51.380,1:10:56.000
+and it turns out that's not enough data to get good estimates of these mean and
+
+1:10:56.000,1:10:58.969
+variance parameters so what batchman does is it takes a mean and variance
+
+1:10:58.969,1:11:05.570
+estimate across all the instances in your mini-batch pretty straightforward
+
+1:11:05.570,1:11:09.890
+and that's what it divides blue by the reason why I don't like this is it is no
+
+1:11:09.890,1:11:12.830
+longer actually stochastic gradient descent if you using batch normalization
+
+1:11:12.830,1:11:19.429
+so it breaks all the theory that I work on for a living so I prefer some other
+
+1:11:19.429,1:11:24.409
+normalization strategies there in fact quite a soon after Bachelor and people
+
+1:11:24.409,1:11:27.409
+tried normalizing via every other possible combination of things you can
+
+1:11:27.409,1:11:31.699
+normalize by and it turns out the three that kind of work a layer instance and
+
+1:11:31.699,1:11:37.370
+group norm and layer norm here in this diagram you averaged across all of the
+
+1:11:37.370,1:11:43.820
+channels and across height and width now this doesn't work on all problems so I
+
+1:11:43.820,1:11:47.000
+would only recommend it on a problem where you know it already works and
+
+1:11:47.000,1:11:49.940
+that's typically a problem where people already using it so look at what the
+
+1:11:49.940,1:11:53.989
+network's people are using if that's a good idea or not will depend the
+
+1:11:53.989,1:11:57.140
+instance normalization is something that's used a lot in modern language
+
+1:11:57.140,1:12:03.380
+models and this you do not average across the batch anymore which is nice I
+
+1:12:03.380,1:12:07.310
+won't we talk about that much depth I really the one I would rather you rather
+
+1:12:07.310,1:12:12.440
+you use in practice is group normalization so here we have which
+
+1:12:12.440,1:12:16.219
+across a group of channels and this group is trapped is chosen arbitrarily
+
+1:12:16.219,1:12:20.090
+and fixed at the beginning so typically we just group things numerically so
+
+1:12:20.090,1:12:23.580
+channel 0 to 10 would be a group channel you know 10 to
+
+1:12:23.580,1:12:31.110
+20 making sure you don't overlap of course disjoint groups of channels and
+
+1:12:31.110,1:12:34.560
+the size of these groups is a parameter that you need to tune although we always
+
+1:12:34.560,1:12:39.150
+use 32 in practice you could tune that and you just do this because there's not
+
+1:12:39.150,1:12:42.600
+enough information on a single channel and using all the channels is too much
+
+1:12:42.600,1:12:46.170
+so you just use something in between it's it's really quite a simple idea and
+
+1:12:46.170,1:12:50.790
+it turns out this group norm often works better than batch normal a lot of
+
+1:12:50.790,1:12:55.410
+problems and it does mean that my HUD theory that I work on is still balanced
+
+1:12:55.410,1:12:57.890
+so I like that so why does normalization help this is a
+
+1:13:02.190,1:13:06.330
+matter of dispute so in fact in the last few years several papers have come out
+
+1:13:06.330,1:13:08.790
+on this topic unfortunately the papers did not agree
+
+1:13:08.790,1:13:13.590
+on why it works they all have completely separate explanations but there's some
+
+1:13:13.590,1:13:16.260
+things that are definitely going on so we can shape it we can say for sure
+
+1:13:16.260,1:13:24.120
+that the network appears to be easier to optimize so by that I mean you can use
+
+1:13:24.120,1:13:28.140
+large learning rates better in a better condition network you can use larger
+
+1:13:28.140,1:13:31.590
+learning rates and therefore get faster convergence so that does seem to be the
+
+1:13:31.590,1:13:35.030
+case when you uses normalization layers another factor which is a little bit
+
+1:13:38.070,1:13:39.989
+disputed but I think is reasonably well-established
+
+1:13:39.989,1:13:44.489
+you get noise in the data passing through your network when you use
+
+1:13:44.489,1:13:49.940
+normalization in vaginal and this noise comes from other instances in the bash
+
+1:13:49.940,1:13:53.969
+because it's random what I like instances are in your batch when you
+
+1:13:53.969,1:13:57.239
+compute the mean using those other instances that mean is noisy and this
+
+1:13:57.239,1:14:01.469
+noise is then added or sorry subtracted from your weight so when you do the
+
+1:14:01.469,1:14:06.050
+normalization operation so this noise is actually potentially helping
+
+1:14:06.050,1:14:11.790
+generalization performance in your network now there has been a lot of
+
+1:14:11.790,1:14:15.180
+papers on injecting noise internet works to help generalization so it's not such
+
+1:14:15.180,1:14:20.370
+a crazy idea that this noise can be helping and in terms of a practical
+
+1:14:20.370,1:14:24.030
+consideration this normalization makes the weight initialization that you use a
+
+1:14:24.030,1:14:28.260
+lot less important it used to be kind of a black art to select the initialization
+
+1:14:28.260,1:14:32.460
+your new your network and the people who really good motive is often it was just
+
+1:14:32.460,1:14:35.340
+because they're really good at changing their initialization and this is just
+
+1:14:35.340,1:14:39.540
+less the case now when we use normalization layers and also gives the
+
+1:14:39.540,1:14:45.930
+benefit if you can kind of tile together layers with impunity so again it used to
+
+1:14:45.930,1:14:49.050
+be the situation that if you just plug together two possible ways in your
+
+1:14:49.050,1:14:52.740
+network it probably wouldn't work now that we use normalization layers it
+
+1:14:52.740,1:14:57.900
+probably will work and even if it's a horrible idea and this has spurred a
+
+1:14:57.900,1:15:02.310
+whole field of automated architecture search where they just randomly calm
+
+1:15:02.310,1:15:05.940
+build together blocks and it's try thousands of them and see what works and
+
+1:15:05.940,1:15:09.540
+that really wasn't possible before because that would typically result in a
+
+1:15:09.540,1:15:14.010
+poorly conditioned Network you couldn't train and with normalization typically
+
+1:15:14.010,1:15:19.590
+you can train it some practical considerations so the the bachelor on
+
+1:15:19.590,1:15:23.310
+paper one of the reasons why it wasn't invented earlier is the kind of
+
+1:15:23.310,1:15:27.480
+non-obvious thing that you have to back propagate through the calculation of the
+
+1:15:27.480,1:15:32.160
+mean and standard deviation if you don't do this everything blows up now you
+
+1:15:32.160,1:15:35.190
+might have to do this yourself as it'll be implemented in the implementation
+
+1:15:35.190,1:15:42.000
+that you use oh yes so I do not have the expertise to answer that I feel like
+
+1:15:42.000,1:15:45.060
+it's kind of sometimes it's just a patent pet method like people like
+
+1:15:45.060,1:15:49.710
+layering in suits normally that field more and in fact a good norm if you it's
+
+1:15:49.710,1:15:53.640
+just the group size covers both so I would be sure that you could probably
+
+1:15:53.640,1:15:56.640
+get the same performance using group norm with a particular group size chosen
+
+1:15:56.640,1:16:00.980
+carefully yeah the choice of national does affect
+
+1:16:00.980,1:16:06.720
+parallelization so the implementation zinc in your computer library or your
+
+1:16:06.720,1:16:10.380
+CPU library are pretty efficient for each of these but it's complicated when
+
+1:16:10.380,1:16:14.820
+you are spreading your computation across machines and you kind of have to
+
+1:16:14.820,1:16:18.630
+synchronize these these these things and batch norm is a bit of a pain there
+
+1:16:18.630,1:16:23.790
+because it would mean that you need to compute an average across all machines
+
+1:16:23.790,1:16:27.540
+and aggregator whereas if you're using group norm every instance is on a
+
+1:16:27.540,1:16:30.450
+different machine you can just completely compute the norm so in all
+
+1:16:30.450,1:16:34.350
+those other three it's separate normalization for each instance it
+
+1:16:34.350,1:16:37.560
+doesn't depend on the other instances in the batch so it's nicer when you're
+
+1:16:37.560,1:16:40.570
+distributing it's when people use batch norm on a cluster
+
+1:16:40.570,1:16:45.100
+they actually do not sync the statistics across which makes it even less like SGD
+
+1:16:45.100,1:16:51.250
+and makes me even more annoyed so what was it already
+
+1:16:51.250,1:16:57.610
+yes yeah Bachelor basically has a lot of momentum not in the optimization sense
+
+1:16:57.610,1:17:01.300
+but in the sense of people's minds so it's very heavily used for that reason
+
+1:17:01.300,1:17:05.860
+but I would recommend group norm instead and there's kind of like a technical
+
+1:17:05.860,1:17:09.760
+data with batch norm you don't want to compute these mean and standard
+
+1:17:09.760,1:17:14.950
+deviations on batches during evaluation time by evaluation time I mean when you
+
+1:17:14.950,1:17:20.170
+actually run your network on the test data set or we use it in the real world
+
+1:17:20.170,1:17:24.370
+for some application it's typically in those situations you don't have batches
+
+1:17:24.370,1:17:29.050
+any more batches or more for training things so you need some substitution in
+
+1:17:29.050,1:17:33.100
+that case you can compute an exponential moving average as we talked about before
+
+1:17:33.100,1:17:37.930
+and EMA of these mean and standard deviations you may think to yourself why
+
+1:17:37.930,1:17:41.260
+don't we use an EMA in the implementation of batch norm the answer
+
+1:17:41.260,1:17:44.860
+is because it doesn't work we it seems like a very reasonable idea though and
+
+1:17:44.860,1:17:48.880
+people have explored that and quite a lot of depth but it doesn't work oh yes
+
+1:17:48.880,1:17:52.900
+this is quite crucial so yet people have tried normalizing things in neural
+
+1:17:52.900,1:17:55.480
+networks before a batch norm was invented but they always made the
+
+1:17:55.480,1:17:59.380
+mistake of not back popping through the mean and standard deviation and the
+
+1:17:59.380,1:18:02.290
+reason why they didn't do that is because the math is really tricky and if
+
+1:18:02.290,1:18:05.650
+you try to implement it yourself it will probably be wrong now that we have pie
+
+1:18:05.650,1:18:09.460
+charts which which computes gradients correctly for you in all situations you
+
+1:18:09.460,1:18:12.850
+could actually do this in practice and there are just a little bit but only a
+
+1:18:12.850,1:18:16.780
+little bit because it's surprisingly difficult yeah so the question is is
+
+1:18:16.780,1:18:21.070
+there a difference if we apply normalization before after than
+
+1:18:21.070,1:18:25.690
+non-linearity and the answer is there will be a small difference in the
+
+1:18:25.690,1:18:28.930
+performance of your network now I can't tell you which one's better because it
+
+1:18:28.930,1:18:32.110
+appears in some situation one works a little bit better in other situations
+
+1:18:32.110,1:18:35.350
+the other one works better what I can tell you is the way I draw it here is
+
+1:18:35.350,1:18:39.100
+what's used in the PyTorch implementation of ResNet and most
+
+1:18:39.100,1:18:43.330
+resonant implementations so just there's probably almost as good as you can get I
+
+1:18:43.330,1:18:49.270
+think that would use the other form if it was better and it's certainly problem
+
+1:18:49.270,1:18:51.460
+depended this is another one of those things where maybe the
+
+1:18:51.460,1:18:55.420
+no correct answer how you do it and it's just random which works better I don't
+
+1:18:55.420,1:19:03.190
+know yes yeah any other questions on this before I move on to the so you need
+
+1:19:03.190,1:19:06.850
+more data to get accurate estimates of the mean and standard deviation the
+
+1:19:06.850,1:19:10.570
+question was why is it a good idea to compute it across multiple channels
+
+1:19:10.570,1:19:13.450
+rather than a single channel and yes it is because you just have more data to
+
+1:19:13.450,1:19:17.800
+make a better estimates but you want to be careful you don't have too much data
+
+1:19:17.800,1:19:21.130
+in that because then you don't get the noise and record that the noise is
+
+1:19:21.130,1:19:25.300
+actually useful so basically the group size in group norm is just adjusting the
+
+1:19:25.300,1:19:28.870
+amount of noise we have basically the question was how is this related to
+
+1:19:28.870,1:19:32.950
+group convolutions this was all pioneered before good convolutions were
+
+1:19:32.950,1:19:38.260
+used it certainly has some interaction with group convolutions if you use them
+
+1:19:38.260,1:19:41.920
+and so you want to be a little bit careful there I don't know exactly what
+
+1:19:41.920,1:19:44.800
+the correct thing to do is in those cases but I can tell you they definitely
+
+1:19:44.800,1:19:48.610
+use normalization in those situations probably Batchelor more more than group
+
+1:19:48.610,1:19:53.260
+norm because of the momentum I mentioned it's just more popular vaginal yes so
+
+1:19:53.260,1:19:56.890
+the question is do we ever use our Beck instances from the mini-batch in group
+
+1:19:56.890,1:20:00.310
+norm or is it always just a single instance we always just use a single
+
+1:20:00.310,1:20:04.450
+instance because there's so many benefits to that it's so much simpler in
+
+1:20:04.450,1:20:08.469
+implementation and in theory to do that maybe you can get some improvement from
+
+1:20:08.469,1:20:11.530
+that in fact I bet you there's a paper that does that somewhere because they've
+
+1:20:11.530,1:20:15.190
+tried have any combination of this in practice I suspect if it worked well
+
+1:20:15.190,1:20:19.450
+we'd probably be using it so probably probably doesn't work well under the the
+
+1:20:19.450,1:20:24.370
+death of optimization I wanted to put something a little bit interesting
+
+1:20:24.370,1:20:27.610
+because you've all been sitting through kind of a pretty dense lecture so this
+
+1:20:27.610,1:20:31.870
+is something that I've kind of been working on a little bit I thought you
+
+1:20:31.870,1:20:36.580
+might find interesting so you might have seen the the xkcd comic here that I've
+
+1:20:36.580,1:20:42.790
+modified it's not always this way it's kind of point of what it makes so
+
+1:20:42.790,1:20:46.270
+sometimes we can just barge into a field we know nothing about it and improve on
+
+1:20:46.270,1:20:50.469
+how they're currently doing it although you have to be a little bit careful so
+
+1:20:50.469,1:20:53.560
+the problem I want to talk about is one that young I think mentioned briefly in
+
+1:20:53.560,1:20:58.530
+the first lecture but I want to go into a bit of detail it's MRI reconstruction
+
+1:20:58.530,1:21:04.639
+now in the MRI reconstruction problem we take a raw data from an MRI machine a
+
+1:21:04.639,1:21:08.540
+medical imaging machine we take raw data from that machine and we reconstruct an
+
+1:21:08.540,1:21:12.530
+image and there's some pipeline an algorithm in the middle there that
+
+1:21:12.530,1:21:17.900
+produces the image and the goal basically here is to replace 30 years of
+
+1:21:17.900,1:21:21.020
+research into what algorithm they should use their with with neural networks
+
+1:21:21.020,1:21:27.949
+because that's that's what I'll get paid to do and I'll give you a bit of detail
+
+1:21:27.949,1:21:31.810
+so these MRI machines capture data in what's known as the Fourier domain I
+
+1:21:31.810,1:21:34.909
+know a lot of you have done signal processing some of you may have no idea
+
+1:21:34.909,1:21:42.070
+what this is and you don't need to understand it for this problem oh yeah
+
+1:21:44.770,1:21:49.639
+yes so you may have seen the the further domain in one dimensional case
+
+1:21:49.639,1:21:54.710
+so for neural networks sorry for MRI reconstruction we have two dimensional
+
+1:21:54.710,1:21:58.340
+Fourier domain the thing you need to know is it's a linear mapping to get
+
+1:21:58.340,1:22:02.389
+from the fluid domain to image domain it's just linear and it's very efficient
+
+1:22:02.389,1:22:06.350
+to do that mapping it literally takes milliseconds no matter how big your
+
+1:22:06.350,1:22:09.980
+images on modern computers so linear and easy to convert back and forth between
+
+1:22:09.980,1:22:15.619
+the two and the MRI machines actually capture either rows or columns of this
+
+1:22:15.619,1:22:20.540
+Fourier domain as samples they're called sample in the literature so each time
+
+1:22:20.540,1:22:25.280
+the machine computes a sample which is every few milliseconds it gets a role
+
+1:22:25.280,1:22:28.940
+column of this image and this is actually technically a complex-valued
+
+1:22:28.940,1:22:33.380
+image but this does not matter for my discussion of it so you can imagine it's
+
+1:22:33.380,1:22:38.300
+just a two channel image if you imagine a real and imaginary channel just think
+
+1:22:38.300,1:22:42.830
+of them as color channels the problem we want to do we want to solve is
+
+1:22:42.830,1:22:48.800
+accelerating MRI acceleration here is in the sense of faster so we want to run
+
+1:22:48.800,1:22:53.830
+the machines quicker and produce identical quality images
+
+1:22:55.400,1:23:00.050
+and one way we can do that in the most successful way so far is by just not
+
+1:23:00.050,1:23:05.540
+capturing all of the columns we just skip some randomly it's useful in
+
+1:23:05.540,1:23:09.320
+practice to also capture some of the middle columns it turns out they contain
+
+1:23:09.320,1:23:14.150
+a lot of the information but outside the middle we just capture randomly and we
+
+1:23:14.150,1:23:16.699
+can't just use a nice linear operation anymore
+
+1:23:16.699,1:23:20.270
+that diagram on the right is the output of that linear operation I mentioned
+
+1:23:20.270,1:23:23.810
+applied to this data so it doesn't give useful Apple they only do something a
+
+1:23:23.810,1:23:27.100
+little bit more intelligent any questions on this before I move on
+
+1:23:27.100,1:23:35.030
+it is frequency and phase dimensions so in this particular case I'm actually
+
+1:23:35.030,1:23:38.510
+sure this diagram one of the dimensions is frequency and one is phase and the
+
+1:23:38.510,1:23:44.390
+value is the magnitude of a sine wave with that frequency and phase so if you
+
+1:23:44.390,1:23:48.980
+add together all the sine waves wave them with the frequency oh so with the
+
+1:23:48.980,1:23:54.620
+weight in this image you get the original image so it's it's a little bit
+
+1:23:54.620,1:23:58.429
+more complicated because it's in two dimensions and the sine waves you gotta
+
+1:23:58.429,1:24:02.030
+be little bit careful but it's basically just each pixel is the magnitude of a
+
+1:24:02.030,1:24:06.230
+sine wave or if you want to compare to a 1d analogy
+
+1:24:06.230,1:24:11.960
+you'll just have frequencies so the pixel intensity is the strength of that
+
+1:24:11.960,1:24:16.580
+frequency if you have a musical note say a piano note with a C major as one of
+
+1:24:16.580,1:24:19.340
+the frequencies that would be one pixel this image would be the C major
+
+1:24:19.340,1:24:24.140
+frequency and another might be a minor or something like that and the magnitude
+
+1:24:24.140,1:24:28.370
+of it is just how hard they press the key on the piano so you have frequency
+
+1:24:28.370,1:24:34.370
+information yes so the video doesn't work there was one of the biggest
+
+1:24:34.370,1:24:38.750
+breakthroughs in in Threat achill mathematics for a long time was the
+
+1:24:38.750,1:24:41.690
+invention of compressed sensing I'm sure some of you have heard of compressed
+
+1:24:41.690,1:24:45.710
+sensing a hands of show of hands compressed sensing yeah some of you
+
+1:24:45.710,1:24:48.980
+especially work in the mathematical sciences would be aware of it
+
+1:24:48.980,1:24:53.330
+basically there's this phenomenal political paper that showed that we
+
+1:24:53.330,1:24:57.770
+could actually in theory get a perfect reconstruction from these subsampled
+
+1:24:57.770,1:25:02.080
+measurements and we had some requirements for this to work the
+
+1:25:02.080,1:25:06.010
+requirements were that we needed to sample randomly
+
+1:25:06.010,1:25:10.150
+in fact it's a bit weaker you have to sample incoherently but in practice
+
+1:25:10.150,1:25:14.710
+everybody samples randomly so it's essentially the same thing now here
+
+1:25:14.710,1:25:18.910
+we're randomly sampling columns but within the columns we do not randomly
+
+1:25:18.910,1:25:22.330
+sample the reason being is it's not faster in the machine the machine can
+
+1:25:22.330,1:25:25.930
+capture one column as quickly as you could capture half a column so we just
+
+1:25:25.930,1:25:29.350
+kind of capture a whole column so that makes it no longer random so that's one
+
+1:25:29.350,1:25:33.760
+kind of problem with it the other problem is kind of the the assumptions
+
+1:25:33.760,1:25:36.850
+of this compressed sensing theory are violated by the kind of images we want
+
+1:25:36.850,1:25:41.020
+to reconstruct I show you on the right they're an example of compressed sensing
+
+1:25:41.020,1:25:44.560
+Theory reconstruction this was a big step forward from what they could do
+
+1:25:44.560,1:25:48.940
+before you would you'll get something that looks like this previously that was
+
+1:25:48.940,1:25:53.020
+really considered the best in fact some people would when this result came out
+
+1:25:53.020,1:25:57.430
+swore though this was impossible it's actually not but you need some
+
+1:25:57.430,1:26:00.550
+assumptions and these assumptions are pretty critical and I mention them there
+
+1:26:00.550,1:26:05.080
+so you need sparsity of the image now that mi a-- majors not sparse by sparse
+
+1:26:05.080,1:26:09.370
+I mean it has a lot of zero or black pixels it's clearly not sparse but it
+
+1:26:09.370,1:26:13.660
+can be represented sparsely or approximately sparsely if you do a
+
+1:26:13.660,1:26:18.160
+wavelet decomposition now I won't go to the details there's a little bit of
+
+1:26:18.160,1:26:20.920
+problem though it's only approximately sparse and when you do that wavelet
+
+1:26:20.920,1:26:24.489
+decomposition that's why this is not a perfect reconstruction if it was very
+
+1:26:24.489,1:26:28.060
+sparse in the wavelet domain and perfectly that would be in exactly the
+
+1:26:28.060,1:26:33.160
+same as the left image and this compressed sensing is based off of the
+
+1:26:33.160,1:26:36.220
+field of optimization it kind of revitalize a lot of the techniques
+
+1:26:36.220,1:26:39.550
+people have been using for a long time the way you get this reconstruction is
+
+1:26:39.550,1:26:45.130
+you solve a little mini optimization problem at every step you every image
+
+1:26:45.130,1:26:47.830
+you want to reconstruct how many other machines so your machine has to solve an
+
+1:26:47.830,1:26:51.030
+optimization problem for every image every time it solves this little
+
+1:26:51.030,1:26:57.340
+quadratic problem with this kind of complicated regularization term so this
+
+1:26:57.340,1:27:00.700
+is great for optimization or all these people who had been getting low paid
+
+1:27:00.700,1:27:03.780
+jobs at universities all of a sudden there of their research was trendy and
+
+1:27:03.780,1:27:09.370
+corporations needed their help so this is great but we can do better so we
+
+1:27:09.370,1:27:13.120
+instead of solving this minimization problem at every time step I will use a
+
+1:27:13.120,1:27:16.960
+neural network so obviously being here arbitrarily to represent the huge in
+
+1:27:16.960,1:27:24.190
+your network beef a big of course we we hope that we can learn in your network
+
+1:27:24.190,1:27:28.000
+of such sufficient complexity that it can essentially solve the optimization
+
+1:27:28.000,1:27:31.240
+problem in one step it just outputs a solution that's as good as the
+
+1:27:31.240,1:27:35.200
+optimization problem solution now this would have been considered impossible 15
+
+1:27:35.200,1:27:39.820
+years ago now we know better so it's actually not very difficult in fact we
+
+1:27:39.820,1:27:44.980
+can just take an example of we can solve a few of these a few I mean like a few
+
+1:27:44.980,1:27:48.520
+hundred thousand of these optimization problems take the solution and the input
+
+1:27:48.520,1:27:53.620
+and we're gonna strain a neural network to map from input to solution that's
+
+1:27:53.620,1:27:56.830
+actually a little bit suboptimal because we get weakened in some cases we know a
+
+1:27:56.830,1:28:00.070
+better solution than the solution to the optimization problem we can gather that
+
+1:28:00.070,1:28:04.780
+by measuring the patient and that's what we actually do in practice so we don't
+
+1:28:04.780,1:28:07.000
+try and solve the optimization problem we try and get to an even better
+
+1:28:07.000,1:28:11.260
+solution and this works really well so I'll give you a very simple example of
+
+1:28:11.260,1:28:14.740
+this so this is what you can do much better than the compressed sensory
+
+1:28:14.740,1:28:18.580
+reconstruction using a neural network and this network involves the tricks
+
+1:28:18.580,1:28:23.140
+I've mentioned so it's trained using Adam it uses group norm normalization
+
+1:28:23.140,1:28:28.690
+layers and convolutional neural networks as you've already been taught and it
+
+1:28:28.690,1:28:33.970
+uses a technique known as u nets which you may go over later in the course not
+
+1:28:33.970,1:28:37.390
+sure about that but it's not a very complicated modification of only one it
+
+1:28:37.390,1:28:40.660
+works as yeah this is the kind of thing you can do and this is this is very
+
+1:28:40.660,1:28:44.880
+close to practical applications so you'll be seeing these accelerated MRI
+
+1:28:44.880,1:28:49.750
+scans happening in in clinical practice in only a few years tired this is not
+
+1:28:49.750,1:28:53.980
+vaporware and yeah that's everything i wanted to talk about you talk about
+
+1:28:53.980,1:28:58.620
+today optimization and the death of optimization thank you
diff --git a/docs/pt/week05/practicum05.sbv b/docs/pt/week05/practicum05.sbv
new file mode 100644
index 000000000..72ed0c5f4
--- /dev/null
+++ b/docs/pt/week05/practicum05.sbv
@@ -0,0 +1,1241 @@
+0:00:00.000,0:00:05.339
+last time we have seen that a matrix can be written basically let me draw here
+
+0:00:05.339,0:00:12.719
+the matrix so we had similar roles right and then we multiplied usually design by
+
+0:00:12.719,0:00:18.210
+one one column all right and so whenever we multiply these guys you can see these
+
+0:00:18.210,0:00:23.340
+and as two types two different equivalent types of representation it
+
+0:00:23.340,0:00:28.980
+can you see right you don't is it legible okay so you can see basically as
+
+0:00:28.980,0:00:35.430
+the output of this product has been a sequence of like the first row times
+
+0:00:35.430,0:00:40.469
+this column vector and then again I'm just okay shrinking them this should be
+
+0:00:40.469,0:00:46.170
+the same size right right because otherwise you can't multiply them so you
+
+0:00:46.170,0:00:52.170
+have this one and so on right until the last one and this is gonna be my final
+
+0:00:52.170,0:01:00.960
+vector and we have seen that each of these bodies here what are these I talk
+
+0:01:00.960,0:01:05.339
+to me please there's a scalar products right but what
+
+0:01:05.339,0:01:08.820
+do they represent what is it how can we call it what's another name for calling
+
+0:01:08.820,0:01:13.290
+a scalar product I show you last time a demonstration with some Chi government
+
+0:01:13.290,0:01:18.119
+trigonometry right what is it so this is all the projection if you
+
+0:01:18.119,0:01:22.619
+talk about geometry or you can think about this as a nun normalized cosine
+
+0:01:22.619,0:01:29.310
+value right so this one is going to be my projection basically of one kernel or
+
+0:01:29.310,0:01:36.030
+my input signal onto the kernel right so these are projections projection alright
+
+0:01:36.030,0:01:40.619
+and so then there was also a another interpretation of this like there is
+
+0:01:40.619,0:01:45.390
+another way of seeing this which was what basically we had the first column
+
+0:01:45.390,0:01:53.579
+of the matrix a multiplied by the first element of the X of these of this vector
+
+0:01:53.579,0:01:58.260
+right so back element number one then you had a second call
+
+0:01:58.260,0:02:04.020
+time's the second element of the X vector until you get to the last column
+
+0:02:04.020,0:02:11.100
+right times the last an element right suppose that this is long N and this is
+
+0:02:11.100,0:02:16.110
+M times n right so the height again is going to be the dimension towards we
+
+0:02:16.110,0:02:19.550
+should - and the width of a matrix is dimension where we're coming from
+
+0:02:19.550,0:02:24.810
+second part was the following so we said instead of using this matrix here
+
+0:02:24.810,0:02:29.450
+instead since we are doing convolutions because we'd like to exploit sparsity a
+
+0:02:29.450,0:02:35.400
+stationarity and compositionality of the data we still use the same matrix here
+
+0:02:35.400,0:02:41.370
+perhaps right we use the same guy here but then those kernels we are going to
+
+0:02:41.370,0:02:45.510
+be using them over and over again the same current across the whole signal
+
+0:02:45.510,0:02:51.360
+right so in this case the width of this matrix is no longer be it's no longer n
+
+0:02:51.360,0:02:56.820
+as it was here is going to be K which is gonna be the kernel size right so here
+
+0:02:56.820,0:03:03.090
+I'm gonna be drawing my thinner matrix and this one is gonna be K lowercase K
+
+0:03:03.090,0:03:10.140
+and the height maybe we can still call it n okay all right so let's say here I
+
+0:03:10.140,0:03:18.230
+have several kernels for example let me have my tsiyon carnal then I may have my
+
+0:03:18.230,0:03:25.080
+other non green let me change let's put pink so you have this one and
+
+0:03:25.080,0:03:33.180
+then you may have green one right and so on so how do we use these kernels right
+
+0:03:33.180,0:03:38.280
+now so we basically can use these kernels by stacking them and shifted
+
+0:03:38.280,0:03:43.650
+them a little bit right so we get the first kernel out of here and then you're
+
+0:03:43.650,0:03:50.519
+gonna get basically you get the first guy here then you shift it shift it
+
+0:03:50.519,0:03:58.290
+shift it and so on right until you get the whole matrix and we were putting a 0
+
+0:03:58.290,0:04:02.100
+here and a 0 here right this is just recap and then you have this one for the
+
+0:04:02.100,0:04:11.379
+blue color now you do magic here and just do copy copy and I you do paste
+
+0:04:11.379,0:04:19.370
+and now you can also do color see fantastic magic and we have pink one and
+
+0:04:19.370,0:04:25.360
+then you have the last one right can I do the same copy yes I can do fantastic
+
+0:04:25.360,0:04:29.080
+so you cannot do copy and paste on the paper
+
+0:04:29.080,0:04:38.419
+all right color and the last one light green okay all right so we just
+
+0:04:38.419,0:04:44.479
+duplicate how many matrices do we have now how many layers no don't count the
+
+0:04:44.479,0:04:50.600
+number like there are letters on the on the screen and K or M what is it what is
+
+0:04:50.600,0:05:00.620
+K the side usually you're just guessing you shouldn't be guessing you should
+
+0:05:00.620,0:05:07.120
+tell me the correct answer I think about this as a job interview I'm training you
+
+0:05:07.120,0:05:14.990
+so how many maps we have and right so this one here are as many as my M which
+
+0:05:14.990,0:05:21.470
+is the number of rows of this initial thing over here right all right so what
+
+0:05:21.470,0:05:30.289
+is instead the width of this little kernel here okay right okay what is the
+
+0:05:30.289,0:05:41.349
+height of this matrix what is the height of the matrix
+
+0:05:42.340,0:05:45.480
+you sure try again
+
+0:05:49.220,0:06:04.310
+I can't hear and minus k plus one okay and the final what is the output of this
+
+0:06:04.310,0:06:08.660
+thing right so the output is going to be one vector which is gonna be of height
+
+0:06:08.660,0:06:19.430
+the same right and minus k plus 1 and then it should be correct yeah but then
+
+0:06:19.430,0:06:27.890
+how many what is the thickness of this final vector M right so this stuff here
+
+0:06:27.890,0:06:35.600
+and goes as thick as M right so this is where we left last time right but then
+
+0:06:35.600,0:06:39.770
+someone asked me now then I realized so we have here as many as the different
+
+0:06:39.770,0:06:45.170
+colors right so for example in this case if I just draw to make sure we
+
+0:06:45.170,0:06:49.730
+understand what's going on you have the first thing here now you have the second
+
+0:06:49.730,0:06:55.600
+one here and I have the third one right in this case all right so last time they
+
+0:06:59.750,0:07:03.650
+asked me if someone asked me at the end of the class so how do we do convolution
+
+0:07:03.650,0:07:09.760
+when we end up in this situation over here because here we assume that my
+
+0:07:09.760,0:07:14.990
+corners are just you know whatever K long let's say three long but then they
+
+0:07:14.990,0:07:21.380
+are just one little vector right and so somebody told me no then what do you do
+
+0:07:21.380,0:07:24.950
+from here like how do we keep going because now we have a thickness before
+
+0:07:24.950,0:07:32.510
+we started with a something here this vector which had just n elements right
+
+0:07:32.510,0:07:35.690
+are you following so far I'm going faster because we already seen these
+
+0:07:35.690,0:07:44.030
+things I'm just reviewing but are you with me until now yes no yes okay
+
+0:07:44.030,0:07:47.720
+fantastic so let's see how we actually keep going so the thing is
+
+0:07:47.720,0:07:51.680
+show you right now is actually assuming that we start with that long vector
+
+0:07:51.680,0:08:01.400
+which was of height what was the height and right but in this case also this one
+
+0:08:01.400,0:08:13.060
+means that we have something that looks like this and so you have basically here
+
+0:08:13.060,0:08:20.720
+this is 1 this is also 1 so we only have a monophonic signal for example and this
+
+0:08:20.720,0:08:26.300
+was n the height right all right so let's assume now we're using a
+
+0:08:26.300,0:08:33.950
+stereophonic system so what is gonna be my domain here so you know my X can be
+
+0:08:33.950,0:08:39.740
+thought as a function that goes from the domain to the ℝ^{number of channels} so
+
+0:08:39.740,0:08:47.840
+what is this guy here yeah x is one dimension and somewhere so what is this
+
+0:08:47.840,0:08:59.930
+Ω we have seen this slide last slide of Tuesday lesson right second Ω is
+
+0:08:59.930,0:09:11.720
+not set of real numbers no someone else tries we are using computers it's time
+
+0:09:11.720,0:09:16.520
+line yes and how many samples you you have one sample number sample number two
+
+0:09:16.520,0:09:21.710
+or sample number three so you have basically a subset of the natural space
+
+0:09:21.710,0:09:30.860
+right so this one is going to be something like 0 1 2 so on set which is
+
+0:09:30.860,0:09:36.410
+gonna be subset of ℕ right so it's not ℝ. ℝ is gonna be if you have time
+
+0:09:36.410,0:09:45.850
+continuous domain what you see in this case the in the case I just showed you
+
+0:09:45.850,0:09:55.160
+so far what is seen in this case now number of input channels because this is
+
+0:09:55.160,0:10:00.740
+going to be my X right this is my input so in this case we show so far in this
+
+0:10:00.740,0:10:07.220
+case here we were just using one so it means we have a monophonic audio let's
+
+0:10:07.220,0:10:10.880
+seven now the assumption make the assumption that this guy is that it's
+
+0:10:10.880,0:10:22.780
+gonna be two such that you're gonna be talking about stereo phonic signal right
+
+0:10:23.200,0:10:27.380
+okay so let's see how this stuff changes so
+
+0:10:27.380,0:10:38.450
+in this case my let me think yeah so how do I draw I'm gonna just draw right
+
+0:10:38.450,0:10:43.400
+little complain if you don't follow are you following so far yes because if
+
+0:10:43.400,0:10:46.550
+i watch my tablet I don't see you right so you should be complaining if
+
+0:10:46.550,0:10:50.750
+something doesn't make sense right otherwise becomes boring from waiting
+
+0:10:50.750,0:10:56.390
+and watching you all the time right yes no yes okay I'm boring okay
+
+0:10:56.390,0:11:00.080
+thank you all right so we have here this signal
+
+0:11:00.080,0:11:07.280
+right and then now we have some thickness in this case what is the
+
+0:11:07.280,0:11:14.660
+thickness of this guy see right so in this case this one is going to be C and
+
+0:11:14.660,0:11:18.589
+in the case of the stereophonic signal you're gonna just have two channels left
+
+0:11:18.589,0:11:30.170
+and right and this one keeps going down right all right so our kernels if I'd
+
+0:11:30.170,0:11:35.030
+like to perform a convolution over this signal right so you have different same
+
+0:11:35.030,0:11:44.150
+pussy right and so on right if I'd like to perform a convolution one big
+
+0:11:44.150,0:11:47.089
+convolution I'm not talking about two deconvolution right because they are
+
+0:11:47.089,0:11:52.670
+still using domain which is here number one right so this is actually important
+
+0:11:52.670,0:11:58.510
+so if I ask you what type of signal this is you're gonna be basically
+
+0:11:58.510,0:12:02.890
+you have to look at this number over here right so we are talking about one
+
+0:12:02.890,0:12:12.490
+dimensional signal which is one dimensional domain right 1d domain okay
+
+0:12:12.490,0:12:17.710
+so we are still using a 1d signal but in this case it has you know you have two
+
+0:12:17.710,0:12:25.750
+values per point so what kind of kernels are we gonna be using so I'm gonna just
+
+0:12:25.750,0:12:31.450
+draw it in this case we're gonna be using something similar like this so I'm
+
+0:12:31.450,0:12:37.990
+gonna be drawing this guy let's say I have K here which is gonna be my width
+
+0:12:37.990,0:12:42.700
+of the kernel but in this case I'm gonna be also have some thickness in this case
+
+0:12:42.700,0:12:56.230
+here right so basically you apply this thing here okay and then you can go
+
+0:12:56.230,0:13:04.060
+second line and third line and so on right so you may still have like here m
+
+0:13:04.060,0:13:11.590
+kernels but in this case you also have some thickness which has to match the
+
+0:13:11.590,0:13:17.680
+other thickness right so this thickness here has to match the thickness of the
+
+0:13:17.680,0:13:23.980
+input size so let me show you how to apply the convolution so you're gonna
+
+0:13:23.980,0:13:37.980
+get one of these slices here and then you're gonna be applying this over here
+
+0:13:39.320,0:13:46.190
+okay and then you simply go down this way
+
+0:13:46.190,0:13:53.870
+alright so whenever you apply these you perform this guy here the inner product
+
+0:13:53.870,0:14:04.410
+with these over here what you get it's actually a one by one is a scalar so
+
+0:14:04.410,0:14:09.540
+whenever I use this orange thingy here on the left hand side and I do a dot
+
+0:14:09.540,0:14:14.190
+product scalar product with this one I just get a scalar so this is actually my
+
+0:14:14.190,0:14:19.620
+convolution in 1d the convolution in 1d means that it goes down this way and
+
+0:14:19.620,0:14:27.480
+only in one way that's why it's called 1d but we multiply each element of this
+
+0:14:27.480,0:14:36.290
+mask times this guy here now a second row and this guy here okay
+
+0:14:36.290,0:14:41.090
+you saw you multiply all of them you sum all of them and then you get your first
+
+0:14:41.090,0:14:47.250
+output here okay so whenever I make this multiplication I get my first output
+
+0:14:47.250,0:14:52.050
+here then I keep sliding this kernel down and then you're gonna get the
+
+0:14:52.050,0:14:58.380
+second output third out fourth and so on until you go down at the end then what
+
+0:14:58.380,0:15:03.780
+happens then happens that I'm gonna be picking up different kernel I'm gonna
+
+0:15:03.780,0:15:07.950
+back it let's say I get the third one okay let's get the second one I get a
+
+0:15:07.950,0:15:19.050
+second one and I perform the same operation you're gonna get here this one
+
+0:15:19.050,0:15:23.240
+actually let's actually make it like a matrix
+
+0:15:26.940,0:15:33.790
+you go down okay until you go with the last one which is gonna be the end right
+
+0:15:33.790,0:15:45.450
+the empty kernel which is gonna be going down this way you get the last one here
+
+0:15:51.680,0:15:58.790
+okay yes no confusing clearing so this was the question I got at the end of the
+
+0:15:58.790,0:16:10.339
+class yeah Suzy yeah because it's a dot product of all those values between so
+
+0:16:10.339,0:16:18.259
+basically do the projection of this part of the signal onto this kernel so you'd
+
+0:16:18.259,0:16:22.879
+like to see what is the contribution like what is the alignment of this part
+
+0:16:22.879,0:16:27.350
+of the signal on to this specific subspace okay this is how a convolution
+
+0:16:27.350,0:16:31.850
+works when you have multiple channels so far I'll show you just with single
+
+0:16:31.850,0:16:35.319
+channel now we have multiple channels okay so oh yeah yeah in one second one
+
+0:16:54.259,0:16:59.509
+and one one at the top one at the bottom so you actually lose the first row here
+
+0:16:59.509,0:17:04.850
+and you lose the last row here so at the end in this case the output is going to
+
+0:17:04.850,0:17:10.490
+be n minus three plus one so you lose two one on top okay in this case you
+
+0:17:10.490,0:17:15.140
+lose two at the bottom if you actually do a Center at the center the
+
+0:17:15.140,0:17:20.390
+convolution usually you lose one at the beginning one at the end every time you
+
+0:17:20.390,0:17:24.409
+perform a convolution you lose the number of the dimension of the kernel
+
+0:17:24.409,0:17:28.789
+minus one you can try if you put your hand like this you have a kernel of
+
+0:17:28.789,0:17:34.340
+three you get the first one here and it is matching then you switch one and then
+
+0:17:34.340,0:17:39.440
+you switch to right so okay with fight let's tell a parent of two right so you
+
+0:17:39.440,0:17:44.149
+have your signal of five you have your kernel with two you have one two three
+
+0:17:44.149,0:17:49.070
+and four so we started with five and you end up with four because you use a
+
+0:17:49.070,0:17:54.500
+kernel size of two if you use a kernel size of three you get one two and three
+
+0:17:54.500,0:17:57.289
+so you goes to if you use a kernel size of three okay
+
+0:17:57.289,0:18:01.010
+so you can always try to do this alright so I'm gonna show you now the
+
+0:18:01.010,0:18:07.040
+dimensions of these kernels and the outputs with PyTorch okay Yes No
+
+0:18:07.040,0:18:18.500
+all right good okay mister can you see anything
+
+0:18:18.500,0:18:25.520
+yes right I mean zoom a little bit more okay so now we can go we do
+
+0:18:25.520,0:18:33.770
+conda activate pDL, pytorch Deep Learning.
+
+0:18:33.770,0:18:40.520
+So here we can just run ipython if i press ctrl L I clear the screen and
+
+0:18:40.520,0:18:49.820
+we can do import torch then I can do from torch import nn so now we can see
+
+0:18:49.820,0:18:54.500
+for example called let's set my convolutional convolutional layer it's
+
+0:18:54.500,0:18:59.930
+going to be equal to NN conf and then I can keep going until I get
+
+0:18:59.930,0:19:04.220
+this one let's say yeah let's say I have no idea how to use this function I just
+
+0:19:04.220,0:19:08.750
+put a question mark I press ENTER and I'm gonna see here now the documentation
+
+0:19:08.750,0:19:13.460
+okay so in this case you're gonna have the first item is going to be the input
+
+0:19:13.460,0:19:19.820
+channel then I have the output channels then I have the corner sighs alright so
+
+0:19:19.820,0:19:24.290
+for example we are going to be putting here input channels we have a stereo
+
+0:19:24.290,0:19:30.530
+signal so we put two channels the number of corners we said that was M and let's
+
+0:19:30.530,0:19:36.650
+say we have 16 kernels so this is the number of kernels I'm gonna be using and
+
+0:19:36.650,0:19:41.810
+then let's have our kernel size of what the same I use here so let's have K or
+
+0:19:41.810,0:19:47.570
+the kernel size equal 3 okay in so here I'm going to define my first convolution
+
+0:19:47.570,0:19:52.910
+object so if I print this one comes you're gonna see we have a convolution a
+
+0:19:52.910,0:19:57.580
+2d combo sorry 1 deconvolution made that okay so we have a 1d convolution
+
+0:20:02.149,0:20:08.869
+which is going from two channels so a stereophonic to a sixteen channels means
+
+0:20:08.869,0:20:16.039
+I use sixteen kernels the skirmish size is 3 and then the stride is also 1 ok so
+
+0:20:16.039,0:20:23.859
+in this case I'm gonna be checking what is gonna be my convolutional weights
+
+0:20:27.429,0:20:33.379
+what is the size of the weights how many weights do we have how many how
+
+0:20:33.379,0:20:40.069
+many planes do we have for the weights 16 right so we have 16 weights what is
+
+0:20:40.069,0:20:53.649
+the length of the the day of the key of D of the kernel okay Oh what is this -
+
+0:20:54.549,0:21:00.349
+Janis right so I have 16 of these scanners which have thickness - and then
+
+0:21:00.349,0:21:05.539
+length of 3 ok makes sense right because you're gonna be applying each of these
+
+0:21:05.539,0:21:11.629
+16 across the whole signal so let's have my signal now you're gonna be is gonna
+
+0:21:11.629,0:21:20.599
+be equal toage dot R and and and oh sighs I don't know let's say 64 I also
+
+0:21:20.599,0:21:25.129
+have to say I have a batch of size 1 so I have a virtual site one so I just have
+
+0:21:25.129,0:21:31.879
+one signal and then this is gonna be 64 how many channels we said this has two
+
+0:21:31.879,0:21:37.819
+right so I have one signal one example which has two channels and has 64
+
+0:21:37.819,0:21:46.689
+samples so this is my X hold on what is the convolutional bias size
+
+0:21:48.320,0:21:54.380
+a 16 right because you have one bias / plain / / / way ok so what's gonna be in
+
+0:21:54.380,0:22:07.539
+our my convolution of X the output hello so I'm gonna still have one sample right
+
+0:22:07.539,0:22:15.919
+how many channels 16 what is gonna be the length of the signal okay that's
+
+0:22:15.919,0:22:22.700
+good 6 fix it okay fantastic all right so what if I'm gonna be using
+
+0:22:22.700,0:22:32.240
+a convolution with size of the kernel 5 what do I get now yet to shout I can't
+
+0:22:32.240,0:22:36.320
+hear you 60 okay you're following fantastic okay
+
+0:22:36.320,0:22:44.059
+so let's try now instead to use a hyper spectral image with a 2d convolution
+
+0:22:44.059,0:22:49.100
+okay so I'm going to be coding now my convolution here is going to be my in
+
+0:22:49.100,0:22:55.490
+this case is correct or is going to be a conf come to D again I don't know how to
+
+0:22:55.490,0:22:59.059
+use it so I put a question mark and then I have here input channel output channel
+
+0:22:59.059,0:23:05.450
+criticize strident padding okay so I'm going to be putting inputs tried input
+
+0:23:05.450,0:23:10.429
+channel so it's a hyper spectral image with 20 planes so what's gonna be the
+
+0:23:10.429,0:23:16.149
+input in this case 20 right because you have you start from 20 spectral bands
+
+0:23:16.149,0:23:20.419
+then we're gonna be inputting the output number of channels we let's say we're
+
+0:23:20.419,0:23:25.330
+gonna be using again 16 in this case I'm going to be inputting the kernel size
+
+0:23:25.330,0:23:33.440
+since I'm planning to use okay let's actually define let's actually define my
+
+0:23:33.440,0:23:40.120
+signal first so my X is gonna be a torch dot R and and let's say one sample with
+
+0:23:40.120,0:23:52.820
+20 channels of height for example I guess 6128 well hold on 64 and then with
+
+0:23:52.820,0:23:58.820
+128 okay so this is gonna be my my input my eople data okay
+
+0:23:58.820,0:24:04.370
+so my convolution now it can be something like this so I have 20
+
+0:24:04.370,0:24:09.110
+channels from input 16 our Mike Ernest I'm gonna be using then I'm gonna be
+
+0:24:09.110,0:24:15.050
+specifying the kernel size in this case let's use something that is like three
+
+0:24:15.050,0:24:24.580
+times five okay so what is going to be the output what are the kernel size
+
+0:24:29.170,0:24:47.630
+anyone yes no what no 20 Janice is the channels of the input data right so you
+
+0:24:47.630,0:24:51.680
+have how many kernels here 16 right there you go
+
+0:24:51.680,0:24:56.420
+we have 16 kernels which have 20 channels such that they can lay over the
+
+0:24:56.420,0:25:03.410
+input 3 by 5 right teeny like a short like yeah short but large ok so what is
+
+0:25:03.410,0:25:08.140
+gonna be my conv(x).size ? [1, 16, 62, 124]. Let's say I'd like to
+
+0:25:16.310,0:25:22.190
+actually add back the I'd like to head the sing dimensionality I can add some
+
+0:25:22.190,0:25:25.730
+padding right so here there is going to be the stride I'm gonna have a stride of
+
+0:25:25.730,0:25:29.930
+1 again if you don't remember the the syntax you can just put the question
+
+0:25:29.930,0:25:35.120
+mark can you figure out and then how much strive should I add now how much
+
+0:25:35.120,0:25:41.870
+stride in the y-direction sorry yes how much padding should I add in the
+
+0:25:41.870,0:25:46.490
+y-direction one because it's gonna be one on top one on the bottom but then
+
+0:25:46.490,0:25:51.890
+then on the x-direction okay you know you're following fantastic and so now if
+
+0:25:51.890,0:25:57.320
+I just run this one you wanna get the initial size okay so now you have both
+
+0:25:57.320,0:26:05.500
+1d and 2d the point is that what is the dimension of a convolutional kernel and
+
+0:26:05.500,0:26:12.470
+symbol for to the dimensional signal again I repeat what is the
+
+0:26:12.470,0:26:20.049
+dimensionality of the collection of careness use for two-dimensional data
+
+0:26:20.860,0:26:27.679
+again for right so four is gonna be the number of dimensions that are required
+
+0:26:27.679,0:26:35.659
+to store the collection of kernels when you perform 2d convolutions the one is
+
+0:26:35.659,0:26:40.370
+going to be the stride so if you don't know how this works you just put a
+
+0:26:40.370,0:26:44.000
+question mark and gonna tell you here so stride is gonna be telling you you
+
+0:26:44.000,0:26:50.929
+stride off you move every time the kernel by one if you are the first one
+
+0:26:50.929,0:26:55.460
+means you only is the batch size so torch expects you to always use batches
+
+0:26:55.460,0:27:00.110
+meaning how many signals you're using just one right so that our expectation
+
+0:27:00.110,0:27:04.549
+if you send an input vector which is going to be input tensor which has
+
+0:27:04.549,0:27:12.289
+dimension three is gonna be breaking and complain okay so we have still some time
+
+0:27:12.289,0:27:18.049
+to go in the second part all right second part is going to be so you've
+
+0:27:18.049,0:27:23.779
+been computing some derivatives right for the first homework right so the
+
+0:27:23.779,0:27:31.909
+following homework maybe you have to do you have to compute this one okay you're
+
+0:27:31.909,0:27:35.510
+supposed to be laughing it's a joke okay there you go
+
+0:27:35.510,0:27:43.340
+fantastic so this is what you can wrote back in the 90s for the computation of
+
+0:27:43.340,0:27:50.029
+the gradients of the of the lsdm which are gonna be covered I guess in next
+
+0:27:50.029,0:27:54.950
+next lesson so how somehow so they had to still do these things right it's kind
+
+0:27:54.950,0:28:00.769
+of crazy nevertheless we can use PyTorch to have automatic computation of these
+
+0:28:00.769,0:28:06.500
+gradients so we can go and check out how these automatic gradient works
+
+0:28:06.500,0:28:12.159
+okay all right so all right so we are going to be going
+
+0:28:23.090,0:28:28.490
+now to the notebook number three which is the yeah
+
+0:28:28.490,0:28:33.590
+invisible let me see if I can highlight it now it's even worse okay number three
+
+0:28:33.590,0:28:41.619
+Auto gratitute Oriole okay let me go fullscreen
+
+0:28:41.619,0:28:53.029
+okay so out of our tutorial was gonna be here here just create my tensor which
+
+0:28:53.029,0:28:57.499
+has as well these required gradients equal true in this case I mean asking
+
+0:28:57.499,0:29:02.539
+torch please track all the gradient computations did it got the competition
+
+0:29:02.539,0:29:07.749
+over the tensor such that we can perform computation of partial derivatives okay
+
+0:29:07.749,0:29:13.279
+in this case I'm gonna have my Y is going to be so X is simply gonna be one
+
+0:29:13.279,0:29:20.419
+two three four the Y is going to be X subtracted number two okay alright so
+
+0:29:20.419,0:29:26.869
+now we can notice that there is this grad F n grad f NN FN function here so
+
+0:29:26.869,0:29:32.059
+let's see what this stuff is we go sit there and see oh this is a sub backward
+
+0:29:32.059,0:29:37.629
+what is it meaning that the Y has been generated by a module which performs the
+
+0:29:37.629,0:29:43.669
+subtraction between X and and - right so you have X minus 2 therefore if you
+
+0:29:43.669,0:29:51.860
+check who generated Y well there's a sub a subtraction module ok so what's gonna
+
+0:29:51.860,0:30:01.009
+be now the God function of X you're supposed to answer oh okay
+
+0:30:01.009,0:30:03.580
+why is none because they should have written there
+
+0:30:07.580,0:30:12.020
+Alfredo generated that right okay all right none is fine as well
+
+0:30:12.020,0:30:17.000
+okay so let's actually put our nose inside we were here we can actually
+
+0:30:17.000,0:30:23.770
+access the first element you have the accumulation why is the accumulation I
+
+0:30:25.090,0:30:29.830
+don't know I forgot but then if you go inside there you're gonna see the
+
+0:30:29.830,0:30:34.760
+initial vector the initial tensor we are using is the one two three four okay so
+
+0:30:34.760,0:30:41.390
+inside this computational graph you can also find the original tensor okay all
+
+0:30:41.390,0:30:46.880
+right so let's now get the Z and inside is gonna be my Y square times three and
+
+0:30:46.880,0:30:51.620
+then I compute my average a it's gonna be the mean of Z right so if I compute
+
+0:30:51.620,0:30:56.330
+the square of this thing here and I multiply by three and I take the average
+
+0:30:56.330,0:31:00.500
+so this is the square part times 3 and then this is the average okay so you can
+
+0:31:00.500,0:31:06.200
+try if you don't believe me all right so let's see how this thing looks like so
+
+0:31:06.200,0:31:10.549
+I'm gonna be promoting here all these sequence of computations so we started
+
+0:31:10.549,0:31:16.669
+by from a two by two matrix what was this guy here to buy - who is this X
+
+0:31:16.669,0:31:22.399
+okay you're following it cool then we subtracted - right and then we
+
+0:31:22.399,0:31:27.440
+multiplied by Y twice right that's why you have to ro so you get the same
+
+0:31:27.440,0:31:31.669
+subtraction that is the whyatt the X minus 2 multiplied by itself then
+
+0:31:31.669,0:31:36.649
+you have another multiplication what is this okay multiply by three and then you
+
+0:31:36.649,0:31:42.980
+have the final the mean backward because this Y is green because it's mean no
+
+0:31:42.980,0:31:51.140
+okay yeah thank you for laughing okay so I compute back prop right
+
+0:31:51.140,0:31:59.409
+what does backdrop do what does this line do
+
+0:32:00.360,0:32:08.610
+I want to hear everyone you know already we compute what radians right so black
+
+0:32:08.610,0:32:11.580
+propagation is how you compute the gradients how do we train your networks
+
+0:32:11.580,0:32:20.730
+with gradients ain't right or whatever Aaron said yesterday back
+
+0:32:20.730,0:32:27.000
+propagation is that is used for computing the gradient completely
+
+0:32:27.000,0:32:29.970
+different things okay please keep them separate don't merge
+
+0:32:29.970,0:32:34.559
+them everyone after a bit that don't they don't see me those two things keep
+
+0:32:34.559,0:32:43.740
+colliding into one mushy thought don't it's painful okay she'll compute the
+
+0:32:43.740,0:32:51.659
+gradients right so guess what we are computing some gradients now okay so we
+
+0:32:51.659,0:33:02.580
+go on your page it's going to be what what was a it was the average right so
+
+0:33:02.580,0:33:10.529
+this is 1/4 right the summation of all those zᵢ
+
+0:33:10.529,0:33:17.460
+what so I goes from 1 to 4 okay so what is that I said I is going
+
+0:33:17.460,0:33:27.539
+to be equal to 3yᵢ² right yeah no questions no okay all right and then
+
+0:33:27.539,0:33:36.840
+this one is was equal to 3(x-2)² right so a what does it belong
+
+0:33:36.840,0:33:38.899
+to where does a belong to what is the ℝ
+
+0:33:44.279,0:33:51.200
+right so it's a scaler okay all right so now we can compute ∂a/∂x.
+
+0:33:51.200,0:33:58.110
+So how much is this stuff you're gonna have 1/4 comes out forum here and
+
+0:33:58.110,0:34:03.090
+then you have you know let's have this one with respect to the xᵢ element
+
+0:34:03.090,0:34:09.179
+okay so we're gonna have this one zᵢ inside is that, I have the 3yᵢ²,
+
+0:34:09.179,0:34:15.899
+and it's gonna be 3(xᵢ- 2)². Right so these three comes
+
+0:34:15.899,0:34:22.080
+out here the two comes down as well and then you multiply by (xᵢ – 2).
+
+0:34:22.080,0:34:33.260
+So far should be correct okay fantastic all right so my X was this element here
+
+0:34:33.589,0:34:38.190
+actually let me compute as well this one so this one goes away this one becomes
+
+0:34:38.190,0:34:47.690
+true this is 1.5 times xᵢ – 3. Right - 2 - 3
+
+0:34:55.159,0:35:06.780
+ok mathematics okay okay thank you all right. So what's gonna be ∂a/∂x ?
+
+0:35:06.780,0:35:11.339
+I'm actually writing the transpose directly here so for the first element
+
+0:35:11.339,0:35:18.859
+you have one you have one times 1.5 so 1.5 minus 3 you get 1 minus 1.5 right
+
+0:35:18.859,0:35:23.670
+second one is going to be 3 minus 3 you get 0 Ryan this is 3 minus 3
+
+0:35:23.670,0:35:27.420
+maybe I should write everything right so you're actually following so you have
+
+0:35:27.420,0:35:37.589
+1.5 minus 3 now you have 3 minus 3 below you have 4 point 5 minus 3 and then the
+
+0:35:37.589,0:35:47.160
+last one is going to be 6 minus 3 which is going to be equal to minus 1 point 5
+
+0:35:47.160,0:35:59.789
+0 1 point 5 and then 3 right you agree ok let me just write this on here
+
+0:35:59.789,0:36:06.149
+okay just remember so we have you be computed the backpropagation here I'm
+
+0:36:06.149,0:36:14.609
+gonna just bring it to the gradients and then the right it's the same stuff we
+
+0:36:14.609,0:36:27.630
+got here right such that I don't have to transpose it here whenever you perform
+
+0:36:27.630,0:36:33.209
+the partial derivative in PyTorch you get the same the same shape is the input
+
+0:36:33.209,0:36:37.469
+dimension so if you have a weight whatever dimension then when you compute
+
+0:36:37.469,0:36:41.069
+the partial you still have the same dimension they don't swap they don't
+
+0:36:41.069,0:36:44.789
+turn okay they just use this for practicality at the correct version I
+
+0:36:44.789,0:36:49.919
+mean the the gradient should be the transpose of that thing sorry did
+
+0:36:49.919,0:36:54.479
+Jacobian which is the transpose of the gradient right if it's a vector but this
+
+0:36:54.479,0:37:08.130
+is a tensor so whatever we just used the same same shape thing no so this one
+
+0:37:08.130,0:37:13.639
+should be a flipping I believe maybe I'm wrong but I don't think all right so
+
+0:37:13.639,0:37:19.919
+this is like basic these basic PyTorch now you can do crazy stuff because we
+
+0:37:19.919,0:37:23.609
+like crazy right I mean I do I think if you like me you
+
+0:37:23.609,0:37:29.669
+like crazy right okay so here I just create my
+
+0:37:29.669,0:37:34.259
+vector X which is going to be a three dimensional well a one-dimensional
+
+0:37:34.259,0:37:43.769
+tensor of three items I'm going to be multiplying X by two then I call this
+
+0:37:43.769,0:37:49.859
+one Y then I start my counter to zero and then until the norm of the Y is long
+
+0:37:49.859,0:37:56.699
+thousand below thousand I keep doubling Y okay and so you can get like a dynamic
+
+0:37:56.699,0:38:01.529
+graph right the graph is base is conditional to the actual random
+
+0:38:01.529,0:38:04.979
+initialization which you can't even tell because I didn't even use a seed so
+
+0:38:04.979,0:38:08.999
+everyone that is running this stuff is gonna get different numbers so these are
+
+0:38:08.999,0:38:11.910
+the final values of the why can you tell me
+
+0:38:11.910,0:38:23.549
+how many iterations we run so the mean of this stuff is actually lower than a
+
+0:38:23.549,0:38:27.630
+thousand yeah but then I'm asking whether you know how many times this
+
+0:38:27.630,0:38:41.119
+loop went through no good why it's random Rises you know it's bad question
+
+0:38:41.119,0:38:45.539
+about bad questions next time I have a something for you okay so I'm gonna be
+
+0:38:45.539,0:38:51.569
+printing this one now I'm telling you the grabbed are 2048 right
+
+0:38:51.569,0:38:55.589
+just check the central one for the moment right this is the actual gradient
+
+0:38:55.589,0:39:04.739
+so can you tell me now how many times the loop went on so someone said 11 how
+
+0:39:04.739,0:39:14.420
+many ends up for 11 okay for people just roast their hands what about the others
+
+0:39:14.809,0:39:17.809
+21 okay any other guys 11 10
+
+0:39:25.529,0:39:30.749
+okay we have actually someone that has the right solution and this loop went on
+
+0:39:30.749,0:39:35.759
+for 10 times why is that because you have the first multiplication by 2 here
+
+0:39:35.759,0:39:40.589
+and then loop goes on over and over and multiplies by 2 right so the final
+
+0:39:40.589,0:39:45.239
+number is gonna be the least number of iterations in the loop plus the
+
+0:39:45.239,0:39:50.779
+additional like addition and multiplication outside right yes no
+
+0:39:50.779,0:39:56.670
+you're sleeping maybe okay I told you not to eat before class otherwise you
+
+0:39:56.670,0:40:05.009
+get groggy okay so inference this is cool so here I'm gonna be just having
+
+0:40:05.009,0:40:09.420
+both my X & Y we are gonna just do linear regression right linear or
+
+0:40:09.420,0:40:17.670
+whatever think the add operator is just the scalar product okay so both the X
+
+0:40:17.670,0:40:21.589
+and W has have the requires gradient equal to true
+
+0:40:21.589,0:40:27.119
+being this means we are going to be keeping track of the the gradients and
+
+0:40:27.119,0:40:31.290
+the computational graph so if I execute this one you're gonna get the partial
+
+0:40:31.290,0:40:37.710
+derivatives of the inner product with respect to the Z with respect to the
+
+0:40:37.710,0:40:43.920
+input is gonna be the weights right so in the range is the input right and the
+
+0:40:43.920,0:40:47.160
+ones are the weights so partial derivative with respect to the input is
+
+0:40:47.160,0:40:50.070
+gonna be the weights partial with respect to the weights are gonna be the
+
+0:40:50.070,0:40:56.670
+input right yes no yes okay now I just you know usually it's this one is the
+
+0:40:56.670,0:41:00.359
+case I just have required gradients for my parameters because I'm gonna be using
+
+0:41:00.359,0:41:06.030
+the gradients for updating later on the the parameters of the mother is so in
+
+0:41:06.030,0:41:12.300
+this case you get none let's have in this case instead what I usually do
+
+0:41:12.300,0:41:17.250
+wanna do inference when I do inference I tell torch a torch stop tracking any
+
+0:41:17.250,0:41:22.950
+kind of operation so I say torch no God please so this one regardless of whether
+
+0:41:22.950,0:41:28.859
+your input always have the required grass true or false whatever when I say
+
+0:41:28.859,0:41:35.060
+torch no brats you do not have any computation a graph taken care of right
+
+0:41:35.060,0:41:41.130
+therefore if I try to run back propagation on a tensor which was
+
+0:41:41.130,0:41:46.320
+generated from like doesn't have actually you know graph because this one
+
+0:41:46.320,0:41:50.940
+doesn't have a graph you're gonna get an error okay so if I run this one you get
+
+0:41:50.940,0:41:55.410
+an error and you have a very angry face here because it's an error and then it
+
+0:41:55.410,0:42:00.720
+takes your element 0 of tensor does not require grads and does not have a god
+
+0:42:00.720,0:42:07.650
+function right so II which was the yeah whatever they reside here actually then
+
+0:42:07.650,0:42:11.400
+you couldn't run back problems that because there is no graph attached to
+
+0:42:11.400,0:42:19.710
+that ok questions this is so powerful you cannot do it this time with tensor
+
+0:42:19.710,0:42:26.790
+you okay tensor flow is like whatever yeah more stuff here actually more stuff
+
+0:42:26.790,0:42:30.600
+coming right now [Applause]
+
+0:42:30.600,0:42:36.340
+so we go back here we have inside the extra folder he has some nice cute
+
+0:42:36.340,0:42:40.450
+things I wanted to cover both of them just that we go just for the second I
+
+0:42:40.450,0:42:47.290
+think sorry the second one is gonna be the following so in this case we are
+
+0:42:47.290,0:42:52.750
+going to be generating our own specific modules so I like let's say I'd like to
+
+0:42:52.750,0:42:58.030
+define my own function which is super special amazing function I can decide if
+
+0:42:58.030,0:43:02.560
+I want to use it for you know training Nets I need to get the forward pass and
+
+0:43:02.560,0:43:06.220
+also have to know what is the partial derivative of the input respect to the
+
+0:43:06.220,0:43:10.930
+output such that I can use this module in any kind of you know point in my
+
+0:43:10.930,0:43:15.670
+inner code such that you know by using back prop you know chain rule you just
+
+0:43:15.670,0:43:20.320
+plug the thing. Yann went on several times as long as you know partial
+
+0:43:20.320,0:43:23.410
+derivative of the output with respect to the input you can plug these things
+
+0:43:23.410,0:43:31.690
+anywhere in your chain of operations so in this case we define my addition which
+
+0:43:31.690,0:43:35.620
+is performing the addition of the two inputs in this case but then when you
+
+0:43:35.620,0:43:41.130
+perform the back propagation if you have an addition what is the back propagation
+
+0:43:41.130,0:43:47.020
+so if you have a addition of the two things you get an output when you send
+
+0:43:47.020,0:43:53.320
+down the gradients what does it happen with the with the gradient it gets you
+
+0:43:53.320,0:43:57.160
+know copied over both sides right and that's why you get both of them are
+
+0:43:57.160,0:44:01.390
+copies or the same thing and they are sent through one side of the other you
+
+0:44:01.390,0:44:05.170
+can execute this stuff you're gonna see here you get the same gradient both ways
+
+0:44:05.170,0:44:09.460
+in this case I have a split so I come from the same thing and then I split and
+
+0:44:09.460,0:44:13.180
+I have those two things doing something else if I go down with the gradient what
+
+0:44:13.180,0:44:20.080
+do I do you add them right and that's why we have here the add install you can
+
+0:44:20.080,0:44:23.680
+execute this one you're going to see here that we had these two initial
+
+0:44:23.680,0:44:27.910
+gradients here and then when you went up or sorry when you went down the two
+
+0:44:27.910,0:44:30.790
+things the two gradients sum together and they are here okay
+
+0:44:30.790,0:44:36.190
+so again if you use pre-made things in PyTorch. They are correct this one you
+
+0:44:36.190,0:44:41.080
+can mess around you can put any kind of different in
+
+0:44:41.080,0:44:47.950
+for a function and backward function I think we ran out of time other questions
+
+0:44:47.950,0:44:58.800
+before we actually leave no all right so I see on Monday and stay warm
diff --git a/docs/pt/week06/06-1.md b/docs/pt/week06/06-1.md
new file mode 100644
index 000000000..cb6439af0
--- /dev/null
+++ b/docs/pt/week06/06-1.md
@@ -0,0 +1,285 @@
+---
+lang: pt
+lang-ref: ch.06-1
+lecturer: Yann LeCun
+title: Aplicações de Redes Convolucionais
+authors: Shiqing Li, Chenqin Yang, Yakun Wang, Jimin Tan
+date: 2 Mar 2020
+translator: Bernardo Lago
+translation-date: 14 Nov 2021
+---
+
+
+<!--
+## [Zip Code Recognition](https://www.youtube.com/watch?v=ycbMGyCPzvE&t=43s)
+-->
+
+## [Reconhecimento de código postal](https://www.youtube.com/watch?v=ycbMGyCPzvE&t=43s)
+
+<!--In the previous lecture, we demonstrated that a convolutional network can recognize digits, however, the question remains, how does the model pick each digit and avoid perturbation on neighboring digits. The next step is to detect non/overlapping objects and use the general approach of Non-Maximum Suppression (NMS). Now, given the assumption that the input is a series of non-overlapping digits, the strategy is to train several convolutional networks and using either majority vote or picking the digits corresponding to the highest score generated by the convolutional network.
+-->
+
+Na aula anterior, demonstramos que uma rede convolucional pode reconhecer dígitos, no entanto, a questão permanece, como o modelo escolhe cada dígito e evita perturbação nos dígitos vizinhos. A próxima etapa é detectar objetos não sobrepostos e usar a abordagem geral de Supressão Não Máxima (NMS). Agora, dada a suposição de que a entrada é uma série de dígitos não sobrepostos, a estratégia é treinar várias redes convolucionais e usando o voto da maioria ou escolhendo os dígitos correspondentes à pontuação mais alta gerada pela rede convolucional.
+
+<!--
+### Recognition with CNN
+-->
+
+### Reconhecimento com CNN
+
+<!--Here we present the task of recognizing 5 non-overlapping zip codes. The system was not given any instructions on how to separate each digit but knows that is must predict 5 digits. The system (Figure 1) consists of 4 different sized convolutional networks, each producing one set of outputs. The output is represented in matrices. The four output matrices are from models with a different kernel width in the last layer. In each output, there are 10 rows, representing 10 categories from 0 to 9. The larger white square represents a higher score in that category. In these four output blocks, the horizontal sizes of the last kernel layers are 5, 4, 3 and 2 respectively. The size of the kernel decides the width of the model's viewing window on the input, therefore each model is predicting digits based on different window sizes. The model then takes a majority vote and selects the category that corresponds to the highest score in that window. To extract useful information, one should keep in mind that not all combinations of characters are possible, therefore error correction leveraging input restrictions is useful to ensure the outputs are true zip codes.
+-->
+
+Aqui apresentamos a tarefa de reconhecer 5 CEPs não sobrepostos. O sistema não recebeu instruções sobre como separar cada dígito, mas sabe que deve prever 5 dígitos. O sistema (Figura 1) consiste em 4 redes convolucionais de tamanhos diferentes, cada uma produzindo um conjunto de saídas. A saída é representada em matrizes. As quatro matrizes de saída são de modelos com largura de kernel diferente na última camada. Em cada saída, há 10 linhas, representando 10 categorias de 0 a 9. O quadrado branco maior representa uma pontuação mais alta nessa categoria. Nestes quatro blocos de saída, os tamanhos horizontais das últimas camadas do kernel são 5, 4, 3 e 2, respectivamente. O tamanho do kernel decide a largura da janela de visualização do modelo na entrada, portanto, cada modelo está prevendo dígitos com base em tamanhos de janela diferentes. O modelo, então, obtém uma votação majoritária e seleciona a categoria que corresponde à pontuação mais alta naquela janela. Para extrair informações úteis, deve-se ter em mente que nem todas as combinações de caracteres são possíveis, portanto, a correção de erros com base nas restrições de entrada é útil para garantir que as saídas sejam códigos postais verdadeiros.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-1/O1IN3JD.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 1:</b> Multiple classifiers on zip code recognition
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-1/O1IN3JD.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 1: </b> Múltiplos classificadores no reconhecimento do CEP
+</center>
+
+<!--Now to impose the order of the characters. The trick is to utilize a shortest path algorithm. Since we are given ranges of possible characters and the total number of digits to predict, We can approach this problem by computing the minimum cost of producing digits and transitions between digit. The path has to be continuous from the lower left cell to the upper right cell on the graph, and the path is restricted to only contain movements from left to right and bottom to top. Note that if the same number is repeated next to each other, the algorithm should be able to distinguish there are repeated numbers instead of predicting a single digit.
+-->
+
+Agora, para impor a ordem dos personagens. O truque é utilizar um algoritmo de caminho mais curto. Uma vez que recebemos faixas de caracteres possíveis e o número total de dígitos a prever, podemos abordar esse problema calculando o custo mínimo de produção de dígitos e transições entre os dígitos. O caminho deve ser contínuo da célula inferior esquerda para a célula superior direita no gráfico, e o caminho é restrito para conter apenas movimentos da esquerda para a direita e de baixo para cima. Observe que se o mesmo número for repetido um ao lado do outro, o algoritmo deve ser capaz de distinguir que há números repetidos em vez de prever um único dígito.
+
+<!--
+## [Face detection](https://www.youtube.com/watch?v=ycbMGyCPzvE&t=1241s)
+-->
+
+## [Detecção de faces](https://www.youtube.com/watch?v=ycbMGyCPzvE&t=1241s)
+
+<!--Convolutional neural networks perform well on detection tasks and face detection is no exception. To perform face detection we collect a dataset of images with faces and without faces, on which we train a convolutional net with a window size such as 30 $\times$ 30 pixels and ask the network to tell whether there is a face or not. Once trained, we apply the model to a new image and if there are faces roughly within a 30 $\times$ 30 pixel window, the convolutional net will light up the output at the corresponding locations. However, two problems exist.
+-->
+
+As redes neurais convolucionais têm um bom desempenho em tarefas de detecção e a detecção de faces não é exceção. Para realizar a detecção de faces, coletamos um conjunto de dados de imagens com faces e sem faces, no qual treinamos uma rede convolucional com um tamanho de janela de 30 $\times$ 30 pixels e pedimos à rede para dizer se há um rosto ou não. Uma vez treinado, aplicamos o modelo a uma nova imagem e se houver faces dentro de uma janela de 30 $\times$ 30 pixels, a rede convolucional iluminará a saída nos locais correspondentes. No entanto, existem dois problemas.
+
+<!--
+- **False Positives**: There are many different variations of non-face objects that may appear in a patch of an image. During the training stage, the model may not see all of them (*i.e.* a fully representative set of non-face patches). Therefore, the model may suffer from a lot of false positives at test time. For example, if the network has not been trained on images containing hands, it may detect faces based on skin tones and incorrectly classify patches of images containing hands as faces, thereby giving rise to false positives.
+-->
+
+- **Falsos positivos**: Existem muitas variações diferentes de objetos não-face que podem aparecer em um patch de uma imagem. Durante o estágio de treinamento, o modelo pode não ver todos eles (*ou seja*, um conjunto totalmente representativo de remendos não faciais). Portanto, o modelo pode apresentar muitos falsos positivos no momento do teste. Por exemplo, se a rede não foi treinada em imagens contendo mãos, ela pode detectar rostos com base em tons de pele e classificar incorretamente manchas de imagens contendo mãos como rostos, dando origem a falsos positivos.
+
+<!--- **Different Face Size:** Not all faces are 30 $\times$ 30 pixels, so faces of differing sizes may not be detected. One way to handle this issue is to generate multi-scale versions of the same image. The original detector will detect faces around 30 $\times$ 30 pixels. If applying a scale on the image of factor $\sqrt 2$, the model will detect faces that were smaller in the original image since what was 30 $\times$ 30 is now 20 $\times$ 20 pixels roughly. To detect bigger faces, we can downsize the image. This process is inexpensive as half of the expense comes from processing the original non-scaled image. The sum of the expenses of all other networks combined is about the same as processing the original non-scaled image. The size of the network is the square of the size of the image on one side, so if you scale down the image by $\sqrt 2$, the network you need to run is smaller by a factor of 2. So the overall cost is $1+1/2+1/4+1/8+1/16…$, which is 2. Performing a multi-scale model only doubles the computational cost.
+-->
+
+- **Tamanho de rosto diferente:** Nem todos os rostos têm 30 $\times$ 30 pixels, portanto, rostos de tamanhos diferentes podem não ser detectados. Uma maneira de lidar com esse problema é gerar versões em várias escalas da mesma imagem. O detector original detectará rostos em torno de  30 $\times$ 30 pixels. Se aplicar uma escala na imagem do fator  $\sqrt 2$, o modelo detectará faces que eram menores na imagem original, pois o que era  30 $\times$ 30 agora é 20 $\times$ 20 pixels aproximadamente. Para detectar rostos maiores, podemos reduzir o tamanho da imagem. Esse processo é barato, pois metade das despesas vem do processamento da imagem original sem escala. A soma das despesas de todas as outras redes combinadas é quase a mesma do processamento da imagem original sem escala. O tamanho da rede é o quadrado do tamanho da imagem de um lado, então, se você reduzir a imagem em $\sqrt 2$, a rede que você precisa para executar é menor em um fator de 2. Portanto, o custo geral é $1+1/2+1/4+1/8+1/16…$, que é 2. Executar um modelo em várias escalas apenas duplica o custo computacional.
+
+<!--
+### A multi-scale face detection system
+-->
+
+### Um sistema de detecção de faces em várias escalas
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-1/8R3v0Dj.png" style="zoom: 30%; background-color:#DCDCDC;"/><br>
+<b>Figure 2:</b> Face detection system
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-1/8R3v0Dj.png" style="zoom: 30%; background-color:#DCDCDC;"/><br>
+<b> Figura 2: </b> Sistema de detecção de faces
+</center>
+
+<!--The maps shown in (Figure 3) indicate the scores of face detectors. This face detector recognizes faces that are 20 $\times$ 20 pixels in size. In fine-scale (Scale 3) there are many high scores but are not very definitive. When the scaling factor goes up (Scale 6), we see more clustered white regions. Those white regions represent detected faces. We then apply non-maximum suppression to get the final location of the face.
+-->
+
+Os mapas mostrados na (Figura 3) indicam a pontuação dos detectores de face. Este detector facial reconhece rostos com tamanho de 20 $\times$ 20 pixels. Em escala fina (Escala 3), há muitas pontuações altas, mas não são muito definitivas. Quando o fator de escala aumenta (Escala 6), vemos mais regiões brancas agrupadas. Essas regiões brancas representam rostos detectados. Em seguida, aplicamos a supressão não máxima para obter a localização final do rosto.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-1/CQ8T00O.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 3:</b> Face detector scores for various scaling factors
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-1/CQ8T00O.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 3: </b> Pontuações do detector facial para vários fatores de escala
+</center>
+
+<!--
+### Non-maximum suppression
+-->
+
+### Supressão não máxima
+
+<!--For each high-scoring region, there is probably a face underneath. If more faces are detected very close to the first, it means that only one should be considered correct and the rest are wrong. With non-maximum suppression, we take the highest-scoring of the overlapping bounding boxes and remove the others. The result will be a single bounding box at the optimum location.
+-->
+
+Para cada região de alta pontuação, provavelmente há um rosto por baixo. Se forem detectados mais rostos muito próximos do primeiro, significa que apenas um deve ser considerado correto e os demais estão errados. Com a supressão não máxima, pegamos a pontuação mais alta das caixas delimitadoras sobrepostas e removemos as outras. O resultado será uma única caixa delimitadora no local ideal.
+
+<!--
+### Negative mining
+-->
+
+### Mineração negativa
+
+<!--In the last section, we discussed how the model may run into a large number of false positives at test time as there are many ways for non-face objects to appear similar to a face. No training set will include all the possible non-face objects that look like faces. We can mitigate this problem through negative mining. In negative mining, we create a negative dataset of non-face patches which the model has (erroneously) detected as faces. The data is collected by running the model on inputs that are known to contain no faces. Then we retrain the detector using the negative dataset. We can repeat this process to increase the robustness of our model against false positives.
+-->
+
+Na última seção, discutimos como o modelo pode ser executado em um grande número de falsos positivos no momento do teste, pois há muitas maneiras de objetos não-face parecerem semelhantes a um rosto. Nenhum conjunto de treinamento incluirá todos os possíveis objetos não-rostos que se parecem com rostos. Podemos mitigar esse problema por meio da mineração negativa. Na mineração negativa, criamos um conjunto de dados negativos de patches não faciais que o modelo detectou (erroneamente) como faces. Os dados são coletados executando o modelo em entradas que são conhecidas por não conterem faces. Em seguida, treinamos novamente o detector usando o conjunto de dados negativo. Podemos repetir esse processo para aumentar a robustez do nosso modelo contra falsos positivos.
+
+<!--
+## Semantic segmentation
+-->
+
+## Segmentação semântica
+
+<!--Semantic segmentation is the task of assigning a category to every pixel in an input image.
+-->
+
+A segmentação semântica é a tarefa de atribuir uma categoria a cada pixel em uma imagem de entrada.
+
+<!--
+### [CNN for Long Range Adaptive Robot Vision](https://www.youtube.com/watch?v=ycbMGyCPzvE&t=1669s)
+-->
+
+### [CNN para Visão de Robôs Adaptável de Longo Alcance](https://www.youtube.com/watch?v=ycbMGyCPzvE&t=1669s)
+
+<!--In this project, the goal was to label regions from input images so that a robot can distinguish between roads and obstacles. In the figure, the green regions are areas the robot can drive on and the red regions are obstacles like tall grass. To train the network for this task, we took a patch from the image and manually label it traversable or not (green or red). We then train the convolutional network on the patches by asking it to predict the color of the patch. Once the system is sufficiently trained, it is applied to the entire image, labeling all the regions of the image as green or red.
+-->
+
+Neste projeto, o objetivo era rotular regiões a partir de imagens de entrada para que um robô pudesse distinguir entre estradas e obstáculos. Na figura, as regiões verdes são áreas nas quais o robô pode dirigir e as regiões vermelhas são obstáculos como grama alta. Para treinar a rede para essa tarefa, pegamos um patch da imagem e rotulamos manualmente como atravessável ou não (verde ou vermelho). Em seguida, treinamos a rede convolucional nos patches, pedindo-lhe para prever a cor do patch. Uma vez que o sistema esteja suficientemente treinado, ele é aplicado em toda a imagem, rotulando todas as regiões da imagem como verdes ou vermelhas.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-1/5mM7dTT.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 4:</b> CNN for Long Range Adaptive Robot Vision (DARPA LAGR program 2005-2008)
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-1/5mM7dTT.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 4: </b> CNN para Visão do Robô Adaptável de Longo Alcance (programa DARPA LAGR 2005-2008)
+</center>
+
+<!--There were five categories for prediction: 1) super green, 2) green, 3) purple: obstacle foot line, 4) red obstacle  5) super red: definitely an obstacle.
+-->
+
+Havia cinco categorias de previsão: 1) superverde, 2) verde, 3) roxo: linha do pé do obstáculo, 4) obstáculo vermelho 5) super vermelho: definitivamente um obstáculo.
+
+<!--
+**Stereo Labels** (Figure 4, Column 2)
+ Images are captured by the 4 cameras on the robot, which are grouped into 2 stereo vision pairs. Using the known distances between the stereo pair cameras, the positions of every pixel in 3D space are then estimated by measuring the relative distances between the pixels that appear in both the cameras in a stereo pair. This is the same process our brains use to estimate the distance of the objects that we see. Using the estimated position information, a plane is fit to the ground, and pixels are then labeled as green if they are near the ground and red if they are above it.
+-->
+
+**Rótulos estéreo** (Figura 4, Coluna 2)
+ As imagens são capturadas pelas 4 câmeras do robô, que são agrupadas em 2 pares de visão estéreo. Usando as distâncias conhecidas entre as câmeras do par estéreo, as posições de cada pixel no espaço 3D são estimadas medindo as distâncias relativas entre os pixels que aparecem em ambas as câmeras em um par estéreo. Este é o mesmo processo que nosso cérebro usa para estimar a distância dos objetos que vemos. Usando as informações de posição estimada, um plano é ajustado ao solo e os pixels são rotulados como verdes se estiverem próximos do solo e vermelhos se estiverem acima dele.
+
+<!--* **Limitations & Motivation for ConvNet**: The stereo vision only works up to 10 meters and driving a robot requires long-range vision. A ConvNet however, is capable of detecting objects at much greater distances, if trained correctly.
+-->
+
+* **Limitações e motivação para ConvNet**: A visão estéreo funciona apenas até 10 metros e dirigir um robô requer visão de longo alcance. Um ConvNet, entretanto, é capaz de detectar objetos em distâncias muito maiores, se treinado corretamente.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-1/rcxY4Lb.png" style="zoom: 100%; background-color:#DCDCDC;"/><br>
+<b>Figure 5:</b> Scale-invariant Pyramid of Distance-normalized Images
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-1/rcxY4Lb.png" style="zoom: 100%; background-color:#DCDCDC;"/><br>
+<b> Figura 5: </b> Pirâmide invariante de escala de imagens normalizadas por distância
+</center>
+
+<!--* **Served as Model Inputs**: Important pre-processing includes building a scale-invariant pyramid of distance-normalized images (Figure 5). It is similar to what we have done earlier of this lecture when we tried to detect faces of multiple scales.
+-->
+
+* **Servido como entradas do modelo**: o pré-processamento importante inclui a construção de uma pirâmide invariante de escala de imagens normalizadas por distância (Figura 5). É semelhante ao que fizemos anteriormente nesta aula, quando tentamos detectar faces de escalas múltiplas.
+
+<!--**Model Outputs** (Figure 4, Column 3)
+-->
+
+**Saídas do modelo** (Figura 4, Coluna 3)
+
+<!--The model outputs a label for every pixel in the image **up to the horizon**. These are classifier outputs of a multi-scale convolutional network.
+-->
+
+O modelo gera um rótulo para cada pixel na imagem **até o horizonte**. Estas são as saídas do classificador de uma rede convolucional multi-escala.
+
+<!--* **How the Model Becomes Adaptive**: The robots have continuous access to the stereo labels, allowing the network to re-train, adapting to the new environment it's in. Please note that only the last layer of the network would be re-trained. The previous layers are trained in the lab and fixed.
+-->
+
+* **Como o modelo se torna adaptativo**: Os robôs têm acesso contínuo às etiquetas estéreo, permitindo que a rede seja treinada novamente, adaptando-se ao novo ambiente em que se encontra. Observe que apenas a última camada da rede seria refeita -treinado. As camadas anteriores são treinadas em laboratório e fixas.
+
+<!--**System Performance**
+-->
+
+**Performance do sistema**
+
+<!--When trying to get to a GPS coordinate on the other side of a barrier, the robot "saw" the barrier from far away and planned a route that avoided it. This is thanks to the CNN detecting objects up 50-100m away.
+-->
+
+Ao tentar chegar a uma coordenada GPS do outro lado de uma barreira, o robô "avistou" a barreira de longe e planejou uma rota para evitá-la. Isso é graças à CNN detectando objetos até 50-100m de distância.
+
+<!--**Limitation**
+-->
+
+**Limitação**
+
+<!--Back in the 2000s, computation resources were restricted. The robot was able to process around 1 frame per second, which means it would not be able to detect a person that walks in its way for a whole second before being able to react. The solution for this limitation is a **Low-Cost Visual Odometry** model. It is not based on neural networks, has a vision of ~2.5m but reacts quickly.
+-->
+
+Na década de 2000, os recursos de computação eram restritos. O robô foi capaz de processar cerca de 1 quadro por segundo, o que significa que ele não seria capaz de detectar uma pessoa que andasse em seu caminho por um segundo inteiro antes de ser capaz de reagir. A solução para essa limitação é um modelo de **Odometria visual de baixo custo**. Não é baseado em redes neurais, tem uma visão de ~2,5m, mas reage rapidamente.
+
+<!--
+### Scene Parsing and Labelling
+-->
+
+### Análise e rotulagem de cenas
+
+<!--In this task, the model outputs an object category (buildings, cars, sky, etc.) for every pixel. The architecture is also multi-scale (Figure 6).
+-->
+
+Nesta tarefa, o modelo gera uma categoria de objeto (edifícios, carros, céu, etc.) para cada pixel. A arquitetura também é multi-escala (Figura 6).
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-1/VpVbkl5.jpg" style="zoom: 30%; background-color:#DCDCDC;"/><br>
+<b>Figure 6:</b> Multi-scale CNN for scene parsing
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-1/VpVbkl5.jpg" style="zoom: 30%; background-color:#DCDCDC;"/><br>
+<b> Figura 6: </b> CNN em várias escalas para análise de cena
+</center>
+
+<!--Notice that if we back project one output of the CNN onto the input, it corresponds to an input window of size $46\times46$ on the original image at the bottom of the Laplacian Pyramid. It means we are **using the context of $46\times46$ pixels to decide the category of the central pixel**.
+-->
+
+Observe que se projetarmos de volta uma saída da CNN na entrada, ela corresponderá a uma janela de entrada de tamanho $46\times46$ na imagem original na parte inferior da Pirâmide Laplaciana. Isso significa que estamos **usando o contexto de $46\times46$ pixels para decidir a categoria do pixel central**.
+
+<!--However, sometimes this context size is not enough to determine the category for larger objects.
+-->
+
+No entanto, às vezes, esse tamanho de contexto não é suficiente para determinar a categoria de objetos maiores.
+
+<!--**The multiscale approach enables a wider vision by providing extra rescaled images as  inputs.** The steps are as follows:
+1. Take the same image, reduce it by the factor of 2 and a factor of 4, separately.
+2. These two extra rescaled images are fed to **the same ConvNet** (same weights, same kernels) and we get another two sets of Level 2 Features.
+3. **Upsample** these features so that they have the same size as the Level 2 Features of the original image.
+4. **Stack** the three sets of (upsampled) features together and feed them to a classifier.
+-->
+
+**A abordagem multiescala permite uma visão mais ampla, fornecendo imagens extras redimensionadas como entradas.** As etapas são as seguintes:
+1. Pegue a mesma imagem, reduza-a pelo fator de 2 e pelo fator de 4, separadamente.
+2. Essas duas imagens redimensionadas extras são alimentadas **a mesma ConvNet** (mesmos pesos, mesmos kernels) e obtemos outros dois conjuntos de recursos de nível 2.
+3. **Aumente a amostra** desses recursos para que tenham o mesmo tamanho que os Recursos de Nível 2 da imagem original.
+4. **Empilhe** os três conjuntos de recursos (amostrados) e os envie a um classificador.
+
+<!--
+Now the largest effective size of content, which is from the 1/4 resized image, is $184\times 184\, (46\times 4=184)$.
+-->
+
+Agora, o maior tamanho efetivo de conteúdo, que é da imagem redimensionada de 1/4, é $184\times 184\, (46\times 4=184)$.
+
+<!--**Performance**: With no post-processing and running frame-by-frame, the model runs very fast even on standard hardware. It has a rather small size of training data (2k~3k), but the results are still record-breaking.
+-->
+
+**Desempenho**: sem pós-processamento e execução quadro a quadro, o modelo funciona muito rápido, mesmo em hardware padrão. Tem um tamanho bastante pequeno de dados de treinamento (2k ~ 3k), mas os resultados ainda são recordes.
+
diff --git a/docs/pt/week06/06-2.md b/docs/pt/week06/06-2.md
new file mode 100644
index 000000000..885769e86
--- /dev/null
+++ b/docs/pt/week06/06-2.md
@@ -0,0 +1,586 @@
+---
+lang: pt
+lang-ref: ch.06-2
+lecturer: Yann LeCun
+title: RNNs, GRUs, LSTMs, Modelos de Atenção, Seq2Seq e Redes com Memória
+authors: Jiayao Liu, Jialing Xu, Zhengyang Bian, Christina Dominguez
+date: 2 March 2020
+translator: Bernardo Lago
+translation-date: 14 Nov 2021
+---
+
+<!--
+## [Deep Learning Architectures](https://www.youtube.com/watch?v=ycbMGyCPzvE&t=2620s)
+-->
+
+## [Arquitetura de Aprendizagem Profunda](https://www.youtube.com/watch?v=ycbMGyCPzvE&t=2620s)
+
+<!--In deep learning, there are different modules to realize different functions. Expertise in deep learning involves designing architectures to complete particular tasks.  Similar to writing programs with algorithms to give instructions to a computer in earlier days, deep learning reduces a complex function into a graph of functional modules (possibly dynamic), the functions of which are finalized by learning.
+-->
+
+Na aprendizagem profunda, existem diferentes módulos para realizar diferentes funções. A especialização em aprendizagem profunda envolve o projeto de arquiteturas para concluir tarefas específicas. Semelhante a escrever programas com algoritmos para dar instruções a um computador nos dias anteriores, o aprendizado profundo reduz uma função complexa em um gráfico de módulos funcionais (possivelmente dinâmicos), cujas funções são finalizadas pelo aprendizado.
+
+<!--As with what we saw with convolutional networks, network architecture is important.
+-->
+
+Como com o que vimos com redes convolucionais, a arquitetura de rede é importante.
+
+<!--
+## Recurrent Networks
+-->
+
+## Redes Neurais Recorrentes
+
+<!--In a Convolutional Neural Network, the graph or interconnections between the modules cannot have loops. There exists at least a partial order among the modules such that the inputs are available when we compute the outputs.
+-->
+
+Em uma Rede Neural Convolucional, o gráfico ou as interconexões entre os módulos não podem ter laços. Existe pelo menos uma ordem parcial entre os módulos, de modo que as entradas estão disponíveis quando calculamos as saídas.
+
+<!--As shown in Figure 1, there are loops in Recurrent Neural Networks.
+-->
+
+Conforme mostrado na Figura 1, existem loops nas Redes Neurais Recorrentes.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-2/RNN_rolled.png" /><br>
+Figure 1. Recurrent Neural Network with roll
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-2/RNN_rolled.png" /><br>
+Figura 1. Rede Neural Recorrente com loops
+</center>
+
+<!-- - $x(t)$ : input that varies across time
+ - $\text{Enc}(x(t))$: encoder that generates a representation of input
+ - $h(t)$: a representation of the input
+ - $w$: trainable parameters
+ - $z(t-1)$: previous hidden state, which is the output of the previous time step
+ - $z(t)$: current hidden state
+ - $g$: function that can be a complicated neural network; one of the inputs is $z(t-1)$ which is the output of the previous time step
+ - $\text{Dec}(z(t))$: decoder that generates an output
+-->
+
+- $x(t)$: entrada que varia ao longo do tempo
+  - $\text{Enc}(x(t))$: codificador que gera uma representação de entrada
+  - $h(t)$: uma representação da entrada
+  - $w$: parâmetros treináveis
+  - $z(t-1)$: estado oculto anterior, que é a saída da etapa de tempo anterior
+  - $z(t)$: estado oculto atual
+  - $g$: função que pode ser uma rede neural complicada; uma das entradas é $z(t-1)$ que é a saída da etapa de tempo anterior
+  - $\text{Dec}(z(t))$: decodificador que gera uma saída
+
+<!--
+## Recurrent Networks: Unroll the loop
+-->
+
+
+## Redes Neurais Recorrentes: desenrolando os loops
+
+<!--Unroll the loop in time. The input is a sequence $x_1, x_2, \cdots, x_T$.
+-->
+
+Desenrole o loop no tempo. A entrada é uma sequência $x_1, x_2, \cdots, x_T$.
+
+<!--<center>
+ "
+<img src="{{site.baseurl}}/images/week06/06-2/RNN_unrolled.png" /><br>
+Figure 2. Recurrent Networks with unrolled loop
+</center>
+-->
+
+<center>
+ "
+<img src="{{site.baseurl}}/images/week06/06-2/RNN_unrolled.png" /><br>
+Figura 2. Redes recorrentes com loop desenrolado
+</center>
+
+<!--In Figure 2, the input is $x_1, x_2, x_3$.
+-->
+
+Na Figura 2, a entrada é $x_1, x_2, x_3$.
+
+<!--At time t=0, the input $x(0)$ is passed to the encoder and it generates the representation $h(x(0)) = \text{Enc}(x(0))$ and then passes it to G to generate hidden state $z(0) = G(h_0, z', w)$. At $t = 0$, $z'$ in $G$ can be initialized as $0$ or randomly initialized. $z(0)$ is passed to decoder to generate an output and also to the next time step.
+-->
+
+No tempo t = 0, a entrada  $x(0)$ é passada para o codificador e ele gera a representação $h(x(0)) = \text{Enc}(x(0))$ e então a passa para G para gerar o estado oculto $z(0) = G(h_0, z', w)$. Em $t = 0$, $z'$ em $G$ pode ser inicializado como $0$ ou inicializado aleatoriamente. $z(0)$ é passado para o decodificador para gerar uma saída e também para a próxima etapa de tempo.
+
+<!--As there are no loops in this network, and we can implement backpropagation.
+-->
+
+Como não há loops nesta rede, podemos implementar a retropropagação.
+
+<!--Figure 2 shows a regular network with one particular characteristic: every block shares the same weights. Three encoders, decoders and G functions have same weights respectively across different time steps.
+-->
+
+A Figura 2 mostra uma rede regular com uma característica particular: cada bloco compartilha os mesmos pesos. Três codificadores, decodificadores e funções G têm os mesmos pesos, respectivamente, em diferentes intervalos de tempo.
+
+<!--BPTT: Backprop through time.  Unfortunately, BPTT doesn't work so well in the naive form of RNN.
+-->
+
+BPTT: Retropropagação através do tempo (Backpropagation through time). Infelizmente, o BPTT não funciona tão bem na forma mais simples de RNN.
+
+<!--Problems with RNNs:
+-->
+
+Problemas com RNNs:
+
+<!--1. Vanishing gradients
+   - In a long sequence, the gradients get multiplied by the weight matrix (transpose) at every time step. If there are small values in the weight matrix, the norm of gradients get smaller and smaller exponentially.
+2. Exploding gradients
+   - If we have a large weight matrix and the non-linearity in the recurrent layer is not saturating, the gradients will explode. The weights will diverge at the update step. We may have to use a tiny learning rate for the gradient descent to work.
+-->
+
+1. Perda da informação do Gradiente (Dissipação do Gradiente)
+    - Em uma longa sequência, os gradientes são multiplicados pela matriz de peso (transposição) a cada passo de tempo. Se houver valores pequenos na matriz de peso, a norma dos gradientes fica cada vez menor exponencialmente.
+2. Explosão de gradientes
+    - Se tivermos uma matriz de peso grande e a não linearidade na camada recorrente não for saturada, os gradientes explodirão. Os pesos irão divergir na etapa de atualização. Podemos ter que usar uma pequena taxa de aprendizado para que o gradiente descendente funcione.
+
+<!--One reason to use RNNs is for the advantage of remembering information in the past. However, it could fail to memorize the information long ago in a simple RNN without tricks.
+-->
+
+Uma razão para usar RNNs é a vantagem de lembrar informações do passado. No entanto, ele pode falhar ao memorizar as informações há muito tempo em um RNN simples sem truques.
+
+<!--An example that has vanishing gradient problem:
+-->
+
+Um exemplo que tem problema de perda da informação do gradiente:
+
+<!--The input is the characters from a C Program. The system will tell whether it is a syntactically correct program. A syntactically correct program should have a valid number of braces and parentheses. Thus, the network should remember how many open parentheses and braces there are to check, and whether we have closed them all. The network has to store such information in hidden states like a counter.  However, because of vanishing gradients, it will fail to preserve such information in a long program.
+-->
+
+A entrada são os caracteres de um programa em C. O sistema dirá se é um programa sintaticamente correto. Um programa sintaticamente correto deve ter um número válido de chaves e parênteses. Portanto, a rede deve lembrar quantos parênteses e colchetes devem ser verificados e se todos eles foram fechados. A rede precisa armazenar essas informações em estados ocultos, como um contador. No entanto, devido ao desaparecimento de gradientes, ele deixará de preservar essas informações em um programa longo.
+
+<!--
+##  RNN Tricks
+-->
+
+##  Truques em RNN
+
+<!--- clipping gradients:  (avoid exploding gradients)
+   Squash the gradients when they get too large.
+- Initialization (start in right ballpark avoids exploding/vanishing)
+   Initialize the weight matrices to preserve the norm to some extent. For example, orthogonal initialization initializes the weight matrix as a random orthogonal matrix.
+-->
+
+- gradientes de recorte: (evite a explosão de gradientes)
+    Esmague os gradientes quando eles ficarem muito grandes.
+- Inicialização (começar no estádio certo evita explodir / desaparecer)
+    Inicialize as matrizes de peso para preservar a norma até certo ponto. Por exemplo, a inicialização ortogonal inicializa a matriz de peso como uma matriz ortogonal aleatória.
+
+<!--
+## Multiplicative Modules
+-->
+
+## Módulos Multiplicativos
+
+<!--In multiplicative modules rather than only computing a weighted sum of inputs, we compute products of inputs and then compute weighted sum of that.
+-->
+
+Em módulos multiplicativos, ao invés de apenas computar uma soma ponderada de entradas, calculamos produtos de entradas e, em seguida, calculamos a soma ponderada disso.
+
+<!--Suppose $x \in {R}^{n\times1}$, $W \in {R}^{m \times n}$, $U \in {R}^{m \times n \times d}$ and $z \in {R}^{d\times1}$. Here U is a tensor.
+-->
+
+Suponha que $x \in {R}^{n\times1}$, $W \in {R}^{m \times n}$, $U \in {R}^{m \times n \times d}$ e $z \in {R}^{d\times1}$. Aqui U é um tensor.
+
+<!--$$
+w_{ij} = u_{ij}^\top z =
+\begin{pmatrix}
+u_{ij1} & u_{ij2} & \cdots &u_{ijd}\\
+\end{pmatrix}
+\begin{pmatrix}
+z_1\\
+z_2\\
+\vdots\\
+z_d\\
+\end{pmatrix} = \sum_ku_{ijk}z_k
+$$
+-->
+
+$$
+w_{ij} = u_{ij}^\top z =
+\begin{pmatrix}
+u_{ij1} & u_{ij2} & \cdots &u_{ijd}\\
+\end{pmatrix}
+\begin{pmatrix}
+z_1\\
+z_2\\
+\vdots\\
+z_d\\
+\end{pmatrix} = \sum_ku_{ijk}z_k
+$$
+
+<!--$$
+s =
+\begin{pmatrix}
+s_1\\
+s_2\\
+\vdots\\
+s_m\\
+\end{pmatrix} = Wx =  \begin{pmatrix}
+w_{11} & w_{12} & \cdots &w_{1n}\\
+w_{21} & w_{22} & \cdots &w_{2n}\\
+\vdots\\
+w_{m1} & w_{m2} & \cdots &w_{mn}
+\end{pmatrix}
+\begin{pmatrix}
+x_1\\
+x_2\\
+\vdots\\
+x_n\\
+\end{pmatrix}
+$$
+-->
+
+$$
+s =
+\begin{pmatrix}
+s_1\\
+s_2\\
+\vdots\\
+s_m\\
+\end{pmatrix} = Wx =  \begin{pmatrix}
+w_{11} & w_{12} & \cdots &w_{1n}\\
+w_{21} & w_{22} & \cdots &w_{2n}\\
+\vdots\\
+w_{m1} & w_{m2} & \cdots &w_{mn}
+\end{pmatrix}
+\begin{pmatrix}
+x_1\\
+x_2\\
+\vdots\\
+x_n\\
+\end{pmatrix}
+$$
+
+<!--where $s_i = w_{i}^\top x = \sum_j w_{ij}x_j$.
+-->
+
+onde $s_i = w_{i}^\top x = \sum_j w_{ij}x_j$.
+
+<!--The output of the system is a classic weighted sum of inputs and weights. Weights themselves are also weighted sums of weights and inputs.
+-->
+
+A saída do sistema é uma soma ponderada clássica de entradas e pesos. Os próprios pesos também são somas ponderadas de pesos e entradas.
+
+<!--Hypernetwork architecture: weights are computed by another network.
+-->
+
+Arquitetura de hiper-rede: os pesos são calculados por outra rede.
+
+<!--
+## Attention
+-->
+
+
+## Atenção (Attention)
+
+<!--$x_1$ and $x_2$ are vectors, $w_1$ and $w_2$ are scalars after softmax where $w_1 + w_2 = 1$, and  $w_1$ and $w_2$ are between 0 and 1.
+-->
+
+$x_1$ e $x_2$ são vetores, $w_1$ e $w_2$ são escalares após softmax onde $w_1 + w_2 = 1$, e $w_1$ e $w_2$ estão entre 0 e 1.
+
+<!--$w_1x_1 + w_2x_2$ is a weighted sum of $x_1$ and $x_2$ weighted by coefficients $w_1$ and $w_2$.
+-->
+
+$w_1x_1 + w_2x_2$ é uma soma ponderada de $x_1$ e $x_2$ ponderada pelos coeficientes $w_1$ e $w_2$.
+
+<!--By changing the relative size of $w_1$ and $w_2$, we can switch the output of $w_1x_1 + w_2x_2$ to $x_1$ or $x_2$ or some linear combinations of $x_1$ and $x_2$.
+-->
+
+Alterando o tamanho relativo de $w_1$ e $w_2$, podemos mudar a saída de $w_1x_1 + w_2x_2$ para $x_1$ ou $x_2$ ou algumas combinações lineares de $x_1$ e $x_2$.
+
+<!--The inputs can have multiple $x$ vectors (more than $x_1$ and $x_2$). The system will choose an appropriate combination, the choice of which is determined by another variable z. An attention mechanism allows the neural network to focus its attention on particular input(s) and ignore the others.
+-->
+
+As entradas podem ter vários vetores $x$ (mais de $x_1$ e $x_2$). O sistema escolherá uma combinação apropriada, cuja escolha é determinada por outra variável z. Um mecanismo de atenção permite que a rede neural concentre sua atenção em determinadas entradas e ignore as outras.
+
+<!--Attention is increasingly important in NLP systems that use transformer architectures or other types of attention.
+-->
+
+A atenção é cada vez mais importante em sistemas de PNL que usam arquiteturas de transformador ou outros tipos de atenção.
+
+<!--The weights are data independent because z is data independent.
+-->
+
+Os pesos são independentes dos dados porque z é independente dos dados.
+
+<!--
+## [Gated Recurrent Units (GRU)](https://www.youtube.com/watch?v=ycbMGyCPzvE&t=3549s)
+-->
+
+
+## [Gated Recurrent Units (GRU)](https://www.youtube.com/watch?v=ycbMGyCPzvE&t=3549s)
+
+<!--As mentioned above, RNN suffers from vanishing/exploding gradients and can’t remember states for very long. GRU, [Cho, 2014](https://arxiv.org/abs/1406.1078), is an application of multiplicative modules that attempts to solve these problems. It's an example of recurrent net with memory (another is LSTM). The structure of A GRU unit is shown below:
+-->
+
+Como mencionado acima, RNN sofre de dissipação e explosão de gradientes e não consegue se lembrar dos estados por muito tempo. GRU, [Cho, 2014](https://arxiv.org/abs/1406.1078), é uma aplicação de módulos multiplicativos que tenta resolver esses problemas. É um exemplo de rede recorrente com memória (outra é LSTM). A estrutura de uma unidade GRU é mostrada abaixo:
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-2/GRU.png" height="300px" style="background-color:#226;"/><br>
+Figure 3. Gated Recurrent Unit
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-2/GRU.png" height="300px" style="background-color:#226;"/><br>
+Figura 3. Gated Recurrent Unit
+</center>
+
+<!--$$
+\begin{array}{l}
+z_t = \sigma_g(W_zx_t + U_zh_{t-1} + b_z)\\
+r_t = \sigma_g(W_rx_t + U_rh_{t-1} + b_r)\\
+h_t = z_t\odot h_{t-1} + (1- z_t)\odot\phi_h(W_hx_t + U_h(r_t\odot h_{t-1}) + b_h)
+\end{array}
+$$
+-->
+
+$$
+\begin{array}{l}
+z_t = \sigma_g(W_zx_t + U_zh_{t-1} + b_z)\\
+r_t = \sigma_g(W_rx_t + U_rh_{t-1} + b_r)\\
+h_t = z_t\odot h_{t-1} + (1- z_t)\odot\phi_h(W_hx_t + U_h(r_t\odot h_{t-1}) + b_h)
+\end{array}
+$$
+
+<!--where $\odot$ denotes element-wise multiplication(Hadamard product), $x_t$ is the input vector, $h_t$ is the output vector, $z_t$ is the update gate vector, $r_t$ is the reset gate vector, $\phi_h$ is a hyperbolic tanh, and $W$,$U$,$b$ are learnable parameters.
+-->
+
+onde $\odot$ denota multiplicação elemento a elemento (produto Hadamard), $ x_t $ é o vetor de entrada, $h_t$é o vetor de saída, $z_t$ é o vetor de porta de atualização, $r_t$ é o vetor de porta de reset, $\phi_h$ é um tanh hiperbólico e $W$, $U$, $b$ são parâmetros que podem ser aprendidos.
+
+<!--To be specific, $z_t$ is a gating vector that determines how much of the past information should be passed along to the future. It applies a sigmoid function to the sum of two linear layers and a bias over the input $x_t$ and the previous state $h_{t-1}$.  $z_t$ contains coefficients between 0 and 1 as a result of applying sigmoid. The final output state $h_t$ is a convex combination of $h_{t-1}$ and $\phi_h(W_hx_t + U_h(r_t\odot h_{t-1}) + b_h)$ via $z_t$. If the coefficient is 1, the current unit output is just a copy of the previous state and ignores the input (which is the default behaviour). If it is less than one, then it takes into account some new information from the input.
+-->
+
+Para ser específico, $z_t$ é um vetor de passagem que determina quanto das informações do passado deve ser repassado para o futuro. Ele aplica uma função sigmóide à soma de duas camadas lineares e um viés sobre a entrada $x_t$ e o estado anterior $h_{t-1}$. $z_t$ contém coeficientes entre 0 e 1 como resultado da aplicação de sigmóide. O estado de saída final $ h_t $ é uma combinação convexa de $h_{t-1}$ e $\phi_h(W_hx_t + U_h(r_t\odot h_{t-1}) + b_h)$ via $z_t$. Se o coeficiente for 1, a saída da unidade atual é apenas uma cópia do estado anterior e ignora a entrada (que é o comportamento padrão). Se for menor que um, leva em consideração algumas novas informações da entrada.
+
+<!--The reset gate $r_t$ is used to decide how much of the past information to forget. In the new memory content $\phi_h(W_hx_t + U_h(r_t\odot h_{t-1}) + b_h)$, if the coefficient in $r_t$ is 0, then it stores none of the information from the past. If at the same time $z_t$ is 0, then the system is completely reset since $h_t$ would only look at the input.
+-->
+
+A porta de reinicialização $r_t$ é usada para decidir quanto das informações anteriores deve ser esquecido. No novo conteúdo de memória $\phi_h(W_hx_t + U_h(r_t\odot h_{t-1}) + b_h)$, se o coeficiente em $r_t$ for 0, então ele não armazena nenhuma das informações do passado. Se ao mesmo tempo $z_t$ for 0, então o sistema será completamente reiniciado, já que $h_t$ só olharia para a entrada.
+
+<!--
+## LSTM (Long Short-Term Memory)
+-->
+
+
+## LSTM (Long Short-Term Memory)
+
+<!--GRU is actually a simplified version of LSTM which came out much earlier, [Hochreiter, Schmidhuber, 1997](https://www.bioinf.jku.at/publications/older/2604.pdf). By building up memory cells to preserve past information, LSTMs also aim to solve long term memory loss issues in RNNs. The structure of LSTMs is shown below:
+-->
+
+GRU é na verdade uma versão simplificada do LSTM que saiu muito antes, [Hochreiter, Schmidhuber, 1997](https://www.bioinf.jku.at/publications/older/2604.pdf). Ao construir células de memória para preservar informações anteriores, os LSTMs também visam resolver problemas de perda de memória de longo prazo em RNNs. A estrutura dos LSTMs é mostrada abaixo:
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-2/LSTM.png" height="300px"/><br>
+Figure 4. LSTM
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-2/LSTM.png" height="300px"/><br>
+Figura 4. LSTM
+</center>
+
+<!--$$
+\begin{array}{l}
+f_t = \sigma_g(W_fx_t + U_fh_{t-1} + b_f)\\
+i_t = \sigma_g(W_ix_t + U_ih_{t-1} + b_i)\\
+o_t = \sigma_o(W_ox_t + U_oh_{t-1} + b_o)\\
+c_t = f_t\odot c_{t-1} + i_t\odot \tanh(W_cx_t + U_ch_{t-1} + b_c)\\
+h_t = o_t \odot\tanh(c_t)
+\end{array}
+$$
+-->
+
+$$
+\begin{array}{l}
+f_t = \sigma_g(W_fx_t + U_fh_{t-1} + b_f)\\
+i_t = \sigma_g(W_ix_t + U_ih_{t-1} + b_i)\\
+o_t = \sigma_o(W_ox_t + U_oh_{t-1} + b_o)\\
+c_t = f_t\odot c_{t-1} + i_t\odot \tanh(W_cx_t + U_ch_{t-1} + b_c)\\
+h_t = o_t \odot\tanh(c_t)
+\end{array}
+$$
+
+<!--where $\odot$ denotes element-wise multiplication, $x_t\in\mathbb{R}^a$ is an input vector to the LSTM unit, $f_t\in\mathbb{R}^h$ is the forget gate's activation vector, $i_t\in\mathbb{R}^h$ is the input/update gate's activation vector, $o_t\in\mathbb{R}^h$ is the output gate's activation vector, $h_t\in\mathbb{R}^h$ is the hidden state vector (also known as output), $c_t\in\mathbb{R}^h$ is the cell state vector.
+-->
+
+onde $\odot$ denota multiplicação elemento a elemento, $x_t\in\mathbb{R}^a$ é um vetor de entrada para a unidade LSTM, $f_t\in\mathbb{R}^h$ é o vetor de ativação do portal de esquecimento , $i_t\in\mathbb{R}^h$ é o vetor de ativação da porta de entrada / atualização, $o_t\in\mathbb{R}^h$ é o vetor de ativação da porta de saída, $h_t\in\mathbb{R}^h$ é o vetor de estado oculto (também conhecido como saída), $c_t\in\mathbb{R}^h$ é o vetor de estado da célula.
+
+<!--An LSTM unit uses a cell state $c_t$ to convey the information through the unit. It regulates how information is preserved or removed from the cell state through structures called gates. The forget gate $f_t$ decides how much information we want to keep from the previous cell state $c_{t-1}$ by looking at the current input and previous hidden state, and produces a number between 0 and 1 as the coefficient of $c_{t-1}$.  $\tanh(W_cx_t + U_ch_{t-1} + b_c)$ computes a new candidate to update the cell state, and like the forget gate, the input gate $i_t$ decides how much of the update to be applied. Finally, the output $h_t$ will be based on the cell state $c_t$, but will be put through a $\tanh$ then filtered by the output gate $o_t$.
+-->
+
+Uma unidade LSTM usa um estado de célula $c_t$ para transmitir as informações através da unidade. Ele regula como as informações são preservadas ou removidas do estado da célula por meio de estruturas chamadas de portas. A porta de esquecimento $f_t$ decide quanta informação queremos manter do estado da célula anterior $c_{t-1}$ olhando para a entrada atual e o estado anterior oculto, e produz um número entre 0 e 1 como o coeficiente de $ c_ {t-1} $. $ \ tanh (W_cx_t + U_ch_ {t-1} + b_c) $ calcula um novo candidato para atualizar o estado da célula e, como a porta de esquecimento, a porta de entrada $ i_t $ decide quanto da atualização a ser aplicada. Finalmente, a saída $ h_t $ será baseada no estado da célula $ c_t $, mas será colocada em um $ \ tanh $ e então filtrada pela porta de saída $ o_t $.
+
+<!--Though LSTMs are widely used in NLP, their popularity is decreasing. For example, speech recognition is moving towards using temporal CNN, and NLP is moving towards using transformers.
+-->
+
+Embora os LSTMs sejam amplamente usados na PNL, sua popularidade está diminuindo. Por exemplo, o reconhecimento de voz está se movendo em direção ao uso de CNN temporal, e a PNL está se movendo em direção ao uso de transformadores.
+
+<!--
+## Sequence to Sequence Model
+-->
+
+
+## Modelo Sequência para Sequência (Seq2Seq)
+
+<!--The approach proposed by [Sutskever NIPS 2014](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf) is the first neural machine translation system to have comparable performance to classic approaches. It uses an encoder-decoder architecture where both the encoder and decoder are multi-layered LSTMs.
+-->
+
+A abordagem proposta por [Sutskever NIPS 2014](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf) é o primeiro sistema de tradução automática neural a ter comparação desempenho às abordagens clássicas. Ele usa uma arquitetura do tipo codificador-decodificador em que o codificador e o decodificador são LSTMs de várias camadas.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-2/Seq2Seq.png" height="300px" /><br>
+Figure 5. Seq2Seq
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-2/Seq2Seq.png" height="300px" /><br>
+Figura 5. Seq2Seq
+</center>
+
+<!--Each cell in the figure is an LSTM. For the encoder (the part on the left), the number of time steps equals the length of the sentence to be translated. At each step, there is a stack of LSTMs (four layers in the paper) where the hidden state of the previous LSTM is fed into the next one. The last layer of the last time step outputs a vector that represents the meaning of the entire sentence, which is then fed into another multi-layer LSTM (the decoder), that produces words in the target language. In the decoder, the text is generated in a sequential fashion. Each step produces one word, which is fed as an input to the next time step.
+-->
+
+Cada célula na figura é um LSTM. Para o codificador (a parte à esquerda), o número de intervalos de tempo é igual ao comprimento da frase a ser traduzida. Em cada etapa, há uma pilha de LSTMs (quatro camadas no papel) onde o estado oculto do LSTM anterior é alimentado para o próximo. A última camada da última etapa de tempo produz um vetor que representa o significado de toda a frase, que é então alimentado em outro LSTM de várias camadas (o decodificador), que produz palavras no idioma de destino. No decodificador, o texto é gerado de forma sequencial. Cada etapa produz uma palavra, que é alimentada como uma entrada para a próxima etapa de tempo.
+
+<!--This architecture is not satisfying in two ways: First, the entire meaning of the sentence has to be squeezed into the hidden state between the encoder and decoder. Second, LSTMs actually do not preserve information for more than about 20 words. The fix for these issues is called a Bi-LSTM, which runs two LSTMs in opposite directions.  In a Bi-LSTM the meaning is encoded in two vectors, one generated by running LSTM from left to right, and another from right to left.  This allows doubling the length of the sentence without losing too much information.
+-->
+
+Essa arquitetura não é satisfatória de duas maneiras: primeiro, todo o significado da frase deve ser comprimido no estado oculto entre o codificador e o decodificador. Em segundo lugar, os LSTMs na verdade não preservam informações por mais de cerca de 20 palavras. A correção para esses problemas é chamada de Bi-LSTM, que executa dois LSTMs em direções opostas. Em um Bi-LSTM, o significado é codificado em dois vetores, um gerado pela execução do LSTM da esquerda para a direita e outro da direita para a esquerda. Isso permite dobrar o comprimento da frase sem perder muitas informações.
+
+<!--
+## Seq2seq with Attention
+-->
+
+
+## Seq2seq com Atenção (Attention)
+
+<!--The success of the approach above was short-lived. Another paper by [Bahdanau, Cho, Bengio](https://arxiv.org/abs/1409.0473)  suggested that instead of having a gigantic network that squeezes the meaning of the entire sentence into one vector, it would make more sense if at every time step we only focus the attention on the relevant locations in the original language with equivalent meaning, *i.e.* the attention mechanism.
+-->
+
+O sucesso da abordagem acima teve vida curta. Outro artigo de [Bahdanau, Cho, Bengio](https://arxiv.org/abs/1409.0473) sugeriu que, em vez de ter uma rede gigantesca que comprime o significado de toda a frase em um vetor, faria mais sentido se em a cada passo, nós apenas focamos a atenção nos locais relevantes no idioma original com significado equivalente, ou seja, o mecanismo de atenção.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-2/Seq2SeqwAttention.png" height="300px" /><br>
+Figure 6. Seq2Seq with Attention
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-2/Seq2SeqwAttention.png" height="300px" /><br>
+Figura 6. Seq2seq com Atenção 
+</center>
+
+<!--In Attention, to produce the current word at each time step, we first need to decide which hidden representations of words in the input sentence to focus on. Essentially, a network will learn to score how well each encoded input matches the current output of the decoder. These scores are normalized by a softmax, then the coefficients are used to compute a weighted sum of the hidden states in the encoder at different time steps. By adjusting the weights, the system can adjust the area of inputs to focus on. The magic of this mechanism is that the network used to compute the coefficients can be trained through backpropagation. There is no need to build them by hand!
+-->
+
+Em Atenção, para produzir a palavra atual em cada etapa de tempo, primeiro precisamos decidir em quais representações ocultas de palavras na frase de entrada nos concentrar. Essencialmente, uma rede aprenderá a pontuar quão bem cada entrada codificada corresponde à saída atual do decodificador. Essas pontuações são normalizadas por um softmax, então os coeficientes são usados para calcular uma soma ponderada dos estados ocultos no codificador em diferentes etapas de tempo. Ao ajustar os pesos, o sistema pode ajustar a área de entradas para focar. A mágica desse mecanismo é que a rede usada para calcular os coeficientes pode ser treinada por meio de retropropagação. Não há necessidade de construí-los manualmente!
+
+<!--Attention mechanisms completely transformed neural machine translation. Later, Google published a paper [*Attention Is All You Need*](https://arxiv.org/abs/1706.03762), and they put forward transformer, where each layer and group of neurons is implementing attention.
+-->
+
+Os mecanismos de atenção transformaram completamente a tradução automática feita por redes neurais. Posteriormente, o Google publicou um artigo [*Attention Is All You Need*](https://arxiv.org/abs/1706.03762) e apresentou o transformer, em que cada camada e grupo de neurônios está implementando a atenção.
+
+<!--
+## [Memory network](https://www.youtube.com/watch?v=ycbMGyCPzvE&t=4575s)
+-->
+
+
+## [Redes com Memória](https://www.youtube.com/watch?v=ycbMGyCPzvE&t=4575s)
+
+<!--Memory networks stem from work at Facebook that was started by [Antoine Bordes](https://arxiv.org/abs/1410.3916) in 2014 and [Sainbayar Sukhbaatar](https://arxiv.org/abs/1503.08895) in 2015.
+-->
+
+Redes de memória derivam do trabalho no Facebook iniciado por [Antoine Bordes](https://arxiv.org/abs/1410.3916) em 2014 e [Sainbayar Sukhbaatar](https://arxiv.org/abs/1503.08895) em 2015.
+
+<!--The idea of a memory network is that there are two important parts in your brain: one is the **cortex**, which is where you have long term memory. There is a separate chunk of neurons called the **hippocampus** which sends wires to nearly everywhere in the cortex. The hippocampus is thought to be used for short term memory, remembering things for a relatively short period of time. The prevalent theory is that when you sleep, there is a lot of information transferred from the hippocampus to the cortex to be solidified in long term memory since the hippocampus has limited capacity.
+-->
+
+A ideia de uma rede com memória é que existem duas partes importantes em seu cérebro: uma é o **córtex**, que é onde você tem memória de longo prazo. Há um grupo separado de neurônios chamado **hipocampo**, que envia fios para quase todos os cantos do córtex. Acredita-se que o hipocampo seja usado para memória de curto prazo, lembrando coisas por um período de tempo relativamente curto. A teoria prevalente é que, quando você dorme, muitas informações são transferidas do hipocampo para o córtex para serem solidificadas na memória de longo prazo, já que o hipocampo tem capacidade limitada.
+
+<!--For a memory network, there is an input to the network, $x$ (think of it as an address of the memory), and compare this $x$ with vectors $k_1, k_2, k_3, \cdots$ ("keys") through a dot product. Put them through a softmax, what you get are an array of numbers which sum to one. And there are a set of other vectors $v_1, v_2, v_3, \cdots$ ("values"). Multiply these vectors by the scalers from softmax and sum these vectors up (note the resemblance to the attention mechanism) gives you the result.
+-->
+
+Para uma rede com memória, há uma entrada para a rede, $ x $ (pense nisso como um endereço da memória), e compare este $ x $ com os vetores $k_1, k_2, k_3, \cdots$ ("chaves") por meio de um produto escalar. Coloque-os em um softmax, o que você obtém é uma matriz de números que somam um. E há um conjunto de outros vetores $v_1, v_2, v_3, \cdots$ ("valores"). Multiplique esses vetores pelos escalonadores de softmax e some esses vetores (observe a semelhança com o mecanismo de atenção) para obter o resultado.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-2/MemoryNetwork1.png" height="300px"/><br>
+Figure 7. Memory Network
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-2/MemoryNetwork1.png" height="300px"/><br>
+Figura 7. Redes com Memória
+</center>
+
+<!--If one of the keys (*e.g.* $k_i$) exactly matches $x$, then the coefficient associated with this key will be very close to one. So the output of the system will essentially be $v_i$.
+-->
+
+Se uma das chaves (*por exemplo* $ k_i $) corresponder exatamente a $ x $, então o coeficiente associado a esta chave será muito próximo de um. Portanto, a saída do sistema será essencialmente $ v_i $.
+
+<!--This is **addressable associative memory**. Associative memory is that if your input matches a key, you get *that* value. And this is just a soft differentiable version of it, which allows you to backpropagate and change the vectors through gradient descent.
+-->
+
+Esta é a **memória associativa endereçável**. A memória associativa é que, se sua entrada corresponder a uma chave, você obtém *aquele* valor. E esta é apenas uma versão soft diferenciável dele, que permite retropropagar e alterar os vetores por meio de gradiente descendente.
+
+<!--What the authors did was tell a story to a system by giving it a sequence of sentences. The sentences are encoded into vectors by running them through a neural net that has not been pretrained. The sentences are returned to the memory of this type. When you ask a question to the system, you encode the question and put it as the input of a neural net, the neural net produces an $x$ to the memory, and the memory returns a value.
+-->
+
+O que os autores fizeram foi contar uma história a um sistema, dando-lhe uma sequência de frases. As sentenças são codificadas em vetores, passando-as por uma rede neural que não foi pré-treinada. As frases são devolvidas à memória deste tipo. Quando você faz uma pergunta ao sistema, você codifica a pergunta e a coloca como a entrada de uma rede neural, a rede neural produz um $ x $ para a memória, e a memória retorna um valor.
+
+<!--This value, together with the previous state of the network, is used to re-access the memory. And you train this entire network to produce an answer to your question. After extensive training, this model actually learns to store stories and answer questions.
+-->
+
+Este valor, junto com o estado anterior da rede, é usado para acessar novamente a memória. E você treina toda essa rede para produzir uma resposta à sua pergunta. Após um treinamento extensivo, esse modelo realmente aprende a armazenar histórias e responder a perguntas.
+
+<!--$$
+\alpha_i = k_i^\top x \\
+c = \text{softmax}(\alpha) \\
+s = \sum_i c_i v_i
+$$
+-->
+
+$$
+\alpha_i = k_i^\top x \\
+c = \text{softmax}(\alpha) \\
+s = \sum_i c_i v_i
+$$
+
+<!--In memory network, there is a neural net that takes an input and then produces an address for the memory, gets the value back to the network, keeps going, and eventually produces an output. This is very much like computer since there is a CPU and an external memory to read and write.
+-->
+
+Na rede de memória, há uma rede neural que recebe uma entrada e, em seguida, produz um endereço para a memória, retorna o valor para a rede, continua e, por fim, produz uma saída. É muito parecido com um computador, pois há uma CPU e uma memória externa para ler e escrever.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-2/MemoryNetwork2.png" height="200px" />
+<img src="{{site.baseurl}}/images/week06/06-2/MemoryNetwork3.png" height="200px" /> <br>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-2/MemoryNetwork2.png" height="200px" />
+<img src="{{site.baseurl}}/images/week06/06-2/MemoryNetwork3.png" height="200px" /> <br>
+
+<!--Figure 8. Comparision between memory network and computer (Photo by <a href='https://www.khanacademy.org/computing/ap-computer-science-principles/computers-101/computer--components/a/computer-memory'>Khan Acadamy</a>)
+</center>
+-->
+
+Figura 8. Comparação entre a rede com memória e um computador (Foto <a href='https://www.khanacademy.org/computing/ap-computer-science-principles/computers-101/computer--components/a/computer-memory'>Khan Acadamy</a>)
+</center>
+
+<!--There are people who imagine that you can actually build **differentiable computers** out of this. One example is the [Neural Turing Machine](https://arxiv.org/abs/1410.5401) from DeepMind, which was made public three days after Facebook's paper was published on arXiv.
+-->
+
+Existem pessoas que imaginam que você pode realmente construir **computadores diferenciáveis** a partir disso. Um exemplo é a [Máquina de Turing Neural](https://arxiv.org/abs/1410.5401) da DeepMind, que se tornou pública três dias depois que o artigo do Facebook foi publicado no arXiv.
+
+<!--The idea is to compare inputs to keys, generate coefficients, and produce values - which is basically what a transformer is.  A transformer is basically a neural net in which every group of neurons is one of these networks.
+-->
+
+A ideia é comparar entradas para chaves, gerar coeficientes e produzir valores - que é basicamente o que é um transformador. Um transformador é basicamente uma rede neural em que cada grupo de neurônios é uma dessas redes.
+
diff --git a/docs/pt/week06/06-3.md b/docs/pt/week06/06-3.md
new file mode 100644
index 000000000..03426aadb
--- /dev/null
+++ b/docs/pt/week06/06-3.md
@@ -0,0 +1,734 @@
+---
+lang: pt
+lang-ref: ch.06-3
+title: Propriedades dos Sinais Naturais
+lecturer: Alfredo Canziani
+authors: Zhengyuan Ding, Biao Huang, Lin Jiang, Nhung Le
+date: 3 Mar 2020
+translator: Bernardo Lago
+translation-date: 14 Nov 2021
+---
+
+
+<!--
+## [Overview](https://www.youtube.com/watch?v=8cAffg2jaT0&t=21s)
+-->
+
+## [Visão geral](https://www.youtube.com/watch?v=8cAffg2jaT0&t=21s)
+
+<!--RNN is one type of architecture that we can use to deal with sequences of data. What is a sequence? From the CNN lesson, we learned that a signal can be either 1D, 2D or 3D depending on the domain. The domain is defined by what you are mapping from and what you are mapping to. Handling sequential data is basically dealing with 1D data since the domain is the temporal axis. Nevertheless, you can also use RNN to deal with 2D data, where you have two directions.
+-->
+
+RNN é um tipo de arquitetura que podemos usar para lidar com sequências de dados. O que é uma sequência? Com a lição da CNN, aprendemos que um sinal pode ser 1D, 2D ou 3D, dependendo do domínio. O domínio é definido pelo que você está mapeando e para o que está mapeando. Manipular dados sequenciais é basicamente lidar com dados 1D, uma vez que o domínio é o eixo temporal. No entanto, você também pode usar RNN para lidar com dados 2D, onde você tem duas direções.
+
+<!--
+### Vanilla *vs.* Recurrent NN
+-->
+
+### Rede Neural "Comum" * vs. * Redes Neurais Recorrentes
+
+<!--Figure 1 is a vanilla neural network diagram with three layers. "Vanilla" is an American term meaning plain. The pink bubble is the input vector x, in the center is the hidden layer in green, and the final blue layer is the output. Using an example from digital electronics on the right, this is like a combinational logic, where the current output only depends on the current input.
+-->
+
+A Figura 1 é um diagrama de rede neural comum (vanilla) com três camadas. "Vanilla" é um termo americano que significa simples, comum. O círculo cor-de-rosa é o vetor de entrada x, no centro está a camada oculta em verde e a camada azul final é a saída. Usando um exemplo da eletrônica digital à direita, isso é como uma lógica combinatória, onde a saída de corrente depende apenas da entrada de corrente.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/vanilla.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 1:</b> Vanilla Architecture
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/vanilla.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 1: </b> Arquitetura "Vanilla"
+</center>
+
+<!--In contrast to a vanilla neural network, in recurrent neural networks the current output depends not only on the current input but also on the state of the system, shown in Figure 2. This is like a sequential logic in digital electronics, where the output also depends on a "flip-flop" (a basic memory unit in digital electronics). Therefore the main difference here is that the output of a vanilla neural network only depends on the current input, while the one of RNN depends also on the state of the system.
+-->
+
+Em contraste com uma rede neural comum, em redes neurais recorrentes (RNN) a saída atual depende não apenas da entrada atual, mas também do estado do sistema, mostrado na Figura 2. Isso é como uma lógica sequencial na eletrônica digital, onde a saída também depende de um "flip-flop" (uma unidade de memória básica em eletrônica digital). Portanto, a principal diferença aqui é que a saída de uma rede neural comum depende apenas da entrada atual, enquanto a de RNN depende também do estado do sistema.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/rnn.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 2:</b> RNN Architecture
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/rnn.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 2: </b> Arquitetura RNN
+</center>
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/basic_neural_net.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 3:</b> Basic NN Architecture
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/basic_neural_net.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 3: </b> Arquitetura de uma Rede Neural básica
+</center>
+
+<!--Yann's diagram adds these shapes between neurons to represent the mapping between one tensor and another(one vector to another). For example, in Figure 3, the input vector x will map through this additional item to the hidden representations h. This item is actually an affine transformation *i.e.* rotation plus distortion. Then through another transformation, we get from the hidden layer to the final output. Similarly, in the RNN diagram, you can have the same additional items between neurons.
+-->
+
+O diagrama de Yann adiciona essas formas entre os neurônios para representar o mapeamento entre um tensor e outro (um vetor para outro). Por exemplo, na Figura 3, o vetor de entrada x será mapeado por meio desse item adicional para as representações ocultas h. Este item é na verdade uma transformação afim, ou seja, rotação mais distorção. Em seguida, por meio de outra transformação, passamos da camada oculta para a saída final. Da mesma forma, no diagrama RNN, você pode ter os mesmos itens adicionais entre os neurônios.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/yann_rnn.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 4:</b> Yann's RNN Architecture
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/yann_rnn.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 4: </b> Arquitetura RNN de Yann
+</center>
+
+<!--
+### Four types of RNN Architectures and Examples
+-->
+
+### Quatro tipos de arquiteturas RNN e exemplos
+
+<!--The first case is vector to sequence. The input is one bubble and then there will be evolutions of the internal state of the system annotated as these green bubbles. As the state of the system evolves, at every time step there will be one specific output.
+-->
+
+O primeiro caso é vetor para sequência. A entrada é uma bolha e então haverá evoluções do estado interno do sistema anotadas como essas bolhas verdes. Conforme o estado do sistema evolui, em cada etapa de tempo haverá uma saída específica.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/vec_seq.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 5:</b> Vec to Seq
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/vec_seq.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 5: </b> Vec para Seq
+</center>
+
+<!--An example of this type of architecture is to have the input as one image while the output will be a sequence of words representing the English descriptions of the input image. To explain using Figure 6, each blue bubble here can be an index in a dictionary of English words. For instance, if the output is the sentence "This is a yellow school bus". You first get the index of the word "This" and then get the index of the word "is" and so on. Some of the results of this network are shown below. For example, in the first column the description regarding the last picture is "A herd of elephants walking across a dry grass field.", which is very well refined. Then in the second column, the first image outputs "Two dogs play in the grass.", while it's actually three dogs. In the last column are the more wrong examples such as "A yellow school bus parked in a parking lot." In general, these results show that this network can fail quite drastically and perform well sometimes. This is the case that is from one input vector, which is the representation of an image, to a sequence of symbols, which are for example characters or words making up the English sentences. This kind of architecture is called an autoregressive network. An autoregressive network is a network which gives an output given that you feed as input the previous output.
+-->
+
+Um exemplo desse tipo de arquitetura é ter a entrada como uma imagem, enquanto a saída será uma sequência de palavras representando as descrições em inglês da imagem de entrada. Para explicar usando a Figura 6, cada bolha azul aqui pode ser um índice em um dicionário de palavras em inglês. Por exemplo, se o resultado for a frase "Este é um ônibus escolar amarelo". Primeiro, você obtém o índice da palavra "Isto" e, em seguida, obtém o índice da palavra "é" e assim por diante. Alguns dos resultados desta rede são mostrados a seguir. Por exemplo, na primeira coluna a descrição da última imagem é "Uma manada de elefantes caminhando por um campo de grama seca.", Que é muito bem refinada. Então, na segunda coluna, a primeira imagem mostra "Dois cachorros brincando na grama.", Enquanto na verdade são três cachorros. Na última coluna estão os exemplos mais errados, como "Um ônibus escolar amarelo estacionado em um estacionamento". Em geral, esses resultados mostram que essa rede pode falhar drasticamente e funcionar bem às vezes. É o caso de um vetor de entrada, que é a representação de uma imagem, para uma sequência de símbolos, que são, por exemplo, caracteres ou palavras que constituem as frases em inglês. Este tipo de arquitetura é denominado rede autoregressiva. Uma rede autoregressiva é uma rede que fornece uma saída, dado que você alimenta como entrada a saída anterior.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/image_to_text_vec2seq.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 6:</b> vec2seq Example: Image to Text
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/image_to_text_vec2seq.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 6: </b> vec2seq Exemplo: Imagem para Texto
+</center>
+
+<!--The second type is sequence to a final vector. This network keeps feeding a sequence of symbols and only at the end gives a final output. An application of this can be using the network to interpret Python. For example, the input are these lines of Python program.
+-->
+
+O segundo tipo é a sequência para um vetor final. Essa rede continua alimentando uma sequência de símbolos e somente no final dá uma saída final. Uma aplicação disso pode ser usar a rede para interpretar Python. Por exemplo, a entrada são essas linhas do programa Python.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/seq2vec.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 7:</b> Seq to Vec
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/seq2vec.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 7: </b> Seq para Vec
+</center>
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/second_1.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 8:</b> Input lines of Python Codes
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/second_1.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 8: </b> Linhas de entrada de códigos Python
+</center>
+
+<!--Then the network will be able to output the correct solution of this program. Another more complicated program like this:
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/second_2.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 9:</b> Input lines of Python Codes in a more Completed Case
+</center>
+-->
+
+Então, a rede será capaz de produzir a solução correta deste programa. Outro programa mais complicado como este:
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/second_2.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 9: </b> Linhas de entrada de códigos Python em um caso mais completo
+</center>
+
+<!--Then the output should be 12184. These two examples display that you can train a neural network to do this kind of operation. We just need to feed a sequence of symbols and enforce the final output to be a specific value.
+-->
+
+Então, a saída deve ser 12184. Esses dois exemplos mostram que você pode treinar uma rede neural para fazer esse tipo de operação. Precisamos apenas alimentar uma sequência de símbolos e fazer com que a saída final seja um valor específico.
+
+<!--The third is sequence to vector to sequence, shown in Figure 10. This architecture used to be the standard way of performing language translation. You start with a sequence of symbols here shown in pink. Then everything gets condensed into this final h, which represents a concept. For instance, we can have a sentence as input and squeeze it temporarily into a vector, which is representing the meaning and message that to send across. Then after getting this meaning in whatever representation, the network unrolls it back into a different language. For example "Today I'm very happy" in a sequence of words in English can be translated into Italian or Chinese. In general, the network gets some kind of encoding as inputs and turns them into a compressed representation. Finally, it performs the decoding given the same compressed version. In recent times we have seen networks like Transformers, which we will cover in the next lesson, outperform this method at language translation tasks. This type of architecture used to be the state of the art about two years ago (2018).
+-->
+
+O terceiro é seqüência para vetor para seqüência, mostrado na Figura 10. Essa arquitetura costumava ser a forma padrão de realizar a tradução de idiomas. Você começa com uma sequência de símbolos mostrados aqui em rosa. Então, tudo se condensa neste h final, que representa um conceito. Por exemplo, podemos ter uma frase como entrada e comprimi-la temporariamente em um vetor, que representa o significado e a mensagem a ser enviada. Então, depois de obter esse significado em qualquer representação, a rede o desenrola de volta para uma linguagem diferente. Por exemplo, "Hoje estou muito feliz" em uma sequência de palavras em inglês pode ser traduzido para o italiano ou chinês. Em geral, a rede obtém algum tipo de codificação como entradas e as transforma em uma representação compactada. Finalmente, ele realiza a decodificação dada a mesma versão compactada. Recentemente, vimos redes como Transformers, que abordaremos na próxima lição, superar esse método em tarefas de tradução de idiomas. Este tipo de arquitetura era o estado da arte há cerca de dois anos (2018).
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/seq2vec2seq.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 10:</b> Seq to Vec to Seq
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/seq2vec2seq.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 10: </b> Seq para Vec para Seq
+</center>
+
+<!--If you do a PCA over the latent space, you will have the words grouped by semantics like shown in this graph.
+-->
+
+Se você fizer um PCA sobre o espaço latente, terá as palavras agrupadas por semântica como mostrado neste gráfico.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/third_1.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 11:</b> Words Grouped by Semantics after PCA
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/third_1.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 11: </b> Palavras agrupadas por semântica após PCA
+</center>
+
+<!--If we zoom in, we will see that the in the same location there are all the months, like January and November.
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/third_2.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 12:</b> Zooming in Word Groups
+</center>
+-->
+
+Se aumentarmos o zoom, veremos que no mesmo local estão todos os meses, como janeiro e novembro.
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/third_2.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 12: </b> Ampliação de grupos de palavras
+</center>
+
+<!--If you focus on a different region, you get phrases like "a few days ago " "the next few months" etc.
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/third_3.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 13:</b> Word Groups in another Region
+</center>
+-->
+
+Se você focar em uma região diferente, obterá frases como "alguns dias atrás" "nos próximos meses" etc.
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/third_3.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 13: </b> Grupos de palavras em outra região
+</center>
+
+<!--From these examples, we see that different locations will have some specific common meanings.
+-->
+
+A partir desses exemplos, vemos que diferentes locais terão alguns significados comuns específicos.
+
+<!--Figure 14 showcases how how by training this kind of network will pick up on some semantics features. For exmaple in this case you can see there is a vector connecting man to woman and another between king and queen, which means woman minus man is going to be equal to queen minus king. You will get the same distance in this embeddings space applied to cases like male-female. Another example will be walking to walked and swimming to swam. You can always apply this kind of specific linear transofmation going from one word to another or from country to capital.
+-->
+
+A Figura 14 mostra como, com o treinamento, esse tipo de rede irá captar alguns recursos semânticos. Por exemplo, neste caso, você pode ver que há um vetor conectando homem a mulher e outro entre rei e rainha, o que significa que mulher menos homem será igual a rainha menos rei. Você obterá a mesma distância neste espaço de embeddings aplicado a casos como masculino-feminino. Outro exemplo será caminhar para caminhar e nadar para nadar. Você sempre pode aplicar esse tipo de transformação linear específica, indo de uma palavra para outra ou de um país para a capital.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/fourth.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 14:</b> Semantics Features Picked during Training
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/fourth.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 14: </b> recursos semânticos escolhidos durante o treinamento
+</center>
+
+<!--The fourth and final case is sequence to sequence. In this network, as you start feeding in input the network starts generating outputs. An example of this type of architecture is T9, if you remember using a Nokia phone, you would get text suggestions as you were typing. Another example is speech to captions. One cool example is this RNN-writer. When you start typing "the rings of Saturn glittered while", it suggests the following "two men looked at each other". This network was trained on some sci-fi novels so that you can just type something and let it make suggestions to help you write a book. One more example is shown in Figure 16. You input the top prompt and then this network will try to complete the rest.
+-->
+
+O quarto e último caso é seqüência a seqüência. Nessa rede, conforme você começa a alimentar a entrada, a rede começa a gerar saídas. Um exemplo desse tipo de arquitetura é o T9. Se você se lembra de usar um telefone Nokia, receberá sugestões de texto enquanto digita. Outro exemplo é a fala com legendas. Um exemplo legal é este escritor RNN. Quando você começa a digitar "os anéis de Saturno brilharam enquanto", isso sugere o seguinte "dois homens se entreolharam". Esta rede foi treinada em alguns romances de ficção científica para que você simplesmente digite algo e deixe que ela faça sugestões para ajudá-lo a escrever um livro. Mais um exemplo é mostrado na Figura 16. Você insere o prompt superior e, em seguida, esta rede tentará completar o resto.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/seq2seq.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 15:</b> Seq to Seq
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/seq2seq.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 15: </b> Seq a Seq
+</center>
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/seq2seq_model_completion.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 16:</b> Text Auto-Completion Model of Seq to Seq Model
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/seq2seq_model_completion.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 16: </b> Modelo de preenchimento automático de texto do modelo Seq para Seq
+</center>
+
+<!--
+## [Back Propagation through time](https://www.youtube.com/watch?v=8cAffg2jaT0&t=855s)
+-->
+
+## [Retropropagação no tempo](https://www.youtube.com/watch?v=8cAffg2jaT0&t=855s)
+
+<!--
+### Model architecture
+-->
+
+### Arquitetura do modelo
+
+<!--In order to train an RNN, backpropagation through time (BPTT) must be used. The model architecture of RNN is given in the figure below. The left design uses loop representation while the right figure unfolds the loop into a row over time.
+-->
+
+Para treinar um RNN, a retropropagação através do tempo (BPTT) deve ser usada. A arquitetura do modelo do RNN é fornecida na figura abaixo. O design da esquerda usa a representação do loop, enquanto a figura da direita desdobra o loop em uma linha ao longo do tempo.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/bptt.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 17:</b> Back Propagation through time
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/bptt.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 17: </b> Retropropagação ao longo do tempo
+</center>
+
+<!--Hidden representations are stated as
+-->
+
+As representações ocultas são indicadas como
+
+<!--$$
+\begin{aligned}
+\begin{cases}
+h[t]&= g(W_{h}\begin{bmatrix}
+x[t] \\
+h[t-1]
+\end{bmatrix}
++b_h)  \\
+h[0]&\dot=\ \boldsymbol{0},\ W_h\dot=\left[ W_{hx} W_{hh}\right] \\
+\hat{y}[t]&= g(W_yh[t]+b_y)
+\end{cases}
+\end{aligned}
+$$
+-->
+
+$$
+\begin{aligned}
+\begin{cases}
+h[t]&= g(W_{h}\begin{bmatrix}
+x[t] \\
+h[t-1]
+\end{bmatrix}
++b_h)  \\
+h[0]&\dot=\ \boldsymbol{0},\ W_h\dot=\left[ W_{hx} W_{hh}\right] \\
+\hat{y}[t]&= g(W_yh[t]+b_y)
+\end{cases}
+\end{aligned}
+$$
+
+<!--The first equation indicates a non-linear function applied on a rotation of a stack version of input where the previous configuration of the hidden layer is appended. At the beginning, $h[0]$ is set 0. To simplify the equation, $W_h$ can be written as two separate matrices, $\left[ W_{hx}\ W_{hh}\right]$, thus sometimes the transformation can be stated as
+-->
+
+A primeira equação indica uma função não linear aplicada em uma rotação de uma versão da pilha de entrada onde a configuração anterior da camada oculta é anexada. No início, $ h [0] $ é definido como 0. Para simplificar a equação, $ W_h $ pode ser escrito como duas matrizes separadas, $ \ left [W_ {hx} \ W_ {hh} \ right] $, portanto, às vezes a transformação pode ser declarada como
+
+<!--$$
+W_{hx}\cdot x[t]+W_{hh}\cdot h[t-1]
+$$
+-->
+
+$$
+W_ {hx} \ cdot x [t] + W_ {hh} \ cdot h [t-1]
+$$
+
+<!--which corresponds to the stack representation of the input.
+-->
+
+que corresponde à representação da pilha da entrada.
+
+<!--$y[t]$ is calculated at the final rotation and then we can use the chain rule to backpropagate the error to the previous time step.
+-->
+
+$ y [t] $ é calculado na rotação final e então podemos usar a regra da cadeia para retropropagar o erro para a etapa de tempo anterior.
+
+<!--
+### Batch-Ification in Language Modeling
+-->
+
+### "Loteamento" na Modelagem de Linguagem
+
+<!--When dealing with a sequence of symbols, we can batchify the text into different sizes. For example, when dealing with sequences shown in the following figure, batch-ification can be applied first, where the time domain is preserved vertically. In this case, the batch size is set to 4.
+-->
+
+Ao lidar com uma sequência de símbolos, podemos agrupar o texto em diferentes tamanhos. Por exemplo, ao lidar com as sequências mostradas na figura a seguir, a ificação em lote pode ser aplicada primeiro, onde o domínio do tempo é preservado verticalmente. Nesse caso, o tamanho do lote é definido como 4.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/batchify_1.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 18:</b> Batch-Ification
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/batchify_1.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 18: </b> "Loteamento" (Batch-Ification)
+</center>
+
+<!--If BPTT period $T$ is set to 3, the first input $x[1:T]$ and output $y[1:T]$ for RNN is determined as
+-->
+
+Se o período $T$ da retropropagação baseada no tempo (BPTT) for definido como 3, a primeira entrada $x[1:T]$ e a saída $y[1:T]$ para RNN é determinada como
+
+<!--$$
+\begin{aligned}
+x[1:T] &= \begin{bmatrix}
+a & g & m & s \\
+b & h & n & t \\
+c & i & o & u \\
+\end{bmatrix} \\
+y[1:T] &= \begin{bmatrix}
+b & h & n & t \\
+c & i & o & u \\
+d & j & p & v
+\end{bmatrix}
+\end{aligned}
+$$
+-->
+
+$$
+\begin{aligned}
+x[1:T] &= \begin{bmatrix}
+a & g & m & s \\
+b & h & n & t \\
+c & i & o & u \\
+\end{bmatrix} \\
+y[1:T] &= \begin{bmatrix}
+b & h & n & t \\
+c & i & o & u \\
+d & j & p & v
+\end{bmatrix}
+\end{aligned}
+$$
+
+<!--When performing RNN on the first batch, firstly, we feed $x[1] = [a\ g\ m\ s]$ into RNN and force the output to be $y[1] = [b\ h\ n\ t]$. The hidden representation $h[1]$ will be sent forward into next time step to help the RNN predict $y[2]$ from $x[2]$. After sending $h[T-1]$ to the final set of $x[T]$ and $y[T]$, we cut gradient propagation process for both $h[T]$ and $h[0]$ so that gradients will not propagate infinitely(.detach() in Pytorch). The whole process is shown in figure below.
+-->
+
+Ao realizar RNN no primeiro lote, em primeiro lugar, alimentamos $x[1] = [a\ g\ m\ s]$ em RNN e forçamos a saída a ser  $y[1] = [b\ h\ n\ t]$. A representação oculta $h[1]$ será enviada para a próxima etapa de tempo para ajudar o RNN a prever $y[2]$ a partir de $x[2]$. Depois de enviar $h[T-1]$ para o conjunto final de $x[T]$ e $y[T]$, cortamos o processo de propagação de gradiente para $h[T]$ e $h[0]$ então que os gradientes não se propagam infinitamente (.detach () no Pytorch). Todo o processo é mostrado na figura abaixo.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/batchify_2.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 19:</b> Batch-Ification
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/batchify_2.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 19: </b> "Loteamento" (Batch-Ification)
+</center>
+
+<!--
+## Vanishing and Exploding Gradient
+-->
+
+## Dissipação e Explosão de Gradiente 
+
+<!--
+### Problem
+-->
+
+### Problema
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/rnn_3.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 20:</b> Vanishing Problem
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/rnn_3.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 20: </b> Problema de dissipação
+</center>
+
+<!--The figure above is a typical RNN architecture. In order to perform rotation over previous steps in RNN, we use matrices, which can be regarded as horizontal arrows in the model above. Since the matrices can change the size of outputs, if the determinant we select is larger than 1, the gradient will inflate over time and cause gradient explosion. Relatively speaking, if the eigenvalue we select is small across 0, the propagation process will shrink gradients and leads to the gradient vanishing.
+-->
+
+A figura acima é uma arquitetura RNN típica. Para realizar a rotação pelas etapas anteriores no RNN, usamos matrizes, que podem ser consideradas como setas horizontais no modelo acima. Uma vez que as matrizes podem alterar o tamanho das saídas, se o determinante que selecionamos for maior que 1, o gradiente se inflará com o tempo e causará a explosão do gradiente. Relativamente falando, se o autovalor que selecionamos for pequeno em 0, o processo de propagação reduzirá os gradientes e levará ao desaparecimento do gradiente (problema da dissipação de gradiente).
+
+<!--In typical RNNs, gradients will be propagated through all the possible arrows, which provides the gradients a large chance to vanish or explode. For example, the gradient at time 1 is large, which is indicated by the bright color. When it goes through one rotation, the gradient shrinks a lot and at time 3, it gets killed.
+-->
+
+Em RNNs típicos, os gradientes serão propagados por todas as setas possíveis, o que fornece aos gradientes uma grande chance de desaparecer ou explodir. Por exemplo, o gradiente no tempo 1 é grande, o que é indicado pela cor brilhante. Quando ele passa por uma rotação, o gradiente encolhe muito e no tempo 3, ele morre.
+
+<!--
+### Solution
+-->
+
+### Solução
+
+<!--An ideal to prevent gradients from exploding or vanishing is to skip connections. To fulfill this, multiply networks can be used.
+-->
+
+Um ideal para evitar que gradientes explodam ou desapareçam é pular conexões. Para cumprir isso, multiplique as redes podem ser usadas.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/rnn_2.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 21:</b> Skip Connection
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/rnn_2.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 21: </b> Pular conexão
+</center>
+
+<!--In the case above, we split the original network into 4 networks. Take the first network for instance. It takes in a value from input at time 1 and sends the output to the first intermediate state in the hidden layer. The state has 3 other networks where the $\circ$s allows the gradients to pass while the $-$s blocks propagation. Such a technique is called gated recurrent network.
+-->
+
+No caso acima, dividimos a rede original em 4 redes. Pegue a primeira rede, por exemplo. Ele obtém um valor da entrada no tempo 1 e envia a saída para o primeiro estado intermediário na camada oculta. O estado tem 3 outras redes onde $ \ circ $ s permite que os gradientes passem enquanto $ - $ s bloqueia a propagação. Essa técnica é chamada de rede recorrente com portas.
+
+<!--LSTM is one prevalent gated RNN and is introduced in detail in the following sections.
+-->
+
+O LSTM é um RNN fechado predominante e é apresentado em detalhes nas seções a seguir.
+
+<!--
+## [Long Short-Term Memory](https://www.youtube.com/watch?v=8cAffg2jaT0&t=1838s)
+-->
+
+## [Long Short-Term Memory](https://www.youtube.com/watch?v=8cAffg2jaT0&t=1838s)
+
+<!--
+### Model Architecture
+-->
+
+### Arquitetura do Modelo
+
+<!--Below are equations expressing an LSTM. The input gate is highlighted by yellow boxes, which will be an affine transformation. This input transformation will be multiplying $c[t]$, which is our candidate gate.
+-->
+
+Abaixo estão as equações que expressam um LSTM. A porta de entrada é destacada por caixas amarelas, que será uma transformação afim. Essa transformação de entrada multiplicará $ c [t] $, que é nossa porta candidata.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/lstm.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 22:</b> LSTM Architecture
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/lstm.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 22: </b> Arquitetura LSTM
+</center>
+
+<!--Don’t forget gate is multiplying the previous value of cell memory $c[t-1]$. Total cell value $c[t]$ is don’t forget gate plus input gate. Final hidden representation is element-wise multiplication between output gate $o[t]$ and hyperbolic tangent version of the cell $c[t]$, such that things are bounded. Finally, candidate gate $\tilde{c}[t]$ is simply a recurrent net. So we have a $o[t]$ to modulate the output, a $f[t]$ to modulate the don’t forget gate, and a $i[t]$ to modulate the input gate. All these interactions between memory and gates are multiplicative interactions. $i[t]$, $f[t]$ and $o[t]$ are all sigmoids, going from zero to one. Hence, when multiplying by zero, you have a closed gate. When multiplying by one, you have an open gate.
+-->
+
+Não se esqueça de que o gate está multiplicando o valor anterior da memória da célula $ c [t-1] $. O valor total da célula $ c [t] $ é não se esqueça da porta mais a porta de entrada. A representação oculta final é a multiplicação elemento a elemento entre a porta de saída $ o [t] $ e a versão tangente hiperbólica da célula $ c [t] $, de forma que as coisas sejam limitadas. Finalmente, a porta candidata $ \ tilde {c} [t] $ é simplesmente uma rede recorrente. Portanto, temos $ o [t] $ para modular a saída, $ f [t] $ para modular a porta não se esqueça e $ i [t] $ para modular a porta de entrada. Todas essas interações entre memória e portas são interações multiplicativas. $ i [t] $, $ f [t] $ e $ o [t] $ são todos sigmóides, indo de zero a um. Portanto, ao multiplicar por zero, você tem uma porta fechada. Ao multiplicar por um, você tem um portão aberto.
+
+<!--How do we turn off the output? Let’s say we have a purple internal representation $th$ and put a zero in the output gate. Then the output will be zero multiplied by something, and we get a zero. If we put a one in the output gate, we will get the same value as the purple representation.
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/lstm_2.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 23:</b> LSTM Architecture - Output On
+</center>
+-->
+
+Como desligamos a saída? Digamos que temos uma representação interna roxa $ th $ e colocamos um zero na porta de saída. Então, a saída será zero multiplicado por alguma coisa, e obteremos um zero. Se colocarmos um na porta de saída, obteremos o mesmo valor da representação roxa.
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/lstm_2.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 23: </b> Arquitetura LSTM - Saída Ligada
+</center>
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/lstm_3.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 24:</b> LSTM Architecture - Output Off
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/lstm_3.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 24: </b> Arquitetura LSTM - Saída Desligada
+</center>
+
+<!--Similarly, we can control the memory. For example, we can reset it by having $f[t]$ and $i[t]$ to be zeros. After multiplication and summation, we have a zero inside the memory. Otherwise, we can keep the memory, by still zeroing out the internal representation $th$ but keep a one in $f[t]$. Hence, the sum gets $c[t-1]$ and keeps sending it out. Finally, we can write such that we can get a one in the input gate, the multiplication gets purple, then set a zero in the don’t forget gate so it actually forget.
+-->
+
+Da mesma forma, podemos controlar a memória. Por exemplo, podemos redefini-lo fazendo com que $ f [t] $ e $ i [t] $ sejam zeros. Após a multiplicação e somatório, temos um zero na memória. Caso contrário, podemos manter a memória, ainda zerando a representação interna $ th $, mas mantendo um em $ f [t] $. Portanto, a soma obtém $ c [t-1] $ e continua sendo enviada. Finalmente, podemos escrever de forma que possamos obter um no portão de entrada, a multiplicação fica roxa e, em seguida, definir um zero no portão não se esqueça para que realmente esqueça.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/memory_cell_vis.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 25:</b> Visualization of the Memory Cell
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/memory_cell_vis.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 25: </b> Visualização da célula de memória
+</center>
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/lstm_4.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 26:</b> LSTM Architecture - Reset Memory
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/lstm_4.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 26: </b> Arquitetura LSTM - Redefinir memória
+</center>
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/lstm_keep_memory.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 27:</b> LSTM Architecture - Keep Memory
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/lstm_keep_memory.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 27: </b> Arquitetura LSTM - Manter memória
+</center>
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/lstm_write_memory.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 28:</b> LSTM Architecture - Write Memory
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/lstm_write_memory.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 28: </b> Arquitetura LSTM - Memória de Gravação
+</center>
+
+<!--
+## Notebook Examples
+-->
+
+## Exemplos de Notebook
+
+<!--
+### Sequence Classification
+-->
+
+### Classificação de sequências (sequences)
+
+<!--The goal is to classify sequences. Elements and targets are represented locally (input vectors with only one non-zero bit). The sequence **b**egins with an `B`, **e**nds with a `E` (the “trigger symbol”), and otherwise consists of randomly chosen symbols from the set `{a, b, c, d}` except for two elements at positions $t_1$ and $t_2$ that are either `X` or `Y`. For the `DifficultyLevel.HARD` case, the sequence length is randomly chosen between 100 and 110, $t_1$ is randomly chosen between 10 and 20, and $t_2$ is randomly chosen between 50 and 60. There are 4 sequence classes `Q`, `R`, `S`, and `U`, which depend on the temporal order of `X` and `Y`. The rules are: `X, X -> Q`; `X, Y -> R`; `Y, X -> S`; `Y, Y -> U`.
+-->
+
+O objetivo é classificar as sequências. Elementos e destinos são representados localmente (vetores de entrada com apenas um bit diferente de zero). A sequência ** b ** egins com um `B`, ** e ** nds com um` E` (o “símbolo de gatilho”), e de outra forma consiste em símbolos escolhidos aleatoriamente do conjunto `{a, b, c , d} `exceto por dois elementos nas posições $ t_1 $ e $ t_2 $ que são` X` ou `Y`. Para o caso `DifficultyLevel.HARD`, o comprimento da sequência é escolhido aleatoriamente entre 100 e 110, $ t_1 $ é escolhido aleatoriamente entre 10 e 20, e $ t_2 $ é escolhido aleatoriamente entre 50 e 60. Existem 4 classes de sequência` Q `,` R`, `S` e` U`, que dependem da ordem temporal de `X` e` Y`. As regras são: `X, X -> Q`; `X, Y -> R`; `Y, X -> S`; `Y, Y -> U`.
+
+<!--1). Dataset Exploration
+-->
+
+1). Exploração de conjunto de dados
+
+<!--The return type from a data generator is a tuple with length 2. The first item in the tuple is the batch of sequences with shape $(32, 9, 8)$. This is the data going to be fed into the network. There are eight different symbols in each row (`X`, `Y`, `a`, `b`, `c`, `d`, `B`, `E`). Each row is a one-hot vector. A sequence of rows represents a sequence of symbols. The first all-zero row is padding. We use padding when the length of the sequence is shorter than the maximum length in the batch.  The second item in the tuple is the corresponding batch of class labels with shape $(32, 4)$, since we have 4 classes (`Q`, `R`, `S`, and `U`). The first sequence is: `BbXcXcbE`. Then its decoded class label is $[1, 0, 0, 0]$, corresponding to `Q`.
+-->
+
+O tipo de retorno de um gerador de dados é uma tupla com comprimento 2. O primeiro item na tupla é o lote de sequências com forma $ (32, 9, 8) $. Esses são os dados que serão alimentados na rede. Existem oito símbolos diferentes em cada linha (`X`,` Y`, `a`,` b`, `c`,` d`, `B`,` E`). Cada linha é um vetor único. Uma sequência de linhas representa uma sequência de símbolos. A primeira linha totalmente zero é o preenchimento. Usamos preenchimento quando o comprimento da sequência é menor que o comprimento máximo do lote. O segundo item na tupla é o lote correspondente de rótulos de classe com forma $ (32, 4) $, uma vez que temos 4 classes (`Q`,` R`, `S` e` U`). A primeira sequência é: `BbXcXcbE`. Então, seu rótulo de classe decodificado é $ [1, 0, 0, 0] $, correspondendo a `Q`.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/dataset.png" style="zoom: 15%; background-color:#DCDCDC;"/><br>
+<b>Figure 29:</b> Input Vector Example
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/dataset.png" style="zoom: 15%; background-color:#DCDCDC;"/><br>
+<b> Figura 29: </b> Exemplo de vetor de entrada
+</center>
+
+<!--
+2). Defining the Model and Training
+-->
+
+2). Definição do modelo e treinamento
+
+<!--Let’s create a simple recurrent network, an LSTM, and train for 10 epochs. In the training loop, we should always look for five steps:
+-->
+
+Vamos criar uma rede recorrente simples, um LSTM, e treinar por 10 períodos. No ciclo de treinamento, devemos sempre procurar cinco etapas:
+
+<!-- * Perform the forward pass of the model
+ * Compute the loss
+ * Zero the gradient cache
+ * Backpropagate to compute the partial derivative of loss with regard to parameters
+ * Step in the opposite direction of the gradient
+-->
+
+* Execute o passe para frente do modelo
+ * Calcule a perda
+ * Zere o cache de gradiente
+ * Backpropagate para calcular a derivada parcial de perda em relação aos parâmetros
+ * Pise na direção oposta do gradiente
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/train_test_easy.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 30:</b> Simple RNN *vs.* LSTM - 10 Epochs
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/train_test_easy.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b> Figura 30: </b> RNN Simples *vs.* LSTM - 10 épocas
+</center>
+
+<!--With an easy level of difficulty, RNN gets 50% accuracy while LSTM gets 100% after 10 epochs. But LSTM has four times more weights than RNN and has two hidden layers, so it is not a fair comparison. After 100 epochs, RNN also gets 100% accuracy, taking longer to train than the LSTM.
+-->
+
+Com um nível de dificuldade fácil, RNN obtém 50% de precisão enquanto LSTM obtém 100% após 10 épocas. Mas LSTM tem quatro vezes mais pesos do que RNN e tem duas camadas ocultas, portanto, não é uma comparação justa. Após 100 épocas, o RNN também obtém 100% de precisão, levando mais tempo para treinar do que o LSTM.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/train_test_hard.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 31:</b> Simple RNN *vs.* LSTM - 100 Epochs
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/train_test_hard.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 31:</b> RNN Simples *vs.* LSTM - 100 Épocas
+</center>
+
+<!--If we increase the difficulty of the training part (using longer sequences), we will see the RNN fails while LSTM continues to work.
+-->
+
+Se aumentarmos a dificuldade da parte de treinamento (usando sequências mais longas), veremos o RNN falhar enquanto o LSTM continua a funcionar.
+
+<!--<center>
+<img src="{{site.baseurl}}/images/week06/06-3/hidden_state_lstm.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 32:</b> Visualization of Hidden State Value
+</center>
+-->
+
+<center>
+<img src="{{site.baseurl}}/images/week06/06-3/hidden_state_lstm.png" style="zoom: 40%; background-color:#DCDCDC;"/><br>
+<b>Figure 32:</b> Visualização do valor do estado oculto
+</center>
+
+<!--The above visualization is drawing the value of hidden state over time in LSTM. We will send the inputs through a hyperbolic tangent, such that if the input is below $-2.5$, it will be mapped to $-1$, and if it is above $2.5$, it will be mapped to $1$. So in this case, we can see the specific hidden layer picked on `X` (fifth row in the picture) and then it became red until we got the other `X`. So, the fifth hidden unit of the cell is triggered by observing the `X` and goes quiet after seeing the other `X`. This allows us to recognize the class of sequence.
+-->
+
+A visualização acima está desenhando o valor do estado oculto ao longo do tempo no LSTM. Enviaremos as entradas por meio de uma tangente hiperbólica, de forma que se a entrada estiver abaixo de $-2.5$, ela será mapeada para $-1$, e se estiver acima de $2,5$, será mapeada para $1$. Portanto, neste caso, podemos ver a camada oculta específica escolhida em `X` (quinta linha na imagem) e então ela se tornou vermelha até que obtivemos o outro` X`. Assim, a quinta unidade oculta da célula é acionada observando o `X` e fica quieta após ver o outro` X`. Isso nos permite reconhecer a classe de sequência.
+
+<!--
+### Signal Echoing
+-->
+
+### Eco de sinal
+
+<!--Echoing signal n steps is an example of synchronized many-to-many task. For instance, the 1st input sequence is `"1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 1 1 1 ..."`, and the 1st target sequence is `"0 0 0 1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 ..."`. In this case, the output is three steps after. So we need a short-time working memory to keep the information. Whereas in the language model, it says something that hasn't already been said.
+-->
+
+Ecoar o sinal n etapas é um exemplo de tarefa muitos-para-muitos sincronizada. Por exemplo, a 1ª sequência de entrada é `"1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 1 1 1 ..."`, e a 1ª sequência de destino é `"0 0 0 1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 ..."`. Nesse caso, a saída ocorre três etapas depois. Portanto, precisamos de uma memória de trabalho de curta duração para manter as informações. Já no modelo de linguagem, diz algo que ainda não foi dito.
+
+<!--Before we send the whole sequence to the network and force the final target to be something, we need to cut the long sequence into little chunks. While feeding a new chunk, we need to keep track of the hidden state and send it as input to the internal state when adding the next new chunk. In LSTM, you can keep the memory for a long time as long as you have enough capacity. In RNN, after you reach a certain length, it starts to forget about what happened in the past.
+-->
+
+Antes de enviarmos toda a sequência para a rede e forçarmos o destino final a ser algo, precisamos cortar a sequência longa em pequenos pedaços. Ao alimentar um novo pedaço, precisamos acompanhar o estado oculto e enviá-lo como entrada para o estado interno ao adicionar o próximo novo pedaço. No LSTM, você pode manter a memória por muito tempo, desde que tenha capacidade suficiente. No RNN, depois de atingir um determinado comprimento, começa a esquecer o que aconteceu no passado.
+
diff --git a/docs/pt/week06/06.md b/docs/pt/week06/06.md
new file mode 100644
index 000000000..6834deefc
--- /dev/null
+++ b/docs/pt/week06/06.md
@@ -0,0 +1,36 @@
+---
+lang: pt
+lang-ref: ch.06
+title: Semana 6
+translator: Bernardo Lago
+---
+
+<!--
+## Lecture part A
+
+We discussed three applications of convolutional neural networks. We started with digit recognition and the application to a 5-digit zip code recognition. In object detection, we talk about how to use multi-scale architecture in a face detection setting. Lastly, we saw how ConvNets are used in semantic segmentation tasks with concrete examples in a robotic vision system and object segmentation in an urban environment.
+-->
+
+## Aula parte A
+
+Discutimos três aplicações de redes neurais convolucionais. Começamos com o reconhecimento de dígitos e a aplicação para um reconhecimento de código postal (CEP) de 5 dígitos. Na detecção de objetos, falamos sobre como usar a arquitetura multi-escala em uma configuração de detecção de faces. Por último, vimos como ConvNets são usados em tarefas de segmentação semântica com exemplos concretos em um sistema de visão robótica e segmentação de objetos em um ambiente urbano.
+
+<!--
+## Lecture part B
+
+We examine Recurrent Neural Networks, their problems, and common techniques for mitigating these issues.  We then review a variety of modules developed to resolve RNN model issues including Attention, GRUs (Gated Recurrent Unit), LSTMs (Long Short-Term Memory), and Seq2Seq.
+-->
+
+## Aula parte B
+
+Examinamos redes neurais recorrentes, seus problemas e técnicas comuns para mitigar esses problemas. Em seguida, revisamos uma variedade de módulos desenvolvidos para resolver os problemas do modelo RNN, incluindo Atenção (Attention), GRUs (Gated Recurrent Unit), LSTMs (Long Short-Term Memory) e Seq2Seq.
+
+
+<!--
+## Practicum
+We discussed architecture of Vanilla RNN and LSTM models and compared the performance between the two. LSTM inherits advantages of RNN, while improving RNN's weaknesses by including a 'memory cell' to store information in memory for long periods of time. LSTM models significantly outperforms RNN models.
+-->
+
+## Prática
+
+Discutimos a arquitetura dos modelos de RNN básica (vanilla) e LSTM e comparamos o desempenho entre os dois. O LSTM herda as vantagens do RNN, ao mesmo tempo em que melhora os pontos fracos do RNN ao incluir uma 'célula de memória' para armazenar informações na memória por longos períodos de tempo. Os modelos LSTM superam significativamente os modelos RNN.
\ No newline at end of file
diff --git a/docs/pt/week06/lecture06.sbv b/docs/pt/week06/lecture06.sbv
new file mode 100644
index 000000000..1eab7f0b2
--- /dev/null
+++ b/docs/pt/week06/lecture06.sbv
@@ -0,0 +1,3338 @@
+0:00:04.960,0:00:08.970
+So I want to do two things, talk about
+
+0:00:11.019,0:00:14.909
+Talk a little bit about like some ways to use Convolutional Nets in various ways
+
+0:00:16.119,0:00:18.539
+Which I haven't gone through last time
+
+0:00:19.630,0:00:21.630
+and
+
+0:00:22.689,0:00:24.689
+And I'll also
+
+0:00:26.619,0:00:29.518
+Talk about different types of architectures that
+
+0:00:30.820,0:00:33.389
+Some of which are very recently designed
+
+0:00:34.059,0:00:35.710
+that people have been
+
+0:00:35.710,0:00:40.320
+Kind of playing with for quite a while. So let's see
+
+0:00:43.660,0:00:47.489
+So last time when we talked about Convolutional Nets we stopped that the
+
+0:00:47.890,0:00:54.000
+idea that we can use Convolutional Nets with kind of a sliding we do over large images and it consists in just
+
+0:00:54.550,0:00:56.550
+applying the convolution on large images
+
+0:00:57.070,0:01:01.559
+which is a very general image, a very general method, so we're gonna
+
+0:01:03.610,0:01:06.900
+See a few more things on how to use convolutional Nets and
+
+0:01:07.659,0:01:08.580
+to some extent
+
+0:01:08.580,0:01:09.520
+I'm going to
+
+0:01:09.520,0:01:16.020
+Rely on a bit of sort of historical papers and things like this to explain kind of simple forms of all of those ideas
+
+0:01:17.409,0:01:21.269
+so as I said last time
+
+0:01:21.850,0:01:27.720
+I had this example where there's multiple characters on an image and you can, you have a convolutional net that
+
+0:01:28.360,0:01:32.819
+whose output is also a convolution like everyday air is a convolution so you can interpret the output as
+
+0:01:33.250,0:01:40.739
+basically giving you a score for every category and for every window on the input and the the framing of the window depends on
+
+0:01:41.860,0:01:47.879
+Like the the windows that the system observes when your back project for my particular output
+
+0:01:49.000,0:01:54.479
+Kind of steps by the amount of subsampling the total amount of sub something you have in a network
+
+0:01:54.640,0:01:59.849
+So if you have two layers that subsample by a factor of two, you have two pooling layers, for example
+
+0:01:59.850,0:02:02.219
+That's a factor of two the overall
+
+0:02:02.920,0:02:07.199
+subsampling ratio is 4 and what that means is that every output is
+
+0:02:07.509,0:02:14.288
+Gonna basically look at a window on the input and successive outputs is going to look at the windows that are separated by four pixels
+
+0:02:14.630,0:02:17.350
+Okay, it's just a product of all the subsampling layers
+
+0:02:20.480,0:02:21.500
+So
+
+0:02:21.500,0:02:24.610
+this this is nice, but then you're gonna have to make sense of
+
+0:02:25.220,0:02:30.190
+All the stuff that's on the input. How do you pick out objects objects that
+
+0:02:31.310,0:02:33.020
+overlap each other
+
+0:02:33.020,0:02:38.949
+Etc. And one thing you can do for this is called "Non maximum suppression"
+
+0:02:41.180,0:02:43.480
+Which is what people use in sort of object detection
+
+0:02:44.750,0:02:47.350
+so basically what that consists in is that if you have
+
+0:02:49.160,0:02:53.139
+Outputs that kind of are more or less at the same place and 
+
+0:02:53.989,0:02:58.749
+or also like overlapping places and one of them tells you I see a
+
+0:02:58.910,0:03:02.199
+Bear and the other one tells you I see a horse one of them wins
+
+0:03:02.780,0:03:07.330
+Okay, it's probably one that's wrong. And you can't have a bear on a horse at the same time at the same place
+
+0:03:07.330,0:03:10.119
+So you do what's called? No, maximum suppression you can
+
+0:03:10.700,0:03:11.959
+Look at which
+
+0:03:11.959,0:03:15.429
+which of those has the highest score and you kind of pick that one or you see if
+
+0:03:15.500,0:03:19.660
+any neighbors also recognize that as a bear or a horse and you kind of make a
+
+0:03:20.360,0:03:24.999
+vote if you want, a local vote, okay, and I'm gonna go to the details of this because
+
+0:03:25.760,0:03:28.719
+Just just kind of rough ideas. Well, this is
+
+0:03:29.930,0:03:34.269
+already implemented in code that you can download and also it's kind of the topic of a 
+
+0:03:35.030,0:03:37.509
+full-fledged computer vision course
+
+0:03:38.239,0:03:42.939
+So here we just allude to kind of how we use deep learning for this kind of application
+
+0:03:46.970,0:03:48.970
+Let's see, so here's
+
+0:03:50.480,0:03:55.750
+Again going back to history a little bit some ideas of how you use
+
+0:03:57.049,0:03:59.739
+neural nets to or convolutional nets in this case to
+
+0:04:00.500,0:04:04.690
+Recognize strings of characters which is kind of the same program as recognizing multiple objects, really
+
+0:04:05.450,0:04:12.130
+So if you have, you have an image that contains the image at the top... "two, three two, zero, six"
+
+0:04:12.130,0:04:15.639
+It's a zip code and the characters touch so you don't know how to separate them in advance
+
+0:04:15.979,0:04:22.629
+So you just apply a convolutional net to the entire string but you don't know in advance what width the characters will take and so
+
+0:04:24.500,0:04:30.739
+what you see here are four different sets of outputs and those four different sets of outputs of
+
+0:04:31.170,0:04:33.170
+the convolutional net
+
+0:04:33.300,0:04:36.830
+Each of which has ten rows and the ten words corresponds to each of the ten categories
+
+0:04:38.220,0:04:43.489
+so if you look at the top for example the top, the top block
+
+0:04:44.220,0:04:46.940
+the white squares represent high-scoring categories
+
+0:04:46.940,0:04:53.450
+So what you see on the left is that the number two is being recognized. So the window that is looked at by the
+
+0:04:54.120,0:04:59.690
+Output units that are on the first column is on the, on the left side of the image and it, and it detects a two
+
+0:05:00.330,0:05:03.499
+Because the you know their order 0 1 2 3 4 etc
+
+0:05:03.810,0:05:07.160
+So you see a white square that corresponds to the detection of a 2
+
+0:05:07.770,0:05:09.920
+and then as the window is
+
+0:05:11.400,0:05:13.400
+shifted over the, over the input
+
+0:05:14.310,0:05:19.549
+Is a 3 or low scoring 3 that is seen then the 2 again there's three character
+
+0:05:19.550,0:05:24.980
+It's three detectors that see this 2 and then nothing then the 0 and then the 6
+
+0:05:26.670,0:05:28.670
+Now this first
+
+0:05:29.580,0:05:32.419
+System looks at a fairly narrow window and
+
+0:05:35.940,0:05:40.190
+Or maybe it's a wide window no, I think it's a wide window so it looks at a pretty wide window and
+
+0:05:41.040,0:05:42.450
+it
+
+0:05:42.450,0:05:44.450
+when it looks at the, the
+
+0:05:45.240,0:05:50.030
+The two, the two that's on the left for example, it actually sees a piece of the three with it, with it
+
+0:05:50.030,0:05:55.459
+So it's kind of in the window the different sets of outputs here correspond to different size
+
+0:05:55.830,0:06:01.009
+Of the kernel of the last layer. So the second row the second block
+
+0:06:01.890,0:06:05.689
+The the size of the kernel is four in the horizontal dimension
+
+0:06:07.590,0:06:11.869
+The next one is 3 and the next one is 2. what this allows the system to do is look at
+
+0:06:13.380,0:06:19.010
+Regions of various width on the input without being kind of too confused by the characters that are on the side if you want
+
+0:06:19.500,0:06:20.630
+so for example
+
+0:06:20.630,0:06:28.189
+the, the, the second to the zero is very high-scoring on the, on the, the
+
+0:06:29.370,0:06:36.109
+Second third and fourth map but not very high-scoring on the top map. Similarly, the three is kind of high-scoring on the
+
+0:06:37.020,0:06:38.400
+second third and fourth map
+
+0:06:38.400,0:06:41.850
+but not on the first map because the three kind of overlaps with the two and so
+
+0:06:42.009,0:06:45.059
+It wants to really look at in our window to be able to recognize it
+
+0:06:45.639,0:06:47.639
+Okay. Yes
+
+0:06:51.400,0:06:55.380
+So it's the size of the white square that indicates the score basically, okay
+
+0:06:57.759,0:07:02.038
+So look at you know, this this column here you have a high-scoring zero
+
+0:07:03.009,0:07:06.179
+Here because it's the first the first row correspond to the category zero
+
+0:07:06.430,0:07:10.079
+but it's not so high-scoring from the top, the top one because that
+
+0:07:10.539,0:07:15.419
+output unit looks at a pretty wide input and it gets confused by the stuff that's on the side
+
+0:07:16.479,0:07:17.910
+Okay, so you have something like this
+
+0:07:17.910,0:07:23.579
+so now you have to make sense out of it and extract the best interpretation of that, of that sequence and
+
+0:07:24.760,0:07:31.349
+It's true for zip code, but it's true for just about every piece of text. Not every combination of characters is possible
+
+0:07:31.599,0:07:36.149
+so when you read English text there is, you know, an English dictionary English grammar and
+
+0:07:36.699,0:07:40.919
+Not every combination of character is possible so you can have a language model that
+
+0:07:41.470,0:07:42.610
+attempts to
+
+0:07:42.610,0:07:48.720
+Tell you what is the most likely sequence of characters. So we're looking at here given that this is English or whatever language
+
+0:07:49.510,0:07:54.929
+Or given that this is a zip code not every zip code are possible. So this --- possibility for error correction
+
+0:07:56.949,0:08:00.719
+So how do we take that into account? I'll come to this in a second but
+
+0:08:03.460,0:08:06.930
+But here what we need to do is kind of you know
+
+0:08:08.169,0:08:10.169
+Come up with a consistent interpretation
+
+0:08:10.389,0:08:15.809
+That you know, there's obviously a three there's obviously a two, a three,a zero somewhere
+
+0:08:16.630,0:08:19.439
+Another two etc. How to return this
+
+0:08:20.110,0:08:22.710
+array of scores into, into a consistent
+
+0:08:23.470,0:08:25.470
+interpretation
+
+0:08:28.610,0:08:31.759
+Is the width of the, the horizontal width of the,
+
+0:08:33.180,0:08:35.180
+the kernel of the last layer
+
+0:08:35.400,0:08:36.750
+Okay
+
+0:08:36.750,0:08:44.090
+Which means when you backprop---, back project on the input the, the viewing window on the input that influences that particular unit
+
+0:08:44.550,0:08:48.409
+has various size depending on which unit you look at. Yes
+
+0:08:52.500,0:08:54.500
+The width of the block yeah
+
+0:08:56.640,0:08:58.070
+It's a, it corresponds
+
+0:08:58.070,0:08:58.890
+it's how wide the
+
+0:08:58.890,0:09:05.090
+Input image is divided by 4 because the substantive issue is 4 so you get one of one column of those for every four pixel
+
+0:09:05.340,0:09:11.660
+so remember we had this, this way of using a neural net, convolutional net which is that you, you basically make every
+
+0:09:12.240,0:09:17.270
+Convolution larger and you view the last layer as a convolution as well. And now what you get is multiple
+
+0:09:17.790,0:09:23.119
+Outputs. Okay. So what I'm representing here on the slide you just saw
+
+0:09:23.760,0:09:30.470
+is the, is this 2d array on the output which corresponds where, where the, the row corresponds to categories
+
+0:09:31.320,0:09:35.030
+Okay, and each column corresponds to a different location on the input
+
+0:09:39.180,0:09:41.750
+And I showed you those examples here so
+
+0:09:42.300,0:09:50.029
+Here, this is a different representation here where the, the character that is displayed just before the title bar is you know
+
+0:09:50.030,0:09:56.119
+Indicates the winning category, so I'm not displaying the scores of every category. I'm just, just, just displaying the winning category here
+
+0:09:57.180,0:09:58.260
+but each
+
+0:09:58.260,0:10:04.640
+Output looks at a 32 by 32 window and the next output by looks at a 32 by 32 window shifted by 4 pixels
+
+0:10:04.650,0:10:06.650
+Ok, etc.
+
+0:10:08.340,0:10:14.809
+So how do you turn this you know sequence of characters into the fact that it is either 3 5 or 5 3
+
+0:10:29.880,0:10:33.979
+Ok, so here the reason why we have four of those is so that is because the last player
+
+0:10:34.800,0:10:36.270
+this different
+
+0:10:36.270,0:10:42.889
+Is different last layers, if you want this four different last layers each of which is trained to recognize the ten categories
+
+0:10:43.710,0:10:50.839
+And those last layers have different kernel width so they essentially look at different width of Windows on the input
+
+0:10:53.670,0:10:59.510
+So you want some that look at wide windows so they can they can recognize kind of large characters and some that look at, look
+
+0:10:59.510,0:11:02.119
+At narrow windows so they can recognize narrow characters without being
+
+0:11:03.210,0:11:05.210
+perturbed by the the neighboring characters
+
+0:11:09.150,0:11:14.329
+So if you know a priori that there are five five characters here because it's a zip code
+
+0:11:16.529,0:11:18.529
+You can do you can use a trick and
+
+0:11:20.010,0:11:22.010
+There is sort of few specific tricks that
+
+0:11:23.130,0:11:27.140
+I can explain but I'm going to explain sort of the general trick if you want. I
+
+0:11:27.959,0:11:30.619
+Didn't want to talk about this actually at least not now
+
+0:11:31.709,0:11:37.729
+Okay here so here's a general trick the general trick is or the you know, kind of a somewhat specific trick
+
+0:11:38.370,0:11:40.609
+Oops, I don't know way it keeps changing slide
+
+0:11:43.890,0:11:50.809
+You say I have I know I have five characters in this word, is there a
+
+0:11:57.990,0:12:01.760
+So that's one of those arrays that produces scores so for each category
+
+0:12:03.060,0:12:07.279
+Let's say I have four categories here and each location
+
+0:12:11.339,0:12:18.049
+There's a score, okay and let's say I know that I want five characters out
+
+0:12:20.250,0:12:27.469
+I'm gonna draw them vertically one two, three, four five because it's a zip code
+
+0:12:29.579,0:12:34.279
+So the question I'm going to ask now is what is the best character I can put in this and
+
+0:12:35.220,0:12:37.220
+In this slot in the first slot
+
+0:12:38.699,0:12:43.188
+And the way I'm going to do this is I'm gonna draw an array
+
+0:12:48.569,0:12:50.569
+And on this array
+
+0:12:54.120,0:13:01.429
+I'm going to say what's the score here for, at every intersection in the array?
+
+0:13:07.860,0:13:11.659
+It's gonna be, what is the, what is the score of putting
+
+0:13:12.269,0:13:17.899
+A particular character here at that location given the score that I have at the output of my neural net
+
+0:13:19.560,0:13:21.560
+Okay, so let's say that
+
+0:13:24.480,0:13:28.159
+So what I'm gonna have to decide is since I have fewer characters
+
+0:13:29.550,0:13:32.539
+On the on the output to the system five
+
+0:13:33.329,0:13:39.919
+Then I have viewing windows and scores produced by the by the system. I'm gonna have to figure out which one I drop
+
+0:13:40.949,0:13:42.949
+okay, and
+
+0:13:43.860,0:13:47.689
+What I can do is build this, build this array
+
+0:13:55.530,0:13:57.530
+And
+
+0:14:01.220,0:14:09.010
+What I need to do is go from here to here by finding a path through this through this array
+
+0:14:15.740,0:14:17.859
+In such a way that I have exactly five
+
+0:14:20.420,0:14:24.640
+Steps if you want, so each step corresponds to to a character and
+
+0:14:25.790,0:14:31.630
+the overall score of a particular string is the overall is the sum of all the scores that
+
+0:14:33.050,0:14:37.060
+Are along this path in other words if I get
+
+0:14:39.560,0:14:41.560
+Three
+
+0:14:41.930,0:14:47.890
+Instances here, three locations where I have a high score for this particular category, which is category one. Okay let's call it 0
+
+0:14:48.440,0:14:50.440
+So 1 2 3
+
+0:14:51.140,0:14:54.129
+I'm gonna say this is the same guy and it's a 1
+
+0:14:55.460,0:14:57.460
+and here if I have
+
+0:14:58.160,0:15:03.160
+Two guys. I have high score for 3, I'm gonna say those are the 3 and here
+
+0:15:03.160,0:15:08.800
+I have only one guy that has high score for 2. So that's a 2 etc
+
+0:15:11.930,0:15:13.370
+So
+
+0:15:13.370,0:15:15.880
+This path here has to be sort of continuous
+
+0:15:16.580,0:15:23.080
+I can't jump from one position to another because that would be kind of breaking the order of the characters. Okay?
+
+0:15:24.650,0:15:31.809
+And I need to find a path that goes through high-scoring cells if you want that correspond to
+
+0:15:33.500,0:15:36.489
+High scoring categories along this path and it's a way of
+
+0:15:37.190,0:15:39.190
+saying you know if I have
+
+0:15:39.950,0:15:43.150
+if those three cells here or
+
+0:15:44.000,0:15:47.530
+Give me the same character. It's only one character. I'm just going to output
+
+0:15:48.440,0:15:50.799
+One here that corresponds to this
+
+0:15:51.380,0:15:57.189
+Ok, those three guys have high score. I stay on the one, on the one and then I transition
+
+0:15:57.770,0:16:02.379
+To the second character. So now I'm going to fill out this slot and this guy has high score for three
+
+0:16:02.750,0:16:06.880
+So I'm going to put three here and this guy has a high score for two
+
+0:16:07.400,0:16:08.930
+as two
+
+0:16:08.930,0:16:10.930
+Etc
+
+0:16:14.370,0:16:19.669
+The principle to find this this path is a shortest path algorithm
+
+0:16:19.670,0:16:25.190
+You can think of this as a graph where I can go from the lower left cell to the upper right cell
+
+0:16:25.560,0:16:27.560
+By either going to the left
+
+0:16:28.410,0:16:32.269
+or going up and to the left and
+
+0:16:35.220,0:16:38.660
+For each of those transitions there is a there's a cost and for each of the
+
+0:16:39.060,0:16:45.169
+For putting a character at that location, there is also a cost or a score if you want
+
+0:16:47.460,0:16:49.460
+So the overall
+
+0:16:50.700,0:16:57.049
+Score of the one at the bottom would be the combined score of the three locations that detect that one and
+
+0:16:59.130,0:17:01.340
+Because it's more all three of them are
+
+0:17:02.730,0:17:04.730
+contributing evidence to the fact that there is a 1
+
+0:17:06.720,0:17:08.959
+When you constrain the path to have 5 steps
+
+0:17:10.530,0:17:14.930
+Ok, it has to go from the bottom left to the top right and
+
+0:17:15.930,0:17:18.169
+It has 5 steps, so it has to go through 5 steps
+
+0:17:18.750,0:17:24.290
+There's no choice. That's that's how you force the system to kind of give you 5 characters basically, right?
+
+0:17:24.810,0:17:28.909
+And because the path can only go from left to right and from top to bottom
+
+0:17:30.330,0:17:33.680
+It has to give you the characters in the order in which they appear in the image
+
+0:17:34.350,0:17:41.240
+So it's a way of imposing the order of the character and imposing that there are fives, there are five characters in the string. Yes
+
+0:17:42.840,0:17:48.170
+Yes, okay in the back, yes, right. Yes
+
+0:17:52.050,0:17:55.129
+Well, so if we have just the string of one you have to have
+
+0:17:55.680,0:18:02.539
+Trained the system in advance so that when it's in between two ones or two characters, whatever they are, it says nothing
+
+0:18:02.540,0:18:04.540
+it says none of the above
+
+0:18:04.740,0:18:06.740
+Otherwise you can tell, right
+
+0:18:07.140,0:18:11.359
+Yeah, a system like this needs to be able to tell you this is none of the above. It's not a character
+
+0:18:11.360,0:18:16.160
+It's a piece of it or I'm in the middle of two characters or I have two characters on the side
+
+0:18:16.160,0:18:17.550
+But nothing in the middle
+
+0:18:17.550,0:18:19.550
+Yeah, absolutely
+
+0:18:24.300,0:18:26.300
+It's a form of non maximum suppression
+
+0:18:26.300,0:18:31.099
+so you can think of this as kind of a smart form of non maximum suppression where you say like for every location you can only
+
+0:18:31.100,0:18:31.950
+have one
+
+0:18:31.950,0:18:33.950
+character
+
+0:18:33.990,0:18:40.370
+And the order in which you produce the five characters must correspond to the order in which they appear on the image
+
+0:18:41.640,0:18:47.420
+What you don't know is how to warp one into the other. Okay. So how to kind of you know, how many
+
+0:18:48.210,0:18:53.780
+detectors are gonna see the number two. It may be three of them and we're gonna decide they're all the same
+
+0:19:00.059,0:19:02.748
+So the thing is for all of you who
+
+0:19:03.629,0:19:06.469
+are on computer science, which is not everyone
+
+0:19:07.590,0:19:12.379
+The the way you compute this path is just a shortest path algorithm. You do this with dynamic programming
+
+0:19:13.499,0:19:15.090
+Okay
+
+0:19:15.090,0:19:21.350
+so find the shortest path to go from bottom left to top right by going through by only going to
+
+0:19:22.080,0:19:25.610
+only taking transition to the right or diagonally and
+
+0:19:26.369,0:19:28.369
+by minimizing the
+
+0:19:28.830,0:19:31.069
+cost so if you think each of those
+
+0:19:31.710,0:19:38.659
+Is is filled by a cost or maximizing the score if you think that scores there are probabilities, for example
+
+0:19:38.789,0:19:41.479
+And it's just a shortest path algorithm in a graph
+
+0:19:54.840,0:19:56.840
+This kind of method by the way was
+
+0:19:57.090,0:20:04.730
+So many early methods of speech recognition kind of work this way, not with neural nets though. We sort of hand extracted features from
+
+0:20:05.909,0:20:13.189
+but it would basically match the sequence of vectors extracted from a speech signal to a template of a word and then you
+
+0:20:13.409,0:20:17.809
+know try to see how you warp the time to match the the
+
+0:20:19.259,0:20:24.559
+The word to be recognized to to the templates and you had a template for every word over fixed size
+
+0:20:25.679,0:20:32.569
+This was called DTW, dynamic time working. There's more sophisticated version of it called hidden markov models, but it's very similar
+
+0:20:33.600,0:20:35.600
+People still do this to some extent
+
+0:20:43.000,0:20:44.940
+Okay
+
+0:20:44.940,0:20:49.880
+So detection, so if you want to apply commercial net for detection
+
+0:20:50.820,0:20:55.380
+it works amazingly well, and it's surprisingly simple, but you
+
+0:20:56.020,0:20:57.210
+You know what you need to do
+
+0:20:57.210,0:20:59.210
+You basically need to let's say you wanna do face detection
+
+0:20:59.440,0:21:05.130
+Which is a very easy problem one of the first problems that computer vision started solving really well for kind of recognition
+
+0:21:05.500,0:21:07.500
+you collect a data set of
+
+0:21:08.260,0:21:11.249
+images with faces and images without faces and
+
+0:21:12.160,0:21:13.900
+you train a
+
+0:21:13.900,0:21:19.379
+convolutional net with input window in something like 20 by 20 or 30 by 30 pixels?
+
+0:21:19.870,0:21:21.959
+To tell you whether there is a face in it or not
+
+0:21:22.570,0:21:28.620
+Okay. Now you take this convolutional net, you apply it on an image and if there is a face that happens to be roughly
+
+0:21:29.230,0:21:31.230
+30 by 30 pixels the
+
+0:21:31.809,0:21:35.699
+the content will will light up at the corresponding output and
+
+0:21:36.460,0:21:38.460
+Not light up when there is no face
+
+0:21:39.130,0:21:41.999
+now there is two problems with this, the first problem is
+
+0:21:42.940,0:21:47.370
+there is many many ways a patch of an image can be a non face and
+
+0:21:48.130,0:21:53.489
+During your training, you probably haven't seen all of them. You haven't seen even a representative set of them
+
+0:21:53.950,0:21:56.250
+So your system is gonna have lots of false positives
+
+0:21:58.390,0:22:04.709
+That's the first problem. Second problem is in the picture not all faces are 30 by 30 pixels. So how do you handle
+
+0:22:05.380,0:22:10.229
+Size variation so one way to handle size variation, which is very simple
+
+0:22:10.230,0:22:14.010
+but it's mostly unnecessary in modern versions, well
+
+0:22:14.860,0:22:16.860
+ at least it's not completely necessary
+
+0:22:16.929,0:22:22.499
+Is you do a multiscale approach. So you take your image you run your detector on it. It fires whenever it wants
+
+0:22:23.440,0:22:27.299
+And you will detect faces are small then you reduce the image by
+
+0:22:27.850,0:22:30.179
+Some scale in this case, in this case here
+
+0:22:30.179,0:22:31.419
+I take a square root of two
+
+0:22:31.419,0:22:36.599
+You apply the convolutional net again on that smaller image and now it's going to be able to detect faces that are
+
+0:22:38.350,0:22:45.750
+That were larger in the original image because now what was 30 by 30 pixel is now about 20 by 20 pixels, roughly
+
+0:22:47.169,0:22:48.850
+Okay
+
+0:22:48.850,0:22:53.309
+But there may be bigger faces there. So you scale the image again by a factor of square root of 2
+
+0:22:53.309,0:22:57.769
+So now the images the size of the original one and you run the convolutional net again
+
+0:22:57.770,0:23:01.070
+And now it's going to detect faces that were 60 by 60 pixels
+
+0:23:02.190,0:23:06.109
+In the original image, but are now 30 by 30 because you reduce the size by half
+
+0:23:07.800,0:23:10.369
+You might think that this is expensive but it's not. Tthe
+
+0:23:11.220,0:23:15.439
+expense is, half of the expense is the final scale
+
+0:23:16.080,0:23:18.379
+the sum of the expense of the other networks are
+
+0:23:19.590,0:23:21.859
+Combined is about the same as the final scale
+
+0:23:26.070,0:23:29.720
+It's because the size of the network is you know
+
+0:23:29.720,0:23:33.019
+Kind of the square of the the size of the image on one side
+
+0:23:33.020,0:23:38.570
+And so you scale down the image by square root of 2 the network you have to run is smaller by a factor of 2
+
+0:23:40.140,0:23:45.619
+Okay, so the overall cost of this is 1 plus 1/2 plus 1/4 plus 1/8 plus 1/16 etc
+
+0:23:45.990,0:23:51.290
+Which is 2 you waste a factor of 2 by doing multi scale, which is very small. Ok
+
+0:23:51.290,0:23:53.290
+you can afford a factor of 2 so
+
+0:23:54.570,0:23:59.600
+This is a completely ancient face detection system from the early 90s and
+
+0:24:00.480,0:24:02.600
+the maps that you see here are all kind of
+
+0:24:03.540,0:24:05.540
+maps that indicate kind of
+
+0:24:06.120,0:24:13.160
+Scores of face detectors, the face detector here I think is 20 by 20 pixels. So it's very low res and
+
+0:24:13.890,0:24:19.070
+It's a big mess at the fine scales. You see kind of high-scoring areas, but it's not really very definite
+
+0:24:19.710,0:24:21.710
+But you see more
+
+0:24:22.530,0:24:24.150
+More definite
+
+0:24:24.150,0:24:26.720
+Things down here. So here you see
+
+0:24:27.780,0:24:33.290
+A white blob here white blob here white blob here same here. You see white blob here, White blob here and
+
+0:24:34.020,0:24:35.670
+Those are faces
+
+0:24:35.670,0:24:41.060
+and so that's now how you, you need to do maximum suppression to get those
+
+0:24:41.580,0:24:46.489
+little red squares that are kind of the winning categories if you want the winning locations where you have a face
+
+0:24:50.940,0:24:52.470
+So
+
+0:24:52.470,0:24:57.559
+Known as sumo suppression in this case means I have a high-scoring white white blob here
+
+0:24:57.560,0:25:01.340
+That means there is probably the face underneath which is roughly 20 by 20
+
+0:25:01.370,0:25:06.180
+It is another face in a window of 20 by 20. That means one of those two is wrong
+
+0:25:06.250,0:25:10.260
+so I'm just gonna take the highest-scoring one within the window of 20 by 20 and
+
+0:25:10.600,0:25:15.239
+Suppress all the others and you'll suppress the others at that location at that scale
+
+0:25:15.240,0:25:22.410
+I mean that nearby location at that scale but also at other scales. Okay, so you you pick the highest-scoring
+
+0:25:23.680,0:25:25.680
+blob if you want
+
+0:25:26.560,0:25:28.560
+For every location every scale
+
+0:25:28.720,0:25:34.439
+And whenever you pick one you you suppress the other ones that could be conflicting with it either
+
+0:25:34.780,0:25:37.259
+because they are a different scale at the same place or
+
+0:25:37.960,0:25:39.960
+At the same scale, but you know nearby
+
+0:25:44.350,0:25:46.350
+Okay, so that's the
+
+0:25:46.660,0:25:53.670
+that's the first problem and the second problem is the fact that as I said, there's many ways to be different from your face and
+
+0:25:54.730,0:25:59.820
+Most likely your training set doesn't have all the non-faces, things that look like faces
+
+0:26:00.790,0:26:05.249
+So the way people deal with this is that they do what's called negative mining
+
+0:26:05.950,0:26:07.390
+so
+
+0:26:07.390,0:26:09.390
+You go through a large collection of images
+
+0:26:09.460,0:26:14.850
+when you know for a fact that there is no face and you run your detector and you keep all the
+
+0:26:16.720,0:26:19.139
+Patches where you detector fires
+
+0:26:21.190,0:26:26.580
+You verify that there is no faces in them and if there is no face you add them to your negative set
+
+0:26:27.610,0:26:31.830
+Okay, then you retrain your detector. And then you use your retrained detector to do the same
+
+0:26:31.990,0:26:35.580
+Go again through a large dataset of images where there you know
+
+0:26:35.580,0:26:40.710
+There is no face and whenever your detector fires add that as a negative sample
+
+0:26:41.410,0:26:43.410
+you do this four or five times and
+
+0:26:43.840,0:26:50.129
+In the end you have a very robust face detector that does not fall victim to negative samples
+
+0:26:53.080,0:26:56.669
+These are all things that look like faces in natural images are not faces
+
+0:27:03.049,0:27:05.049
+This works really well
+
+0:27:10.380,0:27:17.209
+This is over 15 years old work this is my grandparents marriage, their wedding
+
+0:27:18.480,0:27:20.480
+their wedding
+
+0:27:22.410,0:27:24.410
+Okay
+
+0:27:24.500,0:27:29.569
+So here's a another interesting use of convolutional nets and this is for
+
+0:27:30.299,0:27:34.908
+Semantic segmentation what's called semantic segmentation, I alluded to this in the first the first lecture
+
+0:27:36.390,0:27:44.239
+so what is semantic segmentation is the problem of assigning a category to every pixel in an image and
+
+0:27:46.020,0:27:49.280
+Every pixel will be labeled with a category of the object it belongs to
+
+0:27:50.250,0:27:55.429
+So imagine this would be very useful if you want to say drive a robot in nature. So this is a
+
+0:27:56.039,0:28:00.769
+Robotics project that I worked on, my students and I worked on a long time ago
+
+0:28:01.770,0:28:07.520
+And what you like is to label the image so that regions that the robot can drive on
+
+0:28:08.820,0:28:10.820
+are indicated and
+
+0:28:10.860,0:28:15.199
+Areas that are obstacles also indicated so the robot doesn't drive there. Okay
+
+0:28:15.200,0:28:22.939
+So here the green areas are things that the robot can drive on and the red areas are obstacles like tall grass in that case
+
+0:28:28.049,0:28:34.729
+So the way you you train a convolutional net to do to do this kind of semantic segmentation is very similar to what I just
+
+0:28:35.520,0:28:38.659
+Described you you take a patch from the image
+
+0:28:39.360,0:28:41.360
+In this case. I think the patches were
+
+0:28:42.419,0:28:44.719
+20 by 40 or something like that, they are actually small
+
+0:28:46.080,0:28:51.860
+For which, you know what the central pixel is whether it's traversable or not, whether it's green or red?
+
+0:28:52.470,0:28:56.390
+okay, either is being manually labeled or the label has been obtained in some way and
+
+0:28:57.570,0:29:00.110
+You run a conv net on this patch and you train it, you know
+
+0:29:00.110,0:29:02.479
+tell me if it's if he's green or red tell me if it's
+
+0:29:03.000,0:29:05.000
+Drivable area or not
+
+0:29:05.970,0:29:09.439
+And once the system is trained you apply it on the entire image and it you know
+
+0:29:09.440,0:29:14.540
+It puts green or red depending on where it is. in this particular case actually, there were five categories
+
+0:29:14.830,0:29:18.990
+There's the super green green purple, which is a foot of an object
+
+0:29:19.809,0:29:24.269
+Red, which is an obstacle that you know threw off and super red, which is like a definite obstacle
+
+0:29:25.600,0:29:30.179
+Over here. We're only showing three three colors now in this particular
+
+0:29:31.809,0:29:37.319
+Project the the labels were actually collected automatically you didn't have to manually
+
+0:29:39.160,0:29:44.160
+Label the images and the patches what we do would be to run the robot around and then
+
+0:29:44.890,0:29:49.379
+through stereo vision figure out if a pixel is a
+
+0:29:51.130,0:29:53.669
+Correspond to an object that sticks out of the ground or is on the ground
+
+0:29:55.540,0:29:59.309
+So the the middle column here it says stereo labels these are
+
+0:30:00.309,0:30:05.789
+Labels, so the color green or red is computed from stereo vision from basically 3d reconstruction
+
+0:30:06.549,0:30:08.639
+okay, so for, you have two cameras and
+
+0:30:09.309,0:30:15.659
+The two cameras can estimate the distance of every pixel by basically comparing patches. It's relatively expensive, but it kind of works
+
+0:30:15.730,0:30:17.819
+It's not completely reliable, but it sort of works
+
+0:30:18.820,0:30:21.689
+So now for every pixel you have a depth the distance from the camera
+
+0:30:22.360,0:30:25.890
+Which means you know the position of that pixel in 3d which means you know
+
+0:30:25.890,0:30:30.030
+If it sticks out out of the ground or if it's on the ground because you can fit a plane to the ground
+
+0:30:30.880,0:30:33.900
+okay, so the green pixels are the ones that are basically
+
+0:30:34.450,0:30:37.980
+You know near the ground and the red ones are the ones that are up
+
+0:30:39.280,0:30:42.479
+so now you have labels you can try and accomplish on that to
+
+0:30:43.330,0:30:44.919
+predict those labels
+
+0:30:44.919,0:30:49.529
+Then you will tell me why would you want to train a convolutional net on that to do this if you can do this from stereo?
+
+0:30:50.260,0:30:53.760
+And the answer is stereo only works up to ten meters, roughly
+
+0:30:54.669,0:30:59.789
+Past ten meters you can't really using binocular vision and stereo vision, you can't really estimate the distance very well
+
+0:30:59.790,0:31:04.799
+And so that only works out to about ten meters and driving a robot by only looking
+
+0:31:05.200,0:31:07.770
+ten meters ahead of you is not a good idea
+
+0:31:08.950,0:31:13.230
+It's like driving a car in the fog right? It's gonna it's not very efficient
+
+0:31:14.380,0:31:21.089
+So what you used to accomplished on that for is to label every pixel in the image up to the horizon
+
+0:31:21.790,0:31:23.790
+essentially
+
+0:31:24.130,0:31:30.239
+Okay, so the cool thing about about this system is that as I said the labels were collected automatically but also
+
+0:31:32.080,0:31:33.730
+The robot
+
+0:31:33.730,0:31:38.849
+Adapted itself as it run because he collects stereo labels constantly
+
+0:31:39.340,0:31:43.350
+It can constantly retrain its neural net to adapt to the environment
+
+0:31:43.360,0:31:49.199
+it's in. In this particular instance of this robot, it would only will only retrain the last layer
+
+0:31:49.540,0:31:53.879
+So the N minus 1 layers of the ConvNet were fixed, were trained in the in the lab
+
+0:31:53.880,0:32:01.499
+And then the last layer was kind of adapted as the robot run, it allowed the robot to deal with environments
+
+0:32:01.500,0:32:02.680
+He'd never seen before
+
+0:32:02.680,0:32:04.120
+essentially
+
+0:32:04.120,0:32:06.120
+You still have long-range vision?
+
+0:32:10.000,0:32:17.520
+The input to the the conv network basically multiscale views of sort of bands of the image around the horizon
+
+0:32:18.700,0:32:20.700
+no need to go into details
+
+0:32:21.940,0:32:25.710
+Is a very small neural net by today's standard but that's what we could afford I
+
+0:32:27.070,0:32:29.970
+Have a video. I'm not sure it's gonna work, but I'll try
+
+0:32:31.990,0:32:33.990
+Yeah, it works
+
+0:32:41.360,0:32:45.010
+So I should tell you a little bit about the castor character he characters here so
+
+0:32:47.630,0:32:49.630
+Huh
+
+0:32:51.860,0:32:53.860
+You don't want the audio
+
+0:32:55.370,0:32:59.020
+So Pierre Semanet and Raia Hadsell were two students
+
+0:32:59.600,0:33:02.560
+working with me on this project two PhD students
+
+0:33:03.170,0:33:08.200
+Pierre Sermanet is at Google Brain. He works on robotics and Raia Hadsell is the sales director of Robotics at DeepMind
+
+0:33:09.050,0:33:11.050
+Marco Scoffier is NVIDIA
+
+0:33:11.150,0:33:15.249
+Matt Grimes is a DeepMind, Jan Ben is at Mobile Eye which is now Intel
+
+0:33:15.920,0:33:17.920
+Ayse Erkan is at
+
+0:33:18.260,0:33:20.260
+Twitter and
+
+0:33:20.540,0:33:22.540
+Urs Muller is still working with us, he is
+
+0:33:22.910,0:33:29.139
+Actually head of a big group that works on autonomous driving at Nvidia and he is collaborating with us
+
+0:33:30.800,0:33:32.800
+Actually
+
+0:33:33.020,0:33:38.020
+Our further works on this project, so this is a robot
+
+0:33:39.290,0:33:44.440
+And it can drive it about you know, sort of fast walking speed
+
+0:33:46.310,0:33:48.999
+And it's supposed to drive itself in sort of nature
+
+0:33:50.720,0:33:55.930
+So it's got this mass with four eyes, there are two stereo pairs to two stereo camera pairs and
+
+0:33:57.020,0:34:02.320
+It has three computers in the belly. So it's completely autonomous. It doesn't talk to the network or anything
+
+0:34:03.200,0:34:05.200
+And those those three computers
+
+0:34:07.580,0:34:10.120
+I'm on the left. That's when I had a pony tail
+
+0:34:13.640,0:34:19.659
+Okay, so here the the system is the the neural net is crippled so the we didn't turn on the neural Nets
+
+0:34:19.659,0:34:22.029
+It's only using stereo vision and now it's using the neural net
+
+0:34:22.130,0:34:26.529
+so it's it's pretty far away from this barrier, but it sees it and so it directly goes to
+
+0:34:27.169,0:34:31.599
+The side it wants to go to a goal, a GPS coordinate. That's behind it. Same here
+
+0:34:31.600,0:34:33.429
+He wants to go to a GPS coordinate behind it
+
+0:34:33.429,0:34:37.689
+And it sees right away that there is this wall of people that he can't go through
+
+0:34:38.360,0:34:43.539
+The guy on the right here is Marcos, He is holding the transmitter,he is not driving the robot but is holding the kill switch
+
+0:34:48.849,0:34:50.849
+And so
+
+0:34:51.039,0:34:54.689
+You know, that's what the the the convolutional net looks like
+
+0:34:55.659,0:34:57.659
+really small by today's standards
+
+0:35:00.430,0:35:02.430
+And
+
+0:35:03.700,0:35:05.700
+And it produces for every
+
+0:35:06.400,0:35:08.400
+every location every patch on the input
+
+0:35:08.829,0:35:13.859
+The second last layer is a 100 dimensional vector that goes into a classifier that classifies into five categories
+
+0:35:14.650,0:35:16.650
+so once the system classifies
+
+0:35:16.779,0:35:20.189
+Each of those five categories in the image you can you can warp the image
+
+0:35:20.349,0:35:25.979
+Into a map that's centered on the robot and you can you can do planning in this map to figure out like how to avoid
+
+0:35:25.980,0:35:31.379
+Obstacles and stuff like that. So this is what this thing does. It's a particular map called a hyperbolic map, but
+
+0:35:33.999,0:35:36.239
+It's not important for now
+
+0:35:38.380,0:35:40.380
+Now that
+
+0:35:40.509,0:35:42.509
+because this was you know
+
+0:35:42.970,0:35:49.199
+2007 the computers were slowly there were no GPUs so we could run this we could run this neural net only at about one frame per
+
+0:35:49.200,0:35:50.859
+second
+
+0:35:50.859,0:35:54.268
+As you can see here the at the bottom it updates about one frame per second
+
+0:35:54.269,0:35:54.640
+and
+
+0:35:54.640,0:35:59.609
+So if you have someone kind of walking in front of the robot the robot won't see it for a second and will you know?
+
+0:35:59.680,0:36:01.329
+Run over him
+
+0:36:01.329,0:36:07.079
+So that's why we have a second vision system here at the top. This one is stereo. It doesn't use a neural net
+
+0:36:09.039,0:36:13.949
+Odometry I think we don't care this is the controller which is also learned, but we don't care and
+
+0:36:15.730,0:36:21.989
+This is the the system here again, it's vision is crippled they can only see up to two point two and a half meters
+
+0:36:21.989,0:36:23.989
+So it's very short
+
+0:36:24.099,0:36:26.099
+But it kind of does a decent job
+
+0:36:26.529,0:36:28.529
+and
+
+0:36:28.930,0:36:34.109
+This is to test this sort of fast reacting vision systems or here pierre-simon a is jumping in front of it and
+
+0:36:34.420,0:36:40.950
+the robot stops right away so that now that's the full system with long-range vision and
+
+0:36:41.950,0:36:43.950
+annoying grad students
+
+0:36:49.370,0:36:52.150
+Right, so it's kind of giving up
+
+0:37:03.970,0:37:06.149
+Okay, oops
+
+0:37:09.400,0:37:11.049
+Okay, so
+
+0:37:11.049,0:37:12.690
+That's called semantic segmentation
+
+0:37:12.690,0:37:18.329
+But the real form of semantic segmentation is the one in which you you give an object category for every location
+
+0:37:18.729,0:37:21.599
+So that's the kind of problem here we're talking about where
+
+0:37:22.569,0:37:25.949
+every pixel is either building or sky or
+
+0:37:26.769,0:37:28.769
+Street or a car or something like this?
+
+0:37:29.799,0:37:37.409
+And around 2010 a couple datasets started appearing with a few thousand images where you could train vision systems to do this
+
+0:37:39.940,0:37:42.059
+And so the technique here is
+
+0:37:42.849,0:37:44.849
+essentially identical to the one I
+
+0:37:45.309,0:37:47.309
+Described it's also multi scale
+
+0:37:48.130,0:37:52.920
+So you basically have an input image you have a convolutional net
+
+0:37:53.259,0:37:57.959
+that has a set of outputs that you know, one for each category
+
+0:37:58.539,0:38:01.258
+Of objects for which you have label, which in this case is 33
+
+0:38:02.680,0:38:05.879
+When you back project one output of the convolutional net onto the input
+
+0:38:06.219,0:38:11.249
+It corresponds to an input window of 46 by 46 windows. So it's using a context of 46
+
+0:38:12.309,0:38:16.889
+by 46 pixels to make the decision about a single pixel at least that's the the
+
+0:38:17.589,0:38:19.589
+neural net at the back, at the bottom
+
+0:38:19.900,0:38:24.569
+But it has out 46 but 46 is not enough if you want to decide what a gray pixel is
+
+0:38:24.569,0:38:27.359
+Is it the shirt of the person is it the street? Is it the
+
+0:38:28.119,0:38:31.679
+Cloud or kind of pixel on the mountain. You have to look at a wider
+
+0:38:32.650,0:38:34.650
+context to be able to make that decision so
+
+0:38:35.529,0:38:39.179
+We use again this kind of multiscale approach where the same image is
+
+0:38:39.759,0:38:45.478
+Reduced by a factor of 2 and a factor of 4 and you run those two extra images to the same convolutional
+
+0:38:45.479,0:38:47.789
+net same weight same kernel same everything
+
+0:38:48.940,0:38:54.089
+Except the the last feature map you upscale them so that they have the same size as the original one
+
+0:38:54.089,0:38:58.859
+And now you take those combined feature Maps and send them to a couple layers of a classifier
+
+0:38:59.410,0:39:01.410
+So now the classifier to make its decision
+
+0:39:01.749,0:39:07.738
+Has four 46 by 46 windows on images that have been rescaled and so the effective
+
+0:39:08.289,0:39:12.718
+size of the context now is is 184 by 184 window because
+
+0:39:13.269,0:39:15.269
+the the core scale
+
+0:39:15.610,0:39:17.910
+Network basically looks at more this entire
+
+0:39:19.870,0:39:21.870
+Image
+
+0:39:24.310,0:39:30.299
+Then you can clean it up in various way I'm not gonna go to details for this but it works quite well
+
+0:39:33.970,0:39:36.330
+So this is the result
+
+0:39:37.870,0:39:40.140
+The guy who did this in my lab is Clément Farabet
+
+0:39:40.170,0:39:46.319
+He's a VP at Nvidia now in charge of all of machine learning infrastructure and the autonomous driving
+
+0:39:47.080,0:39:49.080
+Not surprisingly
+
+0:39:51.100,0:39:57.959
+And and so that system, you know, this is this is Washington Square Park by the way, so this is the NYU campus
+
+0:39:59.440,0:40:02.429
+It's not perfect far from that from that. You know it
+
+0:40:03.220,0:40:06.300
+Identified some areas of the street as sand
+
+0:40:07.330,0:40:09.160
+or desert and
+
+0:40:09.160,0:40:12.479
+There's no beach. I'm aware of in Washington Square Park
+
+0:40:13.750,0:40:15.750
+and
+
+0:40:16.480,0:40:17.320
+But you know
+
+0:40:17.320,0:40:22.469
+At the time this was the kind of system of this kind at the the number of training samples for this was very small
+
+0:40:22.470,0:40:24.400
+so it was kind of
+
+0:40:24.400,0:40:27.299
+It was about 2,000 or 3,000 images something like that
+
+0:40:31.630,0:40:34.410
+You run you take a you take a full resolution image
+
+0:40:36.220,0:40:42.689
+You run it to the first n minus 2 layers of your  ConvNet that gives you your future Maps
+
+0:40:42.970,0:40:45.570
+Then you reduce the image by a factor of two run it again
+
+0:40:45.570,0:40:50.009
+You get a bunch of feature maps that are smaller then running again by reducing by a factor of four
+
+0:40:50.320,0:40:51.900
+You get smaller feature maps
+
+0:40:51.900,0:40:52.420
+now
+
+0:40:52.420,0:40:57.420
+You take the small feature map and you rescale it you up sample it so it's the same size as the first one same
+
+0:40:57.420,0:41:00.089
+for the second one, you stack all those feature maps together
+
+0:41:00.880,0:41:07.199
+Okay, and that you feed to two layers for a classifier for every patch
+
+0:41:07.980,0:41:12.240
+Yeah, the paper was rejected from CVPR 2012 even though the results were
+
+0:41:13.090,0:41:14.710
+record-breaking and
+
+0:41:14.710,0:41:17.520
+It was faster than the best competing
+
+0:41:18.400,0:41:20.400
+method by a factor of 50
+
+0:41:20.950,0:41:25.920
+Even running on standard hardware, but we also had implementation on special hardware that was incredibly fast
+
+0:41:26.980,0:41:28.130
+and
+
+0:41:28.130,0:41:34.600
+people didn't know what the convolutional net was at the time and so the reviewers basically could not fathom that
+
+0:41:35.660,0:41:37.359
+The method they'd never heard of could work
+
+0:41:37.359,0:41:40.899
+So well. There is way more to say about convolutional nets
+
+0:41:40.900,0:41:44.770
+But I encourage you to take a computer vision course for to hear about this
+
+0:41:45.950,0:41:49.540
+Yeah, this is okay this data set this particular dataset that we used
+
+0:41:51.590,0:41:57.969
+Is a collection of images street images that was collected mostly by Antonio Torralba at MIT and
+
+0:42:02.690,0:42:04.130
+He had a
+
+0:42:04.130,0:42:08.530
+sort of a tool for kind of labeling so you could you know, you could sort of
+
+0:42:09.140,0:42:12.100
+draw the contour over the object and then label of the object and
+
+0:42:12.650,0:42:18.129
+So if it would kind of, you know fill up the object most of the segmentations were done by his mother
+
+0:42:20.030,0:42:22.030
+Who's in Spain
+
+0:42:22.310,0:42:24.310
+she had a lot of time to
+
+0:42:25.220,0:42:27.220
+Spend doing this
+
+0:42:27.380,0:42:29.300
+Huh?
+
+0:42:29.300,0:42:34.869
+His mother yeah labeled that stuff. Yeah. This was in the late late 2000
+
+0:42:37.190,0:42:41.530
+Okay, now let's talk about a bunch of different architectures, right so
+
+0:42:43.400,0:42:45.520
+You know as I mentioned before
+
+0:42:45.950,0:42:51.159
+the idea of deep learning is that you have this catalog of modules that you can assemble in sort of different graphs and
+
+0:42:52.040,0:42:54.879
+and together to do different functions and
+
+0:42:56.210,0:42:58.210
+and a lot of the
+
+0:42:58.430,0:43:03.280
+Expertise in deep learning is to design those architectures to do something in particular
+
+0:43:03.619,0:43:06.909
+It's a little bit like, you know in the early days of computer science
+
+0:43:08.180,0:43:11.740
+Coming up with an algorithm to write a program was kind of a new concept
+
+0:43:12.830,0:43:14.830
+you know reducing a
+
+0:43:15.560,0:43:19.209
+Problem to kind of a set of instructions that could be run on a computer
+
+0:43:19.210,0:43:21.580
+It was kind of something new and here it's the same problem
+
+0:43:21.830,0:43:26.109
+you have to sort of imagine how to reduce a complex function into sort of a
+
+0:43:27.500,0:43:29.560
+graph possibly dynamic graph of
+
+0:43:29.720,0:43:35.830
+Functional modules that you don't need to know completely the function of but that you're going to whose function is gonna be finalized by learning
+
+0:43:36.109,0:43:38.199
+But the architecture is super important, of course
+
+0:43:38.920,0:43:43.359
+As we saw with convolutional Nets. the first important category is recurrent net. So
+
+0:43:44.180,0:43:47.379
+We've we've seen when we talked about the backpropagation
+
+0:43:48.140,0:43:50.140
+There's a big
+
+0:43:50.510,0:43:58.029
+Condition of the condition was that the graph of the interconnection of the module could not have loops. Okay. It had to be a
+
+0:43:59.299,0:44:04.059
+graph for which there is sort of at least a partial order of the module so that you can compute the
+
+0:44:04.819,0:44:09.489
+The the modules in such a way that when you compute the output of a module all of its inputs are available
+
+0:44:11.240,0:44:13.299
+But recurrent net is one in which you have loops
+
+0:44:14.480,0:44:15.490
+How do you deal with this?
+
+0:44:15.490,0:44:18.459
+So here is an example of a recurrent net architecture
+
+0:44:18.920,0:44:25.210
+Where you have an input which varies over time X(t) that goes through the first neural net. Let's call it an encoder
+
+0:44:25.789,0:44:29.349
+That produces a representation of the of the input
+
+0:44:29.349,0:44:32.679
+Let's call it H(t) and it goes into a recurrent layer
+
+0:44:32.680,0:44:38.409
+This recurrent layer is a function G that depends on trainable parameters W this trainable parameters also for the encoder
+
+0:44:38.410,0:44:40.410
+but I didn't mention it and
+
+0:44:41.150,0:44:42.680
+that
+
+0:44:42.680,0:44:46.480
+Recurrent layer takes into account H(t), which is the representation of the input
+
+0:44:46.480,0:44:49.539
+but it also takes into account Z(t-1), which is the
+
+0:44:50.150,0:44:55.509
+Sort of a hidden state, which is its output at a previous time step its own output at a previous time step
+
+0:44:56.299,0:44:59.709
+Okay, this G function can be a very complicated neural net inside
+
+0:45:00.950,0:45:06.519
+convolutional net whatever could be as complicated as you want. But what's important is that one of its inputs is
+
+0:45:08.869,0:45:10.869
+Its output at a previous time step
+
+0:45:11.630,0:45:13.160
+Okay
+
+0:45:13.160,0:45:15.049
+Z(t-1)
+
+0:45:15.049,0:45:21.788
+So that's why this delay indicates here. The input of G at time t is actually Z(t-1)
+
+0:45:21.789,0:45:24.459
+Which is the output its output at a previous time step
+
+0:45:27.230,0:45:32.349
+Ok, then the output of that recurrent module goes into a decoder which basically produces an output
+
+0:45:32.450,0:45:35.710
+Ok, so it turns a hidden representation Z into an output
+
+0:45:39.859,0:45:41.979
+So, how do you deal with this, you unroll the loop
+
+0:45:44.230,0:45:47.439
+So this is basically the same diagram, but I've unrolled it in time
+
+0:45:49.160,0:45:56.170
+Okay, so at time at times 0 I have X(0) that goes through the encoder produces H of 0 and then I apply
+
+0:45:56.170,0:46:00.129
+The G function I start with a Z arbitrary Z, maybe 0 or something
+
+0:46:01.160,0:46:05.980
+And I apply the function and I get Z(0) and that goes into the decoder produces an output
+
+0:46:06.650,0:46:08.270
+Okay
+
+0:46:08.270,0:46:09.740
+and then
+
+0:46:09.740,0:46:16.479
+Now that has Z(0) at time step 1. I can use the Z(0) as the previous output for the time step. Ok
+
+0:46:17.570,0:46:22.570
+Now the output is X(1) and time 1. I run through the encoder I run through the recurrent layer
+
+0:46:22.570,0:46:24.570
+Which is now no longer recurrent
+
+0:46:24.890,0:46:28.510
+And run through the decoder and then the next time step, etc
+
+0:46:29.810,0:46:34.269
+Ok, this network that's involved in time doesn't have any loops anymore
+
+0:46:37.130,0:46:39.040
+Which means I can run backpropagation through it
+
+0:46:39.040,0:46:44.259
+So if I have an objective function that says the last output should be that particular one
+
+0:46:45.020,0:46:48.609
+Or maybe the trajectory should be a particular one of the outputs. I
+
+0:46:49.730,0:46:51.760
+Can just back propagate gradient through this thing
+
+0:46:52.940,0:46:55.510
+It's a regular network with one
+
+0:46:56.900,0:46:59.980
+Particular characteristic, which is that every block
+
+0:47:01.609,0:47:03.609
+Shares the same weights
+
+0:47:04.040,0:47:07.509
+Okay, so the three instances of the encoder
+
+0:47:08.150,0:47:11.379
+They are the same encoder at three different time steps
+
+0:47:11.380,0:47:16.869
+So they have the same weights the same G functions has the same weights, the three decoders have the same weights. Yes
+
+0:47:20.990,0:47:23.260
+It can be variable, you know, I have to decide in advance
+
+0:47:25.160,0:47:27.399
+But it depends on the length of your input sequence
+
+0:47:28.579,0:47:30.109
+basically
+
+0:47:30.109,0:47:33.159
+Right and you know, it's you can you can run it for as long as you want
+
+0:47:33.890,0:47:38.290
+You know, it's the same weights over so you can just you know, repeat the operation
+
+0:47:40.130,0:47:46.390
+Okay this technique of unrolling and then back propagating through time basically is called surprisingly
+
+0:47:47.060,0:47:49.060
+BPTT back prop through time
+
+0:47:50.000,0:47:52.000
+It's pretty obvious
+
+0:47:53.470,0:47:55.470
+That's all there is to it
+
+0:47:56.710,0:48:01.439
+Unfortunately, they don't work very well at least not in their naive form
+
+0:48:03.910,0:48:06.000
+So in the naive form
+
+0:48:07.360,0:48:11.519
+So a simple form of recurrent net is one in which the encoder is linear
+
+0:48:11.770,0:48:16.560
+The G function is linear with high probably tangent or sigmoid or perhaps ReLU
+
+0:48:17.410,0:48:22.680
+And the decoder also is linear something like this maybe with a ReLU or something like that, right so it could be very simple
+
+0:48:23.530,0:48:24.820
+and
+
+0:48:24.820,0:48:27.539
+You get a number of problems with this and one problem is?
+
+0:48:29.290,0:48:32.969
+The so called vanishing gradient problem or exploding gradient problem
+
+0:48:34.060,0:48:38.640
+And it comes from the fact that if you have a long sequence, let's say I don't know 50 time steps
+
+0:48:40.060,0:48:44.400
+Every time you back propagate gradients
+
+0:48:45.700,0:48:52.710
+The gradients that get multiplied by the weight matrix of the G function. Okay at every time step
+
+0:48:54.010,0:48:58.560
+the gradients get multiplied by the the weight matrix now imagine the weight matrix has
+
+0:48:59.110,0:49:00.820
+small values in it
+
+0:49:00.820,0:49:07.049
+Which means that means that every time you take your gradient you multiply it by the transpose of this matrix to get the gradient at previous
+
+0:49:07.050,0:49:08.290
+time step
+
+0:49:08.290,0:49:10.529
+You get a shorter vector you get a smaller vector
+
+0:49:11.200,0:49:14.520
+And you keep rolling the the vector gets shorter and shorter exponentially
+
+0:49:14.980,0:49:18.449
+That's called the vanishing gradient problem by the time you get to the 50th
+
+0:49:19.210,0:49:23.100
+Time steps which is really the first time step. You don't get any gradient
+
+0:49:28.660,0:49:32.970
+Conversely if the weight matrix is really large and the non-linearity and your
+
+0:49:33.760,0:49:36.120
+Recurrent layer is not saturating
+
+0:49:36.670,0:49:41.130
+your gradients can explode if the weight matrix is large every time you multiply the
+
+0:49:41.650,0:49:43.650
+gradient by the transpose of the matrix
+
+0:49:43.660,0:49:46.920
+the vector gets larger and it explodes which means
+
+0:49:47.290,0:49:51.810
+your weights are going to diverge when you do a gradient step or you're gonna have to use a tiny learning rate for it to
+
+0:49:51.810,0:49:53.810
+work
+
+0:49:54.490,0:49:56.290
+So
+
+0:49:56.290,0:49:58.529
+You have to use a lot of tricks to make those things work
+
+0:49:59.860,0:50:04.620
+Here's another problem. The reason why you would want to use a recurrent net. Why would you want to use a recurrent net?
+
+0:50:05.690,0:50:12.639
+The purported advantage of recurrent net is that they can remember remember things from far away in the past
+
+0:50:13.850,0:50:15.850
+Okay
+
+0:50:16.970,0:50:24.639
+If for example you imagine that the the X's are our characters that you enter one by one
+
+0:50:25.940,0:50:31.300
+The characters come from I don't know a C program or something like that, right?
+
+0:50:34.070,0:50:35.300
+And
+
+0:50:35.300,0:50:37.870
+What your system is supposed to tell you at the end, you know?
+
+0:50:37.870,0:50:42.699
+it reads a few hundred characters corresponding to the source code of a function and at the end is
+
+0:50:43.730,0:50:49.090
+you want to train your system so that it produces one if it's a syntactically correct program and
+
+0:50:49.910,0:50:51.910
+Minus one if it's not okay
+
+0:50:52.430,0:50:54.320
+hypothetical problem
+
+0:50:54.320,0:50:57.489
+Recurrent Nets won't do it. Okay, at least not with our tricks
+
+0:50:59.180,0:51:02.500
+Now there is a thing here which is the issue which is that
+
+0:51:03.860,0:51:07.599
+Among other things this program has to have balanced braces and parentheses
+
+0:51:09.110,0:51:10.280
+So
+
+0:51:10.280,0:51:13.540
+It has to have a way of remembering how many open parentheses
+
+0:51:13.540,0:51:20.350
+there are so that it can check that you're closing them all or how many open braces there are so so all of them get
+
+0:51:21.620,0:51:24.939
+Get closed right so it has to store eventually, you know
+
+0:51:27.380,0:51:29.410
+Essentially within its hidden state Z
+
+0:51:29.410,0:51:32.139
+it has to store like how many braces and and
+
+0:51:32.630,0:51:37.240
+Parentheses were open if it wants to be able to tell at the end that all of them have been closed
+
+0:51:38.620,0:51:41.040
+So it has to have some sort of counter inside right
+
+0:51:43.180,0:51:45.080
+Yes
+
+0:51:45.080,0:51:47.840
+It's going to be a topic tomorrow
+
+0:51:51.050,0:51:56.469
+Now if the program is very long that means, you know Z has to kind of preserve information for a long time and
+
+0:51:57.230,0:52:02.679
+Recurrent net, you know give you the hope that maybe a system like this can do this, but because of a vanishing gradient problem
+
+0:52:02.810,0:52:05.259
+They actually don't at least not simple
+
+0:52:07.280,0:52:09.280
+Recurrent Nets
+
+0:52:09.440,0:52:11.440
+Of the type. I just described
+
+0:52:12.080,0:52:14.080
+So you have to use a bunch of tricks
+
+0:52:14.200,0:52:18.460
+Those are tricks from you know Yoshua Bengio's lab, but there is a bunch of them that were published by various people
+
+0:52:19.700,0:52:22.090
+Like Thomas Mikolov and various other people
+
+0:52:24.050,0:52:27.789
+So to avoid exploding gradients you can clip the gradients just you know, make it you know
+
+0:52:27.790,0:52:30.279
+If the gradients get too large, you just kind of squash them down
+
+0:52:30.950,0:52:32.950
+Just normalize them
+
+0:52:35.180,0:52:41.800
+Weak integration momentum I'm not gonna mention that. a good initialization so you want to initialize the weight matrices so that
+
+0:52:42.380,0:52:44.380
+They preserves the norm more or less
+
+0:52:44.660,0:52:49.180
+this is actually a whole bunch of papers on this on orthogonal neural nets and invertible
+
+0:52:49.700,0:52:51.700
+recurrent Nets
+
+0:52:54.770,0:52:56.770
+But the big trick is
+
+0:52:57.470,0:53:04.630
+LSTM and GRUs. Okay. So what is that before I talk about that I'm gonna talk about multiplicative modules
+
+0:53:06.410,0:53:08.470
+So what are multiplicative modules
+
+0:53:09.500,0:53:11.000
+They're basically
+
+0:53:11.000,0:53:14.709
+Modules in which you you can multiply things with each other
+
+0:53:14.710,0:53:20.590
+So instead of just computing a weighted sum of inputs you compute products of inputs and then weighted sum of that
+
+0:53:20.600,0:53:23.110
+Okay, so you have an example of this on the top left
+
+0:53:23.720,0:53:25.040
+on the top
+
+0:53:25.040,0:53:29.080
+so the output of a system here is just a weighted sum of
+
+0:53:30.080,0:53:32.080
+weights and inputs
+
+0:53:32.240,0:53:37.810
+Okay classic, but the weights actually themselves are weighted sums of weights and inputs
+
+0:53:38.780,0:53:43.149
+okay, so Wij here, which is the ij'th term in the weight matrix of
+
+0:53:43.820,0:53:46.479
+The module we're considering is actually itself
+
+0:53:47.270,0:53:49.270
+a weighted sum of
+
+0:53:50.060,0:53:53.439
+three third order tenser Uijk
+
+0:53:54.410,0:53:56.560
+weighted by variables Zk.
+
+0:53:58.220,0:54:02.080
+Okay, so basically what you get is that Wij is kind of a weighted sum of
+
+0:54:04.160,0:54:06.160
+Matrices
+
+0:54:06.800,0:54:08.800
+Uk
+
+0:54:09.020,0:54:13.419
+Weighted by a coefficient Zk and the Zk can change there are input variables the same way
+
+0:54:13.460,0:54:17.230
+So in effect, it's like having a neural net
+
+0:54:18.260,0:54:22.600
+With weight matrix W whose weight matrix is computed itself by another neural net
+
+0:54:24.710,0:54:30.740
+There is a general form of this where you don't just multiply matrices, but you have a neural net that is some complex function
+
+0:54:31.650,0:54:33.650
+turns X into S
+
+0:54:34.859,0:54:40.819
+Some generic function. Ok, give you ConvNet whatever and the weights of those neural nets
+
+0:54:41.910,0:54:44.839
+are not variables that you learn directly but they are the output of
+
+0:54:44.970,0:54:48.800
+Another neuron that that takes maybe another input into account or maybe the same input
+
+0:54:49.830,0:54:55.069
+Some people call those architectures hyper networks. Ok. There are networks whose weights are computed by another network
+
+0:54:56.160,0:54:59.270
+But here's just a simple form of it, which is kind of a bilinear form
+
+0:54:59.970,0:55:01.740
+or quadratic
+
+0:55:01.740,0:55:03.180
+form
+
+0:55:03.180,0:55:05.810
+Ok, so overall when you kind of write it all down
+
+0:55:06.570,0:55:13.339
+SI is equal to sum over j And k of Uijk Zk Xj. This is a double sum
+
+0:55:15.750,0:55:18.169
+People used to call this Sigma Pi units, yes
+
+0:55:22.890,0:55:27.290
+We'll come to this in just a second basically
+
+0:55:31.500,0:55:33.500
+If you want a neural net that can
+
+0:55:34.740,0:55:36.740
+perform a transformation from
+
+0:55:37.440,0:55:41.929
+A vector into another and that transformation is to be programmable
+
+0:55:42.990,0:55:50.089
+Right, you can have that transformation be computed by a neural net but the weight of that neural net would be it themselves the output
+
+0:55:50.089,0:55:51.390
+of
+
+0:55:51.390,0:55:54.200
+Another neural net that figures out what the transformation is
+
+0:55:55.349,0:56:01.399
+That's kind of the more general form more specifically is very useful if you want to route
+
+0:56:03.359,0:56:08.389
+Signals through a neural net in different ways on a data dependent way so
+
+0:56:10.980,0:56:16.669
+You in fact that's exactly what is mentioned below so the attention module is a special case of this
+
+0:56:17.460,0:56:20.510
+It's not a quadratic layer. It's kind of a different type, but it's a
+
+0:56:21.510,0:56:23.510
+particular type of
+
+0:56:25.140,0:56:26.849
+Architecture that
+
+0:56:26.849,0:56:28.849
+basically computes a
+
+0:56:29.339,0:56:32.029
+convex linear combination of a bunch of vectors, so
+
+0:56:32.790,0:56:34.849
+x₁ and x₂ here are vectors
+
+0:56:37.770,0:56:42.499
+w₁ and w₂ are scalars, basically, okay and
+
+0:56:45.540,0:56:47.870
+What the system computes here is a weighted sum of
+
+0:56:49.590,0:56:55.069
+x₁ and x₂ weighted by w₁ w₂ and again w₁ w₂ are scalars in this case
+
+0:56:56.910,0:56:58.910
+Here the sum at the output
+
+0:56:59.730,0:57:01.020
+so
+
+0:57:01.020,0:57:07.999
+Imagine that those two weights. w₁ w₂ are between 0 and 1 and sum to 1 that's what's called a convex linear combination
+
+0:57:10.260,0:57:13.760
+So by changing w₁ w₂ so essentially
+
+0:57:15.480,0:57:18.139
+If this sum to 1 there are the output of a softmax
+
+0:57:18.810,0:57:23.629
+Which means w₂ is equal to 1 - w₁ right? That's kind of the direct consequence
+
+0:57:27.450,0:57:29.450
+So basically by changing
+
+0:57:29.790,0:57:34.340
+the size of w₁ w₂ you kind of switch the output to
+
+0:57:34.530,0:57:39.860
+Being either x₁ or x₂ or some linear combination of the two some interpolation between the two
+
+0:57:41.610,0:57:43.050
+Okay
+
+0:57:43.050,0:57:47.179
+You can have more than just x₁ and x₂ you can have a whole bunch of x vectors
+
+0:57:48.360,0:57:50.360
+and that
+
+0:57:50.730,0:57:54.800
+system will basically choose an appropriate linear combination or focus
+
+0:57:55.140,0:58:02.210
+Is called an attention mechanism because it allows a neural net to basically focus its attention on a particular input and ignoring ignoring the others
+
+0:58:02.880,0:58:05.240
+The choice of this is made by another variable Z
+
+0:58:05.790,0:58:09.679
+Which itself could be the output to some other neural net that looks at Xs for example
+
+0:58:10.740,0:58:12.270
+okay, and
+
+0:58:12.270,0:58:18.409
+This has become a hugely important type of function, it's used in a lot of different situations now
+
+0:58:19.440,0:58:22.700
+In particular it's used in LSTM and GRU but it's also used in
+
+0:58:26.730,0:58:30.020
+Pretty much every natural language processing system nowadays that use
+
+0:58:31.830,0:58:37.939
+Either transformer architectures or all the types of attention they all use this kind of this kind of trick
+
+0:58:43.280,0:58:46.570
+Okay, so you have a vector Z pass it to a softmax
+
+0:58:46.570,0:58:52.509
+You get a bunch of numbers between 0 & 1 that sum to 1 use those as coefficient to compute a weighted sum
+
+0:58:52.700,0:58:54.560
+of a bunch of vectors X
+
+0:58:54.560,0:58:56.589
+xᵢ and you get the weighted sum
+
+0:58:57.290,0:59:00.070
+Weighted by those coefficients those coefficients are data dependent
+
+0:59:00.890,0:59:02.890
+Because Z is data dependent
+
+0:59:05.390,0:59:07.390
+All right, so
+
+0:59:09.800,0:59:13.659
+Here's an example of how you use this whenever you have this symbol here
+
+0:59:15.530,0:59:17.859
+This circle with the dots in the middle, that's a
+
+0:59:20.510,0:59:26.739
+Component by component multiplication of two vectors some people call this Hadamard product
+
+0:59:29.660,0:59:34.629
+Anyway, it's turn-by-turn multiplication. So this is a
+
+0:59:36.200,0:59:41.020
+a type of a kind of functional module
+
+0:59:43.220,0:59:47.409
+GRU, gated recurrent Nets, was proposed by Kyunghyun Cho who is professor here
+
+0:59:50.420,0:59:51.880
+And it attempts
+
+0:59:51.880,0:59:54.430
+It's an attempt at fixing the problem that naturally occur in
+
+0:59:54.560,0:59:58.479
+recurrent Nets that I mentioned the fact that you have exploding gradient the fact that the
+
+1:00:00.050,1:00:04.629
+recurrent nets don't really remember their states for very long. They tend to kind of forget really quickly
+
+1:00:05.150,1:00:07.540
+And so it's basically a memory cell
+
+1:00:08.060,1:00:14.080
+Okay, and I have to say this is the kind of second big family of sort of
+
+1:00:16.820,1:00:20.919
+Recurrent net with memory. The first one is LSTM, but I'm going to talk about it just afterwards
+
+1:00:21.650,1:00:23.650
+Just because this one is a little simpler
+
+1:00:24.950,1:00:27.550
+The equations are written at the bottom here so
+
+1:00:28.280,1:00:30.280
+basically, there is a
+
+1:00:31.280,1:00:32.839
+a
+
+1:00:32.839,1:00:34.839
+gating vector Z
+
+1:00:35.720,1:00:37.550
+which is
+
+1:00:37.550,1:00:41.919
+simply the application of a nonlinear function the sigmoid function
+
+1:00:42.950,1:00:44.089
+to
+
+1:00:44.089,1:00:49.119
+two linear layers and a bias and those two linear layers take into account the input X(t) and
+
+1:00:49.400,1:00:54.389
+The previous state which they did note H in their case, not Z like I did
+
+1:00:55.930,1:01:01.889
+Okay, so you take X you take H you compute matrices
+
+1:01:02.950,1:01:04.140
+You pass a result
+
+1:01:04.140,1:01:07.440
+you add the results you pass them through sigmoid functions and you get a bunch of
+
+1:01:07.539,1:01:11.939
+values between 0 & 1 because the sigmoid is between 0 & 1 gives you a coefficient and
+
+1:01:14.140,1:01:16.140
+You use those coefficients
+
+1:01:16.660,1:01:20.879
+You see the formula at the bottom the Z is used to basically compute a linear combination
+
+1:01:21.700,1:01:24.210
+of two inputs if Z is equal to 1
+
+1:01:25.420,1:01:28.379
+You basically only look at h(t-1). If Z 
+
+1:01:29.859,1:01:35.669
+Is equal to 0 then 1 - Z is equal to 1 then you you look at this
+
+1:01:36.400,1:01:38.109
+expression here and
+
+1:01:38.109,1:01:43.528
+That expression is, you know some weight matrix multiplied by the input passed through a hyperbolic tangent function
+
+1:01:43.529,1:01:46.439
+It could be a ReLU but it's a hyperbolic tangent in this case
+
+1:01:46.839,1:01:49.528
+And it's combined with other stuff here that we can ignore for now
+
+1:01:50.829,1:01:58.439
+Okay. So basically what what the Z value does is that it tells the system just copy if Z equal 1 it just copies its
+
+1:01:58.440,1:02:00.440
+previous state and ignores the input
+
+1:02:00.789,1:02:04.978
+Ok, so it acts like a memory essentially. It just copies its previous state on its output 
+
+1:02:06.430,1:02:08.430
+and if Z
+
+1:02:09.549,1:02:17.189
+Equals 0 then the current state is forgotten essentially and is basically you would you just read the input
+
+1:02:19.450,1:02:24.629
+Ok multiplied by some matrix so it changes the state of the system
+
+1:02:28.960,1:02:35.460
+Yeah, you do this component by component essentially, okay vector 1 yeah exactly
+
+1:02:47.500,1:02:53.459
+Well, it's just like the number of independent multiplications, right, what is the derivative of
+
+1:02:54.880,1:02:59.220
+some objective function with respect to the input of a product. It's equal to the
+
+1:03:01.240,1:03:07.829
+Derivative of that objective function with respect to the add, to the product multiplied by the other term. That's the as simple as that
+
+1:03:18.039,1:03:20.039
+So it's because by default
+
+1:03:20.529,1:03:22.529
+essentially unless Z is
+
+1:03:23.619,1:03:25.509
+your Z is
+
+1:03:25.509,1:03:30.689
+More less by default equal to one and so by default the system just copies its previous state
+
+1:03:33.039,1:03:35.999
+And if it's just you know slightly less than one it
+
+1:03:37.210,1:03:42.539
+It puts a little bit of the input into the state but doesn't significantly change the state and what that means. Is that it
+
+1:03:43.630,1:03:44.799
+preserves
+
+1:03:44.799,1:03:46.919
+Norm, and it preserves information, right?
+
+1:03:48.940,1:03:53.099
+Since basically memory cell that you can change continuously
+
+1:04:00.480,1:04:04.159
+Well because you need something between zero and one it's a coefficient, right
+
+1:04:04.160,1:04:07.789
+And so it needs to be between zero and one that's what we do sigmoids
+
+1:04:11.850,1:04:13.080
+I
+
+1:04:13.080,1:04:16.850
+mean you need one that is monotonic that goes between 0 and 1 and
+
+1:04:17.970,1:04:20.059
+is monotonic and differentiable I mean
+
+1:04:20.730,1:04:22.849
+There's lots of sigmoid functions, but you know
+
+1:04:24.000,1:04:26.000
+Why not?
+
+1:04:26.100,1:04:29.779
+Yeah, I mean there is some argument for using others, but you know doesn't make a huge
+
+1:04:30.540,1:04:32.540
+amount of difference
+
+1:04:32.700,1:04:37.009
+Okay in the full form of gru. there is also a reset gate. So the reset gate is
+
+1:04:37.650,1:04:44.989
+Is this guy here? So R is another vector that's computed also as a linear combination of inputs and previous state and
+
+1:04:45.660,1:04:51.319
+It serves to multiply the previous state. So if R is 0 then the previous state is
+
+1:04:52.020,1:04:54.410
+if R is 0 and Z is 1
+
+1:04:55.950,1:05:00.499
+The system is basically completely reset to 0 because that is 0
+
+1:05:01.350,1:05:03.330
+So it only looks at the input
+
+1:05:03.330,1:05:09.950
+But that's basically a simplified version of something that came out way earlier in 1997 called
+
+1:05:10.260,1:05:12.260
+LSTM long short-term memory
+
+1:05:13.050,1:05:14.820
+Which you know attempted
+
+1:05:14.820,1:05:19.519
+Which was an attempt at solving the same issue that you know recurrent Nets basically lose memory for too long
+
+1:05:19.520,1:05:21.520
+and so you build them as
+
+1:05:22.860,1:05:26.120
+As memory cells by default and by default they will preserve the information
+
+1:05:26.760,1:05:28.430
+It's essentially the same idea here
+
+1:05:28.430,1:05:33.979
+It's a you know, the details are slightly different here don't have dots in the middle of the round shape here for the product
+
+1:05:33.980,1:05:35.610
+But it's the same thing
+
+1:05:35.610,1:05:41.539
+And there's a little more kind of moving parts. It's basically it looks more like an actual run sale
+
+1:05:41.540,1:05:44.060
+So it's like a flip-flop they can you know preserve
+
+1:05:44.430,1:05:48.200
+Information and there is some leakage that you can have, you can reset it to 0 or to 1
+
+1:05:48.810,1:05:50.810
+It's fairly complicated
+
+1:05:52.050,1:05:59.330
+Thankfully people at NVIDIA Facebook Google and various other places have very efficient implementations of those so you don't need to
+
+1:05:59.550,1:06:01.550
+figure out how to write the
+
+1:06:01.620,1:06:03.710
+CUDA code for this or write the back pop
+
+1:06:05.430,1:06:07.430
+Works really well
+
+1:06:07.500,1:06:12.689
+it's it's quite what you'd use but it's used less and less because
+
+1:06:13.539,1:06:15.539
+people use recurrent Nets
+
+1:06:16.150,1:06:18.210
+people used to use recurrent Nets for natural language processing
+
+1:06:19.329,1:06:21.220
+mostly and
+
+1:06:21.220,1:06:25.949
+Things like speech recognition and speech recognition is moving towards using convolutional Nets
+
+1:06:27.490,1:06:29.200
+temporal conditional Nets
+
+1:06:29.200,1:06:34.109
+while the natural language processing is moving towards using what's called transformers
+
+1:06:34.630,1:06:36.900
+Which we'll hear a lot about tomorrow, right?
+
+1:06:37.630,1:06:38.950
+no?
+
+1:06:38.950,1:06:40.950
+when
+
+1:06:41.109,1:06:43.109
+two weeks from now, okay
+
+1:06:46.599,1:06:48.599
+So what transformers are
+
+1:06:49.119,1:06:51.119
+Okay, I'm not gonna talk about transformers just now
+
+1:06:51.759,1:06:56.219
+but these key transformers are kind of a generalization so
+
+1:06:57.009,1:06:58.619
+General use of attention if you want
+
+1:06:58.619,1:07:02.038
+So the big neural Nets that use attention that you know
+
+1:07:02.039,1:07:06.329
+Every block of neuron uses attention and that tends to work pretty well it works
+
+1:07:06.329,1:07:09.538
+So well that people are kind of basically dropping everything else for NLP
+
+1:07:10.869,1:07:12.869
+so the problem is
+
+1:07:13.269,1:07:15.299
+Systems like LSTM are not very good at this so
+
+1:07:16.599,1:07:20.219
+Transformers are much better. The biggest transformers have billions of parameters
+
+1:07:21.430,1:07:26.879
+Like the biggest one is by 15 billion something like that that order of magnitude the t5 or whatever it's called
+
+1:07:27.910,1:07:29.910
+from Google so
+
+1:07:30.460,1:07:36.779
+That's an enormous amount of memory and it's because of the particular type of architecture that's used in transformers
+
+1:07:36.779,1:07:40.319
+They they can actually store a lot of knowledge if you want
+
+1:07:41.289,1:07:43.559
+So that's the stuff people would use for
+
+1:07:44.440,1:07:47.069
+What you're talking about like question answering systems
+
+1:07:47.769,1:07:50.099
+Translation systems etc. They will use transformers
+
+1:07:52.869,1:07:54.869
+Okay
+
+1:07:57.619,1:08:01.778
+So because LSTM kind of was sort of you know one of the first
+
+1:08:02.719,1:08:04.958
+architectures recurrent architecture that kind of worked
+
+1:08:05.929,1:08:11.408
+People tried to use them for things that at first you would think are crazy but turned out to work
+
+1:08:12.109,1:08:16.689
+And one example of this is translation. It's called neural machine translation
+
+1:08:17.509,1:08:19.509
+So there was a paper 
+
+1:08:19.639,1:08:22.149
+by Ilya Sutskever at NIPS 2014 where he
+
+1:08:22.969,1:08:29.799
+Trained this giant multi-layer LSTM. So what's a multi-layered LSTM? It's an LSTM where you have
+
+1:08:30.589,1:08:36.698
+so it's the unfolded version, right? So at the bottom here you have an LSTM which is here unfolded for three time steps
+
+1:08:36.699,1:08:41.618
+But it will have to be unfolded for the length of a sentence you want to translate, let's say a
+
+1:08:42.259,1:08:43.969
+sentence in French
+
+1:08:43.969,1:08:45.529
+and
+
+1:08:45.529,1:08:48.038
+And then you take the hidden
+
+1:08:48.289,1:08:53.709
+state at every time step of this LSTM and you feed that as input to a second LSTM and
+
+1:08:53.929,1:08:55.150
+I think in his network
+
+1:08:55.150,1:08:58.329
+he actually had four layers of that so you can think of this as a
+
+1:08:58.639,1:09:02.139
+Stacked LSTM that you know each of them are recurrent in time
+
+1:09:02.139,1:09:05.589
+But they are kind of stacked as the layers of a neural net
+
+1:09:06.500,1:09:07.670
+so
+
+1:09:07.670,1:09:14.769
+At the last time step in the last layer, you have a vector here, which is meant to represent the entire meaning of that sentence
+
+1:09:16.309,1:09:18.879
+Okay, so it could be a fairly large vector
+
+1:09:19.849,1:09:24.819
+and then you feed that to another multi-layer LSTM, which
+
+1:09:27.319,1:09:31.028
+You know you run for a sort of undetermined number of steps and
+
+1:09:32.119,1:09:37.209
+The role of this LSTM is to produce words in a target language if you do translation say German
+
+1:09:38.869,1:09:40.839
+Okay, so this is time, you know
+
+1:09:40.839,1:09:44.499
+It takes the state you run through the first two layers of the LSTM
+
+1:09:44.630,1:09:48.849
+Produce a word and then take that word and feed it as input to the next time step
+
+1:09:49.940,1:09:52.359
+So that you can generate text sequentially, right?
+
+1:09:52.909,1:09:58.899
+Run through this produce another word take that word feed it back to the input and keep going. So this is a
+
+1:10:00.619,1:10:02.619
+Should do this for translation you get this gigantic
+
+1:10:03.320,1:10:07.480
+Neural net you train and this is the it's a system of this type
+
+1:10:07.480,1:10:12.010
+The one that Sutskever represented at NIPS 2014 it was was the first neural
+
+1:10:13.130,1:10:19.209
+Translation system that had performance that could rival sort of more classical approaches not based on neural nets
+
+1:10:21.350,1:10:23.950
+And people were really surprised that you could get such results
+
+1:10:26.840,1:10:28.840
+That success was very short-lived
+
+1:10:31.280,1:10:33.280
+Yeah, so the problem is
+
+1:10:34.340,1:10:37.449
+The word you're gonna say at a particular time depends on the word you just said
+
+1:10:38.180,1:10:41.320
+Right, and if you ask the system to just produce a word
+
+1:10:42.800,1:10:45.729
+And then you don't feed that word back to the input
+
+1:10:45.730,1:10:49.120
+the system could be used in other word that has that is inconsistent with the previous one you produced
+
+1:10:55.790,1:10:57.790
+It should but it doesn't
+
+1:10:58.760,1:11:05.590
+I mean not well enough that that it works. So so this is so this is kind of sequential production is pretty much required
+
+1:11:07.790,1:11:09.790
+In principle, you're right
+
+1:11:10.910,1:11:12.910
+It's not very satisfying
+
+1:11:13.610,1:11:19.089
+so there's a problem with this which is that the entire meaning of the sentence has to be kind of squeezed into
+
+1:11:19.430,1:11:22.419
+That hidden state that is between the encoder of the decoder
+
+1:11:24.530,1:11:29.829
+That's one problem the second problem is that despite the fact that that LSTM are built to preserve information
+
+1:11:31.040,1:11:36.010
+They are basically memory cells. They don't actually preserve information for more than about 20 words
+
+1:11:36.860,1:11:40.299
+So if your sentence is more than 20 words by the time you get to the end of the sentence
+
+1:11:40.520,1:11:43.270
+Your your hidden state will have forgotten the beginning of it
+
+1:11:43.640,1:11:49.269
+so what people use for this the fix for this is a huge hack is called BiLSTM and
+
+1:11:50.060,1:11:54.910
+It's a completely trivial idea that consists in running two LSTMs in opposite directions
+
+1:11:56.210,1:11:59.020
+Okay, and then you get two codes one that is
+
+1:11:59.720,1:12:04.419
+running the LSTM from beginning to end of the sentence that's one vector and then the second vector is from
+
+1:12:04.730,1:12:09.939
+Running an LSTM in the other direction you get a second vector. That's the meaning of your sentence
+
+1:12:10.280,1:12:16.809
+You can basically double the length of your sentence without losing too much information this way, but it's not a very satisfying solution
+
+1:12:17.120,1:12:19.450
+So if you see biLSTM, that's what that's what it is
+
+1:12:22.830,1:12:29.179
+So as I said, the success was short-lived because in fact before the paper was published at NIPS
+
+1:12:30.390,1:12:32.390
+There was a paper by
+
+1:12:34.920,1:12:37.969
+Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio
+
+1:12:38.670,1:12:42.319
+which was published on arxiv in September 14 that said
+
+1:12:43.560,1:12:47.209
+We can use attention. So the attention mechanism I mentioned earlier
+
+1:12:49.320,1:12:51.300
+Instead of having those gigantic
+
+1:12:51.300,1:12:54.890
+Networks and squeezing the entire meaning of a sentence into this small vector
+
+1:12:55.800,1:12:58.190
+it would make more sense to the translation if
+
+1:12:58.710,1:13:03.169
+Every time said, you know, we want to produce a word in French corresponding to a sentence in English
+
+1:13:04.469,1:13:08.509
+If we looked at the location in the English sentence that had that word
+
+1:13:09.390,1:13:10.620
+Okay
+
+1:13:10.620,1:13:12.090
+so
+
+1:13:12.090,1:13:17.540
+Our decoder is going to produce french words one at a time and when it comes to produce a word
+
+1:13:18.449,1:13:21.559
+that has an equivalent in the input english sentence it's
+
+1:13:21.960,1:13:29.750
+going to focus its attention on that word and then the translation from French to English of that word would be simple or the
+
+1:13:30.360,1:13:32.300
+You know, it may not be a single word
+
+1:13:32.300,1:13:34.050
+it could be a group of words right because
+
+1:13:34.050,1:13:39.590
+Very often you have to turn a group of word in English into a group of words in French to kind of say the same
+
+1:13:39.590,1:13:41.590
+thing if it's German you have to
+
+1:13:42.150,1:13:43.949
+put the
+
+1:13:43.949,1:13:47.479
+You know the verb at the end of the sentence whereas in English, it might be at the beginning
+
+1:13:48.060,1:13:51.109
+So basically you use this attention mechanism
+
+1:13:51.110,1:13:57.440
+so this attention module here is the one that I showed a couple slides earlier which basically decides
+
+1:13:58.739,1:14:04.428
+Which of the time steps which of the hidden representation for which other word in the input sentence it is going to focus on
+
+1:14:06.570,1:14:12.259
+To kind of produce a representation that is going to produce the current word at a particular time step
+
+1:14:12.260,1:14:15.320
+So here we're at time step number three, we're gonna produce a third word
+
+1:14:16.140,1:14:21.829
+And we're gonna have to decide which of the input word corresponds to this and we're gonna have this attention mechanism
+
+1:14:21.830,1:14:23.830
+so essentially we're gonna have a
+
+1:14:25.140,1:14:28.759
+Small piece of neural net that's going to look at the the inputs on this side
+
+1:14:31.809,1:14:35.879
+It's going to have an output which is going to go through a soft max that is going to produce a bunch of
+
+1:14:35.979,1:14:42.269
+Coefficients that sum to 1 between 0 and 1 and they're going to compute a linear combination of the states at different time steps
+
+1:14:43.719,1:14:48.899
+Ok by setting one of those coefficients to 1 and the other ones to 0 it is going to focus the attention of the system on
+
+1:14:48.900,1:14:50.900
+one particular word
+
+1:14:50.949,1:14:56.938
+So the magic of this is that this neural net that decides that runs to the softmax and decides on those coefficients actually
+
+1:14:57.159,1:14:59.159
+Can be trained with back prop is just another
+
+1:14:59.590,1:15:03.420
+Set of weights in a neural net and you don't have to built it by hand. It just figures it out
+
+1:15:06.550,1:15:10.979
+This completely revolutionized the field of neural machine translation in the sense that
+
+1:15:11.889,1:15:13.889
+within a
+
+1:15:14.050,1:15:20.309
+Few months team from Stanford won a big competition with this beating all the other methods
+
+1:15:22.119,1:15:28.199
+And then within three months every big company that works on translation had basically deployed systems based on this
+
+1:15:29.289,1:15:31.469
+So this just changed everything
+
+1:15:33.189,1:15:40.349
+And then people started paying attention to attention, okay pay more attention to attention in a sense that
+
+1:15:41.170,1:15:44.879
+And then there was a paper by a bunch of people at Google
+
+1:15:45.729,1:15:52.529
+What the title was attention is all you need and It was basically a paper that solved a bunch of natural language processing tasks
+
+1:15:53.050,1:15:59.729
+by using a neural net where every layer, every group of neurons basically was implementing attention and that's what a
+
+1:16:00.459,1:16:03.149
+Or something called self attention. That's what a transformer is
+
+1:16:08.829,1:16:15.449
+Yes, you can have a variable number of outputs of inputs that you focus attention on
+
+1:16:18.340,1:16:20.849
+Okay, I'm gonna talk now about memory networks
+
+1:16:35.450,1:16:40.309
+So this stems from work at Facebook that was started by Antoine Bordes
+
+1:16:41.970,1:16:43.970
+I think in 2014 and
+
+1:16:45.480,1:16:47.480
+By
+
+1:16:49.650,1:16:51.799
+Sainbayar Sukhbaatar, I
+
+1:16:56.760,1:16:58.760
+Think in 2015 or 16
+
+1:16:59.040,1:17:01.040
+Called end-to-end memory networks
+
+1:17:01.520,1:17:06.890
+Sainbayar Sukhbaatar was a PhD student here and it was an intern at Facebook when he worked on this
+
+1:17:07.650,1:17:10.220
+together with a bunch of other people Facebook and
+
+1:17:10.860,1:17:12.090
+the idea of memory
+
+1:17:12.090,1:17:17.270
+Network is that you'd like to have a short-term memory you'd like your neural net to have a short-term memory or working memory
+
+1:17:18.300,1:17:23.930
+Okay, you'd like it to you know, you you tell okay, if I tell you a story I tell you
+
+1:17:25.410,1:17:27.410
+John goes to the kitchen
+
+1:17:28.170,1:17:30.170
+John picks up the milk
+
+1:17:34.440,1:17:36.440
+Jane goes to the kitchen
+
+1:17:37.290,1:17:40.910
+And then John goes to the bedroom and drops the milk there
+
+1:17:41.430,1:17:44.899
+And then goes back to the kitchen and ask you. Where's the milk? Okay
+
+1:17:44.900,1:17:47.720
+so every time I had told you a sentence you kind of
+
+1:17:48.330,1:17:50.330
+updated in your mind a
+
+1:17:50.340,1:17:52.340
+Kind of current state of the world if you want
+
+1:17:52.920,1:17:56.870
+and so by telling you the story you now you have a representation of the state to the world and if I ask you a
+
+1:17:56.870,1:17:59.180
+Question about the state of the world you can answer it. Okay
+
+1:18:00.270,1:18:02.270
+You store this in a short-term memory
+
+1:18:03.720,1:18:06.769
+You didn't store it, ok, so there's kind of this
+
+1:18:06.770,1:18:10.399
+There's a number of different parts in your brain, but it's two important parts, one is the cortex
+
+1:18:10.470,1:18:13.279
+The cortex is where you have long term memory. Where you
+
+1:18:15.120,1:18:17.120
+You know you
+
+1:18:17.700,1:18:22.129
+Where all your your thinking is done and all that stuff and there is a separate
+
+1:18:24.720,1:18:26.460
+You know
+
+1:18:26.460,1:18:28.879
+Chunk of neurons called the hippocampus which is sort of
+
+1:18:29.100,1:18:32.359
+Its kind of two formations in the middle of the brain and they kind of send
+
+1:18:34.320,1:18:36.650
+Wires to pretty much everywhere in the cortex and
+
+1:18:37.110,1:18:44.390
+The hippocampus is thought that to be used as a short-term memory. So it can just you know, remember things for relatively short time
+
+1:18:45.950,1:18:47.450
+The prevalent
+
+1:18:47.450,1:18:53.530
+theory is that when you when you sleep and you dream there's a lot of information that is being transferred from your
+
+1:18:53.810,1:18:56.800
+hippocampus to your cortex to be solidified in long-term memory
+
+1:18:59.000,1:19:01.090
+Because the hippocampus has limited capacity
+
+1:19:04.520,1:19:08.859
+When you get senile like you get really old very often your hippocampus shrinks and
+
+1:19:09.620,1:19:13.570
+You don't have short-term memory anymore. So you keep repeating the same stories to the same people
+
+1:19:14.420,1:19:16.420
+Okay, it's very common
+
+1:19:19.430,1:19:25.930
+Or you go to a room to do something and by the time you get to the room you forgot what you were there for
+
+1:19:29.450,1:19:31.869
+This starts happening by the time you're 50, by the way
+
+1:19:36.290,1:19:40.390
+So, I don't remember what I said last week of two weeks ago, um
+
+1:19:41.150,1:19:44.950
+Okay, but anyway, so memory network, here's the idea of memory network
+
+1:19:46.340,1:19:50.829
+You have an input to the memory network. Let's call it X and think of it as an address
+
+1:19:51.770,1:19:53.770
+Of the memory, okay
+
+1:19:53.930,1:19:56.409
+What you're going to do is you're going to compare this X
+
+1:19:58.040,1:20:03.070
+With a bunch of vectors, we're gonna call K
+
+1:20:08.180,1:20:10.180
+So k₁ k₂ k₃
+
+1:20:12.890,1:20:18.910
+Okay, so you compare those two vectors and the way you compare them is via dot product very simple
+
+1:20:28.460,1:20:33.460
+Okay, so now you have the three dot products of all the three Ks with the X
+
+1:20:34.730,1:20:37.990
+They are scalar values, you know plug them to a softmax
+
+1:20:47.630,1:20:50.589
+So what you get are three numbers between 0 & 1 that sum to 1
+
+1:20:53.840,1:20:59.259
+What you do with those you have 3 other vectors that I'm gonna call V
+
+1:21:00.680,1:21:02.680
+v₁, v₂ and v₃
+
+1:21:03.770,1:21:07.120
+And what you do is you multiply
+
+1:21:08.990,1:21:13.570
+These vectors by those scalars, so this is very much like the attention mechanism that we just talked about
+
+1:21:17.870,1:21:20.950
+Okay, and you sum them up
+
+1:21:27.440,1:21:34.870
+Okay, so take an X compare X with each of the K each of the Ks those are called keys
+
+1:21:39.170,1:21:44.500
+You get a bunch of coefficients between the zero and one that sum to one and then compute a linear combination of the values
+
+1:21:45.260,1:21:47.260
+Those are value vectors
+
+1:21:50.510,1:21:51.650
+And
+
+1:21:51.650,1:21:53.150
+Sum them up
+
+1:21:53.150,1:22:00.400
+Okay, so imagine that one of the key exactly matches X you're gonna have a large coefficient here and small coefficients there
+
+1:22:00.400,1:22:06.609
+So the output of the system will essentially be V2, if K 2 matches X the output would essentially be V 2
+
+1:22:08.060,1:22:09.500
+Okay
+
+1:22:09.500,1:22:11.890
+So this is an addressable associative memory
+
+1:22:12.620,1:22:19.419
+Associative memory is exactly that where you have keys with values and if your input matches a key you get the value here
+
+1:22:19.420,1:22:21.420
+It's a kind of soft differentiable version of that
+
+1:22:26.710,1:22:28.710
+So you can
+
+1:22:29.019,1:22:34.559
+you can back propagate to this you can you can write into this memory by changing the V vectors or
+
+1:22:34.929,1:22:38.609
+Even changing the K vectors. You can change the V vectors by gradient descent
+
+1:22:39.489,1:22:45.598
+Okay, so if you wanted the output of the memory to be something in particular by backpropagating gradient through this
+
+1:22:47.019,1:22:52.259
+you're going to change the currently active V to whatever it needs for the
+
+1:22:53.530,1:22:55.530
+for the output
+
+1:22:56.050,1:22:58.050
+So in those papers
+
+1:22:59.800,1:23:02.460
+What what they did was I
+
+1:23:03.969,1:23:06.299
+Mean there's a series of papers on every network, but
+
+1:23:08.409,1:23:11.879
+What they did was exactly scenario I just explained where you you kind of
+
+1:23:12.909,1:23:16.319
+Tell a story to a system so give it a sequence of sentences
+
+1:23:17.530,1:23:22.800
+Those sentences are encoded into vectors by running through a neural net which is not pre-trained, you know
+
+1:23:25.269,1:23:29.279
+it just through the training of the entire system it figures out how to encode this
+
+1:23:30.039,1:23:35.009
+and then those sentences are written to the memory of this type and
+
+1:23:35.829,1:23:41.129
+Then when you ask a question to the system you encode the question at the input of a neural net, the neural net produces
+
+1:23:41.130,1:23:44.999
+An X to the memory the memory returns a value
+
+1:23:46.510,1:23:47.590
+And
+
+1:23:47.590,1:23:49.480
+Then you use this value
+
+1:23:49.480,1:23:54.329
+and the previous state of the network to kind of reaccess the memory, you can do this multiple times and
+
+1:23:54.550,1:23:58.139
+You train this entire network to produce or an answer to your to your question
+
+1:23:59.139,1:24:03.748
+And if you have lots and lots of scenarios lots and lots of questions or also lots of answers
+
+1:24:04.119,1:24:10.169
+Which they did in this case with by artificially generating stories questions and answers
+
+1:24:11.440,1:24:12.940
+this thing actually
+
+1:24:12.940,1:24:15.989
+learns to store stories and
+
+1:24:16.780,1:24:18.760
+answer questions
+
+1:24:18.760,1:24:20.409
+Which is pretty amazing
+
+1:24:20.409,1:24:22.409
+So that's the memory Network
+
+1:24:27.110,1:24:29.860
+Okay, so the first step is you compute
+
+1:24:32.210,1:24:34.300
+Alpha I equals
+
+1:24:36.590,1:24:43.899
+KI transpose X. Okay, just a dot product. Okay, and then you compute
+
+1:24:48.350,1:24:51.519
+CI or the vector C I should say
+
+1:24:54.530,1:24:57.579
+Is the softmax function
+
+1:25:00.320,1:25:02.979
+Applied to the vector of alphas, okay
+
+1:25:02.980,1:25:07.840
+So the C's are between 0 and 1 and sum to 1 and then the output of the system
+
+1:25:09.080,1:25:11.080
+is
+
+1:25:11.150,1:25:13.360
+sum over I of
+
+1:25:14.930,1:25:16.930
+Ci
+
+1:25:17.240,1:25:21.610
+Vi where Vis are the value vectors. Okay. That's the memory
+
+1:25:30.420,1:25:34.489
+Yes, yes, yes, absolutely
+
+1:25:37.140,1:25:38.640
+Not really
+
+1:25:38.640,1:25:41.869
+No, I mean all you need is everything to be encoded as vectors?
+
+1:25:42.660,1:25:48.200
+Right and so run for your favorite convnet, you get a vector that represents the image and then you can do the QA
+
+1:25:50.880,1:25:52.880
+Yeah, I mean so
+
+1:25:53.490,1:25:57.050
+You can imagine lots of applications of this so in particular
+
+1:25:58.110,1:26:00.110
+When application is I
+
+1:26:00.690,1:26:02.690
+Mean you can you can think of
+
+1:26:06.630,1:26:09.109
+You know think of this as a kind of a memory
+
+1:26:11.160,1:26:14.000
+And then you can have some sort of neural net
+
+1:26:16.020,1:26:16.970
+That you know
+
+1:26:16.970,1:26:24.230
+it takes takes an input and then produces an address for the memory gets a value back and
+
+1:26:25.050,1:26:27.739
+Then keeps growing and eventually produces an output
+
+1:26:28.830,1:26:30.830
+This was very much like a computer
+
+1:26:31.050,1:26:33.650
+Ok. Well the neural net here is the
+
+1:26:34.920,1:26:37.099
+the CPU the ALU the CPU
+
+1:26:37.680,1:26:43.099
+Ok, and the memory is just an external memory you can access whenever you need it, or you can write to it if you want
+
+1:26:43.890,1:26:49.040
+It's a recurrent net in this case. You can unfold it in time, which is what these guys did
+
+1:26:51.330,1:26:52.650
+And
+
+1:26:52.650,1:26:58.009
+And then so then there are people who kind of imagined that you could actually build kind of differentiable computers out of this
+
+1:26:58.410,1:27:03.530
+There's something called neural Turing machine, which is essentially a form of this where the memory is not of this type
+
+1:27:03.530,1:27:07.040
+It's kind of a soft tape like in a regular Turing machine
+
+1:27:07.890,1:27:14.030
+This is somewhere from deep mind that the interesting story about this which is that the facebook people put out
+
+1:27:14.760,1:27:19.909
+The paper on the memory network on arxiv and three days later
+
+1:27:22.110,1:27:24.110
+The deepmind people put out a paper
+
+1:27:25.290,1:27:30.679
+About neural Turing machine and the reason they put three days later is that they've been working on the all Turing machine and
+
+1:27:31.350,1:27:32.640
+in their
+
+1:27:32.640,1:27:37.160
+Tradition they kind of keep project secret unless you know until they can make a big splash
+
+1:27:37.770,1:27:40.699
+But there they got scooped so they put the paper out on arxiv
+
+1:27:45.060,1:27:50.539
+Eventually, they made a big splash with another with a paper but that was a year later or so
+
+1:27:52.230,1:27:54.230
+So what's happened
+
+1:27:55.020,1:28:01.939
+since then is that people have kind of taken this module this idea that you compare inputs to keys and
+
+1:28:02.550,1:28:04.550
+that gives you coefficients and
+
+1:28:04.950,1:28:07.819
+You know you you produce values
+
+1:28:08.520,1:28:09.990
+as
+
+1:28:09.990,1:28:14.449
+Kind of a essential module in a neural net and that's basically where the transformer is
+
+1:28:15.060,1:28:18.049
+so a transformer is basically a neural net in which
+
+1:28:19.290,1:28:21.290
+Every group of neurons is one of those
+
+1:28:21.720,1:28:29.449
+It's a it's a whole bunch of memories. Essentially. There's some more twist to it. Okay, but that's kind of the basic the basic idea
+
+1:28:32.460,1:28:34.460
+But you'll hear about this
+
+1:28:34.980,1:28:36.750
+in a week Oh
+
+1:28:36.750,1:28:38.250
+in two weeks
+
+1:28:38.250,1:28:40.140
+one week one week
+
+1:28:40.140,1:28:42.140
+Okay any more questions?
+
+1:28:44.010,1:28:46.640
+Cool. All right. Thank you very much
diff --git a/docs/pt/week06/practicum06.sbv b/docs/pt/week06/practicum06.sbv
new file mode 100644
index 000000000..f05301b02
--- /dev/null
+++ b/docs/pt/week06/practicum06.sbv
@@ -0,0 +1,1742 @@
+0:00:00.030,0:00:03.959
+so today we are gonna be covering quite a lot of materials so I will try not to
+
+0:00:03.959,0:00:08.309
+run but then yesterday young scooped me completely so young talked about exactly
+
+0:00:08.309,0:00:15.269
+the same things I wanted to talk today so I'm gonna go a bit faster please slow
+
+0:00:15.269,0:00:18.210
+me down if you actually are somehow lost okay
+
+0:00:18.210,0:00:21.420
+so I will just try to be a little bit faster than you sir
+
+0:00:21.420,0:00:26.250
+so today we are gonna be talking about recurrent neural networks record neural
+
+0:00:26.250,0:00:31.050
+networks are one type of architecture we can use in order to be to deal with
+
+0:00:31.050,0:00:37.430
+sequences of data what are sequences what type of signal is a sequence
+
+0:00:39.890,0:00:44.219
+temporal is a temporal component but we already seen data with temporal
+
+0:00:44.219,0:00:49.350
+component how what are they called what dimensional what is the dimension
+
+0:00:49.350,0:00:55.320
+of that kind of signal so on the convolutional net lesson we have seen
+
+0:00:55.320,0:00:59.969
+that a signal could be one this signal to this signal 3d signal based on the
+
+0:00:59.969,0:01:06.270
+domain and the domain is what you map from to go to right so temporal handling
+
+0:01:06.270,0:01:10.580
+sequential sequences of data is basically dealing with one the data
+
+0:01:10.580,0:01:15.119
+because the domain is going to be just the temporal axis nevertheless you can
+
+0:01:15.119,0:01:18.689
+also use RNN to deal with you know two dimensional data you have double
+
+0:01:18.689,0:01:28.049
+Direction okay okay so this is a classical neural network in the diagram
+
+0:01:28.049,0:01:33.299
+that is I'm used to draw where I represent each in this case bunch of
+
+0:01:33.299,0:01:37.590
+neurons like each of those is a vector and for example the X is my input vector
+
+0:01:37.590,0:01:42.450
+it's in pink as usual then I have my hidden layer in a green in the center
+
+0:01:42.450,0:01:46.200
+then I have my final blue eared lane layer which is the output network so
+
+0:01:46.200,0:01:52.320
+this is a three layer neural network in my for my notation and so if some of you
+
+0:01:52.320,0:01:57.960
+are familiar with digital electronics this is like talking about a
+
+0:01:57.960,0:02:03.329
+combinatorial logic your current output depends only on the current input and
+
+0:02:03.329,0:02:08.420
+that's it there is no there is no other input instead when we
+
+0:02:08.420,0:02:12.590
+are talking about our men we are gonna be talking about something that looks
+
+0:02:12.590,0:02:17.420
+like this in this case our output here on the right hand side depends on the
+
+0:02:17.420,0:02:21.860
+current input and on the state of the system and again if you're a king of
+
+0:02:21.860,0:02:26.750
+digital electronics this is simply sequential logic whereas you have an
+
+0:02:26.750,0:02:31.580
+internal state the onion is the dimension flip-flop if you have no idea
+
+0:02:31.580,0:02:37.040
+what a flip-flop you know check it out it's just some very basic memory unit in
+
+0:02:37.040,0:02:41.810
+digital electronics nevertheless this is the only difference right in the first
+
+0:02:41.810,0:02:45.290
+case you have an output which is just function of the input in the second case
+
+0:02:45.290,0:02:49.580
+you have an output which is function of the input and the state of the system
+
+0:02:49.580,0:02:54.130
+okay that's the big difference yeah vanilla is in American term for saying
+
+0:02:58.040,0:03:04.670
+it's plane doesn't have a taste that American sorry I try to be the most
+
+0:03:04.670,0:03:11.390
+American I can in Italy you feel taken an ice cream which is doesn't have a
+
+0:03:11.390,0:03:15.950
+taste it's gonna be fior di latte which is milk taste in here we don't have milk
+
+0:03:15.950,0:03:20.049
+tests they have vanilla taste which is the plain ice cream
+
+0:03:20.049,0:03:28.360
+okay Americans sorry all right so oh so let's see what does
+
+0:03:28.360,0:03:32.760
+it change this with young representation so young draws those kind of little
+
+0:03:32.760,0:03:38.170
+funky things here which represent a mapping between a TENS tensor to another
+
+0:03:38.170,0:03:41.800
+painter from one a vector to another vector right so there you have your
+
+0:03:41.800,0:03:46.630
+input vector X is gonna be mapped through this item here to this hidden
+
+0:03:46.630,0:03:50.620
+representation so that actually represent my fine transformation so my
+
+0:03:50.620,0:03:54.130
+rotation Plus this question then you have the heater representation that you
+
+0:03:54.130,0:03:57.850
+have another rotation is question then you get the final output right similarly
+
+0:03:57.850,0:04:03.220
+in the recurrent diagram you can have these additional things this is a fine
+
+0:04:03.220,0:04:06.640
+transformation squashing that's like a delay module with a final transformation
+
+0:04:06.640,0:04:10.900
+excursion and now you have the final one affine transformation and squashing
+
+0:04:10.900,0:04:18.100
+right these things is making noise okay sorry all right so what is the first
+
+0:04:18.100,0:04:24.250
+case first case is this one is a vector to sequence so we input one bubble the
+
+0:04:24.250,0:04:28.270
+pink wonder and then you're gonna have this evolution of the internal state of
+
+0:04:28.270,0:04:33.070
+the system the green one and then as the state of the system evolves you can be
+
+0:04:33.070,0:04:38.470
+spitting out at every time stamp one specific output what can be an example
+
+0:04:38.470,0:04:43.240
+of this kind of architecture so this one could be the following my input is gonna
+
+0:04:43.240,0:04:46.750
+be one of these images and then the output is going to be a sequence of
+
+0:04:46.750,0:04:53.140
+characters representing the English description of whatever this input is so
+
+0:04:53.140,0:04:57.940
+for example in the center when we have a herd of elephants so the last one herd
+
+0:04:57.940,0:05:03.880
+of elephants walking across a dry grass field so it's very very very well
+
+0:05:03.880,0:05:09.130
+refined right then you have in the center here for example two dogs play in
+
+0:05:09.130,0:05:15.640
+the in the grass maybe there are three but okay they play they're playing in
+
+0:05:15.640,0:05:20.500
+the grass right so it's cool in this case you know a red motorcycle park on
+
+0:05:20.500,0:05:24.610
+the side of the road looks more pink or you know a little
+
+0:05:24.610,0:05:30.490
+blow a little a little girl in the pink that is blowing bubbles that she's not
+
+0:05:30.490,0:05:35.650
+blowing right anything there all right and then you also have you know even
+
+0:05:35.650,0:05:41.560
+more wrong examples right so you have like yellow school bus parked in the
+
+0:05:41.560,0:05:44.050
+parking lot well it's CL um but it's not a school
+
+0:05:44.050,0:05:49.860
+bus so it can be failing as well but I also can do a very very nice you know
+
+0:05:49.860,0:05:56.470
+you can also perform very well so this was from one input vector which is B for
+
+0:05:56.470,0:06:01.720
+example representation of my image to a sequence of symbols which are D for
+
+0:06:01.720,0:06:05.620
+example characters or words that are making here my English sentence okay
+
+0:06:05.620,0:06:11.440
+clear so far yeah okay another kind of usage you can have is maybe the
+
+0:06:11.440,0:06:17.560
+following so you're gonna have sequence two final vector okay so I don't care
+
+0:06:17.560,0:06:22.120
+about the intermediate sequences so okay the top right is called Auto regressive
+
+0:06:22.120,0:06:26.590
+network and outer regressive network is a network which is outputting an output
+
+0:06:26.590,0:06:29.950
+given that you feel as input the previous output okay
+
+0:06:29.950,0:06:33.700
+so this is called Auto regressive you have this kind of loopy part on the
+
+0:06:33.700,0:06:37.780
+network on the left hand side instead I'm gonna be providing several sequences
+
+0:06:37.780,0:06:40.140
+yeah that's gonna be the English translation
+
+0:06:51.509,0:06:55.380
+so you have a sequence of words that are going to make up your final sentence
+
+0:06:55.380,0:07:00.330
+it's it's blue there you can think about a index in a dictionary and then each
+
+0:07:00.330,0:07:03.300
+blue is going to tell you which word you're gonna pick on an indexed
+
+0:07:03.300,0:07:09.780
+dictionary right so this is a school bus right so oh yeah a yellow school bus you
+
+0:07:09.780,0:07:14.940
+go to a index of a then you have second index you can figure out that is yellow
+
+0:07:14.940,0:07:17.820
+and then school box right so the sequence here is going to be
+
+0:07:17.820,0:07:22.590
+representing the sequence of words the model is out on the other side there on
+
+0:07:22.590,0:07:26.460
+the left you're gonna have instead I keep feeding a sequence of symbols and
+
+0:07:26.460,0:07:30.750
+only at the end I'm gonna look what is my final output what can be an
+
+0:07:30.750,0:07:36.150
+application of this one so something yun also mentioned was different so let's
+
+0:07:36.150,0:07:40.789
+see if I can get my network to compile Python or to an open pilot own
+
+0:07:40.789,0:07:45.599
+interpretation so in this case I have my current input which I feed my network
+
+0:07:45.599,0:07:54.979
+which is going to be J equal 8580 for then for X in range eight some - J 920
+
+0:07:54.979,0:07:59.430
+blah blah blah and then print this one and then my network is going to be
+
+0:07:59.430,0:08:04.860
+tasked with the just you know giving me twenty five thousand and eleven okay so
+
+0:08:04.860,0:08:09.210
+this is the final output of a program and I enforced in the network to be able
+
+0:08:09.210,0:08:13.860
+to output me the correct output the correct in your solution of this program
+
+0:08:13.860,0:08:18.330
+or even more complicated things for example I can provide a sequence of
+
+0:08:18.330,0:08:21.900
+other symbols which are going to be eighty eight thousand eight hundred
+
+0:08:21.900,0:08:26.669
+thirty seven then I have C is going to be something then I have print this one
+
+0:08:26.669,0:08:33.360
+if something that is always true as the other one and then you know the output
+
+0:08:33.360,0:08:38.849
+should be twelve thousand eight 184 right so you can train a neural net to
+
+0:08:38.849,0:08:42.690
+do these operations so you feed a sequence of symbols and then at the
+
+0:08:42.690,0:08:48.870
+output you just enforce that the final target should be a specific value okay
+
+0:08:48.870,0:08:56.190
+and these things making noise okay maybe I'm better
+
+0:08:56.190,0:09:02.589
+all right so what's next next is going to be for example a sequence to vector
+
+0:09:02.589,0:09:07.210
+to sequence this used to be the standard way of performing length language
+
+0:09:07.210,0:09:13.000
+translation so you start with a sequence of symbols here shown in pink so you
+
+0:09:13.000,0:09:17.290
+have a sequence of inputs then everything gets condensed into this kind
+
+0:09:17.290,0:09:23.020
+of final age which is this H over here which is going to be somehow my concept
+
+0:09:23.020,0:09:27.880
+right so I have a sentence I squeeze the sentence temporal information into just
+
+0:09:27.880,0:09:31.600
+one vector which is representing the meaning the message I'd like to send
+
+0:09:31.600,0:09:36.310
+across and then I get this meaning in whatever representation unrolled back in
+
+0:09:36.310,0:09:41.380
+a different language right so I can encode I don't know today I'm very happy
+
+0:09:41.380,0:09:47.350
+in English as a sequence of word and then you know you can get LG Sonoma to
+
+0:09:47.350,0:09:53.170
+Felicia and then I speak outside Thailand today or whatever now today I'm
+
+0:09:53.170,0:09:58.480
+very tired Jin Chen walk han lei or whatever ok so
+
+0:09:58.480,0:10:02.020
+again you have some kind of encoding then you have a compressed
+
+0:10:02.020,0:10:08.110
+representation and then you get like the decoding given the same compressed
+
+0:10:08.110,0:10:15.040
+version ok and so for example I guess language translation again recently we
+
+0:10:15.040,0:10:20.709
+have seen transformers and a lot of things like in the recent time so we're
+
+0:10:20.709,0:10:25.300
+going to cover that the next lesson I think but this used to be the state of
+
+0:10:25.300,0:10:31.000
+the art until few two years ago and here you can see that if you actually check
+
+0:10:31.000,0:10:38.950
+if you do a PCA over the latent space you have that words are grouped by
+
+0:10:38.950,0:10:43.630
+semantics ok so if we zoom in that region there are we're gonna see that in
+
+0:10:43.630,0:10:48.400
+what in the same location you find all the amounts december february november
+
+0:10:48.400,0:10:52.750
+whatever right if you put a few focus on a different region you get that a few
+
+0:10:52.750,0:10:55.250
+days next few miles and so on right so
+
+0:10:55.250,0:11:00.230
+different location will have some specific you know common meaning so we
+
+0:11:00.230,0:11:05.780
+basically see in this case how by training these networks you know just
+
+0:11:05.780,0:11:09.680
+with symbols they will pick up on some specific semantics
+
+0:11:09.680,0:11:16.130
+you know features right in this case you can see like there is a vector so the
+
+0:11:16.130,0:11:20.900
+vector that is connecting women to men is gonna be the same vector that is well
+
+0:11:20.900,0:11:27.590
+woman - man which is this one I think is gonna be equal to Queen - King right and
+
+0:11:27.590,0:11:32.890
+so yeah it's correct and so you're gonna have that the same distance in this
+
+0:11:32.890,0:11:37.730
+embedding space will be applied to things that are female and male for
+
+0:11:37.730,0:11:43.370
+example or in the other case you have walk-in and walked swimming and swamp so
+
+0:11:43.370,0:11:47.960
+you always have this you know specific linear transformation you can apply in
+
+0:11:47.960,0:11:53.690
+order to go from one type of word to the other one or this one you have the
+
+0:11:53.690,0:11:59.180
+connection between cities and the capitals all right so one more right I
+
+0:11:59.180,0:12:05.210
+think what's missing from the big picture here it's a big picture because
+
+0:12:05.210,0:12:09.560
+it's so large no no it's such a big picture because it's the overview okay
+
+0:12:09.560,0:12:18.590
+you didn't get the joke it's okay what's missing here vector to seek with no okay
+
+0:12:18.590,0:12:23.330
+good but no because you can still use the other one so you have this one the
+
+0:12:23.330,0:12:27.830
+vector is sequence to sequence right so this one is you start feeding inside
+
+0:12:27.830,0:12:31.580
+inputs you start outputting something right what can be an example of this
+
+0:12:31.580,0:12:38.900
+stuff so if you had a Nokia phone and you use the t9 you know this stuff from
+
+0:12:38.900,0:12:43.100
+20 years ago you have basically suggestions on what your typing is
+
+0:12:43.100,0:12:47.150
+you're typing right so this would be one type of these suggestions where like one
+
+0:12:47.150,0:12:50.570
+type of this architecture as you getting suggestions as you're typing things
+
+0:12:50.570,0:12:57.290
+through or you may have like speech to captions right I talked and you have the
+
+0:12:57.290,0:13:02.520
+things below or something very cool which is
+
+0:13:02.520,0:13:08.089
+the following so I start writing here the rings of Saturn glitter while the
+
+0:13:08.089,0:13:16.260
+harsh ice two men look at each other hmm okay they were enemies but the server
+
+0:13:16.260,0:13:20.100
+robots weren't okay okay hold on so this network was trained on some
+
+0:13:20.100,0:13:24.360
+sci-fi novels and therefore you can just type something then you let the network
+
+0:13:24.360,0:13:28.290
+start outputting some suggestions for you so you know if you don't know how to
+
+0:13:28.290,0:13:34.620
+write a book then you can you know ask your computer to help you out okay
+
+0:13:34.620,0:13:39.740
+that's so cool or one more that I really like it this one is fantastic I think
+
+0:13:39.740,0:13:45.959
+you should read read it I think so you put some kind of input there like the
+
+0:13:45.959,0:13:51.630
+scientist named alone what is it or the prompt right so you put in the
+
+0:13:51.630,0:13:56.839
+the top prompt and then you get you know this network start writing about very
+
+0:13:56.839,0:14:05.690
+interesting unicorns with multiple horns is called horns say unicorn right okay
+
+0:14:05.690,0:14:09.480
+alright let's so cool just check it out later and you can take a screenshot of
+
+0:14:09.480,0:14:14.970
+the screen anyhow so that was like the eye candy such that you get you know
+
+0:14:14.970,0:14:21.089
+hungry now let's go into be PTT which is the thing that they aren't really like
+
+0:14:21.089,0:14:27.390
+yesterday's PTT said okay alright let's see how this stuff works okay so on the
+
+0:14:27.390,0:14:31.620
+left hand side we see again this vector middle in the representation the output
+
+0:14:31.620,0:14:35.520
+to a fine transformation and then there we have the classical equations right
+
+0:14:35.520,0:14:42.450
+all right so let's see how this stuff is similar or not similar and you can't see
+
+0:14:42.450,0:14:46.620
+anything so for the next two seconds I will want one minute I will turn off the
+
+0:14:46.620,0:14:51.300
+lights then I turn them on [Music]
+
+0:14:51.300,0:14:55.570
+okay now you can see something all right so let's see what are the questions of
+
+0:14:55.570,0:15:00.490
+this new architecture don't stand up you're gonna be crushing yourself
+
+0:15:00.490,0:15:04.270
+alright so you have here the hidden representation now there's gonna be this
+
+0:15:04.270,0:15:10.000
+nonlinear function of this rotation of a stack version of my input which I
+
+0:15:10.000,0:15:15.520
+appended the previous configuration of the hidden layer okay and so this is a
+
+0:15:15.520,0:15:19.420
+very nice compact notation it's just I just put the two vectors one on top of
+
+0:15:19.420,0:15:24.640
+each other and then I sign assign I sum the bias I also and define initial
+
+0:15:24.640,0:15:29.920
+condition my initial H is gonna be 0 so at the beginning whenever I have t=1
+
+0:15:29.920,0:15:34.360
+this stuff is gonna be settle is a vector of zeros and then I have this
+
+0:15:34.360,0:15:39.880
+matrix Wh is gonna be two separate matrices so sometimes you see this a
+
+0:15:39.880,0:15:48.130
+question is Wₕₓ times x plus Wₕₕ times h[t-1] but you can also figure out
+
+0:15:48.130,0:15:52.450
+that if you stock those two matrices you know one attached to the other that you
+
+0:15:52.450,0:15:56.620
+just put this two vertical lines completely equivalent notation but it
+
+0:15:56.620,0:16:01.360
+looked like very similar to whatever we had here so hidden layer is affine
+
+0:16:01.360,0:16:05.230
+transformation of the input inner layer is affine transformation of the input
+
+0:16:05.230,0:16:11.440
+and the previous value okay and then you have the final output is going to be
+
+0:16:11.440,0:16:20.140
+again my final rotation so I'm gonna turn on the light so no magic so far
+
+0:16:20.140,0:16:27.690
+right you're okay right you're with me to shake the heads what about the others
+
+0:16:27.690,0:16:34.930
+no yes okay whatever so this one is simply on the right hand
+
+0:16:34.930,0:16:40.330
+side I simply unroll over time such that you can see how things are just not very
+
+0:16:40.330,0:16:43.990
+crazy like this loop here is not actually a loop this is like a
+
+0:16:43.990,0:16:48.500
+connection to next time steps right so that around
+
+0:16:48.500,0:16:52.760
+arrow means is just this right arrow so this is a neural net it's dinkley a
+
+0:16:52.760,0:16:57.950
+neural net which is extended in in length rather also not only in a in a
+
+0:16:57.950,0:17:01.639
+thickness right so you have a network that is going this direction input and
+
+0:17:01.639,0:17:05.600
+output but as you can think as there's been an extended input and this been an
+
+0:17:05.600,0:17:10.220
+extended output while all these intermediate weights are all share right
+
+0:17:10.220,0:17:14.120
+so all of these weights are the same weights and then you use this kind of
+
+0:17:14.120,0:17:17.510
+shared weights so it's similar to a convolutional net in the sense that you
+
+0:17:17.510,0:17:21.410
+had this parameter sharing right across different time domains because you
+
+0:17:21.410,0:17:28.820
+assume there is some kind of you know stationarity right of the signal make
+
+0:17:28.820,0:17:32.870
+sense so this is a kind of convolution right you can see how this is kind of a
+
+0:17:32.870,0:17:40.130
+convolution alright so that was kind of you know a little bit of the theory we
+
+0:17:40.130,0:17:46.160
+already seen that so let's see how this works for a practical example so in this
+
+0:17:46.160,0:17:51.830
+case we we are just reading this code here so this is world language model you
+
+0:17:51.830,0:17:57.770
+can find it at the PyTorch examples so you have a sequence of symbols I have
+
+0:17:57.770,0:18:01.910
+just represented there every symbol is like a letter in the alphabet and then
+
+0:18:01.910,0:18:05.419
+the first part is gonna be basically splitting this one in this way right
+
+0:18:05.419,0:18:10.309
+so you preserve vertically in the time domain but then I split the long long
+
+0:18:10.309,0:18:16.640
+long sequence such that I can now chop I can use best bets bets how do you say
+
+0:18:16.640,0:18:21.980
+computation so the first thing you have the best size is gonna be 4 in this case
+
+0:18:21.980,0:18:27.410
+and then I'm gonna be getting in my first batch and then I will force the
+
+0:18:27.410,0:18:33.650
+network to be able to so this will be my best back propagation through time
+
+0:18:33.650,0:18:38.270
+period and I will force the network to output the next sequence of characters
+
+0:18:38.270,0:18:44.510
+ok so given that I have a,b,c, I will force my network to say d given that I have
+
+0:18:44.510,0:18:50.000
+g,h,i, I will force the network to come up with j. Given m,n,o,
+
+0:18:50.000,0:18:54.980
+I want p, given s,t,u, I want v. So how can you actually make
+
+0:18:54.980,0:18:59.660
+sure you understand what I'm saying whenever you are able to predict my next
+
+0:18:59.660,0:19:04.010
+world you're actually able to you know you basically know in already what I'm
+
+0:19:04.010,0:19:11.720
+saying right yeah so by trying to predict an upcoming word you're going to
+
+0:19:11.720,0:19:15.170
+be showing some kind of comprehension of whatever is going to be this temporal
+
+0:19:15.170,0:19:22.700
+information in the data all right so after we get the beds we have so how
+
+0:19:22.700,0:19:26.510
+does it work let's actually see you know and about a bit of a detail this is
+
+0:19:26.510,0:19:30.650
+gonna be my first output is going to be a batch with four items I feed this
+
+0:19:30.650,0:19:34.220
+inside the near corner all night and then my neural net we come up with a
+
+0:19:34.220,0:19:39.740
+prediction of the upcoming sample right and I will force that one to be my b,h,n,t
+
+0:19:39.740,0:19:47.450
+okay then I'm gonna be having my second input I will provide the previous
+
+0:19:47.450,0:19:53.420
+hidden state to the current RNN I will feel these inside and then I expect to
+
+0:19:53.420,0:19:58.670
+get the second line of the output the target right and then so on right I get
+
+0:19:58.670,0:20:03.410
+the next state and sorry the next input I get the next state and then I'm gonna
+
+0:20:03.410,0:20:07.700
+get inside the neural net the RNN I which I will try to force to get the
+
+0:20:07.700,0:20:13.840
+final target okay so far yeah each one is gonna be the output of the
+
+0:20:18.730,0:20:28.280
+internet recurrent neural net right I'll show you the equation before you have h[1]
+
+0:20:28.280,0:20:43.460
+comes out from this one right second the output I'm gonna be forcing the output
+
+0:20:43.460,0:20:48.170
+actually to be my target my next word in the sequence of letters right so I have
+
+0:20:48.170,0:20:52.610
+a sequence of words force my network to predict what's the next word given the
+
+0:20:52.610,0:21:02.480
+previous word know h1 is going to be fed inside here and you stuck the next word
+
+0:21:02.480,0:21:07.880
+the next word together with the previous state and then you'll do a rotation of
+
+0:21:07.880,0:21:13.670
+the previous word with a previous sorry the new word with the next state the new
+
+0:21:13.670,0:21:17.720
+word with the previous state you'll do our rotation here find transformation
+
+0:21:17.720,0:21:21.230
+right and then you apply the non-linearity so you always get a new
+
+0:21:21.230,0:21:25.610
+word that is the current X and then you get the previous state just to see in
+
+0:21:25.610,0:21:30.650
+what state the system once and then you output a new output right and so we are
+
+0:21:30.650,0:21:35.000
+in this situation here we have a bunch of inputs I have my first input and then
+
+0:21:35.000,0:21:39.200
+I get the first output I have this internal memory that is sent forward and
+
+0:21:39.200,0:21:44.240
+then this network will now be aware of what happened here and then I input the
+
+0:21:44.240,0:21:49.450
+next input and so on I get the next output and I force the output to be the
+
+0:21:49.450,0:21:57.040
+output here the value inside the batch ok alright what's missing now
+
+0:21:57.070,0:22:00.160
+[Music] this is for PowerPoint drawing
+
+0:22:02.890,0:22:08.370
+constraint all right what's happening now so here I'm gonna be sending the
+
+0:22:08.370,0:22:13.300
+here I just drawn an arrow with the final h[T] but there is a slash on the
+
+0:22:13.300,0:22:16.780
+arrow what is the slash on the arrow who can
+
+0:22:16.780,0:22:27.100
+understand what the slash mean of course there will be there is gonna be the next
+
+0:22:27.100,0:22:31.570
+batch they're gonna be starting from here D and so on this is gonna be my
+
+0:22:31.570,0:22:46.690
+next batch d,j,p,v  e,k,q,w  and f,l,r,x. This slash here means do not back
+
+0:22:46.690,0:22:51.550
+propagate through okay so that one is gonna be calling dot detach in Porsche
+
+0:22:51.550,0:22:56.560
+which is gonna be stopping the gradient to be you know propagated back to
+
+0:22:56.560,0:23:01.450
+forever okay so this one say know that and so whenever I get the sorry no no
+
+0:23:01.450,0:23:06.970
+gradient such that when I input the next gradient the first input here it's gonna
+
+0:23:06.970,0:23:11.530
+be this guy over here and also of course without gradient such that we don't have
+
+0:23:11.530,0:23:17.170
+an infinite length RNN okay make sense yes
+
+0:23:17.170,0:23:24.640
+no I assume it's a yes okay so vanishing and exploding
+
+0:23:24.640,0:23:30.730
+gradients we touch them upon these also yesterday so again I'm kind of going a
+
+0:23:30.730,0:23:35.620
+little bit faster to the intent user so let's see how this works
+
+0:23:35.620,0:23:40.390
+so usually for our recurrent neural network you have an input you have a
+
+0:23:40.390,0:23:45.160
+hidden layer and then you have an output then this value of here how do you get
+
+0:23:45.160,0:23:50.680
+this information through here what what what does this R represent do you
+
+0:23:50.680,0:23:55.840
+remember the equation of the hidden layer so the new hidden layer is gonna
+
+0:23:55.840,0:24:01.050
+be the previous hidden layer which we rotate
+
+0:24:03.100,0:24:08.030
+alright so we rotate the previous hidden layer and so how do you rotate hidden
+
+0:24:08.030,0:24:15.220
+layers matrices right and so every time you see all ads on tile arrow there is a
+
+0:24:15.220,0:24:21.920
+rotation there is a matrix now if the you know this matrix can
+
+0:24:21.920,0:24:26.900
+change the sizing of your final output right so if you think about perhaps
+
+0:24:26.900,0:24:31.190
+let's say the determinant right if the terminal is unitary it's a mapping the
+
+0:24:31.190,0:24:34.610
+same areas for the same area if it's larger than one they're going to be
+
+0:24:34.610,0:24:39.560
+getting you know this radians to getting larger and larger or if it's smaller
+
+0:24:39.560,0:24:44.660
+than I'm gonna get these gradients to go to zero whenever you perform the back
+
+0:24:44.660,0:24:48.920
+propagation in this direction okay so the problem here is that whenever we do
+
+0:24:48.920,0:24:53.390
+is send gradients back so the gains are going to be going down like that are
+
+0:24:53.390,0:24:57.800
+gonna be going like down like this then down like this way and down like this
+
+0:24:57.800,0:25:01.610
+way and also all down this way and so on right so the gradients are going to be
+
+0:25:01.610,0:25:06.380
+always going against the direction of the arrow in H ro has a matrix inside
+
+0:25:06.380,0:25:11.510
+right and again this matrix will affect how these gradients propagate and that's
+
+0:25:11.510,0:25:18.590
+why you can see here although we have a very bright input that one like gets
+
+0:25:18.590,0:25:23.720
+lost through oh well if you have like a gradient coming down here the gradient
+
+0:25:23.720,0:25:30.410
+gets you know kill over time okay so how do we fix that to fix this one we simply
+
+0:25:30.410,0:25:40.420
+remove the matrices in this horizontal operation does it make sense no yes no
+
+0:25:40.420,0:25:47.630
+the problem is that the next hidden state will have you know its own input
+
+0:25:47.630,0:25:52.910
+memory coming from the previous step through a matrix multiplication now this
+
+0:25:52.910,0:25:58.760
+matrix multiplication will affect what's gonna be the gradient that comes in the
+
+0:25:58.760,0:26:02.630
+other direction okay so whenever you have an output here you
+
+0:26:02.630,0:26:06.740
+have a final loss now you have the grade that are gonna be going against the
+
+0:26:06.740,0:26:12.050
+arrows up to the input the problem is that this gradient which is going
+
+0:26:12.050,0:26:16.910
+through the in the opposite direction of these arrows will be multiplied by the
+
+0:26:16.910,0:26:22.460
+matrix right the transpose of the matrix and again these matrices will affect
+
+0:26:22.460,0:26:26.030
+what is the overall norm of this gradient right and it will be all
+
+0:26:26.030,0:26:28.310
+killing it you have vanishing gradient or you're
+
+0:26:28.310,0:26:32.690
+gonna have exploding the gradient which is going to be whenever is going to be
+
+0:26:32.690,0:26:37.880
+getting amplified right so in order to be avoiding that we have to avoid so you
+
+0:26:37.880,0:26:41.960
+can see this is a very deep network so recurrently our network where the first
+
+0:26:41.960,0:26:45.320
+deep networks back in the night is actually and the word
+
+0:26:45.320,0:26:49.850
+depth was actually in time which and of course they were facing the same issues
+
+0:26:49.850,0:26:54.350
+we face with deep learning in modern day days where ever we were still like
+
+0:26:54.350,0:26:58.450
+stacking several layers we were observing that the gradients get lost as
+
+0:26:58.450,0:27:05.750
+depth right so how do we solve gradient getting lost through the depth in a
+
+0:27:05.750,0:27:08.770
+current days skipping constant connection right the
+
+0:27:11.270,0:27:15.530
+receiver connections we use and similarly here we can use skip
+
+0:27:15.530,0:27:21.860
+connections as well when we go down well up in in time okay so let's see how this
+
+0:27:21.860,0:27:30.500
+works yeah so the problem is that the
+
+0:27:30.500,0:27:34.250
+gradients are only going in the backward paths right back
+
+0:27:34.250,0:27:38.990
+[Music] well the gradient has to go the same way
+
+0:27:38.990,0:27:42.680
+it went forward by the opposite direction right I mean you're computing
+
+0:27:42.680,0:27:46.970
+chain rule so if you have a function of a function of a function then you just
+
+0:27:46.970,0:27:52.220
+use those functions to go back right the point is that whenever you have these
+
+0:27:52.220,0:27:55.790
+gradients coming back they will not have to go through matrices therefore also
+
+0:27:55.790,0:28:01.250
+the forward part has not doesn't have to go through the matrices meaning that the
+
+0:28:01.250,0:28:07.310
+memory cannot go through matrix multiplication if you don't want to have
+
+0:28:07.310,0:28:11.770
+this effect when you perform back propagation okay
+
+0:28:14.050,0:28:19.420
+yeah it's gonna be worth much better working I show you in the next slide
+
+0:28:19.420,0:28:22.539
+[Music] show you next slide
+
+0:28:27.740,0:28:32.270
+so how do we fix this problem well instead of using one recurrent neural
+
+0:28:32.270,0:28:36.650
+network we're gonna using for recurrent neural network okay so the first
+
+0:28:36.650,0:28:41.510
+RNN on the first network is gonna be the one that goes
+
+0:28:41.510,0:28:46.370
+from the input to this intermediate state then I have other three networks
+
+0:28:46.370,0:28:51.410
+and each of those are represented by these three symbols 1 2 & 3.
+
+0:28:51.410,0:28:56.870
+okay think about this as our open mouth and it's like a closed mouth okay like
+
+0:28:56.870,0:29:04.580
+the emoji okay so if you use this kind of for net for recurrent neural network
+
+0:29:04.580,0:29:09.740
+be regular Network you gotta have for example from the input I send things
+
+0:29:09.740,0:29:14.390
+through in the open mouth therefore it gets here I have a closed mouth here so
+
+0:29:14.390,0:29:18.920
+nothing goes forward then I'm gonna have this open mouth here such that the
+
+0:29:18.920,0:29:23.600
+history goes forward so the history gets sent forward without going through a
+
+0:29:23.600,0:29:29.120
+neural network matrix multiplication it just gets through our open mouth and
+
+0:29:29.120,0:29:34.670
+all the other inputs find a closed mouth so the hidden state will not change upon
+
+0:29:34.670,0:29:40.820
+new inputs okay and then here you're gonna have a open mouth here such that
+
+0:29:40.820,0:29:44.960
+you can get the final output here then the open mouth keeps going here such
+
+0:29:44.960,0:29:48.560
+that you have another output there and then finally you get the last closed
+
+0:29:48.560,0:29:54.620
+mouth at the last one now if you perform back prop you will have the gradients
+
+0:29:54.620,0:29:58.880
+flowing through the open mouth and you don't get any kind of matrix
+
+0:29:58.880,0:30:04.400
+multiplication so now let's figure out how these open mouths are represented
+
+0:30:04.400,0:30:10.010
+how are they instantiated in like in in terms of mathematics is it clear design
+
+0:30:10.010,0:30:13.130
+right so now we are using open and closed mouths and each of those mouths
+
+0:30:13.130,0:30:17.880
+is plus the the first guy here that connects the input to the hidden are
+
+0:30:17.880,0:30:25.580
+brn ends so these on here that is a gated recurrent network it's simply for
+
+0:30:25.580,0:30:32.060
+normal recurrent neural network combined in a clever way such that you have
+
+0:30:32.060,0:30:37.920
+multiplicative interaction and not matrix interaction is it clear so far
+
+0:30:37.920,0:30:42.000
+this is like intuition I haven't shown you how all right so let's figure out
+
+0:30:42.000,0:30:48.570
+who made this and how it works okay so we're gonna see now those long short
+
+0:30:48.570,0:30:55.530
+term memory or gated recurrent neural networks so I'm sorry okay that was the
+
+0:30:55.530,0:30:59.730
+dude okay this is the guy who actually invented this stuff actually him and his
+
+0:30:59.730,0:31:07.620
+students back some in 1997 and we were drinking here together okay all right so
+
+0:31:07.620,0:31:14.010
+that is the question of a recurrent neural network and on the top left are
+
+0:31:14.010,0:31:18.000
+you gonna see in the diagram so I just make a very compact version of this
+
+0:31:18.000,0:31:23.310
+recurrent neural network here is going to be the collection of equations that
+
+0:31:23.310,0:31:27.840
+are expressed in a long short term memory they look a little bit dense so I
+
+0:31:27.840,0:31:32.970
+just draw it for you here okay let's actually goes through how this stuff
+
+0:31:32.970,0:31:36.320
+works so I'm gonna be drawing an interactive
+
+0:31:36.320,0:31:40.500
+animation here so you have your input gate here which is going to be an affine
+
+0:31:40.500,0:31:43.380
+transformation so all of these are recurrent Network write the same
+
+0:31:43.380,0:31:49.920
+equation I show you here so this input transformation will be multiplying my C
+
+0:31:49.920,0:31:55.440
+tilde which is my candidate gate here I have a don't forget gate which is
+
+0:31:55.440,0:32:01.920
+multiplying my previous value of my cell memory and then my Poppa stylist maybe
+
+0:32:01.920,0:32:08.100
+don't forget previous plus input ii i'm gonna show you now how it works then i
+
+0:32:08.100,0:32:12.600
+have my final hidden representations to be multiplication element wise
+
+0:32:12.600,0:32:17.850
+multiplication between my output gate and my you know whatever hyperbolic
+
+0:32:17.850,0:32:22.740
+tangent version of the cell such that things are bounded and then I have
+
+0:32:22.740,0:32:26.880
+finally my C tilde which is my candidate gate is simply
+
+0:32:26.880,0:32:31.110
+Anette right so you have one recurrent network one that modulates the output
+
+0:32:31.110,0:32:35.730
+one that modulates this is don't forget gate and this is the input gate
+
+0:32:35.730,0:32:40.050
+so all this interaction between the memory and the gates is a multiplicative
+
+0:32:40.050,0:32:44.490
+interaction and this forget input and don't forget the input and output are
+
+0:32:44.490,0:32:48.780
+all sigmoids and therefore they are going from 0 to 1 so I can multiply by a
+
+0:32:48.780,0:32:53.340
+0 you have a closed mouth or you can multiply by 1 if it's open mouth right
+
+0:32:53.340,0:33:00.120
+if you think about being having our internal linear volume which is below
+
+0:33:00.120,0:33:06.120
+minus 5 or above plus 5 okay such that you using the you use the gate in the
+
+0:33:06.120,0:33:11.940
+saturated area or 0 or 1 right you know the sigmoid so let's see how this stuff
+
+0:33:11.940,0:33:16.260
+works this is the output let's turn off the
+
+0:33:16.260,0:33:20.450
+output how do I do turn off the output I simply put a 0
+
+0:33:20.450,0:33:26.310
+inside so let's say I have a purple internal representation see I put a 0
+
+0:33:26.310,0:33:29.730
+there in the output gate the output is going to be multiplying a 0 with
+
+0:33:29.730,0:33:36.300
+something you get 0 okay then let's say I have a green one I have one then I
+
+0:33:36.300,0:33:40.830
+multiply one with the purple I get purple and then finally I get the same
+
+0:33:40.830,0:33:46.170
+value similarly I can control the memory and I can for example we set it in this
+
+0:33:46.170,0:33:51.240
+case I'm gonna be I have my internal memory see this is purple and then I
+
+0:33:51.240,0:33:57.450
+have here my previous guy which is gonna be blue I guess I have a zero here and
+
+0:33:57.450,0:34:01.500
+therefore the multiplication gives me a zero there I have here a zero so
+
+0:34:01.500,0:34:05.190
+multiplication is gonna be giving a zero at some two zeros and I get a zero
+
+0:34:05.190,0:34:09.690
+inside of memory so I just erase the memory and you get the zero there
+
+0:34:09.690,0:34:15.210
+otherwise I can keep the memory I still do the internal thing I did a new one
+
+0:34:15.210,0:34:19.919
+but I keep a wonder such that the multiplication gets blue the Sun gets
+
+0:34:19.919,0:34:25.649
+blue and then I keep sending out my bloom finally I can write such that I
+
+0:34:25.649,0:34:31.110
+can get a 1 in the input gate the multiplication gets purple then the I
+
+0:34:31.110,0:34:35.010
+set a zero in the don't forget such that the
+
+0:34:35.010,0:34:40.679
+we forget and then multiplication gives me zero I some do I get purple and then
+
+0:34:40.679,0:34:45.780
+I get the final purple output okay so here we control how to send how to write
+
+0:34:45.780,0:34:50.850
+in memory how to reset the memory and how to output something okay so we have
+
+0:34:50.850,0:35:04.770
+all different operation this looks like a computer - and in an yeah it is
+
+0:35:04.770,0:35:08.700
+assumed in this case to show you like how the logic works as we are like
+
+0:35:08.700,0:35:14.250
+having a value inside the sigmoid has been or below minus 5 or being above
+
+0:35:14.250,0:35:27.780
+plus 5 such that we are working as a switch 0 1 switch okay the network can
+
+0:35:27.780,0:35:32.790
+choose to use this kind of operation to me make sense I believe this is the
+
+0:35:32.790,0:35:37.110
+rationale behind how this network has been put together the network can decide
+
+0:35:37.110,0:35:42.690
+to do anything it wants usually they do whatever they want but this seems like
+
+0:35:42.690,0:35:46.800
+they can work at least if they've had to saturate the gates it looks like things
+
+0:35:46.800,0:35:51.930
+can work pretty well so in the remaining 15 minutes of kind of I'm gonna be
+
+0:35:51.930,0:35:56.880
+showing you two notebooks I kind of went a little bit faster because again there
+
+0:35:56.880,0:36:04.220
+is much more to be seen here in the notebooks so yeah
+
+0:36:10.140,0:36:17.440
+so this the the actual weight the actual gradient you care here is gonna be the
+
+0:36:17.440,0:36:21.970
+gradient with respect to previous C's right the thing you care is gonna be
+
+0:36:21.970,0:36:25.000
+basically the partial derivative of the current seen with respect to previous
+
+0:36:25.000,0:36:30.160
+C's such that you if you have the original initial C here and you have
+
+0:36:30.160,0:36:35.140
+multiple C over time you want to change something in the original C you still
+
+0:36:35.140,0:36:39.130
+have the gradient coming down all the way until the first C which comes down
+
+0:36:39.130,0:36:43.740
+to getting gradients through that matrix Wc here right so if you want to change
+
+0:36:46.660,0:36:52.089
+those weights here you just go through the chain of multiplications that are
+
+0:36:52.089,0:36:56.890
+not involving any matrix multiplication as such that you when you get the
+
+0:36:56.890,0:37:00.490
+gradient it still gets multiplied by one all the time and it gets down to
+
+0:37:00.490,0:37:05.760
+whatever we want to do okay did I answer your question
+
+0:37:09.150,0:37:16.660
+so the matrices will change the amplitude of your gradient right so if
+
+0:37:16.660,0:37:22.000
+you have like these largest eigenvalue being you know 0.0001 every time you
+
+0:37:22.000,0:37:26.079
+multiply you get the norm of this vector getting killed right so you have like an
+
+0:37:26.079,0:37:31.569
+exponential decay in this case if my forget gate is actually always equal to
+
+0:37:31.569,0:37:37.510
+1 then you get c = c-t. What is the partial
+
+0:37:37.510,0:37:43.299
+derivative of c[t]/c[t-1]?
+
+0:37:43.299,0:37:48.579
+1 right so the parts of the relative that is the
+
+0:37:48.579,0:37:52.390
+thing that you actually multiply every time there's gonna be 1 so output
+
+0:37:52.390,0:37:57.609
+gradient output gradients can be input gradients right yeah i'll pavillions
+
+0:37:57.609,0:38:01.510
+gonna be implicit because it would apply the output gradient by the derivative of
+
+0:38:01.510,0:38:05.599
+this module right if the this module is e1 then the thing that is
+
+0:38:05.599,0:38:14.660
+here keeps going that is the rationale behind this now this is just for drawing
+
+0:38:14.660,0:38:24.710
+purposes I assumed it's like a switch okay such that I can make things you
+
+0:38:24.710,0:38:29.089
+know you have a switch on and off to show like how it should be working maybe
+
+0:38:29.089,0:38:46.579
+doesn't work like that but still it works it can work this way right yeah so
+
+0:38:46.579,0:38:50.089
+that's the implementation of pro question is gonna be simply you just pad
+
+0:38:50.089,0:38:55.069
+all the other sync when sees with zeros before the sequence so if you have
+
+0:38:55.069,0:38:59.920
+several several sequences yes several sequences that are of a different length
+
+0:38:59.920,0:39:03.619
+you just put them all aligned to the right
+
+0:39:03.619,0:39:08.960
+and then you put some zeros here okay such that you always have in the last
+
+0:39:08.960,0:39:14.599
+column the latest element if you put two zeros here it's gonna be a mess in right
+
+0:39:14.599,0:39:17.299
+in the code if you put the zeros in the in the beginning you just stop doing
+
+0:39:17.299,0:39:21.319
+back propagation when you hit the last symbol right so you start from here you
+
+0:39:21.319,0:39:25.460
+go back here so you go forward then you go back prop and stop whenever you
+
+0:39:25.460,0:39:29.599
+actually reach the end of your sequence if you pad on the other side you get a
+
+0:39:29.599,0:39:34.730
+bunch of drop there in the next ten minutes so you're gonna be seen two
+
+0:39:34.730,0:39:45.049
+notebooks if you don't have other questions okay wow you're so quiet okay
+
+0:39:45.049,0:39:49.970
+so we're gonna be going now for sequence classification alright so in this case
+
+0:39:49.970,0:39:54.589
+I'm gonna be I just really stuff loud out loud the goal is to classify a
+
+0:39:54.589,0:40:00.259
+sequence of elements sequence elements and targets are represented locally
+
+0:40:00.259,0:40:05.660
+input vectors with only one nonzero bit so it's a one hot encoding the sequence
+
+0:40:05.660,0:40:10.770
+starts with a B for beginning and end with a E and otherwise consists of a
+
+0:40:10.770,0:40:16.370
+randomly chosen symbols from a set {a, b, c, d} which are some kind of noise
+
+0:40:16.370,0:40:22.380
+expect for two elements in position t1 and t2 this position can be either or X
+
+0:40:22.380,0:40:29.460
+or Y in for the hard difficulty level you have for example that the sequence
+
+0:40:29.460,0:40:35.220
+length length is chose randomly between 100 and 110 10 t1 is randomly chosen
+
+0:40:35.220,0:40:40.530
+between 10 and 20 Tinto is randomly chosen between 50 and 60 there are four
+
+0:40:40.530,0:40:47.010
+sequences classes Q, R, S and U which depends on the temporal order of x and y so if
+
+0:40:47.010,0:40:53.520
+you have X,X you can be getting a Q. X,Y you get an R. Y,X you get an S
+
+0:40:53.520,0:40:57.750
+and Y,Y get U. You so we're going to be doing a sequence classification based on
+
+0:40:57.750,0:41:03.720
+the X and y or whatever those to import to these kind of triggers okay
+
+0:41:03.720,0:41:08.370
+and in the middle in the middle you can have a,b,c,d in random positions like you
+
+0:41:08.370,0:41:12.810
+know randomly generated is it clear so far so we do cast a classification of
+
+0:41:12.810,0:41:23.180
+sequences where you may have these X,X  X,Y  Y,X ou Y,Y. So in this case
+
+0:41:23.210,0:41:29.460
+I'm showing you first the first input so the return type is a tuple of sequence
+
+0:41:29.460,0:41:36.780
+of two which is going to be what is the output of this example generator and so
+
+0:41:36.780,0:41:43.050
+let's see what is what is this thing here so this is my data I'm going to be
+
+0:41:43.050,0:41:48.030
+feeding to the network so I have 1, 2, 3, 4, 5, 6, 7, 8 
+
+0:41:48.030,0:41:54.180
+different symbols here in a row every time why there are eight we
+
+0:41:54.180,0:42:02.970
+have X and Y and a, b, c and d beginning and end. So we have one hot out of you
+
+0:42:02.970,0:42:08.400
+know eight characters and then i have a sequence of rows which are my sequence
+
+0:42:08.400,0:42:12.980
+of symbols okay in this case you can see here i have a beginning with all zeros
+
+0:42:12.980,0:42:19.260
+why is all zeros padding right so in this case the sequence was shorter than
+
+0:42:19.260,0:42:21.329
+the expect the maximum sequence in the bed
+
+0:42:21.329,0:42:29.279
+and then the first first sequence has an extra zero item at the beginning in them
+
+0:42:29.279,0:42:34.859
+you're gonna have like in this case the second item is of the two a pole to pole
+
+0:42:34.859,0:42:41.160
+is the corresponding best class for example I have a batch size of 32 and
+
+0:42:41.160,0:42:51.930
+then I'm gonna have an output size of 4. Why 4 ? Q, R, S and U.
+
+0:42:51.930,0:42:57.450
+so I have 4 a 4 dimensional target vector and I have a sequence of 8
+
+0:42:57.450,0:43:04.499
+dimensional vectors as input okay so let's see how this sequence looks like
+
+0:43:04.499,0:43:12.779
+in this case is gonna be BbXcXcbE. So X,X let's see X X X X is Q
+
+0:43:12.779,0:43:18.569
+right so we have our Q sequence and that's why the final target is a Q the 1
+
+0:43:18.569,0:43:25.019
+0 0 0 and then you're gonna see B B X C so the second item and the second last
+
+0:43:25.019,0:43:30.390
+is gonna be B lowercase B you can see here the second item and the second last
+
+0:43:30.390,0:43:36.390
+item is going to be a be okay all right so let's now create a recurrent Network
+
+0:43:36.390,0:43:41.249
+in a very quick way so here I can simply say my recurrent network is going to be
+
+0:43:41.249,0:43:47.369
+torch and an RNN and I'm gonna be using a reader network really non-linearity
+
+0:43:47.369,0:43:52.709
+and then I have my final linear layer in the other case I'm gonna be using a led
+
+0:43:52.709,0:43:57.119
+STM and then I'm gonna have a final inner layer so I just execute these guys
+
+0:43:57.119,0:44:07.920
+I have my training loop and I'm gonna be training for 10 books so in the training
+
+0:44:07.920,0:44:13.259
+group you can be always looking for those five different steps first step is
+
+0:44:13.259,0:44:18.900
+gonna be get the data inside the model right so that's step number one what is
+
+0:44:18.900,0:44:30.669
+step number two there are five steps we remember hello
+
+0:44:30.669,0:44:35.089
+you feel that you feed the network if you feed the network with some data then
+
+0:44:35.089,0:44:41.539
+what do you do you compute the loss okay then we have compute step to compute the
+
+0:44:41.539,0:44:52.549
+loss fantastic number three is zero the cash right then number four which is
+
+0:44:52.549,0:45:09.699
+computing the off yes lost dog backwards lost not backward don't compute the
+
+0:45:09.699,0:45:16.449
+partial derivative of the loss with respect to the network's parameters yeah
+
+0:45:16.449,0:45:27.380
+here backward finally number five which is step in opposite direction of the
+
+0:45:27.380,0:45:31.819
+gradient okay all right those are the five steps you always want to see in any
+
+0:45:31.819,0:45:37.909
+training blueprint if someone is missing then you're [ __ ] up okay so we try now
+
+0:45:37.909,0:45:42.469
+the RNN and the LSTM and you get something looks like this
+
+0:45:42.469,0:45:55.929
+so our NN goes up to 50% in accuracy and then the LSTM got 100% okay oh okay
+
+0:45:56.439,0:46:06.019
+first of all how many weights does this LSTM have compared to the RNN four
+
+0:46:06.019,0:46:11.059
+times more weights right so it's not a fair comparison I would say because LSTM
+
+0:46:11.059,0:46:16.819
+is simply for rnns combined somehow right so this is a two layer neural
+
+0:46:16.819,0:46:20.659
+network whereas the other one is at one layer right always both ever like it has
+
+0:46:20.659,0:46:25.009
+one hidden layer they are an end if Alice TM we can think about having two
+
+0:46:25.009,0:46:33.199
+hidden so again one layer two layers well one hidden to lead in one set of
+
+0:46:33.199,0:46:37.610
+parameters four sets of the same numbers like okay not fair okay anyway
+
+0:46:37.610,0:46:43.610
+let's go with hundred iterations okay so now I just go with 100 iterations and I
+
+0:46:43.610,0:46:49.490
+show you how if they work or not and also when I be just clicking things such
+
+0:46:49.490,0:46:56.000
+that we have time to go through stuff okay now my computer's going to be
+
+0:46:56.000,0:47:02.990
+complaining all right so again what are the five types of operations like five
+
+0:47:02.990,0:47:06.860
+okay now is already done sorry I was going to do okay so this is
+
+0:47:06.860,0:47:16.280
+the RNN right RNN and finally actually gave to 100% okay so iron and it just
+
+0:47:16.280,0:47:20.030
+let it more time like a little bit more longer training actually works the other
+
+0:47:20.030,0:47:26.060
+one okay and here you can see that we got 100% in twenty eight bucks okay the
+
+0:47:26.060,0:47:30.650
+other case we got 2,100 percent in roughly twice as long
+
+0:47:30.650,0:47:35.690
+twice longer at a time okay so let's first see how they perform here so I
+
+0:47:35.690,0:47:42.200
+have this sequence BcYdYdaE which is a U sequence and then we ask the network
+
+0:47:42.200,0:47:46.760
+and he actually meant for actually like labels it as you okay so below we're
+
+0:47:46.760,0:47:51.140
+gonna be seeing something very cute so in this case we were using sequences
+
+0:47:51.140,0:47:56.870
+that are very very very very small right so even the RNN is able to train on
+
+0:47:56.870,0:48:02.390
+these small sequences so what is the point of using a LSTM well we can first
+
+0:48:02.390,0:48:07.430
+of all increase the difficulty of the training part and we're gonna see that
+
+0:48:07.430,0:48:13.280
+the RNN can be miserably failing whereas the LSTM keeps working in this
+
+0:48:13.280,0:48:19.790
+visualization part below okay I train a network now Alice and LSTM now with the
+
+0:48:19.790,0:48:26.000
+moderate level which has eighty symbols rather than eight or ten ten symbols so
+
+0:48:26.000,0:48:31.430
+you can see here how this model actually managed to succeed at the end although
+
+0:48:31.430,0:48:38.870
+there is like a very big spike and I'm gonna be now drawing the value of the
+
+0:48:38.870,0:48:43.970
+cell state over time okay so I'm going to be input in our sequence of eighty
+
+0:48:43.970,0:48:49.090
+symbols and I'm gonna be showing you what is the value of the hidden state
+
+0:48:49.090,0:48:53.330
+hidden State so in this case I'm gonna be showing you
+
+0:48:53.330,0:48:56.910
+[Music] hidden hold on
+
+0:48:56.910,0:49:01.140
+yeah I'm gonna be showing I'm gonna send my input through a hyperbolic tangent
+
+0:49:01.140,0:49:06.029
+such that if you're below minus 2.5 I'm gonna be mapping to minus 1 if you're
+
+0:49:06.029,0:49:12.329
+above 2.5 you get mapped to plus 1 more or less and so let's see how this stuff
+
+0:49:12.329,0:49:18.029
+looks so in this case here you can see that this specific hidden layer picked
+
+0:49:18.029,0:49:27.720
+on the X here and then it became red until you got the other X right so this
+
+0:49:27.720,0:49:33.710
+is visualizing the internal state of the LSD and so you can see that in specific
+
+0:49:33.710,0:49:39.599
+unit because in this case I use hidden representation like hidden dimension of
+
+0:49:39.599,0:49:47.700
+10 and so in this case the 1 2 3 4 5 the fifth hidden unit of the cell lay the
+
+0:49:47.700,0:49:52.829
+5th cell actually is trigger by observing the first X and then it goes
+
+0:49:52.829,0:49:58.410
+quiet after seen the other acts this allows me to basically you know take
+
+0:49:58.410,0:50:07.440
+care of I mean recognize if the sequence is U, P, R or S. Okay does it make sense okay
+
+0:50:07.440,0:50:14.519
+oh this one more notebook I'm gonna be showing just quickly which is the 09-echo_data
+
+0:50:14.519,0:50:22.410
+in this case I'm gonna be in South corner I'm gonna have a network echo in
+
+0:50:22.410,0:50:27.059
+whatever I'm saying so if I say something I asked a network to say if I
+
+0:50:27.059,0:50:30.960
+say something I asked my neighbor to say if I say something I ask ok Anderson
+
+0:50:30.960,0:50:42.150
+right ok so in this case here and I'll be inputting this is the first sequence
+
+0:50:42.150,0:50:50.579
+is going to be 0 1 1 1 1 0 and you'll have the same one here 0 1 1 1 1 0 and I
+
+0:50:50.579,0:50:57.259
+have 1 0 1 1 0 1 etc right so in this case if you want to output something
+
+0:50:57.259,0:51:00.900
+after some right this in this case is three time
+
+0:51:00.900,0:51:06.809
+step after you have to have some kind of short-term memory where you keep in mind
+
+0:51:06.809,0:51:11.780
+what I just said where you keep in mind what I just said where you keep in mind
+
+0:51:11.780,0:51:16.890
+[Music] what I just said yeah that's correct so
+
+0:51:16.890,0:51:22.099
+you know pirating actually requires having some kind of working memory
+
+0:51:22.099,0:51:27.569
+whereas the other one the language model which it was prompted prompted to say
+
+0:51:27.569,0:51:33.539
+something that I haven't already said right so that was a different kind of
+
+0:51:33.539,0:51:38.700
+task you actually had to predict what is the most likely next word in keynote you
+
+0:51:38.700,0:51:42.329
+cannot be always right right but this one you can always be right you know
+
+0:51:42.329,0:51:49.079
+this is there is no random stuff anyhow so I have my first batch here and then
+
+0:51:49.079,0:51:53.549
+the sec the white patch which is the same similar thing which is shifted over
+
+0:51:53.549,0:52:01.319
+time and then we have we have to chunk this long long long sequence so before I
+
+0:52:01.319,0:52:05.250
+was sending a whole sequence inside the network and I was enforcing the final
+
+0:52:05.250,0:52:09.569
+target to be something right in this case I had to chunk if the sequence goes
+
+0:52:09.569,0:52:13.319
+this direction I had to chunk my long sequence in little chunks and then you
+
+0:52:13.319,0:52:18.869
+have to fill the first chunk keep trace of whatever is the hidden state send a
+
+0:52:18.869,0:52:23.549
+new chunk where you feed and initially as the initial hidden state the output
+
+0:52:23.549,0:52:28.319
+of this chant right so you feed this chunk you have a final hidden state then
+
+0:52:28.319,0:52:33.960
+you feed this chunk and as you put you have to put these two as input to the
+
+0:52:33.960,0:52:38.430
+internal memory right now you feed the next chunk where you put this one as
+
+0:52:38.430,0:52:44.670
+input as to the internal state and you we are going to be comparing here RNN
+
+0:52:44.670,0:52:57.059
+with analyst TMS I think so at the end here you can see that okay we managed to
+
+0:52:57.059,0:53:02.789
+actually get we are an n/a accuracy that goes 100 100 percent then if you are
+
+0:53:02.789,0:53:08.220
+starting now to mess with the size of the memory chunk with a memory interval
+
+0:53:08.220,0:53:11.619
+you can be seen with the LSTM you can keep this memory
+
+0:53:11.619,0:53:16.399
+for a long time as long as you have enough capacity the RNN after you reach
+
+0:53:16.399,0:53:22.880
+some kind of length you start forgetting what happened in the past and it was
+
+0:53:22.880,0:53:29.809
+pretty much everything for today so stay warm wash your hands and I'll see you
+
+0:53:29.809,0:53:34.929
+next week bye bye

From 4a340e1229cd167c5d70d1b752f083885e79e36d Mon Sep 17 00:00:00 2001
From: Leon Solon <leonsolon@gmail.com>
Date: Sat, 12 Mar 2022 22:31:09 -0300
Subject: [PATCH 3/3] [PT] fix: Alf reviews for PR #805 > > Co-authored-by:
 Felipe Schiavon <felipeschiavondeoliveira@gmail.com> Co-authored-by: Bernardo
 Lago <bernardolago@gmail.com>"

---
 docs/pt/week03/03-1.md         |    8 +-
 docs/pt/week03/lecture03.sbv   | 3161 ++++++++++++++++---------------
 docs/pt/week03/practicum03.sbv | 1164 ++++++------
 docs/pt/week05/lecture05.sbv   | 3188 ++++++++++++++++----------------
 docs/pt/week05/practicum05.sbv |  828 ++++-----
 docs/pt/week06/06-2.md         |    1 -
 docs/pt/week06/06-3.md         |    2 +-
 docs/pt/week06/lecture06.sbv   | 2944 ++++++++++++++---------------
 docs/pt/week06/practicum06.sbv | 1162 ++++++------
 9 files changed, 6228 insertions(+), 6230 deletions(-)

diff --git a/docs/pt/week03/03-1.md b/docs/pt/week03/03-1.md
index e887a7733..5d3d0db79 100644
--- a/docs/pt/week03/03-1.md
+++ b/docs/pt/week03/03-1.md
@@ -392,7 +392,7 @@ Então, por que o Deep Learning deveria estar enraizado na ideia de que nosso mu
 <!--<div align="center">Figure 11. Simon Thorpe's model of visual information flow in the brain <div>
 -->
 
-<div align="center">Figura 11. Modelo de Simon Thorpe de fluxo de informações visuais no cérebro<div>
+<div align="center">Figura 11. Modelo de Simon Thorpe de fluxo de informações visuais no cérebro</div>
 
 <!--Signals pass from the retina to the LGN (helps with contrast enhancement, gate control, etc.), then to the V1 primary visual cortex, V2, V4, then to the inferotemporal cortex (PIT), which is the part of the brain where categories are defined. Observations from open-brain surgery showed that if you show a human a film, neurons in the PIT will fire only when they detect certain images -- such as Jennifer Aniston or a person's grandmother -- and nothing else. The neural firings are invariant to things such as position, size, illumination, your grandmother's orientation, what she's wearing, etc.
 -->
@@ -422,7 +422,7 @@ Um outro insight do cérebro humano vem de Gallant & Van Essen, cujo modelo do c
 <!--<div align="center">Figure 12. Gallen & Van Essen's model of dorsal & ventral pathways in the brain <div>
 -->
 
-<div align="center">Figura 12. Modelo de Gallen e Van Essen das vias dorsais e ventrais no cérebro <div>
+<div align="center">Figura 12. Modelo de Gallen e Van Essen das vias dorsais e ventrais no cérebro </div>
 
 <!--The right side shows the ventral pathway, which tells you what you're looking at, while the left side shows the dorsal pathway, which identifies locations, geometry, and motion. They seem fairly separate in the human (and primate) visual cortex (with a few interactions between them of course).
 -->
@@ -443,7 +443,7 @@ O lado direito mostra a via ventral, que indica o que você está olhando, enqua
 <!--<div align="center">Figure 13. Hubel & Weisel's experiments with visual stimuli in cat brains <div>
 -->
 
-<div align="center"> Figura 13. Experimentos de Hubel e Wiesel com estímulos visuais em cérebros de gatos <div>
+<div align="center"> Figura 13. Experimentos de Hubel e Wiesel com estímulos visuais em cérebros de gatos </div>
 
 <!--Hubel and Weisel experiments used electrodes to measure neural firings in cat brains in response to visual stimuli. They discovered that neurons in the V1 region are only sensitive to certain areas of a visual field (called "receptive fields"), and detect oriented edges in that area. For example, they demonstrated that if you showed the cat a vertical bar and start rotating it, at a particular angle the neuron will fire. Similarly, as the bar moves away from that angle, the activation of the neuron diminishes. These activation-selective neurons Hubel & Weisel named "simple cells", for their ability to detect local features.
 -->
@@ -474,7 +474,7 @@ Outro tipo de neurônio, que eles chamaram de "células complexas", agregam a sa
 <!--<div align="center">Figure 14. Fukushima's CNN model <div>
 -->
 
-<div align="center"> Figura 14. Modelo CNN de Fukushima <div>
+<div align="center"> Figura 14. Modelo CNN de Fukushima </div>
 
 <!--Fukushima was the first to implement the idea of multiple layers of simple cells and complex cells with computer models, using a dataset of handwritten digits. Some of these feature detectors were hand-crafted or learned, though the learning used unsupervised clustering algorithms, trained separately for each layer, as backpropagation was not yet in use.
 -->
diff --git a/docs/pt/week03/lecture03.sbv b/docs/pt/week03/lecture03.sbv
index a6444f462..0a0a91f44 100644
--- a/docs/pt/week03/lecture03.sbv
+++ b/docs/pt/week03/lecture03.sbv
@@ -1,3429 +1,3428 @@
 0:00:04.819,0:00:08.319
-In this case, we have a network which has an input on the left-hand side
+Neste caso, temos uma rede que tem uma entrada do lado esquerdo
 
 0:00:08.959,0:00:14.259
-Usually you have the input on the bottom side or on the left. They are pink in my slides
+Normalmente você tem a entrada no lado inferior ou no lado esquerdo. Eles são rosa em meus slides
 
 0:00:14.260,0:00:17.409
-So if you take notes, make them pink. No, just kidding!
+Então, se você tomar notas, faça-as cor-de-rosa. Não, é brincadeira!
 
 0:00:18.400,0:00:23.020
-And then we have... How many activations? How many hidden layers do you count there?
+E então temos... Quantas ativações? Quantas camadas ocultas você conta lá?
 
 0:00:23.539,0:00:27.789
-Four hidden layers. So overall how many layers does the network have here?
+Quatro camadas ocultas. Então, no geral, quantas camadas a rede tem aqui?
 
 0:00:28.820,0:00:32.980
-Six, right? Because we have four hidden, plus one input, plus one output layer
+Seis, certo? Porque temos quatro camadas ocultas, mais uma de entrada, mais uma de saída
 
 0:00:33.649,0:00:37.568
-So in this case, I have two neurons per layer, right?
+Então, neste caso, eu tenho dois neurônios por camada, certo?
 
 0:00:37.569,0:00:41.739
-So what does it mean? What are the dimensions of the matrices we are using here?
+Então o que isso significa? Quais são as dimensões das matrizes que estamos usando aqui?
 
 0:00:43.339,0:00:46.119
-Two by two. So what does that two by two matrix do?
+Dois por dois. Então, o que essa matriz dois por dois faz?
 
 0:00:48.739,0:00:51.998
-Come on! You have... You know the answer to this question
+Vamos! Você tem... Você sabe a resposta para esta pergunta
 
 0:00:53.359,0:00:57.579
-Rotation, yeah. Then scaling, then shearing and...
+Rotação, sim. Depois dimensionar, depois cortar e...
 
 0:00:59.059,0:01:05.469
-reflection. Fantastic, right? So we constrain our network to perform all the operations on the plan
+reflexão. Fantástico, certo? Então, restringimos nossa rede para realizar todas as operações no plano
 
 0:01:05.540,0:01:12.380
-We have seen the first time if I allow the hidden layer to be a hundred neurons long we can...
+Vimos pela primeira vez se eu permitir que a camada oculta tenha uma centena de neurônios de comprimento, podemos...
 
 0:01:12.380,0:01:13.680
-Wow okay!
+Uau tudo bem!
 
 0:01:13.680,0:01:15.680
-We can easily...
+Podemos facilmente...
 
 0:01:18.079,0:01:20.079
-Ah fantastic. What is it?
+Ai fantástico. O que é isso?
 
 0:01:21.170,0:01:23.170
-We are watching movies now. I see...
+Estamos assistindo filmes agora. Eu vejo...
 
 0:01:24.409,0:01:29.889
-See? Fantastic. What is it? Mandalorian is so cool, no? Okay...
+Ver? Fantástico. O que é isso? Mandaloriano é tão legal, não? OK...
 
 0:01:32.479,0:01:39.428
-Okay, how nice is this lesson. Is it even recorded? Okay, we have no idea
+Ok, como é bom esta lição. É mesmo gravado? Ok, não temos ideia
 
 0:01:40.789,0:01:43.719
-Okay, give me a sec. Okay, so we go here...
+Certo, me dê um segundo. Ok, então vamos aqui...
 
 0:01:47.810,0:01:49.810
-Done
+Feito
 
 0:01:50.390,0:01:52.070
-Listen
+Ouvir
 
 0:01:52.070,0:01:53.600
-All right
+Tudo bem
 
 0:01:53.600,0:01:59.679
-So we started from this network here, right? Which had this intermediate layer and we forced them to be
+Então a gente começou dessa rede aqui, né? Que tinha essa camada intermediária e nós os forçamos a serem
 
 0:02:00.289,0:02:05.229
-2-dimensional, right? Such that all the transformations are enforced to be on a plane
+2 dimensões, certo? De tal forma que todas as transformações são forçadas para estar em um plano
 
 0:02:05.270,0:02:08.319
-So this is what the network does to our plan
+Então é isso que a rede faz com o nosso plano
 
 0:02:08.319,0:02:14.269
-It folds it on specific regions, right? And those foldings are very abrupt
+Ele dobra em regiões específicas, certo? E essas dobras são muito abruptas
 
 0:02:14.370,0:02:18.499
-This is because all the transformations are performed on the 2d layer, right?
+Isso porque todas as transformações são realizadas na camada 2d, certo?
 
 0:02:18.500,0:02:22.550
-So this training took me really a lot of effort because the
+Então esse treinamento me exigiu muito esforço porque o
 
 0:02:23.310,0:02:25.310
-optimization is actually quite hard
+otimização é realmente muito difícil
 
 0:02:25.740,0:02:30.769
-Whenever I had a hundred-neuron hidden layer, that was very easy to train
+Sempre que eu tinha uma camada oculta de cem neurônios, isso era muito fácil de treinar
 
 0:02:30.770,0:02:35.299
-This one really took a lot of effort and you have to tell me why, okay?
+Este realmente exigiu muito esforço e você tem que me dizer por quê, ok?
 
 0:02:35.400,0:02:39.469
-If you don't know the answer right now, you'd better know the answer for the midterm
+Se você não sabe a resposta agora, é melhor saber a resposta para o semestre
 
 0:02:40.470,0:02:43.370
-So you can take note of what are the questions for the midterm...
+Então você pode tomar nota de quais são as perguntas para o meio-termo...
 
 0:02:43.980,0:02:49.600
-Right, so this is the final output of the network, which is also that 2d layer
+Certo, então esta é a saída final da rede, que também é aquela camada 2d
 
 0:02:50.010,0:02:55.489
-to the embedding, so I have no non-linearity on my last layer. And these are the final
+para a incorporação, então não tenho não linearidade na minha última camada. E estes são os finais
 
 0:02:56.370,0:03:01.850
-classification regions. So let's see what each layer does. This is the first layer, affine transformation
+regiões de classificação. Então vamos ver o que cada camada faz. Esta é a primeira camada, transformação afim
 
 0:03:01.850,0:03:06.710
-so it looks like it's a 3d rotation, but it's not right? It's just a 2d rotation
+então parece que é uma rotação 3d, mas não está certo? É apenas uma rotação 2D
 
 0:03:07.740,0:03:15.600
-reflection, scaling, and shearing. And then what is this part? Ah, what's happened right now? Do you see?
+reflexão, escala e cisalhamento. E então qual é essa parte? Ah, o que aconteceu agora? Você vê?
 
 0:03:17.820,0:03:21.439
-We have like the ReLU part, which is killing all the negative
+Temos como a parte ReLU, que está matando todos os negativos
 
 0:03:22.800,0:03:27.079
-sides of the network, right? Sorry, all the negative sides of this
+lados da rede, certo? Desculpe, todos os lados negativos disso
 
 0:03:28.080,0:03:33.499
-space, right? It is the second affine transformation and then here you apply again
+espaço, certo? É a segunda transformação afim e aqui você aplica novamente
 
 0:03:34.770,0:03:37.460
-the ReLU, you can see all the negative
+o ReLU, você pode ver todos os pontos negativos
 
 0:03:38.220,0:03:41.149
-subspaces have been erased and they've been set to zero
+subespaços foram apagados e eles foram definidos como zero
 
 0:03:41.730,0:03:44.509
-Then we keep going with a third affine transformation
+Então continuamos com uma terceira transformação afim
 
 0:03:45.120,0:03:46.790
-We zoom it... it's zooming a lot...
+Nós ampliamos... está aumentando muito...
 
 0:03:46.790,0:03:54.469
-And then you're gonna have the ReLU layer which is gonna be killing one of those... all three quadrants, right?
+E então você vai ter a camada ReLU que vai matar um desses... todos os três quadrantes, certo?
 
 0:03:54.470,0:03:59.240
-Only one quadrant survives every time. And then we go with the fourth affine transformation
+Apenas um quadrante sobrevive a cada vez. E então vamos com a quarta transformação afim
 
 0:03:59.790,0:04:06.200
-where it's elongating a lot because given that we confine all the transformation to be living in this space
+onde está a alongar-se muito porque dado que confinamos toda a transformação a viver neste espaço
 
 0:04:06.210,0:04:12.439
-it really needs to stretch and use all the power it can, right? Again, this is the
+ele realmente precisa esticar e usar toda a energia que puder, certo? Novamente, este é o
 
 0:04:13.170,0:04:18.589
-second last. Then we have the last affine transformation, which is the final one. And then we reach finally
+penúltimo. Então temos a última transformação afim, que é a final. E então chegamos finalmente
 
 0:04:19.320,0:04:20.910
-linearly separable
+linearmente separável
 
 0:04:20.910,0:04:26.359
-regions here. Finally, we're gonna see how each affine transformation can be
+regiões aqui. Por fim, veremos como cada transformação afim pode ser
 
 0:04:27.240,0:04:31.759
-split in each component. So we have the rotation, we have now squashing, like zooming
+dividido em cada componente. Então, temos a rotação, agora temos o esmagamento, como o zoom
 
 0:04:32.340,0:04:38.539
-Then we have rotation, reflection because the determinant is minus one, and then we have the final bias
+Então temos rotação, reflexão porque o determinante é menos um, e então temos o viés final
 
 0:04:38.539,0:04:42.769
-You have the positive part of the ReLU (Rectified Linear Unit), again rotation
+Você tem a parte positiva da ReLU (Unidade Linear Retificada), novamente rotação
 
 0:04:43.650,0:04:47.209
-flipping because we had a negative, a minus one determinant
+lançando porque tínhamos um determinante negativo, menos um
 
 0:04:47.849,0:04:49.849
-Zooming, rotation
+Zoom, rotação
 
 0:04:49.889,0:04:54.258
-One more reflection and then the final bias. This was the second affine transformation
+Mais uma reflexão e depois o viés final. Esta foi a segunda transformação afim
 
 0:04:54.259,0:04:58.609
-Then we have here the positive part again. We have third layer so rotation, reflection
+Então temos aqui a parte positiva novamente. Nós temos a terceira camada então rotação, reflexão
 
 0:05:00.000,0:05:05.629
-zooming and then we have... this is SVD decomposition, right? You should be aware of that, right?
+zoom e então temos... esta é a decomposição SVD, certo? Você deve estar ciente disso, certo?
 
 0:05:05.629,0:05:09.799
-You should know. And then final is the translation and the third
+Você deveria saber. E então final é a tradução e a terceira
 
 0:05:10.229,0:05:15.589
-ReLU, then we had the fourth layer, so rotation, reflection because the determinant was negative
+ReLU, então tivemos a quarta camada, então rotação, reflexão porque o determinante era negativo
 
 0:05:16.169,0:05:18.169
-zooming, again the other rotation
+zoom, novamente a outra rotação
 
 0:05:18.599,0:05:21.769
-Once more... reflection and bias
+Mais uma vez... reflexão e preconceito
 
 0:05:22.379,0:05:24.559
-Finally a ReLU and then we have the last...
+Finalmente um ReLU e então temos o último...
 
 0:05:25.259,0:05:27.259
-the fifth layer. So rotation
+a quinta camada. Então rotação
 
 0:05:28.139,0:05:32.059
-zooming, we didn't have reflection because the determinant was +1
+zoom, não tivemos reflexão porque o determinante foi +1
 
 0:05:32.490,0:05:37.069
-Again, reflection in this case because the determinant was negative and then finally the final bias, right?
+Novamente, reflexão neste caso porque o determinante foi negativo e depois finalmente o viés final, certo?
 
 0:05:37.139,0:05:41.478
-And so this was pretty much how this network, which was
+E foi assim que essa rede, que foi
 
 0:05:42.599,0:05:44.599
-just made of
+apenas feito de
 
 0:05:44.759,0:05:46.759
-a sequence of layers of
+uma sequência de camadas de
 
 0:05:47.159,0:05:52.218
-neurons that are only two neurons per layer, is performing the classification task
+neurônios que são apenas dois neurônios por camada, está realizando a tarefa de classificação
 
 0:05:54.990,0:05:58.159
-And all those transformation have been constrained to be
+E todas essas transformações foram restringidas a ser
 
 0:05:58.680,0:06:03.199
-living on the plane. Okay, so this was really hard to train
+morando no avião. Ok, então isso foi muito difícil de treinar
 
 0:06:03.419,0:06:05.959
-Can you figure out why it was really hard to train?
+Você consegue descobrir por que foi tão difícil treinar?
 
 0:06:06.539,0:06:08.539
-What does it happen if my...
+O que acontece se o meu...
 
 0:06:09.270,0:06:16.219
-if my bias of one of the four layers puts my points away from the top right quadrant?
+se meu viés de uma das quatro camadas afasta meus pontos do quadrante superior direito?
 
 0:06:21.060,0:06:25.519
-Exactly, so if you have one of the four biases putting my
+Exatamente, então se você tem um dos quatro preconceitos colocando meu
 
 0:06:26.189,0:06:28.549
-initial point away from the top right quadrant
+ponto inicial longe do quadrante superior direito
 
 0:06:29.189,0:06:34.039
-then the ReLUs are going to be completely killing everything, and everything gets collapsed into zero
+então os ReLUs estarão matando tudo completamente, e tudo será reduzido a zero
 
 0:06:34.560,0:06:38.399
-Okay? And so there you can't do any more of anything, so
+OK? E aí você não pode fazer mais nada, então
 
 0:06:38.980,0:06:44.129
-this network here was really hard to train. If you just make it a little bit fatter than...
+essa rede aqui foi muito difícil de treinar. Se você apenas torná-lo um pouco mais gordo do que...
 
 0:06:44.320,0:06:48.659
-instead of constraining it to be two neurons for each of the hidden layers
+em vez de restringi-lo a ser dois neurônios para cada uma das camadas ocultas
 
 0:06:48.660,0:06:52.230
-then it is much easier to train. Or you can do a combination of the two, right?
+então é muito mais fácil treinar. Ou você pode fazer uma combinação dos dois, certo?
 
 0:06:52.230,0:06:54.300
-So instead of having just a fat network
+Então, em vez de ter apenas uma rede gorda
 
 0:06:54.300,0:07:01.589
-you can have a network that is less fat, but then you have a few hidden layers, okay?
+você pode ter uma rede menos gorda, mas aí você tem algumas camadas escondidas, ok?
 
 0:07:02.770,0:07:06.659
-So the fatness is how many neurons you have per hidden layer, right?
+Então a gordura é quantos neurônios você tem por camada oculta, certo?
 
 0:07:07.810,0:07:11.429
-Okay. So the question is how do we determine the structure or the
+OK. Então a questão é como determinamos a estrutura ou o
 
 0:07:12.730,0:07:15.150
-configuration of our network, right? How do we design the network?
+configuração da nossa rede, certo? Como projetamos a rede?
 
 0:07:15.580,0:07:20.550
-And the answer is going to be, that's what Yann is gonna be teaching across the semester, right?
+E a resposta vai ser, isso é o que Yann vai ensinar ao longo do semestre, certo?
 
 0:07:20.550,0:07:27.300
-So keep your attention high because that's what we're gonna be teaching here
+Então mantenha sua atenção alta porque é isso que vamos ensinar aqui
 
 0:07:28.090,0:07:30.840
-That's a good question right? There is no
+Essa é uma boa pergunta certo? Não há
 
 0:07:32.410,0:07:34.679
-mathematical rule, there is a lot of experimental
+regra matemática, há um monte de experimentos
 
 0:07:35.710,0:07:39.569
-empirical evidence and a lot of people are trying different configurations
+evidências empíricas e muitas pessoas estão tentando configurações diferentes
 
 0:07:39.570,0:07:42.000
-We found something that actually works pretty well now.
+Encontramos algo que realmente funciona muito bem agora.
 
 0:07:42.100,0:07:46.200
-We're gonna be covering these architectures in the following lessons. Other questions?
+Vamos cobrir essas arquiteturas nas lições a seguir. Outras perguntas?
 
 0:07:48.790,0:07:50.790
-Don't be shy
+Não seja tímido
 
 0:07:51.880,0:07:56.130
-No? Okay, so I guess then we can switch to the second part of the class
+Não? Ok, então acho que podemos mudar para a segunda parte da aula
 
 0:07:57.880,0:08:00.630
-Okay, so we're gonna talk about convolutional nets today
+Ok, então vamos falar sobre redes convolucionais hoje
 
 0:08:02.710,0:08:05.879
-Let's dive right in. So I'll start with
+Vamos mergulhar de cabeça. Então vou começar com
 
 0:08:06.820,0:08:09.500
-something that's relevant to convolutional nets but not just [to them]
+algo que é relevante para redes convolucionais, mas não apenas [para eles]
 
 0:08:10.000,0:08:12.500
-which is the idea of transforming the parameters of a neural net
+que é a ideia de transformar os parâmetros de uma rede neural
 
 0:08:12.570,0:08:17.010
-So here we have a diagram that we've seen before except for a small twist
+Então aqui temos um diagrama que vimos antes, exceto por uma pequena reviravolta
 
 0:08:17.920,0:08:22.300
-The diagram we're seeing here is that we have a neural net G of X and W
+O diagrama que estamos vendo aqui é que temos uma rede neural G de X e W
 
 0:08:22.360,0:08:27.960
-W being the parameters, X being the input that makes a prediction about an output, and that goes into a cost function
+W sendo os parâmetros, X sendo a entrada que faz uma previsão sobre uma saída, e isso entra em uma função de custo
 
 0:08:27.960,0:08:29.500
-We've seen this before
+Já vimos isso antes
 
 0:08:29.500,0:08:34.500
-But the twist here is that the weight vector instead of being a
+Mas a diferença aqui é que o vetor peso em vez de ser um
 
 0:08:35.830,0:08:39.660
-parameter that's being optimized, is actually itself the output of some other function
+parâmetro que está sendo otimizado, é na verdade a saída de alguma outra função
 
 0:08:40.599,0:08:43.589
-possibly parameterized. In this case this function is
+possivelmente parametrizado. Neste caso esta função é
 
 0:08:44.320,0:08:50.369
-not a parameterized function, or it's a parameterized function but the only input is another parameter U, okay?
+não é uma função parametrizada, ou é uma função parametrizada mas a única entrada é outro parâmetro U, ok?
 
 0:08:50.750,0:08:56.929
-So what we've done here is make the weights of that neural net be the function of some more elementary...
+Então, o que fizemos aqui é fazer com que os pesos dessa rede neural sejam a função de algo mais elementar...
 
 0:08:57.480,0:08:59.480
-some more elementary parameters U
+alguns parâmetros mais elementares U
 
 0:09:00.420,0:09:02.420
-through a function and
+através de uma função e
 
 0:09:02.940,0:09:07.880
-you realize really quickly that backprop just works there, right? If you back propagate gradients
+você percebe muito rapidamente que o backprop simplesmente funciona lá, certo? Se você voltar a propagar gradientes
 
 0:09:09.210,0:09:15.049
-through the G function to get the gradient of whatever objective function we're minimizing with respect to the
+através da função G para obter o gradiente de qualquer função objetivo que estamos minimizando em relação ao
 
 0:09:15.600,0:09:21.290
-weight parameters, you can keep back propagating through the H function here to get the gradients with respect to U
+parâmetros de peso, você pode continuar propagando através da função H aqui para obter os gradientes em relação a U
 
 0:09:22.620,0:09:27.229
-So in the end you're sort of propagating things like this
+Então, no final, você está meio que propagando coisas assim
 
 0:09:30.600,0:09:42.220
-So when you're updating U, you're multiplying the Jacobian of the objective function with respect to the parameters, and then by the...
+Então, quando você está atualizando U, você está multiplicando o Jacobiano da função objetivo em relação aos parâmetros, e então pelo...
 
 0:09:42.750,0:09:46.760
-Jacobian of the H function with respect to its own parameters, okay?
+Jacobiano da função H em relação aos seus próprios parâmetros, ok?
 
 0:09:46.760,0:09:50.960
-So you get the product of two Jacobians here, which is just what you get from back propagating
+Então você obtém o produto de dois jacobianos aqui, que é exatamente o que você obtém da propagação de volta
 
 0:09:50.960,0:09:54.919
-You don't have to do anything in PyTorch for this. This will happen automatically as you define the network
+Você não precisa fazer nada no PyTorch para isso. Isso acontecerá automaticamente conforme você define a rede
 
 0:09:59.130,0:10:03.080
-And that's kind of the update that occurs
+E esse é o tipo de atualização que ocorre
 
 0:10:03.840,0:10:10.820
-Now, of course, W being a function of U through the function H, the change in W
+Agora, é claro, W sendo uma função de U através da função H, a mudança em W
 
 0:10:12.390,0:10:16.460
-will be the change in U multiplied by the Jacobian of H transpose
+será a mudança em U multiplicada pela Jacobiana de H transposta
 
 0:10:18.090,0:10:24.739
-And so this is the kind of thing you get here, the effective change in W that you get without updating W
+E esse é o tipo de coisa que você obtém aqui, a mudança efetiva no W que você obtém sem atualizar o W
 
 0:10:24.740,0:10:30.260
---you actually are updating U-- is the update in U multiplied by the Jacobian of H
+--você realmente está atualizando U-- é a atualização em U multiplicada pelo jacobiano de H
 
 0:10:30.690,0:10:37.280
-And we had a transpose here. We have the opposite there. This is a square matrix
+E nós tivemos uma transposição aqui. Temos o oposto aí. Esta é uma matriz quadrada
 
 0:10:37.860,0:10:41.720
-which is Nw by Nw, which is the number of... the dimension of W squared, okay?
+que é Nw por Nw, que é o número de... a dimensão de W ao quadrado, ok?
 
 0:10:42.360,0:10:44.690
-So this matrix here
+Então essa matriz aqui
 
 0:10:45.780,0:10:47.780
-has as many rows as
+tem tantas linhas quanto
 
 0:10:48.780,0:10:52.369
-W has components and then the number of columns is the number of
+W tem componentes e então o número de colunas é o número de
 
 0:10:52.560,0:10:57.470
-components of U. And then this guy, of course, is the other way around so it's an Nu by Nw
+componentes de U. E então esse cara, é claro, é o contrário, então é um Nu por Nw
 
 0:10:57.540,0:11:02.669
-So when you make the product, do the product of those two matrices you get an Nw by Nw matrix
+Então quando você faz o produto, faça o produto dessas duas matrizes você obtém uma matriz Nw por Nw
 
 0:11:03.670,0:11:05.670
-And then you multiply this by this
+E então você multiplica isso por isso
 
 0:11:06.190,0:11:10.380
-Nw vector and you get an Nw vector which is what you need for updating
+Nw vetor e você obtém um vetor Nw que é o que você precisa para atualizar
 
 0:11:11.440,0:11:13.089
-the weights
+os pesos
 
 0:11:13.089,0:11:16.828
-Okay, so that's kind of a general form of transforming the parameter space and there's
+Ok, então essa é uma forma geral de transformar o espaço de parâmetros e há
 
 0:11:18.430,0:11:22.979
-many ways you can use this and a particular way of using it is when
+muitas maneiras de usar isso e uma maneira particular de usá-lo é quando
 
 0:11:23.769,0:11:25.389
-H is what's called a...
+H é o que se chama de...
 
 0:11:26.709,0:11:30.089
-what we talked about last week, which is a "Y connector"
+sobre o que falamos na semana passada, que é um "conector Y"
 
 0:11:30.089,0:11:35.578
-So imagine the only thing that H does is that it takes one component of U and it copies it multiple times
+Então imagine que a única coisa que H faz é pegar um componente de U e copiá-lo várias vezes
 
 0:11:36.029,0:11:40.000
-So that you have the same value, the same weight replicated across the G function
+Para que você tenha o mesmo valor, o mesmo peso replicado na função G
 
 0:11:40.000,0:11:43.379
-the G function we use the same value multiple times
+a função G usamos o mesmo valor várias vezes
 
 0:11:45.639,0:11:47.639
-So this would look like this
+Então isso ficaria assim
 
 0:11:48.339,0:11:50.339
-So let's imagine U is two dimensional
+Então vamos imaginar que U é bidimensional
 
 0:11:51.279,0:11:54.448
-u1, u2 and then W is four dimensional but
+u1, u2 e então W é quadridimensional, mas
 
 0:11:55.000,0:11:59.969
-w1 and w2 are equal to u1 and w3, w4 are equal to u2
+w1 e w2 são iguais a u1 e w3, w4 são iguais a u2
 
 0:12:01.060,0:12:04.400
-So basically you only have two free parameters
+Então, basicamente, você só tem dois parâmetros livres
 
-0:12:04.700
-and when you're changing one component of U changing two components of W at the same time
+0:12:04.700,0:00:00.000
+e quando você está alterando um componente de U alterando dois componentes de W ao mesmo tempo
 
 0:12:08.560,0:12:14.579
-in a very simple manner. And that's called weight sharing, okay? When two weights are forced to be equal
+de uma forma muito simples. E isso se chama compartilhamento de peso, ok? Quando dois pesos são forçados a serem iguais
 
 0:12:14.579,0:12:19.200
-They are actually equal to a more elementary parameter that controls both
+Eles são na verdade iguais a um parâmetro mais elementar que controla tanto
 
 0:12:19.300,0:12:21.419
-That's weight sharing and that's kind of the basis of
+Isso é compartilhamento de peso e essa é a base da
 
 0:12:21.940,0:12:23.940
-a lot of
+um monte de
 
 0:12:24.670,0:12:26.880
-ideas... you know, convolutional nets among others
+idéias... você sabe, redes convolucionais entre outras
 
 0:12:27.730,0:12:31.890
-but you can think of this as a very simple form of H of U
+mas você pode pensar nisso como uma forma muito simples de H de U
 
 0:12:33.399,0:12:38.489
-So you don't need to do anything for this in the sense that when you have weight sharing
+Então você não precisa fazer nada para isso no sentido de que quando você tem compartilhamento de peso
 
 0:12:39.100,0:12:45.810
-If you do it explicitly with a module that does kind of a Y connection on the way back, when the gradients are back propagated
+Se você fizer isso explicitamente com um módulo que faz uma conexão Y no caminho de volta, quando os gradientes são propagados de volta
 
 0:12:45.810,0:12:47.800
-the gradients are summed up
+os gradientes são somados
 
 0:12:47.800,0:12:53.099
-so the gradient of some cost function with respect to u1, for example, will be the sum of the gradient so that
+então o gradiente de alguma função de custo em relação a u1, por exemplo, será a soma do gradiente de modo que
 
 0:12:53.199,0:12:55.559
-cost function with respect to w1 and w2
+função de custo em relação a w1 e w2
 
 0:12:56.860,0:13:02.219
-And similarly for the gradient with respect to u2 would be the sum of the gradients with respect to w3 and w4, okay?
+E da mesma forma para o gradiente em relação a u2 seria a soma dos gradientes em relação a w3 e w4, ok?
 
 0:13:02.709,0:13:06.328
-That's just the effect of backpropagating through the two Y connectors
+Esse é apenas o efeito da retropropagação através dos dois conectores Y
 
 0:13:13.310,0:13:19.119
-Okay, here is a slightly more general view of this parameter transformation that some people have called hypernetworks
+Ok, aqui está uma visão um pouco mais geral dessa transformação de parâmetros que algumas pessoas chamam de hiperredes
 
 0:13:19.970,0:13:23.350
-So a hypernetwork is a network where
+Assim, uma hiper-rede é uma rede onde
 
 0:13:23.839,0:13:28.299
-the weights of one network are computed as the output of another network
+os pesos de uma rede são calculados como a saída de outra rede
 
 0:13:28.459,0:13:33.969
-Okay, so you have a network H that looks at the input, it has its own parameters U
+Ok, então você tem uma rede H que olha a entrada, ela tem seus próprios parâmetros U
 
 0:13:35.569,0:13:37.929
-And it computes the weights of a second network
+E calcula os pesos de uma segunda rede
 
 0:13:38.959,0:13:44.199
-Okay? so the advantage of doing this... there are various names for it
+OK? então a vantagem de fazer isso... existem vários nomes para isso
 
 0:13:44.199,0:13:46.508
-The idea is very old, it goes back to the 80s
+A ideia é muito antiga, remonta aos anos 80
 
 0:13:46.880,0:13:52.539
-people using what's called multiplicative interactions, or three-way network, or sigma-pi units and they're basically
+pessoas usando o que é chamado de interações multiplicativas, ou rede de três vias, ou unidades sigma-pi e eles são basicamente
 
 0:13:53.600,0:13:59.050
-this idea --and this is maybe a slightly more general general formulation of it
+esta ideia -- e esta é talvez uma formulação geral um pouco mais geral dela
 
 0:14:00.949,0:14:02.949
-that you have sort of a dynamically
+que você tem uma espécie de dinamismo
 
 0:14:04.069,0:14:06.519
-Your function that's dynamically defined
+Sua função que é definida dinamicamente
 
 0:14:07.310,0:14:09.669
-In G of X and W
+Em G de X e W
 
 0:14:10.459,0:14:14.318
-Because W is really a complex function of the input and some other parameter
+Porque W é realmente uma função complexa da entrada e algum outro parâmetro
 
 0:14:16.189,0:14:17.959
-This is particularly
+Isto é particularmente
 
 0:14:17.959,0:14:22.419
-interesting architecture when what you're doing to X is transforming it in some ways
+arquitetura interessante quando o que você está fazendo com o X está transformando-o de algumas maneiras
 
 0:14:23.000,0:14:29.889
-Right? So you can think of W as being the parameters of that transformation, so Y would be a transformed version of X
+Certo? Então você pode pensar em W como sendo os parâmetros dessa transformação, então Y seria uma versão transformada de X
 
 0:14:32.569,0:14:37.809
-And the X, I mean the function H basically computes that transformation
+E o X, quero dizer, a função H basicamente calcula essa transformação
 
 0:14:38.899,0:14:41.739
-Okay? But we'll come back to that in a few weeks
+OK? Mas voltaremos a isso em algumas semanas
 
 0:14:42.829,0:14:46.209
-Just wanted to mention this because it's basically a small modification of
+Só queria mencionar isso porque é basicamente uma pequena modificação do
 
 0:14:46.579,0:14:52.869
-of this right? You just have one more wire that goes from X to H, and that's how you get those hypernetworks
+deste certo? Você só tem mais um fio que vai de X a H, e é assim que você obtém essas hiper-redes
 
 0:14:56.569,0:15:03.129
-Okay, so we're showing the idea that you can have one parameter controlling
+Ok, então estamos mostrando a ideia de que você pode ter um parâmetro controlando
 
 0:15:06.500,0:15:12.549
-multiple effective parameters in another network. And one reason that's useful is
+vários parâmetros efetivos em outra rede. E uma razão que é útil é
 
 0:15:13.759,0:15:16.779
-if you want to detect a motif on an input
+se você quiser detectar um motivo em uma entrada
 
 0:15:17.300,0:15:20.139
-And you want to detect this motif regardless of where it appears, okay?
+E você quer detectar esse motivo independentemente de onde ele apareça, ok?
 
 0:15:20.689,0:15:27.099
-So let's say you have an input, let's say it's a sequence but it could be an image, in this case is a sequence
+Então digamos que você tenha uma entrada, digamos que é uma sequência, mas pode ser uma imagem, neste caso é uma sequência
 
 0:15:27.100,0:15:28.000
-Sequence of vectors, let's say
+Sequência de vetores, digamos
 
 0:15:28.300,0:15:33.279
-And you have a network that takes a collection of three of those vectors, three successive vectors
+E você tem uma rede que pega uma coleção de três desses vetores, três vetores sucessivos
 
 0:15:34.010,0:15:36.339
-It's this network G of X and W and
+É esta rede G de X e W e
 
 0:15:37.010,0:15:42.249
-it's trained to detect a particular motif of those three vectors. Maybe this is... I don't know
+ele é treinado para detectar um motivo específico desses três vetores. Talvez isso seja... eu não sei
 
 0:15:42.889,0:15:44.750
-the power consumption
+o consumo de energia
 
 0:15:44.750,0:15:51.880
-Electrical power consumption, and sometimes you might want to be able to detect like a blip or a trend or something like that
+Consumo de energia elétrica e, às vezes, você pode querer detectar um sinal ou uma tendência ou algo assim
 
 0:15:52.519,0:15:54.519
-Or maybe it's, you know...
+Ou talvez seja, você sabe...
 
 0:15:56.120,0:15:58.120
-financial instruments of some kind
+instrumentos financeiros de algum tipo
 
 0:15:59.149,0:16:05.289
-Some sort of time series. Maybe it's a speech signal and you want to detect a particular sound that consists in three
+Algum tipo de série temporal. Talvez seja um sinal de fala e você queira detectar um som específico que consiste em três
 
 0:16:06.050,0:16:10.899
-vectors that define the sort of audio content of that speech signal
+vetores que definem o tipo de conteúdo de áudio desse sinal de fala
 
 0:16:12.440,0:16:15.709
-And so you'd like to be able to detect
+E então você gostaria de ser capaz de detectar
 
 0:16:15.709,0:16:20.469
-if it's a speech signal and there's a particular sound you need to detect for doing speech recognition
+se for um sinal de fala e houver um som específico que você precisa detectar para fazer o reconhecimento de fala
 
 0:16:20.470,0:16:22.630
-You might want to detect the sound
+Você pode querer detectar o som
 
 0:16:23.180,0:16:28.690
-The vowel P, right? The sound P wherever it occurs in a sequence
+A vogal P, certo? O som P onde quer que ocorra em uma sequência
 
 0:16:28.690,0:16:31.299
-You want some detector that fires when the sound P is...
+Você quer algum detector que dispare quando o som P for...
 
 0:16:33.589,0:16:41.439
-...is pronounced. And so what we'd like to have is a detector you can slide over and regardless of where this motif occurs
+... é pronunciado. E então o que gostaríamos de ter é um detector que você possa deslizar e independentemente de onde esse motivo ocorra
 
 0:16:42.470,0:16:47.500
-detect it. So what you need to have is some network, some parameterized function that...
+detectá-lo. Então o que você precisa ter é alguma rede, alguma função parametrizada que...
 
 0:16:48.920,0:16:55.029
-You have multiple copies of that function that you can apply to various regions on the input and they all share the same weight
+Você tem várias cópias dessa função que você pode aplicar a várias regiões na entrada e todas compartilham o mesmo peso
 
 0:16:55.029,0:16:58.600
-but you'd like to train this entire system end to end
+mas você gostaria de treinar todo esse sistema de ponta a ponta
 
 0:16:58.700,0:17:01.459
-So for example, let's say...
+Então, por exemplo, digamos...
 
 0:17:01.459,0:17:03.459
-Let's talk about a slightly more sophisticated
+Vamos falar um pouco mais sofisticado
 
 0:17:05.569,0:17:07.688
-thing here where you have...
+coisa aqui onde você tem...
 
 0:17:11.059,0:17:13.059
-Let's see...
+Vamos ver...
 
 0:17:14.839,0:17:17.349
-A keyword that's being being pronounced so
+Uma palavra-chave que está sendo pronunciada de forma
 
 0:17:18.169,0:17:22.959
-the system listens to sound and wants to detect when a particular keyword, a wakeup
+o sistema ouve o som e deseja detectar quando uma determinada palavra-chave, uma ativação
 
 0:17:24.079,0:17:28.329
-word has been has been pronounced, right? So this is Alexa, right?
+palavra foi pronunciada, certo? Então essa é a Alexa, certo?
 
 0:17:28.459,0:17:32.709
-And you say "Alexa!" and Alexa wakes up it goes bong, right?
+E você diz "Alexa!" e Alexa acorda faz um bong, certo?
 
 0:17:35.260,0:17:40.619
-So what you'd like to have is some network that kind of takes a window over the sound and then sort of keeps
+Então, o que você gostaria de ter é uma rede que faça uma janela sobre o som e mantenha
 
 0:17:41.890,0:17:44.189
-in the background sort of detecting
+em segundo plano, detectando
 
 0:17:44.860,0:17:47.219
-But you'd like to be able to detect
+Mas você gostaria de ser capaz de detectar
 
 0:17:47.220,0:17:52.020
-wherever the sound occurs within the frame that is being looked at, or it's been listened to, I should say
+onde quer que o som ocorra dentro do quadro que está sendo observado, ou foi ouvido, devo dizer
 
 0:17:52.300,0:17:56.639
-So you can have a network like this where you have replicated detectors
+Então você pode ter uma rede como esta onde você tem detectores replicados
 
 0:17:56.640,0:17:59.520
-They all share the same weight and then the output which is
+Todos eles compartilham o mesmo peso e, em seguida, a saída que é
 
 0:17:59.520,0:18:03.329
-the score as to whether something has been detected or not, goes to a max function
+a pontuação para saber se algo foi detectado ou não, vai para uma função max
 
 0:18:04.090,0:18:07.500
-Okay? And that's the output. And the way you train a system like this
+OK? E essa é a saída. E a maneira como você treina um sistema como este
 
 0:18:08.290,0:18:10.290
-you will have a bunch of samples
+você terá um monte de amostras
 
 0:18:10.780,0:18:14.140
-Audio examples where the keyword
+Exemplos de áudio em que a palavra-chave
 
 0:18:14.140,0:18:18.000
-has been pronounced and a bunch of audio samples where the keyword was not pronounced
+foi pronunciada e um monte de amostras de áudio onde a palavra-chave não foi pronunciada
 
 0:18:18.100,0:18:20.249
-And then you train a 2 class classifier
+E então você treina um classificador de 2 classes
 
 0:18:20.470,0:18:24.689
-Turn on when "Alexa" is somewhere in this frame, turn off when it's not
+Ligue quando "Alexa" estiver em algum lugar neste quadro, desligue quando não estiver
 
 0:18:25.059,0:18:30.899
-But nobody tells you where the word "Alexa" occurs within the window that you train the system on, okay?
+Mas ninguém lhe diz onde a palavra "Alexa" ocorre dentro da janela em que você treina o sistema, ok?
 
 0:18:30.900,0:18:35.729
-Because it's really expensive for labelers to look at the audio signal and tell you exactly
+Porque é muito caro para os rotuladores olharem para o sinal de áudio e dizerem exatamente
 
 0:18:35.730,0:18:37.570
-This is where the word "Alexa" is being pronounced
+É aqui que a palavra "Alexa" está sendo pronunciada
 
 0:18:37.570,0:18:42.720
-The only thing they know is that within this segment of a few seconds, the word has been pronounced somewhere
+A única coisa que eles sabem é que dentro deste segmento de alguns segundos, a palavra foi pronunciada em algum lugar
 
 0:18:43.450,0:18:48.390
-Okay, so you'd like to apply a network like this that has those replicated detectors?
+Ok, então você gostaria de aplicar uma rede como esta que tem esses detectores replicados?
 
 0:18:48.390,0:18:53.429
-You don't know exactly where it is, but you run through this max and you want to train the system to...
+Você não sabe exatamente onde está, mas você percorre esse máximo e deseja treinar o sistema para...
 
 0:18:53.950,0:18:59.370
-You want to back propagate gradient to it so that it learns to detect "Alexa", or whatever...
+Você deseja propagar o gradiente de volta para que ele aprenda a detectar "Alexa" ou qualquer outra coisa ...
 
 0:19:00.040,0:19:01.900
-wake up word occurs
+palavra de despertar ocorre
 
 0:19:01.900,0:19:09.540
-And so there what happens is you have those multiple copies --five copies in this example
+E então o que acontece é que você tem várias cópias -- cinco cópias neste exemplo
 
 0:19:09.580,0:19:11.580
-of this network and they all share the same weight
+desta rede e todos compartilham o mesmo peso
 
 0:19:11.710,0:19:16.650
-You can see there's just one weight vector sending its value to five different
+Você pode ver que há apenas um vetor de peso enviando seu valor para cinco
 
 0:19:17.410,0:19:22.559
-instances of the same network and so we back propagate through the
+instâncias da mesma rede e então voltamos a propagar através do
 
 0:19:23.260,0:19:27.689
-five copies of the network, you get five gradients, so those gradients get added up...
+cinco cópias da rede, você obtém cinco gradientes, então esses gradientes são somados...
 
 0:19:29.679,0:19:34.949
-for the parameter. Now, there's this slightly strange way this is implemented in PyTorch and other
+para o parâmetro. Agora, há essa maneira um pouco estranha de implementar no PyTorch e em outros
 
 0:19:35.740,0:19:41.760
-Deep Learning frameworks, which is that this accumulation of gradient in a single parameter is done implicitly
+Frameworks de Deep Learning, que é que esse acúmulo de gradiente em um único parâmetro é feito implicitamente
 
 0:19:42.550,0:19:46.659
-And it's one reason why before you do a backprop in PyTorch, you have to zero out the gradient
+E é uma razão pela qual antes de fazer um backprop no PyTorch, você precisa zerar o gradiente
 
 0:19:47.840,0:19:49.840
-Because there's sort of implicit
+Porque há uma espécie de implícito
 
 0:19:50.510,0:19:52.510
-accumulation of gradients when you do back propagation
+acúmulo de gradientes quando você faz retropropagação
 
 0:19:58.640,0:20:02.000
-Okay, so here's another situation where that would be useful 
+Ok, então aqui está outra situação em que isso seria útil
 
 0:20:02.100,0:20:07.940
-And this is the real motivation behind conditional nets in the first place
+E esta é a verdadeira motivação por trás das redes condicionais em primeiro lugar
 
 0:20:07.940,0:20:09.940
-Which is the problem of
+Qual é o problema de
 
 0:20:10.850,0:20:15.000
-training a system to recognize the shape independently of the position
+treinar um sistema para reconhecer a forma independentemente da posição
 
 0:20:16.010,0:20:17.960
-of where the shape occurs
+de onde a forma ocorre
 
 0:20:17.960,0:20:22.059
-and whether there are distortions of that shape in the input
+e se há distorções dessa forma na entrada
 
 0:20:22.850,0:20:28.929
-So this is a very simple type of convolutional net that is has been built by hand. It's not been trained
+Portanto, este é um tipo muito simples de rede convolucional que foi construída à mão. não foi treinado
 
 0:20:28.929,0:20:30.929
-It's been designed by hand
+Ele foi projetado à mão
 
 0:20:31.760,0:20:36.200
-And it's designed explicitly to distinguish C's from D's
+E é projetado explicitamente para distinguir C's de D's
 
 0:20:36.400,0:20:38.830
-Okay, so you can draw a C on the input
+Ok, então você pode desenhar um C na entrada
 
 0:20:39.770,0:20:41.770
-image which is very low resolution
+imagem com resolução muito baixa
 
 0:20:43.880,0:20:48.459
-And what distinguishes C's from D's is that C's have end points, right?
+E o que distingue os Cs dos Ds é que os Cs têm pontos finais, certo?
 
 0:20:48.460,0:20:54.610
-The stroke kind of ends, and you can imagine designing a detector for that. Whereas these have corners
+O curso meio que acaba, e você pode imaginar projetar um detector para isso. Considerando que estes têm cantos
 
 0:20:55.220,0:20:59.679
-So if you have an endpoint detector or something that detects the end of a segment and
+Então, se você tem um detector de endpoint ou algo que detecta o final de um segmento e
 
 0:21:00.290,0:21:02.290
-a corner detector
+um detector de canto
 
 0:21:02.330,0:21:06.699
-Wherever you have corners detected, it's a D and wherever you have
+Onde quer que você detecte cantos, é um D e onde quer que você tenha
 
 0:21:07.700,0:21:09.700
-segments that end, it's a C
+segmentos que terminam, é um C
 
 0:21:11.870,0:21:16.989
-So here's an example of a C. You take the first detector, so the little
+Então aqui está um exemplo de um C. Você pega o primeiro detector, então o pequeno
 
 0:21:17.750,0:21:19.869
-black and white motif here at the top
+motivo preto e branco aqui no topo
 
 0:21:20.870,0:21:24.640
-is an endpoint detector, okay? It detects the end of a
+é um detector de endpoint, ok? Detecta o fim de um
 
 0:21:25.610,0:21:28.059
-of a segment and the way this
+de um segmento e a forma como este
 
 0:21:28.760,0:21:33.969
-is represented here is that the black pixels here...
+é representado aqui é que os pixels pretos aqui...
 
 0:21:35.840,0:21:37.929
-So think of this as some sort of template
+Então pense nisso como algum tipo de modelo
 
 0:21:38.990,0:21:43.089
-Okay, you're going to take this template and you're going to swipe it over the input image
+Ok, você vai pegar este modelo e passar sobre a imagem de entrada
 
 0:21:44.510,0:21:51.160
-and you're going to compare that template to the little image that is placed underneath, okay?
+e você vai comparar esse modelo com a pequena imagem que está colocada embaixo, ok?
 
 0:21:51.980,0:21:56.490
-And if those two match, the way you're going to determine whether they match is that you're going to do a dot product
+E se esses dois corresponderem, a maneira como você determinará se eles correspondem é fazendo um produto escalar
 
 0:21:56.490,0:22:03.930
-So you're gonna think of those black and white pixels as value of +1 or -1, say +1 for black and -1 for white
+Então você vai pensar nesses pixels preto e branco como valor de +1 ou -1, digamos +1 para preto e -1 para branco
 
 0:22:05.020,0:22:09.420
-And you're gonna think of those pixels also as being +1 for blacks and -1 for white and
+E você vai pensar nesses pixels também como sendo +1 para pretos e -1 para branco e
 
 0:22:10.210,0:22:16.800
-when you compute the dot product of a little window with that template
+quando você calcula o produto escalar de uma pequena janela com esse modelo
 
 0:22:17.400,0:22:22.770
-If they are similar, you're gonna get a large positive value. If they are dissimilar, you're gonna get a...
+Se eles forem semelhantes, você obterá um grande valor positivo. Se eles são diferentes, você vai ter um...
 
 0:22:24.010,0:22:27.629
-zero or negative value. Or a smaller value, okay?
+valor zero ou negativo. Ou um valor menor, ok?
 
 0:22:29.020,0:22:35.489
-So you take that little detector here and you compute the dot product with the first window, second window, third window, etc.
+Então você pega esse pequeno detector aqui e calcula o produto escalar com a primeira janela, segunda janela, terceira janela, etc.
 
 0:22:35.650,0:22:42.660
-You shift by one pixel every time for every location and you recall the result. And what you what you get is this, right?
+Você muda um pixel toda vez para cada local e lembra o resultado. E o que você ganha é isso, certo?
 
 0:22:42.660,0:22:43.660
-So this is...
+Então isso é...
 
 0:22:43.660,0:22:51.640
-Here the grayscale is an indication of the matching
+Aqui a escala de cinza é uma indicação da correspondência
 
 0:22:51.640,0:22:57.959
-which is actually the dot product between the vector formed by those values
+que na verdade é o produto escalar entre o vetor formado por esses valores
 
 0:22:58.100,0:23:05.070
-And the patch of the corresponding location on the input. So this image here is roughly the same size as that image
+E o patch do local correspondente na entrada. Esta imagem aqui é aproximadamente do mesmo tamanho que aquela imagem
 
 0:23:06.250,0:23:08.250
-minus border effects
+menos efeitos de borda
 
 0:23:08.290,0:23:13.469
-And you see there is a... whenever the output is dark there is a match
+E você vê que há um... sempre que a saída estiver escura, há uma correspondência
 
 0:23:14.380,0:23:16.380
-So you see a match here
+Então você vê um jogo aqui
 
 0:23:16.810,0:23:20.249
-because this endpoint detector here matches the
+porque este detector de endpoint aqui corresponde ao
 
 0:23:20.980,0:23:24.810
-the endpoint. You see sort of a match here at the bottom
+o ponto final. Você vê uma espécie de correspondência aqui na parte inferior
 
 0:23:25.630,0:23:27.930
-And the other kind of values are not as
+E o outro tipo de valores não são tão
 
 0:23:28.750,0:23:32.459
-dark, okay? Not as strong if you want
+escuro, ok? Não tão forte se você quiser
 
 0:23:33.250,0:23:38.820
-Now, if you threshold those those values you set the output to +1 if it's above the threshold
+Agora, se você limitar esses valores, você define a saída para +1 se estiver acima do limite
 
 0:23:39.520,0:23:41.520
-Zero if it's below the threshold
+Zero se estiver abaixo do limite
 
 0:23:42.070,0:23:46.499
-You get those maps here, you have to set the threshold appropriately but what you get is that
+Você obtém esses mapas aqui, você precisa definir o limite adequadamente, mas o que você obtém é que
 
 0:23:46.500,0:23:50.880
-this little guy here detected a match at the two end points of the C, okay?
+esse carinha aqui detectou uma correspondência nas duas pontas do C, ok?
 
 0:23:52.150,0:23:54.749
-So now if you take this map and you sum it up
+Então agora se você pegar este mapa e resumir
 
 0:23:56.050,0:23:58.050
-Just add all the values
+Basta adicionar todos os valores
 
 0:23:58.600,0:24:00.430
-You get a positive number
+Você obtém um número positivo
 
 0:24:00.430,0:24:03.989
-Pass that through threshold, and that's your C detector. It's not a very good C detector
+Passe isso pelo limiar, e esse é o seu detector C. Não é um detector C muito bom
 
 0:24:03.990,0:24:07.859
-It's not a very good detector of anything, but for those particular examples of C's
+Não é um detector muito bom de nada, mas para esses exemplos particulares de C's
 
 0:24:08.429,0:24:10.210
-and maybe those D's
+e talvez aqueles D's
 
 0:24:10.210,0:24:16.980
-It will work, it'll be enough. Now for the D is similar, those other detectors here are meant to detect the corners of the D
+Vai funcionar, vai ser o suficiente. Agora para o D é semelhante, esses outros detectores aqui destinam-se a detectar os cantos do D
 
 0:24:17.679,0:24:24.538
-So this guy here, this detector, as you swipe it over the input will detect the
+Então, esse cara aqui, esse detector, conforme você passa sobre a entrada, detectará o
 
 0:24:25.659,0:24:29.189
-upper left corner and that guy will detect the lower right corner
+canto superior esquerdo e esse cara detectará o canto inferior direito
 
 0:24:29.649,0:24:33.689
-Once you threshold, you will get those two maps where the corners are detected
+Depois de atingir o limite, você obterá esses dois mapas onde os cantos são detectados
 
 0:24:34.509,0:24:37.019
-and then you can sum those up and the
+e então você pode resumir isso e o
 
 0:24:37.360,0:24:44.729
-D detector will turn on. Now what you see here is an example of why this is good because that detection now is shift invariant
+O detector D será ligado. Agora, o que você vê aqui é um exemplo de por que isso é bom porque essa detecção agora é invariante ao deslocamento
 
 0:24:44.730,0:24:49.169
-So if I take the same input D here, and I shift it by a couple pixels
+Então, se eu pegar a mesma entrada D aqui, e eu a deslocar por alguns pixels
 
 0:24:50.340,0:24:56.279
-And I run this detector again, it will detect the motifs wherever they appear. The output will be shifted
+E eu corro este detector novamente, ele detectará os motivos onde quer que eles apareçam. A saída será deslocada
 
 0:24:56.379,0:25:01.559
-Okay, so this is called equivariance to shift. So the output of that network
+Ok, então isso é chamado de equivariância ao deslocamento. Então a saída dessa rede
 
 0:25:02.590,0:25:10.499
-is equivariant to shift, which means that if I shift the input the output gets shifted, but otherwise unchanged. Okay? That's equivariance
+é equivalente a deslocamento, o que significa que, se eu deslocar a entrada, a saída será deslocada, mas de outra forma inalterada. OK? Isso é equivariância
 
 0:25:11.289,0:25:12.909
-Invariance would be
+A invariância seria
 
 0:25:12.909,0:25:17.398
-if I shift it, the output will be completely unchanged but here it is modified
+se eu mudar, a saída ficará completamente inalterada, mas aqui está modificada
 
 0:25:17.399,0:25:19.739
-It just modified the same way as the input
+Apenas modificou da mesma maneira que a entrada
 
 0:25:23.950,0:25:31.080
-And so if I just sum up the activities in the feature maps here, it doesn't matter where they occur
+E se eu apenas resumir as atividades nos mapas de recursos aqui, não importa onde elas ocorram
 
 0:25:31.809,0:25:34.199
-My D detector will still activate
+Meu detector D ainda será ativado
 
 0:25:34.929,0:25:38.998
-if I just compute the sum. So this is sort of a handcrafted
+se eu apenas calcular a soma. Então isso é uma espécie de artesanato
 
 0:25:39.700,0:25:47.100
-pattern recognizer that uses local feature detectors and then kind of sums up their activity and what you get is an invariant detection
+reconhecedor de padrões que usa detectores de recursos locais e, em seguida, resume sua atividade e o que você obtém é uma detecção invariável
 
 0:25:47.710,0:25:52.529
-Okay, this is a fairly classical way actually of building certain types of pattern recognition systems
+Ok, esta é uma maneira bastante clássica de construir certos tipos de sistemas de reconhecimento de padrões
 
 0:25:53.049,0:25:55.049
-Going back many years
+Voltando muitos anos
 
 0:25:57.730,0:26:03.929
-But the trick here, what's important of course, what's interesting would be to learn those templates
+Mas o truque aqui, o importante é claro, o interessante seria aprender esses templates
 
 0:26:04.809,0:26:10.258
-Can we view this as just a neural net and we back propagate to it and we learn those templates?
+Podemos ver isso apenas como uma rede neural e propagarmos de volta para ela e aprendermos esses modelos?
 
 0:26:11.980,0:26:18.779
-As weights of a neural net? After all we're using them to do that product which is a weighted sum, so basically
+Como pesos de uma rede neural? Afinal, estamos usando-os para fazer aquele produto que é uma soma ponderada, então basicamente
 
 0:26:21.710,0:26:29.059
-This layer here to go from the input to those so-called feature maps that are weighted sums
+Esta camada aqui para ir da entrada para os chamados mapas de recursos que são somas ponderadas
 
 0:26:29.520,0:26:33.080
-is a linear operation, okay? And we know how to back propagate through that
+é uma operação linear, ok? E sabemos como voltar a propagar através disso
 
 0:26:35.850,0:26:41.750
-We'd have to use a kind of a soft threshold, a ReLU or something like this here because otherwise we can't do backprop
+Teríamos que usar um tipo de limiar suave, um ReLU ou algo assim aqui porque senão não podemos fazer backprop
 
 0:26:43.470,0:26:48.409
-Okay, so this operation here of taking the dot product of a bunch of coefficients
+Ok, então esta operação aqui de pegar o produto escalar de um monte de coeficientes
 
 0:26:49.380,0:26:53.450
-with an input window and then swiping it over, that's a convolution
+com uma janela de entrada e, em seguida, passando-a, isso é uma convolução
 
 0:26:57.810,0:27:03.409
-Okay, so that's the definition of a convolution. It's actually the one up there so this is in the one dimensional case
+Ok, então essa é a definição de uma convolução. Na verdade, é o que está lá em cima, então este é o caso unidimensional
 
 0:27:05.400,0:27:07.170
-where imagine you have
+onde imagine que você tem
 
 0:27:10.530,0:27:16.639
-An input Xj, so X indexed by the j in the index
+Uma entrada Xj, então X indexado pelo j no índice
 
 0:27:20.070,0:27:22.070
-You take a window
+Você pega uma janela
 
 0:27:23.310,0:27:26.029
-of X at a particular location i
+de X em um determinado local i
 
 0:27:27.330,0:27:30.080
-Okay, and then you sum
+Ok, e então você soma
 
 0:27:31.890,0:27:40.340
-You do a weighted sum of the window of the X values and you multiply those by the weights wⱼ's
+Você faz uma soma ponderada da janela dos valores X e os multiplica pelos pesos wⱼ's
 
 0:27:41.070,0:27:50.359
-Okay, and the sum presumably runs over a kind of a small window so j here would go from 1 to 5
+Ok, e a soma presumivelmente passa por uma espécie de pequena janela, então j aqui iria de 1 a 5
 
 0:27:51.270,0:27:54.259
-Something like that, which is the case in the little example I showed earlier
+Algo assim, que é o caso do pequeno exemplo que mostrei anteriormente
 
 0:27:58.020,0:28:00.950
-and that gives you one Yi
+e isso lhe dá um Yi
 
 0:28:01.770,0:28:05.510
-Okay, so take the first window of 5 values of X
+Ok, então pegue a primeira janela de 5 valores de X
 
 0:28:06.630,0:28:13.280
-Compute the weighted sum with the weights, that gives you Y1. Then shift that window by 1, compute the weighted sum of the
+Calcule a soma ponderada com os pesos, que lhe dá Y1. Em seguida, desloque essa janela por 1, calcule a soma ponderada dos
 
 0:28:13.620,0:28:18.320
-dot product of that window by the Y's, that gives you Y2, shift again, etc.
+produto escalar dessa janela pelos Y's, que lhe dá Y2, shift novamente, etc.
 
 0:28:23.040,0:28:26.839
-Now, in practice when people implement in things like PyTorch
+Agora, na prática, quando as pessoas implementam coisas como PyTorch
 
 0:28:26.840,0:28:31.069
-there is a confusion between two things that mathematicians think are very different
+há uma confusão entre duas coisas que os matemáticos pensam que são muito diferentes
 
 0:28:31.070,0:28:37.009
-but in fact, they're pretty much the same. It's convolution and cross correlation. So in convolution, the convention is that the...
+mas na verdade são praticamente iguais. É convolução e correlação cruzada. Então, em convolução, a convenção é que o...
 
 0:28:37.979,0:28:44.359
-the index goes backwards in the window when it goes forwards in the weights
+o índice retrocede na janela quando avança nos pesos
 
 0:28:44.359,0:28:49.519
-In cross correlation, they both go forward. In the end, it's just a convention, it depends on how you lay...
+Na correlação cruzada, ambos avançam. No final, é apenas uma convenção, depende de como você se deita...
 
 0:28:51.659,0:28:59.598
-organize the data and your weights. You can interpret this as a convolution if you read the weights backwards, so really doesn't make any difference
+organizar os dados e seus pesos. Você pode interpretar isso como uma convolução se ler os pesos de trás para frente, então realmente não faz diferença
 
 0:29:01.259,0:29:06.949
-But for certain mathematical properties of a convolution if you want everything to be consistent you have to have the...
+Mas para certas propriedades matemáticas de uma convolução, se você quer que tudo seja consistente, você precisa ter o...
 
 0:29:07.440,0:29:10.849
-The j in the W having an opposite sign to the j in the X
+O j no W com sinal oposto ao j no X
 
 0:29:11.879,0:29:13.879
-So the two dimensional version of this...
+Então a versão bidimensional disso...
 
 0:29:15.419,0:29:17.419
-If you have an image X
+Se você tem uma imagem X
 
 0:29:17.789,0:29:21.258
-that has two indices --in this case i and j
+que tem dois índices --neste caso i e j
 
 0:29:23.339,0:29:25.909
-You do a weighted sum over two indices k and l
+Você faz uma soma ponderada sobre dois índices k e l
 
 0:29:25.909,0:29:31.368
-And so you have a window a two-dimensional window indexed by k and l and you compute the dot product
+E então você tem uma janela bidimensional indexada por k e l e calcula o produto escalar
 
 0:29:31.769,0:29:34.008
-of that window over X with the...
+daquela janela sobre X com o...
 
 0:29:35.099,0:29:39.679
-the weight, and that gives you one value in Yij which is the output
+o peso, e isso lhe dá um valor em Yij que é a saída
 
 0:29:43.349,0:29:51.319
-So the vector W or the matrix W in the 2d version, there is obvious extensions of this to 3d and 4d, etc.
+Portanto, o vetor W ou a matriz W na versão 2d, há extensões óbvias disso para 3d e 4d, etc.
 
 0:29:52.080,0:29:55.639
-It's called a kernel, it's called a convolutional kernel, okay?
+É chamado de kernel, é chamado de kernel convolucional, ok?
 
 0:30:00.380,0:30:03.309
-Is it clear? I'm sure this is known for many of you but...
+Está claro? Tenho certeza que isso é conhecido por muitos de vocês, mas...
 
 0:30:10.909,0:30:13.449
-So what we're going to do with this is that
+Então, o que vamos fazer com isso é que
 
 0:30:14.750,0:30:18.699
-We're going to organize... build a network as a succession of
+Vamos organizar... construir uma rede como uma sucessão de
 
 0:30:20.120,0:30:23.769
-convolutions where in a regular neural net you have
+convoluções onde em uma rede neural regular você tem
 
 0:30:25.340,0:30:29.100
-alternation of linear operators and pointwise non-linearity
+alternância de operadores lineares e não linearidade pontual
 
 0:30:29.250,0:30:34.389
-In convolutional nets, we're going to have an alternation of linear operators that will happen to be convolutions, so multiple convolutions
+Em redes convolucionais, teremos uma alternância de operadores lineares que serão convoluções, então várias convoluções
 
 0:30:34.940,0:30:40.179
-Then also pointwise non-linearity and there's going to be a third type of operation called pooling...
+Então também não linearidade pontual e haverá um terceiro tipo de operação chamado pooling...
 
 0:30:42.620,0:30:44.620
-which is actually optional
+que na verdade é opcional
 
 0:30:45.470,0:30:50.409
-Before I go further, I should mention that there are
+Antes de prosseguir, devo mencionar que existem
 
 0:30:52.220,0:30:56.889
-twists you can make to this convolution. So one twist is what's called a stride
+torções que você pode fazer para esta convolução. Então uma torção é o que é chamado de passo
 
 0:30:57.380,0:31:01.239
-So a stride in a convolution consists in moving the window
+Assim, um passo em uma convolução consiste em mover a janela
 
 0:31:01.760,0:31:07.509
-from one position to another instead of moving it by just one value
+de uma posição para outra em vez de movê-lo por apenas um valor
 
 0:31:07.940,0:31:13.510
-You move it by two or three or four, okay? That's called a stride of a convolution
+Você o move por dois ou três ou quatro, ok? Isso é chamado de passo de uma convolução
 
 0:31:14.149,0:31:17.138
-And so if you have an input of a certain length and...
+E então se você tem uma entrada de um certo comprimento e...
 
 0:31:19.700,0:31:26.590
-So let's say you have an input which is kind of a one-dimensional and size 100 hundred
+Então, digamos que você tenha uma entrada que é uma espécie de unidimensional e tamanho 100 cem
 
 0:31:27.019,0:31:31.059
-And you have a convolution kernel of size five
+E você tem um kernel de convolução de tamanho cinco
 
 0:31:32.330,0:31:34.330
-Okay, and you convolve
+Ok, e você se envolve
 
 0:31:34.909,0:31:38.409
-this kernel with the input
+este kernel com a entrada
 
 0:31:39.350,0:31:46.120
-And you make sure that the window stays within the input of size 100
+E você garante que a janela fique dentro da entrada de tamanho 100
 
 0:31:46.730,0:31:51.639
-The output you get has 96 outputs, okay? It's got the number of inputs
+A saída que você obtém tem 96 saídas, ok? Tem o número de entradas
 
 0:31:52.519,0:31:56.019
-minus the size of the kernel, which is 5 minus 1
+menos o tamanho do kernel, que é 5 menos 1
 
 0:31:57.110,0:32:00.610
-Okay, so that makes it 4. So you get 100 minus 4, that's 96
+Ok, isso dá 4. Então você obtém 100 menos 4, que é 96
 
 0:32:02.299,0:32:08.709
-That's the number of windows of size 5 that fit within this big input of size 100
+Esse é o número de janelas de tamanho 5 que cabem nessa grande entrada de tamanho 100
 
 0:32:11.760,0:32:13.760
-Now, if I use this stride...
+Agora, se eu usar este passo...
 
 0:32:13.760,0:32:21.960
-So what I do now is I take my window of 5 where I applied the kernel and I shift not by one pixel but by 2 pixels
+Então, o que eu faço agora é pegar minha janela de 5 onde apliquei o kernel e desloco não por um pixel, mas por 2 pixels
 
 0:32:21.960,0:32:24.710
-Or two values, let's say. They're not necessarily pixels
+Ou dois valores, digamos. Eles não são necessariamente pixels
 
 0:32:26.310,0:32:31.880
-Okay, the number of outputs I'm gonna get is gonna be divided by two roughly
+Ok, o número de saídas que vou obter será dividido por dois aproximadamente
 
 0:32:33.570,0:32:36.500
-Okay, instead of 96 I'm gonna have
+Ok, em vez de 96 eu vou ter
 
 0:32:37.080,0:32:42.949
-a little less than 50, 48 or something like that. The number is not exact, you can...
+um pouco menos de 50, 48 ou algo assim. O número não é exato, você pode...
 
 0:32:44.400,0:32:46.400
-figure it out in your head
+descobrir isso na sua cabeça
 
 0:32:47.430,0:32:51.470
-Very often when people run convolutions in convolutional nets they actually pad the convolution
+Muitas vezes, quando as pessoas executam convoluções em redes convolucionais, elas realmente preenchem a convolução
 
 0:32:51.470,0:32:59.089
-So they sometimes like to have the output being the same size as the input, and so they actually displace the input window
+Então, às vezes, eles gostam de ter a saída do mesmo tamanho que a entrada e, na verdade, deslocam a janela de entrada
 
 0:32:59.490,0:33:02.479
-past the end of the vector assuming that it's padded with zeros
+passado o final do vetor assumindo que é preenchido com zeros
 
 0:33:04.230,0:33:06.230
-usually on both sides
+geralmente dos dois lados
 
 0:33:16.110,0:33:19.849
-Does it have any effect on performance or is it just for convenience?
+Tem algum efeito no desempenho ou é apenas por conveniência?
 
 0:33:21.480,0:33:25.849
-If it has an effect on performance is bad, okay? But it is convenient
+Se isso tem um efeito sobre o desempenho é ruim, ok? Mas é conveniente
 
 0:33:28.350,0:33:30.350
-That's pretty much the answer
+Essa é praticamente a resposta
 
 0:33:32.700,0:33:37.800
-The assumption that's bad is assuming that when you don't have data it's equal to zero
+A suposição que é ruim é assumir que quando você não tem dados é igual a zero
 
 0:33:38.000,0:33:41.720
-So when your nonlinearities are ReLU, it's not necessarily completely unreasonable
+Então, quando suas não linearidades são ReLU, não é necessariamente completamente irracional
 
 0:33:43.650,0:33:48.079
-But it sometimes creates funny border effects (boundary effects)
+Mas às vezes cria efeitos de borda engraçados (efeitos de fronteira)
 
 0:33:51.120,0:33:53.539
-Okay, everything clear so far?
+Ok, tudo claro até agora?
 
 0:33:54.960,0:33:59.059
-Right. Okay. So what we're going to build is a
+Certo. OK. Então, o que vamos construir é um
 
 0:34:01.050,0:34:03.050
-neural net composed of those
+rede neural composta por
 
 0:34:03.690,0:34:08.120
-convolutions that are going to be used as feature detectors, local feature detectors
+convoluções que serão usadas como detectores de recursos, detectores de recursos locais
 
 0:34:09.090,0:34:13.069
-followed by nonlinearities, and then we're gonna stack multiple layers of those
+seguido por não linearidades, e então vamos empilhar várias camadas dessas
 
 0:34:14.190,0:34:18.169
-And the reason for stacking multiple layers is because
+E a razão para empilhar várias camadas é porque
 
 0:34:19.170,0:34:21.090
-We want to build
+Nós queremos construir
 
 0:34:21.090,0:34:25.809
-hierarchical representations of the visual world of the data
+representações hierárquicas do mundo visual dos dados
 
 0:34:26.089,0:34:32.258
-It's not... convolutional nets are not necessarily applied to images. They can be applied to speech and other signals
+Não é... redes convolucionais não são necessariamente aplicadas a imagens. Eles podem ser aplicados à fala e outros sinais
 
 0:34:32.299,0:34:35.619
-They basically can be applied to any signal that comes to you in the form of an array
+Eles basicamente podem ser aplicados a qualquer sinal que chegue até você na forma de uma matriz
 
 0:34:36.889,0:34:41.738
-And I'll come back to the properties that this array has to verify
+E eu vou voltar para as propriedades que este array tem que verificar
 
 0:34:43.789,0:34:45.789
-So what you want is...
+Então o que você quer é...
 
 0:34:46.459,0:34:48.698
-Why do you want to build hierarchical representations?
+Por que você quer construir representações hierárquicas?
 
 0:34:48.699,0:34:54.369
-Because the world is compositional --and I alluded to this I think you the first lecture if remember correctly
+Porque o mundo é composicional - e eu aludi a isso, acho que a primeira palestra se lembra corretamente
 
 0:34:55.069,0:35:03.519
-It's the fact that pixes assemble to form simple motifs like oriented edges
+É o fato de que os pixels se reúnem para formar motivos simples, como bordas orientadas
 
 0:35:04.430,0:35:10.839
-Oriented edges kind of assemble to form local features like corners and T junctions and...
+As arestas orientadas são montadas para formar recursos locais, como cantos e junções em T e ...
 
 0:35:11.539,0:35:14.018
-things like that... gratings, you know, and...
+coisas assim... grades, você sabe, e...
 
 0:35:14.719,0:35:19.600
-then those assemble to form motifs that are slightly more abstract.
+em seguida, esses se reúnem para formar motivos um pouco mais abstratos.
 
 0:35:19.700,0:35:23.559
-Then those assemble to form parts of objects, and those assemble to form objects
+Então, aqueles se reúnem para formar partes de objetos, e aqueles se reúnem para formar objetos
 
 0:35:23.559,0:35:28.000
-So there is a sort of natural compositional hierarchy in the natural world
+Portanto, há uma espécie de hierarquia de composição natural no mundo natural
 
 0:35:28.100,0:35:33.129
-And this natural compositional hierarchy in the natural world is not just because of
+E essa hierarquia de composição natural no mundo natural não é apenas por causa de
 
 0:35:34.369,0:35:38.438
-perception --visual perception-- is true at a physical level, right?
+percepção --percepção visual-- é verdade em um nível físico, certo?
 
 0:35:41.390,0:35:46.808
-You start at the lowest level of the description
+Você começa no nível mais baixo da descrição
 
 0:35:47.719,0:35:50.079
-You have elementary particles and they form...
+Você tem partículas elementares e elas formam...
 
 0:35:50.079,0:35:56.438
-they clump to form less elementary particles, and they clump to form atoms, and they clump to form molecules, and molecules clump to form
+eles se aglomeram para formar partículas menos elementares, e se aglomeram para formar átomos, e se aglomeram para formar moléculas, e as moléculas se aglomeram para formar
 
 0:35:57.229,0:36:00.399
-materials, and materials parts of objects and
+materiais e materiais partes de objetos e
 
 0:36:01.130,0:36:03.609
-parts of objects into objects, and things like that, right?
+partes de objetos em objetos, e coisas assim, certo?
 
 0:36:04.670,0:36:07.599
-Or macromolecules or polymers, bla bla bla
+Ou macromoléculas ou polímeros, bla bla bla
 
 0:36:08.239,0:36:13.239
-And then you have this natural composition or hierarchy the world is built this way
+E então você tem essa composição natural ou hierarquia, o mundo é construído dessa maneira
 
 0:36:14.719,0:36:19.000
-And it may be why the world is understandable, right?
+E pode ser por isso que o mundo é compreensível, certo?
 
 0:36:19.100,0:36:22.419
-So there's this famous quote from Einstein that says:
+Então há esta famosa citação de Einstein que diz:
 
 0:36:23.329,0:36:26.750
-"the most incomprehensible thing about the world is that the world is comprehensible"
+"a coisa mais incompreensível sobre o mundo é que o mundo é compreensível"
 
 0:36:26.800,0:36:30.069
-And it seems like a conspiracy that we live in a world that we are able to comprehend
+E parece uma conspiração que vivemos em um mundo que somos capazes de compreender
 
 0:36:31.130,0:36:35.019
-But we can comprehend it because the world is compositional and
+Mas podemos compreendê-lo porque o mundo é composicional e
 
 0:36:36.970,0:36:38.970
-it happens to be easy to build
+acontece de ser fácil de construir
 
 0:36:39.760,0:36:44.370
-brains in a compositional world that actually can interpret compositional world
+cérebros em um mundo composicional que realmente pode interpretar o mundo composicional
 
 0:36:45.580,0:36:47.580
-It still seems like a conspiracy to me
+Ainda me parece uma conspiração
 
 0:36:49.660,0:36:51.660
-So there's a famous quote from...
+Então, há uma frase famosa de...
 
 0:36:53.650,0:36:54.970
-from a...
+a partir de um...
 
 0:36:54.970,0:37:00.780
-Not that famous, but somewhat famous, from a statistician at Brown called Stuart Geman.
+Não tão famoso, mas um tanto famoso, de um estatístico de Brown chamado Stuart Geman.
 
 0:37:01.360,0:37:04.799
-And he says that sounds like a conspiracy, like magic
+E ele diz que isso soa como uma conspiração, como mágica
 
 0:37:06.070,0:37:08.070
-But you know...
+Mas você sabe...
 
 0:37:08.440,0:37:15.570
-If the world were not compositional we would need some even more magic to be able to understand it
+Se o mundo não fosse composicional, precisaríamos de ainda mais magia para poder entendê-lo
 
 0:37:17.260,0:37:21.540
-The way he says this is: "the world is compositional or there is a God"
+A maneira como ele diz isso é: "o mundo é composicional ou existe um Deus"
 
 0:37:25.390,0:37:32.339
-You would need to appeal to superior powers if the world was not compositional to explain how we can understand it
+Você precisaria apelar para poderes superiores se o mundo não fosse composicional para explicar como podemos entendê-lo
 
 0:37:35.830,0:37:37.830
-Okay, so this idea of hierarchy
+Ok, então essa ideia de hierarquia
 
 0:37:38.440,0:37:44.520
-and local feature detection comes from biology. So the whole idea of convolutional nets comes from biology. It's been
+e a detecção de características locais vem da biologia. Portanto, toda a ideia de redes convolucionais vem da biologia. Tem sido
 
 0:37:45.850,0:37:47.850
-so inspired by biology and
+tão inspirado pela biologia e
 
 0:37:48.850,0:37:53.399
-what you see here on the right is a diagram by Simon Thorpe who's a
+o que você vê aqui à direita é um diagrama de Simon Thorpe, que é um
 
 0:37:54.160,0:37:56.160
-psycho-physicist and
+psicofísico e
 
 0:37:56.500,0:38:02.939
-did some relatively famous experiments where he showed that the way we recognize everyday objects
+fez alguns experimentos relativamente famosos onde mostrou que a maneira como reconhecemos objetos do cotidiano
 
 0:38:03.580,0:38:05.969
-seems to be extremely fast. So if you show...
+parece ser extremamente rápido. Então se você mostrar...
 
 0:38:06.640,0:38:10.409
-if you flash the image of an everyday object to a person and
+se você mostrar a imagem de um objeto cotidiano para uma pessoa e
 
 0:38:11.110,0:38:12.730
-you flash
+você pisca
 
 0:38:12.730,0:38:16.649
-one of them every 100 milliseconds or so, you realize that the
+um deles a cada 100 milissegundos mais ou menos, você percebe que o
 
 0:38:18.070,0:38:23.549
-the time it takes for a person to identify in a long sequence, whether there was a particular object, let's say a tiger
+o tempo que leva para uma pessoa identificar em uma longa sequência, se havia um objeto em particular, digamos um tigre
 
 0:38:25.780,0:38:27.640
-is about 100 milliseconds
+é cerca de 100 milissegundos
 
 0:38:27.640,0:38:34.769
-So the time it takes for brain to interpret an image and recognize basic objects in them is about 100 milliseconds
+Portanto, o tempo que o cérebro leva para interpretar uma imagem e reconhecer objetos básicos nelas é de cerca de 100 milissegundos
 
 0:38:35.650,0:38:37.740
-A tenth of a second, right?
+Um décimo de segundo, certo?
 
 0:38:39.490,0:38:42.120
-And that's just about the time it takes for the
+E isso é quase o tempo que leva para o
 
 0:38:43.000,0:38:45.000
-nerve signal to propagate from
+sinal nervoso para se propagar
 
 0:38:45.700,0:38:47.550
-the retina
+a retina
 
 0:38:47.550,0:38:54.090
-where images are formed in the eye to what's called the LGN (lateral geniculate nucleus)
+onde as imagens são formadas no olho para o que é chamado de LGN (núcleo geniculado lateral)
 
 0:38:54.340,0:38:56.340
-which is a small
+que é um pequeno
 
 0:38:56.350,0:39:02.640
-piece of the brain that basically does sort of contrast enhancement and gain control, and things like that
+pedaço do cérebro que basicamente faz uma espécie de aprimoramento de contraste e ganha controle, e coisas assim
 
 0:39:03.580,0:39:08.789
-And then that signal goes to the back of your brain v1. That's the primary visual cortex area
+E então esse sinal vai para a parte de trás do seu cérebro v1. Essa é a área primária do córtex visual
 
 0:39:09.490,0:39:15.600
-in humans and then v2, which is very close to v1. There's a fold that sort of makes v1 sort of
+em humanos e depois v2, que é muito próximo de v1. Há uma dobra que meio que faz a v1 meio que
 
 0:39:17.380,0:39:20.549
-right in front of v2, and there is lots of wires between them
+bem na frente da v2, e há muitos fios entre eles
 
 0:39:21.580,0:39:28.890
-And then v4, and then the inferior temporal cortex, which is on the side here and that's where object categories are represented
+E então v4, e então o córtex temporal inferior, que está do lado aqui e é onde as categorias de objetos são representadas
 
 0:39:28.890,0:39:35.369
-So there are neurons in your inferior temporal cortex that represent generic object categories
+Portanto, existem neurônios em seu córtex temporal inferior que representam categorias genéricas de objetos
 
 0:39:38.350,0:39:41.370
-And people have done experiments with this where...
+E as pessoas fizeram experimentos com isso onde...
 
 0:39:44.320,0:39:51.150
-epileptic patients are in hospital and have their skull open because they need to locate the...
+pacientes epilépticos estão no hospital e têm o crânio aberto porque precisam localizar o...
 
 0:39:52.570,0:40:00.200
-exact position of the source of their epilepsy seizures
+posição exata da fonte de suas crises de epilepsia
 
 0:40:02.080,0:40:04.650
-And because they have electrodes on the surface of their brain
+E porque eles têm eletrodos na superfície do cérebro
 
 0:40:05.770,0:40:11.000
-you can show the movies and then observe if a particular neuron turns on for particular movies
+você pode mostrar os filmes e observar se um neurônio específico liga para filmes específicos
 
 0:40:11.100,0:40:14.110
-And you show them a movie with Jennifer Aniston and there is this
+E você mostra a eles um filme com Jennifer Aniston e tem isso
 
 0:40:14.110,0:40:17.900
-neuron that only turns on when Jennifer Aniston is there, okay?
+neurônio que só liga quando Jennifer Aniston está lá, ok?
 
 0:40:18.000,0:40:21.000
-It doesn't turn on for anything else as far as we could tell, okay?
+Ele não liga para mais nada até onde sabemos, ok?
 
 0:40:21.700,0:40:27.810
-So you seem to have very selective neurons in the inferior temporal cortex that react to a small number of categories
+Então você parece ter neurônios muito seletivos no córtex temporal inferior que reagem a um pequeno número de categorias
 
 0:40:30.760,0:40:35.669
-There's a joke, kind of a running joke, in neuroscience of a concept called the grandmother cell
+Há uma piada, uma espécie de piada corrente, na neurociência de um conceito chamado célula avó
 
 0:40:35.670,0:40:40.350
-So this is the one neuron in your inferior temporal cortex that turns on when you see your grandmother
+Então este é o único neurônio em seu córtex temporal inferior que liga quando você vê sua avó
 
 0:40:41.050,0:40:45.120
-regardless of what position what she's wearing, how far, whether it's a photo or not
+independentemente da posição que ela está vestindo, a que distância, se é uma foto ou não
 
 0:40:46.510,0:40:50.910
-Nobody really believes in this concept, what people really believe in is distributed representations
+Ninguém realmente acredita nesse conceito, o que as pessoas realmente acreditam são representações distribuídas
 
 0:40:50.910,0:40:54.449
-So there is no such thing as a cell that just turns on for you grandmother
+Então não existe celular que só liga pra sua avó
 
 0:40:54.970,0:41:00.820
-There are this collection of cells that turn on for various things and they serve to represent general categories
+Há essa coleção de células que ligam para várias coisas e servem para representar categorias gerais
 
 0:41:01.100,0:41:04.060
-But the important thing is that they are invariant to
+Mas o importante é que eles são invariáveis
 
 0:41:04.700,0:41:06.700
-position, size...
+posição, tamanho...
 
 0:41:06.920,0:41:11.080
-illumination, all kinds of different things and the real motivation behind
+iluminação, todos os tipos de coisas diferentes e a verdadeira motivação por trás
 
 0:41:11.930,0:41:14.349
-convolutional nets is to build
+redes convolucionais é construir
 
 0:41:15.140,0:41:18.670
-neural nets that are invariant to irrelevant transformation of the inputs
+redes neurais que são invariantes à transformação irrelevante das entradas
 
 0:41:19.510,0:41:27.070
-You can still recognize a C or D or your grandmother regardless of the position and to some extent the orientation, the style, etc.
+Você ainda pode reconhecer um C ou D ou sua avó, independentemente da posição e, até certo ponto, da orientação, do estilo etc.
 
 0:41:29.150,0:41:36.790
-So this idea that the signal only takes 100 milliseconds to go from the retina to the inferior temporal cortex
+Então essa ideia de que o sinal leva apenas 100 milissegundos para ir da retina ao córtex temporal inferior
 
 0:41:37.160,0:41:40.330
-Seems to suggest that if you count the delay
+Parece sugerir que se você contar o atraso
 
 0:41:40.850,0:41:42.850
-to go through every neuron or every
+passar por cada neurônio ou cada
 
 0:41:43.340,0:41:45.489
-stage in that pathway
+fase nesse caminho
 
 0:41:46.370,0:41:48.880
-There's barely enough time for a few spikes to get through
+Mal há tempo suficiente para alguns picos passarem
 
 0:41:48.880,0:41:55.720
-So there's no time for complex recurrent computation, is basically a feed-forward process. It's very fast
+Portanto, não há tempo para computação recorrente complexa, é basicamente um processo feed-forward. é muito rápido
 
 0:41:56.930,0:41:59.980
-Okay, and we need it to be fast because that's a question of survival for us
+Ok, e precisamos que seja rápido porque isso é uma questão de sobrevivência para nós
 
 0:41:59.980,0:42:06.159
-There's a lot of... for most animals, you need to be able to recognize really quickly what's going on, particularly...
+Há muito... para a maioria dos animais, você precisa ser capaz de reconhecer muito rapidamente o que está acontecendo, particularmente...
 
 0:42:07.850,0:42:12.820
-fast-moving predators or preys for that matter
+predadores ou presas em movimento rápido para esse assunto
 
 0:42:17.570,0:42:20.830
-So that kind of suggests the idea that we can do
+Então, isso sugere a ideia de que podemos fazer
 
 0:42:21.920,0:42:26.230
-perhaps we could come up with some sort of neuronal net architecture that is completely feed-forward and
+talvez possamos criar algum tipo de arquitetura de rede neuronal que seja completamente feed-forward e
 
 0:42:27.110,0:42:29.110
-still can do recognition
+ainda pode fazer o reconhecimento
 
 0:42:30.230,0:42:32.230
-The diagram on the right
+O diagrama à direita
 
 0:42:34.430,0:42:39.280
-is from Gallent & Van Essen, so this is a type of sort of abstract
+é de Gallent & Van Essen, então este é um tipo de resumo
 
 0:42:39.920,0:42:43.450
-conceptual diagram of the two pathways in the visual cortex
+diagrama conceitual das duas vias no córtex visual
 
 0:42:43.490,0:42:50.530
-There is the ventral pathway and the dorsal pathway. The ventral pathway is, you know, basically the v1, v2, v4, IT hierarchy
+Existe a via ventral e a via dorsal. O caminho ventral é basicamente a hierarquia v1, v2, v4, TI
 
 0:42:50.530,0:42:54.999
-which is sort of from the back of the brain, and goes to the bottom and to the side and
+que é meio que da parte de trás do cérebro, e vai para o fundo e para o lado e
 
 0:42:55.280,0:42:58.179
-then the dorsal pathway kind of goes
+então o caminho dorsal meio que vai
 
 0:42:59.060,0:43:02.469
-through the top also towards the inferior temporal cortex and
+pelo topo também em direção ao córtex temporal inferior e
 
 0:43:04.040,0:43:09.619
-there is this idea somehow that the ventral pathway is there to tell you what you're looking at, right?
+existe essa ideia de alguma forma que o caminho ventral está lá para dizer o que você está olhando, certo?
 
 0:43:10.290,0:43:12.499
-The dorsal pathway basically identifies
+A via dorsal basicamente identifica
 
 0:43:13.200,0:43:15.200
-locations
+Localizações
 
 0:43:15.390,0:43:17.390
-geometry and motion
+geometria e movimento
 
 0:43:17.460,0:43:25.040
-Okay? So there is a pathway for what, and another pathway for where, and that seems fairly separate in the
+OK? Portanto, há um caminho para o quê, e outro caminho para onde, e isso parece bastante separado no
 
 0:43:25.040,0:43:29.030
-human or primate visual cortex
+córtex visual humano ou primata
 
 0:43:32.610,0:43:34.610
-And of course there are interactions between them
+E é claro que há interações entre eles
 
 0:43:39.390,0:43:45.499
-So various people had the idea of kind of using... so where does that idea come from? There is
+Então várias pessoas tiveram a ideia de usar... então de onde vem essa ideia? Há
 
 0:43:46.080,0:43:48.799
-classic work in neuroscience from the late 50s early 60s
+trabalho clássico em neurociência do final dos anos 50 início dos anos 60
 
 0:43:49.650,0:43:52.129
-By Hubel & Wiesel, they're on the picture here
+Por Hubel & Wiesel, eles estão na foto aqui
 
 0:43:53.190,0:43:57.440
-They won a Nobel Prize for it, so it's really classic work and what they showed
+Eles ganharam um Prêmio Nobel por isso, então é um trabalho realmente clássico e o que eles mostraram
 
 0:43:58.290,0:44:01.519
-with cats --basically by poking electrodes into cat brains
+com gatos --basicamente enfiando eletrodos em cérebros de gatos
 
 0:44:02.310,0:44:08.480
-is that neurons in the cat brain --in v1-- detect...
+é que os neurônios no cérebro do gato --em v1-- detectam...
 
 0:44:09.150,0:44:13.789
-are only sensitive to a small area of the visual field and they detect oriented edges
+são sensíveis apenas a uma pequena área do campo visual e detectam bordas orientadas
 
 0:44:14.970,0:44:17.030
-contours in that particular area, okay?
+contornos nessa área em particular, ok?
 
 0:44:17.880,0:44:22.160
-So the area to which a particular neuron is sensitive is called a receptive field
+Assim, a área à qual um neurônio em particular é sensível é chamada de campo receptivo.
 
 0:44:23.700,0:44:27.859
-And you take a particular neuron and you show it
+E você pega um neurônio em particular e mostra
 
 0:44:29.070,0:44:35.719
-kind of an oriented bar that you rotate, and at one point the neuron will fire
+tipo de uma barra orientada que você gira, e em um ponto o neurônio irá disparar
 
 0:44:36.270,0:44:40.640
-for a particular angle, and as you move away from that angle the activation of the neuron kind of
+para um determinado ângulo, e à medida que você se afasta desse ângulo, a ativação do neurônio
 
 0:44:42.690,0:44:50.149
-diminishes, okay? So that's called orientation selective neurons, and Hubel & Wiesel called it simple cells
+diminui, ok? Então isso é chamado de neurônios seletivos de orientação, e Hubel & Wiesel chamaram de células simples
 
 0:44:51.420,0:44:56.930
-If you move the bar a little bit, you go out of the receptive field, that neuron doesn't fire anymore
+Se você mover um pouco a barra, você sai do campo receptivo, aquele neurônio não dispara mais
 
 0:44:57.150,0:45:03.049
-it doesn't react to it. This could be another neuron almost exactly identical to it, just a little bit
+não reage a isso. Este poderia ser outro neurônio quase exatamente idêntico a ele, apenas um pouco
 
 0:45:04.830,0:45:09.620
-Away from the first one that does exactly the same function. It will react to a slightly different
+Longe do primeiro que faz exatamente a mesma função. Ele vai reagir a um pouco diferente
 
 0:45:10.380,0:45:12.440
-receptive field but with the same orientation
+campo receptivo, mas com a mesma orientação
 
 0:45:14.700,0:45:18.889
-So you start getting this idea that you have local feature detectors that are positioned
+Então você começa a ter essa ideia de que tem detectores de recursos locais posicionados
 
 0:45:20.220,0:45:23.689
-replicated all over the visual field, which is basically this idea of
+replicado por todo o campo visual, que é basicamente essa ideia de
 
 0:45:24.960,0:45:26.960
-convolution, okay?
+convolução, ok?
 
 0:45:27.870,0:45:33.470
-So they are called simple cells. And then another idea that or discovery that
+Por isso são chamadas de células simples. E então outra ideia que ou descoberta que
 
 0:45:35.100,0:45:40.279
-Hubel & Wiesel did is the idea of complex cells. So what a complex cell is is another type of neuron
+Hubel & Wiesel fizeram é a ideia de células complexas. Então, o que é uma célula complexa é outro tipo de neurônio
 
 0:45:41.100,0:45:45.200
-that integrates the output of multiple simple cells within a certain area
+que integra a saída de várias células simples dentro de uma determinada área
 
 0:45:46.170,0:45:50.120
-Okay? So they will take different simple cells that all detect
+OK? Então eles vão pegar diferentes células simples que todas detectam
 
 0:45:51.180,0:45:54.079
-contours at a particular orientation, edges at a particular orientation
+contornos em uma orientação específica, bordas em uma orientação específica
 
 0:45:55.350,0:46:02.240
-And compute an aggregate of all those activations. It will either do a max, or a sum, or
+E calcule um agregado de todas essas ativações. Ele fará um máximo, ou uma soma, ou
 
 0:46:02.760,0:46:08.239
-a sum of squares, or square root of sum of squares. Some sort of function that does not depend on the order of the arguments
+uma soma de quadrados, ou raiz quadrada de soma de quadrados. Algum tipo de função que não depende da ordem dos argumentos
 
 0:46:08.820,0:46:11.630
-Okay? Let's say max for the sake of simplicity
+OK? Digamos max por uma questão de simplicidade
 
 0:46:12.900,0:46:17.839
-So basically a complex cell will turn on if any of the simple cells within its
+Então, basicamente, uma célula complexa será ativada se qualquer uma das células simples dentro de sua
 
 0:46:19.740,0:46:22.399
-input group turns on
+grupo de entrada é ativado
 
 0:46:22.680,0:46:29.480
-Okay? So that complex cell will detect an edge at a particular orientation regardless of its position within that little region
+OK? Assim, essa célula complexa detectará uma borda em uma orientação específica, independentemente de sua posição dentro dessa pequena região
 
 0:46:30.210,0:46:32.210
-So it builds a little bit of
+Então ele constrói um pouco de
 
 0:46:32.460,0:46:34.609
-shift invariance of the
+invariância de deslocamento do
 
 0:46:35.250,0:46:40.159
-representation coming out of the complex cells with respect to small variation of positions of
+representação que sai das células complexas com relação à pequena variação de posições de
 
 0:46:40.890,0:46:42.890
-features in the input
+características na entrada
 
 0:46:46.680,0:46:52.010
-So a gentleman by the name of Kunihiko Fukushima
+Então, um cavalheiro com o nome de Kunihiko Fukushima
 
 0:46:54.420,0:46:56.569
---No real relationship with the nuclear power plant
+--Nenhuma relação real com a usina nuclear
 
 0:46:58.230,0:47:00.230
-In the late 70s early 80s
+No final dos anos 70 início dos anos 80
 
 0:47:00.330,0:47:07.190
-experimented with computer models that sort of implemented this idea of simple cell / complex cell, and he had the idea of sort of replicating this
+experimentou com modelos de computador que meio que implementavam essa ideia de célula simples / célula complexa, e ele teve a ideia de replicar isso
 
 0:47:07.500,0:47:09.500
-with multiple layers, so basically...
+com várias camadas, então basicamente...
 
 0:47:11.310,0:47:17.810
-The architecture he did was very similar to the one I showed earlier here with this sort of handcrafted
+A arquitetura que ele fez foi muito parecida com a que mostrei anteriormente aqui com esse tipo de artesanato
 
 0:47:18.570,0:47:20.490
-feature detector
+detector de recursos
 
 0:47:20.490,0:47:24.559
-Some of those feature detectors in his model were handcrafted but some of them were learned
+Alguns desses detectores de recursos em seu modelo foram feitos à mão, mas alguns deles foram aprendidos
 
 0:47:25.230,0:47:30.709
-They were learned by an unsupervised method. He didn't have have backprop, right? Backprop didn't exist
+Eles foram aprendidos por um método não supervisionado. Ele não tinha backprop, certo? Backprop não existia
 
 0:47:30.710,0:47:36.770
-I mean, it existed but it wasn't really popular and people didn't use it
+Quer dizer, existia, mas não era muito popular e as pessoas não o usavam
 
 0:47:38.609,0:47:43.338
-So he trained those filters basically with something that amounts to a
+Então ele treinou esses filtros basicamente com algo que equivale a um
 
 0:47:44.190,0:47:46.760
-sort of clustering algorithm a little bit...
+tipo de algoritmo de agrupamento um pouco ...
 
 0:47:49.830,0:47:53.569
-and separately for each layer. And so he would
+e separadamente para cada camada. E assim ele faria
 
 0:47:56.609,0:48:02.389
-train the filters for the first layer, train this with handwritten digits --he also had a dataset of handwritten digits
+treinar os filtros para a primeira camada, treinar isso com dígitos manuscritos -- ele também tinha um conjunto de dados de dígitos manuscritos
 
 0:48:03.390,0:48:06.470
-and then feed this to complex cells that
+e, em seguida, alimentar isso para células complexas que
 
 0:48:06.470,0:48:10.820
-pool the activity of simple cells together, and then that would
+agrupar a atividade de células simples, e então isso
 
 0:48:11.880,0:48:18.440
-form the input to the next layer, and it would repeat the same running algorithm. His model of neuron was very complicated
+formar a entrada para a próxima camada, e repetiria o mesmo algoritmo em execução. Seu modelo de neurônio era muito complicado
 
 0:48:18.440,0:48:19.589
-It was kind of inspired by biology
+Foi meio que inspirado pela biologia
 
 0:48:19.589,0:48:27.229
-So it had separate inhibitory neurons, the other neurons only have positive weights and outgoing weights, etc.
+Então tinha neurônios inibitórios separados, os outros neurônios só têm pesos positivos e pesos de saída, etc.
 
 0:48:27.839,0:48:29.839
-He managed to get this thing to kind of work
+Ele conseguiu fazer essa coisa funcionar
 
 0:48:30.510,0:48:33.800
-Not very well, but sort of worked
+Não muito bem, mas meio que funcionou
 
 0:48:36.420,0:48:39.170
-Then a few years later
+Então, alguns anos depois
 
 0:48:40.770,0:48:44.509
-I basically kind of got inspired by similar architectures, but
+Eu basicamente me inspirei em arquiteturas semelhantes, mas
 
 0:48:45.780,0:48:51.169
-trained them supervised with backprop, okay? So that's the genesis of convolutional nets, if you want
+treinou-os supervisionado com backprop, ok? Essa é a gênese das redes convolucionais, se você quiser
 
 0:48:51.750,0:48:53.869
-And then independently more or less
+E então independentemente mais ou menos
 
 0:48:57.869,0:49:04.969
-Max Riesenhuber and Tony Poggio's lab at MIT kind of rediscovered this architecture also, but also didn't use backprop for some reason
+O laboratório de Max Riesenhuber e Tony Poggio no MIT meio que redescobriu essa arquitetura também, mas também não usou backprop por algum motivo
 
 0:49:06.060,0:49:08.060
-He calls this H-max
+Ele chama isso de H-max
 
 0:49:12.150,0:49:20.039
-So this is sort of early experiments I did with convolutional nets when I was finishing my postdoc in the University of Toronto in 1988
+Este é um tipo de experimento que fiz com redes convolucionais quando estava terminando meu pós-doutorado na Universidade de Toronto em 1988
 
 0:49:20.040,0:49:22.040
-So that goes back a long time
+Então isso remonta a muito tempo
 
 0:49:22.840,0:49:26.730
-And I was trying to figure out, does this work better on a small data set?
+E eu estava tentando descobrir, isso funciona melhor em um pequeno conjunto de dados?
 
 0:49:26.730,0:49:27.870
-So if you have a tiny amount of data
+Então, se você tem uma pequena quantidade de dados
 
 0:49:27.870,0:49:31.109
-you're trying to fully connect to network or linear network with just one layer or
+você está tentando se conectar totalmente à rede ou rede linear com apenas uma camada ou
 
 0:49:31.480,0:49:34.529
-a network with local connections but no shared weights or compare this with
+uma rede com conexões locais, mas sem pesos compartilhados ou compare isso com
 
 0:49:35.170,0:49:39.299
-what was not yet called a convolutional net, where you have shared weights and local connections
+o que ainda não era chamado de rede convolucional, onde você compartilha pesos e conexões locais
 
 0:49:39.400,0:49:42.749
-Which one works best? And it turned out that in terms of
+Qual deles funciona melhor? E descobriu-se que em termos de
 
 0:49:43.450,0:49:46.439
-generalization ability, which are the curves on the bottom left
+capacidade de generalização, que são as curvas no canto inferior esquerdo
 
 0:49:49.270,0:49:52.499
-which you see here, the top curve here, is...
+que você vê aqui, a curva superior aqui, é...
 
 0:49:53.500,0:50:00.330
-basically the baby convolutional net architecture trained with very a simple data set of handwritten digits that were drawn with a mouse, right?
+basicamente a arquitetura de rede convolucional bebê treinada com um conjunto de dados muito simples de dígitos manuscritos que foram desenhados com um mouse, certo?
 
 0:50:00.330,0:50:02.490
-We didn't have any way of collecting images, basically
+Não tínhamos como coletar imagens, basicamente
 
 0:50:03.640,0:50:05.640
-at that time
+naquela hora
 
 0:50:05.860,0:50:09.240
-And then if you have real connections without shared weights
+E então, se você tiver conexões reais sem pesos compartilhados
 
 0:50:09.240,0:50:12.119
-it works a little worse. And then if you have fully connected
+funciona um pouco pior. E então, se você estiver totalmente conectado
 
 0:50:14.470,0:50:22.230
-networks it works worse, and if you have a linear network, it not only works worse, but but it also overfits, it over trains
+redes funciona pior, e se você tem uma rede linear, não só funciona pior, mas também superajusta, sobre trens
 
 0:50:23.110,0:50:28.410
-So the test error goes down after a while, and this was trained with 320
+Então o erro de teste diminui depois de um tempo, e isso foi treinado com 320
 
 0:50:29.410,0:50:35.519
-320 training samples, which is really small. Those networks had on the order of
+320 amostras de treinamento, o que é muito pequeno. Essas redes tiveram da ordem de
 
 0:50:36.760,0:50:43.170
-five thousand connections, one thousand parameters. So this is a billion times smaller than what we do today
+cinco mil conexões, mil parâmetros. Então isso é um bilhão de vezes menor do que o que fazemos hoje
 
 0:50:43.990,0:50:45.990
-A million times I would say
+Um milhão de vezes eu diria
 
 0:50:47.890,0:50:53.730
-And then I finished my postdoc, I went to Bell Labs, and Bell Labs had slightly bigger computers
+E então terminei meu pós-doutorado, fui para a Bell Labs, e a Bell Labs tinha computadores um pouco maiores
 
 0:50:53.730,0:50:57.389
-but what they had was a data set that came from the Postal Service
+mas o que eles tinham era um conjunto de dados que veio do Serviço Postal
 
 0:50:57.390,0:51:00.629
-So they had zip codes for envelopes and we built a
+Então eles tinham códigos postais para envelopes e nós construímos um
 
 0:51:00.730,0:51:05.159
-data set out of those zip codes and then trained a slightly bigger a neural net for three weeks
+conjunto de dados desses códigos postais e, em seguida, treinou uma rede neural um pouco maior por três semanas
 
 0:51:06.430,0:51:12.749
-and got really good results. So this convolutional net did not have separate
+e obtive resultados muito bons. Portanto, esta rede convolucional não tem separação
 
 0:51:13.960,0:51:15.960
-convolution and pooling
+convolução e agrupamento
 
 0:51:16.240,0:51:22.769
-It had strided convolution, so convolutions where the window is shifted by more than one pixel. So that's...
+Tinha convolução a passos largos, então convoluções onde a janela é deslocada em mais de um pixel. Então isso é...
 
 0:51:23.860,0:51:29.739
-What's the result of this? So the result is that the output map when you do a convolution where the stride is
+Qual é o resultado disso? Então o resultado é que o mapa de saída quando você faz uma convolução onde o passo é
 
 0:51:30.710,0:51:36.369
-more than one, you get an output whose resolution is smaller than the input and you see an example here
+mais de um, você obtém uma saída cuja resolução é menor que a entrada e vê um exemplo aqui
 
 0:51:36.370,0:51:40.390
-So here the input is 16 by 16 pixels. That's what we could afford
+Então aqui a entrada é de 16 por 16 pixels. Isso é o que poderíamos pagar
 
 0:51:41.900,0:51:49.029
-The kernels are 5 by 5, but they are shifted by 2 pixels every time and so the
+Os kernels são de 5 por 5, mas são deslocados em 2 pixels a cada vez e, portanto, o
 
 0:51:51.950,0:51:56.919
-the output here is smaller because of that
+a saída aqui é menor por causa disso
 
 0:52:11.130,0:52:13.980
-Okay? And then one year later this was the next generation
+OK? E então um ano depois esta era a próxima geração
 
 0:52:14.830,0:52:16.830
-convolutional net. This one had separate
+rede convolucional. Este tinha separado
 
 0:52:17.680,0:52:19.680
-convolution and pooling so...
+convolução e pooling então...
 
 0:52:20.740,0:52:24.389
-Where's the pooling operation? At that time, the pooling operation was just another
+Onde está a operação de pooling? Naquela época, a operação de pooling era apenas mais uma
 
 0:52:25.690,0:52:31.829
-neuron except that all the weights of that neuron were equal, okay? So a pooling unit was basically
+neurônio exceto que todos os pesos desse neurônio eram iguais, ok? Assim, uma unidade de pooling era basicamente
 
 0:52:32.680,0:52:36.839
-a unit that computed an average of its inputs
+uma unidade que calculou uma média de suas entradas
 
 0:52:37.180,0:52:41.730
-it added a bias, and then passed it to a non-linearity, which in this case was a hyperbolic tangent function
+adicionou um viés e, em seguida, passou para uma não linearidade, que neste caso era uma função tangente hiperbólica
 
 0:52:42.820,0:52:48.450
-Okay? All the non-linearities in this network were hyperbolic tangents at the time. That's what people were doing
+OK? Todas as não linearidades nesta rede eram tangentes hiperbólicas na época. Isso é o que as pessoas estavam fazendo
 
 0:52:53.200,0:52:55.200
-And the pooling operation was
+E a operação de agrupamento foi
 
 0:52:56.380,0:52:58.440
-performed by shifting
+realizado por deslocamento
 
 0:52:59.680,0:53:01.710
-the window over which you compute the
+a janela sobre a qual você calcula o
 
 0:53:02.770,0:53:09.240
-the aggregate of the output of the previous layer by 2 pixels, okay? So here
+o agregado da saída da camada anterior por 2 pixels, ok? Então aqui
 
 0:53:10.090,0:53:13.470
-you get a 32 by 32 input window
+você obtém uma janela de entrada de 32 por 32
 
 0:53:14.470,0:53:20.730
-You convolve this with filters that are 5 by 5. I should mention that a convolution kernel sometimes is also called a filter
+Você envolve isso com filtros que são 5 por 5. Devo mencionar que um kernel de convolução às vezes também é chamado de filtro
 
 0:53:22.540,0:53:25.230
-And so what you get here are
+E então o que você tem aqui são
 
 0:53:27.520,0:53:29.520
-outputs that are
+saídas que são
 
 0:53:30.520,0:53:33.749
-I guess minus 4 so is 28 by 28, okay?
+Acho que menos 4, então é 28 por 28, ok?
 
 0:53:34.540,0:53:40.380
-And then there is a pooling which computes an average of
+E então há um pooling que calcula uma média de
 
 0:53:41.530,0:53:44.400
-pixels here over a 2 by 2 window and
+pixels aqui em uma janela de 2 por 2 e
 
 0:53:45.310,0:53:47.310
-then shifts that window by 2
+então muda essa janela por 2
 
 0:53:48.160,0:53:50.160
-So how many such windows do you have?
+Então, quantas dessas janelas você tem?
 
 0:53:51.220,0:53:56.279
-Since the image is 28 by 28, you divide by 2, is 14 by 14, okay? So those images
+Já que a imagem é 28 por 28, você divide por 2, é 14 por 14, ok? Então essas imagens
 
 0:53:57.460,0:54:00.359
-here are 14 by 14 pixels
+aqui estão 14 por 14 pixels
 
 0:54:02.050,0:54:05.759
-And they are basically half the resolution as the previous window
+E eles são basicamente metade da resolução da janela anterior
 
 0:54:07.420,0:54:09.420
-because of this stride
+por causa desse passo
 
 0:54:10.360,0:54:16.470
-Okay? Now it becomes interesting because what you want is, you want the next layer to detect combinations of features from the previous layer
+OK? Agora fica interessante porque o que você quer é que a próxima camada detecte combinações de recursos da camada anterior
 
 0:54:17.200,0:54:19.200
-And so...
+E assim...
 
 0:54:20.200,0:54:22.619
-the way to do this is... you have
+a maneira de fazer isso é... você tem
 
 0:54:23.440,0:54:26.579
-different convolution filters apply to each of those feature maps
+diferentes filtros de convolução se aplicam a cada um desses mapas de recursos
 
 0:54:27.730,0:54:29.730
-Okay?
+OK?
 
 0:54:29.950,0:54:35.939
-And you sum them up, you sum the results of those four convolutions and you pass the result to a non-linearity and that gives you
+E você os soma, soma os resultados dessas quatro circunvoluções e passa o resultado para uma não linearidade e isso lhe dá
 
 0:54:36.910,0:54:42.239
-one feature map of the next layer. So because those filters are 5 by 5 and those
+um mapa de características da próxima camada. Então, porque esses filtros são 5 por 5 e aqueles
 
 0:54:43.330,0:54:46.380
-images are 14 by 14, those guys are 10 by 10
+as imagens são 14 por 14, esses caras são 10 por 10
 
 0:54:47.290,0:54:49.739
-Okay? To not have border effects
+OK? Para não ter efeitos de borda
 
 0:54:52.270,0:54:56.999
-So each of these feature maps --of which there are sixteen if I remember correctly
+Então, cada um desses mapas de recursos -- dos quais há dezesseis se bem me lembro
 
 0:54:59.290,0:55:01.290
-uses a different set of
+usa um conjunto diferente de
 
 0:55:02.860,0:55:04.860
-kernels to...
+núcleos para...
 
 0:55:06.340,0:55:09.509
-convolve the previous layers. In fact
+convolva as camadas anteriores. Na verdade
 
 0:55:10.630,0:55:13.799
-the connection pattern between the feature map...
+o padrão de conexão entre o mapa de recursos...
 
 0:55:14.650,0:55:18.720
-the feature map at this layer and the feature map at the next layer is actually not full
+o mapa de feição nesta camada e o mapa de feição na próxima camada não estão cheios
 
 0:55:18.720,0:55:22.349
-so not every feature map is connected to every feature map. There's a particular scheme of
+portanto, nem todo mapa de recursos está conectado a todos os mapas de recursos. Há um esquema particular de
 
 0:55:23.680,0:55:25.950
-different combinations of feature map from the previous layer
+diferentes combinações de mapa de recursos da camada anterior
 
 0:55:28.030,0:55:33.600
-combining to four feature maps at the next layer. And the reason for doing this is just to save computer time
+combinando com quatro mapas de recursos na próxima camada. E a razão para fazer isso é apenas para economizar tempo no computador
 
 0:55:34.000,0:55:40.170
-We just could not afford to connect everything to everything. It would have taken twice the time to run or more
+Nós simplesmente não podíamos nos dar ao luxo de conectar tudo a tudo. Levaria o dobro do tempo para correr ou mais
 
 0:55:41.890,0:55:48.359
-Nowadays we are kind of forced more or less to actually have a complete connection between feature maps in a convolutional net
+Hoje em dia somos meio que forçados a ter uma conexão completa entre mapas de características em uma rede convolucional
 
 0:55:49.210,0:55:52.289
-Because of the way that multiple convolutions are implemented in GPUs
+Devido à maneira como várias convoluções são implementadas em GPUs
 
 0:55:53.440,0:55:55.440
-Which is sad
+O que é triste
 
 0:55:56.560,0:55:59.789
-And then the next layer up. So again those maps are 10 by 10
+E então a próxima camada para cima. Então, novamente, esses mapas são 10 por 10
 
 0:55:59.790,0:56:02.729
-Those feature maps are 10 by 10 and the next layer up
+Esses mapas de recursos são 10 por 10 e a próxima camada acima
 
 0:56:03.970,0:56:06.389
-is produced by pooling and subsampling
+é produzido por agrupamento e subamostragem
 
 0:56:07.330,0:56:09.330
-by a factor of 2
+por um fator de 2
 
 0:56:09.370,0:56:11.370
-and so those are 5 by 5
+e então esses são 5 por 5
 
 0:56:12.070,0:56:14.880
-Okay? And then again there is a 5 by 5 convolution here
+OK? E então, novamente, há uma convolução de 5 por 5 aqui
 
 0:56:14.880,0:56:18.089
-Of course, you can't move the window 5 by 5 over a 5 by 5 image
+Claro, você não pode mover a janela 5 por 5 sobre uma imagem de 5 por 5
 
 0:56:18.090,0:56:21.120
-So it looks like a full connection, but it's actually a convolution
+Parece uma conexão completa, mas na verdade é uma convolução
 
 0:56:22.000,0:56:24.000
-Okay? Keep this in mind
+OK? Mantenha isso em mente
 
 0:56:24.460,0:56:26.460
-But you basically just sum in only one location
+Mas você basicamente apenas soma em apenas um local
 
 0:56:27.250,0:56:33.960
-And those feature maps at the top here are really outputs. And so you have one special location
+E esses mapas de recursos no topo aqui são realmente saídas. E assim você tem um local especial
 
 0:56:33.960,0:56:39.399
-Okay? Because you can only place one 5 by 5 window within a 5 by 5 image
+OK? Porque você só pode colocar uma janela de 5 por 5 em uma imagem de 5 por 5
 
 0:56:40.460,0:56:45.340
-And you have 10 of those feature maps each of which corresponds to a category so you train the system to classify
+E você tem 10 desses mapas de recursos, cada um correspondendo a uma categoria, então você treina o sistema para classificar
 
 0:56:45.560,0:56:47.619
-digits from 0 to 9, you have ten categories
+dígitos de 0 a 9, você tem dez categorias
 
 0:56:59.750,0:57:03.850
-This is a little animation that I borrowed from Andrej Karpathy
+Esta é uma pequena animação que peguei emprestada de Andrej Karpathy
 
 0:57:05.570,0:57:08.439
-He spent the time to build this really nice real animation
+Ele gastou o tempo para construir esta animação real muito legal
 
 0:57:09.470,0:57:16.780
-which is to represent several convolutions, right? So you have three feature maps here on the input and you have six
+que é para representar várias circunvoluções, certo? Então você tem três mapas de recursos aqui na entrada e você tem seis
 
 0:57:18.650,0:57:21.100
-convolution kernels and two feature maps on the output
+kernels de convolução e dois mapas de recursos na saída
 
 0:57:21.100,0:57:26.709
-So here the first group of three feature maps are convolved with...
+Então, aqui o primeiro grupo de três mapas de recursos é convoluído com...
 
 0:57:28.520,0:57:31.899
-kernels are convolved with the three input feature maps to produce
+kernels são convolvidos com os três mapas de recursos de entrada para produzir
 
 0:57:32.450,0:57:37.330
-the first group, the first of the two feature maps, the green one at the top
+o primeiro grupo, o primeiro dos dois mapas de recursos, o verde no topo
 
 0:57:38.390,0:57:40.370
-Okay?
+OK?
 
 0:57:40.370,0:57:42.820
-And then...
+E então...
 
 0:57:44.180,0:57:49.000
-Okay, so this is the first group of three kernels convolved with the three feature maps
+Ok, então este é o primeiro grupo de três kernels convoluídos com os três mapas de recursos
 
 0:57:49.000,0:57:53.349
-And they produce the green map at the top, and then you switch to the second group of
+E eles produzem o mapa verde no topo, e então você muda para o segundo grupo de
 
 0:57:54.740,0:57:58.479
-of convolution kernels. You convolve with the
+de núcleos de convolução. Você se envolve com o
 
 0:57:59.180,0:58:04.149
-three input feature maps to produce the map at the bottom. Okay? So that's
+três mapas de recursos de entrada para produzir o mapa na parte inferior. OK? Então isso é
 
 0:58:05.810,0:58:07.810
-an example of
+um exemplo de
 
 0:58:10.070,0:58:17.709
-n-feature map on the input, n-feature map on the output, and N times M convolution kernels to get all combinations
+Mapa de n recursos na entrada, mapa de n recursos na saída e N vezes M kernels de convolução para obter todas as combinações
 
 0:58:25.000,0:58:27.000
-Here's another animation which I made a long time ago
+Aqui está outra animação que fiz há muito tempo
 
 0:58:28.100,0:58:34.419
-That shows convolutional net after it's been trained in action trying to recognize digits
+Isso mostra a rede convolucional depois de ser treinada em ação tentando reconhecer dígitos
 
 0:58:35.330,0:58:38.529
-And so what's interesting to look at here is you have
+E o que é interessante ver aqui é que você tem
 
 0:58:39.440,0:58:41.440
-an input here, which is I believe
+uma entrada aqui, que é eu acredito
 
 0:58:42.080,0:58:44.590
-32 rows by 64 columns
+32 linhas por 64 colunas
 
 0:58:45.770,0:58:52.570
-And after doing six convolutions with six convolution kernels passing it through a hyperbolic tangent non-linearity after a bias
+E depois de fazer seis convoluções com seis núcleos de convolução passando por uma tangente hiperbólica não linear após um viés
 
 0:58:52.570,0:58:59.229
-you get those feature maps here, each of which kind of activates for a different type of feature. So, for example
+você obtém esses mapas de recursos aqui, cada um dos quais é ativado para um tipo diferente de recurso. Assim, por exemplo
 
 0:58:59.990,0:59:01.990
-the feature map at the top here
+o mapa de recursos no topo aqui
 
 0:59:02.390,0:59:04.690
-turns on when there is some sort of a horizontal edge
+acende quando há algum tipo de borda horizontal
 
 0:59:07.400,0:59:10.090
-This guy here it turns on whenever there is a vertical edge
+Esse cara aqui liga sempre que tem uma borda vertical
 
 0:59:10.940,0:59:15.340
-Okay? And those convolutional kernels have been learned through backprop, the thing has been just been trained
+OK? E esses kernels convolucionais foram aprendidos através do backprop, a coisa acabou de ser treinada
 
 0:59:15.980,0:59:20.980
-with backprop. Not set by hand. They're set randomly usually
+com suporte traseiro. Não definido à mão. Eles são definidos aleatoriamente geralmente
 
 0:59:21.620,0:59:26.769
-So you see this notion of equivariance here, if I shift the input image the
+Então você vê essa noção de equivariância aqui, se eu mudar a imagem de entrada
 
 0:59:27.500,0:59:31.600
-activations on the feature maps shift, but otherwise stay unchanged
+as ativações nos mapas de recursos mudam, mas permanecem inalteradas
 
 0:59:32.540,0:59:34.540
-All right?
+Tudo bem?
 
 0:59:34.940,0:59:36.940
-That's shift equivariance
+Isso é equivariância de deslocamento
 
 0:59:36.950,0:59:38.860
-Okay, and then we go to the pooling operation
+Ok, e então vamos para a operação de pooling
 
 0:59:38.860,0:59:42.519
-So this first feature map here corresponds to a pooled version of
+Portanto, este primeiro mapa de recursos aqui corresponde a uma versão agrupada de
 
 0:59:42.800,0:59:46.149
-this first one, the second one to the second one, third went to the third one
+este primeiro, o segundo para o segundo, o terceiro foi para o terceiro
 
 0:59:46.250,0:59:51.370
-and the pooling operation here again is an average, then a bias, then a similar non-linearity
+e a operação de agrupamento aqui novamente é uma média, depois um viés, depois uma não linearidade semelhante
 
 0:59:52.070,0:59:55.029
-And so if this map shifts by
+E assim, se este mapa muda de
 
 0:59:56.570,0:59:59.499
-one pixel this map will shift by one half pixel
+um pixel este mapa mudará em meio pixel
 
-1:00:01.370,1:00:02.780
-Okay?
+0:00:01.370,0:00:02.780
+OK?
 
-1:00:02.780,1:00:05.259
-So you still have equavariance, but
+0:00:02.780,0:00:05.259
+Então você ainda tem equavariância, mas
 
-1:00:06.260,1:00:11.830
-shifts are reduced by a factor of two, essentially
+0:00:06.260,0:00:11.830
+os deslocamentos são reduzidos por um fator de dois, essencialmente
 
-1:00:11.830,1:00:15.850
-and then you have the second stage where each of those maps here is a result of
+0:00:11.830,0:00:15.850
+e então você tem o segundo estágio onde cada um desses mapas aqui é resultado de
 
-1:00:16.160,1:00:23.440
-doing a convolution on each, or a subset of the previous maps with different kernels, summing up the result, passing the result through
+0:00:16.160,0:00:23.440
+fazendo uma convolução em cada um, ou um subconjunto dos mapas anteriores com kernels diferentes, somando o resultado, passando o resultado por
 
-1:00:24.170,1:00:27.070
-a sigmoid, and so you get those kind of abstract features
+0:00:24.170,0:00:27.070
+um sigmóide, e assim você obtém esse tipo de recursos abstratos
 
-1:00:28.730,1:00:32.889
-here that are a little hard to interpret visually, but it's still equivariant to shift
+0:00:28.730,0:00:32.889
+aqui que são um pouco difíceis de interpretar visualmente, mas ainda é equivalente mudar
 
-1:00:33.860,1:00:40.439
-Okay? And then again you do pooling and subsampling. So the pooling also has this stride by a factor of two
+0:00:33.860,0:00:40.439
+OK? E então, novamente, você faz o agrupamento e a subamostragem. Então o pooling também tem esse passo por um fator de dois
 
-1:00:40.630,1:00:42.580
-So what you get here are
+0:00:40.630,0:00:42.580
+Então o que você tem aqui são
 
-1:00:42.580,1:00:47.609
-our maps, so that those maps shift by one quarter pixel if the input shifts by one pixel
+0:00:42.580,0:00:47.609
+nossos mapas, para que esses mapas mudem em um quarto de pixel se a entrada mudar em um pixel
 
-1:00:48.730,1:00:55.290
-Okay? So we reduce the shift and it becomes... it might become easier and easier for following layers to kind of interpret what the shape is
+0:00:48.730,0:00:55.290
+OK? Então reduzimos a mudança e torna-se... pode tornar-se cada vez mais fácil para as camadas seguintes para interpretar qual é a forma
 
-1:00:55.290,1:00:57.290
-because you exchange
+0:00:55.290,0:00:57.290
+porque você troca
 
-1:00:58.540,1:01:00.540
-spatial resolution for
+0:00:58.540,0:01:00.540
+resolução espacial para
 
-1:01:01.030,1:01:05.009
-feature type resolution. You increase the number of feature types as you go up the layers
+0:01:01.030,0:01:05.009
+resolução do tipo de recurso. Você aumenta o número de tipos de feição à medida que sobe as camadas
 
-1:01:06.040,1:01:08.879
-The spatial resolution goes down because of the pooling and subsampling
+0:01:06.040,0:01:08.879
+A resolução espacial diminui devido ao agrupamento e subamostragem
 
-1:01:09.730,1:01:14.459
-But the number of feature maps increases and so you make the representation a little more abstract
+0:01:09.730,0:01:14.459
+Mas o número de mapas de recursos aumenta e você torna a representação um pouco mais abstrata
 
-1:01:14.460,1:01:19.290
-but less sensitive to shift and distortions. And the next layer
+0:01:14.460,0:01:19.290
+mas menos sensível a mudanças e distorções. E a próxima camada
 
-1:01:20.740,1:01:25.080
-again performs convolutions, but now the size of the convolution kernel is equal to the height of the image
+0:01:20.740,0:01:25.080
+novamente executa convoluções, mas agora o tamanho do kernel de convolução é igual à altura da imagem
 
-1:01:25.080,1:01:27.449
-And so what you get is a single band
+0:01:25.080,0:01:27.449
+E então o que você ganha é uma única banda
 
-1:01:28.359,1:01:32.219
-for this feature map. It basically becomes one dimensional and
+0:01:28.359,0:01:32.219
+para este mapa de recursos. Basicamente torna-se unidimensional e
 
-1:01:32.920,1:01:39.750
-so now any vertical shift is basically eliminated, right? It's turned into some variation of activation, but it's not
+0:01:32.920,0:01:39.750
+então agora qualquer deslocamento vertical é basicamente eliminado, certo? Ele se transformou em alguma variação de ativação, mas não é
 
-1:01:40.840,1:01:42.929
-It's not a shift anymore. It's some sort of
+0:01:40.840,0:01:42.929
+Não é mais uma mudança. É algum tipo de
 
-1:01:44.020,1:01:45.910
-simpler --hopefully
+0:01:44.020,0:01:45.910
+mais simples --espero
 
-1:01:45.910,1:01:49.020
-transformation of the input. In fact, you can show it's simpler
+0:01:45.910,0:01:49.020
+transformação da entrada. Na verdade, você pode mostrar que é mais simples
 
-1:01:51.160,1:01:53.580
-It's flatter in some ways
+0:01:51.160,0:01:53.580
+É mais plano em alguns aspectos
 
-1:01:56.650,1:02:00.330
-Okay? So that's the sort of generic convolutional net architecture we have
+0:01:56.650,0:02:00.330
+OK? Esse é o tipo de arquitetura de rede convolucional genérica que temos
 
-1:02:01.570,1:02:05.699
-This is a slightly more modern version of it, where you have some form of normalization
+0:02:01.570,0:02:05.699
+Esta é uma versão um pouco mais moderna, onde você tem alguma forma de normalização
 
-1:02:07.450,1:02:09.450
-Batch norm
+0:02:07.450,0:02:09.450
+Norma de lote
 
-1:02:10.600,1:02:15.179
-Good norm, whatever. A filter bank, those are the multiple convolutions
+0:02:10.600,0:02:15.179
+Boa norma, tanto faz. Um banco de filtros, essas são as múltiplas convoluções
 
-1:02:16.660,1:02:18.690
-In signal processing they're called filter banks
+0:02:16.660,0:02:18.690
+No processamento de sinal, eles são chamados de bancos de filtros
 
-1:02:19.840,1:02:27.149
-Pointwise non-linearity, generally a ReLU, and then some pooling, generally max pooling in the most common
+0:02:19.840,0:02:27.149
+Não linearidade pontual, geralmente um ReLU e, em seguida, algum pooling, geralmente max pooling no mais comum
 
-1:02:28.330,1:02:30.629
-implementations of convolutional nets. You can, of course
+0:02:28.330,0:02:30.629
+implementações de redes convolucionais. Você pode, claro
 
-1:02:30.630,1:02:35.880
-imagine other types of pooling. I talked about the average but the more generic version is the LP norm
+0:02:30.630,0:02:35.880
+imagine outros tipos de pooling. Eu falei sobre a média, mas a versão mais genérica é a norma LP
 
-1:02:36.640,1:02:38.640
-which is...
+0:02:36.640,0:02:38.640
+qual é...
 
-1:02:38.770,1:02:45.530
-take all the inputs through a complex cell, elevate them to some power and then take the...
+0:02:38.770,0:02:45.530
+pegue todas as entradas através de uma célula complexa, eleve-as a alguma potência e então pegue o...
 
-1:02:45.530,1:02:47.530
-Sum them up, and then take the...
+0:02:45.530,0:02:47.530
+Soma-os e depois toma o...
 
-1:02:49.860,1:02:51.860
-Elevate that to 1 over the power
+0:02:49.860,0:02:51.860
+Eleve isso para 1 sobre o poder
 
-1:02:53.340,1:02:58.489
-Yeah, this should be a sum inside of the P-th root here
+0:02:53.340,0:02:58.489
+Sim, isso deve ser uma soma dentro da raiz P-th aqui
 
-1:03:00.870,1:03:02.870
-Another way to pool and again
+0:03:00.870,0:03:02.870
+Outra maneira de piscina e novamente
 
-1:03:03.840,1:03:07.759
-a good pooling operation is an operation that is
+0:03:03.840,0:03:07.759
+uma boa operação de pooling é uma operação que é
 
-1:03:07.920,1:03:11.719
-invariant to a permutation of the input. It gives you the same result
+0:03:07.920,0:03:11.719
+invariante a uma permutação da entrada. Dá o mesmo resultado
 
-1:03:12.750,1:03:14.750
-regardless of the order in which you put the input
+0:03:12.750,0:03:14.750
+independentemente da ordem em que você coloca a entrada
 
-1:03:15.780,1:03:22.670
-Here's another example. We talked about this function before: 1 over b log sum of our inputs of e to the bXᵢ
+0:03:15.780,0:03:22.670
+Aqui está outro exemplo. Falamos sobre essa função antes: 1 sobre b log soma de nossas entradas de e para o bXᵢ
 
-1:03:25.920,1:03:30.649
-Exponential bX. Again, that's a kind of symmetric aggregation operation that you can use
+0:03:25.920,0:03:30.649
+Exponencial bX. Novamente, esse é um tipo de operação de agregação simétrica que você pode usar
 
-1:03:32.400,1:03:35.539
-So that's kind of a stage of a convolutional net, and then you can repeat that
+0:03:32.400,0:03:35.539
+Então, isso é uma espécie de estágio de uma rede convolucional, e então você pode repetir isso
 
-1:03:36.270,1:03:43.729
-There's sort of various ways of positioning the normalization. Some people put it after the non-linearity before the pooling
+0:03:36.270,0:03:43.729
+Existem várias maneiras de posicionar a normalização. Algumas pessoas colocam após a não linearidade antes do agrupamento
 
-1:03:43.730,1:03:45.730
-You know, it depends
+0:03:43.730,0:03:45.730
+Você sabe, depende
 
-1:03:46.590,1:03:48.590
-But it's typical
+0:03:46.590,0:03:48.590
+Mas é típico
 
-1:03:53.640,1:03:56.569
-So, how do you do this in PyTorch? there's a number of different ways
+0:03:53.640,0:03:56.569
+Então, como você faz isso no PyTorch? há várias maneiras diferentes
 
-1:03:56.570,1:04:02.479
-You can do it by writing it explicitly, writing a class. So this is an example of a convolutional net class
+0:03:56.570,0:04:02.479
+Você pode fazer isso escrevendo explicitamente, escrevendo uma classe. Este é um exemplo de uma classe de rede convolucional
 
-1:04:04.020,1:04:10.520
-In particular one here where you do convolutions, ReLU and max pooling
+0:04:04.020,0:04:10.520
+Em particular aqui onde você faz convoluções, ReLU e pool máximo
 
-1:04:12.600,1:04:17.900
-Okay, so the constructor here creates convolutional layers which have parameters in them
+0:04:12.600,0:04:17.900
+Ok, então o construtor aqui cria camadas convolucionais que possuem parâmetros nelas
 
-1:04:18.810,1:04:24.499
-And this one has what's called fully-connected layers. I hate that. Okay?
+0:04:18.810,0:04:24.499
+E este tem o que chamamos de camadas totalmente conectadas. Eu odeio isso. OK?
 
-1:04:25.980,1:04:30.919
-So there is this idea somehow that the last layer of a convolutional net
+0:04:25.980,0:04:30.919
+Então existe essa ideia de alguma forma que a última camada de uma rede convolucional
 
-1:04:32.760,1:04:34.790
-Like this one, is fully connected because
+0:04:32.760,0:04:34.790
+Como este, está totalmente conectado porque
 
-1:04:37.320,1:04:42.860
-every unit in this layer is connected to every unit in that layer. So that looks like a full connection
+0:04:37.320,0:04:42.860
+cada unidade nesta camada está conectada a cada unidade naquela camada. Então isso parece uma conexão completa
 
-1:04:44.010,1:04:47.060
-But it's actually useful to think of it as a convolution
+0:04:44.010,0:04:47.060
+Mas é realmente útil pensar nisso como uma convolução
 
-1:04:49.200,1:04:51.060
-Okay?
+0:04:49.200,0:04:51.060
+OK?
 
-1:04:51.060,1:04:56.070
-Now, for efficiency reasons, or maybe some others bad reasons they're called
+0:04:51.060,0:04:56.070
+Agora, por razões de eficiência, ou talvez outras razões ruins, eles são chamados
 
-1:04:57.370,1:05:00.959
-fully-connected layers, and we used the class linear here
+0:04:57.370,0:05:00.959
+camadas totalmente conectadas, e usamos a classe linear aqui
 
-1:05:01.120,1:05:05.459
-But it kind of breaks the whole idea that your network is a convolutional network
+0:05:01.120,0:05:05.459
+Mas meio que quebra toda a ideia de que sua rede é uma rede convolucional
 
-1:05:06.070,1:05:09.209
-So it's much better actually to view them as convolutions
+0:05:06.070,0:05:09.209
+Portanto, é muito melhor vê-los como convoluções
 
-1:05:09.760,1:05:14.370
-In this case one by one convolution which is sort of a weird concept. Okay. So here we have
+0:05:09.760,0:05:14.370
+Neste caso, uma por uma convolução que é uma espécie de conceito estranho. OK. Então aqui temos
 
-1:05:15.190,1:05:20.46
-four layers, two convolutional layers and two so-called fully-connected layers
+0:05:15.190,0:05:20.460
+quatro camadas, duas camadas convolucionais e duas chamadas camadas totalmente conectadas
 
-1:05:21.790,1:05:23.440
-And then the way we...
+0:05:21.790,0:05:23.440
+E então a maneira como nós...
 
-1:05:23.440,1:05:29.129
-So we need to create them in the constructor, and the way we use them in the forward pass is that
+0:05:23.440,0:05:29.129
+Então, precisamos criá-los no construtor, e a maneira como os usamos na passagem para frente é que
 
-1:05:30.630,1:05:35.310
-we do a convolution of the input, and then we apply the ReLU, and then we do max pooling and then we
+0:05:30.630,0:05:35.310
+nós fazemos uma convolução da entrada, e então aplicamos o ReLU, e então fazemos o pool máximo e então nós
 
-1:05:35.710,1:05:38.699
-run the second layer, and apply the ReLU, and do max pooling again
+0:05:35.710,0:05:38.699
+execute a segunda camada e aplique o ReLU e faça o pool máximo novamente
 
-1:05:38.700,1:05:44.280
-And then we reshape the output because it's a fully connected layer. So we want to make this a
+0:05:38.700,0:05:44.280
+E então remodelamos a saída porque é uma camada totalmente conectada. Então queremos fazer disso um
 
-1:05:45.190,1:05:47.879
-vector so that's what the x.view(-1) does
+0:05:45.190,0:05:47.879
+vetor então é isso que o x.view(-1) faz
 
-1:05:48.820,1:05:50.820
-And then apply a
+0:05:48.820,0:05:50.820
+E depois aplique um
 
-1:05:51.160,1:05:53.160
-ReLU to it
+0:05:51.160,0:05:53.160
+ReLU para isso
 
-1:05:53.260,1:05:55.260
-And...
+0:05:53.260,0:05:55.260
+E...
 
-1:05:55.510,1:06:00.330
-the second fully-connected layer, and then apply a softmax if we want to do classification
+0:05:55.510,0:06:00.330
+a segunda camada totalmente conectada e, em seguida, aplique um softmax se quisermos fazer a classificação
 
-1:06:00.460,1:06:04.409
-And so this is somewhat similar to the architecture you see at the bottom
+0:06:00.460,0:06:04.409
+E isso é um pouco semelhante à arquitetura que você vê na parte inferior
 
-1:06:04.900,1:06:08.370
-The numbers might be different in terms of feature maps and stuff, but...
+0:06:04.900,0:06:08.370
+Os números podem ser diferentes em termos de mapas de recursos e outras coisas, mas ...
 
-1:06:09.160,1:06:11.160
-but the general architecture is
+0:06:09.160,0:06:11.160
+mas a arquitetura geral é
 
-1:06:12.250,1:06:14.250
-pretty much what we're talking about
+0:06:12.250,0:06:14.250
+praticamente o que estamos falando
 
-1:06:15.640,1:06:17.640
-Yes?
+0:06:15.640,0:06:17.640
+Sim?
 
-1:06:20.530,1:06:22.530
-Say again
+0:06:20.530,0:06:22.530
+Repita
 
-1:06:24.040,1:06:26.100
-You know, whatever gradient descent decides
+0:06:24.040,0:06:26.100
+Você sabe, o que quer que a descida do gradiente decida
 
-1:06:28.630,1:06:30.630
-We can look at them, but
+0:06:28.630,0:06:30.630
+Podemos olhar para eles, mas
 
-1:06:31.180,1:06:33.180
-if you train with a lot of
+0:06:31.180,0:06:33.180
+se você treinar com muito
 
-1:06:33.280,1:06:37.590
-examples of natural images, the kind of filters you will see at the first layer
+0:06:33.280,0:06:37.590
+exemplos de imagens naturais, o tipo de filtro que você verá na primeira camada
 
-1:06:37.840,1:06:44.999
-basically will end up being mostly oriented edge detectors, very much similar to what people, to what neuroscientists
+0:06:37.840,0:06:44.999
+basicamente acabarão sendo detectores de bordas orientados, muito semelhantes ao que as pessoas, ao que os neurocientistas
 
-1:06:45.340,1:06:49.110
-observe in the cortex of
+0:06:45.340,0:06:49.110
+observar no córtex de
 
-1:06:49.210,1:06:50.440
-animals
+0:06:49.210,0:06:50.440
+animais
 
-1:06:50.440,1:06:52.440
-In the visual cortex of animals
+0:06:50.440,0:06:52.440
+No córtex visual dos animais
 
-1:06:55.780,1:06:58.469
-They will change when you train the model, that's the whole point yes
+0:06:55.780,0:06:58.469
+Eles vão mudar quando você treinar o modelo, esse é o ponto sim
 
-1:07:05.410,1:07:11.160
-Okay, so it's pretty simple. Here's another way of defining those. This is... I guess it's kind of an
+0:07:05.410,0:07:11.160
+Ok, então é bem simples. Aqui está outra maneira de defini-los. Isso é... eu acho que é uma espécie de
 
-1:07:12.550,1:07:15.629
-outdated way of doing it, right? Not many people do this anymore
+0:07:12.550,0:07:15.629
+maneira desatualizada de fazer isso, certo? Muitas pessoas não fazem mais isso
 
-1:07:17.170,1:07:23.340
-but it's kind of a simple way. Also there is this class in PyTorch called nn.Sequential
+0:07:17.170,0:07:23.340
+mas é uma forma simples. Também existe essa classe no PyTorch chamada nn.Sequential
 
-1:07:24.550,1:07:28.469
-It's basically a container and you keep putting modules in it and it just
+0:07:24.550,0:07:28.469
+É basicamente um container e você continua colocando módulos nele e simplesmente
 
-1:07:29.080,1:07:36.269
-automatically kind of use them as being kind of connected in sequence, right? And so then you just have to call
+0:07:29.080,0:07:36.269
+automaticamente meio que usá-los como sendo meio conectados em sequência, certo? E então você só tem que ligar
 
-1:07:40.780,1:07:45.269
-forward on it and it will just compute the right thing
+0:07:40.780,0:07:45.269
+adiante e ele apenas calculará a coisa certa
 
-1:07:46.360,1:07:50.370
-In this particular form here, you pass it a bunch of pairs
+0:07:46.360,0:07:50.370
+Neste formulário específico aqui, você passa um monte de pares
 
-1:07:50.370,1:07:55.229
-It's like a dictionary so you can give a name to each of the layers, and you can later access them
+0:07:50.370,0:07:55.229
+É como um dicionário para que você possa dar um nome a cada uma das camadas e depois acessá-las
 
-1:08:08.079,1:08:10.079
-It's the same architecture we were talking about earlier
+0:08:08.079,0:08:10.079
+É a mesma arquitetura que estávamos falando anteriormente
 
-1:08:18.489,1:08:24.029
-Yeah, I mean the backprop is automatic, right? You get it
+0:08:18.489,0:08:24.029
+Sim, quero dizer que o backprop é automático, certo? Você entendeu
 
-1:08:25.630,1:08:27.630
-by default you just call
+0:08:25.630,0:08:27.630
+por padrão você apenas chama
 
-1:08:28.690,1:08:32.040
-backward and it knows how to back propagate through it
+0:08:28.690,0:08:32.040
+para trás e sabe como voltar a propagar através dele
 
-1:08:44.000,1:08:49.180
-Well, the class kind of encapsulates everything into an object where the parameters are
+0:08:44.000,0:08:49.180
+Bem, a classe meio que encapsula tudo em um objeto onde os parâmetros são
 
-1:08:49.250,1:08:51.250
-There's a particular way of...
+0:08:49.250,0:08:51.250
+Existe uma forma específica de...
 
-1:08:52.220,1:08:54.220
-getting the parameters out and 
+0:08:52.220,0:08:54.220
+tirando os parâmetros e
 
-1:08:55.130,1:08:58.420
-kind of feeding them to an optimizer
+0:08:55.130,0:08:58.420
+tipo de alimentá-los para um otimizador
 
-1:08:58.420,1:09:01.330
-And so the optimizer doesn't need to know what your network looks like
+0:08:58.420,0:09:01.330
+E assim o otimizador não precisa saber como é sua rede
 
-1:09:01.330,1:09:06.910
-It just knows that there is a function and there is a bunch of parameters and it gets a gradient and
+0:09:01.330,0:09:06.910
+Ele apenas sabe que existe uma função e um monte de parâmetros e obtém um gradiente e
 
-1:09:06.910,1:09:08.910
-it doesn't need to know what your network looks like
+0:09:06.910,0:09:08.910
+ele não precisa saber como é sua rede
 
-1:09:10.790,1:09:12.879
-Yeah, you'll hear more about this
+0:09:10.790,0:09:12.879
+Sim, você vai ouvir mais sobre isso
 
-1:09:14.840,1:09:16.840
-tomorrow
+0:09:14.840,0:09:16.840
+amanhã
 
-1:09:25.610,1:09:33.159
-So here's a very interesting aspect of convolutional nets and it's one of the reasons why they've become so
+0:09:25.610,0:09:33.159
+Então aqui está um aspecto muito interessante das redes convolucionais e é uma das razões pelas quais elas se tornaram tão
 
-1:09:33.830,1:09:37.390
-successful in many applications. It's the fact that
+0:09:33.830,0:09:37.390
+sucesso em muitas aplicações. É o fato de
 
-1:09:39.440,1:09:45.280
-if you view every layer in a convolutional net as a convolution, so there is no full connections, so to speak
+0:09:39.440,0:09:45.280
+se você visualizar cada camada em uma rede convolucional como uma convolução, então não há conexões completas, por assim dizer
 
-1:09:47.660,1:09:53.320
-you don't need to have a fixed size input. You can vary the size of the input and the network will
+0:09:47.660,0:09:53.320
+você não precisa ter uma entrada de tamanho fixo. Você pode variar o tamanho da entrada e a rede
 
-1:09:54.380,1:09:56.380
-vary its size accordingly
+0:09:54.380,0:09:56.380
+variar seu tamanho de acordo
 
-1:09:56.780,1:09:58.780
-because...
+0:09:56.780,0:09:58.780
+Porque...
 
-1:09:59.510,1:10:01.510
-when you apply a convolution to an image
+0:09:59.510,0:10:01.510
+quando você aplica uma convolução a uma imagem
 
-1:10:02.150,1:10:05.800
-you fit it an image of a certain size, you do a convolution with a kernel
+0:10:02.150,0:10:05.800
+você encaixa uma imagem de um certo tamanho, você faz uma convolução com um kernel
 
-1:10:06.620,1:10:11.979
-you get an image whose size is related to the size of the input
+0:10:06.620,0:10:11.979
+você obtém uma imagem cujo tamanho está relacionado ao tamanho da entrada
 
-1:10:12.140,1:10:15.789
-but you can change the size of the input and it just changes the size of the output
+0:10:12.140,0:10:15.789
+mas você pode alterar o tamanho da entrada e apenas altera o tamanho da saída
 
-1:10:16.760,1:10:20.320
-And this is true for every convolutional-like like operation, right?
+0:10:16.760,0:10:20.320
+E isso é verdade para todas as operações do tipo convolucional, certo?
 
-1:10:20.320,1:10:25.509
-So if your network is composed only of convolutions, then it doesn't matter what the size of the input is
+0:10:20.320,0:10:25.509
+Portanto, se sua rede é composta apenas de convoluções, não importa o tamanho da entrada
 
-1:10:26.180,1:10:31.450
-It's going to go through the network and the size of every layer will change according to the size of the input
+0:10:26.180,0:10:31.450
+Vai passar pela rede e o tamanho de cada camada mudará de acordo com o tamanho da entrada
 
-1:10:31.580,1:10:34.120
-and the size of the output will also change accordingly
+0:10:31.580,0:10:34.120
+e o tamanho da saída também mudará de acordo
 
-1:10:34.640,1:10:37.329
-So here is a little example here where
+0:10:34.640,0:10:37.329
+Então aqui está um pequeno exemplo aqui onde
 
-1:10:38.720,1:10:40.720
-I wanna do
+0:10:38.720,0:10:40.720
+Eu quero fazer
 
-1:10:41.300,1:10:45.729
-cursive handwriting recognition and it's very hard because I don't know where the letters are
+0:10:41.300,0:10:45.729
+reconhecimento de caligrafia cursiva e é muito difícil porque não sei onde estão as letras
 
-1:10:45.730,1:10:48.700
-So I can't just have a character recognizer that...
+0:10:45.730,0:10:48.700
+Então eu não posso simplesmente ter um reconhecedor de caracteres que...
 
-1:10:49.260,1:10:51.980
-I mean a system that will first cut the
+0:10:49.260,0:10:51.980
+Refiro-me a um sistema que primeiro cortará o
 
-1:10:52.890,1:10:56.100
-word into letters
+0:10:52.890,0:10:56.100
+palavra em letras
 
-1:10:56.100,1:10:57.72
-because I don't know where the letters are
+0:10:56.100,0:10:57.720
+porque não sei onde estão as letras
 
-1:10:57.720,1:10:59.900
-and then apply the convolutional net to each of the letters
+0:10:57.720,0:10:59.900
+e, em seguida, aplique a rede convolucional a cada uma das letras
 
-1:11:00.210,1:11:05.200
-So the best I can do is take the convolutional net and swipe it over the input and then record the output
+0:11:00.210,0:11:05.200
+Então, o melhor que posso fazer é pegar a rede convolucional e passá-la sobre a entrada e gravar a saída
 
-1:11:05.850,1:11:11.810
-Okay? And so you would think that to do this you will have to take a convolutional net like this that has a window
+0:11:05.850,0:11:11.810
+OK? E então você pensaria que para fazer isso você terá que pegar uma rede convolucional como esta que tem uma janela
 
-1:11:12.060,1:11:14.389
-large enough to see a single character
+0:11:12.060,0:11:14.389
+grande o suficiente para ver um único caractere
 
-1:11:15.120,1:11:21.050
-and then you take your input image and compute your convolutional net at every location
+0:11:15.120,0:11:21.050
+e então você pega sua imagem de entrada e calcula sua rede convolucional em cada local
 
-1:11:21.660,1:11:27.110
-shifting it by one pixel or two pixels or four pixels or something like this, a small enough number of pixels that
+0:11:21.660,0:11:27.110
+deslocando-o em um pixel ou dois pixels ou quatro pixels ou algo assim, um número pequeno o suficiente de pixels que
 
-1:11:27.630,1:11:30.619
-regardless of where the character occurs in the input
+0:11:27.630,0:11:30.619
+independentemente de onde o caractere ocorre na entrada
 
-1:11:30.620,1:11:35.000
-you will still get a score on the output whenever it needs to recognize one
+0:11:30.620,0:11:35.000
+você ainda obterá uma pontuação na saída sempre que precisar reconhecer um
 
-1:11:36.150,1:11:38.989
-But it turns out that will be extremely wasteful
+0:11:36.150,0:11:38.989
+Mas acontece que será extremamente desperdício
 
-1:11:40.770,1:11:42.770
-because...
+0:11:40.770,0:11:42.770
+Porque...
 
-1:11:43.290,1:11:50.179
-you will be redoing the same computation multiple times. And so the proper way to do this --and this is very important to understand
+0:11:43.290,0:11:50.179
+você estará refazendo o mesmo cálculo várias vezes. E a maneira correta de fazer isso -- e isso é muito importante para entender
 
-1:11:50.880,1:11:56.659
-is that you don't do what I just described where you have a small convolutional net that you apply to every window
+0:11:50.880,0:11:56.659
+é que você não faz o que acabei de descrever onde você tem uma pequena rede convolucional que você aplica a todas as janelas
 
-1:11:58.050,1:12:00.050
-What you do is you
+0:11:58.050,0:12:00.050
+O que você faz é você
 
-1:12:01.230,1:12:07.939
-take a large input and you apply the convolutions to the input image since it's larger you're gonna get a larger output
+0:12:01.230,0:12:07.939
+pegue uma entrada grande e aplique as convoluções à imagem de entrada, pois é maior, você obterá uma saída maior
 
-1:12:07.940,1:12:11.270
-you apply the second layer convolution to that, or the pooling, whatever it is
+0:12:07.940,0:12:11.270
+você aplica a convolução da segunda camada a isso, ou o pooling, seja lá o que for
 
-1:12:11.610,1:12:15.170
-You're gonna get a larger input again, etc.
+0:12:11.610,0:12:15.170
+Você obterá uma entrada maior novamente, etc.
 
-1:12:15.170,1:12:16.650
-all the way to the top and
+0:12:15.170,0:12:16.650
+todo o caminho até o topo e
 
-1:12:16.650,1:12:20.929
-whereas in the original design you were getting only one output now you're going to get multiple outputs because
+0:12:16.650,0:12:20.929
+enquanto no design original você estava obtendo apenas uma saída, agora você obterá várias saídas porque
 
-1:12:21.570,1:12:23.570
-it's a convolutional layer
+0:12:21.570,0:12:23.570
+é uma camada convolucional
 
-1:12:27.990,1:12:29.990
-This is super important because
+0:12:27.990,0:12:29.990
+Isso é superimportante porque
 
-1:12:30.600,1:12:35.780
-this way of applying a convolutional net with a sliding window is
+0:12:30.600,0:12:35.780
+esta maneira de aplicar uma rede convolucional com uma janela deslizante é
 
-1:12:36.870,1:12:40.610
-much, much cheaper than recomputing the convolutional net at every location
+0:12:36.870,0:12:40.610
+muito, muito mais barato do que recalcular a rede convolucional em cada local
 
-1:12:42.510,1:12:44.510
-Okay?
+0:12:42.510,0:12:44.510
+OK?
 
-1:12:45.150,1:12:51.619
-You would not believe how many decades it took to convince people that this was a good thing
+0:12:45.150,0:12:51.619
+Você não acreditaria em quantas décadas foram necessárias para convencer as pessoas de que isso era uma coisa boa
 
-1:12:58.960,1:13:03.390
-So here's an example of how you can use this
+0:12:58.960,0:13:03.390
+Então aqui está um exemplo de como você pode usar isso
 
-1:13:04.090,1:13:09.180
-This is a conventional net that was trained on individual digits, 32 by 32. It was trained on a MNIST, okay?
+0:13:04.090,0:13:09.180
+Esta é uma rede convencional que foi treinada em dígitos individuais, 32 por 32. Foi treinada em um MNIST, ok?
 
-1:13:09.760,1:13:11.760
-32 by 32 input windows
+0:13:09.760,0:13:11.760
+32 por 32 janelas de entrada
 
-1:13:12.400,1:13:15.690
-It's LeNet 5, so it's very similar to the architecture
+0:13:12.400,0:13:15.690
+É LeNet 5, então é muito semelhante à arquitetura
 
-1:13:15.690,1:13:20.940
-I just showed the code for, okay? It's trained on individual characters to just classify
+0:13:15.690,0:13:20.940
+Acabei de mostrar o código para, ok? É treinado em personagens individuais para apenas classificar
 
-1:13:21.970,1:13:26.369
-the character in the center of the image. And the way it was trained was there was a little bit of data
+0:13:21.970,0:13:26.369
+o personagem no centro da imagem. E a maneira como foi treinado foi que havia um pouco de dados
 
-1:13:26.770,1:13:30.359
-augmentation where the character in the center was kind of shifted a little bit in various locations
+0:13:26.770,0:13:30.359
+aumento onde o personagem no centro foi meio que mudou um pouco em vários locais
 
-1:13:31.420,1:13:36.629
-changed in size. And then there were two other characters
+0:13:31.420,0:13:36.629
+mudou de tamanho. E então havia dois outros personagens
 
-1:13:37.420,1:13:39.600
-that were kind of added to the side to confuse it
+0:13:37.420,0:13:39.600
+que foram meio que adicionados ao lado para confundir
 
-1:13:40.480,1:13:45.660
-in many samples. And then it was also trained with an 11th category
+0:13:40.480,0:13:45.660
+em muitas amostras. E então também foi treinado com uma 11ª categoria
 
-1:13:45.660,1:13:50.249
-which was "none of the above" and the way it's trained is either you show it a blank image
+0:13:45.660,0:13:50.249
+que foi "nenhuma das opções acima" e a maneira como é treinado é ou você mostra uma imagem em branco
 
-1:13:50.410,1:13:54.149
-or you show it an image where there is no character in the center but there are characters on the side
+0:13:50.410,0:13:54.149
+ou você mostra uma imagem onde não há nenhum caractere no centro, mas há caracteres na lateral
 
-1:13:54.940,1:13:59.399
-so that it would detect whenever it's inbetween two characters
+0:13:54.940,0:13:59.399
+para detectar sempre que estiver entre dois caracteres
 
-1:14:00.520,1:14:02.520
-and then you do this thing of
+0:14:00.520,0:14:02.520
+e então você faz essa coisa de
 
-1:14:02.650,1:14:10.970
-computing the convolutional net at every location on the input without actually shifting it but just applying the convolutions to the entire image
+0:14:02.650,0:14:10.970
+computando a rede convolucional em cada local na entrada sem realmente deslocá-la, mas apenas aplicando as convoluções à imagem inteira
 
-1:14:11.740,1:14:13.740
-And that's what you get
+0:14:11.740,0:14:13.740
+E é isso que você ganha
 
-1:14:13.780,1:14:23.220
-So here the input image is 64 by 32, even though the network was trained on 32 by 32 with those kind of generated examples
+0:14:13.780,0:14:23.220
+Então aqui a imagem de entrada é 64 por 32, embora a rede tenha sido treinada em 32 por 32 com esse tipo de exemplo gerado
 
-1:14:24.280,1:14:28.049
-And what you see is the activity of some of the layers, not all of them are represented
+0:14:24.280,0:14:28.049
+E o que você vê é a atividade de algumas das camadas, nem todas estão representadas
 
-1:14:29.410,1:14:32.309
-And what you see at the top here, those kind of funny shapes
+0:14:29.410,0:14:32.309
+E o que você vê no topo aqui, essas formas engraçadas
 
-1:14:33.520,1:14:37.560
-You see threes and fives popping up and they basically are an
+0:14:33.520,0:14:37.560
+Você vê três e cinco aparecendo e eles são basicamente um
 
-1:14:38.830,1:14:41.850
-indication of the winning category for every location, right?
+0:14:38.830,0:14:41.850
+indicação da categoria vencedora para cada localidade, certo?
 
-1:14:42.670,1:14:47.339
-So the eight outputs that you see at the top are
+0:14:42.670,0:14:47.339
+Portanto, as oito saídas que você vê no topo são
 
-1:14:48.520,1:14:50.520
-basically the output corresponding to eight different
+0:14:48.520,0:14:50.520
+basicamente a saída correspondente a oito
 
-1:14:51.250,1:14:56.790
-positions of the 32 by 32 input window on the input, shifted by 4 pixels every time
+0:14:51.250,0:14:56.790
+posições da janela de entrada de 32 por 32 na entrada, deslocadas em 4 pixels a cada vez
 
-1:14:59.530,1:15:05.859
-And what is represented is the winning category within that window and the grayscale indicates the score, okay?
+0:14:59.530,0:15:05.859
+E o que está representado é a categoria vencedora dentro dessa janela e a escala de cinza indica a pontuação, ok?
 
-1:15:07.220,1:15:10.419
-So what you see is that there's two detectors detecting the five
+0:15:07.220,0:15:10.419
+Então o que você vê é que há dois detectores detectando os cinco
 
-1:15:11.030,1:15:15.850
-until the three kind of starts overlapping. And then two detectors are detecting the three that kind of moved around
+0:15:11.030,0:15:15.850
+até que os três tipos comecem a se sobrepor. E então dois detectores estão detectando os três que meio que se moveram
 
-1:15:18.230,1:15:22.779
-because within a 32 by 32 window
+0:15:18.230,0:15:22.779
+porque dentro de uma janela de 32 por 32
 
-1:15:23.390,1:15:29.919
-the three appears to the left of that 32 by 32 window, and then to the right of that other 32 by 32 windows shifted by four
+0:15:23.390,0:15:29.919
+os três aparecem à esquerda dessa janela de 32 por 32 e, em seguida, à direita das outras janelas de 32 por 32 deslocadas por quatro
 
-1:15:29.920,1:15:31.920
-and so those two detectors detect
+0:15:29.920,0:15:31.920
+e então esses dois detectores detectam
 
-1:15:32.690,1:15:34.690
-that 3 or that 5
+0:15:32.690,0:15:34.690
+aquele 3 ou aquele 5
 
-1:15:36.140,1:15:39.890
-So then what you do is you take all those scores here at the top and you
+0:15:36.140,0:15:39.890
+Então o que você faz é pegar todas essas pontuações aqui no topo e você
 
-1:15:39.890,1:15:43.809
-do a little bit of post-processing very simple and you figure out if it's a three and a five
+0:15:39.890,0:15:43.809
+faça um pouco de pós-processamento muito simples e você descobre se é um três e um cinco
 
-1:15:44.630,1:15:46.630
-What's interesting about this is that
+0:15:44.630,0:15:46.630
+O que é interessante nisso é que
 
-1:15:47.660,1:15:49.899
-you don't need to do prior segmentation
+0:15:47.660,0:15:49.899
+você não precisa fazer segmentação prévia
 
-1:15:49.900,1:15:51.860
-So something that people had to do
+0:15:49.900,0:15:51.860
+Então, algo que as pessoas tinham que fazer
 
-1:15:51.860,1:15:58.180
-before, in computer vision, was if you wanted to recognize an object you had to separate the object from its background because the recognition system
+0:15:51.860,0:15:58.180
+antes, em visão computacional, era se você quisesse reconhecer um objeto que você tinha que separar o objeto de seu fundo porque o sistema de reconhecimento
 
-1:15:58.490,1:16:00.490
-would get confused by
+0:15:58.490,0:16:00.490
+ficaria confuso com
 
-1:16:00.800,1:16:07.900
-the background. But here with this convolutional net, it's been trained with overlapping characters and it knows how to tell them apart
+0:16:00.800,0:16:07.900
+o fundo. Mas aqui com esta rede convolucional, ela foi treinada com caracteres sobrepostos e sabe como diferenciá-los
 
-1:16:08.600,1:16:10.809
-And so it's not confused by characters that overlap
+0:16:08.600,0:16:10.809
+E por isso não é confundido por caracteres que se sobrepõem
 
-1:16:10.810,1:16:15.729
-I have a whole bunch of those on my web website, by the way, those animations from the early nineties
+0:16:10.810,0:16:15.729
+Eu tenho um monte deles no meu site, a propósito, aquelas animações do início dos anos noventa
 
-1:16:38.450,1:16:41.679
-No, that was the main issue. That's one of the reasons why
+0:16:38.450,0:16:41.679
+Não, essa era a questão principal. Essa é uma das razões pelas quais
 
-1:16:44.210,1:16:48.040
-computer vision wasn't working very well. It's because the very problem of
+0:16:44.210,0:16:48.040
+a visão computacional não estava funcionando muito bem. É porque o próprio problema de
 
-1:16:49.850,1:16:52.539
-figure/background separation, detecting an object
+0:16:49.850,0:16:52.539
+separação figura/fundo, detectando um objeto
 
-1:16:53.780,1:16:59.530
-and recognizing it is the same. You can't recognize the object until you segment it but you can't segment it until you recognize it
+0:16:53.780,0:16:59.530
+e reconhecê-lo é o mesmo. Você não pode reconhecer o objeto até segmentá-lo, mas não pode segmentá-lo até reconhecê-lo
 
-1:16:59.840,1:17:05.290
-It's the same for cursive handwriting recognition, right? You can't... so here's an example
+0:16:59.840,0:17:05.290
+É o mesmo para o reconhecimento de caligrafia cursiva, certo? Você não pode... então aqui está um exemplo
 
-1:17:07.460,1:17:09.460
-Do we have pens?
+0:17:07.460,0:17:09.460
+Temos canetas?
 
-1:17:10.650,1:17:12.650
-Doesn't look like we have pens right?
+0:17:10.650,0:17:12.650
+Não parece que temos canetas né?
 
-1:17:14.969,1:17:21.859
-Here we go, that's true. I'm sorry... maybe I should use the...
+0:17:14.969,0:17:21.859
+Aqui vamos nós, isso é verdade. Me desculpe... talvez eu deva usar o...
 
-1:17:24.780,1:17:26.780
-If this works...
+0:17:24.780,0:17:26.780
+Se isso funcionar...
 
-1:17:34.500,1:17:36.510
-Oh, of course...
+0:17:34.500,0:17:36.510
+Ah, claro...
 
-1:17:43.409,1:17:45.409
-Okay...
+0:17:43.409,0:17:45.409
+OK...
 
-1:17:52.310,1:17:54.310
-Can you guys read this?
+0:17:52.310,0:17:54.310
+Vocês podem ler isso?
 
-1:17:55.670,1:18:01.990
-Okay, I mean it's horrible handwriting but it's also because I'm writing on the screen. Okay, now can you read it?
+0:17:55.670,0:18:01.990
+Ok, quero dizer que é uma caligrafia horrível, mas também é porque estou escrevendo na tela. Ok, agora você pode lê-lo?
 
-1:18:08.240,1:18:10.240
-Minimum, yeah
+0:18:08.240,0:18:10.240
+Mínimo sim
 
-1:18:11.870,1:18:15.010
-Okay, there's actually no way you can segment the letters out of this right
+0:18:11.870,0:18:15.010
+Ok, na verdade não há como você segmentar as letras dessa direita
 
-1:18:15.010,1:18:17.439
-I mean this is kind of a random number of waves
+0:18:15.010,0:18:17.439
+Quero dizer, isso é meio que um número aleatório de ondas
 
-1:18:17.900,1:18:23.260
-But just the fact that the two "I"s are identified, then it's basically not ambiguous at least in English
+0:18:17.900,0:18:23.260
+Mas apenas o fato de que os dois "I"s são identificados, então basicamente não é ambíguo, pelo menos em inglês
 
-1:18:24.620,1:18:26.620
-So that's a good example of
+0:18:24.620,0:18:26.620
+Então esse é um bom exemplo de
 
-1:18:28.100,1:18:30.340
-the interpretation of individual
+0:18:28.100,0:18:30.340
+a interpretação do indivíduo
 
-1:18:31.580,1:18:38.169
-objects depending on their context. And what you need is some sort of high-level language model to know what words are possible
+0:18:31.580,0:18:38.169
+objetos dependendo de seu contexto. E o que você precisa é de algum tipo de modelo de linguagem de alto nível para saber quais palavras são possíveis
 
-1:18:38.170,1:18:40.170
-If you don't know English or similar
+0:18:38.170,0:18:40.170
+Se você não sabe inglês ou similar
 
-1:18:40.670,1:18:44.320
-languages that have the same word, there's no way you can you can read this
+0:18:40.670,0:18:44.320
+idiomas que têm a mesma palavra, não tem como você ler isso
 
-1:18:45.500,1:18:48.490
-Spoken language is very similar to this
+0:18:45.500,0:18:48.490
+A linguagem falada é muito semelhante a esta
 
-1:18:49.700,1:18:53.679
-All of you who have had the experience of learning a foreign language
+0:18:49.700,0:18:53.679
+Todos vocês que tiveram a experiência de aprender uma língua estrangeira
 
-1:18:54.470,1:18:56.470
-probably had the experience that
+0:18:54.470,0:18:56.470
+provavelmente teve a experiência de que
 
-1:18:57.110,1:19:04.150
-you have a hard time segmenting words from a new language and then recognizing the words because you don't have the vocabulary
+0:18:57.110,0:19:04.150
+você tem dificuldade em segmentar palavras de um novo idioma e depois reconhecer as palavras porque não tem o vocabulário
 
-1:19:04.850,1:19:09.550
-Right? So if I speak in French -- si je commence à parler français, vous n'avez aucune idée d'où sont les limites des mots --
-[If I start speaking French, you have no idea where the limits of words are]
+0:19:04.850,0:19:09.550
+Certo? Então, se eu falar em francês -- se eu começar a falar francês, você não tem ideia de onde estão os limites das palavras]
 
-1:19:09.740,1:19:13.749
-Except if you speak French. So I spoke a sentence, it's words
+0:19:09.740,0:19:13.749
+Exceto se você fala francês. Então eu falei uma frase, são palavras
 
-1:19:13.750,1:19:17.140
-but you can't tell the boundary between the words right because it is basically no
+0:19:13.750,0:19:17.140
+mas você não pode dizer o limite entre as palavras porque basicamente não é
 
-1:19:17.990,1:19:23.800
-clear seizure between the words unless you know where the words are in advance, right? So that's the problem of segmentation
+0:19:17.990,0:19:23.800
+clara apreensão entre as palavras, a menos que você saiba onde as palavras estão de antemão, certo? Então esse é o problema da segmentação
 
-1:19:23.900,1:19:28.540
-You can't recognize until you segment, you can't segment until you recognize you have to do both at the same time
+0:19:23.900,0:19:28.540
+Você não pode reconhecer até segmentar, não pode segmentar até reconhecer que precisa fazer as duas coisas ao mesmo tempo
 
-1:19:29.150,1:19:32.379
-Early computer vision systems had a really hard time doing this
+0:19:29.150,0:19:32.379
+Os primeiros sistemas de visão computacional tiveram muita dificuldade em fazer isso
 
-1:19:40.870,1:19:46.739
-So that's why this kind of stuff is big progress because you don't have to do segmentation in advance, it just...
+0:19:40.870,0:19:46.739
+Então é por isso que esse tipo de coisa é um grande progresso, porque você não precisa fazer a segmentação com antecedência, apenas...
 
-1:19:47.679,1:19:52.559
-just train your system to be robust to kind of overlapping objects and things like that. Yes, in the back!
+0:19:47.679,0:19:52.559
+apenas treine seu sistema para ser robusto a objetos sobrepostos e coisas assim. Sim, nas costas!
 
-1:19:55.510,1:19:59.489
-Yes, there is a background class. So when you see a blank response
+0:19:55.510,0:19:59.489
+Sim, há uma classe de fundo. Então, quando você vê uma resposta em branco
 
-1:20:00.340,1:20:04.410
-it means the system says "none of the above" basically, right? So it's been trained
+0:20:00.340,0:20:04.410
+significa que o sistema diz "nenhuma das opções acima" basicamente, certo? Então foi treinado
 
-1:20:05.590,1:20:07.590
-to produce "none of the above"
+0:20:05.590,0:20:07.590
+para produzir "nenhuma das anteriores"
 
-1:20:07.690,1:20:11.699
-either when the input is blank or when there is one character that's too
+0:20:07.690,0:20:11.699
+ou quando a entrada está em branco ou quando há um caractere que é muito
 
-1:20:13.420,1:20:17.190
-outside of the center or when you have two characters
+0:20:13.420,0:20:17.190
+fora do centro ou quando você tem dois personagens
 
-1:20:17.620,1:20:24.029
-but there's nothing in the center. Or when you have two characters that overlap, but there is no central character, right? So it's...
+0:20:17.620,0:20:24.029
+mas não há nada no centro. Ou quando você tem dois personagens que se sobrepõem, mas não tem um personagem central, certo? Então é...
 
-1:20:24.760,1:20:27.239
-trying to detect boundaries between characters essentially
+0:20:24.760,0:20:27.239
+tentando detectar limites entre personagens essencialmente
 
-1:20:28.420,1:20:30.420
-Here's another example
+0:20:28.420,0:20:30.420
+Aqui está outro exemplo
 
-1:20:31.390,1:20:38.640
-This is an example that shows that even a very simple convolutional net with just two stages, right? convolution, pooling, convolution
+0:20:31.390,0:20:38.640
+Este é um exemplo que mostra que mesmo uma rede convolucional muito simples com apenas dois estágios, certo? convolução, agrupamento, convolução
 
-1:20:38.640,1:20:40.640
-pooling, and then two layers of...
+0:20:38.640,0:20:40.640
+pooling e, em seguida, duas camadas de ...
 
-1:20:42.010,1:20:44.010
-two more layers afterwards
+0:20:42.010,0:20:44.010
+mais duas camadas depois
 
-1:20:44.770,1:20:47.429
-can solve what's called the feature-binding problem
+0:20:44.770,0:20:47.429
+pode resolver o que é chamado de problema de vinculação de recursos
 
-1:20:48.130,1:20:50.130
-So visual neuroscientists and
+0:20:48.130,0:20:50.130
+Assim, neurocientistas visuais e
 
-1:20:50.320,1:20:56.190
-computer vision people had the issue --it was kind of a puzzle-- How is it that
+0:20:50.320,0:20:56.190
+as pessoas de visão computacional tinham o problema --era uma espécie de quebra-cabeça-- Como é que
 
-1:20:57.489,1:21:01.289
-we perceive objects as objects? Objects are collections of features
+0:20:57.489,0:21:01.289
+percebemos os objetos como objetos? Objetos são coleções de recursos
 
-1:21:01.290,1:21:04.229
-but how do we bind all the features together of an object to form this object?
+0:21:01.290,0:21:04.229
+mas como ligamos todos os recursos de um objeto para formar esse objeto?
 
-1:21:06.460,1:21:09.870
-Is there some kind of magical way of doing this?
+0:21:06.460,0:21:09.870
+Existe algum tipo de maneira mágica de fazer isso?
 
-1:21:12.520,1:21:16.589
-And they did... psychologists did experiments like...
+0:21:12.520,0:21:16.589
+E eles fizeram... psicólogos fizeram experimentos como...
 
-1:21:24.210,1:21:26.210
-draw this and then that
+0:21:24.210,0:21:26.210
+desenhe isso e depois aquilo
 
-1:21:28.239,1:21:31.349
-and you perceive the bar as
+0:21:28.239,0:21:31.349
+e você percebe a barra como
 
-1:21:32.469,1:21:39.419
-a single bar because you're used to bars being obstructed by, occluded by other objects
+0:21:32.469,0:21:39.419
+uma única barra porque você está acostumado a barras sendo obstruídas por outros objetos
 
-1:21:39.550,1:21:41.550
-and so you just assume it's an occlusion
+0:21:39.550,0:21:41.550
+e então você apenas assume que é uma oclusão
 
-1:21:44.410,1:21:47.579
-And then there are experiments that figure out how much do I have to
+0:21:44.410,0:21:47.579
+E depois há experimentos que descobrem o quanto eu tenho que
 
-1:21:48.430,1:21:52.109
-shift the two bars to make me perceive them as two separate bars
+0:21:48.430,0:21:52.109
+mude as duas barras para me fazer percebê-las como duas barras separadas
 
-1:21:53.980,1:21:56.580
-But in fact, the minute they perfectly line and if you...
+0:21:53.980,0:21:56.580
+Mas, na verdade, no minuto em que eles se alinham perfeitamente e se você...
 
-1:21:57.250,1:21:59.080
-if you do this..
+0:21:57.250,0:21:59.080
+se você fizer isto..
 
-1:21:59.080,1:22:03.809
-maybe exactly identical to what you see here, but now you perceive them as two different objects
+0:21:59.080,0:22:03.809
+talvez exatamente idêntico ao que você vê aqui, mas agora você os percebe como dois objetos diferentes
 
-1:22:06.489,1:22:12.929
-So how is it that we seem to be solving the feature-binding problem?
+0:22:06.489,0:22:12.929
+Então, como parece que estamos resolvendo o problema de vinculação de recursos?
 
-1:22:15.880,1:22:21.450
-And what this shows is that you don't need any specific mechanism for it. It just happens
+0:22:15.880,0:22:21.450
+E o que isso mostra é que você não precisa de nenhum mecanismo específico para isso. Simplesmente acontece
 
-1:22:22.210,1:22:25.919
-If you have enough nonlinearities and you train with enough data
+0:22:22.210,0:22:25.919
+Se você tiver não linearidades suficientes e treinar com dados suficientes
 
-1:22:26.440,1:22:33.359
-then, as a side effect, you get a system that solves the feature-binding problem without any particular mechanism for it
+0:22:26.440,0:22:33.359
+então, como efeito colateral, você obtém um sistema que resolve o problema de vinculação de recursos sem nenhum mecanismo específico para isso
 
-1:22:37.510,1:22:40.260
-So here you have two shapes and you move a single
+0:22:37.510,0:22:40.260
+Então aqui você tem duas formas e você move uma única
 
-1:22:43.060,1:22:50.519
-stroke and it goes from a six and a one, to a three, to a five and a one, to a seven and a three
+0:22:43.060,0:22:50.519
+curso e vai de um seis e um, para um três, para um cinco e um, para um sete e um três
 
-1:22:53.140,1:22:55.140
-Etcetera
+0:22:53.140,0:22:55.140
+etc.
 
-1:23:00.020,1:23:07.480
-Right, good question. So the question is: how do you distinguish between the two situations? We have two fives next to each other and
+0:23:00.020,0:23:07.480
+Certo, boa pergunta. Então a pergunta é: como você distingue entre as duas situações? Temos dois cincos um ao lado do outro e
 
-1:23:08.270,1:23:14.890
-the fact that you have a single five being detected by two different frames, right? Two different framing of that five
+0:23:08.270,0:23:14.890
+o fato de você ter um único cinco sendo detectado por dois quadros diferentes, certo? Dois enquadramentos diferentes desses cinco
 
-1:23:15.470,1:23:17.470
-Well there is this explicit
+0:23:15.470,0:23:17.470
+Bem, há isso explícito
 
-1:23:17.660,1:23:20.050
-training so that when you have two characters that
+0:23:17.660,0:23:20.050
+treinando para que quando você tiver dois personagens que
 
-1:23:20.690,1:23:25.029
-are touching and none of them is really centered you train the system to say "none of the above", right?
+0:23:20.690,0:23:25.029
+estão se tocando e nenhum deles está realmente centralizado, você treina o sistema para dizer "nenhuma das opções acima", certo?
 
-1:23:25.030,1:23:29.079
-So it's always going to have five blank five
+0:23:25.030,0:23:29.079
+Então sempre vai ter cinco cinco em branco
 
-1:23:30.020,1:23:35.800
-It's always gonna have even like one blank one, and the ones can be very close. It will you'll tell you the difference
+0:23:30.020,0:23:35.800
+Sempre vai ter um em branco, e os dois podem ser muito próximos. Será que você vai te dizer a diferença
 
-1:23:39.170,1:23:41.289
-Okay, so what are convnets good for?
+0:23:39.170,0:23:41.289
+Ok, então para que servem os convnets?
 
-1:24:04.970,1:24:07.599
-So what you have to look at is this
+0:24:04.970,0:24:07.599
+Então o que você tem que olhar é isso
 
-1:24:11.510,1:24:13.510
-Every layer here is a convolution
+0:24:11.510,0:24:13.510
+Cada camada aqui é uma convolução
 
-1:24:13.610,1:24:15.020
-Okay?
+0:24:13.610,0:24:15.020
+OK?
 
-1:24:15.020,1:24:21.070
-Including the last layer, so it looks like a full connection because every unit in the second layer goes into the output
+0:24:15.020,0:24:21.070
+Incluindo a última camada, então parece uma conexão completa porque cada unidade na segunda camada vai para a saída
 
-1:24:21.070,1:24:24.460
-But in fact, it is a convolution, it just happens to be applied to a single location
+0:24:21.070,0:24:24.460
+Mas, na verdade, é uma convolução, só acontece de ser aplicada a um único local
 
-1:24:24.950,1:24:31.300
-So now imagine that this layer at the top here now is bigger, okay? Which is represented here
+0:24:24.950,0:24:31.300
+Então agora imagine que essa camada no topo aqui agora é maior, ok? Que está representado aqui
 
-1:24:32.840,1:24:34.130
-Okay?
+0:24:32.840,0:24:34.130
+OK?
 
-1:24:34.130,1:24:37.779
-Now the size of the kernel is the size of the image you had here previously
+0:24:34.130,0:24:37.779
+Agora o tamanho do kernel é o tamanho da imagem que você tinha aqui anteriormente
 
-1:24:37.820,1:24:43.360
-But now it's a convolution that has multiple locations, right? And so what you get is multiple outputs
+0:24:37.820,0:24:43.360
+Mas agora é uma convolução que tem vários locais, certo? E então o que você obtém são várias saídas
 
-1:24:46.430,1:24:55.100
-That's right, that's right. Each of which corresponds to a classification over an input window of size 32 by 32 in the example I showed
+0:24:46.430,0:24:55.100
+Isso mesmo, isso mesmo. Cada um deles corresponde a uma classificação em uma janela de entrada de tamanho 32 por 32 no exemplo que mostrei
 
-1:24:55.100,1:25:02.710
-And those windows are shifted by 4 pixels. The reason being that the network architecture I showed
+0:24:55.100,0:25:02.710
+E essas janelas são deslocadas em 4 pixels. A razão é que a arquitetura de rede que mostrei
 
-1:25:04.280,1:25:11.739
-here has a convolution with stride one, then pooling with stride two, convolution with stride one, pooling with stride two
+0:25:04.280,0:25:11.739
+aqui tem uma convolução com passo um, então juntando com passo dois, convolução com passo um, juntando com passo dois
 
-1:25:13.949,1:25:17.178
-And so the overall stride is four, right?
+0:25:13.949,0:25:17.178
+E assim o passo geral é quatro, certo?
 
-1:25:18.719,1:25:22.788
-And so to get a new output you need to shift the input window by four
+0:25:18.719,0:25:22.788
+E assim, para obter uma nova saída, você precisa deslocar a janela de entrada em quatro
 
-1:25:24.210,1:25:29.509
-to get one of those because of the two pooling layers with...
+0:25:24.210,0:25:29.509
+para obter um desses por causa das duas camadas de pool com ...
 
-1:25:31.170,1:25:35.480
-Maybe I should be a little more explicit about this. Let me draw a picture, that would be clearer
+0:25:31.170,0:25:35.480
+Talvez eu devesse ser um pouco mais explícito sobre isso. Deixe-me desenhar uma imagem, isso seria mais claro
 
-1:25:39.929,1:25:43.848
-So you have an input
+0:25:39.929,0:25:43.848
+Então você tem uma entrada
 
-1:25:49.110,1:25:53.749
-like this... a convolution, let's say a convolution of size three
+0:25:49.110,0:25:53.749
+assim... uma convolução, digamos uma convolução de tamanho três
 
-1:25:57.420,1:25:59.420
-Okay? Yeah with stride one
+0:25:57.420,0:25:59.420
+OK? Sim com passo um
 
-1:26:01.289,1:26:04.518
-Okay, I'm not gonna draw all of them, then you have
+0:26:01.289,0:26:04.518
+Ok, eu não vou desenhar todos eles, então você tem
 
-1:26:05.460,1:26:11.389
-pooling with subsampling of size two, so you pool over 2 and you subsample, the stride is 2, so you shift by two
+0:26:05.460,0:26:11.389
+agrupando com subamostragem de tamanho dois, então você agrupa mais de 2 e subamostra, o passo é 2, então você muda por dois
 
-1:26:12.389,1:26:14.389
-No overlap
+0:26:12.389,0:26:14.389
+Sem sobreposição
 
-1:26:18.550,1:26:25.060
-Okay, so here the input is this size --one two, three, four, five, six, seven, eight
+0:26:18.550,0:26:25.060
+Ok, então aqui a entrada é deste tamanho -- um dois, três, quatro, cinco, seis, sete, oito
 
-1:26:26.150,1:26:29.049
-because the convolution is of size three you get
+0:26:26.150,0:26:29.049
+porque a convolução é de tamanho três, você obtém
 
-1:26:29.840,1:26:31.840
-an output here of size six and
+0:26:29.840,0:26:31.840
+uma saída aqui de tamanho seis e
 
-1:26:32.030,1:26:39.010
-then when you do pooling with subsampling with stride two, you get three outputs because that divides the output by two, okay?
+0:26:32.030,0:26:39.010
+então quando você faz o agrupamento com subamostragem com passo dois, você obtém três saídas porque isso divide a saída por dois, ok?
 
-1:26:39.880,1:26:41.880
-Let me add another one
+0:26:39.880,0:26:41.880
+Deixe-me adicionar outro
 
-1:26:43.130,1:26:45.130
-Actually two
+0:26:43.130,0:26:45.130
+Na verdade dois
 
-1:26:46.790,1:26:48.790
-Okay, so now the output is ten
+0:26:46.790,0:26:48.790
+Ok, então agora a saída é dez
 
-1:26:50.030,1:26:51.680
-This guy is eight
+0:26:50.030,0:26:51.680
+Esse cara tem oito
 
-1:26:51.680,1:26:53.680
-This guy is four
+0:26:51.680,0:26:53.680
+Esse cara tem quatro
 
-1:26:54.260,1:26:56.409
-I can do convolutions now also
+0:26:54.260,0:26:56.409
+Eu posso fazer convoluções agora também
 
-1:26:57.650,1:26:59.650
-Let's say three
+0:26:57.650,0:26:59.650
+Digamos três
 
-1:27:01.400,1:27:03.400
-I only get two outputs
+0:27:01.400,0:27:03.400
+só tenho duas saídas
 
-1:27:04.490,1:27:06.490
-Okay? Oops!
+0:27:04.490,0:27:06.490
+OK? Ops!
 
-1:27:07.040,1:27:10.820
-Hmm not sure why it doesn't... draw
+0:27:07.040,0:27:10.820
+Hmm não tenho certeza porque não... desenha
 
-1:27:10.820,1:27:13.270
-Doesn't wanna draw anymore, that's interesting
+0:27:10.820,0:27:13.270
+Não quer mais desenhar, isso é interessante
 
-1:27:17.060,1:27:19.060
-Aha!
+0:27:17.060,0:27:19.060
+Ah!
 
-1:27:24.110,1:27:26.380
-It doesn't react to clicks, that's interesting
+0:27:24.110,0:27:26.380
+Ele não reage a cliques, isso é interessante
 
-1:27:34.460,1:27:39.609
-Okay, not sure what's going on! Oh "xournal" is not responding
+0:27:34.460,0:27:39.609
+Ok, não tenho certeza do que está acontecendo! Oh "xournal" não está respondendo
 
-1:27:41.750,1:27:44.320
-All right, I guess it crashed on me
+0:27:41.750,0:27:44.320
+Tudo bem, acho que caiu em mim
 
-1:27:46.550,1:27:48.550
-Well, that's annoying
+0:27:46.550,0:27:48.550
+Bem, isso é irritante
 
-1:27:53.150,1:27:55.150
-Yeah, definitely crashed
+0:27:53.150,0:27:55.150
+Sim, definitivamente caiu
 
-1:28:02.150,1:28:04.150
-And, of course, it forgot it, so...
+0:28:02.150,0:28:04.150
+E, claro, ele esqueceu, então...
 
-1:28:09.860,1:28:12.760
-Okay, so we have ten, then eight
+0:28:09.860,0:28:12.760
+Ok, então temos dez, então oito
 
-1:28:15.230,1:28:20.470
-because of convolution with three, then we have pooling
+0:28:15.230,0:28:20.470
+por causa da convolução com três, então temos o agrupamento
 
-1:28:22.520,1:28:24.520
-of size two with
+0:28:22.520,0:28:24.520
+tamanho dois com
 
-1:28:26.120,1:28:28.120
-stride two, so we get four
+0:28:26.120,0:28:28.120
+passo dois, então temos quatro
 
-1:28:30.350,1:28:36.970
-Then we have convolution with three so we get two, okay? And then maybe pooling again
+0:28:30.350,0:28:36.970
+Então temos convolução com três, então temos dois, ok? E então talvez juntando novamente
 
-1:28:38.450,1:28:42.700
-of size two and subsampling two, we get one. Okay, so...
+0:28:38.450,0:28:42.700
+de tamanho dois e subamostrando dois, obtemos um. OK, então...
 
-1:28:44.450,1:28:46.869
-ten input, eight
+0:28:44.450,0:28:46.869
+dez entradas, oito
 
-1:28:49.370,1:28:53.079
-four, two, and...
+0:28:49.370,0:28:53.079
+quatro, dois e...
 
-1:28:58.010,1:29:03.339
-then one for the pooling. This is convolution three, you're right
+0:28:58.010,0:29:03.339
+então um para o pooling. Esta é a convolução três, você está certo
 
-1:29:06.500,1:29:08.500
-This is two
+0:29:06.500,0:29:08.500
+Isso é dois
 
-1:29:09.140,1:29:11.140
-And those are three
+0:29:09.140,0:29:11.140
+E esses são três
 
-1:29:12.080,1:29:14.080
-Etcetera. Right. Now, let's assume
+0:29:12.080,0:29:14.080
+etc. Certo. Agora, vamos supor
 
-1:29:14.540,1:29:17.860
-I add a few units here
+0:29:14.540,0:29:17.860
+Eu adiciono algumas unidades aqui
 
-1:29:18.110,1:29:21.010
-Okay? So that's going to add, let's say
+0:29:18.110,0:29:21.010
+OK? Então isso vai adicionar, digamos
 
-1:29:21.890,1:29:24.160
-four units here, two units here
+0:29:21.890,0:29:24.160
+quatro unidades aqui, duas unidades aqui
 
-1:29:27.620,1:29:29.620
-Then...
+0:29:27.620,0:29:29.620
+Então...
 
-1:29:41.190,1:29:42.840
-Yeah, this one is
+0:29:41.190,0:29:42.840
+Sim, este é
 
-1:29:42.840,1:29:46.279
-like this and like that so I got four and
+0:29:42.840,0:29:46.279
+assim e assim, então eu tenho quatro e
 
-1:29:47.010,1:29:48.960
-I got another one here
+0:29:47.010,0:29:48.960
+tenho outro aqui
 
-1:29:48.960,1:29:52.460
-Okay? So now I have only one output and by adding four
+0:29:48.960,0:29:52.460
+OK? Então agora eu tenho apenas uma saída e adicionando quatro
 
-1:29:53.640,1:29:55.640
-four inputs here
+0:29:53.640,0:29:55.640
+quatro entradas aqui
 
-1:29:55.830,1:29:58.249
-which is not 14. I got two outputs
+0:29:55.830,0:29:58.249
+que não é 14. Eu tenho duas saídas
 
-1:29:59.790,1:30:02.090
-Why four? Because I have 2
+0:29:59.790,0:30:02.090
+Por que quatro? Porque eu tenho 2
 
-1:30:02.970,1:30:04.830
-stride of 2
+0:30:02.970,0:30:04.830
+passo de 2
 
-1:30:04.830,1:30:10.939
-Okay? So the overall subsampling ratio from input to output is 4, it's 2 times 2
+0:30:04.830,0:30:10.939
+OK? Portanto, a proporção geral de subamostragem da entrada para a saída é 4, é 2 vezes 2
 
-1:30:13.140,1:30:17.540
-Now this is 12, and this is 6, and this is 4
+0:30:13.140,0:30:17.540
+Agora isso é 12, e isso é 6, e isso é 4
 
-1:30:20.010,1:30:22.010
-So that's a...
+0:30:20.010,0:30:22.010
+Então isso é um...
 
-1:30:22.620,1:30:24.620
-demonstration of the fact that
+0:30:22.620,0:30:24.620
+demonstração do fato de que
 
-1:30:24.900,1:30:26.900
-you can increase the size of the input
+0:30:24.900,0:30:26.900
+você pode aumentar o tamanho da entrada
 
-1:30:26.900,1:30:32.330
-it will increase the size of every layer, and if you have a layer that has size 1 and it's a convolutional layer
+0:30:26.900,0:30:32.330
+aumentará o tamanho de cada camada, e se você tiver uma camada com tamanho 1 e for uma camada convolucional
 
-1:30:32.330,1:30:34.330
-its size is going to be increased
+0:30:32.330,0:30:34.330
+seu tamanho vai aumentar
 
-1:30:42.870,1:30:44.870
-Yes
+0:30:42.870,0:30:44.870
+sim
 
-1:30:47.250,1:30:52.760
-Change the size of a layer, like, vertically, horizontally? Yeah, so there's gonna be...
+0:30:47.250,0:30:52.760
+Alterar o tamanho de uma camada, tipo, verticalmente, horizontalmente? Sim, então vai ter...
 
-1:30:54.390,1:30:57.950
-So first you have to train for it, if you want the system to have so invariance to size
+0:30:54.390,0:30:57.950
+Então, primeiro você tem que treinar para isso, se você quer que o sistema tenha tanta invariância de tamanho
 
-1:30:58.230,1:31:03.860
-you have to train it with characters of various sizes. You can do this with data augmentation if your characters are normalized
+0:30:58.230,0:31:03.860
+você tem que treiná-lo com personagens de vários tamanhos. Você pode fazer isso com aumento de dados se seus personagens forem normalizados
 
-1:31:04.740,1:31:06.740
-That's the first thing. Second thing is...
+0:31:04.740,0:31:06.740
+Essa é a primeira coisa. Segunda coisa é...
 
-1:31:08.850,1:31:16.579
-empirically simple convolutional nets are only invariant to size within a factor of... rather small factor, like you can increase the size by
+0:31:08.850,0:31:16.579
+redes convolucionais empiricamente simples são apenas invariantes ao tamanho dentro de um fator de... fator bastante pequeno, como você pode aumentar o tamanho por
 
-1:31:17.610,1:31:23.599
-maybe 40 percent or something. I mean change the size about 40 percent plus/minus 20 percent, something like that, right?
+0:31:17.610,0:31:23.599
+talvez 40 por cento ou algo assim. Quero dizer, mude o tamanho cerca de 40% mais/menos 20%, algo assim, certo?
 
-1:31:26.250,1:31:28.250
-Beyond that...
+0:31:26.250,0:31:28.250
+Além disso...
 
-1:31:28.770,1:31:33.830
-you might have more trouble getting invariance, but people have trained with input...
+0:31:28.770,0:31:33.830
+você pode ter mais problemas para obter invariância, mas as pessoas treinaram com entrada ...
 
-1:31:33.980,1:31:38.390
-I mean objects of sizes that vary by a lot. So the way to handle this is
+0:31:33.980,0:31:38.390
+Quero dizer objetos de tamanhos que variam muito. Então, a maneira de lidar com isso é
 
-1:31:39.750,1:31:46.430
-if you want to handle variable size, is that if you have an image and you don't know what size the objects are
+0:31:39.750,0:31:46.430
+se você quiser lidar com tamanho variável, é que se você tem uma imagem e não sabe o tamanho dos objetos
 
-1:31:46.950,1:31:50.539
-that are in this image, you apply your convolutional net to that image and
+0:31:46.950,0:31:50.539
+que estão nesta imagem, você aplica sua rede convolucional a essa imagem e
 
-1:31:51.180,1:31:53.979
-then you take the same image, reduce it by a factor of two
+0:31:51.180,0:31:53.979
+então você pega a mesma imagem, reduz por um fator de dois
 
-1:31:54.440,1:31:58.179
-just scale the image by a factor of two, run the same convolutional net on that new image and
+0:31:54.440,0:31:58.179
+apenas dimensione a imagem por um fator de dois, execute a mesma rede convolucional nessa nova imagem e
 
-1:31:59.119,1:32:02.949
-then reduce it by a factor of two again, and run the same convolutional net again on that image
+0:31:59.119,0:32:02.949
+em seguida, reduza-o por um fator de dois novamente e execute a mesma rede convolucional novamente nessa imagem
 
-1:32:03.800,1:32:08.110
-Okay? So the first convolutional net will be able to detect small objects within the image
+0:32:03.800,0:32:08.110
+OK? Assim, a primeira rede convolucional será capaz de detectar pequenos objetos dentro da imagem
 
-1:32:08.630,1:32:11.859
-So let's say your network has been trained to detect objects of size...
+0:32:08.630,0:32:11.859
+Então, digamos que sua rede foi treinada para detectar objetos de tamanho...
 
-1:32:11.860,1:32:16.179
-I don't know, 20 pixels, like faces for example, right? They are 20 pixels
+0:32:11.860,0:32:16.179
+Eu não sei, 20 pixels, como rostos por exemplo, certo? São 20 pixels
 
-1:32:16.789,1:32:20.739
-It will detect faces that are roughly 20 pixels within this image and
+0:32:16.789,0:32:20.739
+Ele detectará rostos com aproximadamente 20 pixels nesta imagem e
 
-1:32:21.320,1:32:24.309
-then when you subsample by a factor of 2 and you apply the same network
+0:32:21.320,0:32:24.309
+então, quando você subamostra por um fator de 2 e aplica a mesma rede
 
-1:32:24.309,1:32:31.209
-it will detect faces that are 20 pixels within the new image, which means there were 40 pixels in the original image
+0:32:24.309,0:32:31.209
+ele detectará rostos com 20 pixels na nova imagem, o que significa que havia 40 pixels na imagem original
 
-1:32:32.179,1:32:37.899
-Okay? Which the first network will not see because the face would be bigger than its input window
+0:32:32.179,0:32:37.899
+OK? Que a primeira rede não verá porque o rosto seria maior que sua janela de entrada
 
-1:32:39.170,1:32:41.529
-And then the next network over will detect
+0:32:39.170,0:32:41.529
+E então a próxima rede detectará
 
-1:32:42.139,1:32:44.409
-faces that are 80 pixels, etc., right?
+0:32:42.139,0:32:44.409
+rostos de 80 pixels, etc., certo?
 
-1:32:44.659,1:32:49.089
-So then by kind of combining the scores from all of those, and doing something called non-maximum suppression
+0:32:44.659,0:32:49.089
+Então, combinando as pontuações de todos esses, e fazendo algo chamado supressão não máxima
 
-1:32:49.090,1:32:51.090
-we can actually do detection and
+0:32:49.090,0:32:51.090
+podemos realmente fazer a detecção e
 
-1:32:51.230,1:32:57.939
-localization of objects. People use considerably more sophisticated techniques for detection now, and for localization that we'll talk about next week
+0:32:51.230,0:32:57.939
+localização de objetos. As pessoas usam técnicas consideravelmente mais sofisticadas para detecção agora e para localização, sobre as quais falaremos na próxima semana
 
-1:32:58.429,1:33:00.429
-But that's the basic idea
+0:32:58.429,0:33:00.429
+Mas essa é a ideia básica
 
-1:33:00.920,1:33:02.920
-So let me conclude
+0:33:00.920,0:33:02.920
+Então deixe-me concluir
 
-1:33:03.019,1:33:09.429
-What are convnets good for? They're good for signals that come to you in the form of a multi-dimensional array
+0:33:03.019,0:33:09.429
+Para que servem os convnets? Eles são bons para sinais que chegam até você na forma de uma matriz multidimensional
 
-1:33:10.190,1:33:12.190
-But that multi-dimensional array has
+0:33:10.190,0:33:12.190
+Mas essa matriz multidimensional tem
 
-1:33:13.190,1:33:17.500
-to have two characteristics at least. The first one is
+0:33:13.190,0:33:17.500
+ter pelo menos duas características. O primeiro é
 
-1:33:18.469,1:33:23.828
-there is strong local correlations between values. So if you take an image
+0:33:18.469,0:33:23.828
+há fortes correlações locais entre os valores. Então, se você tirar uma imagem
 
-1:33:24.949,1:33:32.949
-random image, take two pixels within this image, two pixels that are nearby. Those two pixels are very likely to have very similar colors
+0:33:24.949,0:33:32.949
+imagem aleatória, pegue dois pixels dentro desta imagem, dois pixels que estão próximos. Esses dois pixels provavelmente terão cores muito semelhantes
 
-1:33:33.530,1:33:38.199
-Take a picture of this class, for example, two pixels on the wall basically have the same color
+0:33:33.530,0:33:38.199
+Tire uma foto dessa aula, por exemplo, dois pixels na parede têm basicamente a mesma cor
 
-1:33:39.469,1:33:42.069
-Okay? It looks like there is a ton of objects here, but
+0:33:39.469,0:33:42.069
+OK? Parece que há uma tonelada de objetos aqui, mas
 
-1:33:43.280,1:33:49.509
---animate objects-- but in fact mostly, statistically, neighboring pixels are essentially the same color
+0:33:43.280,0:33:49.509
+--animar objetos-- mas na verdade principalmente, estatisticamente, os pixels vizinhos são essencialmente da mesma cor
 
-1:33:52.699,1:34:00.129
-As you move the distance from two pixels away and you compute the statistics of how similar pixels are as a function of distance
+0:33:52.699,0:34:00.129
+À medida que você move a distância de dois pixels de distância e calcula as estatísticas de quão semelhantes os pixels são em função da distância
 
-1:34:00.650,1:34:02.650
-they're less and less similar
+0:34:00.650,0:34:02.650
+são cada vez menos parecidos
 
-1:34:03.079,1:34:05.079
-So what does that mean? Because
+0:34:03.079,0:34:05.079
+Então, o que isso significa? Porque
 
-1:34:06.350,1:34:09.430
-nearby pixels are likely to have similar colors
+0:34:06.350,0:34:09.430
+pixels próximos provavelmente terão cores semelhantes
 
-1:34:09.560,1:34:14.499
-that means that when you take a patch of pixels, say five by five, or eight by eight or something
+0:34:09.560,0:34:14.499
+isso significa que quando você pega um pedaço de pixels, digamos cinco por cinco, ou oito por oito ou algo assim
 
-1:34:16.040,1:34:18.040
-The type of patch you're going to observe
+0:34:16.040,0:34:18.040
+O tipo de patch que você vai observar
 
-1:34:18.920,1:34:21.159
-is very likely to be kind of a smoothly varying
+0:34:18.920,0:34:21.159
+é muito provável que seja uma espécie de variação suave
 
-1:34:21.830,1:34:23.830
-color or maybe with an edge
+0:34:21.830,0:34:23.830
+cor ou talvez com uma borda
 
-1:34:24.770,1:34:32.080
-But among all the possible combinations of 25 pixels, the ones that you actually observe in natural images is a tiny subset
+0:34:24.770,0:34:32.080
+Mas entre todas as combinações possíveis de 25 pixels, as que você realmente observa em imagens naturais é um pequeno subconjunto
 
-1:34:34.130,1:34:38.380
-What that means is that it's advantageous to represent the content of that patch
+0:34:34.130,0:34:38.380
+O que isso significa é que é vantajoso representar o conteúdo desse patch
 
-1:34:39.440,1:34:46.509
-by a vector with perhaps less than 25 values that represent the content of that patch. Is there an edge, is it uniform?
+0:34:39.440,0:34:46.509
+por um vetor com talvez menos de 25 valores que representam o conteúdo desse patch. Existe uma borda, é uniforme?
 
-1:34:46.690,1:34:48.520
-What color is it? You know things like that, right?
+0:34:46.690,0:34:48.520
+Que cor é essa? Você sabe coisas assim, certo?
 
-1:34:48.520,1:34:52.660
-And that's basically what the convolutions in the first layer of a convolutional net are doing
+0:34:48.520,0:34:52.660
+E é basicamente isso que as convoluções na primeira camada de uma rede convolucional estão fazendo
 
-1:34:53.900,1:34:58.809
-Okay. So if you have local correlations, there is an advantage in detecting local features
+0:34:53.900,0:34:58.809
+OK. Portanto, se você tiver correlações locais, há uma vantagem em detectar recursos locais
 
-1:34:59.090,1:35:01.659
-That's what we observe in the brain. That's what convolutional nets are doing
+0:34:59.090,0:35:01.659
+Isso é o que observamos no cérebro. Isso é o que as redes convolucionais estão fazendo
 
-1:35:03.140,1:35:08.140
-This idea of locality. If you feed a convolutional net with permuted pixels
+0:35:03.140,0:35:08.140
+Essa ideia de localidade. Se você alimentar uma rede convolucional com pixels permutados
 
-1:35:09.020,1:35:15.070
-it's not going to be able to do a good job at recognizing your images, even if the permutation is fixed
+0:35:09.020,0:35:15.070
+não será capaz de fazer um bom trabalho em reconhecer suas imagens, mesmo que a permutação seja fixa
 
-1:35:17.030,1:35:19.960
-Right? A fully connected net doesn't care
+0:35:17.030,0:35:19.960
+Certo? Uma rede totalmente conectada não se importa
 
-1:35:21.410,1:35:23.410
-about permutations
+0:35:21.410,0:35:23.410
+sobre permutações
 
-1:35:25.700,1:35:28.240
-Then the second characteristics is that
+0:35:25.700,0:35:28.240
+Então a segunda característica é que
 
-1:35:30.050,1:35:34.869
-features that are important may appear anywhere on the image. So that's what justifies shared weights
+0:35:30.050,0:35:34.869
+recursos importantes podem aparecer em qualquer lugar da imagem. Então é isso que justifica pesos compartilhados
 
-1:35:35.630,1:35:38.499
-Okay? The local correlation justifies local connections
+0:35:35.630,0:35:38.499
+OK? A correlação local justifica conexões locais
 
-1:35:39.560,1:35:46.570
-The fact that features can appear anywhere, that the statistics of images or the signal is uniform
+0:35:39.560,0:35:46.570
+O fato de que os recursos podem aparecer em qualquer lugar, que as estatísticas das imagens ou o sinal são uniformes
 
-1:35:47.810,1:35:52.030
-means that you need to have repeated feature detectors for every location
+0:35:47.810,0:35:52.030
+significa que você precisa ter detectores de recursos repetidos para cada local
 
-1:35:52.850,1:35:54.850
-And that's where shared weights
+0:35:52.850,0:35:54.850
+E é aí que pesos compartilhados
 
-1:35:55.880,1:35:57.880
-come into play
+0:35:55.880,0:35:57.880
+entre no jogo
 
-1:36:01.990,1:36:06.059
-It does justify the pooling because the pooling is if you want invariance to
+0:36:01.990,0:36:06.059
+Isso justifica o pooling porque o pooling é se você quiser que a invariância
 
-1:36:06.760,1:36:11.400
-variations in the location of those characteristic features. And so if the objects you're trying to recognize
+0:36:06.760,0:36:11.400
+variações na localização dessas feições características. E se os objetos que você está tentando reconhecer
 
-1:36:12.340,1:36:16.619
-don't change their nature by kind of being slightly distorted then you want pooling
+0:36:12.340,0:36:16.619
+não mude sua natureza meio que distorcida, então você quer agrupar
 
-1:36:21.160,1:36:24.360
-So people have used convnets for cancer stuff, image video
+0:36:21.160,0:36:24.360
+Então, as pessoas usaram convnets para coisas de câncer, vídeo de imagem
 
-1:36:25.660,1:36:31.019
-text, speech. So speech actually is pretty... speech recognition convnets are used a lot
+0:36:25.660,0:36:31.019
+texto, fala. Então a fala é realmente bonita... as redes de reconhecimento de fala são muito usadas
 
-1:36:32.260,1:36:34.380
-Time series prediction, you know things like that
+0:36:32.260,0:36:34.380
+Previsão de séries temporais, você sabe coisas assim
 
-1:36:36.220,1:36:42.030
-And you know biomedical image analysis, so if you want to analyze an MRI, for example
+0:36:36.220,0:36:42.030
+E você conhece a análise de imagens biomédicas, então se você quiser analisar uma ressonância magnética, por exemplo
 
-1:36:42.030,1:36:44.030
-MRI or CT scan is a 3d image
+0:36:42.030,0:36:44.030
+A ressonância magnética ou tomografia computadorizada é uma imagem 3d
 
-1:36:44.950,1:36:49.170
-As humans we can't because we don't have a good visualization technology. We can't really
+0:36:44.950,0:36:49.170
+Como humanos, não podemos porque não temos uma boa tecnologia de visualização. Nós não podemos realmente
 
-1:36:49.960,1:36:54.960
-apprehend or understand a 3d volume, a 3-dimensional image
+0:36:49.960,0:36:54.960
+apreender ou compreender um volume 3D, uma imagem tridimensional
 
-1:36:55.090,1:36:58.709
-But a convnet is fine, feed it a 3d image and it will deal with it
+0:36:55.090,0:36:58.709
+Mas um convnet está bem, alimente-o com uma imagem 3d e ele lidará com isso
 
-1:36:59.530,1:37:02.729
-That's a big advantage because you don't have to go through slices to kind of figure out
+0:36:59.530,0:37:02.729
+Isso é uma grande vantagem porque você não precisa passar por fatias para descobrir
 
-1:37:04.000,1:37:06.030
-the object in the image
+0:37:04.000,0:37:06.030
+o objeto na imagem
 
-1:37:10.390,1:37:15.300
-And then the last thing here at the bottom, I don't know if you guys know where hyperspectral images are
+0:37:10.390,0:37:15.300
+E a última coisa aqui embaixo, não sei se vocês sabem onde estão as imagens hiperespectrais
 
-1:37:15.300,1:37:19.139
-So hyperspectral image is an image where... most natural color images
+0:37:15.300,0:37:19.139
+Então, a imagem hiperespectral é uma imagem onde... a maioria das imagens de cores naturais
 
-1:37:19.140,1:37:22.619
-I mean images that you collect with a normal camera you get three color components
+0:37:19.140,0:37:22.619
+Quero dizer, imagens que você coleta com uma câmera normal, você obtém três componentes de cores
 
-1:37:23.470,1:37:25.390
+0:37:23.470,0:37:25.390
 RGB
 
-1:37:25.390,1:37:28.019
-But we can build cameras with way more
+0:37:25.390,0:37:28.019
+Mas podemos construir câmeras com muito mais
 
-1:37:28.660,1:37:30.660
-spectral bands than this and
+0:37:28.660,0:37:30.660
+bandas espectrais do que isso e
 
-1:37:31.510,1:37:34.709
-that's particularly the case for satellite imaging where some
+0:37:31.510,0:37:34.709
+esse é particularmente o caso de imagens de satélite, onde alguns
 
-1:37:36.160,1:37:40.920
-cameras have many spectral bands going from infrared to ultraviolet and
+0:37:36.160,0:37:40.920
+câmeras têm muitas bandas espectrais indo do infravermelho ao ultravioleta e
 
-1:37:41.890,1:37:44.610
-that gives you a lot of information about what you see in each pixel
+0:37:41.890,0:37:44.610
+que fornece muitas informações sobre o que você vê em cada pixel
 
-1:37:45.760,1:37:47.040
-Some tiny animals
+0:37:45.760,0:37:47.040
+Alguns pequenos animais
 
-1:37:47.040,1:37:54.930
-that have small brains find it easier to process hyperspectral images of low resolution than high resolution images with just three colors
+0:37:47.040,0:37:54.930
+que têm cérebros pequenos acham mais fácil processar imagens hiperespectrais de baixa resolução do que imagens de alta resolução com apenas três cores
 
-1:37:55.750,1:38:00.450
-For example, there's a particular type of shrimp, right? They have those beautiful
+0:37:55.750,0:38:00.450
+Por exemplo, há um tipo específico de camarão, certo? Eles têm aqueles lindos
 
-1:38:01.630,1:38:07.499
-eyes and they have like 17 spectral bands or something, but super low resolution and they have a tiny brain to process it
+0:38:01.630,0:38:07.499
+olhos e eles têm tipo 17 bandas espectrais ou algo assim, mas resolução super baixa e eles têm um cérebro minúsculo para processá-lo
 
-1:38:09.770,1:38:12.850
-Okay, that's all for today. See you!
+0:38:09.770,0:38:12.850
+Ok, isso é tudo por hoje. Vê você!
\ No newline at end of file
diff --git a/docs/pt/week03/practicum03.sbv b/docs/pt/week03/practicum03.sbv
index 79126d43e..e9d44d1a9 100644
--- a/docs/pt/week03/practicum03.sbv
+++ b/docs/pt/week03/practicum03.sbv
@@ -1,1751 +1,1751 @@
 0:00:00.020,0:00:07.840
-So convolutional neural networks, I guess today I so foundations me, you know, I post nice things on Twitter
+Então, redes neurais convolucionais, acho que hoje eu me funda, sabe, eu posto coisas legais no Twitter
 
 0:00:09.060,0:00:11.060
-Follow me. I'm just kidding
+Me siga. Estou brincando
 
 0:00:11.290,0:00:16.649
-Alright. So again anytime you have no idea what's going on. Just stop me ask questions
+Tudo bem. Então, novamente, sempre que você não tem ideia do que está acontecendo. Apenas me pare de fazer perguntas
 
 0:00:16.900,0:00:23.070
-Let's make these lessons interactive such that I can try to please you and provide the necessary information
+Vamos tornar essas aulas interativas para que eu possa tentar agradá-lo e fornecer as informações necessárias
 
 0:00:23.980,0:00:25.980
-For you to understand what's going on?
+Para você entender o que está acontecendo?
 
 0:00:26.349,0:00:27.970
-alright, so
+tudo bem, então
 
 0:00:27.970,0:00:31.379
-Convolutional neural networks. How cool is this stuff? Very cool
+Redes neurais convolucionais. Quão legal é essa coisa? Muito legal
 
 0:00:32.439,0:00:38.699
-mostly because before having convolutional nets we couldn't do much and we're gonna figure out why now
+principalmente porque antes de ter redes convolucionais não podíamos fazer muito e vamos descobrir por que agora
 
 0:00:39.850,0:00:43.800
-how why why and how these networks are so powerful and
+como por que e como essas redes são tão poderosas e
 
 0:00:44.379,0:00:48.329
-They are going to be basically making they are making like a very large
+Eles vão estar basicamente fazendo eles estão fazendo como um grande
 
 0:00:48.879,0:00:52.859
-Chunk of like the whole networks are used these days
+Pedaço de como as redes inteiras são usadas nos dias de hoje
 
 0:00:53.980,0:00:55.300
-so
+assim
 
 0:00:55.300,0:01:02.369
-More specifically we are gonna get used to repeat several times those three words, which are the key words for understanding
+Mais especificamente, vamos nos acostumar a repetir várias vezes essas três palavras, que são as palavras-chave para entender
 
 0:01:02.920,0:01:05.610
-Convolutions, but we are going to be figuring out that soon
+Convoluções, mas vamos descobrir isso em breve
 
 0:01:06.159,0:01:09.059
-so let's get started and figuring out how
+então vamos começar e descobrir como
 
 0:01:09.580,0:01:11.470
-these
+esses
 
 0:01:11.470,0:01:13.470
-signals these images and these
+sinaliza essas imagens e essas
 
 0:01:13.990,0:01:17.729
-different items look like so whenever we talk about
+itens diferentes parecem assim sempre que falamos sobre
 
 0:01:18.670,0:01:21.000
-signals we can think about them as
+sinais que podemos pensar sobre eles como
 
 0:01:21.580,0:01:23.200
-vectors for example
+vetores por exemplo
 
 0:01:23.200,0:01:30.600
-We have there a signal which is representing a monophonic audio signal so given that is only
+Temos aí um sinal que está representando um sinal de áudio monofônico, dado que é apenas
 
 0:01:31.180,0:01:38.339
-We have only the temporal dimension going in like the signal happens over one dimension, which is the temporal dimension
+Temos apenas a dimensão temporal entrando como o sinal acontece em uma dimensão, que é a dimensão temporal
 
 0:01:38.560,0:01:46.079
-This is called 1d signal and can be represented by a singular vector as is shown up up there
+Isso é chamado de sinal 1d e pode ser representado por um vetor singular como é mostrado lá em cima
 
 0:01:46.750,0:01:48.619
-each
+cada
 
 0:01:48.619,0:01:52.389
-Value of that vector represents the amplitude of the wave form
+O valor desse vetor representa a amplitude da forma de onda
 
 0:01:53.479,0:01:56.589
-for example, if you have just a sign you're going to be just hearing like
+por exemplo, se você tiver apenas um sinal, você vai ouvir como
 
 0:01:57.830,0:01:59.830
-Like some sound like that
+Como alguns soam assim
 
 0:02:00.560,0:02:05.860
-If you have like different kind of you know, it's not just a sign a sign you're gonna hear
+Se você tem um tipo diferente de você sabe, não é apenas um sinal, um sinal que você vai ouvir
 
 0:02:06.500,0:02:08.500
-different kind of Timbers or
+diferentes tipos de madeiras ou
 
 0:02:09.200,0:02:11.200
-different kind of
+tipo diferente de
 
 0:02:11.360,0:02:13.190
-different kind of
+tipo diferente de
 
 0:02:13.190,0:02:15.190
-flavor of the sound
+sabor do som
 
 0:02:15.440,0:02:18.190
-Moreover you're familiar. How sound works, right? So
+Além disso, você está familiarizado. Como o som funciona, certo? assim
 
 0:02:18.709,0:02:21.518
-Right now I'm just throwing air through my windpipe
+Agora estou apenas jogando ar pela minha traqueia
 
 0:02:22.010,0:02:26.830
-where there are like some membranes which is making the air vibrate these the
+onde existem como algumas membranas que estão fazendo o ar vibrar essas
 
 0:02:26.930,0:02:33.640
-Vibration propagates through the air there are going to be hitting your ears and the ear canal you have inside some little
+A vibração se propaga pelo ar, atingindo seus ouvidos e o canal auditivo que você tem dentro de alguns pequenos
 
 0:02:35.060,0:02:38.410
-you have likely cochlea right and then given about
+você provavelmente tem a cóclea certa e, em seguida, deu cerca de
 
 0:02:38.989,0:02:45.159
-How much the sound propagates through the cochlea you're going to be detecting the pitch and then by adding different pitch
+Quanto o som se propaga pela cóclea, você detectará o tom e, em seguida, adicionará um tom diferente
 
 0:02:45.830,0:02:49.119
-information you can and also like different kind of
+informações que você pode e também gosta de diferentes tipos de
 
 0:02:50.090,0:02:53.350
-yeah, I guess speech information you're going figure out what is the
+sim, acho que informações de fala você vai descobrir qual é o
 
 0:02:53.930,0:02:59.170
-Sound I was making over here and then you reconstruct that using your language model you have in your brain
+Som que eu estava fazendo aqui e então você reconstrói isso usando seu modelo de linguagem que você tem em seu cérebro
 
 0:02:59.170,0:03:03.369
-Right and the same thing Yun was mentioning if you start speaking another language
+Certo e a mesma coisa que Yun estava mencionando se você começar a falar outro idioma
 
 0:03:04.310,0:03:11.410
-then you won't be able to parse the information because you're using both a speech model like a conversion between
+então você não poderá analisar as informações porque está usando um modelo de fala como uma conversão entre
 
 0:03:12.019,0:03:17.709
-Vibrations and like, you know signal your brain plus the language model in order to make sense
+Vibrações e afins, você sabe sinalizar seu cérebro mais o modelo de linguagem para fazer sentido
 
 0:03:18.709,0:03:22.629
-Anyhow, that was a 1d signal. Let's say I'm listening to music so
+De qualquer forma, isso era um sinal 1d. Digamos que estou ouvindo música, então
 
 0:03:23.570,0:03:25.570
-What kind of signal do I?
+Que tipo de sinal eu faço?
 
 0:03:25.910,0:03:27.910
-have there
+tem lá
 
 0:03:28.280,0:03:34.449
-So if I listen to music user is going to be a stare of stereophonic, right? So it means you're gonna have how many channels?
+Então, se eu ouvir a música do usuário vai ser um olhar estereofônico, certo? Então significa que você vai ter quantos canais?
 
 0:03:35.420,0:03:37.420
-Two channels, right?
+Dois canais, certo?
 
 0:03:37.519,0:03:38.570
-nevertheless
+no entanto
 
 0:03:38.570,0:03:41.019
-What type of signal is gonna be this one?
+Que tipo de sinal vai ser esse?
 
 0:03:41.150,0:03:46.420
-It's still gonna be one this signal although there are two channels so you can think about you know
+Ainda vai ser um este sinal, embora existam dois canais para que você possa pensar, você sabe
 
 0:03:46.640,0:03:54.459
-regardless of how many chanted channels like if you had Dolby Surround you're gonna have what 5.1 so six I guess so, that's the
+independentemente de quantos canais cantados, como se você tivesse Dolby Surround, você terá o que 5.1, então seis, acho que sim, esse é o
 
 0:03:55.050,0:03:56.410
-You know
+Você sabe
 
 0:03:56.410,0:03:58.390
-vectorial the
+vetorial o
 
 0:03:58.390,0:04:02.790
-size of the signal and then the time is the only variable which is
+tamanho do sinal e então o tempo é a única variável que é
 
 0:04:03.820,0:04:07.170
-Like moving forever. Okay. So those are 1d signals
+Como se mover para sempre. OK. Então esses são sinais 1d
 
 0:04:09.430,0:04:13.109
-All right, so let's have a look let's zoom in a little bit so
+Tudo bem, então vamos dar uma olhada, vamos ampliar um pouco para
 
 0:04:14.050,0:04:18.420
-We have it. For example on the left hand side. We have something that looks like a sinusoidal
+Nós temos isso. Por exemplo, do lado esquerdo. Temos algo que se parece com um senoidal
 
 0:04:19.210,0:04:25.619
-function here nevertheless a little bit after you're gonna have again the same type of
+funcionar aqui, no entanto, um pouco depois você terá novamente o mesmo tipo de
 
 0:04:27.280,0:04:29.640
-Function appearing again, so this is called
+Função aparecendo novamente, então isso é chamado
 
 0:04:30.460,0:04:37.139
-Stationarity you're gonna see over and over and over again the same type of pattern across the temporal
+Estacionaridade você verá repetidas vezes o mesmo tipo de padrão ao longo do tempo
 
 0:04:37.810,0:04:39.810
-Dimension, okay
+Dimensão, ok
 
 0:04:40.090,0:04:47.369
-So the first property of this signal which is our natural signal because it happens in nature is gonna be we said
+Então, a primeira propriedade deste sinal, que é o nosso sinal natural, porque acontece na natureza, será que dissemos
 
 0:04:49.330,0:04:51.330
-Stationarity, okay. That's the first one
+Estacionaridade, ok. Esse é o primeiro
 
 0:04:51.580,0:04:53.580
-Moreover what do you think?
+Além disso o que você acha?
 
 0:04:54.130,0:04:56.130
-How likely is?
+Quão provável é?
 
 0:04:56.140,0:05:00.989
-If I have a peak on the left hand side to have a peak also very nearby
+Se eu tiver um pico do lado esquerdo para ter um pico também muito próximo
 
 0:05:03.430,0:05:09.510
-So how likely is to have a peak there rather than having a peak there given that you had a peak before or
+Então, qual é a probabilidade de ter um pico lá em vez de ter um pico lá, dado que você teve um pico antes ou
 
 0:05:09.610,0:05:11.590
-if I keep going
+se eu continuar
 
 0:05:11.590,0:05:18.119
-How likely is you have a peak, you know few seconds later given that you have a peak on the left hand side. So
+Qual é a probabilidade de você ter um pico, você sabe alguns segundos depois, pois tem um pico no lado esquerdo. assim
 
 0:05:19.960,0:05:24.329
-There should be like some kind of common sense common knowledge perhaps that
+Deveria haver algum tipo de conhecimento comum de senso comum, talvez, que
 
 0:05:24.910,0:05:27.390
-If you are close together and if you are
+Se vocês estão juntos e se estão
 
 0:05:28.000,0:05:33.360
-Close to the left hand side is there's gonna be a larger probability that things are gonna be looking
+Perto do lado esquerdo, haverá uma probabilidade maior de que as coisas pareçam
 
 0:05:33.880,0:05:40.589
-Similar, for example you have like a specific sound will have a very kind of specific shape
+Semelhante, por exemplo, você tem como um som específico terá um tipo muito específico de forma
 
 0:05:41.170,0:05:43.770
-But then if you go a little bit further away from that sound
+Mas então se você for um pouco mais longe desse som
 
 0:05:44.050,0:05:50.010
-then there's no relation anymore about what happened here given what happened before and so if you
+então não há mais relação sobre o que aconteceu aqui dado o que aconteceu antes e então se você
 
 0:05:50.410,0:05:55.170
-Compute the cross correlation between a signal and itself, do you know what's a cross correlation?
+Calcule a correlação cruzada entre um sinal e ele mesmo, você sabe o que é uma correlação cruzada?
 
 0:05:57.070,0:06:02.670
-Do know like if you don't know okay how many hands up who doesn't know a cross correlation
+Sabe como se você não sabe ok quantas mãos para cima quem não sabe uma correlação cruzada
 
 0:06:04.360,0:06:07.680
-Okay fine, so that's gonna be homework for you
+Tudo bem, então isso vai ser lição de casa para você
 
 0:06:07.680,0:06:14.489
-If you take one signal just a signal audio signal they perform convolution of that signal with itself
+Se você pegar um sinal apenas um sinal de áudio, eles realizam a convolução desse sinal consigo mesmo
 
 0:06:14.650,0:06:15.330
-Okay
+OK
 
 0:06:15.330,0:06:19.680
-and so convolution is going to be you have your own signal you take the thing you flip it and then you
+e então a convolução vai ser você tem seu próprio sinal você pega a coisa você vira e então você
 
 0:06:20.170,0:06:22.170
-pass it across and then you multiply
+passá-lo e então você multiplica
 
 0:06:22.390,0:06:25.019
-Whenever you're gonna have them overlaid in the same
+Sempre que você vai tê-los sobrepostos no mesmo
 
 0:06:25.780,0:06:27.780
-Like when there is zero
+Como quando há zero
 
 0:06:28.450,0:06:33.749
-Misalignment you're gonna have like a spike. And then as you start moving around you're gonna have basically two decaying
+Desalinhamento você vai ter como um pico. E então, quando você começar a se mover, terá basicamente dois
 
 0:06:34.360,0:06:36.930
-sides that represents the fact that
+lados que representa o fato de que
 
 0:06:37.990,0:06:44.850
-Things have much things in common basically performing a dot product right? So things that have much in common when they are
+As coisas têm muitas coisas em comum basicamente realizando um produto escalar certo? Então, coisas que têm muito em comum quando são
 
 0:06:45.370,0:06:47.970
-Very close to one specific location
+Muito perto de um local específico
 
 0:06:47.970,0:06:55.919
-If you go further away things start, you know averaging out. So here the second property of this natural signal is locality
+Se você for mais longe, as coisas começam, você sabe fazer a média. Então aqui a segunda propriedade deste sinal natural é a localidade
 
 0:06:56.500,0:07:04.470
-Information is contained in specific portion and parts of the in this case temporal domain. Okay. So before we had
+A informação está contida em uma porção específica e partes do domínio temporal neste caso. OK. Então, antes que tivéssemos
 
 0:07:06.940,0:07:08.940
-Stationarity now we have
+Estacionaridade agora temos
 
 0:07:09.640,0:07:11.640
-Locality alright don't
+Localidade tudo bem não
 
 0:07:12.160,0:07:17.999
-Bless you. All, right. So how about this one right? This is completely unrelated to what happened over there
+Abençoe. Tudo bem. Então que tal esse certo? Isso não tem nada a ver com o que aconteceu lá
 
 0:07:20.110,0:07:24.960
-Okay, so let's look at the nice little kitten what kind of
+Ok, então vamos olhar para o gatinho simpático que tipo de
 
 0:07:25.780,0:07:27.070
-dimensions
+dimensões
 
 0:07:27.070,0:07:31.200
-What kind of yeah what dimension has this signal? What was your guess?
+Que tipo de sim que dimensão tem esse sinal? Qual foi o seu palpite?
 
 0:07:32.770,0:07:34.829
-It's a 2 dimensional signal why is that
+É um sinal bidimensional por que isso
 
 0:07:39.690,0:07:45.469
-Okay, we have also a three-dimensional signal option here so someone said two dimensions someone said three dimensions
+Ok, também temos uma opção de sinal tridimensional aqui, então alguém disse duas dimensões alguém disse três dimensões
 
 0:07:47.310,0:07:51.739
-It's two-dimensional why is that sorry noise? Why is two-dimensional
+É bidimensional, por que é esse barulho lamentável? Por que é bidimensional
 
 0:07:54.030,0:07:56.030
-Because the information is
+Porque a informação é
 
 0:07:58.050,0:08:00.050
-Sorry the information is
+Desculpe a informação é
 
 0:08:00.419,0:08:01.740
-especially
+especialmente
 
 0:08:01.740,0:08:03.740
-Depicted right? So the information
+retratado certo? Então as informações
 
 0:08:03.750,0:08:05.310
-is
+é
 
 0:08:05.310,0:08:08.450
-Basically encoded in the spatial location of those points
+Basicamente codificado na localização espacial desses pontos
 
 0:08:08.760,0:08:15.439
-Although each point is a vector for example of three or if it's a hyper spectral image. It can be several planes
+Embora cada ponto seja um vetor por exemplo de três ou se for uma imagem hiperespectral. Pode ser vários aviões
 
 0:08:16.139,0:08:23.029
-Nevertheless you still you still have two directions in which points can move right? The thickness doesn't change
+No entanto, você ainda tem duas direções nas quais os pontos podem se mover, certo? A espessura não muda
 
 0:08:24.000,0:08:27.139
-across like in the thicknesses of a given space
+transversalmente como nas espessuras de um determinado espaço
 
 0:08:27.139,0:08:33.408
-Right so given thickness and it doesn't change right so you can have as many, you know planes as you want
+Certo, dada a espessura e não muda certo, então você pode ter quantos, você conhece os aviões que quiser
 
 0:08:33.409,0:08:35.409
-but the information is basically
+mas a informação é basicamente
 
 0:08:35.640,0:08:41.779
-It's a spatial information is spread across the plane. So these are two dimensional data you can also
+É uma informação espacial espalhada pelo plano. Então, esses são dados bidimensionais que você também pode
 
 0:08:50.290,0:08:53.940
-Okay, I see your point so like a wide image or a
+Ok, eu vejo seu ponto como uma imagem ampla ou um
 
 0:08:54.910,0:08:56.350
-grayscale image
+imagem em tons de cinza
 
 0:08:56.350,0:08:58.350
-It's definitely a 2d
+com certeza é 2d
 
 0:08:58.870,0:09:04.169
-Signal and also it can be represented by using a tensor of two dimensions
+Sinal e também pode ser representado usando um tensor de duas dimensões
 
 0:09:04.870,0:09:07.739
-A color image has RGB planes
+Uma imagem colorida tem planos RGB
 
 0:09:08.350,0:09:14.550
-but the thickness is always three doesn't change and the information is still spread across the
+mas a espessura é sempre três não muda e a informação ainda está espalhada pelo
 
 0:09:15.579,0:09:21.839
-Other two dimensions so you can change the size of a color image, but you won't change the thickness of a color image, right?
+Outras duas dimensões para que você possa alterar o tamanho de uma imagem colorida, mas não alterará a espessura de uma imagem colorida, certo?
 
 0:09:22.870,0:09:28.319
-So we are talking about here. The dimension of the signal is how is the information?
+Então estamos falando aqui. A dimensão do sinal é como fica a informação?
 
 0:09:29.470,0:09:31.680
-Basically spread around right in the temporal information
+Basicamente espalhados pela informação temporal
 
 0:09:31.959,0:09:38.789
-If you have Dolby Surround mono mono signal or you have a stereo we still have over time, right?
+Se você tem sinal mono mono Dolby Surround ou tem um estéreo ainda temos com o tempo, certo?
 
 0:09:38.790,0:09:41.670
-So it's one dimensional images are 2d
+Então, as imagens unidimensionais são 2d
 
 0:09:42.250,0:09:44.759
-so let's have a look to the little nice kitten and
+então vamos dar uma olhada no gatinho simpático e
 
 0:09:45.519,0:09:47.909
-Let's focus on the on the nose, right? Oh
+Vamos focar no nariz, certo? Oh
 
 0:09:48.579,0:09:50.579
-My god, this is a monster. No
+Meu Deus, isso é um monstro. Não
 
 0:09:50.949,0:09:52.949
-Okay. Nice big
+OK. Bem grande
 
 0:09:53.649,0:09:55.948
-Creature here, right? Okay, so
+Criatura aqui, certo? OK, então
 
 0:09:56.740,0:10:03.690
-We observe there and there is some kind of dark region nearby the eye you can observe that kind of seeing a pattern
+Nós observamos lá e há algum tipo de região escura perto do olho você pode observar esse tipo de ver um padrão
 
 0:10:04.329,0:10:09.809
-Appear over there, right? So what is this property of natural signals? I
+Aparece por lá, certo? Então, qual é essa propriedade dos sinais naturais? eu
 
 0:10:12.699,0:10:18.239
-Told you two properties, this is stationarity. Why is this stationarity?
+Disse-lhe duas propriedades, esta é a estacionaridade. Por que essa estacionariedade?
 
 0:10:22.029,0:10:29.129
-Right, so the same pattern appears over and over again across the dimensionality in this case the dimension is two dimension. Sorry
+Certo, então o mesmo padrão aparece repetidamente em toda a dimensionalidade, neste caso a dimensão é de duas dimensões. Desculpe
 
 0:10:30.220,0:10:36.600
-Moreover, what is the likelihood that given that the color in the pupil is black? What is the likelihood that?
+Além disso, qual é a probabilidade de que, dado que a cor da pupila seja preta? Qual é a probabilidade disso?
 
 0:10:37.149,0:10:42.448
-The pixel on the arrow or like on the tip of the arrow is also black
+O pixel na seta ou como na ponta da seta também é preto
 
 0:10:42.449,0:10:47.879
-I would say it's quite likely right because it's very close. How about that point?
+Eu diria que é bem provável que esteja certo porque é muito próximo. Que tal esse ponto?
 
 0:10:48.069,0:10:51.899
-Yeah, kind of less likely right if I keep clicking
+Sim, meio menos provável certo se eu continuar clicando
 
 0:10:52.480,0:10:59.649
-You know, it's completely it's bright. No, no the other pics in right so is further you go in spacial dimension
+Você sabe, é completamente brilhante. Não, não as outras fotos à direita, então você vai mais longe na dimensão espacial
 
 0:11:00.290,0:11:06.879
-The less less likely you're gonna have, you know similar information. And so this is called
+Quanto menos provável você tiver, você saberá informações semelhantes. E assim se chama isso
 
 0:11:08.629,0:11:10.629
-Locality which means
+Localidade que significa
 
 0:11:12.679,0:11:16.269
-There's a higher likelihood for things to have if like
+Há uma maior probabilidade de as coisas terem se como
 
 0:11:16.549,0:11:22.509
-The information is like containers in a specific region as you move around things get much much more
+As informações são como contêineres em uma região específica à medida que você se move, as coisas ficam muito, muito mais
 
 0:11:24.649,0:11:26.649
-You know independent
+Você sabe independente
 
 0:11:27.199,0:11:32.529
-Alright, so we have two properties. The third property is gonna be the following. What is this?
+Tudo bem, então temos duas propriedades. A terceira propriedade será a seguinte. O que é isso?
 
 0:11:33.829,0:11:35.829
-Are you hungry?
+Está com fome?
 
 0:11:37.579,0:11:41.769
-So you can see here some donuts right no donuts how you called
+Então você pode ver aqui alguns donuts, sem donuts como você chamou
 
 0:11:42.649,0:11:44.230
-Bagels, right? All right
+Bagels, certo? Tudo bem
 
 0:11:44.230,0:11:51.009
-So for the you the the one of you which have glasses take your glasses off and now answer my question
+Então, para você, aquele de vocês que tem óculos, tire os óculos e agora responda à minha pergunta
 
 0:11:53.179,0:11:55.179
-Okay
+OK
 
 0:11:59.210,0:12:01.210
-So the third property
+Então a terceira propriedade
 
 0:12:02.210,0:12:07.059
-It's compositionality right and so compositionality means that the
+É certo de composicionalidade e assim composicionalidade significa que o
 
 0:12:07.880,0:12:10.119
-Word is actually explainable, right?
+A palavra é realmente explicável, certo?
 
 0:12:11.060,0:12:13.060
-okay, you enjoy the
+tudo bem, você gosta de
 
 0:12:15.830,0:12:20.199
-The thing okay, you gotta get back to me right? I just try to keep your life
+A coisa está bem, você tem que voltar para mim certo? Eu apenas tento manter sua vida
 
 0:12:26.180,0:12:28.100
-Hello
+Olá
 
 0:12:28.100,0:12:33.520
-Okay. So for the one that doesn't have glasses ask the friend who has glasses and try them on. Okay now
+OK. Então, para quem não tem óculos, pergunte ao amigo que tem óculos e experimente-os. Okay agora
 
 0:12:34.430,0:12:36.430
-Don't do it if it's not good
+Não faça isso se não for bom
 
 0:12:37.010,0:12:43.659
-I'm just kidding. You can squint just queen don't don't don't use other people glasses. Okay?
+Estou brincando. Você pode apertar os olhos apenas rainha, não, não use óculos de outras pessoas. OK?
 
 0:12:44.990,0:12:46.990
-Question. Yeah
+Pergunta. Sim
 
 0:12:50.900,0:12:52.130
-So
+assim
 
 0:12:52.130,0:12:57.489
-Stationerity means you observe the same kind of pattern over and over again your data
+Estacionaridade significa que você observa o mesmo tipo de padrão repetidamente em seus dados
 
 0:12:58.160,0:13:01.090
-Locality means that pattern are just localized
+Localidade significa que o padrão é apenas localizado
 
 0:13:01.820,0:13:08.109
-So you have some specific information here some information here information here as you move away from this point
+Então você tem algumas informações específicas aqui algumas informações aqui informações aqui conforme você se afasta deste ponto
 
 0:13:08.270,0:13:10.270
-this other value is gonna be
+esse outro valor vai ser
 
 0:13:10.760,0:13:11.780
-almost
+quase
 
 0:13:11.780,0:13:15.249
-Independent from the value of this point here. So things are correlated
+Independente do valor deste ponto aqui. Então as coisas estão correlacionadas
 
 0:13:15.860,0:13:17.860
-Only within a neighborhood, okay
+Apenas dentro de um bairro, ok
 
 0:13:19.910,0:13:27.910
-Okay, everyone has been experimenting now squinting and looking at this nice picture, okay. So this is the third part which is compositionality
+Ok, todo mundo tem experimentado agora apertando os olhos e olhando para esta bela foto, ok. Então esta é a terceira parte que é composicionalidade
 
 0:13:28.730,0:13:32.289
-Here you can tell how you can actually see something
+Aqui você pode dizer como você pode realmente ver algo
 
 0:13:33.080,0:13:35.080
-If you blur it a little bit
+Se você borrar um pouco
 
 0:13:35.810,0:13:39.250
-because again things are made of small parts and you can actually
+porque novamente as coisas são feitas de pequenas peças e você pode realmente
 
 0:13:40.010,0:13:42.429
-You know compose things in this way
+Você sabe compor as coisas dessa maneira
 
 0:13:43.400,0:13:47.829
-anyhow, so these are the three main properties of natural signals, which
+de qualquer forma, então essas são as três propriedades principais dos sinais naturais, que
 
 0:13:48.650,0:13:50.650
-allow us to
+permita-nos
 
 0:13:51.260,0:13:55.960
-Can be exploited for making, you know, a design of our architecture, which is more
+Pode ser explorado para fazer, você sabe, um projeto de nossa arquitetura, que é mais
 
 0:13:56.600,0:14:00.880
-Actually prone to extract information that has these properties
+Realmente propenso a extrair informações que tenham essas propriedades
 
 0:14:00.880,0:14:05.169
-Okay, so we are just talking now about signals that exhibits those properties
+Ok, então estamos falando agora sobre sinais que exibem essas propriedades
 
 0:14:07.730,0:14:11.500
-Finally okay. There was the last one which I didn't talk so
+Finalmente tudo bem. Teve o último que eu não falei então
 
 0:14:12.890,0:14:18.159
-We had the last one here. We have an English sentence, right John picked up the apple
+Tivemos o último aqui. Temos uma frase em inglês, certo John pegou a maçã
 
 0:14:18.779,0:14:22.818
-whatever and here again, you can represent each word as
+qualquer coisa e aqui novamente, você pode representar cada palavra como
 
 0:14:23.399,0:14:26.988
-One vector, for example each of those items. It can be a
+Um vetor, por exemplo, cada um desses itens. Pode ser um
 
 0:14:27.869,0:14:30.469
-Vector which has a 1 in correspondent
+Vetor que tem 1 no correspondente
 
 0:14:31.110,0:14:35.329
-Correspondence to the position of where that word happens to be in a dictionary, okay
+Correspondência para a posição de onde essa palavra está no dicionário, ok
 
 0:14:35.329,0:14:39.709
-so if you have a dictionary of 10,000 words, you can just check whatever is the
+então se você tem um dicionário de 10.000 palavras, você pode simplesmente checar o que for
 
 0:14:40.679,0:14:44.899
-The word on this dictionary you just put the page plus the whatever number
+A palavra neste dicionário você acabou de colocar a página mais o número
 
 0:14:45.629,0:14:50.599
-Like you just figured that the position of the page in the dictionary. So also language
+Como você acabou de descobrir que a posição da página no dicionário. Assim também a linguagem
 
 0:14:51.899,0:14:56.419
-Has those kind of properties things that are close by have, you know
+Tem esse tipo de propriedades que as coisas que estão por perto têm, você sabe
 
 0:14:56.420,0:15:01.069
-Some kind of relationship things away are not less unless you know
+Algum tipo de relacionamento, as coisas não são menos, a menos que você saiba
 
 0:15:01.470,0:15:05.149
-Correlated and then similar patterns happen over and over again over
+Padrões correlacionados e, em seguida, semelhantes acontecem repetidamente
 
 0:15:05.819,0:15:12.558
-Moreover, you can use you know words make sentences to make full essays and to make finally your write-ups for the
+Além disso, você pode usar as palavras que conhece para fazer frases para fazer redações completas e, finalmente, fazer suas redações para o
 
 0:15:12.839,0:15:16.008
-Sessions. I'm just kidding. Okay. All right, so
+Sessões. Estou brincando. OK. Tudo bem, então
 
 0:15:17.429,0:15:19.789
-We already seen this one. So I'm gonna be going quite fast
+Já vimos este. Então eu vou muito rápido
 
 0:15:20.759,0:15:28.279
-there shouldn't be any I think questions because also we have everything written down on the website, right so you can always check the
+acho que não deve haver nenhuma pergunta, porque também temos tudo escrito no site, certo para que você possa sempre verificar o
 
 0:15:28.860,0:15:30.919
-summaries of the previous lesson on the website
+resumos da lição anterior no site
 
 0:15:32.040,0:15:39.349
-So fully connected layer. So this actually perhaps is a new version of the diagram. This is my X,Y is at the bottom
+Camada tão totalmente conectada. Então, na verdade, talvez seja uma nova versão do diagrama. Este é o meu X,Y está na parte inferior
 
 0:15:42.089,0:15:49.698
-Low level features. What's the color of the decks? Pink. Okay good. All right, so we have an arrow which represents my
+Características de baixo nível. Qual a cor dos decks? Cor de rosa. OK, bom. Tudo bem, então temos uma seta que representa minha
 
 0:15:51.299,0:15:54.439
-Yeah, fine that's the proper term, but I like to call them
+Sim, tudo bem, esse é o termo adequado, mas eu gosto de chamá-los
 
 0:15:55.410,0:16:02.299
-Rotations and then there is some squashing right? squashing means the non-linearity then I have my hidden layer then I have another
+Rotações e depois há algum esmagamento certo? esmagamento significa a não linearidade, então eu tenho minha camada oculta, então eu tenho outra
 
 0:16:04.379,0:16:06.379
-Rotation and a final
+Rotação e um final
 
 0:16:06.779,0:16:12.888
-Squashing. Okay. It's not necessary. Maybe can be a linear, you know final transformation like a linear
+Esmagamento. OK. Não é necessário. Talvez possa ser um linear, você sabe a transformação final como um linear
 
 0:16:14.520,0:16:18.059
-Whatever function they're like if you do if you perform a regression task
+Seja qual for a função que eles são, se você fizer uma tarefa de regressão
 
 0:16:19.750,0:16:21.750
-There you have the equations, right
+Aí você tem as equações, certo
 
 0:16:22.060,0:16:24.060
-And those guys can be any of those
+E esses caras podem ser qualquer um desses
 
 0:16:24.610,0:16:26.260
-nonlinear functions or
+funções não lineares ou
 
 0:16:26.260,0:16:33.239
-Even a linear function right if you perform regression once more and so you can write down these layers where I expand
+Mesmo uma função linear né se você fizer a regressão mais uma vez e assim você pode anotar essas camadas onde eu expando
 
 0:16:33.240,0:16:39.510
-So this guy here the the bottom guy is actually a vector and I represent the vector G with just one pole there
+Então esse cara aqui embaixo é na verdade um vetor e eu represento o vetor G com apenas um polo ali
 
 0:16:39.510,0:16:42.780
-I just show you all the five items elements of that vector
+Acabei de mostrar todos os cinco elementos de itens desse vetor
 
 0:16:43.030,0:16:45.239
-So you have the X the first layer?
+Então você tem o X na primeira camada?
 
 0:16:45.370,0:16:50.520
-Then you have the first hidden second hidden third hit and the last layer so we have how many layers?
+Então você tem o primeiro segundo terceiro hit oculto e a última camada, então temos quantas camadas?
 
 0:16:53.590,0:16:55.240
-Five okay
+Cinco ok
 
 0:16:55.240,0:16:56.950
-And then you can also call them
+E então você também pode chamá-los
 
 0:16:56.950,0:17:03.689
-activation layer 1 layer 2 3 4 whatever and then the matrices are where you store your
+camada de ativação 1 camada 2 3 4 o que for e então as matrizes são onde você armazena seu
 
 0:17:03.970,0:17:10.380
-Parameters you have those different W's and then in order to get each of those values you already seen the stuff, right?
+Parâmetros você tem esses W's diferentes e então para pegar cada um desses valores você já viu as coisas, certo?
 
 0:17:10.380,0:17:17.280
-So I go quite faster you perform just the scalar product. Which means you just do that thing
+Então eu vou bem mais rápido você executa apenas o produto escalar. O que significa que você acabou de fazer aquela coisa
 
 0:17:17.860,0:17:23.400
-You get all those weights. I multiply the input for each of those weights and you keep going like that
+Você recebe todos esses pesos. Eu multiplico a entrada para cada um desses pesos e você continua assim
 
 0:17:24.490,0:17:28.920
-And then you store those weights in those matrices and so on. So as you can tell
+E então você armazena esses pesos nessas matrizes e assim por diante. Então, como você pode dizer
 
 0:17:30.700,0:17:37.019
-There is a lot of arrows right and regardless of the fact that I spent too many hours doing that drawing
+Tem muitas flechas né e independente do fato de eu ter passado muitas horas fazendo aquele desenho
 
 0:17:38.200,0:17:43.649
-This is also like very computationally expensive because there are so many computations right each arrow
+Isso também é muito caro computacionalmente porque há tantos cálculos certos em cada seta
 
 0:17:44.350,0:17:46.350
-represents a weight which you have to multiply
+representa um peso que você tem que multiplicar
 
 0:17:46.960,0:17:49.110
-for like by its own input
+para like por sua própria entrada
 
 0:17:49.870,0:17:51.870
-so
+assim
 
 0:17:52.090,0:17:53.890
-What can we do now?
+O que podemos fazer agora?
 
 0:17:53.890,0:17:55.150
-so
+assim
 
 0:17:55.150,0:17:57.150
-given that our information is
+visto que nossas informações
 
 0:17:57.700,0:18:04.679
-Has locality. No our data has this locality as a property. What does it mean if I had something here?
+Tem localidade. Não os nossos dados têm esta localidade como propriedade. O que significa se eu tivesse algo aqui?
 
 0:18:05.290,0:18:07.290
-Do I care what's happening here?
+Eu me importo com o que está acontecendo aqui?
 
 0:18:09.460,0:18:12.540
-So some of you are just shaking the hand and the rest of
+Então, alguns de vocês estão apenas apertando a mão e o resto
 
 0:18:13.000,0:18:17.219
-You are kind of I don't know not responsive and I have to ping you
+Você é meio que não sei não responde e eu tenho que te dar um ping
 
 0:18:18.140,0:18:18.900
-so
+assim
 
 0:18:18.900,0:18:25.849
-We have locality, right? So things are just in specific regions. You actually care to look about far away
+Temos localidade, certo? Então as coisas estão apenas em regiões específicas. Você realmente se importa em olhar para longe
 
 0:18:27.030,0:18:28.670
-No, okay. Fantastic
+Não, tudo bem. Fantástico
 
 0:18:28.670,0:18:32.119
-So let's simply drop some connections, right?
+Então, vamos simplesmente descartar algumas conexões, certo?
 
 0:18:32.130,0:18:38.660
-So here we go from layer L-1 to the layer L by using the first, you know five
+Então aqui vamos da camada L-1 para a camada L usando a primeira, você conhece cinco
 
 0:18:39.570,0:18:45.950
-Ten and fifteen, right? Plus I have the last one here to from the layer L to L+1
+Dez e quinze, certo? Além disso, eu tenho o último aqui da camada L para L+1
 
 0:18:45.950,0:18:48.529
-I have three more right so in total we have
+Eu tenho mais três, então no total temos
 
 0:18:50.550,0:18:53.089
-Eighteen weights computations, right
+Dezoito cálculos de pesos, certo
 
 0:18:53.760,0:18:55.760
-so, how about we
+então, que tal nós
 
 0:18:56.370,0:19:01.280
-Drop the things that we don't care, right? So like let's say for this neuron, perhaps
+Largue as coisas que não nos importamos, certo? Então, digamos, para este neurônio, talvez
 
 0:19:01.830,0:19:04.850
-Why do we have to care about those guys there on the bottom, right?
+Por que temos que nos preocupar com aqueles caras lá no fundo, certo?
 
 0:19:05.160,0:19:08.389
-So, for example, I can just use those three weights, right?
+Então, por exemplo, eu posso usar esses três pesos, certo?
 
 0:19:08.390,0:19:12.770
-I just forget about the other two and then again, I just use those three weights
+Eu apenas esqueço os outros dois e, novamente, eu apenas uso esses três pesos
 
 0:19:12.770,0:19:15.229
-I skip the first and the last and so on
+Eu pulo o primeiro e o último e assim por diante
 
 0:19:16.170,0:19:23.570
-Okay. So right now we have just nine connections now just now nine multiplications and finally three more
+OK. Então agora temos apenas nove conexões agora apenas nove multiplicações e finalmente mais três
 
 0:19:24.360,0:19:28.010
-so as we go from the left hand side to the right hand side we
+então, à medida que vamos do lado esquerdo para o lado direito,
 
 0:19:28.920,0:19:32.149
-Climb the hierarchy and we're gonna have a larger and larger
+Suba na hierarquia e teremos um número cada vez maior
 
 0:19:33.960,0:19:34.790
-View right
+Ver à direita
 
 0:19:34.790,0:19:40.879
-so although these green bodies here and don't see the whole input is you keep climbing the
+então, embora esses corpos verdes aqui e não vejam toda a entrada, você continua subindo o
 
 0:19:41.310,0:19:45.109
-Hierarchy you're gonna be able to see the whole span of the input, right?
+Hierarquia você poderá ver toda a extensão da entrada, certo?
 
 0:19:46.590,0:19:48.590
-so in this case, we're going to be
+então, neste caso, vamos ser
 
 0:19:49.230,0:19:55.760
-Defining the RF as receptive field. So my receptive field here from the last
+Definindo a RF como campo receptivo. Então meu campo receptivo aqui desde o último
 
 0:19:56.400,0:20:03.769
-Neuron to the intermediate neuron is three. So what is gonna be? This means that the final neuron sees three
+Neurônio para o neurônio intermediário é três. Então o que vai ser? Isso significa que o neurônio final vê três
 
 0:20:04.500,0:20:10.820
-Neurons from the previous layer. So what is the receptive field of the hidden layer with respect to the input layer?
+Neurônios da camada anterior. Então, qual é o campo receptivo da camada oculta em relação à camada de entrada?
 
 0:20:14.970,0:20:21.199
-The answer was three. Yeah, correct, but what is now their septic field of the output layer with respect to the input layer
+A resposta foi três. Sim, correto, mas qual é agora o campo séptico da camada de saída em relação à camada de entrada
 
 0:20:23.549,0:20:25.549
-Five right. That's fantastic
+Cinco certo. Isso é fantástico
 
 0:20:25.679,0:20:30.708
-Okay, sweet. So right now the whole architecture does see the whole input
+Ok, doce. Então agora toda a arquitetura vê toda a entrada
 
 0:20:31.229,0:20:33.229
-while each sub part
+enquanto cada subparte
 
 0:20:33.239,0:20:39.019
-Like intermediate layers only sees small regions and this is very nice because you will spare
+Como as camadas intermediárias só vê pequenas regiões e isso é muito bom porque você vai poupar
 
 0:20:39.239,0:20:46.939
-Computations which are unnecessary because on average they have no whatsoever in information. And so we managed to speed up
+Cálculos que são desnecessários porque em média não têm qualquer informação. E assim conseguimos acelerar
 
 0:20:47.669,0:20:50.059
-The computations that you actually can compute
+Os cálculos que você realmente pode calcular
 
 0:20:51.119,0:20:53.208
-things in a decent amount of time
+coisas em um tempo razoável
 
 0:20:54.809,0:20:58.998
-Clear so we can talk about sparsity only because
+Claro para que possamos falar sobre esparsidade apenas porque
 
 0:21:02.669,0:21:05.238
-We assume that our data shows
+Assumimos que nossos dados mostram
 
 0:21:06.329,0:21:08.249
-locality, right
+localidade, certo
 
 0:21:08.249,0:21:12.708
-Question if my data doesn't show locality. Can I use sparsity?
+Questione se meus dados não mostram a localidade. Posso usar esparsidade?
 
 0:21:16.139,0:21:19.279
-No, okay fantastic, okay. All right
+Não, tudo bem, fantástico, tudo bem. Tudo bem
 
 0:21:20.549,0:21:23.898
-more stuff so we also said that this natural signals are
+mais coisas, então também dissemos que esses sinais naturais são
 
 0:21:24.209,0:21:28.399
-Stationary and so given that they're stationary things appear over and over again
+Estacionária e, portanto, dado que são estacionárias, as coisas aparecem repetidamente
 
 0:21:28.399,0:21:34.008
-So maybe we don't have to learn again again the same stuff of all over the time right? So
+Então talvez não tenhamos que aprender de novo as mesmas coisas de todo o tempo certo? assim
 
 0:21:34.679,0:21:37.668
-In this case we said oh we drop those two lines, right?
+Neste caso, dissemos oh, deixamos de lado essas duas linhas, certo?
 
 0:21:38.729,0:21:41.179
-And so how about we use?
+E então que tal usarmos?
 
 0:21:41.969,0:21:46.999
-The first connection the oblique one from you know going in down
+A primeira conexão a oblíqua de você conhece indo para baixo
 
 0:21:47.549,0:21:52.158
-Make it yellow. So all of those are yellows then these are orange
+Faça-o amarelo. Então todos esses são amarelos, então estes são laranja
 
 0:21:52.859,0:21:57.139
-And then the final one are red, right? So how many weights do I have here?
+E então o último é vermelho, certo? Então, quantos pesos eu tenho aqui?
 
 0:21:59.639,0:22:01.639
-And I had over here
+E eu tinha aqui
 
 0:22:03.089,0:22:05.089
-Nine right and before we had
+Nove à direita e antes que tivéssemos
 
 0:22:06.749,0:22:09.769
-15 right so we drop from 15 to 3
+15 certo, então caímos de 15 para 3
 
 0:22:10.529,0:22:14.958
-This is like a huge reduction and how perhaps now it is actually won't work
+Isso é como uma grande redução e como talvez agora não funcione
 
 0:22:14.969,0:22:16.759
-So we have to fix that in a bit
+Então, temos que corrigir isso em um pouco
 
 0:22:16.759,0:22:22.368
-But anyhow in this way when I train a network, I just had to train three weights the red
+Mas de qualquer forma desta forma quando eu treino uma rede, eu só tive que treinar três pesos o vermelho
 
 0:22:22.840,0:22:25.980
-sorry, the yellow orange and red and
+desculpe, o amarelo laranja e vermelho e
 
 0:22:26.889,0:22:30.959
-It's gonna be actually working even better because it just has to learn
+Na verdade, vai funcionar ainda melhor porque só precisa aprender
 
 0:22:31.749,0:22:37.079
-You're gonna have more information you have more data for you know training those specific weights
+Você terá mais informações, terá mais dados para saber treinar esses pesos específicos
 
 0:22:41.320,0:22:48.299
-So those are those three colors the yellow orange and red are gonna be called my kernel and so I stored them
+Então essas são essas três cores, o amarelo laranja e o vermelho vão ser chamados de meu kernel e então eu as armazenei
 
 0:22:48.850,0:22:50.850
-Into a vector over here
+Em um vetor aqui
 
 0:22:53.200,0:22:58.679
-And so those if you talk about you know convolutional careness those are simply the weight of these
+E então aqueles se você fala sobre você sabe cuidados convolucionais esses são simplesmente o peso desses
 
 0:22:59.200,0:22:59.909
-over here
+por aqui
 
 0:22:59.909,0:23:04.589
-Right the weights that we are using by using sparsity and then using parameter sharing
+Corrija os pesos que estamos usando usando esparsidade e, em seguida, usando o compartilhamento de parâmetros
 
 0:23:04.869,0:23:09.629
-Parameter sharing means you use the same parameter over over again across the architecture
+Compartilhamento de parâmetros significa que você usa o mesmo parâmetro novamente em toda a arquitetura
 
 0:23:10.330,0:23:15.090
-So there are the following nice properties of using those two combined
+Portanto, existem as seguintes propriedades interessantes de usar esses dois combinados
 
 0:23:15.490,0:23:20.699
-So parameter sharing gives us faster convergence because you're gonna have much more information
+Assim, o compartilhamento de parâmetros nos dá uma convergência mais rápida porque você terá muito mais informações
 
 0:23:21.399,0:23:23.549
-To use in order to train these weights
+Para usar para treinar esses pesos
 
 0:23:24.519,0:23:26.139
-You have a better
+Você tem um melhor
 
 0:23:26.139,0:23:32.008
-Generalization because you don't have to learn every time a specific type of thing that happened in different region
+Generalização porque você não precisa aprender toda vez um tipo específico de coisa que aconteceu em uma região diferente
 
 0:23:32.009,0:23:34.079
-You just learn something. That makes sense
+Você acabou de aprender alguma coisa. Isso faz sentido
 
 0:23:34.720,0:23:36.720
-You know globally
+Você sabe globalmente
 
 0:23:37.570,0:23:44.460
-Then we also have we are not constrained to the input size this is so important ray also Yann said this thing three times yesterday
+Então nós também não estamos restritos ao tamanho da entrada isso é tão importante ray também Yann disse isso três vezes ontem
 
 0:23:45.700,0:23:48.029
-Why are we not constrained to the input size?
+Por que não estamos restritos ao tamanho da entrada?
 
 0:23:54.039,0:24:00.449
-Because we can keep shifting in over right before in these other case if you have more neurons you have to learn new stuff
+Porque podemos continuar mudando logo antes, neste outro caso, se você tiver mais neurônios, precisará aprender coisas novas
 
 0:24:00.450,0:24:06.210
-Right, in this case. I can simply add more neurons and I keep using my weight across right that was
+Certo, neste caso. Eu posso simplesmente adicionar mais neurônios e continuo usando meu peso certo que foi
 
 0:24:07.240,0:24:09.809
-Some of the major points Yann, you know
+Alguns dos principais pontos Yann, você sabe
 
 0:24:10.509,0:24:12.509
-highlighted yesterday
+destaque ontem
 
 0:24:12.639,0:24:14.939
-Moreover we have the kernel independence
+Além disso, temos a independência do kernel
 
 0:24:15.999,0:24:18.689
-So for the one of you they are interested in optimization
+Então, para um de vocês, eles estão interessados ​​em otimização
 
 0:24:19.659,0:24:21.009
-optimizing like computation
+otimizando como computação
 
 0:24:21.009,0:24:22.299
-this is so cool because
+isso é tão legal porque
 
 0:24:22.299,0:24:29.189
-This kernel and another kernel are completely independent so you can train them you can paralyze is to make things go faster
+Este kernel e outro kernel são completamente independentes então você pode treiná-los você pode paralisar é fazer as coisas andarem mais rápido
 
 0:24:33.580,0:24:38.549
-So finally we have also some connection sparsity property and so here we have a
+Então, finalmente, temos também alguma propriedade de esparsidade de conexão e aqui temos uma
 
 0:24:39.070,0:24:41.700
-Reduced amount of computation, which is also very good
+Quantidade reduzida de computação, o que também é muito bom
 
 0:24:42.009,0:24:48.659
-So all these properties allowed us to be able to train this network on a lot of data
+Então, todas essas propriedades nos permitiram treinar essa rede em muitos dados
 
 0:24:48.659,0:24:55.739
-you still require a lot of data, but without having sparsity locality, so without having sparsity and
+você ainda precisa de muitos dados, mas sem ter localidade esparsa, portanto, sem ter esparsidade e
 
 0:24:56.409,0:25:01.859
-Parameter sharing you wouldn't be able to actually finish training this network in a reasonable amount of time
+Compartilhamento de parâmetros, você não conseguiria concluir o treinamento desta rede em um período de tempo razoável
 
 0:25:03.639,0:25:11.039
-So, let's see, for example now how this works when you have like audio signal which is how many dimensional signal
+Então, vamos ver, por exemplo, agora como isso funciona quando você tem um sinal de áudio, que é quantos sinais dimensionais
 
 0:25:12.279,0:25:17.849
-1 dimensional signal, right? Okay. So for example kernels for 1d data
+1 sinal dimensional, certo? OK. Então, por exemplo, kernels para dados 1d
 
 0:25:18.490,0:25:24.119
-On the right hand side. You can see again. My my neurons can I'll be using my
+No lado direito. Você pode ver novamente. Meus meus neurônios posso estar usando meu
 
 0:25:24.909,0:25:30.359
-Different the first scanner here. And so I'm gonna be storing my kernel there in that vector
+Diferente do primeiro scanner aqui. E então eu vou armazenar meu kernel lá nesse vetor
 
 0:25:31.330,0:25:36.059
-For example, I can have a second kernel right. So right now we have two kernels the
+Por exemplo, posso ter um segundo kernel certo. Então agora temos dois kernels o
 
 0:25:36.700,0:25:39.749
-Blue purple and pink and the yellow, orange and red
+Azul roxo e rosa e amarelo, laranja e vermelho
 
 0:25:41.559,0:25:44.158
-So let's say my output is r2
+Então vamos dizer que minha saída é r2
 
 0:25:44.799,0:25:46.829
-So that means that each of those
+Isso significa que cada um desses
 
 0:25:47.980,0:25:50.909
-Bubbles here. Each of those neurons are actually
+Bolhas aqui. Cada um desses neurônios é realmente
 
 0:25:51.639,0:25:57.359
-One and two rightly come out from the from the board, right? So it's each of those are having a thickness of two
+Um e dois saem do tabuleiro, certo? Então, cada um deles tem uma espessura de dois
 
 0:25:58.929,0:26:02.819
-And let's say the other guy here are having a thickness of seven, right
+E digamos que o outro cara aqui está tendo uma espessura de sete, certo
 
 0:26:02.990,0:26:07.010
-They are coming outside from the screen and they are you know, seven euros in this way
+Eles estão saindo da tela e são, você sabe, sete euros dessa maneira
 
 0:26:08.070,0:26:13.640
-so in this case, my kernel are going to be of size 2 * 7 * 3
+então, neste caso, meu kernel terá tamanho 2 * 7 * 3
 
 0:26:13.860,0:26:17.719
-So 2 means I have two kernels which are going from 7
+Então 2 significa que eu tenho dois kernels que vão de 7
 
 0:26:18.240,0:26:20.070
-to give me
+para me dar
 
 0:26:20.070,0:26:22.070
 3
 
 0:26:22.950,0:26:24.950
-Outputs
+Saídas
 
 0:26:28.470,0:26:32.959
-Hold on my bad. So the 2 means you have ℝ² right here
+Segure meu mal. Então o 2 significa que você tem ℝ² aqui
 
 0:26:33.659,0:26:37.069
-Because you have two corners. So the first kernel will give you the first
+Porque você tem dois cantos. Então o primeiro kernel lhe dará o primeiro
 
 0:26:37.679,0:26:41.298
-The first column here and the second kernel is gonna give you the second column
+A primeira coluna aqui e o segundo kernel vão te dar a segunda coluna
 
 0:26:42.179,0:26:44.869
-Then it has to init 7
+Então tem que inicializar 7
 
 0:26:45.210,0:26:50.630
-Because it needs to match all the thickness of the previous layer and then it has 3 because there are three
+Porque precisa combinar com toda a espessura da camada anterior e aí tem 3 porque são três
 
 0:26:50.789,0:26:56.778
-Connections right? So maybe I miss I got confused before does it make sense the sizing?
+Conexões certo? Então, talvez eu sinta falta de me confundir antes, faz sentido o dimensionamento?
 
 0:26:58.049,0:26:59.820
-so given that our
+assim dado que o nosso
 
 0:26:59.820,0:27:03.710
-273  2 means you had 2 kernels and therefore you have two
+273 2 significa que você tinha 2 kernels e, portanto, você tem dois
 
 0:27:04.080,0:27:08.000
-Items here like one a one coming out for each of those columns
+Itens aqui como um a um saindo para cada uma dessas colunas
 
 0:27:08.640,0:27:15.919
-It has seven because each of these have a thickness of 7 and finally 3 means there are 3 connection connecting to the previous layer
+Tem sete porque cada um deles tem uma espessura de 7 e finalmente 3 significa que existem 3 conexões conectando à camada anterior
 
 0:27:17.429,0:27:22.819
-Right so 1d data uses 3d kernels ok
+Certo, então os dados 1d usam kernels 3d ok
 
 0:27:23.460,0:27:30.049
-so if I call this my collection of kernel, right, so if those are gonna be stored in a tensor
+então se eu chamar isso de minha coleção de kernel, certo, então se eles forem armazenados em um tensor
 
 0:27:30.049,0:27:32.898
-This tensor will be a three dimensional tensor
+Este tensor será um tensor tridimensional
 
 0:27:33.690,0:27:34.919
-so
+assim
 
 0:27:34.919,0:27:37.939
-Question for you, if I'm gonna be playing now with images
+Pergunta para você, se eu vou brincar agora com imagens
 
 0:27:38.580,0:27:40.580
-What is the size of?
+Qual é o tamanho de?
 
 0:27:40.679,0:27:43.999
-You know full pack of kernels for an image
+Você conhece o pacote completo de kernels para uma imagem
 
 0:27:45.809,0:27:47.809
-Convolutional net
+Rede convolucional
 
 0:27:49.590,0:27:56.209
-Four right. So we're gonna have the number of kernels then it's going to be the number of the thickness
+Quatro certo. Então, teremos o número de kernels, então será o número da espessura
 
 0:27:56.730,0:28:00.589
-And then you're gonna have connections in height and connection in width
+E então você terá conexões em altura e conexões em largura
 
 0:28:01.799,0:28:03.179
-Okay
+OK
 
 0:28:03.179,0:28:09.798
-So if you're gonna be checking the currently convolutional kernels later on in your notebook, actually you should check that
+Então, se você for verificar os kernels convolucionais atuais mais tarde em seu notebook, na verdade, você deve verificar isso
 
 0:28:09.929,0:28:12.138
-You should find the same kind of dimensions
+Você deve encontrar o mesmo tipo de dimensões
 
 0:28:14.159,0:28:16.159
-All right, so
+Tudo bem, então
 
 0:28:18.059,0:28:20.478
-Questions so far, is this so clear?. Yeah
+Perguntas até agora, isso é tão claro?. Sim
 
 0:28:50.460,0:28:52.460
-Okay, so good question so
+Ok, boa pergunta então
 
 0:28:52.469,0:28:56.149
-trade-off about, you know sizing of those convolutions
+trade-off sobre, você sabe o dimensionamento dessas circunvoluções
 
 0:28:56.700,0:28:59.119
-convolutional kernels, right is it correct? Right
+kernels convolucionais, certo está correto? Certo
 
 0:28:59.909,0:29:06.409
-Three by three he seems to be like the minimum you can go for if you actually care about spatial information
+Três por três, ele parece ser o mínimo que você pode obter se realmente se importa com informações espaciais
 
 0:29:07.499,0:29:13.098
-As Yann pointed out you can also use one by one convolution. Oh, sorry one come one
+Como Yann apontou, você também pode usar uma convolução uma por uma. Oh, desculpe, venha um
 
 0:29:13.769,0:29:15.149
-like a
+como um
 
 0:29:15.149,0:29:20.718
-Convolution with which has only one weight or if you use like in images you have a one by one convolution
+Convolução com que tem apenas um peso ou se usar como nas imagens tem uma convolução uma a uma
 
 0:29:21.179,0:29:23.179
-Those are used in order to be
+Esses são usados ​​para serem
 
 0:29:23.309,0:29:24.570
-having like a
+tendo como um
 
 0:29:24.570,0:29:26.570
-final layer, which is still
+camada final, que ainda
 
 0:29:26.909,0:29:30.528
-Spatial still can be applied to a larger input image
+Ainda espacial pode ser aplicado a uma imagem de entrada maior
 
 0:29:31.649,0:29:36.138
-Right now we just use kernels that are three or maybe five
+No momento, usamos apenas kernels que são três ou talvez cinco
 
 0:29:36.929,0:29:42.348
-it's kind of empirical so it's not like we don't have like a magic formulas, but
+é meio empírico, então não é como se não tivéssemos fórmulas mágicas, mas
 
 0:29:43.349,0:29:44.279
-we've been
+temos sido
 
 0:29:44.279,0:29:50.329
-trying hard in the past ten years to figure out what is you know the best set of hyper parameters and if you check
+tentando arduamente nos últimos dez anos para descobrir qual é o melhor conjunto de hiperparâmetros e se você verificar
 
 0:29:50.969,0:29:55.879
-For each field like for a speech processing visual processing like image processing
+Para cada campo como para um processamento de fala processamento visual como processamento de imagem
 
 0:29:55.879,0:29:59.718
-You're gonna figure out what is the right compromise for your specific data?
+Você vai descobrir qual é o compromisso certo para seus dados específicos?
 
 0:30:01.769,0:30:03.769
-Yeah
+Sim
 
 0:30:04.910,0:30:06.910
-Second
+Segundo
 
 0:30:07.970,0:30:12.279
-Okay, that's a good question why odd numbers why the kernel has an odd number
+Ok, essa é uma boa pergunta por que números ímpares por que o kernel tem um número ímpar
 
 0:30:14.390,0:30:16.220
-Of elements
+De elementos
 
 0:30:16.220,0:30:20.049
-So if you actually have a odd number of elements there would be a central element
+Então, se você realmente tiver um número ímpar de elementos, haveria um elemento central
 
 0:30:20.240,0:30:25.270
-Right. If you have a even number of elements there, we'll know there won't be a central value
+Certo. Se você tiver um número par de elementos, saberemos que não haverá um valor central
 
 0:30:25.370,0:30:27.880
-So if you have again odd number
+Então, se você tiver novamente um número ímpar
 
 0:30:27.880,0:30:30.790
-You know that from a specific point you're gonna be considering
+Você sabe que a partir de um ponto específico você vai considerar
 
 0:30:31.220,0:30:36.789
-Even number of left and even number of right items if it's a even size
+Número par de itens à esquerda e número par de itens à direita se for um tamanho par
 
 0:30:37.070,0:30:42.399
-Kernel that you actually don't know where the center is and the center is gonna be the average of two
+Kernel que você realmente não sabe onde está o centro e o centro será a média de dois
 
 0:30:43.040,0:30:48.310
-Neighboring samples which actually creates like a low-pass filter effect. So even
+Amostras vizinhas que realmente criam um efeito de filtro passa-baixa. Então mesmo
 
 0:30:49.220,0:30:51.910
-kernel sizes are not usually
+tamanhos de kernel geralmente não são
 
 0:30:52.580,0:30:56.080
-preferred or not usually used because they imply some kind of
+preferidos ou não usualmente usados ​​porque implicam algum tipo de
 
 0:30:57.290,0:30:59.889
-additional lowering of the quality of the data
+redução adicional da qualidade dos dados
 
 0:31:02.000,0:31:08.380
-Okay, so one more thing that we mentioned also yesterday its padding padding is something
+Ok, então mais uma coisa que mencionamos também ontem, seu preenchimento é algo
 
 0:31:09.590,0:31:16.629
-that if it has an effect on the final results is getting it worse, but it's very convenient for
+que se isso afeta os resultados finais está piorando, mas é muito conveniente para
 
 0:31:17.570,0:31:25.450
-programming side so if we've had our so as you can see here when we apply convolution from this layer you're gonna end up with
+lado da programação, então, se tivermos o nosso, como você pode ver aqui, quando aplicarmos a convolução desta camada, você terminará com
 
 0:31:27.680,0:31:31.359
-Okay, how many how many neurons we have here
+Ok, quantos neurônios temos aqui
 
 0:31:32.720,0:31:34.720
-three and we started from
+três e começamos de
 
 0:31:35.480,0:31:39.400
-five, so if we use a convolutional kernel of three
+cinco, então se usarmos um kernel convolucional de três
 
 0:31:40.490,0:31:42.490
-We lose how many neurons? 
+Perdemos quantos neurônios?
 
 0:31:43.310,0:31:50.469
-Two, okay, one per side. If you're gonna be using a convolutional kernel of size five how much you're gonna be losing
+Dois, ok, um de cada lado. Se você estiver usando um kernel convolucional de tamanho cinco, quanto você perderá
 
 0:31:52.190,0:31:57.639
-Four right and so that's the rule user zero padding you have to add an extra
+Quatro à direita e, portanto, essa é a regra de preenchimento zero do usuário, você precisa adicionar um extra
 
 0:31:58.160,0:32:02.723
-Neuron here an extra neuron here. So you're gonna do number size of the kernel, right?
+Neurônio aqui um neurônio extra aqui. Então você vai fazer o tamanho do número do kernel, certo?
 
 0:32:02.723,0:32:05.800
-Three minus one divided by two and then you add that extra
+Três menos um dividido por dois e então você adiciona esse extra
 
 0:32:06.560,0:32:12.850
-Whatever number of neurons here, you've set them to zero. Why to zero? because usually you zero mean
+Qualquer que seja o número de neurônios aqui, você os configurou para zero. Por que zerar? porque geralmente você zero significa
 
 0:32:13.470,0:32:18.720
-Your inputs or your zero each layer output by using some normalization layers
+Suas entradas ou zerar a saída de cada camada usando algumas camadas de normalização
 
 0:32:19.900,0:32:21.820
-in this case
+nesse caso
 
 0:32:21.820,0:32:25.770
-Yeah, three comes from the size of the kernel and then you have that
+Sim, três vem do tamanho do kernel e então você tem isso
 
 0:32:26.740,0:32:28.630
-Some animation should be playing
+Alguma animação deve estar tocando
 
 0:32:28.630,0:32:31.289
-Yeah, you have one extra neuron there there then
+Sim, você tem um neurônio extra lá então
 
 0:32:31.289,0:32:37.289
-I have an extra neuron there such that finally you end up with these, you know ghosts neurons there
+Eu tenho um neurônio extra lá de tal forma que finalmente você acaba com isso, você sabe, neurônios fantasmas lá
 
 0:32:37.330,0:32:41.309
-But now you have the same number of input and the same number of output
+Mas agora você tem o mesmo número de entrada e o mesmo número de saída
 
 0:32:41.740,0:32:47.280
-And this is so convenient because if we started with I don't know 64 neurons you apply a convolution
+E isso é tão conveniente porque se começamos com eu não sei 64 neurônios, você aplica uma convolução
 
 0:32:47.280,0:32:54.179
-You still have 64 neurons and therefore you can use let's say max pooling of two you're going to end up at 32 neurons
+Você ainda tem 64 neurônios e, portanto, você pode usar, digamos, o agrupamento máximo de dois, você terminará em 32 neurônios
 
 0:32:54.179,0:32:57.809
-Otherwise you gonna have this I don't know if you consider one
+Caso contrário você vai ter isso eu não sei se você considera um
 
 0:32:58.539,0:33:01.019
-We have a odd number right so you don't know what to do
+Temos um número ímpar certo, então você não sabe o que fazer
 
 0:33:04.030,0:33:06.030
-after a bit, right?
+depois de um tempo né?
 
 0:33:08.320,0:33:10.320
-Okay, so
+OK, então
 
 0:33:10.720,0:33:12.720
-Yeah, and you have the same size
+Sim, e você tem o mesmo tamanho
 
 0:33:13.539,0:33:20.158
-All right. So, let's see how much time you have left. You have a bit of time. So, let's see how we use this
+Tudo bem. Então, vamos ver quanto tempo você tem. Você tem um pouco de tempo. Então, vamos ver como usamos isso
 
 0:33:21.130,0:33:27.270
-Convolutional net work in practice. So this is like the theory behind and we have said that we can use convolutions
+Rede convolucional na prática. Então, isso é como a teoria por trás e dissemos que podemos usar convoluções
 
 0:33:28.000,0:33:33.839
-So this is a convolutional operator. I didn't even define. What's a convolution. We just said that if our data has
+Portanto, este é um operador convolucional. Eu nem defini. O que é uma convolução. Acabamos de dizer que se nossos dados
 
 0:33:37.090,0:33:39.929
-Stationarity locality and is actually
+localidade de estacionaridade e é na verdade
 
 0:33:42.130,0:33:45.689
-Compositional then we can exploit this by using
+Composicional, então podemos explorar isso usando
 
 0:33:49.240,0:33:51.240
-Weight sharing
+Compartilhamento de peso
 
 0:33:51.940,0:33:56.730
-Sparsity and then you know by stacking several of this layer. You have a like a hierarchy, right?
+Esparsidade e então você sabe empilhando várias dessa camada. Você tem um tipo de hierarquia, certo?
 
 0:33:58.510,0:34:06.059
-So by using this kind of operation this is a convolution I didn't even define it I don't care right now maybe next class
+Então, usando esse tipo de operação, isso é uma convolução, eu nem defini isso, não me importo agora, talvez na próxima aula
 
 0:34:07.570,0:34:11.999
-So this is like the theory behind now, we're gonna see a little bit of practical
+Então isso é como a teoria por trás agora, vamos ver um pouco de prática
 
 0:34:12.429,0:34:15.628
-You know suggestions how we actually use this stuff in practice
+Você conhece sugestões de como realmente usamos essas coisas na prática
 
 0:34:16.119,0:34:22.229
-So next thing we have like a standard a spatial convolutional net which is operating which kind of data
+Então, a próxima coisa que temos como padrão é uma rede convolucional espacial que está operando que tipo de dados
 
 0:34:22.840,0:34:24.840
-If it's spatial
+Se é espacial
 
 0:34:25.780,0:34:28.229
-It's special because it's my network right special
+É especial porque é minha rede né especial
 
 0:34:29.260,0:34:32.099
-Not just kidding so special as you know space
+Não apenas brincando tão especial como você conhece o espaço
 
 0:34:33.190,0:34:37.139
-So in this case, we have multiple layers, of course we stuck them
+Então, neste caso, temos várias camadas, é claro que as prendemos
 
 0:34:37.300,0:34:42.419
-We also talked about why it's better to have several layers rather than having a fat layer
+Também falamos sobre por que é melhor ter várias camadas em vez de ter uma camada de gordura
 
 0:34:43.300,0:34:48.149
-We have convolutions. Of course, we have nonlinearities because otherwise
+Temos circunvoluções. É claro que temos não linearidades porque, caso contrário,
 
 0:34:55.270,0:34:56.560
-So
+assim
 
 0:34:56.560,0:35:04.439
-ok, next time we're gonna see how a convolution can be implemented with matrices but convolutions are just linear operator with which a lot of
+ok, da próxima vez vamos ver como uma convolução pode ser implementada com matrizes, mas as convoluções são apenas operadores lineares com os quais muitos
 
 0:35:04.440,0:35:07.470
-zeros and like replication of the same by the weights
+zeros e como replicação do mesmo pelos pesos
 
 0:35:07.570,0:35:13.019
-but otherwise if you don't use non-linearity a convolution of a convolution
+mas caso contrário, se você não usar não linearidade, uma convolução de uma convolução
 
 0:35:13.020,0:35:16.679
-It's gonna be a convolution. So we have to clean up stuff
+Vai ser uma convolução. Então temos que limpar as coisas
 
 0:35:17.680,0:35:19.510
-that
+aquele
 
 0:35:19.510,0:35:25.469
-We have to like put barriers right? in order to avoid collapse of the whole network. We had some pooling operator
+A gente tem que gostar de colocar barreiras né? para evitar o colapso de toda a rede. Tínhamos algum operador de pooling
 
 0:35:26.140,0:35:27.280
-which
+que
 
 0:35:27.280,0:35:33.989
-Geoffrey says that's you know, something already bad. But you know, you're still doing that Hinton right Geoffrey Hinton
+Geoffrey diz que é você sabe, algo já ruim. Mas você sabe, você ainda está fazendo isso Hinton certo Geoffrey Hinton
 
 0:35:35.410,0:35:40.950
-Then we've had something that if you don't use it, your network is not gonna be training. So just use it
+Então, tivemos algo que, se você não usar, sua rede não estará treinando. Então é só usar
 
 0:35:41.560,0:35:44.339
-although we don't know exactly why it works but
+embora não saibamos exatamente por que funciona, mas
 
 0:35:45.099,0:35:48.659
-I think there is a question on Piazza. I will put a link there
+Acho que há uma pergunta na Piazza. vou colocar um link lá
 
 0:35:49.330,0:35:53.519
-About this batch normalization. Also Yann is going to be covering all the normalization layers
+Sobre esta normalização de lote. Além disso, Yann cobrirá todas as camadas de normalização
 
 0:35:54.910,0:36:01.889
-Finally we have something that also is quite recent which is called a receival or bypass connections
+Finalmente, temos algo que também é bastante recente que é chamado de conexões de recebimento ou desvio
 
 0:36:01.990,0:36:03.990
-Which are basically these?
+Quais são basicamente estes?
 
 0:36:04.240,0:36:05.859
 extra
 
 0:36:05.859,0:36:07.089
-connections
+conexões
 
 0:36:07.089,0:36:09.089
-Which allow me to
+Que me permitem
 
 0:36:09.250,0:36:10.320
-Get the network
+Obtenha a rede
 
 0:36:10.320,0:36:13.320
-You know the network decided whether whether to send information
+Você sabe que a rede decidiu se deve enviar informações
 
 0:36:13.780,0:36:18.780
-Through this line or actually send it forward if you stack so many many layers one after each other
+Através desta linha ou realmente envie-a para frente se você empilhar tantas camadas uma após a outra
 
 0:36:18.910,0:36:24.330
-The signal get lost a little bit after sometime if you add these additional connections
+O sinal se perde um pouco depois de algum tempo se você adicionar essas conexões adicionais
 
 0:36:24.330,0:36:27.089
-You always have like a path in order to go back
+Você sempre tem como um caminho para voltar
 
 0:36:27.710,0:36:31.189
-The bottom to the top and also to have gradients coming down from the top to the bottom
+De baixo para cima e também ter gradientes descendo de cima para baixo
 
 0:36:31.440,0:36:38.599
-so that's actually a very important both the receiver connection and the batch normalization are really really helpful to get this network to
+então isso é realmente muito importante, tanto a conexão do receptor quanto a normalização do lote são realmente muito úteis para fazer com que essa rede
 
 0:36:39.059,0:36:46.849
-Properly train if you don't use them then it's going to be quite hard to get those networks to really work for the training part
+Treine adequadamente se você não os usar, então será muito difícil fazer com que essas redes realmente funcionem para a parte do treinamento
 
 0:36:48.000,0:36:51.949
-So how does it work we have here an image, for example
+Então como funciona temos aqui uma imagem, por exemplo
 
 0:36:53.010,0:36:55.939
-Where most of the information is spatial information?
+Onde a maior parte da informação é informação espacial?
 
 0:36:55.940,0:36:59.000
-So the information is spread across the two dimensions
+Assim, a informação está espalhada pelas duas dimensões
 
 0:36:59.220,0:37:04.520
-Although there is a thickness and I call the thickness as characteristic information
+Embora haja uma espessura e eu chamo a espessura como informação característica
 
 0:37:04.770,0:37:07.339
-Which means it provides a information?
+O que significa que fornece uma informação?
 
 0:37:07.890,0:37:11.569
-At that specific point. So what is my characteristic information?
+Nesse ponto específico. Então, qual é a minha informação característica?
 
 0:37:12.180,0:37:15.740
- in this image let's say it's a RGB image
+nesta imagem digamos que é uma imagem RGB
 
 0:37:16.680,0:37:18.680
-It's a color image right?
+É uma imagem colorida certo?
 
 0:37:19.230,0:37:27.109
-So we have the most of the information is spread on a  spatial information. Like if you have me making funny faces
+Assim temos que a maior parte da informação está espalhada em uma informação espacial. Como se você me fizesse fazer caretas
 
 0:37:28.109,0:37:30.109
-but then at each point
+mas então em cada ponto
 
 0:37:30.300,0:37:33.769
-This is not a grayscale image is a color image, right?
+Esta não é uma imagem em tons de cinza, é uma imagem colorida, certo?
 
 0:37:33.770,0:37:39.199
-So each point will have an additional information which is my you know specific
+Então cada ponto terá uma informação adicional que é minha você sabe
 
 0:37:39.990,0:37:42.439
-Characteristic information. What is it in this case?
+Informações características. O que é neste caso?
 
 0:37:44.640,0:37:46.910
-It's a vector of three values which represent
+É um vetor de três valores que representam
 
 0:37:48.630,0:37:51.530
-RGB are the three letters by the __  as they represent
+RGB são as três letras do __, pois representam
 
 0:37:54.780,0:37:57.949
-Okay, overall, what does it represent like
+Ok, no geral, como isso representa
 
 0:37:59.160,0:38:02.480
-Yes intensity. Just you know, tell me in English without weird
+Sim intensidade. Só você sabe, me diga em inglês sem estranho
 
 0:38:03.359,0:38:05.130
-things
+coisas
 
 0:38:05.130,0:38:11.480
-The color of the pixel, right? So my specific information. My characteristic information. Yeah. I don't know what you're saying
+A cor do pixel, certo? Então minhas informações específicas. Minha informação característica. Sim. Eu não sei o que você está dizendo
 
 0:38:11.480,0:38:18.500
-Sorry, the characteristic information in this case is just a color right so the color is the only information that is specific there
+Desculpe, a informação característica neste caso é apenas uma cor certa então a cor é a única informação que é específica lá
 
 0:38:18.500,0:38:20.780
-But then otherwise information is spread around
+Mas, caso contrário, a informação é espalhada por aí
 
 0:38:21.359,0:38:23.359
-As if we climb climb the hierarchy
+Como se subíssemos escalar a hierarquia
 
 0:38:23.730,0:38:31.189
-You can see now some final vector which has let's say we are doing classification in this case. So my
+Você pode ver agora algum vetor final que digamos que estamos fazendo a classificação neste caso. Então meu
 
 0:38:31.770,0:38:36.530
-You know the height and width or the thing is going to be one by one so it's just one vector
+Você sabe a altura e a largura ou a coisa vai ser uma por uma, então é apenas um vetor
 
 0:38:37.080,0:38:43.590
-And then let's say there you have the specific final logit, which is the highest one so which is representing the class
+E então digamos que você tenha o logit final específico, que é o mais alto, então representa a classe
 
 0:38:43.590,0:38:47.400
-Which is most likely to be the correct one if it's trained well
+Qual é mais provável de ser o correto se for bem treinado
 
 0:38:48.220,0:38:51.630
-in the Midway, you have something that is, you know a trade-off between
+no Midway, você tem algo que é, você conhece um trade-off entre
 
 0:38:52.330,0:38:59.130
-Spatial information and then these characteristic information. Okay. So basically it's like a conversion between
+Informações espaciais e depois essas informações características. OK. Então, basicamente, é como uma conversão entre
 
 0:39:00.070,0:39:01.630
-spatial information
+informações espaciais
 
 0:39:01.630,0:39:03.749
-into this characteristic information
+nesta informação característica
 
 0:39:04.360,0:39:07.049
-Do you see so it basically go from a thing?
+Você vê assim basicamente ir de uma coisa?
 
 0:39:07.660,0:39:08.740
-input
+entrada
 
 0:39:08.740,0:39:13.920
-Data to something. It is very thick, but then has no more information spatial information
+Dados para algo. É muito grosso, mas depois não tem mais informação informação espacial
 
 0:39:14.710,0:39:20.760
-and so you can see here with my ninja PowerPoint skills how you can get you know a
+e assim você pode ver aqui com minhas habilidades ninja PowerPoint como você pode conhecer um
 
 0:39:22.240,0:39:27.030
-Reduction of the ___ thickener like a figure thicker in our presentation
+Redução do espessante ___ como uma figura mais espessa em nossa apresentação
 
 0:39:27.070,0:39:30.840
-Whereas you actually lose the spatial special one
+Considerando que você realmente perde o especial espacial
 
 0:39:32.440,0:39:39.870
-Okay, so that was oh one more pooling so pooling is simply again for example
+Ok, então isso foi mais um agrupamento, então o agrupamento é simplesmente novamente, por exemplo
 
 0:39:41.620,0:39:43.600
-It can be performed in this way
+Pode ser realizado desta forma
 
 0:39:43.600,0:39:48.660
-So there you have some hand drawing because I didn't want to do you have time to make it in latex?
+Então aí você tem um desenho à mão porque eu não queria que você tivesse tempo para fazer em látex?
 
 0:39:49.270,0:39:52.410
-So you have different regions you apply a specific?
+Então você tem diferentes regiões que você aplica um específico?
 
 0:39:53.500,0:39:57.060
-Operator to that specific region, for example, you have the P norm
+Operador para aquela região específica, por exemplo, você tem a norma P
 
 0:39:58.150,0:39:59.680
-and then
+e então
 
 0:39:59.680,0:40:02.760
-Yes, the P goes to plus infinity. You have the Max
+Sim, o P vai para mais infinito. voce tem o maximo
 
 0:40:03.730,0:40:09.860
-And then that one is not give you one value right then you perform a stride.
+E então esse não é dar-lhe um valor certo, então você dá um passo.
 
 0:40:09.860,0:40:12.840
-jump to Pixels further and then you again you compute the same thing
+pule para Pixels ainda mais e então você novamente calcula a mesma coisa
 
 0:40:12.840,0:40:18.150
-you're gonna get another value there and so on until you end up from
+você vai ter outro valor lá e assim sucessivamente até acabar de
 
 0:40:18.700,0:40:24.900
-Your data which was m by n with c channels you get still c channels
+Seus dados que eram m por n com canais c você ainda obtém canais c
 
 0:40:24.900,0:40:31.199
-But then in this case you gonna get m/2 and c and  n/2. Okay, and this is for images
+Mas então, neste caso, você obterá m/2 ec e n/2. Ok, e isso é para imagens
 
 0:40:35.029,0:40:41.079
-There are no parameters on the pooling how you can nevertheless choose which kind of pooling, right you can choose max pooling
+Não há parâmetros no pool, como você pode escolher qual tipo de pooling, certo, você pode escolher o pool máximo
 
 0:40:41.390,0:40:44.229
-Average pooling any pooling is wrong. So
+Agrupamento médio qualquer agrupamento está errado. assim
 
 0:40:45.769,0:40:48.879
-Yeah, let's also the problem, okay, so
+Sim, vamos também o problema, ok, então
 
 0:40:49.999,0:40:55.809
-This was the mean part with the slides. We are gonna see now the notebooks will go a bit slower this time
+Esta foi a parte média com os slides. Vamos ver agora os notebooks ficarão um pouco mais lentos desta vez
 
 0:40:55.809,0:40:58.508
-I noticed that last time I kind of rushed
+Percebi que da última vez eu meio que me apressei
 
 0:40:59.900,0:41:02.529
-Are there any questions so far on this part that we cover?
+Há alguma pergunta até agora sobre esta parte que abordamos?
 
 0:41:04.519,0:41:06.519
-Yeah
+Sim
 
 0:41:10.670,0:41:12.469
-So there is like
+Então existe como
 
 0:41:12.469,0:41:17.769
-Geoffrey Hinton is renowned for saying that max pooling is something which is just
+Geoffrey Hinton é conhecido por dizer que o pooling máximo é algo que é apenas
 
 0:41:18.259,0:41:23.319
-Wrong because you just throw away information as you average or you take the max you just throw away things
+Errado porque você apenas joga fora as informações como você calcula a média ou você pega o máximo, você apenas joga fora as coisas
 
 0:41:24.380,0:41:29.140
-He's been working on like something called capsule networks, which have you know specific
+Ele está trabalhando em algo chamado redes de cápsulas, que você conhece
 
 0:41:29.660,0:41:33.849
-routing paths that are choosing, you know some
+caminhos de roteamento que estão escolhendo, você conhece alguns
 
 0:41:34.519,0:41:41.319
-Better strategies in order to avoid like throwing away information. Okay. Basically that's the the argument behind yeah
+Melhores estratégias para evitar como jogar fora informações. OK. Basicamente esse é o argumento por trás sim
 
 0:41:45.469,0:41:52.329
-Yes, so the main purpose of using this pooling or the stride is actually to get rid of a lot of data such that you
+Sim, então o objetivo principal de usar esse pooling ou o stride é, na verdade, livrar-se de muitos dados, de modo que você
 
 0:41:52.329,0:41:54.579
-Can compute things in a reasonable amount of time?
+Pode computar coisas em uma quantidade razoável de tempo?
 
 0:41:54.619,0:42:00.939
-Usually you need a lot of stride or pooling at the first layers at the bottom because otherwise  it's absolutely  you know
+Normalmente você precisa de muito passo ou pooling nas primeiras camadas na parte inferior porque, caso contrário, é absolutamente você saber
 
 0:42:01.339,0:42:03.339
-Too computationally expensive
+Muito caro computacionalmente
 
 0:42:03.979,0:42:05.979
-Yeah
+Sim
 
 0:42:21.459,0:42:23.459
-So on that sit
+Então, naquele sentar
 
 0:42:24.339,0:42:32.068
-Those network architectures are so far driven by you know the state of the art, which is completely an empirical base
+Essas arquiteturas de rede são até agora impulsionadas por você conhece o estado da arte, que é completamente uma base empírica
 
 0:42:33.279,0:42:40.109
-we try hard and we actually go to I mean now we actually arrive to some kind of standard so a
+nós nos esforçamos e na verdade vamos para, quero dizer, agora chegamos a algum tipo de padrão, então um
 
 0:42:40.359,0:42:44.399
-Few years back. I was answering like I don't know but right now we actually have
+Alguns anos atrás. Eu estava respondendo como se eu não soubesse, mas agora nós realmente temos
 
 0:42:45.099,0:42:47.049
-Determined some good configurations
+Determinado algumas boas configurações
 
 0:42:47.049,0:42:53.968
-Especially using those receiver connections and the batch normalization. We actually can get to train basically everything
+Especialmente usando essas conexões de receptor e a normalização de lote. Na verdade, podemos treinar basicamente tudo
 
 0:42:54.759,0:42:56.759
-Yeah
+Sim
 
 0:43:05.859,0:43:11.038
-So basically you're gonna have your gradient at a specific point coming down as well
+Então, basicamente, você terá seu gradiente em um ponto específico descendo também
 
 0:43:11.039,0:43:13.679
-And then you have the other gradient coming down down
+E então você tem o outro gradiente descendo
 
 0:43:13.839,0:43:18.238
-Then you had a branch right a branching and if you have branch what's happening with the gradient?
+Então você tinha uma ramificação certa uma ramificação e se tiver ramificação o que está acontecendo com o gradiente?
 
 0:43:19.720,0:43:25.439
-That's correct. Yeah, they get added right so you have the two gradients coming from two different branches getting added together
+Está correto. Sim, eles são adicionados corretamente, então você tem os dois gradientes provenientes de dois ramos diferentes sendo adicionados juntos
 
 0:43:26.470,0:43:31.769
-All right. So let's go to the notebook such that we can cover  we don't rush too much
+Tudo bem. Então vamos ao caderno para que possamos cobrir não nos apressarmos muito
 
 0:43:32.859,0:43:37.139
-So here I just go through the convnet part. So here I train
+Então aqui eu apenas passo pela parte do convnet. Então aqui eu treino
 
 0:43:39.519,0:43:41.289
-Initially I
+Inicialmente eu
 
 0:43:41.289,0:43:43.979
-Load the MNIST data set so I show you a few
+Carregue o conjunto de dados MNIST, então mostro alguns
 
 0:43:44.680,0:43:45.849
-characters here
+personagens aqui
 
 0:43:45.849,0:43:52.828
-Okay, and I train now a multi-layer perceptron like a fully connected Network like a mood, you know
+Ok, e eu treino agora um perceptron multicamada como uma rede totalmente conectada como um humor, você sabe
 
 0:43:53.440,0:44:00.509
-Yeah, fully connected Network and a convolutional neural net which have the same number of parameters. Okay. So these two models will have the same
+Sim, rede totalmente conectada e uma rede neural convolucional que possui o mesmo número de parâmetros. OK. Então esses dois modelos terão o mesmo
 
 0:44:01.150,0:44:05.819
-Dimension in terms of D. If you save them we'll wait the same so
+Dimensão em termos de D. Se você salvá-los vamos esperar o mesmo para
 
 0:44:07.269,0:44:11.219
-I'm training here this guy here with the fully connected Network
+Estou treinando aqui esse cara aqui com a Rede totalmente conectada
 
 0:44:12.640,0:44:14.640
-It takes a little bit of time
+Leva um pouco de tempo
 
 0:44:14.829,0:44:21.028
-And he gets some 87% Okay. This is trained on classification of the MNIST digits from Yann
+E ele fica 87% bem. Isso é treinado na classificação dos dígitos MNIST de Yann
 
 0:44:21.999,0:44:24.419
-We actually download from his website if you check
+Na verdade, baixamos do site dele, se você verificar
 
 0:44:25.239,0:44:32.189
-Anyhow, I train a convolutional neural net with the same number of parameters what you expect to have a better a worse result
+De qualquer forma, eu treino uma rede neural convolucional com o mesmo número de parâmetros que você espera ter um resultado melhor ou pior
 
 0:44:32.349,0:44:35.548
-So my multi-layer perceptron gets 87 percent
+Então, meu perceptron multicamada obtém 87%
 
 0:44:36.190,0:44:38.190
-What do we get with a convolutional net?
+O que obtemos com uma rede convolucional?
 
 0:44:41.739,0:44:43.739
-Yes, why
+Sim porque
 
 0:44:46.910,0:44:50.950
-Okay, so what is the point here of using sparsity what does it mean
+Ok, então qual é o ponto aqui de usar esparsidade o que isso significa
 
 0:44:52.640,0:44:55.089
-Given that we have the same number of parameters
+Dado que temos o mesmo número de parâmetros
 
 0:44:56.690,0:44:58.690
-We manage to train much
+Conseguimos treinar muito
 
 0:44:59.570,0:45:05.440
-more filters right in the second case because in the first case we use filters that are completely trying to get some
+mais filtros certo no segundo caso porque no primeiro caso usamos filtros que estão tentando obter algum
 
 0:45:05.960,0:45:12.549
-dependencies between things that are further away with things that are closed by so they are completely wasted basically they learn 0
+dependências entre coisas que estão mais distantes com coisas que estão fechadas por isso são completamente desperdiçadas basicamente elas aprendem 0
 
 0:45:12.830,0:45:19.930
-Instead in the convolutional net. I have all these parameters. They're just concentrated for figuring out. What is the relationship within a
+Em vez disso, na rede convolucional. Eu tenho todos esses parâmetros. Eles estão apenas concentrados para descobrir. Qual é a relação dentro de um
 
 0:45:20.480,0:45:23.799
-Neighboring pixels. All right. So now it takes the pictures I
+Pixels vizinhos. Tudo bem. Então agora ele tira as fotos que eu
 
 0:45:24.740,0:45:26.740
-Shake everything just got scrambled
+Agite tudo acabou de ser mexido
 
 0:45:27.410,0:45:33.369
-But I keep the same I scramble the same same way all the images. So I perform a random permutation
+Mas mantenho o mesmo embaralhe da mesma forma todas as imagens. Então eu executo uma permutação aleatória
 
 0:45:34.850,0:45:38.710
-Always the same random permutation of all my images or the pixels on my images
+Sempre a mesma permutação aleatória de todas as minhas imagens ou pixels nas minhas imagens
 
 0:45:39.500,0:45:41.090
-What does it happen?
+O que acontece?
 
 0:45:41.090,0:45:43.299
-If I train both networks
+Se eu treinar as duas redes
 
 0:45:47.990,0:45:50.049
-So here I trained see here
+Então aqui eu treinei veja aqui
 
 0:45:50.050,0:45:56.950
-I have my pics images and here I just scrambled with the same scrambling function all the pixels
+Eu tenho minhas imagens de fotos e aqui eu apenas embaralhei com a mesma função de embaralhamento todos os pixels
 
 0:46:00.200,0:46:04.240
-All my inputs are going to be these images here
+Todas as minhas entradas serão essas imagens aqui
 
 0:46:06.590,0:46:10.870
-The output is going to be still the class of the original so this is a four you
+A saída ainda será a classe do original, então este é um quatro você
 
 0:46:11.450,0:46:13.780
-Can see this this is a four. This is a nine
+Pode ver que isso é um quatro. Este é um nove
 
 0:46:14.920,0:46:19.889
-This is a 1 this is a 7 is a 3 in this is a 4 so I keep the same labels
+Este é um 1 este é um 7 é um 3 este é um 4 então eu mantenho os mesmos rótulos
 
 0:46:19.930,0:46:24.450
-But I scrambled the order of the pixels and I perform the same scrambling every time
+Mas eu embaralhei a ordem dos pixels e executo o mesmo embaralhamento todas as vezes
 
 0:46:25.239,0:46:27.239
-What do you expect is performance?
+O que você espera é desempenho?
 
 0:46:31.029,0:46:33.299
-Who's better who's working who's the same?
+Quem é melhor quem está trabalhando quem é o mesmo?
 
 0:46:38.619,0:46:46.258
-Perception how does it do with the perception? Does he see any difference? No, okay. So the guy still 83
+Percepção como faz com a percepção? Ele vê alguma diferença? Não, tudo bem. Então o cara ainda 83
 
 0:46:47.920,0:46:49.920
-Yann's network
+rede de Yann
 
 0:46:52.029,0:46:54.029
-What do you guys
+O que vocês
 
 0:47:04.089,0:47:09.988
-Know that's a fully connected. Sorry. I'll change the order. Yeah, see. Okay. There you go
+Saiba que é um totalmente conectado. Desculpe. Vou mudar a ordem. Sim, veja. OK. Ai está
 
 0:47:12.460,0:47:14.999
-So I can't even show you this thing
+Então eu não posso nem te mostrar essa coisa
 
 0:47:17.920,0:47:18.730
-All right
+Tudo bem
 
 0:47:18.730,0:47:24.659
-So the fully connected guy basically performed the same the differences are just basic based on the initial
+Então, o cara totalmente conectado basicamente executou o mesmo, as diferenças são apenas básicas com base no inicial
 
 0:47:25.059,0:47:30.899
-The random initialization the convolutional net which was winning by kind of large advance
+A inicialização aleatória da rede convolucional que estava ganhando por meio de grande avanço
 
 0:47:31.509,0:47:33.509
-advantage before actually performs
+vantagem antes de realmente realizar
 
 0:47:34.059,0:47:38.008
-Kind of each similarly, but I mean worse than much worse than before
+Tipo de cada um de forma semelhante, mas quero dizer pior do que muito pior do que antes
 
 0:47:38.499,0:47:42.449
-Why is the convolutional network now performing worse than my fully connected Network?
+Por que a rede convolucional agora está tendo um desempenho pior do que minha rede totalmente conectada?
 
 0:47:44.829,0:47:46.829
-Because we fucked up
+Porque nós fodemos
 
 0:47:47.739,0:47:55.379
-Okay, and so every time you use a convolutional network, you actually have to think can I use of convolutional network, okay
+Ok, e toda vez que você usa uma rede convolucional, você realmente tem que pensar que posso usar uma rede convolucional, ok
 
 0:47:56.440,0:47:59.700
-If it holds now, you have the three properties then yeah
+Se valer agora, você tem as três propriedades, então sim
 
 0:47:59.700,0:48:05.759
-Maybe of course, it should be giving you a better performance if those three properties don't hold
+Talvez, é claro, devesse oferecer um desempenho melhor se essas três propriedades não se mantiverem
 
 0:48:06.579,0:48:09.058
-then using convolutional networks is
+então usar redes convolucionais é
 
 0:48:11.499,0:48:17.939
-BS right, which was the bias? No. Okay. Never mind. All right. Well, good night
+BS certo, qual foi o viés? Não. Ok. Deixa pra lá. Tudo bem. Bem boa noite
\ No newline at end of file
diff --git a/docs/pt/week05/lecture05.sbv b/docs/pt/week05/lecture05.sbv
index 8d3d55657..402b9231e 100644
--- a/docs/pt/week05/lecture05.sbv
+++ b/docs/pt/week05/lecture05.sbv
@@ -1,3572 +1,3572 @@
 0:00:00.000,0:00:04.410
-All right so as you can see today we don't have Yann. Yann is somewhere else
+Tudo bem, como você pode ver hoje não temos Yann. Yann está em outro lugar
 
 0:00:04.410,0:00:09.120
-having fun. Hi Yann. Okay so today's that we have
+se divertindo. Olá Yan. Ok, então é hoje que temos
 
 0:00:09.120,0:00:13.740
-Aaron DeFazio he's a research scientist at Facebook working mostly on
+Aaron DeFazio ele é um cientista de pesquisa no Facebook trabalhando principalmente em
 
 0:00:13.740,0:00:16.619
-optimization he's been there for the past three years
+otimização ele está lá nos últimos três anos
 
 0:00:16.619,0:00:21.900
-and before he was a data scientist at Ambiata and then a student at the
+e antes de ser cientista de dados na Ambiata e depois estudante da
 
 0:00:21.900,0:00:27.599
-Australian National University so why don't we give a round of applause to the
+Australian National University, então por que não damos uma salva de palmas ao
 
 0:00:27.599,0:00:37.350
-our speaker today I'll be talking about optimization and if we have time at the
+nosso palestrante hoje falarei sobre otimização e se tivermos tempo no
 
 0:00:37.350,0:00:42.739
-end the death of optimization so these are the topics I will be covering today
+acabar com a morte da otimização, então esses são os tópicos que abordarei hoje
 
 0:00:42.739,0:00:47.879
-now optimization is at the heart of machine learning and some of the things
+agora a otimização está no centro do aprendizado de máquina e algumas das coisas
 
 0:00:47.879,0:00:52.680
-are going to be talking about today will be used every day in your role
+vai falar hoje será usado todos os dias no seu papel
 
 0:00:52.680,0:00:56.640
-potentially as an applied scientist or even as a research scientist or a data
+potencialmente como um cientista aplicado ou mesmo como um cientista de pesquisa ou um
 
 0:00:56.640,0:01:01.590
-scientist and I'm gonna focus on the application of these methods
+cientista e vou me concentrar na aplicação desses métodos
 
 0:01:01.590,0:01:05.850
-particularly rather than the theory behind them part of the reason for this
+particularmente, em vez da teoria por trás deles, parte da razão para isso
 
 0:01:05.850,0:01:10.260
-is that we don't fully understand all of these methods so for me to come up here
+é que não entendemos completamente todos esses métodos, então para eu vir até aqui
 
 0:01:10.260,0:01:15.119
-and say this is why it works I would be oversimplifying things but what I can
+e dizer que é por isso que funciona, eu estaria simplificando demais as coisas, mas o que posso
 
 0:01:15.119,0:01:22.320
-tell you is how to use them how we know that they work in certain situations and
+dizer é como usá-los como sabemos que eles funcionam em determinadas situações e
 
 0:01:22.320,0:01:28.320
-what the best method may be to use to train your neural network and to
+qual pode ser o melhor método para treinar sua rede neural e
 
 0:01:28.320,0:01:31.770
-introduce you to the topic of optimization I need to start with the
+apresentá-lo ao tópico de otimização, preciso começar com o
 
 0:01:31.770,0:01:36.720
-worst method in the world gradient descent and I'll explain in a minute why
+pior método do mundo gradiente descendente e explicarei em um minuto por que
 
 0:01:36.720,0:01:43.850
-it's the worst method but to begin with we're going to use the most generic
+é o pior método, mas para começar vamos usar o mais genérico
 
 0:01:43.850,0:01:47.549
-formulation of optimization now the problems you're going to be considering
+formulação de otimização agora os problemas que você vai considerar
 
 0:01:47.549,0:01:51.659
-will have more structure than this but it's very useful useful notationally to
+terá mais estrutura do que isso, mas é muito útil em notação para
 
 0:01:51.659,0:01:56.969
-start this way so we talked about a function f now we're trying to prove
+comece desta forma, então falamos sobre uma função f agora estamos tentando provar
 
 0:01:56.969,0:02:03.930
-properties of our optimizer will assume additional structure on f but in
+propriedades do nosso otimizador assumirão estrutura adicional em f, mas em
 
 0:02:03.930,0:02:07.049
-practice the structure in our neural networks essentially obey no of the
+praticar a estrutura em nossas redes neurais essencialmente obedecem a nenhuma das
 
 0:02:07.049,0:02:09.239
-assumptions none of the assumptions people make in
+suposições nenhuma das suposições que as pessoas fazem em
 
 0:02:09.239,0:02:12.030
-practice I'm just gonna start with the generic F
+prática eu vou começar com o F genérico
 
 0:02:12.030,0:02:17.070
-and we'll assume it's continuous and differentiable even though we're already
+e vamos supor que é contínuo e diferenciável mesmo que já estejamos
 
 0:02:17.070,0:02:20.490
-getting into the realm of incorrect assumptions since the neural networks
+entrando no reino de suposições incorretas, uma vez que as redes neurais
 
 0:02:20.490,0:02:25.170
-most people are using in practice these days are not differentiable instead you
+a maioria das pessoas está usando na prática hoje em dia não são diferenciáveis, em vez disso você
 
 0:02:25.170,0:02:29.460
-have a equivalent sub differential which you can essentially plug into all these
+tem um subdiferencial equivalente que você pode essencialmente conectar a todos esses
 
 0:02:29.460,0:02:33.570
-formulas and if you cross your fingers there's no theory to support this it
+fórmulas e se você cruzar os dedos não há teoria para apoiar isso
 
 0:02:33.570,0:02:38.910
-should work so the method of gradient descent is shown here it's an iterative
+deve funcionar para que o método de gradiente descendente seja mostrado aqui, é um iterativo
 
 0:02:38.910,0:02:44.790
-method so you start at a point k equals zero and at each step you update your
+método para que você comece em um ponto k igual a zero e a cada passo você atualiza seu
 
 0:02:44.790,0:02:49.410
-point and here we're going to use W to represent our current iterate either it
+ponto e aqui vamos usar W para representar nossa iteração atual ou
 
 0:02:49.410,0:02:54.000
-being the standard nomenclature for the point for your neural network this w
+sendo a nomenclatura padrão para o ponto da sua rede neural isso w
 
 0:02:54.000,0:03:00.420
-will be some large collection of weights one weight tensor per layer but notation
+será uma grande coleção de pesos um tensor de peso por camada, mas notação
 
 0:03:00.420,0:03:03.540
-we we kind of squash the whole thing down to a single vector and you can
+nós meio que esmagamos a coisa toda em um único vetor e você pode
 
 0:03:03.540,0:03:09.000
-imagine just doing that literally by reshaping all your vectors to all your
+imagine apenas fazer isso literalmente remodelando todos os seus vetores para todos os seus
 
 0:03:09.000,0:03:13.740
-tensors two vectors and just concatenate them together and this method is
+tensores dois vetores e apenas concatená-los e esse método é
 
 0:03:13.740,0:03:17.519
-remarkably simple all we do is we follow the direction of the negative gradient
+notavelmente simples, tudo o que fazemos é seguir a direção do gradiente negativo
 
 0:03:17.519,0:03:24.750
-and the rationale for this it's pretty simple so let me give you a diagram and
+e a razão para isso é bem simples, então deixe-me dar-lhe um diagrama e
 
 0:03:24.750,0:03:28.410
-maybe this will help explain exactly why following the negative gradient
+talvez isso ajude a explicar exatamente por que seguir o gradiente negativo
 
 0:03:28.410,0:03:33.570
-direction is a good idea so we don't know enough about our function to do
+direção é uma boa ideia, então não sabemos o suficiente sobre nossa função para fazer
 
 0:03:33.570,0:03:38.760
-better this is a high level idea when we're optimizing a function we look at
+melhor, esta é uma ideia de alto nível quando estamos otimizando uma função que analisamos
 
 0:03:38.760,0:03:45.060
-the landscape the optimization landscape locally so by optimization landscape I
+o cenário o cenário de otimização localmente então pelo cenário de otimização I
 
 0:03:45.060,0:03:49.230
-mean the domain of all possible weights of our network now we don't know what's
+significa o domínio de todos os pesos possíveis da nossa rede agora não sabemos o que é
 
 0:03:49.230,0:03:53.459
-going to happen if we use any particular weights on your network we don't know if
+vai acontecer se usarmos pesos específicos em sua rede, não sabemos se
 
 0:03:53.459,0:03:56.930
-it'll be better at the task we're trying to train it to or worse but we do know
+será melhor na tarefa para a qual estamos tentando treiná-lo ou pior, mas sabemos
 
 0:03:56.930,0:04:01.530
-locally is the point that are currently ad and the gradient and this gradient
+localmente é o ponto que está atualmente ad e o gradiente e este gradiente
 
 0:04:01.530,0:04:05.190
-provides some information about a direction which we can travel in that
+fornece algumas informações sobre uma direção que podemos seguir nesse
 
 0:04:05.190,0:04:09.870
-may improve the performance of our network or in this case reduce the value
+pode melhorar o desempenho de nossa rede ou, neste caso, reduzir o valor
 
 0:04:09.870,0:04:14.340
-of our function were minimizing here in this set up this general setup
+da nossa função foram minimizados aqui nesta configuração esta configuração geral
 
 0:04:14.340,0:04:19.380
-minimizing a function is essentially training in your network so minimizing
+minimizar uma função é essencialmente treinar em sua rede, portanto, minimizar
 
 0:04:19.380,0:04:23.520
-the loss will give you the best performance on your classification task
+a perda lhe dará o melhor desempenho em sua tarefa de classificação
 
 0:04:23.520,0:04:26.550
-or whatever you're trying to do and because we only look at the world
+ou o que você está tentando fazer e porque só olhamos para o mundo
 
 0:04:26.550,0:04:31.110
-locally here this gradient is basically the best information we have and you can
+localmente aqui este gradiente é basicamente a melhor informação que temos e você pode
 
 0:04:31.110,0:04:36.270
-think of this as descending a valley where you start somewhere horrible some
+pense nisso como descer um vale onde você começa em algum lugar horrível
 
 0:04:36.270,0:04:39.600
-pinkie part of the landscape the top of a mountain for instance and you travel
+dedo mindinho parte da paisagem o topo de uma montanha por exemplo e você viaja
 
 0:04:39.600,0:04:43.590
-down from there and at each point you follow the direction near you that has
+a partir daí e em cada ponto você segue a direção perto de você que tem
 
 0:04:43.590,0:04:50.040
-the most sorry the steepest descent and in fact the go the method of grading %
+o mais triste a descida mais íngreme e, de fato, o método de classificação %
 
 0:04:50.040,0:04:53.820
-is sometimes called the method of steepest descent and this direction will
+às vezes é chamado de método de descida mais íngreme e essa direção
 
 0:04:53.820,0:04:57.630
-change as you move in the space now if you move locally by only an
+mudar à medida que você se move no espaço agora se você se mover localmente por apenas um
 
 0:04:57.630,0:05:02.040
-infinitesimal amount assuming this smoothness that I mentioned before which
+quantidade infinitesimal assumindo essa suavidade que mencionei antes e que
 
 0:05:02.040,0:05:04.740
-is actually not true in practice but we'll get to that assuming the
+não é verdade na prática, mas chegaremos a isso assumindo que
 
 0:05:04.740,0:05:08.280
-smoothness this small step will only change the gradient a small amount so
+suavidade este pequeno passo só irá alterar o gradiente um pouco para que
 
 0:05:08.280,0:05:11.820
-the direction you're traveling in is at least a good direction when you take
+a direção em que você está viajando é pelo menos uma boa direção quando você toma
 
 0:05:11.820,0:05:18.120
-small steps and we essentially just follow this path taking as larger steps
+pequenos passos e nós essencialmente apenas seguimos este caminho tomando como passos maiores
 
 0:05:18.120,0:05:20.669
-as we can traversing the landscape until we reach
+como podemos percorrer a paisagem até chegarmos
 
 0:05:20.669,0:05:25.229
-the valley at the bottom which is the minimizer our function now there's a
+o vale na parte inferior que é o minimizador nossa função agora há um
 
 0:05:25.229,0:05:30.690
-little bit more we can say for some problem classes and I'm going to use the
+pouco mais podemos dizer para algumas classes de problemas e vou usar o
 
 0:05:30.690,0:05:34.950
-most simplistic problem class we can just because it's the only thing that I
+classe de problema mais simplista que podemos apenas porque é a única coisa que eu
 
 0:05:34.950,0:05:39.210
-can really do any mathematics for on one slide so bear with me
+pode realmente fazer qualquer matemática em um slide, então tenha paciência comigo
 
 0:05:39.210,0:05:44.580
-this class is quadratics so for a quadratic optimization problem we
+esta classe é quadrática, então para um problema de otimização quadrática nós
 
 0:05:44.580,0:05:51.570
-actually know quite a bit just based off the gradient so firstly a gradient cuts
+na verdade, sei um pouco apenas com base no gradiente, então primeiro um gradiente corta
 
 0:05:51.570,0:05:55.440
-off an entire half of a space and now illustrate this here with this green
+metade de um espaço inteiro e agora ilustre isso aqui com este verde
 
 0:05:55.440,0:06:02.130
-line so we're at that point there where the line starts near the Green Line we
+linha, então estamos naquele ponto onde a linha começa perto da Linha Verde,
 
 0:06:02.130,0:06:05.789
-know the solution cannot be in the rest of the space and this is not true from
+sei que a solução não pode estar no resto do espaço e isso não é verdade de
 
 0:06:05.789,0:06:09.930
-your networks but it's still a genuinely a good guideline that we want to follow
+suas redes, mas ainda é uma diretriz genuinamente boa que queremos seguir
 
 0:06:09.930,0:06:13.710
-the direction of negative gradient there could be better solutions elsewhere in
+a direção do gradiente negativo pode haver melhores soluções em outros lugares
 
 0:06:13.710,0:06:17.910
-the space but finding them is is much harder than just trying to find the best
+o espaço, mas encontrá-los é muito mais difícil do que apenas tentar encontrar o melhor
 
 0:06:17.910,0:06:21.300
-solution near to where we are so that's what we do we trying to find the best
+solução perto de onde estamos, então é isso que fazemos, tentando encontrar o melhor
 
 0:06:21.300,0:06:24.930
-solution near to where we are you could imagine this being the surface of the
+solução perto de onde estamos, você pode imaginar que esta seja a superfície do
 
 0:06:24.930,0:06:28.410
-earth where there are many hills and valleys and we can't hope to know
+terra onde há muitas colinas e vales e não podemos esperar saber
 
 0:06:28.410,0:06:31.020
-something about a mountain on the other side of the planet but we can certainly
+algo sobre uma montanha do outro lado do planeta, mas certamente podemos
 
 0:06:31.020,0:06:34.559
-look for the valley directly beneath the mountain where we currently are
+procure o vale logo abaixo da montanha onde estamos atualmente
 
 0:06:34.559,0:06:39.089
-in fact you can think of these functions here as being represented with these
+na verdade, você pode pensar nessas funções aqui como sendo representadas com esses
 
 0:06:39.089,0:06:44.369
-topographic maps this is the same as topographic maps you use that you may be
+mapas topográficos isto é o mesmo que mapas topográficos que você usa que você pode estar
 
 0:06:44.369,0:06:50.369
-familiar with from from the planet Earth where mountains are shown by these rings
+familiarizado do planeta Terra, onde as montanhas são mostradas por esses anéis
 
 0:06:50.369,0:06:53.309
-now here the rings are representing descent so this is the bottom of the
+agora aqui os anéis estão representando a descida, então esta é a parte inferior do
 
 0:06:53.309,0:06:57.839
-valley we're showing here not the top of a hill at the center there so yes our
+vale que estamos mostrando aqui, não o topo de uma colina no centro, então sim, nosso
 
 0:06:57.839,0:07:02.459
-gradient knocks off a whole half of the possible space now it's very reasonable
+gradiente elimina metade do espaço possível agora é muito razoável
 
 0:07:02.459,0:07:06.059
-then to go in the direction find this negative gradient because it's kind of
+então ir na direção encontrar esse gradiente negativo porque é meio que
 
 0:07:06.059,0:07:10.199
-orthogonal to this line that cuts off after space and you can see that I've
+ortogonal a esta linha que corta após o espaço e você pode ver que eu
 
 0:07:10.199,0:07:21.409
-got the indication of orthogonal you there the little la square so the
+peguei a indicação de ortogonal você ai a pracinha então o
 
 0:07:21.409,0:07:25.319
-properties of gradient to spend a gradient descent depend greatly on the
+propriedades do gradiente para passar um gradiente descendente dependem muito da
 
 0:07:25.319,0:07:28.889
-structure of the problem for these quadratic problems it's actually
+estrutura do problema para esses problemas quadráticos é na verdade
 
 0:07:28.889,0:07:32.549
-relatively simple to characterize what will happen so I'm going to give you a
+relativamente simples para caracterizar o que vai acontecer, então vou dar-lhe uma
 
 0:07:32.549,0:07:35.369
-little bit of an overview here and I'll spend a few minutes on this because it's
+um pouco de uma visão geral aqui e vou gastar alguns minutos nisso porque é
 
 0:07:35.369,0:07:38.339
-quite interesting and I'm hoping that those of you with some background in
+bastante interessante e espero que aqueles de vocês com alguma experiência em
 
 0:07:38.339,0:07:42.629
-linear algebra can follow this derivation but we're going to consider a
+álgebra linear pode seguir esta derivação, mas vamos considerar um
 
 0:07:42.629,0:07:47.309
-quadratic optimization problem now the problem stated in the gray box
+problema de otimização quadrática agora o problema indicado na caixa cinza
 
 0:07:47.309,0:07:53.309
-at the top you can see that this is a quadratic where a is a positive definite
+no topo você pode ver que esta é uma quadrática onde a é uma definida positiva
 
 0:07:53.309,0:07:58.769
-matrix we can handle broader classes of Quadra quadratics and this potentially
+matriz podemos lidar com classes mais amplas de quadráticas e isso potencialmente
 
 0:07:58.769,0:08:04.649
-but the analysis is most simple in the positive definite case and the grating
+mas a análise é mais simples no caso positivo definido e na grade
 
 0:08:04.649,0:08:09.539
-of that function is very simple of course as Aw - b and u the solution of
+dessa função é muito simples, é claro, como Aw - b e u a solução de
 
 0:08:09.539,0:08:13.379
-this problem has a closed form in the case of quadratics it's as inverse of a
+este problema tem uma forma fechada no caso de quadráticas é como inversa de um
 
 0:08:13.379,0:08:20.179
-times B now what we do is we take the steps they're shown in the green box and
+vezes B agora o que fazemos é seguir os passos mostrados na caixa verde e
 
 0:08:20.179,0:08:26.519
-we just plug it into the distance from solution. So this || wₖ₊₁ – w*||
+nós apenas o conectamos à distância da solução. Então isso || wₖ₊₁ – w*||
 
 0:08:26.519,0:08:30.479
-is a distance from solution so we want to see how this changes over time and
+é uma distância da solução, então queremos ver como isso muda ao longo do tempo e
 
 0:08:30.479,0:08:34.050
-the idea is that if we're moving closer to the solution over time the method is
+a ideia é que, se estivermos nos aproximando da solução ao longo do tempo, o método é
 
 0:08:34.050,0:08:38.579
-converging so we start with that distance from solution to be plug in the
+convergindo, então começamos com essa distância da solução a ser plugada no
 
 0:08:38.579,0:08:44.509
-value of the update now with a little bit of rearranging we can pull
+valor da atualização agora com um pouco de reorganização, podemos puxar
 
 0:08:45.050,0:08:50.950
-the terms we can group the terms together and we can write B as a inverse
+os termos, podemos agrupar os termos e podemos escrever B como um inverso
 
 0:08:50.950,0:09:05.090
-so we can pull or we can pull the W star inside the inside the brackets there and
+então podemos puxar ou podemos puxar a estrela W dentro dos suportes lá e
 
 0:09:05.090,0:09:11.960
-then we get this expression where it's matrix times the previous distance to
+então obtemos essa expressão em que é a matriz vezes a distância anterior para
 
 0:09:11.960,0:09:16.040
-the solution matrix times previous distance solution now we don't know
+a matriz de solução vezes a solução de distância anterior agora não sabemos
 
 0:09:16.040,0:09:20.720
-anything about which directions this quadratic it varies most extremely in
+qualquer coisa sobre quais direções esta quadrática varia mais extremamente em
 
 0:09:20.720,0:09:24.890
-but we can just not bound this very simply by taking the product of the
+mas não podemos limitar isso simplesmente tomando o produto do
 
 0:09:24.890,0:09:28.850
-matrix as norm and the distance to the solution here this norm at the bottom so
+matriz como norma e a distância para a solução aqui esta norma na parte inferior para
 
 0:09:28.850,0:09:34.070
-that's the bottom line now now when you're considering matrix norms it's
+essa é a linha de fundo agora quando você está considerando as normas da matriz, é
 
 0:09:34.070,0:09:39.590
-pretty straightforward to see that you're going to have an expression where
+bastante simples ver que você terá uma expressão em que
 
 0:09:39.590,0:09:45.710
-the eigen values of this matrix are going to be 1 minus μ γ or 1 minus
+os valores próprios desta matriz serão 1 menos μ γ ou 1 menos
 
 0:09:45.710,0:09:48.950
-L γ now the way I get this is I just look at what are the extreme eigen
+L γ agora a maneira como eu entendo isso é apenas olhar para quais são os eigen extremos
 
 0:09:48.950,0:09:54.050
-values of a which we call them μ and L and by plugging these into the
+valores de a que nós os chamamos de μ e L e colocando-os no
 
 0:09:54.050,0:09:56.930
-expression we can see what the extreme eigen values will be of this combined
+expressão podemos ver quais serão os valores próprios extremos desta combinação
 
 0:09:56.930,0:10:03.050
-matrix I minus γ a and you have this absolute value here now you can optimize
+matriz I menos γ a e você tem esse valor absoluto aqui agora você pode otimizar
 
 0:10:03.050,0:10:06.320
-this and get an optimal learning rate for the quadratics
+isso e obter uma taxa de aprendizado ideal para as quadráticas
 
 0:10:06.320,0:10:09.920
-but that optimal learning rate is not robust in practice you probably don't
+mas essa taxa de aprendizado ideal não é robusta na prática, você provavelmente não
 
 0:10:09.920,0:10:16.910
-want to use that so a simpler value you can use is 1/L. L being the largest
+deseja usar isso, então um valor mais simples que você pode usar é 1/L. L sendo o maior
 
 0:10:16.910,0:10:22.420
-eigen value and this gives you this convergence rate of 1 – μ/L
+valor próprio e isso lhe dá essa taxa de convergência de 1 – μ/L
 
 0:10:22.420,0:10:29.240
-reduction in distance to solution every step do we have any questions here I
+redução da distância até a solução a cada passo temos alguma dúvida aqui eu
 
 0:10:29.240,0:10:32.020
-know it's a little dense yes yes it's it's a substitution from in
+sei que é um pouco denso sim sim é uma substituição de dentro
 
 0:10:41.120,0:10:46.010
-that gray box do you see the bottom line on the gray box yeah that's that's just
+essa caixa cinza você vê a linha de fundo na caixa cinza sim é isso é apenas
 
 0:10:46.010,0:10:51.230
-a by definition we can solve the gradient so by taking the gradient to
+a, por definição, podemos resolver o gradiente, então, tomando o gradiente para
 
 0:10:51.230,0:10:53.060
-zero if you see in that second line in the box
+zero se você vir nessa segunda linha na caixa
 
 0:10:53.060,0:10:55.720
-taking the gradient to zero this so replaced our gradient with zero and
+levando o gradiente para zero, então substituímos nosso gradiente por zero e
 
 0:10:55.720,0:11:01.910
-rearranging you get the closed form solution to the problem here so the
+reorganizando você obtém a solução de forma fechada para o problema aqui para que o
 
 0:11:01.910,0:11:04.490
-problem with using that closed form solution in practice is we have to
+problema com o uso dessa solução de forma fechada na prática é que temos que
 
 0:11:04.490,0:11:08.420
-invert a matrix and by using gradient descent we can solve this problem by
+inverter uma matriz e usando gradiente descendente podemos resolver este problema por
 
 0:11:08.420,0:11:12.920
-only doing matrix multiplications instead I'm not that I would suggest you
+apenas fazendo multiplicações de matrizes, em vez disso, eu não sugiro que você
 
 0:11:12.920,0:11:15.560
-actually use this technique to solve the matrix as I mentioned before it's the
+realmente usar essa técnica para resolver a matriz, como mencionei antes, é o
 
 0:11:15.560,0:11:20.750
-worst method in the world and the convergence rate of this method is
+pior método do mundo e a taxa de convergência deste método é
 
 0:11:20.750,0:11:25.100
-controlled by this new overall quantity now these are standard notations so
+controlado por esta nova quantidade global agora estas são notações padrão, então
 
 0:11:25.100,0:11:27.950
-we're going from linear algebra where you talk about the min and Max eigen
+estamos indo da álgebra linear onde você fala sobre o min e Max eigen
 
 0:11:27.950,0:11:33.430
-value to the notation typically used in the field of optimization.
+valor à notação normalmente usada no campo da otimização.
 
 0:11:33.430,0:11:39.380
-μ is smallest eigen value L being largest eigen value and this μ/L is the
+μ é o menor valor próprio L sendo o maior valor próprio e este μ/L é o
 
 0:11:39.380,0:11:44.570
-inverse of the condition number condition number being L/μ this
+inverso do número de condição número de condição sendo L/μ isso
 
 0:11:44.570,0:11:51.140
-gives you a broad characterization of how quickly optimization methods will
+fornece uma ampla caracterização da rapidez com que os métodos de otimização
 
 0:11:51.140,0:11:57.440
-work on this problem and this these military terms they don't exist for
+trabalhar neste problema e estes termos militares eles não existem para
 
 0:11:57.440,0:12:02.870
-neural networks only in the very simplest situations do we have L exists
+redes neurais apenas nas situações mais simples temos L existe
 
 0:12:02.870,0:12:06.740
-and we essentially never have μ existing nevertheless we want to talk
+e nós essencialmente nunca temos μ existindo, no entanto, queremos falar
 
 0:12:06.740,0:12:10.520
-about network networks being polar conditioned and well conditioned and
+sobre as redes de rede serem polarizadas e bem condicionadas e
 
 0:12:10.520,0:12:14.930
-poorly conditioned would typically be some approximation to L is very large
+mal condicionado seria tipicamente alguma aproximação de L é muito grande
 
 0:12:14.930,0:12:21.260
-and well conditioned maybe L is very close to one so the step size we can
+e bem condicionado talvez L seja muito próximo de um, então o tamanho do passo que podemos
 
 0:12:21.260,0:12:27.770
-select in one summer training depends very heavily on these constants so let
+selecionar em um treinamento de verão depende muito dessas constantes, então vamos
 
 0:12:27.770,0:12:30.800
-me give you a little bit of an intuition for step sizes and this is very
+me dar um pouco de intuição para tamanhos de passos e isso é muito
 
 0:12:30.800,0:12:34.640
-important in practice I myself find a lot of my time is spent treating
+importante na prática Eu mesmo acho que muito do meu tempo é gasto tratando
 
 0:12:34.640,0:12:40.310
-learning rates and I'm sure you'll be involved in similar procedure so we have
+taxas de aprendizagem e tenho certeza que você estará envolvido em um procedimento semelhante, então temos
 
 0:12:40.310,0:12:45.740
-a couple of situations that can occur if we use a learning rate that's too low
+algumas situações que podem ocorrer se usarmos uma taxa de aprendizado muito baixa
 
 0:12:45.740,0:12:49.310
-we'll find that we make steady progress towards the solution here we're
+descobriremos que fazemos progressos constantes em direção à solução, aqui estamos
 
 0:12:49.310,0:12:56.480
-minimizing a little 1d quadratic and by steady progress I mean that every
+minimizando um pouco 1d quadrático e por progresso constante quero dizer que cada
 
 0:12:56.480,0:13:00.920
-iteration the gradient stays in buffer the same direction and you make similar
+iteração o gradiente fica no buffer na mesma direção e você faz semelhante
 
 0:13:00.920,0:13:05.420
-progress as you approach the solution this is slower than it is possible so
+progride à medida que você se aproxima da solução, isso é mais lento do que é possível, então
 
 0:13:05.420,0:13:09.910
-what you would ideally want to do is go straight to the solution for a quadratic
+o que você gostaria de fazer idealmente é ir direto para a solução para um quadrático
 
 0:13:09.910,0:13:12.650
-especially a 1d one like this that's going to be pretty straightforward
+especialmente um 1d como este que vai ser bem direto
 
 0:13:12.650,0:13:16.340
-there's going to be an exact step size that'll get you all the way to solution
+haverá um tamanho de passo exato que levará você até a solução
 
 0:13:16.340,0:13:20.810
-but more generally you can't do that and what you typically want to use is
+mas geralmente você não pode fazer isso e o que você normalmente deseja usar é
 
 0:13:20.810,0:13:26.150
-actually a step size a bit above that optimal and this is for a number of
+na verdade, um tamanho de passo um pouco acima do ideal e isso é para vários
 
 0:13:26.150,0:13:29.570
-reasons it tends to be quicker in practice we have to be very very careful
+razões que tende a ser mais rápido na prática, temos que ter muito, muito cuidado
 
 0:13:29.570,0:13:33.800
-because you get divergence and the term divergence means that the iterates will
+porque você obtém divergência e o termo divergência significa que as iterações
 
 0:13:33.800,0:13:37.160
-get further away than from the solution instead of closer this will typically
+ficar mais longe do que da solução em vez de se aproximar, isso normalmente
 
 0:13:37.160,0:13:42.530
-happen if you use two larger learning rate unfortunately for us we want to use
+acontecer se você usar duas taxas de aprendizado maiores, infelizmente para nós queremos usar
 
 0:13:42.530,0:13:45.590
-learning rates as large as possible to get as quick learning as possible so
+taxas de aprendizado o maior possível para obter o aprendizado o mais rápido possível,
 
 0:13:45.590,0:13:50.180
-we're always at the edge of divergence in fact it's very rare that you'll see
+estamos sempre à beira da divergência, na verdade, é muito raro que você veja
 
 0:13:50.180,0:13:55.400
-that the gradients follow this nice trajectory where they all point the same
+que os gradientes sigam esta bela trajetória onde todos apontam para o mesmo
 
 0:13:55.400,0:13:58.670
-direction until you kind of reach the solution what almost always happens in
+direção até chegar à solução o que quase sempre acontece
 
 0:13:58.670,0:14:02.960
-practice especially with gradient descent invariants is that you observe
+prática especialmente com invariantes gradientes descendentes é que você observa
 
 0:14:02.960,0:14:06.770
-this zigzagging behavior now we can't actually see zigzagging in million
+esse comportamento de ziguezague agora não podemos ver ziguezague em milhões
 
 0:14:06.770,0:14:10.940
-dimensional spaces that we train your networks in but it's very evident in
+espaços dimensionais nos quais treinamos suas redes, mas é muito evidente em
 
 0:14:10.940,0:14:15.680
-these 2d plots of a quadratic so here I'm showing the level sets you can see
+esses gráficos 2d de uma quadrática, então aqui estou mostrando os conjuntos de níveis que você pode ver
 
 0:14:15.680,0:14:20.560
-the numbers or the function value indicated there on the level sets and
+os números ou o valor da função indicado lá nos conjuntos de nível e
 
 0:14:20.560,0:14:27.830
-when we use a learning rate that is good not optimal but good we get pretty close
+quando usamos uma taxa de aprendizado que é boa, não ótima, mas boa, chegamos bem perto
 
 0:14:27.830,0:14:31.760
-to that blue dot the solution are for the 10 steps when we use a learning rate
+para esse ponto azul a solução é para os 10 passos quando usamos uma taxa de aprendizado
 
 0:14:31.760,0:14:35.450
-that seems nicer in that it's not oscillating it's well-behaved when we
+que parece melhor porque não está oscilando é bem comportado quando
 
 0:14:35.450,0:14:38.330
-use such a learning rate we actually end up quite a bit further away from the
+usar essa taxa de aprendizado, na verdade acabamos um pouco mais longe do
 
 0:14:38.330,0:14:42.830
-solution so it's a fact of life that we have to deal with these learning rates
+solução, então é um fato da vida que temos que lidar com essas taxas de aprendizado
 
 0:14:42.830,0:14:50.690
-that are stressfully high it's kind of like a race right you know no one wins a
+que são estressantemente altos, é como uma corrida, você sabe que ninguém ganha uma
 
 0:14:50.690,0:14:55.730
-a race by driving safely so our network training should be very comparable to
+uma corrida dirigindo com segurança, então nosso treinamento em rede deve ser muito comparável ao
 
 0:14:55.730,0:15:01.940
-that so the core topic we want to talk about is actually it stochastic
+que o tópico central sobre o qual queremos falar é, na verdade, estocástico
 
 0:15:01.940,0:15:08.600
-optimization and this is the method that we will be using every day for training
+otimização e este é o método que usaremos todos os dias para treinamento
 
 0:15:08.600,0:15:14.660
-neural networks in practice so it's de casting optimization is actually not so
+redes neurais na prática, então a otimização de casting não é tão
 
 0:15:14.660,0:15:19.190
-different what we're gonna do is we're going to replace the gradients in our
+diferente o que vamos fazer é substituir os gradientes em nosso
 
 0:15:19.190,0:15:25.700
-gradient descent step with a stochastic approximation to the gradient now in a
+passo de descida do gradiente com uma aproximação estocástica do gradiente agora em um
 
 0:15:25.700,0:15:29.930
-neural network we can be a bit more precise here by stochastic approximation
+rede neural podemos ser um pouco mais precisos aqui por aproximação estocástica
 
 0:15:29.930,0:15:36.310
-what we mean is the gradient of the loss for a single data point single instance
+o que queremos dizer é o gradiente da perda para uma única instância de ponto de dados único
 
 0:15:36.310,0:15:42.970
-you might want to call it so I've got that in the notation here this function
+você pode querer chamá-lo, então eu tenho isso na notação aqui esta função
 
 0:15:42.970,0:15:49.430
-L is the loss of one day the point here the data point is indexed by AI and we
+L é a perda de um dia o ponto aqui o ponto de dados é indexado por AI e nós
 
 0:15:49.430,0:15:52.970
-would write this typically in the optimization literature as the function
+escreveria isso normalmente na literatura de otimização como a função
 
 0:15:52.970,0:15:57.380
-fᵢ and I'm going to use this notation but you should imagine fᵢ as being the
+fᵢ e vou usar essa notação, mas você deve imaginar fᵢ como sendo a
 
 0:15:57.380,0:16:02.390
-loss for a single instance I and here I'm using supervised learning setup
+perda para uma única instância I e aqui estou usando a configuração de aprendizado supervisionado
 
 0:16:02.390,0:16:08.330
-where we have data points I labels yᵢ so they points xᵢ labels yᵢ the full
+onde temos pontos de dados que eu rotula yᵢ então eles apontam xᵢ rótulos yᵢ o completo
 
 0:16:08.330,0:16:14.290
-loss for a function is shown at the top there it's a sum of all these fᵢ. Now
+a perda de uma função é mostrada na parte superior, é uma soma de todos esses fᵢ. Agora
 
 0:16:14.290,0:16:17.600
-let me give you a bit more explanation for what we're doing here we're placing
+deixe-me dar-lhe um pouco mais de explicação para o que estamos fazendo aqui estamos colocando
 
 0:16:17.600,0:16:24.230
-this through gradient with a stochastic gradient this is a noisy approximation
+isto através de gradiente com um gradiente estocástico esta é uma aproximação ruidosa
 
 0:16:24.230,0:16:30.350
-and this is how it's often explained in the stochastic optimization setup so we
+e é assim que muitas vezes é explicado na configuração de otimização estocástica.
 
 0:16:30.350,0:16:36.440
-have this function the gradient and in our setup it's expected value is equal
+tem esta função o gradiente e em nossa configuração o valor esperado é igual
 
 0:16:36.440,0:16:41.150
-to the full gradient so you can think of a stochastic gradient descent step as
+para o gradiente completo para que você possa pensar em uma etapa de descida de gradiente estocástica como
 
 0:16:41.150,0:16:47.210
-being a full gradient step in expectation now this is not actually the
+sendo um passo de gradiente completo na expectativa agora, isso não é realmente o
 
 0:16:47.210,0:16:50.480
-best way to view it because there's a lot more going on than that it's not
+melhor maneira de vê-lo porque há muito mais acontecendo do que isso não é
 
 0:16:50.480,0:16:58.310
-just gradient descent with noise so let me give you a little bit more detail but
+apenas gradiente descendente com ruído, então deixe-me dar um pouco mais de detalhes, mas
 
 0:16:58.310,0:17:03.050
-first I let anybody ask any questions I have here before I move on yes
+primeiro eu deixo qualquer um fazer qualquer pergunta que eu tenha aqui antes de seguir em frente sim
 
 0:17:03.050,0:17:08.420
-mm-hmm yeah I could talk a bit more about that but yes so you're right so
+mm-hmm sim, eu poderia falar um pouco mais sobre isso, mas sim, então você está certo, então
 
 0:17:08.420,0:17:12.500
-using your entire dataset to calculate a gradient is here what I mean by gradient
+usar todo o seu conjunto de dados para calcular um gradiente é aqui o que quero dizer com gradiente
 
 0:17:12.500,0:17:17.720
-descent we also call that full batch gradient descent just to be clear now in
+descida, também chamamos essa descida de gradiente de lote completo apenas para ficar claro agora em
 
 0:17:17.720,0:17:22.280
-machine learning we virtually always use mini batches so people may use the name
+aprendizado de máquina, praticamente sempre usamos minilotes para que as pessoas possam usar o nome
 
 0:17:22.280,0:17:24.620
-gradient descent or something when they're really talking about stochastic
+gradiente descendente ou algo assim quando eles estão realmente falando sobre estocástico
 
 0:17:24.620,0:17:29.150
-gradient descent and what you mentioned is absolutely true so there are some
+gradiente descendente e o que você mencionou é absolutamente verdade, então existem alguns
 
 0:17:29.150,0:17:33.920
-difficulties of training neural networks using very large batch sizes and this is
+dificuldades de treinar redes neurais usando tamanhos de lote muito grandes e isso é
 
 0:17:33.920,0:17:37.010
-understood to some degree and I'll actually explain that on the very next
+entendido até certo ponto e eu realmente explicarei isso no próximo
 
 0:17:37.010,0:17:39.230
-slide so let me let me get to to your point first
+deslize então deixe-me chegar ao seu ponto primeiro
 
 0:17:39.230,0:17:45.679
-so the point the answer to your question is actually the third point here the
+então o ponto a resposta para sua pergunta é na verdade o terceiro ponto aqui o
 
 0:17:45.679,0:17:50.780
-noise in stochastic gradient descent induces this phenomena known as
+ruído na descida do gradiente estocástico induz esse fenômeno conhecido como
 
 0:17:50.780,0:17:54.770
-annealing and the diagram directly to the right of it illustrates this
+recozimento e o diagrama diretamente à direita ilustra isso
 
 0:17:54.770,0:18:00.260
-phenomena so your network training landscapes have a bumpy structure to
+fenômenos para que seus cenários de treinamento de rede tenham uma estrutura irregular para
 
 0:18:00.260,0:18:05.330
-them where there are lots of small minima that are not good minima that
+onde há muitos mínimos pequenos que não são bons mínimos que
 
 0:18:05.330,0:18:09.320
-appear on the path to the good minima so the theory that a lot of people
+aparecem no caminho para os bons mínimos, então a teoria de que muitas pessoas
 
 0:18:09.320,0:18:13.760
-subscribe to is that SGD in particular the noise induced in the gradient
+subscrever é que SGD em particular o ruído induzido no gradiente
 
 0:18:13.760,0:18:18.919
-actually helps the optimizer to jump over these bad minima and the theory is
+realmente ajuda o otimizador a pular esses mínimos ruins e a teoria é
 
 0:18:18.919,0:18:22.669
-that these bad minima are quite small in the space and so they're easy to jump
+que esses mínimos ruins são muito pequenos no espaço e, portanto, são fáceis de pular
 
 0:18:22.669,0:18:27.380
-over we're good minima that results in good performance around your own network
+sobre nós somos bons mínimos que resultam em bom desempenho em torno de sua própria rede
 
 0:18:27.380,0:18:34.070
-are larger and harder to skip so does this answer your question yes so besides
+são maiores e mais difíceis de pular, então isso responde sua pergunta sim, além disso
 
 0:18:34.070,0:18:39.440
-that annealing point of view there's there's actually a few other reasons so
+esse ponto de vista de recozimento, existem algumas outras razões, então
 
 0:18:39.440,0:18:45.559
-we have a lot of redundancy in the information we get from each terms
+temos muita redundância nas informações que obtemos de cada termo
 
 0:18:45.559,0:18:51.679
-gradient and using stochastic gradient lets us exploit this redundancy in a lot
+gradiente e usar gradiente estocástico nos permite explorar muito essa redundância
 
 0:18:51.679,0:18:56.870
-of situations the gradient computed on a few hundred examples is almost as good
+de situações, o gradiente calculado em algumas centenas de exemplos é quase tão bom
 
 0:18:56.870,0:19:01.460
-as a gradient computed on the full data set and often thousands of times cheaper
+como um gradiente calculado no conjunto de dados completo e muitas vezes milhares de vezes mais barato
 
 0:19:01.460,0:19:05.300
-depending on your problem so it's it's hard to come up with a compelling reason
+dependendo do seu problema, é difícil encontrar um motivo convincente
 
 0:19:05.300,0:19:09.320
-to use gradient descent given the success of stochastic gradient descent
+usar gradiente descendente dado o sucesso do gradiente descendente estocástico
 
 0:19:09.320,0:19:13.809
-and this is part of the reason why disgusted gradient said is one of the
+e isso é parte do motivo pelo qual gradiente desgostoso disse ser um dos
 
 0:19:15.659,0:19:19.859
-best misses we have but gradient descent is one of the worst and in fact early
+melhores erros que temos, mas o gradiente descendente é um dos piores e, de fato, precoce
 
 0:19:19.859,0:19:23.580
-stages the correlation is remarkable this disgusted gradient can be
+estágios a correlação é notável este gradiente desgostoso pode ser
 
 0:19:23.580,0:19:28.499
-correlated up to a coefficient of 0.999 correlation coefficient to the true
+correlacionado até um coeficiente de 0,999 coeficiente de correlação para o verdadeiro
 
 0:19:28.499,0:19:33.869
-gradient at those early steps of optimization so I want to briefly talk
+gradiente nessas etapas iniciais de otimização, então quero falar brevemente
 
 0:19:33.869,0:19:38.179
-about a something you need to know about I think Yann has already mentioned this
+sobre algo que você precisa saber, acho que Yann já mencionou isso
 
 0:19:38.179,0:19:43.259
-briefly but in practice we don't use individual instances in stochastic
+brevemente, mas na prática não usamos instâncias individuais em
 
 0:19:43.259,0:19:48.749
-gradient descent how we use mini batches of instances so I'm just using some
+gradiente descendente como usamos minilotes de instâncias, então estou usando apenas alguns
 
 0:19:48.749,0:19:52.649
-notation here but everybody uses different notation for mini batching so
+notação aqui, mas todo mundo usa notação diferente para mini lotes, então
 
 0:19:52.649,0:19:56.970
-you shouldn't get too attached to the notation but essentially at every step
+você não deve se apegar muito à notação, mas essencialmente a cada passo
 
 0:19:56.970,0:20:03.149
-you have some batch here I'm going to call it B an index with I for step and
+você tem algum lote aqui, vou chamá-lo de B um índice com I para o passo e
 
 0:20:03.149,0:20:09.299
-you basically use the average of the gradients over this mini batch which is
+você basicamente usa a média dos gradientes sobre este mini lote que é
 
 0:20:09.299,0:20:13.470
-a subset of your data rather than a single instance or the full full batch
+um subconjunto de seus dados em vez de uma única instância ou o lote completo completo
 
 0:20:13.470,0:20:19.799
-now almost everybody will use this mini batch selected uniformly at random
+agora quase todo mundo vai usar este mini lote selecionado uniformemente de forma aleatória
 
 0:20:19.799,0:20:23.009
-some people use with replacement sampling and some people use without
+algumas pessoas usam com amostragem de reposição e algumas pessoas usam sem
 
 0:20:23.009,0:20:26.669
-with replacement sampling but the differences are not important for this
+com amostragem de reposição, mas as diferenças não são importantes para isso
 
 0:20:26.669,0:20:31.729
-purposes you can use either and there's a lot of advantages to mini batching so
+propósitos você pode usar e há muitas vantagens em mini lotes, então
 
 0:20:31.729,0:20:35.220
-there's actually some good impelling theoretical reasons to not be any batch
+na verdade, existem algumas boas razões teóricas para não ser qualquer lote
 
 0:20:35.220,0:20:38.609
-but the practical reasons are overwhelming part of these practical
+mas as razões práticas são parte esmagadora dessas
 
 0:20:38.609,0:20:43.950
-reasons are computational we make ammonia may utilize our hardware say at
+razões são computacionais, fazemos amônia pode utilizar nosso hardware, digamos em
 
 0:20:43.950,0:20:47.489
-1% efficiency when training some of the network's we use if we try and use
+1% de eficiência ao treinar algumas das redes que usamos se tentarmos usar
 
 0:20:47.489,0:20:51.239
-single instances and we get the most efficient utilization of the hardware
+instâncias únicas e obtemos a utilização mais eficiente do hardware
 
 0:20:51.239,0:20:55.979
-with batch sizes often in the hundreds if you're training on the typical
+com tamanhos de lote muitas vezes na casa das centenas, se você estiver treinando no típico
 
 0:20:55.979,0:20:59.999
-ImageNet data set for in for instance you don't use batch sizes less than
+Conjunto de dados ImageNet para, por exemplo, você não usa tamanhos de lote menores que
 
 0:20:59.999,0:21:08.429
-about 64 to get good efficiency maybe can go down to 32 but another important
+cerca de 64 para obter uma boa eficiência talvez possa descer para 32, mas outro importante
 
 0:21:08.429,0:21:13.080
-application is distributed training and this is really becoming a big thing so
+aplicação é treinamento distribuído e isso está realmente se tornando uma grande coisa, então
 
 0:21:13.080,0:21:17.309
-as was mentioned before people were recently able to Train ImageNet days
+como foi mencionado antes que as pessoas pudessem treinar dias de ImageNet recentemente
 
 0:21:17.309,0:21:21.639
-said that normally takes two days to train and not so long ago it took
+disse que normalmente leva dois dias para treinar e não faz muito tempo que levava
 
 0:21:21.639,0:21:25.779
-in a week to train in only one hour and the way they did that was using very
+em uma semana para treinar em apenas uma hora e o jeito que eles fizeram isso foi usando muito
 
 0:21:25.779,0:21:29.889
-large mini batches and along with using large many batches there are some tricks
+grandes mini lotes e junto com o uso de grandes lotes, existem alguns truques
 
 0:21:29.889,0:21:34.059
-that you need to use to get it to work it's probably not something that you
+que você precisa usar para fazê-lo funcionar, provavelmente não é algo que você
 
 0:21:34.059,0:21:37.149
-would cover an introductory lecture so I encourage you to check out that paper if
+cobriria uma palestra introdutória, então eu encorajo você a verificar esse artigo se
 
 0:21:37.149,0:21:40.409
-you're interested it's ImageNet in one hour
+você está interessado é ImageNet em uma hora
 
 0:21:40.409,0:21:45.279
-leaves face book authors I can't recall the first author at the moment as a side
+deixa os autores do face book não me lembro do primeiro autor no momento como um lado
 
 0:21:45.279,0:21:51.459
-note there are some situations where you need to do full batch optimization do
+observe que existem algumas situações em que você precisa fazer a otimização completa do lote
 
 0:21:51.459,0:21:54.759
-not use gradient descent in that situation I can't emphasize it enough to
+não usar gradiente descendente nessa situação, não posso enfatizar o suficiente para
 
 0:21:54.759,0:21:59.950
-not use gradient ascent ever if you have full batch data by far the most
+não use subida de gradiente se você tiver dados de lote completos de longe
 
 0:21:59.950,0:22:03.249
-effective method that is kind of plug-and-play you don't to think about
+método eficaz que é tipo plug-and-play que você não precisa pensar
 
 0:22:03.249,0:22:08.859
-it is known as l-bfgs it's accumulation of 50 years of optimization research and
+é conhecido como l-bfgs é o acúmulo de 50 anos de pesquisa de otimização e
 
 0:22:08.859,0:22:12.519
-it works really well torch's implementation is pretty good
+funciona muito bem a implementação do torch é muito boa
 
 0:22:12.519,0:22:17.379
-but the Scipy implementation causes some filtering code that was written 15 years
+mas a implementação do Scipy causa algum código de filtragem que foi escrito há 15 anos
 
 0:22:17.379,0:22:23.440
-ago that is pretty much bulletproof so because they were those so that's a good
+atrás, isso é praticamente à prova de balas, então porque eles eram aqueles, então isso é uma boa
 
 0:22:23.440,0:22:26.619
-question classically you do need to use the full
+pergunta classicamente você precisa usar o
 
 0:22:26.619,0:22:28.809
-data set now PyTorch implementation actually
+conjunto de dados agora implementação do PyTorch na verdade
 
 0:22:28.809,0:22:34.209
-supports using mini battery now this is somewhat of a gray area in that there's
+suporta o uso de mini bateria agora esta é uma área cinzenta em que há
 
 0:22:34.209,0:22:37.899
-really no theory to support the use of this and it may work well for your
+realmente nenhuma teoria para apoiar o uso disso e pode funcionar bem para o seu
 
 0:22:37.899,0:22:43.839
-problem or it may not so it could be worth trying I mean you want to use your
+problema ou pode não, então pode valer a pena tentar, quero dizer, você quer usar seu
 
 0:22:43.839,0:22:49.929
-whole data set for each gradient evaluation or probably more likely since
+conjunto de dados inteiro para cada avaliação de gradiente ou provavelmente mais provável desde
 
 0:22:49.929,0:22:52.359
-it's very rarely you want to do that probably more likely you're solving some
+é muito raramente você quer fazer isso provavelmente mais provavelmente você está resolvendo alguns
 
 0:22:52.359,0:22:56.889
-other optimization problem that isn't isn't training in your network but maybe
+outro problema de otimização que não está treinando em sua rede, mas talvez
 
 0:22:56.889,0:23:01.869
-some ancillary problem related and you need to solve an optimization problem
+algum problema auxiliar relacionado e você precisa resolver um problema de otimização
 
 0:23:01.869,0:23:06.669
-without this data point structure that doesn't summer isn't a sum of data
+sem essa estrutura de pontos de dados que não é verão não é uma soma de dados
 
 0:23:06.669,0:23:12.239
-points yeah hopefully it was another question yep oh yes the question was
+pontos sim espero que tenha sido outra pergunta sim oh sim a pergunta foi
 
 0:23:12.239,0:23:16.869
-Yann recommended we used mini batches equal to the size of the number of
+Yann recomendou que usássemos minilotes iguais ao tamanho do número de
 
 0:23:16.869,0:23:20.079
-classes we have in our data set why is that reasonable that was the question
+classes que temos em nosso conjunto de dados por que é razoável essa era a questão
 
 0:23:20.079,0:23:23.889
-the answer is that we want any vectors to be representative of the full data
+a resposta é que queremos que quaisquer vetores sejam representativos dos dados completos
 
 0:23:23.889,0:23:28.329
-set and typically each class is quite distinct from the other classes in its
+set e tipicamente cada classe é bastante distinta das outras classes em seu
 
 0:23:28.329,0:23:33.490
-properties so about using a mini batch that contains on average
+propriedades assim sobre como usar um minilote que contém em média
 
 0:23:33.490,0:23:36.850
-one instance from each class in fact we can enforce that explicitly although
+uma instância de cada classe, de fato, podemos impor isso explicitamente, embora
 
 0:23:36.850,0:23:39.820
-it's not necessary by having an approximately equal to that
+não é necessário ter um aproximadamente igual a esse
 
 0:23:39.820,0:23:44.590
-size we can assume it has the kind of structure of a food gradient so you
+tamanho, podemos assumir que tem o tipo de estrutura de um gradiente de comida para que você
 
 0:23:44.590,0:23:49.870
-capture a lot of the correlations in the data you see with the full gradient and
+capturar muitas das correlações nos dados que você vê com o gradiente completo e
 
 0:23:49.870,0:23:54.279
-it's a good guide especially if you're using training on CPU where you're not
+é um bom guia, especialmente se você estiver usando o treinamento na CPU onde você não está
 
 0:23:54.279,0:23:58.690
-constrained too much by hardware efficiency here when training on energy
+muito limitado pela eficiência do hardware aqui ao treinar em energia
 
 0:23:58.690,0:24:05.080
-on a CPU batch size is not critical for hardware utilization it's problem
+em um tamanho de lote de CPU não é crítico para a utilização de hardware, é um problema
 
 0:24:05.080,0:24:09.370
-dependent I would always recommend mini batching I don't think it's worth trying
+dependente eu sempre recomendaria mini batching não acho que vale a pena tentar
 
 0:24:09.370,0:24:13.899
-size one as a starting point if you try to eke out small gains maybe that's
+tamanho um como ponto de partida, se você tentar obter pequenos ganhos, talvez seja
 
 0:24:13.899,0:24:19.779
-worth exploring yes there was another question so in the annealing example so
+vale a pena explorar sim, havia outra pergunta, então no exemplo de recozimento, então
 
 0:24:19.779,0:24:24.760
-the question was why is the lost landscape so wobbly and this is this is
+a questão era por que a paisagem perdida é tão instável e é isso
 
 0:24:24.760,0:24:31.600
-actually something that is very a very realistic depiction of actual law slams
+na verdade, algo que é uma representação muito realista de batidas de lei reais
 
 0:24:31.600,0:24:37.630
-codes for neural networks they're incredibly in the sense that they have a
+códigos para redes neurais são incrivelmente no sentido de que eles têm um
 
 0:24:37.630,0:24:41.860
-lot of hills and valleys and this is something that is actively researched
+muitas colinas e vales e isso é algo que é pesquisado ativamente
 
 0:24:41.860,0:24:47.140
-now what we can say for instance is that there is a very large number of good
+agora o que podemos dizer, por exemplo, é que há um número muito grande de boas
 
 0:24:47.140,0:24:52.720
-minima and and so hills and valleys we know this because your networks have
+mínimo e assim por colinas e vales sabemos disso porque suas redes têm
 
 0:24:52.720,0:24:56.590
-this combinatorial aspect to them you can reaper ammeter eyes a neural network
+este aspecto combinatório para eles você pode colher os olhos do amperímetro uma rede neural
 
 0:24:56.590,0:25:00.309
-by shifting all the weights around and you can get in your work you'll know if
+deslocando todos os pesos ao redor e você pode entrar em seu trabalho, você saberá se
 
 0:25:00.309,0:25:04.750
-it outputs exactly the same output for whatever task you're looking at with all
+ele produz exatamente a mesma saída para qualquer tarefa que você esteja olhando com todos
 
 0:25:04.750,0:25:07.419
-these weights moved around and that correspondence essentially to a
+esses pesos se movimentavam e essa correspondência essencialmente a um
 
 0:25:07.419,0:25:12.460
-different location in parameter space so given that there's an exponential number
+localização diferente no espaço de parâmetros, dado que há um número exponencial
 
 0:25:12.460,0:25:16.270
-of these possible ways of rearranging the weights to get the same network
+dessas possíveis maneiras de reorganizar os pesos para obter a mesma rede
 
 0:25:16.270,0:25:18.940
-you're going to end up with the space that's incredibly spiky exponential
+você vai acabar com o espaço que é exponencial incrivelmente pontiagudo
 
 0:25:18.940,0:25:24.789
-number of these spikes now the reason why these these local minima appear that
+número desses picos agora a razão pela qual esses mínimos locais aparecem que
 
 0:25:24.789,0:25:27.580
-is something that is still active research so I'm not sure I can give you
+é algo que ainda é uma pesquisa ativa, então não tenho certeza se posso lhe dar
 
 0:25:27.580,0:25:32.890
-a great answer there but they're definitely observed in practice and what
+uma ótima resposta lá, mas eles são definitivamente observados na prática e o que
 
 0:25:32.890,0:25:39.000
-I can say is they appear to be less of a problem we've very
+Posso dizer que eles parecem ser um problema menor que nós
 
 0:25:39.090,0:25:42.810
-like close to state-of-the-art networks so these local minima were considered
+como perto de redes de última geração, então esses mínimos locais foram considerados
 
 0:25:42.810,0:25:47.940
-big problems 15 years ago but so much at the moment people essentially never hit
+grandes problemas há 15 anos, mas tanto no momento em que as pessoas essencialmente nunca atingem
 
 0:25:47.940,0:25:52.350
-them in practice when using kind of recommended parameters and things like
+na prática ao usar o tipo de parâmetros recomendados e coisas como
 
 0:25:52.350,0:25:55.980
-that when you use very large batches you can run into these problems it's not
+que quando você usa lotes muito grandes você pode ter esses problemas, não é
 
 0:25:55.980,0:25:59.490
-even clear that the the poor performance when using large batches is even
+mesmo claro que o baixo desempenho ao usar grandes lotes é ainda
 
 0:25:59.490,0:26:03.900
-attributable to these larger minima to these local minima so this is yes to
+atribuível a esses mínimos maiores a esses mínimos locais, então isso é sim para
 
 0:26:03.900,0:26:08.550
-ongoing research yes the problem is you can't really see this local structure
+pesquisa em andamento sim, o problema é que você não pode realmente ver essa estrutura local
 
 0:26:08.550,0:26:10.920
-because we're in this million dimensional space it's not a good way to
+porque estamos neste espaço de um milhão de dimensões, não é uma boa maneira de
 
 0:26:10.920,0:26:15.090
-see it so yeah I don't know if people might have explored that already I'm not
+veja então sim eu não sei se as pessoas podem ter explorado isso já eu não sou
 
 0:26:15.090,0:26:18.840
-familiar with papers on that but I bet someone has looked at it so you might
+familiarizado com documentos sobre isso, mas aposto que alguém olhou para isso, então você pode
 
 0:26:18.840,0:26:23.520
-want to google that yeah so a lot of the advances in neural network design have
+quero pesquisar isso no Google, então muitos dos avanços no design de redes neurais
 
 0:26:23.520,0:26:27.420
-actually been in reducing this bumpiness in a lot of ways so this is part of the
+na verdade, reduzi essa irregularidade de várias maneiras, então isso faz parte do
 
 0:26:27.420,0:26:30.510
-reason why it's not considered a huge problem anymore whether it was it was
+razão pela qual não é mais considerado um grande problema se era
 
 0:26:30.510,0:26:35.960
-considered a big problem in the past there's any other questions yes so it's
+considerado um grande problema no passado, há outras perguntas sim, então é
 
 0:26:35.960,0:26:41.550
-it is hard to see but there are certain things you can do that we make the the
+é difícil de ver, mas há certas coisas que você pode fazer que tornamos o
 
 0:26:41.550,0:26:46.830
-peaks and valleys smaller certainly and by rescaling some parts the neural
+picos e vales menores certamente e redimensionando algumas partes do sistema neural
 
 0:26:46.830,0:26:50.010
-network you can amplify certain directions the curvature in certain
+rede você pode amplificar certas direções da curvatura em certas
 
 0:26:50.010,0:26:54.320
-directions can be stretched and squashed the particular innovation residual
+direções podem ser esticadas e esmagadas o resíduo de inovação particular
 
 0:26:54.320,0:27:00.000
-connections that were mentioned they're very easy to see that they smooth out
+conexões que foram mencionadas são muito fáceis de ver que suavizam
 
 0:27:00.000,0:27:03.600
-the the loss in fact you can kind of draw two line between two points in the
+a perda, de fato, você pode desenhar duas linhas entre dois pontos no
 
 0:27:03.600,0:27:06.570
-space and you can see what happens along that line that's really the best way we
+espaço e você pode ver o que acontece ao longo dessa linha que é realmente a melhor maneira de
 
 0:27:06.570,0:27:10.170
-have a visualizing million dimensional spaces so I turn him into one dimension
+tenho uma visualização de milhões de espaços dimensionais, então eu o transformo em uma dimensão
 
 0:27:10.170,0:27:13.200
-and you can see that it's that it's a much nicer between these two points
+e você pode ver que é muito melhor entre esses dois pontos
 
 0:27:13.200,0:27:17.370
-whatever two points you choose when using these residual connections I'll be
+quaisquer que sejam os dois pontos que você escolher ao usar essas conexões residuais, estarei
 
 0:27:17.370,0:27:21.570
-talking all about dodging or later in the lecture so yeah if hopefully I'll
+falando tudo sobre esquivar ou mais tarde na palestra, então sim, espero que eu
 
 0:27:21.570,0:27:24.870
-answer that question without you having to ask it again but we'll see
+responda a essa pergunta sem precisar perguntar novamente, mas veremos
 
 0:27:24.870,0:27:31.560
-thanks any other questions yes so l-bfgs excellent method it's it's kind of a
+obrigado qualquer outra pergunta sim, então l-bfgs excelente método é uma espécie de
 
 0:27:31.560,0:27:34.650
-constellation of optimization researchers that we still use SGD a
+constelação de pesquisadores de otimização que ainda usamos SGD
 
 0:27:34.650,0:27:40.470
-method invented in the 60s or earlier is still state of the art but there has
+método inventado nos anos 60 ou antes ainda é o estado da arte, mas
 
 0:27:40.470,0:27:44.880
-been some innovation in fact only a couple years later but there was some
+houve alguma inovação, na verdade, apenas alguns anos depois, mas houve algumas
 
 0:27:44.880,0:27:49.180
-innovation since the invention of sed and one of these innovations is
+inovação desde a invenção do sed e uma dessas inovações é
 
 0:27:49.180,0:27:54.730
-and I'll talk about another later so momentum it's a trick
+e eu vou falar sobre outro mais tarde, então o impulso é um truque
 
 0:27:54.730,0:27:57.520
-that you should pretty much always be using when you're using stochastic
+que você deve usar sempre quando estiver usando estocástico
 
 0:27:57.520,0:28:00.880
-gradient descent it's worth be going into this in a little bit of detail
+gradiente descendente vale a pena entrar nisso com um pouco de detalhes
 
 0:28:00.880,0:28:04.930
-you'll often be tuning the momentum parameter and your network and it's
+muitas vezes você estará ajustando o parâmetro de impulso e sua rede e é
 
 0:28:04.930,0:28:09.340
-useful to understand what it's actually doing when you're tuning up so part of
+útil para entender o que está realmente fazendo quando você está ajustando
 
 0:28:09.340,0:28:15.970
-the problem with momentum it's very misunderstood and this can be explained
+o problema com o impulso é muito mal compreendido e isso pode ser explicado
 
 0:28:15.970,0:28:18.760
-by the fact that there's actually three different ways of writing momentum that
+pelo fato de que na verdade existem três maneiras diferentes de escrever o momento que
 
 0:28:18.760,0:28:21.790
-look completely different but turn out to be equivalent I'm only going to
+parecem completamente diferentes, mas acabam sendo equivalentes, só vou
 
 0:28:21.790,0:28:25.120
-present two of these ways because the third way is not as well known but is
+apresentar duas dessas maneiras porque a terceira maneira não é tão conhecida, mas é
 
 0:28:25.120,0:28:30.070
-actually in my opinion the correct way to view it I don't talk about my
+na verdade na minha opinião a maneira correta de ver eu não falo sobre minha
 
 0:28:30.070,0:28:32.470
-research here so we'll talk about how it's actually implemented in the
+pesquisa aqui, então vamos falar sobre como ele é realmente implementado no
 
 0:28:32.470,0:28:37.390
-packages you'll be using and this first form here is what's actually implemented
+pacotes que você irá usar e este primeiro formulário aqui é o que está realmente implementado
 
 0:28:37.390,0:28:42.040
-in PyTorch and other software that you'll be using here we maintain two variables
+no PyTorch e em outro software que você usará aqui, mantemos duas variáveis
 
 0:28:42.040,0:28:47.650
-now you'll see lots of papers using different notation here P is the
+agora você verá muitos papéis usando notação diferente aqui P é o
 
 0:28:47.650,0:28:51.580
-notation used in physics for momentum and it's very common to use that also as
+notação usada na física para momento e é muito comum usá-la também como
 
 0:28:51.580,0:28:55.720
-the momentum variable when talking about sed with momentum so I'll be following
+a variável momentum ao falar sobre sed com momentum então estarei seguindo
 
 0:28:55.720,0:29:01.000
-that convention so instead of having a single iterate we now have to Eretz P
+essa convenção, então, em vez de ter uma única iteração, agora temos que Eretz P
 
 0:29:01.000,0:29:06.940
-and W and at every step we update both and this is quite a simple update so the
+e W e a cada passo atualizamos ambos e esta é uma atualização bastante simples, então o
 
 0:29:06.940,0:29:13.060
-P update involves adding to the old P and instead of adding exactly to the old
+A atualização de P envolve adicionar ao antigo P e em vez de adicionar exatamente ao antigo
 
 0:29:13.060,0:29:16.720
-P we kind of damp the old P we reduce it by multiplying it by a constant that's
+P nós meio que amortecemos o antigo P nós o reduzimos multiplicando-o por uma constante que é
 
 0:29:16.720,0:29:21.310
-worse than one so reduce the old P and here I'm using β̂ as the constant
+pior do que um, então reduza o antigo P e aqui estou usando β̂ como a constante
 
 0:29:21.310,0:29:24.880
-there so that would probably be 0.9 in practice a small amount of damping and
+lá de modo que provavelmente seria 0,9 na prática uma pequena quantidade de amortecimento e
 
 0:29:24.880,0:29:32.650
-we add to that the new gradient so P is kind of this accumulated gradient buffer
+adicionamos a isso o novo gradiente, então P é uma espécie de buffer de gradiente acumulado
 
 0:29:32.650,0:29:38.170
-you can think of where new gradients come in at full value and past gradients
+você pode pensar em onde novos gradientes entram em valor total e gradientes anteriores
 
 0:29:38.170,0:29:42.490
-are reduced at each step by a certain factor usually 0.9 which used to reduce
+são reduzidos em cada etapa por um certo fator geralmente 0,9 que costumava reduzir
 
 0:29:42.490,0:29:47.910
-reduced so the buffer tends to be a some sort of running sum of gradients and
+reduzido de modo que o buffer tende a ser uma espécie de soma de gradientes e
 
 0:29:47.910,0:29:53.080
-it's basically we just modify this to custer gradient two-step descent step by
+é basicamente nós apenas modificamos isso para descida de duas etapas de gradiente de Custer passo a passo
 
 0:29:53.080,0:29:56.440
-using this P instead of the negative gradient instead of the gradient sorry
+usando este P em vez do gradiente negativo em vez do gradiente desculpe
 
 0:29:56.440,0:30:00.260
-using P instead of the in the update since the two line formula
+usando P em vez do na atualização, pois a fórmula de duas linhas
 
 0:30:00.260,0:30:05.790
-it may be better to understand this by the second form that I put below this is
+pode ser melhor entender isso pelo segundo formulário que coloco abaixo, isso é
 
 0:30:05.790,0:30:09.600
-equivalent you've got a map the β with a small transformation so it's not
+equivalente você tem um mapa do β com uma pequena transformação, então não é
 
 0:30:09.600,0:30:12.750
-exactly the same β between the two methods but it's practically the same
+exatamente o mesmo β entre os dois métodos, mas é praticamente o mesmo
 
 0:30:12.750,0:30:20.300
-for in practice so these are essentially the same up to reap romanization and
+pois na prática são essencialmente os mesmos até colher romanização e
 
 0:30:21.260,0:30:25.530
-this film I think is maybe clearer this form is called the stochastic heavy ball
+este filme eu acho que talvez seja mais claro essa forma é chamada de bola pesada estocástica
 
 0:30:25.530,0:30:31.170
-method and here our update still includes the gradient but we're also
+e aqui nossa atualização ainda inclui o gradiente, mas também estamos
 
 0:30:31.170,0:30:40.020
-adding on a multiplied copy of the past direction we traveled in now what does
+adicionando uma cópia multiplicada da direção passada em que viajamos agora o que faz
 
 0:30:40.020,0:30:43.320
-this mean what are we actually doing here so it's actually not too difficult
+isso significa o que estamos realmente fazendo aqui, então não é muito difícil
 
 0:30:43.320,0:30:49.170
-to visualize and I'm going to kind of use a visualization from a distilled
+para visualizar e vou usar uma visualização de um destilado
 
 0:30:49.170,0:30:52.710
-publication you can see the dress at the bottom there and I disagree with a lot
+publicação você pode ver o vestido lá embaixo e eu discordo muito
 
 0:30:52.710,0:30:55.620
-of what they talked about in that document but I like the visualizations
+do que eles falaram nesse documento, mas eu gosto das visualizações
 
 0:30:55.620,0:31:02.820
-so let's use had and I'll explain why I disagreed some regards later but it's
+então vamos usar had e eu vou explicar porque eu discordei de alguns cumprimentos mais tarde, mas é
 
 0:31:02.820,0:31:07.440
-quite simple so you can think of momentum as the physical process and I
+bastante simples para que você possa pensar no momento como o processo físico e eu
 
 0:31:07.440,0:31:10.650
-mention those of you have done introductory physics courses would have
+mencionar aqueles de vocês que fizeram cursos introdutórios de física teria
 
 0:31:10.650,0:31:17.340
-covered this so momentum is the property of something to keep moving in the
+cobriu isso, então o impulso é a propriedade de algo para continuar se movendo no
 
 0:31:17.340,0:31:21.330
-direction that's currently moving in all right if you're familiar with Newton's
+direção que está se movendo no momento certo se você estiver familiarizado com Newton
 
 0:31:21.330,0:31:24.240
-laws things want to keep going in the direction they're going and this is
+leis, as coisas querem continuar na direção em que estão indo e isso é
 
 0:31:24.240,0:31:28.860
-momentum and when you do this mapping the physics the gradient is kind of a
+momento e quando você faz isso mapeando a física, o gradiente é uma espécie de
 
 0:31:28.860,0:31:34.020
-force that is pushing you're literate which by this analogy is a heavy ball
+força que está empurrando você é alfabetizado que por esta analogia é uma bola pesada
 
 0:31:34.020,0:31:39.860
-it's pushing this heavy ball at each point so rather than making dramatic
+está empurrando essa bola pesada em cada ponto, em vez de fazer drama
 
 0:31:39.860,0:31:44.030
-changes in the direction we travel at every step which is shown in that left
+mudanças na direção em que viajamos a cada passo que é mostrado na esquerda
 
 0:31:44.030,0:31:48.480
-diagram instead of making these dramatic changes we're going to make kind of a
+diagrama em vez de fazer essas mudanças dramáticas, vamos fazer uma espécie de
 
 0:31:48.480,0:31:51.480
-bit more modest changes so when we realize we're going in the wrong
+mudanças um pouco mais modestas, então quando percebemos que estamos indo no caminho errado
 
 0:31:51.480,0:31:55.740
-direction we kind of do a u-turn instead of putting the hand brake on and
+direção, meio que fazemos uma inversão de marcha em vez de colocar o freio de mão e
 
 0:31:55.740,0:31:59.440
-swinging around it turns out in a lot of practical
+balançando se torna muito prático
 
 0:31:59.440,0:32:01.810
-problems this gives you a big improvement so here you can see you're
+problemas, isso lhe dá uma grande melhoria, então aqui você pode ver que está
 
 0:32:01.810,0:32:06.280
-getting much closer to the solution by the end of it with much less oscillation
+chegando muito mais perto da solução no final dela com muito menos oscilação
 
 0:32:06.280,0:32:10.840
-and you can see this oscillation so it's kind of a fact of life if you're using
+e você pode ver essa oscilação, então é um fato da vida se você estiver usando
 
 0:32:10.840,0:32:14.650
-gradient descent type methods so here we talk about momentum on top of gradient
+métodos do tipo gradiente descendente, então aqui falamos sobre o momento em cima do gradiente
 
 0:32:14.650,0:32:18.550
-descent in the visualization you're gonna get this oscillation it's just a
+descida na visualização você vai ter essa oscilação é apenas um
 
 0:32:18.550,0:32:22.240
-property of gradient descent no way to get rid of it without modifying the
+propriedade do gradiente descendente não há como se livrar dele sem modificar o
 
 0:32:22.240,0:32:27.490
-method and we're meant to them to some degree dampens this oscillation I've got
+método e estamos destinados a eles até certo ponto amortece essa oscilação que eu tenho
 
 0:32:27.490,0:32:30.760
-another visualization here which will kind of give you an intuition for how
+outra visualização aqui que lhe dará uma intuição de como
 
 0:32:30.760,0:32:34.660
-this β parameter controls things now the Department of these to be greater
+este parâmetro β controla as coisas agora o Departamento destes para ser maior
 
 0:32:34.660,0:32:39.280
-than zero if it's equal to zero you distr in gradient descent and it's gotta
+que zero se for igual a zero você distr em gradiente descendente e tem que
 
 0:32:39.280,0:32:43.330
-be less than one otherwise the Met everything blows up as you start
+ser menor que um caso contrário o Met tudo explode quando você começa
 
 0:32:43.330,0:32:45.970
-including past gradients with more and more weight over times it's gotta be
+incluindo gradientes anteriores com cada vez mais peso ao longo do tempo, tem que ser
 
 0:32:45.970,0:32:54.070
-between zero and one and typical values range from you know small 0.25 up to
+entre zero e um e os valores típicos variam de você sabe pequeno 0,25 até
 
 0:32:54.070,0:32:59.230
-like 0.99 so in practice you can get pretty close to one and what happens is
+como 0,99 então na prática você pode chegar bem perto de um e o que acontece é
 
 0:32:59.230,0:33:09.130
-the smaller values they result in you're changing direction quicker okay so in
+os valores menores que resultam em você está mudando de direção mais rápido ok então em
 
 0:33:09.130,0:33:12.820
-this diagram you can see on the left with the small β you as soon as you
+este diagrama você pode ver à esquerda com o pequeno β você assim que você
 
 0:33:12.820,0:33:16.120
-get close to the solution you kind of change direction pretty rapidly and head
+aproxime-se da solução, você meio que muda de direção rapidamente e segue
 
 0:33:16.120,0:33:19.900
-towards a solution when you use these larger βs it takes longer for you to
+em direção a uma solução quando você usa esses βs maiores, leva mais tempo para você
 
 0:33:19.900,0:33:23.530
-make this dramatic turn you can think of it as a car with a bad turning circle
+fazer esta curva dramática você pode pensar nisso como um carro com um raio de viragem ruim
 
 0:33:23.530,0:33:26.170
-takes you quite a long time to get around that corner and head towards
+leva muito tempo para contornar aquela esquina e seguir em direção
 
 0:33:26.170,0:33:31.180
-solution now this may seem like a bad thing but actually in practice this
+solução agora isso pode parecer uma coisa ruim, mas na prática isso
 
 0:33:31.180,0:33:35.110
-significantly dampens the oscillations that you get from gradient descent and
+amortece significativamente as oscilações que você obtém da descida do gradiente e
 
 0:33:35.110,0:33:40.450
-that's the nice property of it now in terms of practice I can give you some
+essa é a boa propriedade disso agora em termos de prática, posso lhe dar algumas
 
 0:33:40.450,0:33:45.760
-pretty clear guidance here you pretty much always want to use momentum it's
+orientação bastante clara aqui, você sempre quer usar o impulso é
 
 0:33:45.760,0:33:48.820
-pretty hard to find problems where it's actually not beneficial to some degree
+muito difícil encontrar problemas onde na verdade não é benéfico até certo ponto
 
 0:33:48.820,0:33:52.960
-now part of the reason for this is it's just an extra parameter now typically
+agora parte da razão para isso é que é apenas um parâmetro extra agora normalmente
 
 0:33:52.960,0:33:55.870
-when you take some method and just add more parameters to it you can usually
+quando você pega algum método e apenas adiciona mais parâmetros a ele, geralmente pode
 
 0:33:55.870,0:34:01.000
-find some value of that parameter that makes us slightly better now that is
+encontrar algum valor desse parâmetro que nos torne um pouco melhor agora que é
 
 0:34:01.000,0:34:04.330
-sometimes the case here but often these improvements from using momentum are
+às vezes é o caso aqui, mas muitas vezes essas melhorias do uso do impulso são
 
 0:34:04.330,0:34:08.810
-actually quite substantial and using a momentum value of point nine is
+realmente bastante substancial e usar um valor de momento de ponto nove é
 
 0:34:08.810,0:34:13.610
-really a default value used in machine learning quite often and often in some
+realmente um valor padrão usado no aprendizado de máquina com bastante frequência e muitas vezes em alguns
 
 0:34:13.610,0:34:19.010
-situations 0.99 may be better so I would recommend trying both values if you have
+situações 0,99 pode ser melhor, então eu recomendaria tentar ambos os valores se você tiver
 
 0:34:19.010,0:34:24.770
-time otherwise just try point nine but I have to do a warning the way momentum is
+caso contrário, tente o ponto nove, mas eu tenho que fazer um aviso sobre a forma como o impulso é
 
 0:34:24.770,0:34:29.300
-stated in this expression if you look at it carefully when we increase the
+declarado nesta expressão se você olhar com atenção quando aumentamos o
 
 0:34:29.300,0:34:36.440
-momentum we kind of increase the step size now it's not the step size of the
+momento nós meio que aumentamos o tamanho do passo agora não é o tamanho do passo do
 
 0:34:36.440,0:34:39.380
-current gradient so the current gradient is included in the step with the same
+gradiente atual para que o gradiente atual seja incluído na etapa com o mesmo
 
 0:34:39.380,0:34:43.399
-strengths but past gradients become included in the step with a higher
+pontos fortes, mas gradientes passados ​​são incluídos na etapa com uma maior
 
 0:34:43.399,0:34:48.290
-strength when you increase momentum now when you write momentum in other forms
+força quando você aumenta o momento agora quando você escreve o momento em outras formas
 
 0:34:48.290,0:34:53.179
-this becomes a lot more obvious so this firm kind of occludes that but what you
+isso se torna muito mais óbvio, então esse tipo de empresa obstrui isso, mas o que você
 
 0:34:53.179,0:34:58.820
-should generally do when you change momentum you want to change it so that
+geralmente deve fazer quando você altera o momento, você deseja alterá-lo para que
 
 0:34:58.820,0:35:04.310
-you have your step size divided by one minus β is your new step size so if
+você tem o tamanho do seu passo dividido por um menos β é o seu novo tamanho do passo, então se
 
 0:35:04.310,0:35:07.790
-your old step size was using a certain B do you want to map it to that equation
+seu tamanho de passo antigo estava usando um certo B, você deseja mapeá-lo para essa equação
 
 0:35:07.790,0:35:11.690
-then map it back to get the the new step size now this may be very modest change
+em seguida, mapeie-o de volta para obter o novo tamanho da etapa agora, isso pode ser uma mudança muito modesta
 
 0:35:11.690,0:35:16.400
-but if you're going from momentum 0.9 to momentum 0.99 you may need to reduce
+mas se você estiver indo do momento 0,9 para o momento 0,99, talvez seja necessário reduzir
 
 0:35:16.400,0:35:20.480
-your learning rate by a factor of 10 approximately so just be wary of that
+sua taxa de aprendizado por um fator de 10 aproximadamente, então tenha cuidado com isso
 
 0:35:20.480,0:35:22.850
-you can't expect to keep the same learning rate and change the momentum
+você não pode esperar manter a mesma taxa de aprendizado e mudar o ritmo
 
 0:35:22.850,0:35:27.260
-parameter at wallmart work now I want to go into a bit of detail about why
+parâmetro no trabalho do wallmart agora quero entrar em detalhes sobre o porquê
 
 0:35:27.260,0:35:31.880
-momentum works is very misunderstood and the explanation you'll see in that
+o momentum funciona é muito mal compreendido e a explicação que você verá nisso
 
 0:35:31.880,0:35:38.570
-Distilled post is acceleration and this is certainly a contributor to the
+Pós destilado é aceleração e isso certamente contribui para a
 
 0:35:38.570,0:35:44.380
-performance of momentum now acceleration is a topic yes if you've got a question
+desempenho do momento agora a aceleração é um tópico sim se você tiver uma pergunta
 
 0:35:44.380,0:35:48.170
-the question was is there a big difference between using momentum and
+a questão era se há uma grande diferença entre usar impulso e
 
 0:35:48.170,0:35:54.890
-using a mini batch of two and there is so momentum has advantages in for when
+usando um minilote de dois e há então o impulso tem vantagens para quando
 
 0:35:54.890,0:35:59.150
-using gradient descent as well as stochastic gradient descent so in fact
+usando gradiente descendente, bem como gradiente descendente estocástico, de fato
 
 0:35:59.150,0:36:03.110
-this acceleration explanation were about to use applies both in the stochastic
+esta explicação de aceleração estava prestes a usar se aplica tanto no estocástico
 
 0:36:03.110,0:36:07.520
-and non stochastic case so no matter what batch size you're going to use the
+e caso não estocástico, portanto, não importa o tamanho do lote que você usará
 
 0:36:07.520,0:36:13.100
-benefits of momentum still are shown now it also has benefits in the stochastic
+benefícios do momentum ainda são mostrados agora também tem benefícios no estocástico
 
 0:36:13.100,0:36:17.000
-case as well which I'll cover in a slide or two so the answer is it's quite
+caso também, que abordarei em um slide ou dois, então a resposta é que é bastante
 
 0:36:17.000,0:36:19.579
-distinct from batch size and you shouldn't complete them
+distinto do tamanho do lote e você não deve completá-los
 
 0:36:19.579,0:36:22.459
-learn it like really you should be changing your learning rate when you
+aprenda como realmente você deveria estar mudando sua taxa de aprendizado quando você
 
 0:36:22.459,0:36:26.239
-change your bat size rather than changing the momentum and for very large
+mude o tamanho do seu taco em vez de mudar o momento e para muito grandes
 
 0:36:26.239,0:36:30.380
-batch sizes there's a clear relationship between learning rate and batch size but
+tamanhos de lote há uma relação clara entre a taxa de aprendizado e o tamanho do lote, mas
 
 0:36:30.380,0:36:34.729
-for small batch sizes it's not clear so it's problem dependent any other
+para tamanhos de lote pequenos, não está claro, por isso depende do problema de qualquer outro
 
 0:36:34.729,0:36:38.599
-questions before I move on on momentum yes yes it's it's just blow up so it's
+perguntas antes de seguir em frente sim, sim, é apenas explodir, então é
 
 0:36:38.599,0:36:42.979
-actually in the in the in the physics interpretation it's conservation of
+na verdade, na interpretação da física, é a conservação de
 
 0:36:42.979,0:36:48.499
-momentum would be exactly equal to one now that's not good because if you're in
+momento seria exatamente igual a um agora isso não é bom porque se você estiver em
 
 0:36:48.499,0:36:51.890
-a world with no friction then you drop a heavy ball somewhere it's gonna keep
+um mundo sem atrito, então você solta uma bola pesada em algum lugar que vai manter
 
 0:36:51.890,0:36:56.479
-moving forever it's not good stuff so we need some dampening and this is where
+movendo-se para sempre, não é uma coisa boa, então precisamos de um pouco de amortecimento e é aqui que
 
 0:36:56.479,0:37:01.069
-the physics interpretation breaks down so you do need some damping now now you
+a interpretação da física falha, então você precisa de um pouco de amortecimento agora, agora você
 
 0:37:01.069,0:37:05.209
-can imagine if you use a larger value than one those past gradients get
+pode imaginar se você usar um valor maior do que aquele que os gradientes anteriores obtêm
 
 0:37:05.209,0:37:09.410
-amplified every step so in fact the first gradient you evaluate in your
+amplificado a cada passo, então, na verdade, o primeiro gradiente que você avalia em seu
 
 0:37:09.410,0:37:13.940
-network is not relevant information content wise later in optimization but
+rede não é conteúdo de informação relevante mais tarde na otimização, mas
 
 0:37:13.940,0:37:16.910
-if it used to be the larger than 1 it would dominate the step that you're
+se fosse maior que 1, ele dominaria a etapa que você está
 
 0:37:16.910,0:37:21.170
-using does that answer your question yeah ok any other questions about
+usando isso responde sua pergunta sim ok quaisquer outras perguntas sobre
 
 0:37:21.170,0:37:26.359
-momentum before we move on they are for a particular value of β yes it's
+momento antes de seguirmos em frente eles são para um valor particular de β sim, é
 
 0:37:26.359,0:37:30.859
-strictly equivalent it's not very hard to you should be able to do it in like
+estritamente equivalente, não é muito difícil você ser capaz de fazê-lo como
 
 0:37:30.859,0:37:38.359
-two lines if you try and do the equivalence yourself no the bidders are
+duas linhas se você tentar fazer a equivalência, não os licitantes são
 
 0:37:38.359,0:37:40.910
-not quite the same but the the γ is the same that's why I use the same
+não é bem o mesmo mas o γ é o mesmo é por isso que eu uso o mesmo
 
 0:37:40.910,0:37:45.319
-notation for it oh yes so that's what I mentioned yes so when you change β
+notação para isso oh sim, então é isso que eu mencionei sim, então quando você muda β
 
 0:37:45.319,0:37:48.349
-you want to scale your learning rate by the learning rate divided by one over
+você deseja dimensionar sua taxa de aprendizado pela taxa de aprendizado dividida por um sobre
 
 0:37:48.349,0:37:52.369
-β so in this form I'm not sure if it appears in this form it could be a
+β então neste formulário não tenho certeza se ele aparece neste formulário pode ser um
 
 0:37:52.369,0:37:55.969
-mistake but I think I'm okay here I think it's not in this formula but yeah
+erro mas acho que estou bem aqui acho que não está nessa fórmula mas sim
 
 0:37:55.969,0:37:59.269
-what you definitely when you change β you need to change learning rate as well
+o que você definitivamente quando muda β você precisa mudar a taxa de aprendizado também
 
 0:37:59.269,0:38:09.300
-to keep things balanced yeah Oh either averaging form it's probably
+para manter as coisas equilibradas, sim, ou a forma média é provavelmente
 
 0:38:09.300,0:38:13.830
-not worth going over but you can think of it as momentum is basically changing
+não vale a pena passar por cima, mas você pode pensar nisso como o momento está basicamente mudando
 
 0:38:13.830,0:38:17.850
-the point that you evaluate the gradient at in the standard firm you evaluate the
+o ponto em que você avalia o gradiente na empresa padrão, você avalia o
 
 0:38:17.850,0:38:22.230
-gradient at this W point in the inner averaging form you take a running
+gradiente neste ponto W na forma de média interna você faz uma corrida
 
 0:38:22.230,0:38:25.890
-average of the points you've been evaluating the Grady Nutt and you
+média dos pontos que você está avaliando o Grady Nutt e você
 
 0:38:25.890,0:38:30.630
-evaluate at that point so it's basically instead of averaging gradients to
+avaliar nesse ponto, então é basicamente em vez de calcular a média de gradientes para
 
 0:38:30.630,0:38:37.530
-average points it's clear sense Jewell yes yes so acceleration now this is
+pontos médios é sentido claro Jewell sim sim então aceleração agora isso é
 
 0:38:37.530,0:38:43.260
-something you can spend the whole career studying and it's it's somewhat poorly
+algo que você pode passar a carreira inteira estudando e é um pouco mal
 
 0:38:43.260,0:38:47.070
-understood now if you try and read Nesterov original work on it now
+entendido agora se você tentar ler o trabalho original de Nesterov agora
 
 0:38:47.070,0:38:53.520
-Nesterov is kind of the grandfather of modern optimization in practically half
+Nesterov é meio que o avô da otimização moderna em praticamente metade
 
 0:38:53.520,0:38:56.460
-the methods we use are named after him to some degree which is can be confusing
+os métodos que usamos têm o nome dele até certo ponto, o que pode ser confuso
 
 0:38:56.460,0:39:01.740
-at times and in the 80s he came up with this formulation he didn't write it in
+às vezes e nos anos 80 ele veio com essa formulação que ele não escreveu em
 
 0:39:01.740,0:39:04.650
-this form he wrote it in another form which people realized a while later
+desta forma ele escreveu em outra forma que as pessoas perceberam um tempo depois
 
 0:39:04.650,0:39:09.450
-could be written in this form and his analysis is also very opaque and
+poderia ser escrito desta forma e sua análise também é muito opaca e
 
 0:39:09.450,0:39:15.590
-originally written in Russian doesn't help no for understanding unfortunately
+originalmente escrito em russo não ajuda não para entender infelizmente
 
 0:39:15.590,0:39:21.180
-those nice people the NSA translated all of the Russian literature back then so
+aquelas pessoas legais que a NSA traduziu toda a literatura russa na época, então
 
 0:39:21.180,0:39:27.330
-so we have access to them and it's actually a very small modification of
+para que tenhamos acesso a eles e, na verdade, é uma modificação muito pequena de
 
 0:39:27.330,0:39:31.890
-the momentum step but I think that small modification belittles what it's
+a etapa de impulso, mas acho que uma pequena modificação diminui o que é
 
 0:39:31.890,0:39:36.600
-actually doing it's really not the same method at all what I can say is with
+realmente não é o mesmo método, o que posso dizer é com
 
 0:39:36.600,0:39:41.400
-Nesterov Swimmer momentum if you very carefully choose these constants you can
+Nesterov Swimmer momentum se você escolher cuidadosamente essas constantes, você pode
 
 0:39:41.400,0:39:46.050
-get what's known as accelerated convergence now this doesn't apply in
+obter o que é conhecido como convergência acelerada agora isso não se aplica em
 
 0:39:46.050,0:39:49.560
-your networks but for convex problems I won't go into details of convexity but
+suas redes, mas para problemas convexos, não entrarei em detalhes de convexidade, mas
 
 0:39:49.560,0:39:52.230
-some of you may know what that means it's kind of a simple structure but
+alguns de vocês podem saber o que isso significa, é uma estrutura simples, mas
 
 0:39:52.230,0:39:55.740
-convex problems it's a radically improved convergence rate from this
+problemas convexos é uma taxa de convergência radicalmente melhorada deste
 
 0:39:55.740,0:39:59.940
-acceleration but only for very carefully chosen constants and you really can't
+aceleração, mas apenas para constantes cuidadosamente escolhidas e você realmente não pode
 
 0:39:59.940,0:40:03.030
-choose these carefully ahead of time so you've got to do quite a large search
+escolha-os cuidadosamente com antecedência para que você tenha que fazer uma pesquisa bastante grande
 
 0:40:03.030,0:40:05.640
-over your parameters your hyper parameters sorry to find the right
+sobre seus parâmetros seus hiperparâmetros desculpe encontrar o certo
 
 0:40:05.640,0:40:10.710
-constants to get that acceleration what I can say is this actually occurs for
+constantes para obter essa aceleração, o que posso dizer é que isso realmente ocorre para
 
 0:40:10.710,0:40:14.779
-quadratics when using regular momentum and this is confused a lot of people
+quadráticos ao usar o momento regular e isso confunde muitas pessoas
 
 0:40:14.779,0:40:18.559
-so you'll see a lot of people say that momentum is an accelerated method it's
+então você verá muitas pessoas dizendo que o momentum é um método acelerado, é
 
 0:40:18.559,0:40:23.449
-excited only for quadratics and even then it's it's a little bit iffy I would
+animado apenas para quadráticas e mesmo assim é um pouco duvidoso que eu faria
 
 0:40:23.449,0:40:27.529
-not recommend using it for quadratics use conjugate gradients or some new
+não recomendo usá-lo para quadráticos, use gradientes conjugados ou alguns novos
 
 0:40:27.529,0:40:33.499
-methods that have been developed over the last few years and this is
+métodos que têm sido desenvolvidos ao longo dos últimos anos e isso é
 
 0:40:33.499,0:40:36.919
-definitely a contributing factor to our momentum works so well in practice and
+definitivamente um fator que contribui para o nosso impulso funciona tão bem na prática e
 
 0:40:36.919,0:40:42.499
-there's definitely some acceleration going on but this acceleration is hard
+há definitivamente alguma aceleração acontecendo, mas essa aceleração é difícil
 
 0:40:42.499,0:40:46.669
-to realize when you have stochastic gradients now when you look at what
+para perceber quando você tem gradientes estocásticos agora quando você olha para o que
 
 0:40:46.669,0:40:51.679
-makes acceleration work noise really kills it and it's it's hard to believe
+faz barulho de trabalho de aceleração realmente mata e é difícil de acreditar
 
 0:40:51.679,0:40:55.549
-that it's the main factor contributing to the performance but it's certainly
+que é o principal fator que contribui para o desempenho, mas é certamente
 
 0:40:55.549,0:40:59.989
-there and the the still post I mentioned attributes or the performance of
+lá e o post ainda que mencionei atributos ou o desempenho de
 
 0:40:59.989,0:41:02.689
-momentum to acceleration but I wouldn't go that quite that far but it's
+impulso para aceleração, mas eu não iria tão longe, mas é
 
 0:41:02.689,0:41:08.390
-definitely a contributing factor but probably the practical and provable
+definitivamente um fator contribuinte, mas provavelmente a prática e comprovável
 
 0:41:08.390,0:41:13.669
-reason why acceleration why knows sorry why momentum helps is noise smoothing
+razão pela qual a aceleração por que sabe desculpe por que o impulso ajuda é a suavização de ruído
 
 0:41:13.669,0:41:21.619
-and this is very intuitive momentum averages gradients in a sense we keep
+e isso são gradientes de médias de momento muito intuitivos no sentido de que mantemos
 
 0:41:21.619,0:41:25.099
-this running buffer gradients that we use as a step instead of individual
+esses gradientes de buffer de execução que usamos como uma etapa em vez de individual
 
 0:41:25.099,0:41:30.259
-gradients this is kind of a form of averaging and it turns out that when you
+gradientes isso é uma forma de média e acontece que quando você
 
 0:41:30.259,0:41:33.229
-use s to D without momentum to prove anything at all about it
+use s para D sem momento para provar qualquer coisa sobre isso
 
 0:41:33.229,0:41:37.449
-you actually have to work with the average of all the points you visited
+você realmente tem que trabalhar com a média de todos os pontos que você visitou
 
 0:41:37.449,0:41:42.380
-you can get really weak bounds on the last point that you ended up at but
+você pode obter limites muito fracos no último ponto em que acabou, mas
 
 0:41:42.380,0:41:45.349
-really you've got to work with this average of points and this is suboptimal
+realmente você tem que trabalhar com essa média de pontos e isso não é o ideal
 
 0:41:45.349,0:41:48.529
-like we never want to actually take this average in practice it's heavily
+como se nunca quiséssemos levar essa média na prática, é muito
 
 0:41:48.529,0:41:52.099
-weighted with points that we visited a long time ago which may be irrelevant
+ponderado com pontos que visitamos há muito tempo que podem ser irrelevantes
 
 0:41:52.099,0:41:55.159
-and in fact this averaging doesn't work very well in practice for neural
+e, de fato, essa média não funciona muito bem na prática para
 
 0:41:55.159,0:41:59.150
-networks it's really only important for convex problems but nevertheless it's
+redes é realmente importante apenas para problemas convexos, mas mesmo assim é
 
 0:41:59.150,0:42:03.380
-necessary to analyze regular s2d and one of the remarkable facts about momentum
+necessário analisar s2d regular e um dos fatos notáveis ​​sobre momentum
 
 0:42:03.380,0:42:09.019
-is actually this averaging is no longer theoretically necessary so essentially
+é na verdade essa média não é mais teoricamente necessária tão essencialmente
 
 0:42:09.019,0:42:14.509
-momentum adds smoothing dream optimization that makes it makes us so
+impulso adiciona suavização de otimização de sonho que nos torna tão
 
 0:42:14.509,0:42:19.459
-the last point you visit is still a good approximation to the solution with SGG
+o último ponto que você visita ainda é uma boa aproximação da solução com SGG
 
 0:42:19.459,0:42:23.329
-really you want to average a whole bunch of last points you've seen in order to
+realmente você quer fazer a média de um monte de últimos pontos que você viu para
 
 0:42:23.329,0:42:26.700
-get a good approximation to the solution now let me illustrate that
+obter uma boa aproximação para a solução agora deixe-me ilustrar que
 
 0:42:26.700,0:42:31.190
-here so this is this is a very typical example of what happens when using STD
+aqui então este é um exemplo muito típico do que acontece ao usar STD
 
 0:42:31.190,0:42:36.329
-STD at the beginning you make great progress the gradient is essentially
+STD no início você faz um grande progresso, o gradiente é essencialmente
 
 0:42:36.329,0:42:39.960
-almost the same as the stochastic gradient so first few steps you make
+quase o mesmo que o gradiente estocástico, então os primeiros passos que você faz
 
 0:42:39.960,0:42:44.490
-great progress towards solution but then you end up in this ball now recall here
+grande progresso em direção à solução, mas então você acaba nesta bola agora lembre-se aqui
 
 0:42:44.490,0:42:47.579
-that's a valley that we're heading down so this ball here is kind of the floor
+esse é um vale que estamos descendo, então essa bola aqui é meio que o chão
 
 0:42:47.579,0:42:53.550
-of the valley and you kind of bounce around in this floor and the most common
+do vale e você meio que salta neste chão e o mais comum
 
 0:42:53.550,0:42:56.579
-solution of this is if you reduce your learning rate you'll bounce around
+A solução para isso é que se você reduzir sua taxa de aprendizado, você se recuperará
 
 0:42:56.579,0:43:01.290
-slower not exactly a great solution but it's one way to handle it but when you
+mais lento não é exatamente uma ótima solução, mas é uma maneira de lidar com isso, mas quando você
 
 0:43:01.290,0:43:04.710
-use s to deal with momentum you can kind of smooth out this bouncing around and
+use s para lidar com o impulso, você pode suavizar esse salto e
 
 0:43:04.710,0:43:08.160
-you kind of just kind of wheel around now the path is not always going to be
+você meio que meio que dá uma volta agora o caminho nem sempre vai ser
 
 0:43:08.160,0:43:12.300
-this corkscrew tile path it's actually quite random you could kind of wobble
+este caminho de ladrilhos de saca-rolhas é realmente bastante aleatório, você pode balançar
 
 0:43:12.300,0:43:15.990
-left and right but when I seeded it with 42 this is what it spread out so that's
+esquerda e direita, mas quando eu semeei com 42, isso é o que se espalhou, então é isso
 
 0:43:15.990,0:43:20.790
-what I'm using here you typically get this corkscrew you get this cork scoring
+o que estou usando aqui, você normalmente obtém esse saca-rolhas, você obtém essa pontuação de cortiça
 
 0:43:20.790,0:43:24.660
-for this set of parameters and yeah I think this is a good explanation so some
+para este conjunto de parâmetros e sim, acho que esta é uma boa explicação, então alguns
 
 0:43:24.660,0:43:27.960
-combination of acceleration and noise smoothing is why momentum works
+combinação de aceleração e suavização de ruído é o motivo pelo qual o impulso funciona
 
 0:43:27.960,0:43:33.180
-oh yes yes so I should say that when we inject noise here the gradient may not
+oh sim sim então devo dizer que quando injetamos ruído aqui o gradiente pode não
 
 0:43:33.180,0:43:37.470
-even be the right direction to travel in fact it could be in the opposite
+mesmo ser a direção certa para viajar na verdade poderia ser na direção oposta
 
 0:43:37.470,0:43:40.800
-direction from where you want to go and this is why you kind of bounce around in
+direção de onde você quer ir e é por isso que você meio que salta
 
 0:43:40.800,0:43:46.410
-the valley there so in fact the gray you can see here that the first step with
+o vale lá então na verdade o cinza você pode ver aqui que o primeiro passo com
 
 0:43:46.410,0:43:49.980
-SUV is practically orthogonal to the level set there that's because it is
+O SUV é praticamente ortogonal ao nível estabelecido ali, porque é
 
 0:43:49.980,0:43:52.770
-such a good step at the beginning but once you get further down it can point
+um passo tão bom no início, mas uma vez que você desce mais, pode apontar
 
 0:43:52.770,0:44:00.300
-in pretty much any direction vaguely around the solution so yesterday with
+em praticamente qualquer direção vagamente em torno da solução, então ontem com
 
 0:44:00.300,0:44:03.540
-momentum is currently state of the art optimization method for a lot of machine
+momento é atualmente o método de otimização de última geração para muitas máquinas
 
 0:44:03.540,0:44:08.730
-learning problems so you'll probably be using it in your course for a lot of
+problemas de aprendizagem, então você provavelmente vai usá-lo em seu curso por muitos
 
 0:44:08.730,0:44:12.990
-problems but there has been some other innovations over the years and these are
+problemas, mas houve algumas outras inovações ao longo dos anos e estas são
 
 0:44:12.990,0:44:16.829
-particularly useful for poorly conditioned problems now as I mentioned
+particularmente útil para problemas mal condicionados agora como mencionei
 
 0:44:16.829,0:44:19.770
-earlier in the lecture some problems have this kind of well condition
+no início da palestra alguns problemas têm esse tipo de condição de poço
 
 0:44:19.770,0:44:22.530
-property that we can't really characterize for neural networks but we
+propriedade que não podemos realmente caracterizar para redes neurais, mas
 
 0:44:22.530,0:44:27.450
-can measure it by the test that if s to D works then it's well conditioned
+pode medi-lo pelo teste de que se s para D funcionar, então está bem condicionado
 
 0:44:27.450,0:44:31.470
-eventually there doesent works and if I must be walking poorly conditioned so we
+eventualmente não funciona e se devo estar andando mal condicionado então
 
 0:44:31.470,0:44:34.410
-have other methods we can handle we can use to handle this in some
+temos outros métodos que podemos manipular, podemos usar para lidar com isso em alguns
 
 0:44:34.410,0:44:39.690
-situations and these generally are called adaptive methods now you need to
+situações e estes geralmente são chamados de métodos adaptativos agora você precisa
 
 0:44:39.690,0:44:43.500
-be a little bit careful because what are you adapting to people in literature use
+tenha um pouco de cuidado porque o que você está adaptando para as pessoas na literatura usam
 
 0:44:43.500,0:44:51.780
-this nomenclature for adapting learning rates adapting momentum parameters but
+esta nomenclatura para adaptar as taxas de aprendizagem adaptando os parâmetros de momento, mas
 
 0:44:51.780,0:44:56.339
-in our our situation we're talk about a specific type of adaptivity roman this
+em nossa situação, estamos falando de um tipo específico de adaptabilidade romana
 
 0:44:56.339,0:45:03.780
-adaptivity is individual learning rates now what I mean by that so in the
+adaptabilidade são as taxas de aprendizado individual agora o que quero dizer com isso, então no
 
 0:45:03.780,0:45:06.869
-simulation I already showed you a stochastic gradient descent
+simulação eu já mostrei uma descida de gradiente estocástica
 
 0:45:06.869,0:45:10.619
-I used a global learning rate by that I mean every single rate in your network
+Eu usei uma taxa de aprendizado global, quero dizer todas as taxas em sua rede
 
 0:45:10.619,0:45:16.800
-is updated using an equation with the same γ now γ could vary over
+é atualizado usando uma equação com o mesmo γ agora γ pode variar ao longo
 
 0:45:16.800,0:45:21.720
-time step so you used γ K in the notation but often you use a fixed
+passo de tempo, então você usou γ K na notação, mas muitas vezes você usa um fixo
 
 0:45:21.720,0:45:26.310
-camera for quite a long time but for adaptive methods we want to adapt a
+câmera por um bom tempo, mas para métodos adaptativos queremos adaptar um
 
 0:45:26.310,0:45:30.240
-learning rate for every weight individually and we want to use
+taxa de aprendizagem para cada peso individualmente e queremos usar
 
 0:45:30.240,0:45:37.109
-information we get from gradients for each weight to adapt this so this seems
+informações que obtemos de gradientes para cada peso para adaptar isso, então isso parece
 
 0:45:37.109,0:45:39.900
-like the obvious thing to do and people have been trying to get this stuff to
+como a coisa óbvia a fazer e as pessoas têm tentado fazer com que essas coisas
 
 0:45:39.900,0:45:43.200
-work for decades and we're kind of stumbled upon some methods that work and
+funcionam há décadas e meio que nos deparamos com alguns métodos que funcionam e
 
 0:45:43.200,0:45:48.510
-some that don't but I want to ask for questions here if there's any any
+alguns que não, mas eu quero fazer perguntas aqui se houver alguma
 
 0:45:48.510,0:45:53.040
-explanation needed so I can say that it's not entirely clear why you need to
+explicação necessária para que eu possa dizer que não está totalmente claro por que você precisa
 
 0:45:53.040,0:45:56.880
-do this right if your network is well conditioned you don't need to do this
+faça isso direito se sua rede estiver bem condicionada você não precisa fazer isso
 
 0:45:56.880,0:46:01.349
-potentially but often the network's we use in practice have very different
+potencialmente, mas muitas vezes as redes que usamos na prática têm
 
 0:46:01.349,0:46:05.069
-structure in different parts of the network so for instance the early parts
+estrutura em diferentes partes da rede, por exemplo, as primeiras partes
 
 0:46:05.069,0:46:10.619
-of your convolutional neural network may be very shallow convolutional layers on
+da sua rede neural convolucional podem ser camadas convolucionais muito rasas em
 
 0:46:10.619,0:46:14.849
-large images later in the network you're going to be doing convolutions with
+imagens grandes posteriormente na rede, você fará convoluções com
 
 0:46:14.849,0:46:18.359
-large numbers of channels on small images now these operations are very
+grande número de canais em imagens pequenas agora essas operações são muito
 
 0:46:18.359,0:46:21.150
-different and there's no reason to believe that a learning rate that works
+diferente e não há razão para acreditar que uma taxa de aprendizado que funciona
 
 0:46:21.150,0:46:26.310
-well for one would work well for the other and this is why the adaptive
+bem para um funcionaria bem para o outro e é por isso que a adaptação
 
 0:46:26.310,0:46:28.140
-learning rates can be useful any questions here
+taxas de aprendizagem podem ser úteis quaisquer perguntas aqui
 
 0:46:28.140,0:46:32.250
-yes so unfortunately there's no good definition for neural networks we
+sim, infelizmente não há uma boa definição para redes neurais que
 
 0:46:32.250,0:46:35.790
-couldn't measure it even if there was a good definition so I'm going to use it
+não poderia medi-lo mesmo que houvesse uma boa definição, então vou usá-lo
 
 0:46:35.790,0:46:40.109
-in a vague sense that it actually doesn't works and it's poorly
+em um sentido vago que realmente não funciona e é mal
 
 0:46:40.109,0:46:42.619
-conditioned yes so in the sort of quadratic case if
+condicionado sim, então no tipo de caso quadrático se
 
 0:46:45.830,0:46:51.380
-you recall I have an explicit definition of this condition number L over μ.
+você se lembra que eu tenho uma definição explícita deste número de condição L sobre μ.
 
 0:46:51.380,0:46:55.910
-L being maximized in value μ being smallest eigen value and yeah the large
+L sendo maximizado em valor μ sendo o menor valor próprio e sim o grande
 
 0:46:55.910,0:47:00.140
-of this gap between largest larger and smaller eigen value the worst condition
+desta lacuna entre o maior e o menor valor próprio, a pior condição
 
 0:47:00.140,0:47:03.320
-it is this does not imply if in your network so that μ does not exist in
+é isso não implica se na sua rede para que μ não exista em
 
 0:47:03.320,0:47:07.610
-your networks L still has some information in it but I wouldn't say
+suas redes L ainda tem algumas informações, mas eu não diria
 
 0:47:07.610,0:47:12.800
-it's a determining factor there's just a lot going on so there are some ways that
+é um fator determinante, há muita coisa acontecendo, então existem algumas maneiras de
 
 0:47:12.800,0:47:15.619
-your looks behave a lot like simple problems but there are other ways where
+sua aparência se comporta muito como problemas simples, mas existem outras maneiras de
 
 0:47:15.619,0:47:23.090
-we just kind of hang wave and say that they like them yeah yeah yes so for this
+nós meio que acenamos e dizemos que eles gostam deles sim, sim, sim, então para isso
 
 0:47:23.090,0:47:25.910
-particular network this is a network that actually isn't too poorly
+rede particular esta é uma rede que na verdade não é muito ruim
 
 0:47:25.910,0:47:30.920
-conditioned already in fact this is a VDD 16 which is practically the best net
+condicionado já na verdade esse é um VDD 16 que é praticamente a melhor rede
 
 0:47:30.920,0:47:34.490
-method best network when you had a train before the invention of certain
+método melhor rede quando você tinha um trem antes da invenção de certos
 
 0:47:34.490,0:47:37.369
-techniques to improve conditioning so this is almost the best of first
+técnicas para melhorar o condicionamento, então isso é quase o melhor do primeiro
 
 0:47:37.369,0:47:40.910
-condition you can actually get and there are a lot of the structure of this
+condição que você pode realmente obter e há muito da estrutura deste
 
 0:47:40.910,0:47:45.140
-network is actually defined by this conditioning like we double the number
+rede é realmente definida por este condicionamento como nós dobramos o número
 
 0:47:45.140,0:47:48.680
-of channels after certain steps because that seems to result in networks at a
+de canais após certas etapas, porque isso parece resultar em redes em um
 
 0:47:48.680,0:47:53.600
-world condition rather than any other reason but it's certainly what you can
+condição do mundo em vez de qualquer outro motivo, mas é certamente o que você pode
 
 0:47:53.600,0:47:57.170
-say is that weights very light the network have very large effect on the
+dizer é que pesos muito leves a rede tem efeito muito grande no
 
 0:47:57.170,0:48:02.630
-output that very last layer there with if there are 4096 weights in it that's a
+produza essa última camada lá com se houver 4096 pesos nela, isso é um
 
 0:48:02.630,0:48:06.400
-very small number of whites this network has millions of whites I believe those
+número muito pequeno de brancos esta rede tem milhões de brancos acredito que aqueles
 
 0:48:06.400,0:48:10.640
-4096 weights have a very strong effect on the output because they directly
+Os pesos 4096 têm um efeito muito forte na saída porque eles diretamente
 
 0:48:10.640,0:48:14.450
-dictate that output and for that reason you generally want to use smaller
+ditar essa saída e, por esse motivo, você geralmente deseja usar
 
 0:48:14.450,0:48:19.190
-learning rates for those whereas yeah weights early in the network some of
+taxas de aprendizado para aqueles, enquanto sim pondera no início da rede alguns dos
 
 0:48:19.190,0:48:21.770
-them might have a large effect but especially when you've initialized
+eles podem ter um grande efeito, mas especialmente quando você inicializou
 
 0:48:21.770,0:48:25.910
-network of randomly they typically will have a smaller effect of those those
+rede de aleatoriamente eles normalmente terão um efeito menor daqueles que
 
 0:48:25.910,0:48:29.840
-earlier weights and this is very hand wavy and the reason why is because we
+pesos anteriores e isso é muito ondulado e a razão é porque nós
 
 0:48:29.840,0:48:33.859
-really don't understand this well enough for me to give you a precise precise
+realmente não entendo isso bem o suficiente para eu lhe dar uma precisão precisa
 
 0:48:33.859,0:48:41.270
-statement here 120 million weights in this network actually so yeah so that
+declaração aqui 120 milhões de pesos nesta rede, na verdade, sim, para que
 
 0:48:41.270,0:48:47.710
-last layer is like 4096 by 4096 matrix so
+última camada é como 4096 por 4096 matriz, então
 
 0:48:47.950,0:48:53.510
-yeah okay any other questions yeah yes I would recommend only using them when
+sim ok qualquer outra pergunta sim sim eu recomendaria usá-los apenas quando
 
 0:48:53.510,0:48:59.120
-your problem doesn't have a structure that decomposes into a large sum of
+seu problema não tem uma estrutura que se decompõe em uma grande soma de
 
 0:48:59.120,0:49:04.880
-similar things okay yeah that's a bit of a mouthful but sut works well when you
+coisas semelhantes ok, sim, isso é um pouco difícil, mas funciona bem quando você
 
 0:49:04.880,0:49:09.830
-have an objective that is a sum where each term of the sum is is vaguely
+tem um objetivo que é uma soma onde cada termo da soma é vagamente
 
 0:49:09.830,0:49:14.990
-comparable so in machine learning each sub term in this sum is a loss of one
+comparável, portanto, no aprendizado de máquina, cada subtermo nessa soma é uma perda de um
 
 0:49:14.990,0:49:18.290
-data point and these have very similar structures individual losses that's a
+ponto de dados e estes têm estruturas muito semelhantes perdas individuais que é um
 
 0:49:18.290,0:49:21.080
-hand-wavy sense that they have very similar structure because of course each
+sentido de mão ondulada que eles têm estrutura muito semelhante porque é claro que cada
 
 0:49:21.080,0:49:25.220
-data point could be quite different but when your problem doesn't have a large
+ponto de dados pode ser bem diferente, mas quando seu problema não tem um grande
 
 0:49:25.220,0:49:30.440
-sum as the main part of its structure then l-bfgs would be useful that's the
+sum como a parte principal de sua estrutura, então l-bfgs seria útil, essa é a
 
 0:49:30.440,0:49:35.840
-general answer I doubt you make use of it in this course l-bfgs doubt it that
+resposta geral duvido que você faça uso disso neste curso l-bfgs duvido que
 
 0:49:35.840,0:49:40.660
-it can be very handy for small networks you can experiment around with it with
+pode ser muito útil para pequenas redes com as quais você pode experimentar
 
 0:49:40.660,0:49:44.720
-the leaner v network or something which I'm sure you probably use in this course
+a rede leaner v ou algo que eu tenho certeza que você provavelmente usa neste curso
 
 0:49:44.720,0:49:51.230
-you could experiment with l-bfgs probably and have some success there one
+você poderia experimentar com l-bfgs provavelmente e ter algum sucesso lá
 
 0:49:51.230,0:49:58.670
-of the kind of founding techniques in modern your network training is rmsprop
+do tipo de técnicas fundamentais no treinamento moderno de sua rede é rmsprop
 
 0:49:58.670,0:50:03.680
-and i'm going to talk about this year now at some point kind of the standard
+e eu vou falar sobre este ano agora em algum ponto do padrão
 
 0:50:03.680,0:50:07.640
-practice in the field of optimization is in research and optimization kind of
+prática no campo da otimização está no tipo de pesquisa e otimização de
 
 0:50:07.640,0:50:10.640
-diverged with what people were actually doing when training neural networks and
+divergiram com o que as pessoas estavam realmente fazendo ao treinar redes neurais e
 
 0:50:10.640,0:50:14.150
-this IMS prop was kind of the fracturing point where we all went off in different
+este suporte IMS foi uma espécie de ponto de ruptura onde todos nós saímos em diferentes
 
 0:50:14.150,0:50:19.820
-directions and this rmsprop is usually attributed to Geoffrey Hinton slides
+direções e este rmsprop é geralmente atribuído aos slides de Geoffrey Hinton
 
 0:50:19.820,0:50:23.380
-which he then attributes to an unpublished paper from someone else
+que ele então atribui a um artigo inédito de outra pessoa
 
 0:50:23.380,0:50:28.790
-which is really unsatisfying to be citing someone slides in a paper but
+o que é realmente insatisfatório estar citando slides de alguém em um artigo, mas
 
 0:50:28.790,0:50:34.400
-anyway it's a method that has some it has no proof behind why it works but
+de qualquer forma, é um método que tem alguns, não tem provas por que funciona, mas
 
 0:50:34.400,0:50:38.050
-it's similar to methods that you can prove work so that's at least something
+é semelhante aos métodos que você pode provar que funciona, então isso é pelo menos algo
 
 0:50:38.050,0:50:43.520
-and it works pretty well in practice and that's why I look if we use it so I want
+e funciona muito bem na prática e é por isso que eu procuro se usarmos, então eu quero
 
 0:50:43.520,0:50:46.310
-to give you that kind of introduction before what I explained what it actually
+para lhe dar esse tipo de introdução antes do que eu expliquei o que realmente
 
 0:50:46.310,0:50:51.020
-is and rmsprop stands for root mean squared propagation
+é e rmsprop significa propagação quadrática média da raiz
 
 0:50:51.020,0:50:54.579
-this was from the era where everything we do the fuel networks we
+isso foi da época em que tudo que fazemos as redes de combustível nós
 
 0:50:54.579,0:50:58.690
-called propagation such-and-such like back prop which now we call deep so it
+chamado propagação tal e tal como back prop, que agora chamamos de deep, então
 
 0:50:58.690,0:51:02.920
-probably be called Armas deep propyl something if it was embedded now and
+provavelmente seria chamado de Armas Deep propyl algo se estivesse embutido agora e
 
 0:51:02.920,0:51:08.470
-it's a little bit of a modification so it still to line algorithm but a little
+é um pouco de modificação, então ainda é um algoritmo de linha, mas um pouco
 
 0:51:08.470,0:51:11.200
-bit different so I'm gonna go over these terms in some detail because it's
+um pouco diferente, então vou repassar esses termos com algum detalhe porque é
 
 0:51:11.200,0:51:19.450
-important to understand this now we we keep around this V buffer now this is
+importante entender isso agora nós mantemos em torno desse buffer V agora isso é
 
 0:51:19.450,0:51:22.720
-not a momentum buffer okay so we using different notation here he is doing
+não é um buffer de momento ok, então estamos usando uma notação diferente aqui que ele está fazendo
 
 0:51:22.720,0:51:27.069
-something different and I'm going to use some notation that that some people
+algo diferente e vou usar alguma notação de que algumas pessoas
 
 0:51:27.069,0:51:30.760
-really hates but I think it's convenient I'm going to write the element wise
+realmente odeia, mas acho conveniente vou escrever o elemento sábio
 
 0:51:30.760,0:51:36.040
-square of a vector just by squaring the vector this is not really confusing
+quadrado de um vetor apenas elevando o vetor ao quadrado, isso não é realmente confuso
 
 0:51:36.040,0:51:40.390
-notationally in almost all situations but it's a nice way to write it so here
+notadamente em quase todas as situações, mas é uma boa maneira de escrevê-lo, então aqui
 
 0:51:40.390,0:51:43.480
-I'm writing the gradient squared I really mean you take every element in
+Estou escrevendo o gradiente ao quadrado, quero dizer que você pega todos os elementos
 
 0:51:43.480,0:51:47.109
-that vector million element vector or whatever it is and square each element
+esse vetor de milhão de elementos vetor ou o que quer que seja e quadrado cada elemento
 
 0:51:47.109,0:51:51.309
-individually so this video update is what's known as an exponential moving
+individualmente, então esta atualização de vídeo é conhecida como um movimento exponencial
 
 0:51:51.309,0:51:55.480
-average I do I have a quick show of hands who's familiar with exponential
+média eu tenho um rápido show de mãos que está familiarizado com exponencial
 
 0:51:55.480,0:51:59.890
-moving averages I want to know if I need to talk about it in some more seems like
+médias móveis eu quero saber se preciso falar sobre isso em um pouco mais parece
 
 0:51:59.890,0:52:03.270
-it's probably need to explain it in some depth but in expose for a moving average
+provavelmente é necessário explicá-lo com alguma profundidade, mas em exposição para uma média móvel
 
 0:52:03.270,0:52:08.020
-it's a standard way this has been used for many many decades across many fields
+é uma maneira padrão que isso tem sido usado por muitas décadas em muitos campos
 
 0:52:08.020,0:52:14.650
-for maintaining an average that are the quantity that may change over time okay
+para manter uma média que são as quantidades que podem mudar ao longo do tempo ok
 
 0:52:14.650,0:52:19.630
-so when a quantity is changing over time we need to put larger weights on newer
+então, quando uma quantidade está mudando ao longo do tempo, precisamos colocar pesos maiores em novos
 
 0:52:19.630,0:52:24.210
-values because they provide more information and one way to do that is
+valores porque eles fornecem mais informações e uma maneira de fazer isso é
 
 0:52:24.210,0:52:30.700
-down weight old values exponentially and when you do this exponentially you mean
+diminuir os valores antigos exponencialmente e quando você faz isso exponencialmente você quer dizer
 
 0:52:30.700,0:52:36.880
-that the weight of an old value from say ten steps ago will have weight alpha to
+que o peso de um valor antigo de digamos dez passos atrás terá peso alfa para
 
 0:52:36.880,0:52:41.109
-the ten in your thing so that's where the exponential comes in the output of
+o dez em sua coisa, então é aí que o exponencial vem na saída de
 
 0:52:41.109,0:52:43.900
-the ten now it's that's not really in the notation and in the notation at each
+o dez agora é que não está realmente na notação e na notação em cada
 
 0:52:43.900,0:52:49.390
-step we just download the pass vector by this alpha constant and as if you can
+passo, basta baixar o vetor de passagem por esta constante alfa e como se você pudesse
 
 0:52:49.390,0:52:53.440
-imagine in your head things in that buffer the V buffer that are very old at
+imagine na sua cabeça coisas nesse buffer do buffer V que são muito antigas
 
 0:52:53.440,0:52:57.760
-each step they get downloaded by alpha at every step and just as before alpha
+cada passo eles são baixados pelo alpha em cada passo e assim como antes do alpha
 
 0:52:57.760,0:53:01.359
-here is something between zero and one so we can't use values greater than one
+aqui está algo entre zero e um, então não podemos usar valores maiores que um
 
 0:53:01.359,0:53:04.280
-there so this will damp those all values until they no longer
+lá, então isso irá amortecer todos os valores até que eles não mais
 
 0:53:04.280,0:53:08.180
-the exponential moving average so this method keeps an exponential moving
+a média móvel exponencial, então este método mantém um movimento exponencial
 
 0:53:08.180,0:53:12.860
-average of the second moment I mean non-central second moment so we do not
+média do segundo momento quero dizer segundo momento não central para que não
 
 0:53:12.860,0:53:18.920
-subtract off the mean here the PyTorch implementation has a switch where you
+subtrair a média aqui a implementação do PyTorch tem um interruptor onde você
 
 0:53:18.920,0:53:22.370
-can tell it to subtract off the mean play with that if you like it'll
+pode dizer para subtrair o jogo médio com isso, se você quiser,
 
 0:53:22.370,0:53:25.460
-probably perform very similarly in practice there's a paper on that I'm
+provavelmente terá um desempenho muito semelhante na prática, há um artigo sobre o qual estou
 
 0:53:25.460,0:53:30.620
-sure but the original method does not subtract off the mean there and we use
+com certeza, mas o método original não subtrai a média e usamos
 
 0:53:30.620,0:53:35.000
-this second moment to normalize the gradient and we do this element-wise so
+este segundo momento para normalizar o gradiente e fazemos isso por elemento para
 
 0:53:35.000,0:53:39.560
-all this notation is element wise every element of the gradient is divided
+toda essa notação é por elemento cada elemento do gradiente é dividido
 
 0:53:39.560,0:53:43.310
-through by the square root of the second moment estimate and if you think that
+pela raiz quadrada da estimativa do segundo momento e se você acha que
 
 0:53:43.310,0:53:47.090
-this square root is really being the standard deviation even though this is
+essa raiz quadrada está realmente sendo o desvio padrão, embora isso seja
 
 0:53:47.090,0:53:50.990
-not a central moment so it's not actually the standard deviation it's
+não é um momento central, então não é realmente o desvio padrão, é
 
 0:53:50.990,0:53:55.580
-useful to think of it that way and the name you know root means square is kind
+útil pensar dessa maneira e o nome que você sabe que raiz significa quadrado é tipo
 
 0:53:55.580,0:54:03.590
-of alluding to that division by the root of the mean of the squares and the
+de aludir a essa divisão pela raiz da média dos quadrados e o
 
 0:54:03.590,0:54:07.820
-important technical detail here you have to add epsilon here for the annoying
+detalhe técnico importante aqui você tem que adicionar epsilon aqui para o irritante
 
 0:54:07.820,0:54:12.950
-problem that when you divide 0 by 0 everything breaks so you occasionally
+problema que quando você divide 0 por 0 tudo quebra então você ocasionalmente
 
 0:54:12.950,0:54:16.310
-have zeros in your network there are some situations where it makes a
+tem zeros em sua rede existem algumas situações em que faz um
 
 0:54:16.310,0:54:20.060
-difference outside of when your gradients zero but you absolutely do
+diferença fora de quando seus gradientes zero, mas você absolutamente faz
 
 0:54:20.060,0:54:25.310
-need that epsilon in your method and you'll see this is a recurring theme all
+precisa desse epsilon em seu método e você verá que este é um tema recorrente
 
 0:54:25.310,0:54:29.900
-of these no adaptive methods basically you've got to put an epsilon when your
+desses métodos não adaptativos basicamente você tem que colocar um epsilon quando seu
 
 0:54:29.900,0:54:34.040
-the divide something just to avoiding to avoid dividing by 0 and typically that
+a dividir algo apenas para evitar evitar dividir por 0 e normalmente isso
 
 0:54:34.040,0:54:38.690
-epsilon will be close to your machine Epsilon I don't know if so if you're
+epsilon vai ficar perto da sua máquina Epsilon não sei se sim se você está
 
 0:54:38.690,0:54:41.750
-familiar with that term but it's something like 10 to a negative 7
+familiarizado com esse termo, mas é algo como 10 a menos 7
 
 0:54:41.750,0:54:45.710
-sometimes 10 to the negative 8 something of that order so really only has a small
+às vezes 10 elevado a menos 8 algo dessa ordem então realmente só tem um pequeno
 
 0:54:45.710,0:54:49.790
-effect on the value before I talk about why this method works I want to talk
+efeito sobre o valor antes de falar sobre porque esse método funciona quero falar
 
 0:54:49.790,0:54:53.150
-about the the most recent kind of innovation on top of this method and
+sobre o tipo mais recente de inovação em cima deste método e
 
 0:54:53.150,0:54:57.560
-that is the method that we actually use in practice so rmsprop is sometimes
+esse é o método que realmente usamos na prática, então o rmsprop às vezes é
 
 0:54:57.560,0:55:03.170
-still use but more often we use a method notice atom an atom means adaptive
+ainda uso, mas mais frequentemente usamos um método note átomo um átomo significa adaptativo
 
 0:55:03.170,0:55:10.790
-moment estimation so Adam is rmsprop with momentum so I spent 20 minutes
+estimativa de momento, então Adam é rmsprop com impulso, então passei 20 minutos
 
 0:55:10.790,0:55:13.760
-telling you I should use momentum so I'm going to say well you should put it on
+dizendo que eu deveria usar o impulso, então vou dizer bem, você deve colocá-lo
 
 0:55:13.760,0:55:18.420
-top of rmsprop as well there's always of doing that at least
+no topo do rmsprop também há sempre de fazer isso pelo menos
 
 0:55:18.420,0:55:21.569
-half a dozen in this papers for each of them but Adam is the one that caught on
+meia dúzia nestes jornais para cada um deles, mas Adam é aquele que pegou
 
 0:55:21.569,0:55:25.770
-and the way we do have a mention here is we actually convert the momentum update
+e a maneira como temos uma menção aqui é que na verdade convertemos a atualização do momento
 
 0:55:25.770,0:55:32.609
-to an exponential moving average as well now this may seem like a quantity
+para uma média móvel exponencial também agora isso pode parecer uma quantidade
 
 0:55:32.609,0:55:37.200
-qualitatively different update like doing momentum by moving average in fact
+atualização qualitativamente diferente, como fazer o impulso movendo a média de fato
 
 0:55:37.200,0:55:40.829
-what we were doing before is essentially equivalent to that you can work out some
+o que estávamos fazendo antes é essencialmente equivalente a que você pode descobrir alguns
 
 0:55:40.829,0:55:44.490
-constants where you can get a method where you use a moving exponential
+constantes onde você pode obter um método onde você usa uma exponencial em movimento
 
 0:55:44.490,0:55:47.760
-moving average momentum that is equivalent to the regular mentum so
+momento médio móvel que é equivalente ao mento regular, então
 
 0:55:47.760,0:55:50.460
-don't think of this moving average momentum as being anything different
+não pense nesse momento da média móvel como algo diferente
 
 0:55:50.460,0:55:54.000
-than your previous momentum but it has a nice property that you don't need to
+do que o seu impulso anterior, mas tem uma boa propriedade que você não precisa
 
 0:55:54.000,0:55:57.660
-change the learning rate when you mess with the β here which I think it's a
+alterar a taxa de aprendizado quando você mexer com o β aqui, o que eu acho que é um
 
 0:55:57.660,0:56:03.780
-big improvement so yeah we added momentum of the gradient and just as
+grande melhoria, então sim, adicionamos impulso do gradiente e, assim como
 
 0:56:03.780,0:56:07.980
-before with rmsprop we have this exponential moving average of the
+antes com rmsprop temos essa média móvel exponencial do
 
 0:56:07.980,0:56:13.050
-squared gradient on top of that we basically just plug in this moving
+gradiente quadrado em cima disso, basicamente apenas conectamos esse movimento
 
 0:56:13.050,0:56:17.010
-average gradient where we had the gradient in the previous update so it's
+gradiente médio onde tínhamos o gradiente na atualização anterior, então é
 
 0:56:17.010,0:56:20.579
-not too complicated now if you actually read the atom paper you'll see a whole
+não muito complicado agora, se você realmente ler o papel do átomo, verá um todo
 
 0:56:20.579,0:56:23.880
-bunch of additional notation the algorithm is like ten lines long instead
+monte de notação adicional, o algoritmo é como dez linhas
 
 0:56:23.880,0:56:28.859
-of three and that is because they add something called bias correction this is
+de três e isso é porque eles adicionam algo chamado correção de viés, isso é
 
 0:56:28.859,0:56:34.260
-actually not necessary but it'll help a little bit so everybody uses it and all
+na verdade não é necessário, mas vai ajudar um pouco, então todo mundo usa e tudo
 
 0:56:34.260,0:56:39.780
-it does is it increases the value of these parameters during the early stages
+faz é aumentar o valor desses parâmetros durante os estágios iniciais
 
 0:56:39.780,0:56:43.319
-of optimization and the reason you do that is because you initialize this
+de otimização e a razão pela qual você faz isso é porque você inicializa este
 
 0:56:43.319,0:56:48.150
-momentum buffer at zero typically now imagine your initial initializer at zero
+buffer de impulso em zero normalmente agora imagine seu inicializador inicial em zero
 
 0:56:48.150,0:56:52.440
-then after the first step we're going to be adding to that a value of 1 minus
+depois da primeira etapa, adicionaremos a isso um valor de 1 menos
 
 0:56:52.440,0:56:56.700
-β times the gradient now 1 minus β will typically be 0.1 because we
+β vezes o gradiente agora 1 menos β será tipicamente 0,1 porque nós
 
 0:56:56.700,0:57:00.599
-typically use momentum point 9 so when we do that our gradient step is actually
+normalmente usa o ponto de momento 9, então, quando fazemos isso, nosso passo de gradiente é na verdade
 
 0:57:00.599,0:57:05.069
-using a learning rate 10 times smaller because this momentum buffer has a tenth
+usando uma taxa de aprendizado 10 vezes menor porque esse buffer de momento tem um décimo
 
 0:57:05.069,0:57:08.670
-of a gradient in it and that's undesirable so all the bias
+de um gradiente nele e isso é indesejável, então todo o viés
 
 0:57:08.670,0:57:13.890
-correction does is just multiply by 10 the step in those early iterations and
+correção faz é apenas multiplicar por 10 o passo nessas iterações iniciais e
 
 0:57:13.890,0:57:18.420
-the bias correction formula is just basically the correct way to do that to
+a fórmula de correção de viés é basicamente a maneira correta de fazer isso para
 
 0:57:18.420,0:57:23.030
-result in a step that's unbiased and unbiased here means just the expectation
+resultar em uma etapa imparcial e imparcial aqui significa apenas a expectativa
 
 0:57:23.030,0:57:28.420
-of the momentum buffer is the gradient so it's nothing too mysterious
+do buffer de momento é o gradiente, então não é nada muito misterioso
 
 0:57:28.420,0:57:32.960
-yeah don't think of it as being like a huge addition although I do think that
+sim, não pense nisso como uma grande adição, embora eu ache que
 
 0:57:32.960,0:57:37.190
-the atom paper was the first one to use bicycle action in a mainstream
+o papel do átomo foi o primeiro a usar a ação da bicicleta em um mainstream
 
 0:57:37.190,0:57:40.310
-optimization method I don't know if they invented it but it certainly pioneered
+método de otimização eu não sei se eles inventaram, mas certamente foi pioneiro
 
 0:57:40.310,0:57:44.990
-the base correction so these methods work really well in practice let me just
+a correção de base para que esses métodos funcionem muito bem na prática, deixe-me apenas
 
 0:57:44.990,0:57:48.590
-give you a common empirical comparison here now this quadratic I'm using is a
+dar-lhe uma comparação empírica comum aqui agora esta quadrática que estou usando é uma
 
 0:57:48.590,0:57:52.220
-diagonal quadratic so it's a little bit shading to use a method that works well
+diagonal quadrática, então é um pouco sombreado usar um método que funciona bem
 
 0:57:52.220,0:57:55.060
-on down or quadratics on and diagonal quadratic but I'm gonna do that anyway
+em baixo ou quadrática em e diagonal quadrática, mas vou fazer isso de qualquer maneira
 
 0:57:55.060,0:58:00.320
-and you can see that the direction they travel is quite an improvement over SGD
+e você pode ver que a direção em que eles viajam é uma grande melhoria em relação ao SGD
 
 0:58:00.320,0:58:03.950
-so in this simplified problem sut kind of goes in the wrong direction at the
+então neste problema simplificado sut vai na direção errada no
 
 0:58:03.950,0:58:08.780
-beginning where rmsprop basically heads in the right direction now the problem
+começando onde rmsprop basicamente segue na direção certa agora o problema
 
 0:58:08.780,0:58:15.140
-is rmsprop suffers from noise just as regular sut without noise suffers so you
+é rmsprop sofre de ruído assim como sut regular sem ruído sofre para que você
 
 0:58:15.140,0:58:19.490
-get this situation where kind of bounces around the optimum quite significantly
+obter esta situação em que meio que salta em torno do ideal de forma bastante significativa
 
 0:58:19.490,0:58:24.710
-and just as with std with momentum when we add momentum to atom we get the same
+e assim como com std com momento, quando adicionamos momento ao átomo, obtemos o mesmo
 
 0:58:24.710,0:58:29.210
-kind of improvement where we kind of corkscrew or sometimes reverse corkscrew
+tipo de melhoria onde nós meio que saca-rolhas ou às vezes invertemos o saca-rolhas
 
 0:58:29.210,0:58:32.240
-around the solution that kind of thing and this gets you to the solution
+em torno da solução esse tipo de coisa e isso leva você à solução
 
 0:58:32.240,0:58:35.960
-quicker and it means that the last point you're currently at is a good estimate
+mais rápido e significa que o último ponto em que você está atualmente é uma boa estimativa
 
 0:58:35.960,0:58:39.370
-of the solution not a noisy estimate but it's kind of the best estimate you have
+da solução não é uma estimativa barulhenta, mas é a melhor estimativa que você tem
 
 0:58:39.370,0:58:45.350
-so I would generally recommend using a demova rmsprop and it's serving the case
+então eu geralmente recomendaria usar um demova rmsprop e está atendendo ao caso
 
 0:58:45.350,0:58:50.750
-that for some problems you just can't use SGD atom is necessary for training
+que para alguns problemas você simplesmente não pode usar o átomo SGD é necessário para o treinamento
 
 0:58:50.750,0:58:53.690
-some of the neural networks were using our language models or say our language
+algumas das redes neurais estavam usando nossos modelos de linguagem ou, digamos, nossa linguagem
 
 0:58:53.690,0:58:57.290
-models it's necessary for training the network so I'm going to talk about near
+modelos é necessário para treinar a rede então vou falar sobre
 
 0:58:57.290,0:59:03.580
-the end of this presentation and it's it's generally the if I have to
+no final desta apresentação e é geralmente se eu tiver que
 
 0:59:07.490,0:59:10.670
-recommend something you should use you should try either s to D with momentum
+recomendar algo que você deve usar, você deve tentar de s a D com impulso
 
 0:59:10.670,0:59:14.690
-or atom as you'll go to methods for optimizing your networks so there's some
+ou atom como você irá para métodos para otimizar suas redes, então há alguns
 
 0:59:14.690,0:59:19.430
-practical advice for you personally I hate atom because I'm an optimization
+conselhos práticos para você, pessoalmente, odeio atom porque sou uma otimização
 
 0:59:19.430,0:59:24.920
-researcher and the theory and their paper is wrong this has been shown
+pesquisador e a teoria e seu artigo está errado, isso foi demonstrado
 
 0:59:24.920,0:59:29.360
-recently so the method in fact does not converge and you can show this on very
+recentemente, então o método de fato não converge e você pode mostrar isso muito
 
 0:59:29.360,0:59:32.430
-simple test problems so one of the most heavily music
+problemas de teste simples, então uma das músicas mais pesadas
 
 0:59:32.430,0:59:35.820
-use methods in modern machine learning actually doesn't work in a lot of
+usar métodos no aprendizado de máquina moderno na verdade não funciona em muitos
 
 0:59:35.820,0:59:40.740
-situations this is unsatisfying and it's I'm kind of an ongoing research question
+situações isso é insatisfatório e é uma questão de pesquisa em andamento
 
 0:59:40.740,0:59:44.670
-of the best way to fix this I don't think just modifying Adam a little bit
+da melhor maneira de corrigir isso, não acho que apenas modificando Adam um pouco
 
 0:59:44.670,0:59:47.160
-to try and fix it is really the best solution I think it's got some more
+tentar consertá-lo é realmente a melhor solução, acho que tem um pouco mais
 
 0:59:47.160,0:59:52.620
-fundamental problems but I won't go into any detail for that there is a very
+problemas fundamentais, mas não vou entrar em detalhes porque há um problema muito
 
 0:59:52.620,0:59:56.460
-practical problem they need to talk about though Adam is known to sometimes
+problema prático sobre o qual eles precisam falar, embora Adam seja conhecido por às vezes
 
-0:59:56.460,1:00:01.140
-give worse generalization error I think Yara's talked in detail about
+0:59:56.460,0:00:01.140
+dar pior erro de generalização Acho que Yara falou em detalhes sobre
 
-1:00:01.140,1:00:08.730
-generalization error do I go over that so yeah generalization error is the
+0:00:01.140,0:00:08.730
+erro de generalização eu repasso isso então sim o erro de generalização é o
 
-1:00:08.730,1:00:14.100
-error on data that you didn't train your model on basically so your networks are
+0:00:08.730,0:00:14.100
+erro nos dados nos quais você não treinou seu modelo basicamente para que suas redes sejam
 
-1:00:14.100,1:00:17.370
-very heavily parameter over parameterised and if you train them to
+0:00:14.100,0:00:17.370
+muito fortemente parametrizado e se você treiná-los para
 
-1:00:17.370,1:00:22.200
-give zero loss on the data you trained it on they won't give zero loss on other
+0:00:17.370,0:00:22.200
+dão perda zero nos dados em que você treinou, eles não darão perda zero em outros
 
-1:00:22.200,1:00:27.240
-data points data that it's never seen before and this generalization error is
+0:00:22.200,0:00:27.240
+dados aponta dados que nunca foram vistos antes e esse erro de generalização é
 
-1:00:27.240,1:00:32.310
-that error typically the best thing we can do is minimize the loss and the data
+0:00:27.240,0:00:32.310
+esse erro normalmente a melhor coisa que podemos fazer é minimizar a perda e os dados
 
-1:00:32.310,1:00:37.080
-we have but sometimes that's suboptimal and it turns out when you use Adam it's
+0:00:32.310,0:00:37.080
+temos, mas às vezes isso não é o ideal e acontece que quando você usa Adam é
 
-1:00:37.080,1:00:40.860
-quite common on particularly on image problems that you get worst
+0:00:37.080,0:00:40.860
+bastante comum principalmente em problemas de imagem que você piora
 
-1:00:40.860,1:00:46.140
-generalization error than when you use STD and people attribute this to a whole
+0:00:40.860,0:00:46.140
+erro de generalização do que quando você usa STD e as pessoas atribuem isso a um todo
 
-1:00:46.140,1:00:50.400
-bunch of different things it may be finding those bad local minima that I
+0:00:46.140,0:00:50.400
+monte de coisas diferentes, pode estar encontrando aqueles mínimos locais ruins que eu
 
-1:00:50.400,1:00:54.180
-mentioned earlier the ones that are smaller it's kind of unfortunate that
+0:00:50.400,0:00:54.180
+mencionei anteriormente os que são menores, é uma pena que
 
-1:00:54.180,1:00:57.840
-the better your optimization method the more likely it is to hit those small
+0:00:54.180,0:00:57.840
+quanto melhor seu método de otimização, maior a probabilidade de atingir esses pequenos
 
-1:00:57.840,1:01:02.460
-local minima because they're closer to where you currently are and kind of it's
+0:00:57.840,0:01:02.460
+mínimos locais porque eles estão mais próximos de onde você está atualmente e é meio que
 
-1:01:02.460,1:01:06.510
-the goal of an optimization method to find you the closest minima in a sense
+0:01:02.460,0:01:06.510
+o objetivo de um método de otimização para encontrar o mínimo mais próximo em um sentido
 
-1:01:06.510,1:01:10.620
-these local optimization methods we use but there's a whole bunch of other
+0:01:06.510,0:01:10.620
+esses métodos de otimização local que usamos, mas há um monte de outros
 
-1:01:10.620,1:01:16.950
-reasons that you can attribute to it less noise in Adam perhaps it could be
+0:01:10.620,0:01:16.950
+razões que você pode atribuir a ele menos barulho em Adam talvez possa ser
 
-1:01:16.950,1:01:20.100
-some structure maybe these methods where you rescale
+0:01:16.950,0:01:20.100
+alguma estrutura talvez esses métodos onde você redimensiona
 
-1:01:20.100,1:01:23.070
-space like this have this fundamental problem where they give worst
+0:01:20.100,0:01:23.070
+espaço como este tem esse problema fundamental onde eles dão o pior
 
-1:01:23.070,1:01:26.430
-generalization we don't really understand this but it's important to
+0:01:23.070,0:01:26.430
+generalização, não entendemos isso, mas é importante
 
-1:01:26.430,1:01:30.390
-know that this may be a problem or in some cases it's not to say that it will
+0:01:26.430,0:01:30.390
+sei que isso pode ser um problema ou em alguns casos não quer dizer que vai
 
-1:01:30.390,1:01:33.450
-give horrible performance you'll still get a pretty good neuron that workout at
+0:01:30.390,0:01:33.450
+dar um desempenho horrível, você ainda terá um neurônio muito bom que treina em
 
-1:01:33.450,1:01:37.200
-the end and what I can tell you is the language models that we trained at
+0:01:33.450,0:01:37.200
+o final e o que posso dizer são os modelos de linguagem que treinamos
 
-1:01:37.200,1:01:41.890
-Facebook use methods like atom or atom itself and they
+0:01:37.200,0:01:41.890
+O Facebook usa métodos como o átomo ou o próprio átomo e eles
 
-1:01:41.890,1:01:46.960
-much better results than if you use STD and there's a kind of a small thing that
+0:01:41.890,0:01:46.960
+resultados muito melhores do que se você usar STD e há uma pequena coisa que
 
-1:01:46.960,1:01:51.490
-won't affect you at all I would expect but with Adam you have to maintain these
+0:01:46.960,0:01:51.490
+não vai afetá-lo, eu esperaria, mas com Adam você tem que manter esses
 
-1:01:51.490,1:01:56.410
-three buffers where's sed you have two buffers of parameters this doesn't
+0:01:51.490,0:01:56.410
+três buffers onde está sed você tem dois buffers de parâmetros isso não
 
-1:01:56.410,1:01:59.230
-matter except when you're training a model that's like 12 gigabytes and then
+0:01:56.410,0:01:59.230
+importa, exceto quando você está treinando um modelo de 12 gigabytes e depois
 
-1:01:59.230,1:02:02.790
-it really becomes a problem I don't think you'll encounter that in practice
+0:01:59.230,0:02:02.790
+realmente se torna um problema, acho que você não encontrará isso na prática
 
-1:02:02.790,1:02:06.280
-and surely there's a little bit iffy so you gotta trim two parameters instead of
+0:02:02.790,0:02:06.280
+e certamente há um pouco de dúvida, então você precisa cortar dois parâmetros em vez de
 
-1:02:06.280,1:02:13.060
-one so yeah that's practical advice use Adam arrest you do but onto something
+0:02:06.280,0:02:13.060
+um então sim, é um conselho prático, use Adam para prender você, mas em algo
 
-1:02:13.060,1:02:18.220
-that is also sup is also kind of a core thing oh sorry have a question yes yes
+0:02:13.060,0:02:18.220
+isso também é sup também é uma coisa central oh desculpe tenho uma pergunta sim sim
 
-1:02:18.220,1:02:22.600
-you absolutely correct but typically I guess the question the question was
+0:02:18.220,0:02:22.600
+você está absolutamente correto, mas normalmente eu acho que a pergunta foi
 
-1:02:22.600,1:02:28.000
-weren't using a small epsilon in the denominator result in blow-up certainly
+0:02:22.600,0:02:28.000
+não estavam usando um pequeno épsilon no denominador resultar em explosão certamente
 
-1:02:28.000,1:02:32.440
-if the numerator was equal to roughly one than dividing through by ten to the
+0:02:28.000,0:02:32.440
+se o numerador for igual a aproximadamente um do que dividir por dez para o
 
-1:02:32.440,1:02:37.900
-negative seven could be catastrophic and this this is a legitimate question but
+0:02:32.440,0:02:37.900
+sete negativo pode ser catastrófico e esta é uma pergunta legítima, mas
 
-1:02:37.900,1:02:45.250
-typically in order for the V buffer to have very small values the gradient also
+0:02:37.900,0:02:45.250
+normalmente para que o buffer V tenha valores muito pequenos, o gradiente também
 
-1:02:45.250,1:02:48.340
-has to have had very small values you can see that from the way the
+0:02:45.250,0:02:48.340
+deve ter tido valores muito pequenos, você pode ver isso pela forma como o
 
-1:02:48.340,1:02:53.110
-exponential moving averages are updated so in fact it's not a practical problem
+0:02:48.340,0:02:53.110
+as médias móveis exponenciais são atualizadas, então, na verdade, não é um problema prático
 
-1:02:53.110,1:02:56.860
-when this when this V is incredibly small the momentum is also very small
+0:02:53.110,0:02:56.860
+quando isso quando este V é incrivelmente pequeno o momento também é muito pequeno
 
-1:02:56.860,1:03:01.180
-and when you're dividing small thing by a small thing you don't get blow-up oh
+0:02:56.860,0:03:01.180
+e quando você está dividindo coisa pequena por coisa pequena você não explode oh
 
-1:03:01.180,1:03:08.050
-yeah so the question is should I you buy an SUV and atom separately at the same
+0:03:01.180,0:03:08.050
+sim, então a pergunta é: devo comprar um SUV e um átomo separadamente ao mesmo tempo
 
-1:03:08.050,1:03:11.860
-time and just see which one works better in fact that is pretty much what we do
+0:03:08.050,0:03:11.860
+tempo e apenas ver qual funciona melhor, na verdade, isso é praticamente o que fazemos
 
-1:03:11.860,1:03:14.620
-because we have lots of computers we just have one computer runners you need
+0:03:11.860,0:03:14.620
+porque temos muitos computadores, temos apenas um corredor de computador que você precisa
 
-1:03:14.620,1:03:17.890
-one computer one atom and see which one works better although we kind of know
+0:03:14.620,0:03:17.890
+um computador um átomo e ver qual funciona melhor, embora saibamos
 
-1:03:17.890,1:03:21.730
-from most problems which one is the better choice for whatever problems
+0:03:17.890,0:03:21.730
+da maioria dos problemas, qual é a melhor escolha para quaisquer problemas
 
-1:03:21.730,1:03:24.460
-you're working with maybe you can try both it depends how long it's going to
+0:03:21.730,0:03:24.460
+você está trabalhando, talvez você possa tentar os dois, depende de quanto tempo vai durar
 
-1:03:24.460,1:03:27.940
-take to train I'm not sure exactly what you're gonna be doing in terms of
+0:03:24.460,0:03:27.940
+levar para treinar eu não tenho certeza exatamente o que você vai fazer em termos de
 
-1:03:27.940,1:03:31.150
-practice in this course yeah certainly legitimate way to do it
+0:03:27.940,0:03:31.150
+pratique neste curso sim, certamente maneira legítima de fazê-lo
 
-1:03:31.150,1:03:35.020
-in fact some people use SGD at the beginning and then switch to atom at the
+0:03:31.150,0:03:35.020
+na verdade, algumas pessoas usam SGD no início e depois mudam para átomo no
 
-1:03:35.020,1:03:39.430
-end that's certainly a good approach it just makes it more complicated and
+0:03:35.020,0:03:39.430
+final que é certamente uma boa abordagem, apenas torna mais complicado e
 
-1:03:39.430,1:03:44.740
-complexity should be avoided if possible yes this is one of those deep unanswered
+0:03:39.430,0:03:44.740
+complexidade deve ser evitada se possível sim, este é um daqueles problemas profundos sem resposta
 
-1:03:44.740,1:03:48.400
-questions so the question was should we 1s you deal with lots of different
+0:03:44.740,0:03:48.400
+perguntas, então a pergunta era: devemos 1s você lidar com muitas
 
-1:03:48.400,1:03:51.850
-initializations and see which one gets the best solution won't I help with the
+0:03:48.400,0:03:51.850
+inicializações e ver qual deles obtém a melhor solução não vou ajudar com o
 
-1:03:51.850,1:03:54.990
-bumpiness this is the case with small neural net
+0:03:51.850,0:03:54.990
+irregularidade este é o caso da pequena rede neural
 
-1:03:54.990,1:03:59.160
-that you will get different solutions depending on your initialization now
+0:03:54.990,0:03:59.160
+que você obterá soluções diferentes dependendo da sua inicialização agora
 
-1:03:59.160,1:04:02.369
-there's a remarkable property of the kind of large networks we use at the
+0:03:59.160,0:04:02.369
+há uma propriedade notável do tipo de grandes redes que usamos no
 
-1:04:02.369,1:04:07.349
-moment and the art networks as long as you use similar random initialization in
+0:04:02.369,0:04:07.349
+momento e as redes de arte, desde que você use inicialização aleatória semelhante em
 
-1:04:07.349,1:04:11.400
-terms of the variance of initialization you'll end up practically at a similar
+0:04:07.349,0:04:11.400
+termos da variância de inicialização, você terminará praticamente em um
 
-1:04:11.400,1:04:16.380
-quality solutions and this is not well understood so yeah it's it's quite
+0:04:11.400,0:04:16.380
+soluções de qualidade e isso não é bem entendido, então sim, é bastante
 
-1:04:16.380,1:04:19.319
-remarkable that your neural network can train for three hundred epochs and you
+0:04:16.380,0:04:19.319
+notável que sua rede neural pode treinar por trezentas épocas e você
 
-1:04:19.319,1:04:23.550
-end up with solution the test error is like almost exactly the same as what you
+0:04:19.319,0:04:23.550
+acabar com a solução, o erro de teste é quase exatamente o mesmo que você
 
-1:04:23.550,1:04:26.220
-got with some completely different initialization we don't understand this
+0:04:23.550,0:04:26.220
+com uma inicialização completamente diferente, não entendemos isso
 
-1:04:26.220,1:04:31.800
-so if you really need to eke out tiny performance gains you may be able to get
+0:04:26.220,0:04:31.800
+então, se você realmente precisar obter pequenos ganhos de desempenho, poderá obter
 
-1:04:31.800,1:04:36.150
-a little bit better Network by running multiple and picking the best and it
+0:04:31.800,0:04:36.150
+uma rede um pouco melhor executando vários e escolhendo o melhor e
 
-1:04:36.150,1:04:39.180
-seems the bigger your network and the harder your problem the less game you
+0:04:36.150,0:04:39.180
+parece que quanto maior sua rede e mais difícil seu problema, menos jogo você
 
-1:04:39.180,1:04:44.190
-get from doing that yes so the question was we have three buffers for each
+0:04:39.180,0:04:44.190
+obter de fazer isso sim, então a questão era que temos três buffers para cada
 
-1:04:44.190,1:04:49.470
-weight on the answer answer is yes so essentially yeah we basically in memory
+0:04:44.190,0:04:49.470
+peso na resposta a resposta é sim, então essencialmente sim, basicamente na memória
 
-1:04:49.470,1:04:53.160
-we have a copy of the same size as our weight data so our weight will be a
+0:04:49.470,0:04:53.160
+temos uma cópia do mesmo tamanho que nossos dados de peso, então nosso peso será um
 
-1:04:53.160,1:04:55.920
-whole bunch of tensors in memory we have a separate whole bunch of tensors that
+0:04:53.160,0:04:55.920
+monte de tensores na memória, temos um monte separado de tensores que
 
-1:04:55.920,1:05:01.849
-our momentum tensors and we have a whole bunch of other tensors that are the the
+0:04:55.920,0:05:01.849
+nossos tensores de momento e temos um monte de outros tensores que são os
 
-1:05:01.849,1:05:09.960
-second moment tensors so yeah so normalization layers so this is kind of
+0:05:01.849,0:05:09.960
+tensores de segundo momento, então sim, camadas de normalização, então isso é meio que
 
-1:05:09.960,1:05:14.369
-a clever idea why try and salt why try and come up with a better optimization
+0:05:09.960,0:05:14.369
+uma ideia inteligente por que tentar e sal por que tentar criar uma otimização melhor
 
-1:05:14.369,1:05:20.540
-algorithm where we can just come up with a better network and this is the idea so
+0:05:14.369,0:05:20.540
+algoritmo onde podemos criar uma rede melhor e essa é a ideia, então
 
-1:05:20.960,1:05:24.960
-modern neural networks typically we modify the network by adding additional
+0:05:20.960,0:05:24.960
+as redes neurais modernas normalmente modificamos a rede adicionando
 
-1:05:24.960,1:05:32.280
-layers in between existing layers and the goal of these layers to improve the
+0:05:24.960,0:05:32.280
+camadas entre as camadas existentes e o objetivo dessas camadas para melhorar a
 
-1:05:32.280,1:05:36.450
-optimization and generalization performance of the network and the way
+0:05:32.280,0:05:36.450
+desempenho de otimização e generalização da rede e a maneira
 
-1:05:36.450,1:05:39.059
-they do this can happen in a few different ways but let me give you an
+0:05:36.450,0:05:39.059
+eles fazem isso pode acontecer de algumas maneiras diferentes, mas deixe-me dar-lhe uma
 
-1:05:39.059,1:05:44.430
-example so we would typically take standard kind of combinations so as you
+0:05:39.059,0:05:44.430
+por exemplo, normalmente usaríamos o tipo padrão de combinações para que você
 
-1:05:44.430,1:05:48.930
-know in modern your networks we typically alternate linear operations
+0:05:44.430,0:05:48.930
+saiba que em suas redes modernas normalmente alternamos operações lineares
 
-1:05:48.930,1:05:52.319
-with nonlinear operations and here I call that activation functions we
+0:05:48.930,0:05:52.319
+com operações não lineares e aqui chamo essas funções de ativação que
 
-1:05:52.319,1:05:56.069
-alternate them linear nonlinear linear nonlinear what we could do is we can
+0:05:52.319,0:05:56.069
+alterná-los linear não linear linear não linear o que poderíamos fazer é que podemos
 
-1:05:56.069,1:06:01.819
-place these normalization layers either between the linear order non-linear or
+0:05:56.069,0:06:01.819
+coloque essas camadas de normalização entre a ordem linear não linear ou
 
-1:06:01.819,1:06:11.009
-before so there in this case we are using for instance this is the kind of
+0:06:01.819,0:06:11.009
+antes, então, neste caso, estamos usando, por exemplo, este é o tipo de
 
-1:06:11.009,1:06:14.369
-structure we have in real networks where we have a convolution recover that
+0:06:11.009,0:06:14.369
+estrutura que temos em redes reais onde temos uma recuperação de convolução que
 
-1:06:14.369,1:06:18.240
-convolutions or linear operations followed by batch normalization this is
+0:06:14.369,0:06:18.240
+convoluções ou operações lineares seguidas de normalização em lote, isso é
 
-1:06:18.240,1:06:20.789
-a type of normalization which I will detail in a minute
+0:06:18.240,0:06:20.789
+um tipo de normalização que detalharei em um minuto
 
-1:06:20.789,1:06:28.140
-followed by riilu which is currently the most popular activation function and we
+0:06:20.789,0:06:28.140
+seguido por riilu que é atualmente a função de ativação mais popular e nós
 
-1:06:28.140,1:06:31.230
-place this mobilization between these existing layers and what I want to make
+0:06:28.140,0:06:31.230
+colocar essa mobilização entre essas camadas existentes e o que eu quero fazer
 
-1:06:31.230,1:06:35.940
-clear is this normalization layers they affect the flow of data through so they
+0:06:31.230,0:06:35.940
+claro é que essas camadas de normalização afetam o fluxo de dados, de modo que
 
-1:06:35.940,1:06:39.150
-modify the data that's flowing through but they don't change the power of the
+0:06:35.940,0:06:39.150
+modificam os dados que estão fluindo, mas eles não alteram o poder do
 
-1:06:39.150,1:06:43.380
-network in the sense that that you can set up the weights in the network in
+0:06:39.150,0:06:43.380
+rede no sentido de que você pode configurar os pesos na rede em
 
-1:06:43.380,1:06:46.769
-some way that'll still give whatever output you had in an unknown alized
+0:06:43.380,0:06:46.769
+alguma forma que ainda dará qualquer saída que você teve em um desconhecido
 
-1:06:46.769,1:06:50.220
-network with a normalized network so normalization layers you're not making
+0:06:46.769,0:06:50.220
+rede com uma rede normalizada, então camadas de normalização que você não está fazendo
 
-1:06:50.220,1:06:53.670
-that work more powerful they improve it in other ways normally when we add
+0:06:50.220,0:06:53.670
+que funcionam mais poderosos, eles melhoram de outras maneiras normalmente quando adicionamos
 
-1:06:53.670,1:06:57.660
-things to a neural network the goal is to make it more powerful and yes this
+0:06:53.670,0:06:57.660
+coisas para uma rede neural o objetivo é torná-lo mais poderoso e sim isso
 
-1:06:57.660,1:07:01.740
-normalization layer can also be after the activation or before the linear or
+0:06:57.660,0:07:01.740
+camada de normalização também pode ser após a ativação ou antes do linear ou
 
-1:07:01.740,1:07:05.009
-you know because this wraps around we do this in order a lot of them are
+0:07:01.740,0:07:05.009
+você sabe, porque isso envolve, fazemos isso para que muitos deles sejam
 
-1:07:05.009,1:07:11.400
-equivalent but any questions here this is this bits yes yes so that's certainly
+0:07:05.009,0:07:11.400
+equivalente, mas qualquer dúvida aqui são esses bits sim sim, então isso é certamente
 
-1:07:11.400,1:07:16.140
-true but we kind of want that we want the real o2 sensor some of the data but
+0:07:11.400,0:07:16.140
+verdade, mas meio que queremos que o sensor de o2 real alguns dos dados, mas
 
-1:07:16.140,1:07:20.009
-not too much but it's also not quite accurate because normalization layers
+0:07:16.140,0:07:20.009
+não muito, mas também não é muito preciso porque as camadas de normalização
 
-1:07:20.009,1:07:24.989
-can also scale and ship the data and so it won't necessarily be that although
+0:07:20.009,0:07:24.989
+também pode dimensionar e enviar os dados e, portanto, não será necessariamente que, embora
 
-1:07:24.989,1:07:28.739
-it's certainly at initialization they do not do that scaling in ship so typically
+0:07:24.989,0:07:28.739
+é certamente na inicialização que eles não fazem esse dimensionamento no navio tão tipicamente
 
-1:07:28.739,1:07:32.460
-cut off half the data and in fact if you try to do a theoretical analysis of this
+0:07:28.739,0:07:32.460
+cortar metade dos dados e, de fato, se você tentar fazer uma análise teórica disso
 
-1:07:32.460,1:07:37.470
-it's very convenient that it cuts off half the data so the structure this
+0:07:32.460,0:07:37.470
+é muito conveniente que corte metade dos dados para que a estrutura desta
 
-1:07:37.470,1:07:42.239
-normalization layers they all pretty much do the same kind of operation and
+0:07:37.470,0:07:42.239
+camadas de normalização, todas elas fazem praticamente o mesmo tipo de operação e
 
-1:07:42.239,1:07:47.640
-how many use kind of generic notation here so you should imagine that X is an
+0:07:42.239,0:07:47.640
+quantos usam tipo de notação genérica aqui, então você deve imaginar que X é um
 
-1:07:47.640,1:07:54.930
-input to the normalization layer and Y is an output and what you do is use do a
+0:07:47.640,0:07:54.930
+entrada para a camada de normalização e Y é uma saída e o que você faz é usar fazer um
 
-1:07:54.930,1:08:00.119
-whitening or normalization operation where you subtract off some estimate of
+0:07:54.930,0:08:00.119
+operação de clareamento ou normalização onde você subtrai alguma estimativa de
 
-1:08:00.119,1:08:05.190
-the mean of the data and you divide through by some estimate of the standard
+0:08:00.119,0:08:05.190
+a média dos dados e você divide por alguma estimativa do padrão
 
-1:08:05.190,1:08:10.259
-deviation and remember before that I mentioned we want to keep the
+0:08:05.190,0:08:10.259
+desvio e lembre-se antes que eu mencionei que queremos manter o
 
-1:08:10.259,1:08:12.630
-representational power of the network the same
+0:08:10.259,0:08:12.630
+poder de representação da rede o mesmo
 
-1:08:12.630,1:08:17.430
-what we do to ensure that is we multiply by an alpha and we add a sorry in height
+0:08:12.630,0:08:17.430
+o que fazemos para garantir é que multiplicamos por um alfa e adicionamos uma desculpa na altura
 
-1:08:17.430,1:08:22.050
-multiplied by an hey and we add a B and this is just so that the layer can still
+0:08:17.430,0:08:22.050
+multiplicado por um hey e adicionamos um B e isso é apenas para que a camada ainda possa
 
-1:08:22.050,1:08:27.120
-output values over any particular range or if we just always had every layer
+0:08:22.050,0:08:27.120
+valores de saída em qualquer intervalo específico ou se sempre tivéssemos todas as camadas
 
-1:08:27.120,1:08:30.840
-output in white and data the network couldn't output like a value million or
+0:08:27.120,0:08:30.840
+saída em branco e dados que a rede não poderia produzir como um valor de milhão ou
 
-1:08:30.840,1:08:35.370
-something like that it wouldn't it could only do that you know with very in very
+0:08:30.840,0:08:35.370
+algo assim não seria, só poderia fazer isso você sabe com muito em muito
 
-1:08:35.370,1:08:38.520
-rare cases because that would be very heavy on the tail of the normal
+0:08:35.370,0:08:38.520
+casos raros porque isso seria muito pesado na cauda do normal
 
-1:08:38.520,1:08:41.850
-distribution so this allows our layers to essentially output things that are
+0:08:38.520,0:08:41.850
+distribuição, então isso permite que nossas camadas essencialmente produzam coisas que são
 
-1:08:41.850,1:08:49.200
-the same range as before and yes so normalization layers have parameters and
+0:08:41.850,0:08:49.200
+o mesmo intervalo de antes e sim, então as camadas de normalização têm parâmetros e
 
-1:08:49.200,1:08:51.900
-in the network is a little bit more complicated in the sensor has more
+0:08:49.200,0:08:51.900
+na rede é um pouco mais complicado no sensor tem mais
 
-1:08:51.900,1:08:56.010
-parameters it's typically a very small number of parameters like rounding error
+0:08:51.900,0:08:56.010
+parâmetros, normalmente é um número muito pequeno de parâmetros, como erro de arredondamento
 
-1:08:56.010,1:09:04.290
-in your counts of network parameters typically and yeah so the complexity of
+0:08:56.010,0:09:04.290
+em suas contagens de parâmetros de rede normalmente e sim, então a complexidade de
 
-1:09:04.290,1:09:06.840
-this is on being kind of vague about how you compute the mean and standard
+0:09:04.290,0:09:06.840
+isso é ser meio vago sobre como você calcula a média e o padrão
 
-1:09:06.840,1:09:10.170
-deviation the reason I'm doing that is because all the methods compute in a
+0:09:06.840,0:09:10.170
+desvio a razão pela qual estou fazendo isso é porque todos os métodos computam em um
 
-1:09:10.170,1:09:18.210
-different way and I'll detail that in a second yes question weighs re lb oh it's
+0:09:10.170,0:09:18.210
+maneira diferente e vou detalhar que em uma segunda pergunta sim pesa re lb oh é
 
-1:09:18.210,1:09:24.630
-just a shift parameter so the data could have had a nonzero mean and we want it
+0:09:18.210,0:09:24.630
+apenas um parâmetro de deslocamento para que os dados possam ter uma média diferente de zero e queremos
 
-1:09:24.630,1:09:28.470
-delayed to be able to produce outputs with a nonzero mean so if we always just
+0:09:24.630,0:09:28.470
+atrasado para poder produzir saídas com uma média diferente de zero, então, se sempre apenas
 
-1:09:28.470,1:09:30.570
-subtract off the mean it couldn't do that
+0:09:28.470,0:09:30.570
+subtrair a média que não poderia fazer isso
 
-1:09:30.570,1:09:34.950
-so it just adds back representational power to the layer yes so the question
+0:09:30.570,0:09:34.950
+então ele apenas adiciona de volta o poder de representação à camada sim, então a pergunta
 
-1:09:34.950,1:09:40.110
-is don't these a and B parameters reverse the normalization and and in
+0:09:34.950,0:09:40.110
+é que esses parâmetros a e B não invertem a normalização e e em
 
-1:09:40.110,1:09:44.730
-fact that often is the case that they do something similar but they move at
+0:09:40.110,0:09:44.730
+fato de que muitas vezes é o caso de eles fazerem algo semelhante, mas eles se movem
 
-1:09:44.730,1:09:48.750
-different time scales so between the steps or between evaluations your
+0:09:44.730,0:09:48.750
+diferentes escalas de tempo, portanto, entre as etapas ou entre as avaliações, seu
 
-1:09:48.750,1:09:52.410
-network the mean and variance can can shift quite substantially based off the
+0:09:48.750,0:09:52.410
+rede, a média e a variância podem mudar substancialmente com base na
 
-1:09:52.410,1:09:55.320
-data you're feeding but these a and B parameters are quite stable they move
+0:09:52.410,0:09:55.320
+dados que você está alimentando, mas esses parâmetros a e B são bastante estáveis, eles se movem
 
-1:09:55.320,1:10:01.260
-slowly as you learn them so because they're most stable this has beneficial
+0:09:55.320,0:10:01.260
+lentamente à medida que você os aprende, porque eles são mais estáveis, isso é benéfico
 
-1:10:01.260,1:10:04.530
-properties and I'll describe those a little bit later but I want to talk
+0:10:01.260,0:10:04.530
+propriedades e vou descrevê-las um pouco mais tarde, mas quero falar
 
-1:10:04.530,1:10:08.610
-about is exactly how you normalize the data and this is where the crucial thing
+0:10:04.530,0:10:08.610
+sobre é exatamente como você normaliza os dados e é aí que a coisa crucial
 
-1:10:08.610,1:10:11.760
-so the earliest of these methods developed was batch norm and he is this
+0:10:08.610,0:10:11.760
+então o primeiro desses métodos desenvolvidos foi a norma de lote e ele é este
 
-1:10:11.760,1:10:16.429
-kind of a bizarre normalization that I I think is a horrible idea
+0:10:11.760,0:10:16.429
+tipo de normalização bizarra que eu acho uma ideia horrível
 
-1:10:16.429,1:10:22.460
-but unfortunately works fantastically well so it normalizes across batches so
+0:10:16.429,0:10:22.460
+mas infelizmente funciona fantasticamente bem, então normaliza em lotes, então
 
-1:10:22.460,1:10:28.370
-we want information about a certain channel recall for a convolutional
+0:10:22.460,0:10:28.370
+queremos informações sobre um certo recall de canal para um convolucional
 
-1:10:28.370,1:10:32.000
-neural network which channel is one of these latent images that you have in
+0:10:28.370,0:10:32.000
+rede neural qual canal é uma dessas imagens latentes que você tem em
 
-1:10:32.000,1:10:34.610
-your network that part way through the network you have some data it doesn't
+0:10:32.000,0:10:34.610
+sua rede que, no meio da rede, você tem alguns dados que não
 
-1:10:34.610,1:10:37.070
-really look like an image if you actually look at it but it's it's shaped
+0:10:34.610,0:10:37.070
+realmente se parece com uma imagem se você realmente olhar para ela, mas é em forma
 
-1:10:37.070,1:10:41.000
-like an image anyway and that's a channel so we want to compute an average
+0:10:37.070,0:10:41.000
+como uma imagem de qualquer maneira e isso é um canal, então queremos calcular uma média
 
-1:10:41.000,1:10:47.239
-over this over this channel but we only have a small amount of data that's
+0:10:41.000,0:10:47.239
+sobre isso neste canal, mas temos apenas uma pequena quantidade de dados que é
 
-1:10:47.239,1:10:51.380
-what's in this channel basically height times width if it's a if it's an image
+0:10:47.239,0:10:51.380
+o que há neste canal basicamente altura vezes largura se for uma imagem
 
-1:10:51.380,1:10:56.000
-and it turns out that's not enough data to get good estimates of these mean and
+0:10:51.380,0:10:56.000
+e acontece que não há dados suficientes para obter boas estimativas dessas médias e
 
-1:10:56.000,1:10:58.969
-variance parameters so what batchman does is it takes a mean and variance
+0:10:56.000,0:10:58.969
+parâmetros de variância, então o que o batchman faz é pegar uma média e variância
 
-1:10:58.969,1:11:05.570
-estimate across all the instances in your mini-batch pretty straightforward
+0:10:58.969,0:11:05.570
+estimativa em todas as instâncias em seu mini-lote bastante simples
 
-1:11:05.570,1:11:09.890
-and that's what it divides blue by the reason why I don't like this is it is no
+0:11:05.570,0:11:09.890
+e é isso que divide o azul pela razão pela qual eu não gosto disso é não
 
-1:11:09.890,1:11:12.830
-longer actually stochastic gradient descent if you using batch normalization
+0:11:09.890,0:11:12.830
+descida de gradiente realmente estocástica se você estiver usando a normalização em lote
 
-1:11:12.830,1:11:19.429
-so it breaks all the theory that I work on for a living so I prefer some other
+0:11:12.830,0:11:19.429
+por isso quebra toda a teoria em que trabalho para viver, então prefiro algum outro
 
-1:11:19.429,1:11:24.409
-normalization strategies there in fact quite a soon after Bachelor and people
+0:11:19.429,0:11:24.409
+estratégias de normalização, de fato, logo após o Bacharelado e as pessoas
 
-1:11:24.409,1:11:27.409
-tried normalizing via every other possible combination of things you can
+0:11:24.409,0:11:27.409
+tentou normalizar através de todas as outras combinações possíveis de coisas que você pode
 
-1:11:27.409,1:11:31.699
-normalize by and it turns out the three that kind of work a layer instance and
+0:11:27.409,0:11:31.699
+normalize por e acontece que os três que funcionam uma instância de camada e
 
-1:11:31.699,1:11:37.370
-group norm and layer norm here in this diagram you averaged across all of the
+0:11:31.699,0:11:37.370
+norma de grupo e norma de camada aqui neste diagrama você fez a média de todos os
 
-1:11:37.370,1:11:43.820
-channels and across height and width now this doesn't work on all problems so I
+0:11:37.370,0:11:43.820
+canais e em altura e largura agora isso não funciona em todos os problemas, então eu
 
-1:11:43.820,1:11:47.000
-would only recommend it on a problem where you know it already works and
+0:11:43.820,0:11:47.000
+só o recomendaria em um problema em que você sabe que ele já funciona e
 
-1:11:47.000,1:11:49.940
-that's typically a problem where people already using it so look at what the
+0:11:47.000,0:11:49.940
+esse é normalmente um problema em que as pessoas já o usam, então veja o que o
 
-1:11:49.940,1:11:53.989
-network's people are using if that's a good idea or not will depend the
+0:11:49.940,0:11:53.989
+pessoas da rede estão usando se isso é uma boa ideia ou não vai depender do
 
-1:11:53.989,1:11:57.140
-instance normalization is something that's used a lot in modern language
+0:11:53.989,0:11:57.140
+a normalização de instâncias é algo muito usado na linguagem moderna
 
-1:11:57.140,1:12:03.380
-models and this you do not average across the batch anymore which is nice I
+0:11:57.140,0:12:03.380
+modelos e isso você não faz mais a média do lote, o que é bom, eu
 
-1:12:03.380,1:12:07.310
-won't we talk about that much depth I really the one I would rather you rather
+0:12:03.380,0:12:07.310
+não vamos falar sobre tanta profundidade eu realmente o que eu preferiria que você preferisse
 
-1:12:07.310,1:12:12.440
-you use in practice is group normalization so here we have which
+0:12:07.310,0:12:12.440
+que você usa na prática é a normalização de grupo, então aqui temos qual
 
-1:12:12.440,1:12:16.219
-across a group of channels and this group is trapped is chosen arbitrarily
+0:12:12.440,0:12:16.219
+através de um grupo de canais e este grupo está preso é escolhido arbitrariamente
 
-1:12:16.219,1:12:20.090
-and fixed at the beginning so typically we just group things numerically so
+0:12:16.219,0:12:20.090
+e fixo no início, então normalmente nós apenas agrupamos as coisas numericamente para
 
-1:12:20.090,1:12:23.580
-channel 0 to 10 would be a group channel you know 10 to
+0:12:20.090,0:12:23.580
+canal 0 a 10 seria um canal de grupo que você conhece de 10 a 10
 
-1:12:23.580,1:12:31.110
-20 making sure you don't overlap of course disjoint groups of channels and
+0:12:23.580,0:12:31.110
+20 certificando-se de não sobrepor, é claro, grupos disjuntos de canais e
 
-1:12:31.110,1:12:34.560
-the size of these groups is a parameter that you need to tune although we always
+0:12:31.110,0:12:34.560
+o tamanho desses grupos é um parâmetro que você precisa ajustar, embora sempre
 
-1:12:34.560,1:12:39.150
-use 32 in practice you could tune that and you just do this because there's not
+0:12:34.560,0:12:39.150
+usar 32 na prática você pode ajustar isso e você só faz isso porque não há
 
-1:12:39.150,1:12:42.600
-enough information on a single channel and using all the channels is too much
+0:12:39.150,0:12:42.600
+informações suficientes em um único canal e usar todos os canais é demais
 
-1:12:42.600,1:12:46.170
-so you just use something in between it's it's really quite a simple idea and
+0:12:42.600,0:12:46.170
+então você apenas usa algo no meio, é realmente uma ideia bastante simples e
 
-1:12:46.170,1:12:50.790
-it turns out this group norm often works better than batch normal a lot of
+0:12:46.170,0:12:50.790
+acontece que essa norma de grupo geralmente funciona melhor do que o lote normal
 
-1:12:50.790,1:12:55.410
-problems and it does mean that my HUD theory that I work on is still balanced
+0:12:50.790,0:12:55.410
+problemas e isso significa que minha teoria HUD na qual trabalho ainda está equilibrada
 
-1:12:55.410,1:12:57.890
-so I like that so why does normalization help this is a
+0:12:55.410,0:12:57.890
+então eu gosto disso, então por que a normalização ajuda isso é um
 
-1:13:02.190,1:13:06.330
-matter of dispute so in fact in the last few years several papers have come out
+0:13:02.190,0:13:06.330
+questão de disputa, de fato, nos últimos anos, vários artigos foram publicados
 
-1:13:06.330,1:13:08.790
-on this topic unfortunately the papers did not agree
+0:13:06.330,0:13:08.790
+neste tópico infelizmente os jornais não concordaram
 
-1:13:08.790,1:13:13.590
-on why it works they all have completely separate explanations but there's some
+0:13:08.790,0:13:13.590
+sobre por que funciona, todos eles têm explicações completamente separadas, mas há algumas
 
-1:13:13.590,1:13:16.260
-things that are definitely going on so we can shape it we can say for sure
+0:13:13.590,0:13:16.260
+coisas que definitivamente estão acontecendo para que possamos moldá-las, podemos dizer com certeza
 
-1:13:16.260,1:13:24.120
-that the network appears to be easier to optimize so by that I mean you can use
+0:13:16.260,0:13:24.120
+que a rede parece ser mais fácil de otimizar, então quero dizer que você pode usar
 
-1:13:24.120,1:13:28.140
-large learning rates better in a better condition network you can use larger
+0:13:24.120,0:13:28.140
+maiores taxas de aprendizado melhor em uma rede de melhor condição você pode usar maiores
 
-1:13:28.140,1:13:31.590
-learning rates and therefore get faster convergence so that does seem to be the
+0:13:28.140,0:13:31.590
+taxas de aprendizagem e, portanto, obter uma convergência mais rápida, de modo que parece ser o
 
-1:13:31.590,1:13:35.030
-case when you uses normalization layers another factor which is a little bit
+0:13:31.590,0:13:35.030
+caso quando você usa camadas de normalização outro fator que é um pouco
 
-1:13:38.070,1:13:39.989
-disputed but I think is reasonably well-established
+0:13:38.070,0:13:39.989
+contestado, mas acho que está razoavelmente bem estabelecido
 
-1:13:39.989,1:13:44.489
-you get noise in the data passing through your network when you use
+0:13:39.989,0:13:44.489
+você obtém ruído nos dados que passam pela sua rede quando usa
 
-1:13:44.489,1:13:49.940
-normalization in vaginal and this noise comes from other instances in the bash
+0:13:44.489,0:13:49.940
+normalização na vagina e esse barulho vem de outras instâncias no bash
 
-1:13:49.940,1:13:53.969
-because it's random what I like instances are in your batch when you
+0:13:49.940,0:13:53.969
+porque é aleatório o que eu gosto, as instâncias estão no seu lote quando você
 
-1:13:53.969,1:13:57.239
-compute the mean using those other instances that mean is noisy and this
+0:13:53.969,0:13:57.239
+calcule a média usando aquelas outras instâncias que a média é barulhenta e isso
 
-1:13:57.239,1:14:01.469
-noise is then added or sorry subtracted from your weight so when you do the
+0:13:57.239,0:14:01.469
+ruído é então adicionado ou subtraído do seu peso, então quando você faz o
 
-1:14:01.469,1:14:06.050
-normalization operation so this noise is actually potentially helping
+0:14:01.469,0:14:06.050
+operação de normalização, então esse ruído está realmente ajudando
 
-1:14:06.050,1:14:11.790
-generalization performance in your network now there has been a lot of
+0:14:06.050,0:14:11.790
+desempenho de generalização em sua rede agora tem havido muitos
 
-1:14:11.790,1:14:15.180
-papers on injecting noise internet works to help generalization so it's not such
+0:14:11.790,0:14:15.180
+artigos sobre injeção de ruído na internet funcionam para ajudar na generalização, então não é tão
 
-1:14:15.180,1:14:20.370
-a crazy idea that this noise can be helping and in terms of a practical
+0:14:15.180,0:14:20.370
+uma ideia maluca de que esse barulho pode estar ajudando e em termos práticos
 
-1:14:20.370,1:14:24.030
-consideration this normalization makes the weight initialization that you use a
+0:14:20.370,0:14:24.030
+consideração esta normalização torna a inicialização de peso que você usa um
 
-1:14:24.030,1:14:28.260
-lot less important it used to be kind of a black art to select the initialization
+0:14:24.030,0:14:28.260
+muito menos importante, costumava ser uma espécie de arte negra selecionar a inicialização
 
-1:14:28.260,1:14:32.460
-your new your network and the people who really good motive is often it was just
+0:14:28.260,0:14:32.460
+sua nova sua rede e as pessoas que realmente bom motivo é muitas vezes foi apenas
 
-1:14:32.460,1:14:35.340
-because they're really good at changing their initialization and this is just
+0:14:32.460,0:14:35.340
+porque eles são muito bons em alterar sua inicialização e isso é apenas
 
-1:14:35.340,1:14:39.540
-less the case now when we use normalization layers and also gives the
+0:14:35.340,0:14:39.540
+menos o caso agora quando usamos camadas de normalização e também fornece a
 
-1:14:39.540,1:14:45.930
-benefit if you can kind of tile together layers with impunity so again it used to
+0:14:39.540,0:14:45.930
+benefício se você puder agrupar camadas com impunidade, então, novamente, costumava
 
-1:14:45.930,1:14:49.050
-be the situation that if you just plug together two possible ways in your
+0:14:45.930,0:14:49.050
+ser a situação que se você apenas conectar duas maneiras possíveis em seu
 
-1:14:49.050,1:14:52.740
-network it probably wouldn't work now that we use normalization layers it
+0:14:49.050,0:14:52.740
+provavelmente não funcionaria agora que usamos camadas de normalização
 
-1:14:52.740,1:14:57.900
-probably will work and even if it's a horrible idea and this has spurred a
+0:14:52.740,0:14:57.900
+provavelmente funcionará e mesmo que seja uma ideia horrível e isso estimulou um
 
-1:14:57.900,1:15:02.310
-whole field of automated architecture search where they just randomly calm
+0:14:57.900,0:15:02.310
+todo o campo de pesquisa de arquitetura automatizada, onde eles simplesmente se acalmam aleatoriamente
 
-1:15:02.310,1:15:05.940
-build together blocks and it's try thousands of them and see what works and
+0:15:02.310,0:15:05.940
+construir blocos juntos e tentar milhares deles e ver o que funciona e
 
-1:15:05.940,1:15:09.540
-that really wasn't possible before because that would typically result in a
+0:15:05.940,0:15:09.540
+que realmente não era possível antes porque isso normalmente resultaria em um
 
-1:15:09.540,1:15:14.010
-poorly conditioned Network you couldn't train and with normalization typically
+0:15:09.540,0:15:14.010
+Rede mal condicionada que você não conseguiu treinar e com normalização normalmente
 
-1:15:14.010,1:15:19.590
-you can train it some practical considerations so the the bachelor on
+0:15:14.010,0:15:19.590
+você pode treinar algumas considerações práticas para que o bacharel em
 
-1:15:19.590,1:15:23.310
-paper one of the reasons why it wasn't invented earlier is the kind of
+0:15:19.590,0:15:23.310
+papel uma das razões pelas quais não foi inventado antes é o tipo de
 
-1:15:23.310,1:15:27.480
-non-obvious thing that you have to back propagate through the calculation of the
+0:15:23.310,0:15:27.480
+coisa não óbvia que você tem que voltar propagar através do cálculo do
 
-1:15:27.480,1:15:32.160
-mean and standard deviation if you don't do this everything blows up now you
+0:15:27.480,0:15:32.160
+média e desvio padrão se você não fizer isso tudo explode agora você
 
-1:15:32.160,1:15:35.190
-might have to do this yourself as it'll be implemented in the implementation
+0:15:32.160,0:15:35.190
+pode ter que fazer isso sozinho, pois será implementado na implementação
 
-1:15:35.190,1:15:42.000
-that you use oh yes so I do not have the expertise to answer that I feel like
+0:15:35.190,0:15:42.000
+que você usa oh sim, então eu não tenho experiência para responder que eu sinto
 
-1:15:42.000,1:15:45.060
-it's kind of sometimes it's just a patent pet method like people like
+0:15:42.000,0:15:45.060
+às vezes é apenas um método patenteado de animais de estimação, como as pessoas gostam
 
-1:15:45.060,1:15:49.710
-layering in suits normally that field more and in fact a good norm if you it's
+0:15:45.060,0:15:49.710
+camadas em ternos normalmente esse campo mais e de fato uma boa norma se você é
 
-1:15:49.710,1:15:53.640
-just the group size covers both so I would be sure that you could probably
+0:15:49.710,0:15:53.640
+apenas o tamanho do grupo cobre os dois, então eu teria certeza de que você provavelmente poderia
 
-1:15:53.640,1:15:56.640
-get the same performance using group norm with a particular group size chosen
+0:15:53.640,0:15:56.640
+obter o mesmo desempenho usando a norma do grupo com um determinado tamanho de grupo escolhido
 
-1:15:56.640,1:16:00.980
-carefully yeah the choice of national does affect
+0:15:56.640,0:16:00.980
+com cuidado sim, a escolha do nacional afeta
 
-1:16:00.980,1:16:06.720
-parallelization so the implementation zinc in your computer library or your
+0:16:00.980,0:16:06.720
+paralelização para que a implementação do zinco em sua biblioteca de computador ou seu
 
-1:16:06.720,1:16:10.380
-CPU library are pretty efficient for each of these but it's complicated when
+0:16:06.720,0:16:10.380
+A biblioteca da CPU é bastante eficiente para cada um deles, mas é complicado quando
 
-1:16:10.380,1:16:14.820
-you are spreading your computation across machines and you kind of have to
+0:16:10.380,0:16:14.820
+você está espalhando sua computação entre máquinas e você meio que tem que
 
-1:16:14.820,1:16:18.630
-synchronize these these these things and batch norm is a bit of a pain there
+0:16:14.820,0:16:18.630
+sincronizar essas essas coisas e a norma do lote é um pouco difícil
 
-1:16:18.630,1:16:23.790
-because it would mean that you need to compute an average across all machines
+0:16:18.630,0:16:23.790
+porque isso significaria que você precisa calcular uma média em todas as máquinas
 
-1:16:23.790,1:16:27.540
-and aggregator whereas if you're using group norm every instance is on a
+0:16:23.790,0:16:27.540
+e agregador, enquanto se você estiver usando a norma do grupo, todas as instâncias estarão em um
 
-1:16:27.540,1:16:30.450
-different machine you can just completely compute the norm so in all
+0:16:27.540,0:16:30.450
+máquina diferente, você pode calcular completamente a norma, então em todos
 
-1:16:30.450,1:16:34.350
-those other three it's separate normalization for each instance it
+0:16:30.450,0:16:34.350
+os outros três é normalização separada para cada instância
 
-1:16:34.350,1:16:37.560
-doesn't depend on the other instances in the batch so it's nicer when you're
+0:16:34.350,0:16:37.560
+não depende das outras instâncias no lote, então é melhor quando você está
 
-1:16:37.560,1:16:40.570
-distributing it's when people use batch norm on a cluster
+0:16:37.560,0:16:40.570
+distribuindo é quando as pessoas usam a norma de lote em um cluster
 
-1:16:40.570,1:16:45.100
-they actually do not sync the statistics across which makes it even less like SGD
+0:16:40.570,0:16:45.100
+eles realmente não sincronizam as estatísticas, o que o torna ainda menos parecido com o SGD
 
-1:16:45.100,1:16:51.250
-and makes me even more annoyed so what was it already
+0:16:45.100,0:16:51.250
+e me deixa ainda mais irritado então o que já era
 
-1:16:51.250,1:16:57.610
-yes yeah Bachelor basically has a lot of momentum not in the optimization sense
+0:16:51.250,0:16:57.610
+sim sim Bacharel basicamente tem muito impulso não no sentido de otimização
 
-1:16:57.610,1:17:01.300
-but in the sense of people's minds so it's very heavily used for that reason
+0:16:57.610,0:17:01.300
+mas no sentido da mente das pessoas, é muito usado por esse motivo
 
-1:17:01.300,1:17:05.860
-but I would recommend group norm instead and there's kind of like a technical
+0:17:01.300,0:17:05.860
+mas eu recomendaria a norma do grupo e há uma espécie de técnica
 
-1:17:05.860,1:17:09.760
-data with batch norm you don't want to compute these mean and standard
+0:17:05.860,0:17:09.760
+dados com norma de lote, você não deseja calcular esses valores médios e padrão
 
-1:17:09.760,1:17:14.950
-deviations on batches during evaluation time by evaluation time I mean when you
+0:17:09.760,0:17:14.950
+desvios em lotes durante o tempo de avaliação por tempo de avaliação quero dizer quando você
 
-1:17:14.950,1:17:20.170
-actually run your network on the test data set or we use it in the real world
+0:17:14.950,0:17:20.170
+realmente executa sua rede no conjunto de dados de teste ou nós a usamos no mundo real
 
-1:17:20.170,1:17:24.370
-for some application it's typically in those situations you don't have batches
+0:17:20.170,0:17:24.370
+para alguns aplicativos, normalmente é nessas situações que você não tem lotes
 
-1:17:24.370,1:17:29.050
-any more batches or more for training things so you need some substitution in
+0:17:24.370,0:17:29.050
+mais lotes ou mais para treinar coisas, então você precisa de alguma substituição em
 
-1:17:29.050,1:17:33.100
-that case you can compute an exponential moving average as we talked about before
+0:17:29.050,0:17:33.100
+nesse caso, você pode calcular uma média móvel exponencial como falamos antes
 
-1:17:33.100,1:17:37.930
-and EMA of these mean and standard deviations you may think to yourself why
+0:17:33.100,0:17:37.930
+e EMA desses desvios médios e padrão, você pode pensar por que
 
-1:17:37.930,1:17:41.260
-don't we use an EMA in the implementation of batch norm the answer
+0:17:37.930,0:17:41.260
+não usamos um EMA na implementação da norma de lote a resposta
 
-1:17:41.260,1:17:44.860
-is because it doesn't work we it seems like a very reasonable idea though and
+0:17:41.260,0:17:44.860
+é porque não funciona, mas parece uma ideia muito razoável e
 
-1:17:44.860,1:17:48.880
-people have explored that and quite a lot of depth but it doesn't work oh yes
+0:17:44.860,0:17:48.880
+as pessoas exploraram isso e bastante profundidade, mas não funciona oh sim
 
-1:17:48.880,1:17:52.900
-this is quite crucial so yet people have tried normalizing things in neural
+0:17:48.880,0:17:52.900
+isso é bastante crucial, então as pessoas tentaram normalizar as coisas no sistema neural
 
-1:17:52.900,1:17:55.480
-networks before a batch norm was invented but they always made the
+0:17:52.900,0:17:55.480
+redes antes de uma norma de lote ser inventada, mas eles sempre fizeram o
 
-1:17:55.480,1:17:59.380
-mistake of not back popping through the mean and standard deviation and the
+0:17:55.480,0:17:59.380
+erro de não voltar a saltar pela média e desvio padrão e o
 
-1:17:59.380,1:18:02.290
-reason why they didn't do that is because the math is really tricky and if
+0:17:59.380,0:18:02.290
+razão pela qual eles não fizeram isso é porque a matemática é realmente complicada e se
 
-1:18:02.290,1:18:05.650
-you try to implement it yourself it will probably be wrong now that we have pie
+0:18:02.290,0:18:05.650
+você tenta implementá-lo sozinho, provavelmente estará errado agora que temos torta
 
-1:18:05.650,1:18:09.460
-charts which which computes gradients correctly for you in all situations you
+0:18:05.650,0:18:09.460
+gráficos que calculam gradientes corretamente para você em todas as situações que você
 
-1:18:09.460,1:18:12.850
-could actually do this in practice and there are just a little bit but only a
+0:18:09.460,0:18:12.850
+poderia realmente fazer isso na prática e há apenas um pouco, mas apenas um
 
-1:18:12.850,1:18:16.780
-little bit because it's surprisingly difficult yeah so the question is is
+0:18:12.850,0:18:16.780
+um pouco porque é surpreendentemente difícil, sim, então a questão é
 
-1:18:16.780,1:18:21.070
-there a difference if we apply normalization before after than
+0:18:16.780,0:18:21.070
+há uma diferença se aplicarmos a normalização antes depois do que
 
-1:18:21.070,1:18:25.690
-non-linearity and the answer is there will be a small difference in the
+0:18:21.070,0:18:25.690
+não linearidade e a resposta é que haverá uma pequena diferença na
 
-1:18:25.690,1:18:28.930
-performance of your network now I can't tell you which one's better because it
+0:18:25.690,0:18:28.930
+desempenho da sua rede agora não posso dizer qual é melhor porque
 
-1:18:28.930,1:18:32.110
-appears in some situation one works a little bit better in other situations
+0:18:28.930,0:18:32.110
+aparece em alguma situação funciona um pouco melhor em outras situações
 
-1:18:32.110,1:18:35.350
-the other one works better what I can tell you is the way I draw it here is
+0:18:32.110,0:18:35.350
+o outro funciona melhor o que posso te dizer é como eu desenho aqui é
 
-1:18:35.350,1:18:39.100
-what's used in the PyTorch implementation of ResNet and most
+0:18:35.350,0:18:39.100
+o que é usado na implementação PyTorch do ResNet e mais
 
-1:18:39.100,1:18:43.330
-resonant implementations so just there's probably almost as good as you can get I
+0:18:39.100,0:18:43.330
+implementações ressonantes, então provavelmente é quase tão bom quanto você pode obter I
 
-1:18:43.330,1:18:49.270
-think that would use the other form if it was better and it's certainly problem
+0:18:43.330,0:18:49.270
+acho que usaria a outra forma se fosse melhor e certamente é problema
 
-1:18:49.270,1:18:51.460
-depended this is another one of those things where maybe the
+0:18:49.270,0:18:51.460
+dependia isso é mais uma daquelas coisas onde talvez o
 
-1:18:51.460,1:18:55.420
-no correct answer how you do it and it's just random which works better I don't
+0:18:51.460,0:18:55.420
+nenhuma resposta correta como você faz isso e é apenas aleatório o que funciona melhor eu não
 
-1:18:55.420,1:19:03.190
-know yes yeah any other questions on this before I move on to the so you need
+0:18:55.420,0:19:03.190
+sei sim sim quaisquer outras perguntas sobre isso antes de passar para o que você precisa
 
-1:19:03.190,1:19:06.850
-more data to get accurate estimates of the mean and standard deviation the
+0:19:03.190,0:19:06.850
+mais dados para obter estimativas precisas da média e do desvio padrão
 
-1:19:06.850,1:19:10.570
-question was why is it a good idea to compute it across multiple channels
+0:19:06.850,0:19:10.570
+pergunta era por que é uma boa ideia calculá-lo em vários canais
 
-1:19:10.570,1:19:13.450
-rather than a single channel and yes it is because you just have more data to
+0:19:10.570,0:19:13.450
+em vez de um único canal e sim, é porque você só tem mais dados para
 
-1:19:13.450,1:19:17.800
-make a better estimates but you want to be careful you don't have too much data
+0:19:13.450,0:19:17.800
+fazer uma estimativa melhor, mas você quer ter cuidado para não ter muitos dados
 
-1:19:17.800,1:19:21.130
-in that because then you don't get the noise and record that the noise is
+0:19:17.800,0:19:21.130
+nisso porque então você não percebe o ruído e grava que o ruído é
 
-1:19:21.130,1:19:25.300
-actually useful so basically the group size in group norm is just adjusting the
+0:19:21.130,0:19:25.300
+realmente útil, então basicamente o tamanho do grupo na norma do grupo é apenas ajustar o
 
-1:19:25.300,1:19:28.870
-amount of noise we have basically the question was how is this related to
+0:19:25.300,0:19:28.870
+quantidade de ruído que temos basicamente a questão era como isso está relacionado
 
-1:19:28.870,1:19:32.950
-group convolutions this was all pioneered before good convolutions were
+0:19:28.870,0:19:32.950
+convoluções de grupo, tudo isso foi iniciado antes que boas convoluções fossem
 
-1:19:32.950,1:19:38.260
-used it certainly has some interaction with group convolutions if you use them
+0:19:32.950,0:19:38.260
+usado certamente tem alguma interação com convoluções de grupo se você usá-los
 
-1:19:38.260,1:19:41.920
-and so you want to be a little bit careful there I don't know exactly what
+0:19:38.260,0:19:41.920
+e então você quer ter um pouco de cuidado aí eu não sei exatamente o que
 
-1:19:41.920,1:19:44.800
-the correct thing to do is in those cases but I can tell you they definitely
+0:19:41.920,0:19:44.800
+a coisa correta a fazer é nesses casos, mas posso dizer que eles definitivamente
 
-1:19:44.800,1:19:48.610
-use normalization in those situations probably Batchelor more more than group
+0:19:44.800,0:19:48.610
+use a normalização nessas situações provavelmente mais de Batchelor do que de grupo
 
-1:19:48.610,1:19:53.260
-norm because of the momentum I mentioned it's just more popular vaginal yes so
+0:19:48.610,0:19:53.260
+normal por causa do impulso que mencionei, é apenas mais popular vaginal sim, então
 
-1:19:53.260,1:19:56.890
-the question is do we ever use our Beck instances from the mini-batch in group
+0:19:53.260,0:19:56.890
+a questão é se alguma vez usamos nossas instâncias Beck do mini-lote em grupo
 
-1:19:56.890,1:20:00.310
-norm or is it always just a single instance we always just use a single
+0:19:56.890,0:20:00.310
+norma ou é sempre apenas uma única instância, sempre usamos apenas uma única
 
-1:20:00.310,1:20:04.450
-instance because there's so many benefits to that it's so much simpler in
+0:20:00.310,0:20:04.450
+exemplo, porque há tantos benefícios que é muito mais simples em
 
-1:20:04.450,1:20:08.469
-implementation and in theory to do that maybe you can get some improvement from
+0:20:04.450,0:20:08.469
+implementação e, em teoria, para fazer isso, talvez você possa obter alguma melhoria
 
-1:20:08.469,1:20:11.530
-that in fact I bet you there's a paper that does that somewhere because they've
+0:20:08.469,0:20:11.530
+que na verdade eu aposto que há um jornal que faz isso em algum lugar porque eles
 
-1:20:11.530,1:20:15.190
-tried have any combination of this in practice I suspect if it worked well
+0:20:11.530,0:20:15.190
+tentei ter alguma combinação disso na prática, suspeito se funcionou bem
 
-1:20:15.190,1:20:19.450
-we'd probably be using it so probably probably doesn't work well under the the
+0:20:15.190,0:20:19.450
+provavelmente estaríamos usando, então provavelmente não funciona bem sob o
 
-1:20:19.450,1:20:24.370
-death of optimization I wanted to put something a little bit interesting
+0:20:19.450,0:20:24.370
+morte da otimização eu queria colocar algo um pouco interessante
 
-1:20:24.370,1:20:27.610
-because you've all been sitting through kind of a pretty dense lecture so this
+0:20:24.370,0:20:27.610
+porque todos vocês assistiram a uma palestra bastante densa, então isso
 
-1:20:27.610,1:20:31.870
-is something that I've kind of been working on a little bit I thought you
+0:20:27.610,0:20:31.870
+é algo que eu tenho trabalhado um pouco, eu pensei que você
 
-1:20:31.870,1:20:36.580
-might find interesting so you might have seen the the xkcd comic here that I've
+0:20:31.870,0:20:36.580
+pode achar interessante, então você pode ter visto o quadrinho xkcd aqui que eu
 
-1:20:36.580,1:20:42.790
-modified it's not always this way it's kind of point of what it makes so
+0:20:36.580,0:20:42.790
+modificado, nem sempre é assim, é meio que o que faz tão
 
-1:20:42.790,1:20:46.270
-sometimes we can just barge into a field we know nothing about it and improve on
+0:20:42.790,0:20:46.270
+às vezes podemos simplesmente invadir um campo que não sabemos nada sobre isso e melhorar
 
-1:20:46.270,1:20:50.469
-how they're currently doing it although you have to be a little bit careful so
+0:20:46.270,0:20:50.469
+como eles estão fazendo isso, embora você tenha que ter um pouco de cuidado para
 
-1:20:50.469,1:20:53.560
-the problem I want to talk about is one that young I think mentioned briefly in
+0:20:50.469,0:20:53.560
+o problema sobre o qual quero falar é aquele jovem que acho que mencionei brevemente em
 
-1:20:53.560,1:20:58.530
-the first lecture but I want to go into a bit of detail it's MRI reconstruction
+0:20:53.560,0:20:58.530
+a primeira palestra, mas quero entrar em detalhes, é a reconstrução de ressonância magnética
 
-1:20:58.530,1:21:04.639
-now in the MRI reconstruction problem we take a raw data from an MRI machine a
+0:20:58.530,0:21:04.639
+agora, no problema de reconstrução de ressonância magnética, pegamos dados brutos de uma máquina de ressonância magnética
 
-1:21:04.639,1:21:08.540
-medical imaging machine we take raw data from that machine and we reconstruct an
+0:21:04.639,0:21:08.540
+máquina de imagem médica, pegamos dados brutos dessa máquina e reconstruímos um
 
-1:21:08.540,1:21:12.530
-image and there's some pipeline an algorithm in the middle there that
+0:21:08.540,0:21:12.530
+imagem e há algum pipeline de um algoritmo no meio que
 
-1:21:12.530,1:21:17.900
-produces the image and the goal basically here is to replace 30 years of
+0:21:12.530,0:21:17.900
+produz a imagem e o objetivo basicamente aqui é substituir 30 anos de
 
-1:21:17.900,1:21:21.020
-research into what algorithm they should use their with with neural networks
+0:21:17.900,0:21:21.020
+pesquise com qual algoritmo eles devem usar com redes neurais
 
-1:21:21.020,1:21:27.949
-because that's that's what I'll get paid to do and I'll give you a bit of detail
+0:21:21.020,0:21:27.949
+porque é para isso que serei pago para fazer e vou dar-lhe um pouco de detalhe
 
-1:21:27.949,1:21:31.810
-so these MRI machines capture data in what's known as the Fourier domain I
+0:21:27.949,0:21:31.810
+então essas máquinas de ressonância magnética capturam dados no que é conhecido como domínio de Fourier I
 
-1:21:31.810,1:21:34.909
-know a lot of you have done signal processing some of you may have no idea
+0:21:31.810,0:21:34.909
+sei que muitos de vocês fizeram processamento de sinal alguns de vocês podem não ter ideia
 
-1:21:34.909,1:21:42.070
-what this is and you don't need to understand it for this problem oh yeah
+0:21:34.909,0:21:42.070
+o que é isso e você não precisa entendê-lo para este problema oh sim
 
-1:21:44.770,1:21:49.639
-yes so you may have seen the the further domain in one dimensional case
+0:21:44.770,0:21:49.639
+sim, então você pode ter visto o domínio adicional em um caso dimensional
 
-1:21:49.639,1:21:54.710
-so for neural networks sorry for MRI reconstruction we have two dimensional
+0:21:49.639,0:21:54.710
+então, para redes neurais, desculpe pela reconstrução de ressonância magnética, temos duas dimensões
 
-1:21:54.710,1:21:58.340
-Fourier domain the thing you need to know is it's a linear mapping to get
+0:21:54.710,0:21:58.340
+Domínio de Fourier, o que você precisa saber é que é um mapeamento linear para obter
 
-1:21:58.340,1:22:02.389
-from the fluid domain to image domain it's just linear and it's very efficient
+0:21:58.340,0:22:02.389
+do domínio fluido para o domínio da imagem é apenas linear e muito eficiente
 
-1:22:02.389,1:22:06.350
-to do that mapping it literally takes milliseconds no matter how big your
+0:22:02.389,0:22:06.350
+para fazer esse mapeamento leva literalmente milissegundos, não importa o tamanho do seu
 
-1:22:06.350,1:22:09.980
-images on modern computers so linear and easy to convert back and forth between
+0:22:06.350,0:22:09.980
+imagens em computadores modernos tão lineares e fáceis de converter entre
 
-1:22:09.980,1:22:15.619
-the two and the MRI machines actually capture either rows or columns of this
+0:22:09.980,0:22:15.619
+os dois e as máquinas de ressonância magnética realmente capturam linhas ou colunas deste
 
-1:22:15.619,1:22:20.540
-Fourier domain as samples they're called sample in the literature so each time
+0:22:15.619,0:22:20.540
+Domínio de Fourier como amostras, eles são chamados de amostra na literatura, então cada vez
 
-1:22:20.540,1:22:25.280
-the machine computes a sample which is every few milliseconds it gets a role
+0:22:20.540,0:22:25.280
+a máquina calcula uma amostra que é a cada poucos milissegundos ela recebe um papel
 
-1:22:25.280,1:22:28.940
-column of this image and this is actually technically a complex-valued
+0:22:25.280,0:22:28.940
+coluna desta imagem e isso é, na verdade, tecnicamente um valor complexo
 
-1:22:28.940,1:22:33.380
-image but this does not matter for my discussion of it so you can imagine it's
+0:22:28.940,0:22:33.380
+imagem, mas isso não importa para minha discussão, então você pode imaginar que é
 
-1:22:33.380,1:22:38.300
-just a two channel image if you imagine a real and imaginary channel just think
+0:22:33.380,0:22:38.300
+apenas uma imagem de dois canais se você imaginar um canal real e imaginário apenas pense
 
-1:22:38.300,1:22:42.830
-of them as color channels the problem we want to do we want to solve is
+0:22:38.300,0:22:42.830
+deles como canais de cores, o problema que queremos resolver é
 
-1:22:42.830,1:22:48.800
-accelerating MRI acceleration here is in the sense of faster so we want to run
+0:22:42.830,0:22:48.800
+acelerar a aceleração de MRI aqui é no sentido de mais rápido, então queremos executar
 
-1:22:48.800,1:22:53.830
-the machines quicker and produce identical quality images
+0:22:48.800,0:22:53.830
+as máquinas mais rapidamente e produzem imagens de qualidade idêntica
 
-1:22:55.400,1:23:00.050
-and one way we can do that in the most successful way so far is by just not
+0:22:55.400,0:23:00.050
+e uma maneira de fazer isso da maneira mais bem-sucedida até agora é simplesmente não
 
-1:23:00.050,1:23:05.540
-capturing all of the columns we just skip some randomly it's useful in
+0:23:00.050,0:23:05.540
+capturando todas as colunas, apenas pulamos algumas aleatoriamente, é útil em
 
-1:23:05.540,1:23:09.320
-practice to also capture some of the middle columns it turns out they contain
+0:23:05.540,0:23:09.320
+prática para também capturar algumas das colunas do meio, elas contêm
 
-1:23:09.320,1:23:14.150
-a lot of the information but outside the middle we just capture randomly and we
+0:23:09.320,0:23:14.150
+muita informação, mas fora do meio nós apenas capturamos aleatoriamente e
 
-1:23:14.150,1:23:16.699
-can't just use a nice linear operation anymore
+0:23:14.150,0:23:16.699
+não pode mais usar uma boa operação linear
 
-1:23:16.699,1:23:20.270
-that diagram on the right is the output of that linear operation I mentioned
+0:23:16.699,0:23:20.270
+esse diagrama à direita é a saída dessa operação linear que mencionei
 
-1:23:20.270,1:23:23.810
-applied to this data so it doesn't give useful Apple they only do something a
+0:23:20.270,0:23:23.810
+aplicado a esses dados para que não seja útil à Apple, eles apenas fazem algo
 
-1:23:23.810,1:23:27.100
-little bit more intelligent any questions on this before I move on
+0:23:23.810,0:23:27.100
+um pouco mais inteligente qualquer pergunta sobre isso antes de seguir em frente
 
-1:23:27.100,1:23:35.030
-it is frequency and phase dimensions so in this particular case I'm actually
+0:23:27.100,0:23:35.030
+são dimensões de frequência e fase, então, neste caso em particular, estou
 
-1:23:35.030,1:23:38.510
-sure this diagram one of the dimensions is frequency and one is phase and the
+0:23:35.030,0:23:38.510
+certifique-se que neste diagrama uma das dimensões é frequência e uma é fase e o
 
-1:23:38.510,1:23:44.390
-value is the magnitude of a sine wave with that frequency and phase so if you
+0:23:38.510,0:23:44.390
+valor é a magnitude de uma onda senoidal com essa frequência e fase, então se você
 
-1:23:44.390,1:23:48.980
-add together all the sine waves wave them with the frequency oh so with the
+0:23:44.390,0:23:48.980
+some todas as ondas senoidais acene-as com a frequência oh então com o
 
-1:23:48.980,1:23:54.620
-weight in this image you get the original image so it's it's a little bit
+0:23:48.980,0:23:54.620
+peso nesta imagem você obtém a imagem original, então é um pouco
 
-1:23:54.620,1:23:58.429
-more complicated because it's in two dimensions and the sine waves you gotta
+0:23:54.620,0:23:58.429
+mais complicado porque é em duas dimensões e as ondas senoidais que você tem
 
-1:23:58.429,1:24:02.030
-be little bit careful but it's basically just each pixel is the magnitude of a
+0:23:58.429,0:24:02.030
+tenha um pouco de cuidado, mas é basicamente apenas cada pixel é a magnitude de um
 
-1:24:02.030,1:24:06.230
-sine wave or if you want to compare to a 1d analogy
+0:24:02.030,0:24:06.230
+onda senoidal ou se você quiser comparar com uma analogia 1d
 
-1:24:06.230,1:24:11.960
-you'll just have frequencies so the pixel intensity is the strength of that
+0:24:06.230,0:24:11.960
+você terá apenas frequências, então a intensidade do pixel é a força disso
 
-1:24:11.960,1:24:16.580
-frequency if you have a musical note say a piano note with a C major as one of
+0:24:11.960,0:24:16.580
+frequência se você tiver uma nota musical diga uma nota de piano com um C maior como uma das
 
-1:24:16.580,1:24:19.340
-the frequencies that would be one pixel this image would be the C major
+0:24:16.580,0:24:19.340
+as frequências que seriam de um pixel esta imagem seria o C maior
 
-1:24:19.340,1:24:24.140
-frequency and another might be a minor or something like that and the magnitude
+0:24:19.340,0:24:24.140
+frequência e outro pode ser menor ou algo assim e a magnitude
 
-1:24:24.140,1:24:28.370
-of it is just how hard they press the key on the piano so you have frequency
+0:24:24.140,0:24:28.370
+disso é o quão forte eles pressionam a tecla no piano para que você tenha frequência
 
-1:24:28.370,1:24:34.370
-information yes so the video doesn't work there was one of the biggest
+0:24:28.370,0:24:34.370
+informação sim para que o vídeo não funcione foi um dos maiores
 
-1:24:34.370,1:24:38.750
-breakthroughs in in Threat achill mathematics for a long time was the
+0:24:34.370,0:24:38.750
+avanços na matemática de Ameaças por muito tempo foi o
 
-1:24:38.750,1:24:41.690
-invention of compressed sensing I'm sure some of you have heard of compressed
+0:24:38.750,0:24:41.690
+invenção do sensoriamento comprimido Tenho certeza que alguns de vocês já ouviram falar
 
-1:24:41.690,1:24:45.710
-sensing a hands of show of hands compressed sensing yeah some of you
+0:24:41.690,0:24:45.710
+sentindo uma mão levantada de mãos comprimidas sentindo sim alguns de vocês
 
-1:24:45.710,1:24:48.980
-especially work in the mathematical sciences would be aware of it
+0:24:45.710,0:24:48.980
+especialmente o trabalho nas ciências matemáticas estaria ciente disso
 
-1:24:48.980,1:24:53.330
-basically there's this phenomenal political paper that showed that we
+0:24:48.980,0:24:53.330
+basicamente há este jornal político fenomenal que mostrou que nós
 
-1:24:53.330,1:24:57.770
-could actually in theory get a perfect reconstruction from these subsampled
+0:24:53.330,0:24:57.770
+poderia, em teoria, obter uma reconstrução perfeita a partir desses subamostrados
 
-1:24:57.770,1:25:02.080
-measurements and we had some requirements for this to work the
+0:24:57.770,0:25:02.080
+medições e tivemos alguns requisitos para que isso funcionasse
 
-1:25:02.080,1:25:06.010
-requirements were that we needed to sample randomly
+0:25:02.080,0:25:06.010
+requisitos eram que precisávamos amostrar aleatoriamente
 
-1:25:06.010,1:25:10.150
-in fact it's a bit weaker you have to sample incoherently but in practice
+0:25:06.010,0:25:10.150
+na verdade, é um pouco mais fraco, você precisa amostrar incoerentemente, mas na prática
 
-1:25:10.150,1:25:14.710
-everybody samples randomly so it's essentially the same thing now here
+0:25:10.150,0:25:14.710
+todo mundo faz amostras aleatoriamente, então é essencialmente a mesma coisa agora aqui
 
-1:25:14.710,1:25:18.910
-we're randomly sampling columns but within the columns we do not randomly
+0:25:14.710,0:25:18.910
+estamos amostrando colunas aleatoriamente, mas dentro das colunas não
 
-1:25:18.910,1:25:22.330
-sample the reason being is it's not faster in the machine the machine can
+0:25:18.910,0:25:22.330
+amostra a razão é que não é mais rápido na máquina que a máquina pode
 
-1:25:22.330,1:25:25.930
-capture one column as quickly as you could capture half a column so we just
+0:25:22.330,0:25:25.930
+capturar uma coluna tão rapidamente quanto você poderia capturar meia coluna, então nós apenas
 
-1:25:25.930,1:25:29.350
-kind of capture a whole column so that makes it no longer random so that's one
+0:25:25.930,0:25:29.350
+tipo de capturar uma coluna inteira para que não seja mais aleatório, então é um
 
-1:25:29.350,1:25:33.760
-kind of problem with it the other problem is kind of the the assumptions
+0:25:29.350,0:25:33.760
+tipo de problema com isso o outro problema é o tipo de suposições
 
-1:25:33.760,1:25:36.850
-of this compressed sensing theory are violated by the kind of images we want
+0:25:33.760,0:25:36.850
+desta teoria do sensoriamento comprimido são violados pelo tipo de imagens que queremos
 
-1:25:36.850,1:25:41.020
-to reconstruct I show you on the right they're an example of compressed sensing
+0:25:36.850,0:25:41.020
+para reconstruir eu mostro à direita eles são um exemplo de sensoriamento comprimido
 
-1:25:41.020,1:25:44.560
-Theory reconstruction this was a big step forward from what they could do
+0:25:41.020,0:25:44.560
+Reconstrução da teoria isso foi um grande passo à frente do que eles poderiam fazer
 
-1:25:44.560,1:25:48.940
-before you would you'll get something that looks like this previously that was
+0:25:44.560,0:25:48.940
+antes de você obterá algo parecido com isso anteriormente, que foi
 
-1:25:48.940,1:25:53.020
-really considered the best in fact some people would when this result came out
+0:25:48.940,0:25:53.020
+realmente considerado o melhor que algumas pessoas fariam quando esse resultado saiu
 
-1:25:53.020,1:25:57.430
-swore though this was impossible it's actually not but you need some
+0:25:53.020,0:25:57.430
+Jurei que isso era impossível, na verdade não é, mas você precisa de um pouco
 
-1:25:57.430,1:26:00.550
-assumptions and these assumptions are pretty critical and I mention them there
+0:25:57.430,0:26:00.550
+suposições e essas suposições são bastante críticas e eu as menciono lá
 
-1:26:00.550,1:26:05.080
-so you need sparsity of the image now that mi a-- majors not sparse by sparse
+0:26:00.550,0:26:05.080
+então você precisa de uma imagem esparsa agora que mi a -- majors não esparsos por esparsos
 
-1:26:05.080,1:26:09.370
-I mean it has a lot of zero or black pixels it's clearly not sparse but it
+0:26:05.080,0:26:09.370
+Quero dizer, tem muitos pixels zero ou pretos, claramente não é esparso, mas
 
-1:26:09.370,1:26:13.660
-can be represented sparsely or approximately sparsely if you do a
+0:26:09.370,0:26:13.660
+podem ser representados esparsamente ou aproximadamente esparsamente se você fizer um
 
-1:26:13.660,1:26:18.160
-wavelet decomposition now I won't go to the details there's a little bit of
+0:26:13.660,0:26:18.160
+decomposição wavelet agora não vou entrar em detalhes há um pouco de
 
-1:26:18.160,1:26:20.920
-problem though it's only approximately sparse and when you do that wavelet
+0:26:18.160,0:26:20.920
+problema, embora seja apenas aproximadamente esparso e quando você faz essa wavelet
 
-1:26:20.920,1:26:24.489
-decomposition that's why this is not a perfect reconstruction if it was very
+0:26:20.920,0:26:24.489
+decomposição é por isso que esta não é uma reconstrução perfeita se foi muito
 
-1:26:24.489,1:26:28.060
-sparse in the wavelet domain and perfectly that would be in exactly the
+0:26:24.489,0:26:28.060
+esparso no domínio wavelet e perfeitamente que seria exatamente no
 
-1:26:28.060,1:26:33.160
-same as the left image and this compressed sensing is based off of the
+0:26:28.060,0:26:33.160
+igual à imagem da esquerda e esta detecção compactada é baseada na
 
-1:26:33.160,1:26:36.220
-field of optimization it kind of revitalize a lot of the techniques
+0:26:33.160,0:26:36.220
+campo de otimização meio que revitaliza muitas das técnicas
 
-1:26:36.220,1:26:39.550
-people have been using for a long time the way you get this reconstruction is
+0:26:36.220,0:26:39.550
+que as pessoas usam há muito tempo, a maneira como você obtém essa reconstrução é
 
-1:26:39.550,1:26:45.130
-you solve a little mini optimization problem at every step you every image
+0:26:39.550,0:26:45.130
+você resolve um pequeno problema de mini otimização em cada etapa de cada imagem
 
-1:26:45.130,1:26:47.830
-you want to reconstruct how many other machines so your machine has to solve an
+0:26:45.130,0:26:47.830
+você deseja reconstruir quantas outras máquinas para que sua máquina tenha que resolver um
 
-1:26:47.830,1:26:51.030
-optimization problem for every image every time it solves this little
+0:26:47.830,0:26:51.030
+problema de otimização para cada imagem toda vez que resolve este pequeno
 
-1:26:51.030,1:26:57.340
-quadratic problem with this kind of complicated regularization term so this
+0:26:51.030,0:26:57.340
+problema quadrático com esse tipo de termo de regularização complicado, então isso
 
-1:26:57.340,1:27:00.700
-is great for optimization or all these people who had been getting low paid
+0:26:57.340,0:27:00.700
+é ótimo para otimização ou todas essas pessoas que estavam sendo mal pagas
 
-1:27:00.700,1:27:03.780
-jobs at universities all of a sudden there of their research was trendy and
+0:27:00.700,0:27:03.780
+empregos em universidades, de repente, suas pesquisas estavam na moda e
 
-1:27:03.780,1:27:09.370
-corporations needed their help so this is great but we can do better so we
+0:27:03.780,0:27:09.370
+corporações precisavam de sua ajuda, então isso é ótimo, mas podemos fazer melhor para que possamos
 
-1:27:09.370,1:27:13.120
-instead of solving this minimization problem at every time step I will use a
+0:27:09.370,0:27:13.120
+em vez de resolver esse problema de minimização a cada passo de tempo, usarei um
 
-1:27:13.120,1:27:16.960
-neural network so obviously being here arbitrarily to represent the huge in
+0:27:13.120,0:27:16.960
+rede neural tão obviamente estar aqui arbitrariamente para representar a enorme
 
-1:27:16.960,1:27:24.190
-your network beef a big of course we we hope that we can learn in your network
+0:27:16.960,0:27:24.190
+sua rede é muito importante, é claro, esperamos que possamos aprender em sua rede
 
-1:27:24.190,1:27:28.000
-of such sufficient complexity that it can essentially solve the optimization
+0:27:24.190,0:27:28.000
+de complexidade suficiente que pode essencialmente resolver a otimização
 
-1:27:28.000,1:27:31.240
-problem in one step it just outputs a solution that's as good as the
+0:27:28.000,0:27:31.240
+problema em uma etapa, ele apenas gera uma solução que é tão boa quanto a
 
-1:27:31.240,1:27:35.200
-optimization problem solution now this would have been considered impossible 15
+0:27:31.240,0:27:35.200
+solução do problema de otimização agora isso seria considerado impossível 15
 
-1:27:35.200,1:27:39.820
-years ago now we know better so it's actually not very difficult in fact we
+0:27:35.200,0:27:39.820
+anos atrás, agora sabemos melhor, então não é muito difícil, na verdade, nós
 
-1:27:39.820,1:27:44.980
-can just take an example of we can solve a few of these a few I mean like a few
+0:27:39.820,0:27:44.980
+pode apenas dar um exemplo de que podemos resolver alguns desses alguns, quero dizer, como alguns
 
-1:27:44.980,1:27:48.520
-hundred thousand of these optimization problems take the solution and the input
+0:27:44.980,0:27:48.520
+centenas de milhares desses problemas de otimização levam a solução e a entrada
 
-1:27:48.520,1:27:53.620
-and we're gonna strain a neural network to map from input to solution that's
+0:27:48.520,0:27:53.620
+e vamos forçar uma rede neural para mapear da entrada para a solução que é
 
-1:27:53.620,1:27:56.830
-actually a little bit suboptimal because we get weakened in some cases we know a
+0:27:53.620,0:27:56.830
+na verdade, um pouco abaixo do ideal porque ficamos enfraquecidos em alguns casos, conhecemos um
 
-1:27:56.830,1:28:00.070
-better solution than the solution to the optimization problem we can gather that
+0:27:56.830,0:28:00.070
+melhor solução do que a solução para o problema de otimização, podemos reunir que
 
-1:28:00.070,1:28:04.780
-by measuring the patient and that's what we actually do in practice so we don't
+0:28:00.070,0:28:04.780
+medindo o paciente e isso é o que realmente fazemos na prática para não
 
-1:28:04.780,1:28:07.000
-try and solve the optimization problem we try and get to an even better
+0:28:04.780,0:28:07.000
+tentar resolver o problema de otimização que tentamos e chegar a um ainda melhor
 
-1:28:07.000,1:28:11.260
-solution and this works really well so I'll give you a very simple example of
+0:28:07.000,0:28:11.260
+solução e isso funciona muito bem, então vou lhe dar um exemplo muito simples de
 
-1:28:11.260,1:28:14.740
-this so this is what you can do much better than the compressed sensory
+0:28:11.260,0:28:14.740
+isso então é isso que você pode fazer muito melhor do que o sensorial comprimido
 
-1:28:14.740,1:28:18.580
-reconstruction using a neural network and this network involves the tricks
+0:28:14.740,0:28:18.580
+reconstrução usando uma rede neural e esta rede envolve os truques
 
-1:28:18.580,1:28:23.140
-I've mentioned so it's trained using Adam it uses group norm normalization
+0:28:18.580,0:28:23.140
+Eu mencionei, então é treinado usando Adam, ele usa normalização de norma de grupo
 
-1:28:23.140,1:28:28.690
-layers and convolutional neural networks as you've already been taught and it
+0:28:23.140,0:28:28.690
+camadas e redes neurais convolucionais como você já aprendeu e
 
-1:28:28.690,1:28:33.970
-uses a technique known as u nets which you may go over later in the course not
+0:28:28.690,0:28:33.970
+usa uma técnica conhecida como u nets que você pode usar mais tarde no curso não
 
-1:28:33.970,1:28:37.390
-sure about that but it's not a very complicated modification of only one it
+0:28:33.970,0:28:37.390
+com certeza, mas não é uma modificação muito complicada de apenas um
 
-1:28:37.390,1:28:40.660
-works as yeah this is the kind of thing you can do and this is this is very
+0:28:37.390,0:28:40.660
+funciona como sim, este é o tipo de coisa que você pode fazer e isso é muito
 
-1:28:40.660,1:28:44.880
-close to practical applications so you'll be seeing these accelerated MRI
+0:28:40.660,0:28:44.880
+perto de aplicações práticas, então você verá essas imagens de ressonância magnética aceleradas
 
-1:28:44.880,1:28:49.750
-scans happening in in clinical practice in only a few years tired this is not
+0:28:44.880,0:28:49.750
+varreduras acontecendo na prática clínica em apenas alguns anos cansados, isso não é
 
-1:28:49.750,1:28:53.980
-vaporware and yeah that's everything i wanted to talk about you talk about
+0:28:49.750,0:28:53.980
+vaporware e sim, isso é tudo que eu queria falar sobre você falar
 
-1:28:53.980,1:28:58.620
-today optimization and the death of optimization thank you
+0:28:53.980,0:28:58.620
+hoje otimização e a morte da otimização obrigado
\ No newline at end of file
diff --git a/docs/pt/week05/practicum05.sbv b/docs/pt/week05/practicum05.sbv
index 72ed0c5f4..e1143a002 100644
--- a/docs/pt/week05/practicum05.sbv
+++ b/docs/pt/week05/practicum05.sbv
@@ -1,1241 +1,1241 @@
 0:00:00.000,0:00:05.339
-last time we have seen that a matrix can be written basically let me draw here
+última vez que vimos que uma matriz pode ser escrita basicamente deixe-me desenhar aqui
 
 0:00:05.339,0:00:12.719
-the matrix so we had similar roles right and then we multiplied usually design by
+a matriz, então tínhamos papéis semelhantes e, em seguida, multiplicamos normalmente design por
 
 0:00:12.719,0:00:18.210
-one one column all right and so whenever we multiply these guys you can see these
+uma coluna tudo bem e sempre que multiplicarmos esses caras você pode ver esses
 
 0:00:18.210,0:00:23.340
-and as two types two different equivalent types of representation it
+e como dois tipos dois tipos diferentes equivalentes de representação
 
 0:00:23.340,0:00:28.980
-can you see right you don't is it legible okay so you can see basically as
+você pode ver direito você não é legível ok então você pode ver basicamente como
 
 0:00:28.980,0:00:35.430
-the output of this product has been a sequence of like the first row times
+a saída deste produto foi uma sequência de vezes como a primeira linha
 
 0:00:35.430,0:00:40.469
-this column vector and then again I'm just okay shrinking them this should be
+este vetor de coluna e, novamente, estou bem encolhendo-os, isso deve ser
 
 0:00:40.469,0:00:46.170
-the same size right right because otherwise you can't multiply them so you
+o mesmo tamanho certo, porque senão você não pode multiplicá-los, então você
 
 0:00:46.170,0:00:52.170
-have this one and so on right until the last one and this is gonna be my final
+ter este e assim por diante até o último e este vai ser o meu final
 
 0:00:52.170,0:01:00.960
-vector and we have seen that each of these bodies here what are these I talk
+vetor e vimos que cada um desses corpos aqui o que são esses que falo
 
 0:01:00.960,0:01:05.339
-to me please there's a scalar products right but what
+para mim, por favor, há um produto escalar certo, mas o que
 
 0:01:05.339,0:01:08.820
-do they represent what is it how can we call it what's another name for calling
+eles representam o que é como podemos chamá-lo qual é outro nome para chamar
 
 0:01:08.820,0:01:13.290
-a scalar product I show you last time a demonstration with some Chi government
+um produto escalar eu mostro a você da última vez uma demonstração com algum governo Chi
 
 0:01:13.290,0:01:18.119
-trigonometry right what is it so this is all the projection if you
+trigonometria certo o que é isso então esta é toda a projeção se você
 
 0:01:18.119,0:01:22.619
-talk about geometry or you can think about this as a nun normalized cosine
+fale sobre geometria ou você pode pensar nisso como um cosseno normalizado freira
 
 0:01:22.619,0:01:29.310
-value right so this one is going to be my projection basically of one kernel or
+valor certo então este vai ser minha projeção basicamente de um kernel ou
 
 0:01:29.310,0:01:36.030
-my input signal onto the kernel right so these are projections projection alright
+meu sinal de entrada no kernel certo, então essas são projeções de projeção bem
 
 0:01:36.030,0:01:40.619
-and so then there was also a another interpretation of this like there is
+e então houve também uma outra interpretação disso como há
 
 0:01:40.619,0:01:45.390
-another way of seeing this which was what basically we had the first column
+outra maneira de ver isso que era basicamente o que tínhamos na primeira coluna
 
 0:01:45.390,0:01:53.579
-of the matrix a multiplied by the first element of the X of these of this vector
+da matriz a multiplicado pelo primeiro elemento do X destes deste vetor
 
 0:01:53.579,0:01:58.260
-right so back element number one then you had a second call
+certo, então de volta ao elemento número um, então você teve uma segunda chamada
 
 0:01:58.260,0:02:04.020
-time's the second element of the X vector until you get to the last column
+tempo é o segundo elemento do vetor X até chegar à última coluna
 
 0:02:04.020,0:02:11.100
-right times the last an element right suppose that this is long N and this is
+certo vezes o último um elemento certo suponha que isso é longo N e isso é
 
 0:02:11.100,0:02:16.110
-M times n right so the height again is going to be the dimension towards we
+M vezes n certo, então a altura novamente será a dimensão em direção a nós
 
 0:02:16.110,0:02:19.550
-should - and the width of a matrix is dimension where we're coming from
+deveria - e a largura de uma matriz é a dimensão de onde estamos vindo
 
 0:02:19.550,0:02:24.810
-second part was the following so we said instead of using this matrix here
+a segunda parte foi a seguinte, então dissemos em vez de usar esta matriz aqui
 
 0:02:24.810,0:02:29.450
-instead since we are doing convolutions because we'd like to exploit sparsity a
+em vez disso, já que estamos fazendo convoluções porque gostaríamos de explorar a dispersão
 
 0:02:29.450,0:02:35.400
-stationarity and compositionality of the data we still use the same matrix here
+estacionaridade e composicionalidade dos dados ainda usamos a mesma matriz aqui
 
 0:02:35.400,0:02:41.370
-perhaps right we use the same guy here but then those kernels we are going to
+talvez certo nós usamos o mesmo cara aqui, mas então esses kernels nós vamos
 
 0:02:41.370,0:02:45.510
-be using them over and over again the same current across the whole signal
+estar usando-os repetidamente a mesma corrente em todo o sinal
 
 0:02:45.510,0:02:51.360
-right so in this case the width of this matrix is no longer be it's no longer n
+certo, então, neste caso, a largura desta matriz não é mais n
 
 0:02:51.360,0:02:56.820
-as it was here is going to be K which is gonna be the kernel size right so here
+como estava aqui vai ser K que vai ser o tamanho do kernel certo então aqui
 
 0:02:56.820,0:03:03.090
-I'm gonna be drawing my thinner matrix and this one is gonna be K lowercase K
+Eu vou desenhar minha matriz mais fina e esta vai ser K minúsculo K
 
 0:03:03.090,0:03:10.140
-and the height maybe we can still call it n okay all right so let's say here I
+e a altura talvez ainda possamos chamar de n ok tudo bem então vamos dizer aqui eu
 
 0:03:10.140,0:03:18.230
-have several kernels for example let me have my tsiyon carnal then I may have my
+tenho vários kernels, por exemplo, deixe-me ter meu tsiyon carnal, então eu posso ter meu
 
 0:03:18.230,0:03:25.080
-other non green let me change let's put pink so you have this one and
+outro não verde deixa eu trocar vamos colocar rosa pra você ter esse e
 
 0:03:25.080,0:03:33.180
-then you may have green one right and so on so how do we use these kernels right
+então você pode ter um verde certo e assim por diante, então como usamos esses kernels certo
 
 0:03:33.180,0:03:38.280
-now so we basically can use these kernels by stacking them and shifted
+agora, basicamente, podemos usar esses kernels empilhando-os e deslocando
 
 0:03:38.280,0:03:43.650
-them a little bit right so we get the first kernel out of here and then you're
+um pouco certo, então tiramos o primeiro kernel daqui e então você está
 
 0:03:43.650,0:03:50.519
-gonna get basically you get the first guy here then you shift it shift it
+vai pegar basicamente você pega o primeiro cara aqui então você muda
 
 0:03:50.519,0:03:58.290
-shift it and so on right until you get the whole matrix and we were putting a 0
+desloque-o e assim por diante até obter toda a matriz e estávamos colocando um 0
 
 0:03:58.290,0:04:02.100
-here and a 0 here right this is just recap and then you have this one for the
+aqui e um 0 aqui certo isso é apenas recapitulação e então você tem este para o
 
 0:04:02.100,0:04:11.379
-blue color now you do magic here and just do copy copy and I you do paste
+cor azul agora você faz mágica aqui e apenas copia copia e eu cola
 
 0:04:11.379,0:04:19.370
-and now you can also do color see fantastic magic and we have pink one and
+e agora você também pode fazer cores ver mágica fantástica e temos um rosa e
 
 0:04:19.370,0:04:25.360
-then you have the last one right can I do the same copy yes I can do fantastic
+então você tem o último certo posso fazer a mesma cópia sim eu posso fazer fantástico
 
 0:04:25.360,0:04:29.080
-so you cannot do copy and paste on the paper
+então você não pode copiar e colar no papel
 
 0:04:29.080,0:04:38.419
-all right color and the last one light green okay all right so we just
+tudo certo cor e o último verde claro tudo bem então nós apenas
 
 0:04:38.419,0:04:44.479
-duplicate how many matrices do we have now how many layers no don't count the
+duplique quantas matrizes temos agora quantas camadas não conte o
 
 0:04:44.479,0:04:50.600
-number like there are letters on the on the screen and K or M what is it what is
+número como se houvesse letras na tela e K ou M o que é o que é
 
 0:04:50.600,0:05:00.620
-K the side usually you're just guessing you shouldn't be guessing you should
+K do lado geralmente você está apenas supondo que não deveria estar supondo que deveria
 
 0:05:00.620,0:05:07.120
-tell me the correct answer I think about this as a job interview I'm training you
+me diga a resposta correta eu penso nisso como uma entrevista de emprego estou treinando você
 
 0:05:07.120,0:05:14.990
-so how many maps we have and right so this one here are as many as my M which
+então quantos mapas temos e certo então este aqui são tantos quanto o meu M que
 
 0:05:14.990,0:05:21.470
-is the number of rows of this initial thing over here right all right so what
+é o número de linhas dessa coisa inicial aqui, tudo bem, então o que
 
 0:05:21.470,0:05:30.289
-is instead the width of this little kernel here okay right okay what is the
+é em vez disso a largura deste pequeno kernel aqui ok certo ok qual é o
 
 0:05:30.289,0:05:41.349
-height of this matrix what is the height of the matrix
+altura desta matriz qual é a altura da matriz
 
 0:05:42.340,0:05:45.480
-you sure try again
+você com certeza tente novamente
 
 0:05:49.220,0:06:04.310
-I can't hear and minus k plus one okay and the final what is the output of this
+não consigo ouvir e menos k mais um ok e o final qual é a saída disso
 
 0:06:04.310,0:06:08.660
-thing right so the output is going to be one vector which is gonna be of height
+coisa certa, então a saída será um vetor que será de altura
 
 0:06:08.660,0:06:19.430
-the same right and minus k plus 1 and then it should be correct yeah but then
+a mesma direita e menos k mais 1 e então deve estar correto sim, mas então
 
 0:06:19.430,0:06:27.890
-how many what is the thickness of this final vector M right so this stuff here
+quantos qual é a espessura desse vetor final M certo então essas coisas aqui
 
 0:06:27.890,0:06:35.600
-and goes as thick as M right so this is where we left last time right but then
+e vai tão grosso quanto M à direita, então é aqui que saímos da última vez, mas depois
 
 0:06:35.600,0:06:39.770
-someone asked me now then I realized so we have here as many as the different
+alguém me perguntou agora então percebi que temos aqui tantos quantos os diferentes
 
 0:06:39.770,0:06:45.170
-colors right so for example in this case if I just draw to make sure we
+cores certas, por exemplo, neste caso, se eu apenas desenhar para ter certeza de que
 
 0:06:45.170,0:06:49.730
-understand what's going on you have the first thing here now you have the second
+entenda o que está acontecendo você tem a primeira coisa aqui agora você tem a segunda
 
 0:06:49.730,0:06:55.600
-one here and I have the third one right in this case all right so last time they
+um aqui e eu tenho o terceiro certo neste caso tudo bem então da última vez eles
 
 0:06:59.750,0:07:03.650
-asked me if someone asked me at the end of the class so how do we do convolution
+me perguntou se alguém me perguntou no final da aula, então como fazemos convolução
 
 0:07:03.650,0:07:09.760
-when we end up in this situation over here because here we assume that my
+quando acabamos nessa situação aqui porque aqui assumimos que meu
 
 0:07:09.760,0:07:14.990
-corners are just you know whatever K long let's say three long but then they
+cantos são apenas você sabe o que quer que K longo, digamos três longos, mas então eles
 
 0:07:14.990,0:07:21.380
-are just one little vector right and so somebody told me no then what do you do
+são apenas um pequeno vetor certo e então alguém me disse não, então o que você faz
 
 0:07:21.380,0:07:24.950
-from here like how do we keep going because now we have a thickness before
+daqui como vamos continuar porque agora temos uma espessura antes
 
 0:07:24.950,0:07:32.510
-we started with a something here this vector which had just n elements right
+começamos com algo aqui este vetor que tinha apenas n elementos certo
 
 0:07:32.510,0:07:35.690
-are you following so far I'm going faster because we already seen these
+você está acompanhando até agora estou indo mais rápido porque já vimos esses
 
 0:07:35.690,0:07:44.030
-things I'm just reviewing but are you with me until now yes no yes okay
+coisas que estou apenas revisando mas você está comigo até agora sim não sim ok
 
 0:07:44.030,0:07:47.720
-fantastic so let's see how we actually keep going so the thing is
+fantástico, então vamos ver como realmente continuamos, então a coisa é
 
 0:07:47.720,0:07:51.680
-show you right now is actually assuming that we start with that long vector
+mostrar a você agora é assumir que começamos com esse vetor longo
 
 0:07:51.680,0:08:01.400
-which was of height what was the height and right but in this case also this one
+qual era de altura qual era a altura e certo mas neste caso também este
 
 0:08:01.400,0:08:13.060
-means that we have something that looks like this and so you have basically here
+significa que temos algo parecido com isso e você tem basicamente aqui
 
 0:08:13.060,0:08:20.720
-this is 1 this is also 1 so we only have a monophonic signal for example and this
+isso é 1 isso também é 1 então só temos um sinal monofônico por exemplo e isso
 
 0:08:20.720,0:08:26.300
-was n the height right all right so let's assume now we're using a
+estava na altura certo, então vamos supor que agora estamos usando um
 
 0:08:26.300,0:08:33.950
-stereophonic system so what is gonna be my domain here so you know my X can be
+sistema estereofônico então qual vai ser o meu domínio aqui para que você saiba que meu X pode ser
 
 0:08:33.950,0:08:39.740
-thought as a function that goes from the domain to the ℝ^{number of channels} so
+pensado como uma função que vai do domínio ao ℝ^{número de canais}, então
 
 0:08:39.740,0:08:47.840
-what is this guy here yeah x is one dimension and somewhere so what is this
+o que é esse cara aqui sim x é uma dimensão e em algum lugar então o que é isso
 
 0:08:47.840,0:08:59.930
-Ω we have seen this slide last slide of Tuesday lesson right second Ω is
+Ω vimos este slide no último slide da lição de terça-feira à direita, segundo Ω é
 
 0:08:59.930,0:09:11.720
-not set of real numbers no someone else tries we are using computers it's time
+não é um conjunto de números reais ninguém tenta estamos usando computadores é hora
 
 0:09:11.720,0:09:16.520
-line yes and how many samples you you have one sample number sample number two
+linha sim e quantas amostras você tem um número de amostra número de amostra dois
 
 0:09:16.520,0:09:21.710
-or sample number three so you have basically a subset of the natural space
+ou amostra número três para que você tenha basicamente um subconjunto do espaço natural
 
 0:09:21.710,0:09:30.860
-right so this one is going to be something like 0 1 2 so on set which is
+certo, então este vai ser algo como 0 1 2, então no set, que é
 
 0:09:30.860,0:09:36.410
-gonna be subset of ℕ right so it's not ℝ. ℝ is gonna be if you have time
+vai ser subconjunto de ℕ certo, então não é ℝ. ℝ vai ser se você tiver tempo
 
 0:09:36.410,0:09:45.850
-continuous domain what you see in this case the in the case I just showed you
+domínio contínuo o que você vê neste caso no caso que acabei de mostrar
 
 0:09:45.850,0:09:55.160
-so far what is seen in this case now number of input channels because this is
+até agora o que é visto neste caso agora número de canais de entrada porque isso é
 
 0:09:55.160,0:10:00.740
-going to be my X right this is my input so in this case we show so far in this
+vai ser meu X certo esta é minha entrada então neste caso nós mostramos até agora neste
 
 0:10:00.740,0:10:07.220
-case here we were just using one so it means we have a monophonic audio let's
+caso aqui estávamos usando apenas um, então significa que temos um áudio monofônico, vamos
 
 0:10:07.220,0:10:10.880
-seven now the assumption make the assumption that this guy is that it's
+sete agora a suposição faz a suposição de que esse cara é que é
 
 0:10:10.880,0:10:22.780
-gonna be two such that you're gonna be talking about stereo phonic signal right
+serão dois de tal forma que você estará falando sobre sinal fonético estéreo certo
 
 0:10:23.200,0:10:27.380
-okay so let's see how this stuff changes so
+ok, então vamos ver como isso muda, então
 
 0:10:27.380,0:10:38.450
-in this case my let me think yeah so how do I draw I'm gonna just draw right
+neste caso, deixe-me pensar sim, então como faço para desenhar, vou desenhar certo
 
 0:10:38.450,0:10:43.400
-little complain if you don't follow are you following so far yes because if
+reclame pouco se você não segue você está seguindo até agora sim porque se
 
 0:10:43.400,0:10:46.550
-i watch my tablet I don't see you right so you should be complaining if
+eu assisto meu tablet eu não vejo você direito então você deveria estar reclamando se
 
 0:10:46.550,0:10:50.750
-something doesn't make sense right otherwise becomes boring from waiting
+algo não faz sentido direito senão fica chato de esperar
 
 0:10:50.750,0:10:56.390
-and watching you all the time right yes no yes okay I'm boring okay
+e assistindo você o tempo todo certo sim não sim ok eu sou chato ok
 
 0:10:56.390,0:11:00.080
-thank you all right so we have here this signal
+obrigado tudo bem então temos aqui esse sinal
 
 0:11:00.080,0:11:07.280
-right and then now we have some thickness in this case what is the
+certo e agora temos alguma espessura, neste caso, qual é o
 
 0:11:07.280,0:11:14.660
-thickness of this guy see right so in this case this one is going to be C and
+espessura desse cara veja bem então neste caso este vai ser C e
 
 0:11:14.660,0:11:18.589
-in the case of the stereophonic signal you're gonna just have two channels left
+no caso do sinal estereofônico, você terá apenas dois canais restantes
 
 0:11:18.589,0:11:30.170
-and right and this one keeps going down right all right so our kernels if I'd
+e certo e este continua indo para baixo tudo bem, então nossos kernels se eu tivesse
 
 0:11:30.170,0:11:35.030
-like to perform a convolution over this signal right so you have different same
+gostaria de realizar uma convolução sobre este sinal certo para que você tenha o mesmo
 
 0:11:35.030,0:11:44.150
-pussy right and so on right if I'd like to perform a convolution one big
+buceta certo e assim por diante direito se eu gostaria de realizar uma convolução um grande
 
 0:11:44.150,0:11:47.089
-convolution I'm not talking about two deconvolution right because they are
+convolução não estou falando de duas deconvoluções né porque são
 
 0:11:47.089,0:11:52.670
-still using domain which is here number one right so this is actually important
+ainda usando o domínio que é o número um, então isso é realmente importante
 
 0:11:52.670,0:11:58.510
-so if I ask you what type of signal this is you're gonna be basically
+então se eu perguntar que tipo de sinal é esse, você vai ser basicamente
 
 0:11:58.510,0:12:02.890
-you have to look at this number over here right so we are talking about one
+você tem que olhar para este número aqui, então estamos falando de um
 
 0:12:02.890,0:12:12.490
-dimensional signal which is one dimensional domain right 1d domain okay
+sinal dimensional que é um domínio dimensional direito 1d domínio ok
 
 0:12:12.490,0:12:17.710
-so we are still using a 1d signal but in this case it has you know you have two
+então ainda estamos usando um sinal 1d, mas neste caso você sabe que tem dois
 
 0:12:17.710,0:12:25.750
-values per point so what kind of kernels are we gonna be using so I'm gonna just
+valores por ponto, então que tipo de kernels vamos usar, então vou apenas
 
 0:12:25.750,0:12:31.450
-draw it in this case we're gonna be using something similar like this so I'm
+desenhá-lo neste caso, vamos usar algo parecido com isso, então estou
 
 0:12:31.450,0:12:37.990
-gonna be drawing this guy let's say I have K here which is gonna be my width
+vai desenhar esse cara vamos dizer que eu tenho K aqui que vai ser a minha largura
 
 0:12:37.990,0:12:42.700
-of the kernel but in this case I'm gonna be also have some thickness in this case
+do kernel, mas neste caso eu também vou ter alguma espessura neste caso
 
 0:12:42.700,0:12:56.230
-here right so basically you apply this thing here okay and then you can go
+aqui, então basicamente você aplica essa coisa aqui ok e então você pode ir
 
 0:12:56.230,0:13:04.060
-second line and third line and so on right so you may still have like here m
+segunda linha e terceira linha e assim por diante, então você ainda pode ter como aqui m
 
 0:13:04.060,0:13:11.590
-kernels but in this case you also have some thickness which has to match the
+grãos, mas neste caso você também tem alguma espessura que deve corresponder ao
 
 0:13:11.590,0:13:17.680
-other thickness right so this thickness here has to match the thickness of the
+outra espessura certa, então essa espessura aqui tem que corresponder à espessura do
 
 0:13:17.680,0:13:23.980
-input size so let me show you how to apply the convolution so you're gonna
+tamanho de entrada, então deixe-me mostrar como aplicar a convolução para que você
 
 0:13:23.980,0:13:37.980
-get one of these slices here and then you're gonna be applying this over here
+pegue uma dessas fatias aqui e então você vai aplicar isso aqui
 
 0:13:39.320,0:13:46.190
-okay and then you simply go down this way
+ok e então você simplesmente desce por este caminho
 
 0:13:46.190,0:13:53.870
-alright so whenever you apply these you perform this guy here the inner product
+tudo bem, então sempre que você aplica isso, você executa esse cara aqui o produto interno
 
 0:13:53.870,0:14:04.410
-with these over here what you get it's actually a one by one is a scalar so
+com estes aqui o que você obtém é na verdade um por um é um escalar, então
 
 0:14:04.410,0:14:09.540
-whenever I use this orange thingy here on the left hand side and I do a dot
+sempre que eu uso essa coisa laranja aqui do lado esquerdo e faço um ponto
 
 0:14:09.540,0:14:14.190
-product scalar product with this one I just get a scalar so this is actually my
+produto escalar produto com este eu apenas recebo um escalar então este é realmente o meu
 
 0:14:14.190,0:14:19.620
-convolution in 1d the convolution in 1d means that it goes down this way and
+convolução em 1d a convolução em 1d significa que desce por aqui e
 
 0:14:19.620,0:14:27.480
-only in one way that's why it's called 1d but we multiply each element of this
+apenas de uma maneira, é por isso que é chamado de 1d, mas multiplicamos cada elemento disso
 
 0:14:27.480,0:14:36.290
-mask times this guy here now a second row and this guy here okay
+mascara vezes esse cara aqui agora uma segunda fila e esse cara aqui ok
 
 0:14:36.290,0:14:41.090
-you saw you multiply all of them you sum all of them and then you get your first
+você viu você multiplicar todos eles você soma todos eles e então você obtém seu primeiro
 
 0:14:41.090,0:14:47.250
-output here okay so whenever I make this multiplication I get my first output
+saída aqui ok então sempre que eu fizer essa multiplicação eu recebo minha primeira saída
 
 0:14:47.250,0:14:52.050
-here then I keep sliding this kernel down and then you're gonna get the
+aqui então eu continuo deslizando este kernel para baixo e então você vai obter o
 
 0:14:52.050,0:14:58.380
-second output third out fourth and so on until you go down at the end then what
+segunda saída, terceira saída, quarta e assim por diante até você descer no final, então o que
 
 0:14:58.380,0:15:03.780
-happens then happens that I'm gonna be picking up different kernel I'm gonna
+acontece então acontece que eu vou pegar um kernel diferente eu vou
 
 0:15:03.780,0:15:07.950
-back it let's say I get the third one okay let's get the second one I get a
+de volta vamos dizer que eu pego o terceiro ok vamos pegar o segundo eu pego um
 
 0:15:07.950,0:15:19.050
-second one and I perform the same operation you're gonna get here this one
+segundo e eu faço a mesma operação você vai chegar aqui
 
 0:15:19.050,0:15:23.240
-actually let's actually make it like a matrix
+na verdade, vamos torná-lo como uma matriz
 
 0:15:26.940,0:15:33.790
-you go down okay until you go with the last one which is gonna be the end right
+você desce bem até ir com o último que vai ser o final certo
 
 0:15:33.790,0:15:45.450
-the empty kernel which is gonna be going down this way you get the last one here
+o kernel vazio que vai descer desta forma você obtém o último aqui
 
 0:15:51.680,0:15:58.790
-okay yes no confusing clearing so this was the question I got at the end of the
+ok sim não esclarecimento confuso então esta foi a pergunta que recebi no final do
 
 0:15:58.790,0:16:10.339
-class yeah Suzy yeah because it's a dot product of all those values between so
+class sim Suzy sim porque é um produto escalar de todos esses valores entre então
 
 0:16:10.339,0:16:18.259
-basically do the projection of this part of the signal onto this kernel so you'd
+basicamente fazer a projeção desta parte do sinal neste kernel para que você
 
 0:16:18.259,0:16:22.879
-like to see what is the contribution like what is the alignment of this part
+gostaria de ver qual é a contribuição como qual é o alinhamento desta parte
 
 0:16:22.879,0:16:27.350
-of the signal on to this specific subspace okay this is how a convolution
+do sinal para este subespaço específico ok é assim que uma convolução
 
 0:16:27.350,0:16:31.850
-works when you have multiple channels so far I'll show you just with single
+funciona quando você tem vários canais até agora vou mostrar apenas com single
 
 0:16:31.850,0:16:35.319
-channel now we have multiple channels okay so oh yeah yeah in one second one
+canal agora temos vários canais ok então oh sim sim em um segundo
 
 0:16:54.259,0:16:59.509
-and one one at the top one at the bottom so you actually lose the first row here
+e um na parte superior e na parte inferior, então você realmente perde a primeira linha aqui
 
 0:16:59.509,0:17:04.850
-and you lose the last row here so at the end in this case the output is going to
+e você perde a última linha aqui, então no final, neste caso, a saída será
 
 0:17:04.850,0:17:10.490
-be n minus three plus one so you lose two one on top okay in this case you
+seja n menos três mais um então você perde dois um em cima ok neste caso você
 
 0:17:10.490,0:17:15.140
-lose two at the bottom if you actually do a Center at the center the
+perder dois na parte inferior se você realmente fizer um Centro no centro do
 
 0:17:15.140,0:17:20.390
-convolution usually you lose one at the beginning one at the end every time you
+convolução geralmente você perde um no início um no final toda vez que você
 
 0:17:20.390,0:17:24.409
-perform a convolution you lose the number of the dimension of the kernel
+execute uma convolução você perde o número da dimensão do kernel
 
 0:17:24.409,0:17:28.789
-minus one you can try if you put your hand like this you have a kernel of
+menos um você pode tentar se você colocar sua mão assim você tem um kernel de
 
 0:17:28.789,0:17:34.340
-three you get the first one here and it is matching then you switch one and then
+três você pega o primeiro aqui e está combinando, então você troca um e depois
 
 0:17:34.340,0:17:39.440
-you switch to right so okay with fight let's tell a parent of two right so you
+você muda para a direita, então tudo bem com a luta, vamos dizer a um pai de dois filhos, então você
 
 0:17:39.440,0:17:44.149
-have your signal of five you have your kernel with two you have one two three
+tem seu sinal de cinco você tem seu kernel com dois você tem um dois três
 
 0:17:44.149,0:17:49.070
-and four so we started with five and you end up with four because you use a
+e quatro, então começamos com cinco e você termina com quatro porque usa um
 
 0:17:49.070,0:17:54.500
-kernel size of two if you use a kernel size of three you get one two and three
+tamanho do kernel de dois se você usar um tamanho de kernel de três você obtém um dois e três
 
 0:17:54.500,0:17:57.289
-so you goes to if you use a kernel size of three okay
+então você vai se você usar um tamanho de kernel de três ok
 
 0:17:57.289,0:18:01.010
-so you can always try to do this alright so I'm gonna show you now the
+então você sempre pode tentar fazer isso bem, então eu vou te mostrar agora o
 
 0:18:01.010,0:18:07.040
-dimensions of these kernels and the outputs with PyTorch okay Yes No
+dimensões desses kernels e as saídas com PyTorch ok Sim Não
 
 0:18:07.040,0:18:18.500
-all right good okay mister can you see anything
+tudo bem, tudo bem, senhor, você pode ver alguma coisa
 
 0:18:18.500,0:18:25.520
-yes right I mean zoom a little bit more okay so now we can go we do
+sim certo quero dizer zoom um pouco mais ok então agora podemos ir nós fazemos
 
 0:18:25.520,0:18:33.770
-conda activate pDL, pytorch Deep Learning.
+conda ativar pDL, pytorch Deep Learning.
 
 0:18:33.770,0:18:40.520
-So here we can just run ipython if i press ctrl L I clear the screen and
+Então aqui podemos apenas executar o ipython se eu pressionar ctrl LI limpar a tela e
 
 0:18:40.520,0:18:49.820
-we can do import torch then I can do from torch import nn so now we can see
+podemos fazer a importação da tocha, então eu posso fazer a partir da tocha import nn, então agora podemos ver
 
 0:18:49.820,0:18:54.500
-for example called let's set my convolutional convolutional layer it's
+por exemplo chamado vamos definir minha camada convolucional convolucional é
 
 0:18:54.500,0:18:59.930
-going to be equal to NN conf and then I can keep going until I get
+vai ser igual a NN conf e então eu posso continuar até conseguir
 
 0:18:59.930,0:19:04.220
-this one let's say yeah let's say I have no idea how to use this function I just
+este, digamos sim, digamos que não tenho ideia de como usar essa função, apenas
 
 0:19:04.220,0:19:08.750
-put a question mark I press ENTER and I'm gonna see here now the documentation
+coloco um ponto de interrogação eu pressiono ENTER e vou ver aqui agora a documentação
 
 0:19:08.750,0:19:13.460
-okay so in this case you're gonna have the first item is going to be the input
+ok, então, neste caso, você terá o primeiro item que será a entrada
 
 0:19:13.460,0:19:19.820
-channel then I have the output channels then I have the corner sighs alright so
+canal então eu tenho os canais de saída então eu tenho o canto suspira bem então
 
 0:19:19.820,0:19:24.290
-for example we are going to be putting here input channels we have a stereo
+por exemplo vamos colocar aqui os canais de entrada temos um estéreo
 
 0:19:24.290,0:19:30.530
-signal so we put two channels the number of corners we said that was M and let's
+sinal então colocamos dois canais o número de cantos que dissemos que era M e vamos
 
 0:19:30.530,0:19:36.650
-say we have 16 kernels so this is the number of kernels I'm gonna be using and
+digamos que temos 16 kernels, então este é o número de kernels que vou usar e
 
 0:19:36.650,0:19:41.810
-then let's have our kernel size of what the same I use here so let's have K or
+então vamos ter o tamanho do nosso kernel do que eu uso aqui, então vamos ter K ou
 
 0:19:41.810,0:19:47.570
-the kernel size equal 3 okay in so here I'm going to define my first convolution
+o tamanho do kernel é igual a 3 ok então aqui vou definir minha primeira convolução
 
 0:19:47.570,0:19:52.910
-object so if I print this one comes you're gonna see we have a convolution a
+objeto então se eu imprimir este vem você vai ver que temos uma convolução a
 
 0:19:52.910,0:19:57.580
-2d combo sorry 1 deconvolution made that okay so we have a 1d convolution
+2d combo desculpe 1 deconvolução fez isso bem, então temos uma convolução 1d
 
 0:20:02.149,0:20:08.869
-which is going from two channels so a stereophonic to a sixteen channels means
+que vai de dois canais, de modo estereofônico para dezesseis canais, significa
 
 0:20:08.869,0:20:16.039
-I use sixteen kernels the skirmish size is 3 and then the stride is also 1 ok so
+Eu uso dezesseis kernels o tamanho do skirmish é 3 e então o passo também é 1 ok então
 
 0:20:16.039,0:20:23.859
-in this case I'm gonna be checking what is gonna be my convolutional weights
+neste caso, vou verificar quais serão meus pesos convolucionais
 
 0:20:27.429,0:20:33.379
-what is the size of the weights how many weights do we have how many how
+qual é o tamanho dos pesos quantos pesos temos quantos quantos
 
 0:20:33.379,0:20:40.069
-many planes do we have for the weights 16 right so we have 16 weights what is
+quantos aviões temos para os pesos 16 certo então temos 16 pesos o que é
 
 0:20:40.069,0:20:53.649
-the length of the the day of the key of D of the kernel okay Oh what is this -
+a duração do dia da chave de D do kernel ok Oh o que é isso -
 
 0:20:54.549,0:21:00.349
-Janis right so I have 16 of these scanners which have thickness - and then
+Janis está certo, então eu tenho 16 desses scanners que têm espessura - e então
 
 0:21:00.349,0:21:05.539
-length of 3 ok makes sense right because you're gonna be applying each of these
+comprimento de 3 ok faz sentido porque você vai aplicar cada um desses
 
 0:21:05.539,0:21:11.629
-16 across the whole signal so let's have my signal now you're gonna be is gonna
+16 em todo o sinal então vamos ter o meu sinal agora você vai ser vai
 
 0:21:11.629,0:21:20.599
-be equal toage dot R and and and oh sighs I don't know let's say 64 I also
+ser igual a idade ponto R e e e oh suspiros não sei digamos 64 eu também
 
 0:21:20.599,0:21:25.129
-have to say I have a batch of size 1 so I have a virtual site one so I just have
+tenho que dizer que eu tenho um lote de tamanho 1, então eu tenho um site virtual, então eu só tenho
 
 0:21:25.129,0:21:31.879
-one signal and then this is gonna be 64 how many channels we said this has two
+um sinal e então isso vai ser 64 quantos canais nós dissemos que isso tem dois
 
 0:21:31.879,0:21:37.819
-right so I have one signal one example which has two channels and has 64
+certo então eu tenho um sinal um exemplo que tem dois canais e tem 64
 
 0:21:37.819,0:21:46.689
-samples so this is my X hold on what is the convolutional bias size
+samples então este é o meu X espera qual é o tamanho do viés convolucional
 
 0:21:48.320,0:21:54.380
-a 16 right because you have one bias / plain / / / way ok so what's gonna be in
+um 16 certo porque você tem um viés / simples / / / muito bem, então o que vai estar em
 
 0:21:54.380,0:22:07.539
-our my convolution of X the output hello so I'm gonna still have one sample right
+nossa minha convolução de X a saída olá, então ainda vou ter uma amostra certa
 
 0:22:07.539,0:22:15.919
-how many channels 16 what is gonna be the length of the signal okay that's
+quantos canais 16 qual vai ser a duração do sinal ok isso é
 
 0:22:15.919,0:22:22.700
-good 6 fix it okay fantastic all right so what if I'm gonna be using
+bom 6 conserte ok fantástico tudo bem e daí se eu vou usar
 
 0:22:22.700,0:22:32.240
-a convolution with size of the kernel 5 what do I get now yet to shout I can't
+uma convolução com tamanho do kernel 5 o que eu recebo agora ainda para gritar não consigo
 
 0:22:32.240,0:22:36.320
-hear you 60 okay you're following fantastic okay
+ouço você 60 ok você está seguindo fantástico ok
 
 0:22:36.320,0:22:44.059
-so let's try now instead to use a hyper spectral image with a 2d convolution
+então vamos tentar agora usar uma imagem hiperespectral com uma convolução 2d
 
 0:22:44.059,0:22:49.100
-okay so I'm going to be coding now my convolution here is going to be my in
+ok, então eu vou codificar agora minha convolução aqui vai ser minha entrada
 
 0:22:49.100,0:22:55.490
-this case is correct or is going to be a conf come to D again I don't know how to
+este caso está correto ou vai ser um conf venha para D novamente não sei como
 
 0:22:55.490,0:22:59.059
-use it so I put a question mark and then I have here input channel output channel
+uso então eu coloco um ponto de interrogação e depois tenho aqui canal de entrada canal de saída
 
 0:22:59.059,0:23:05.450
-criticize strident padding okay so I'm going to be putting inputs tried input
+critique o preenchimento estridente ok, então vou colocar entradas testadas entrada
 
 0:23:05.450,0:23:10.429
-channel so it's a hyper spectral image with 20 planes so what's gonna be the
+canal, então é uma imagem hiperespectral com 20 planos, então qual será o
 
 0:23:10.429,0:23:16.149
-input in this case 20 right because you have you start from 20 spectral bands
+insira neste caso 20 certo porque você começa a partir de 20 bandas espectrais
 
 0:23:16.149,0:23:20.419
-then we're gonna be inputting the output number of channels we let's say we're
+então vamos inserir o número de saída de canais, digamos que estamos
 
 0:23:20.419,0:23:25.330
-gonna be using again 16 in this case I'm going to be inputting the kernel size
+vou usar novamente 16, neste caso, vou inserir o tamanho do kernel
 
 0:23:25.330,0:23:33.440
-since I'm planning to use okay let's actually define let's actually define my
+já que estou planejando usar ok vamos definir vamos definir meu
 
 0:23:33.440,0:23:40.120
-signal first so my X is gonna be a torch dot R and and let's say one sample with
+sinal primeiro, então meu X será um ponto de tocha R e, digamos, uma amostra com
 
 0:23:40.120,0:23:52.820
-20 channels of height for example I guess 6128 well hold on 64 and then with
+20 canais de altura por exemplo eu acho que 6128 bem segure em 64 e depois com
 
 0:23:52.820,0:23:58.820
-128 okay so this is gonna be my my input my eople data okay
+128 ok então esta será minha entrada meus dados pessoais ok
 
 0:23:58.820,0:24:04.370
-so my convolution now it can be something like this so I have 20
+então minha convolução agora pode ser algo assim, então eu tenho 20
 
 0:24:04.370,0:24:09.110
-channels from input 16 our Mike Ernest I'm gonna be using then I'm gonna be
+canais de entrada 16 nosso Mike Ernest eu vou estar usando então eu vou ser
 
 0:24:09.110,0:24:15.050
-specifying the kernel size in this case let's use something that is like three
+especificando o tamanho do kernel neste caso vamos usar algo como três
 
 0:24:15.050,0:24:24.580
-times five okay so what is going to be the output what are the kernel size
+vezes cinco ok então qual vai ser a saída qual é o tamanho do kernel
 
 0:24:29.170,0:24:47.630
-anyone yes no what no 20 Janice is the channels of the input data right so you
+alguém sim não o que não 20 Janice é os canais de entrada de dados certo então você
 
 0:24:47.630,0:24:51.680
-have how many kernels here 16 right there you go
+tem quantos kernels aqui 16 ai vai
 
 0:24:51.680,0:24:56.420
-we have 16 kernels which have 20 channels such that they can lay over the
+temos 16 kernels que possuem 20 canais de modo que eles podem se sobrepor ao
 
 0:24:56.420,0:25:03.410
-input 3 by 5 right teeny like a short like yeah short but large ok so what is
+entrada 3 por 5 direito pequenino como um curto como sim curto, mas grande ok então o que é
 
 0:25:03.410,0:25:08.140
-gonna be my conv(x).size ? [1, 16, 62, 124]. Let's say I'd like to
+vai ser meu conv(x).size ? [1, 16, 62, 124]. Digamos que eu gostaria de
 
 0:25:16.310,0:25:22.190
-actually add back the I'd like to head the sing dimensionality I can add some
+na verdade, adicione de volta o que eu gostaria de liderar a dimensionalidade de canto, posso adicionar alguns
 
 0:25:22.190,0:25:25.730
-padding right so here there is going to be the stride I'm gonna have a stride of
+preenchimento certo, então aqui vai ser o passo que eu vou ter um passo de
 
 0:25:25.730,0:25:29.930
-1 again if you don't remember the the syntax you can just put the question
+1 novamente se você não se lembra da sintaxe, basta colocar a pergunta
 
 0:25:29.930,0:25:35.120
-mark can you figure out and then how much strive should I add now how much
+marca você pode descobrir e, em seguida, quanto esforço devo adicionar agora quanto
 
 0:25:35.120,0:25:41.870
-stride in the y-direction sorry yes how much padding should I add in the
+passo na direção y desculpe sim quanto preenchimento devo adicionar no
 
 0:25:41.870,0:25:46.490
-y-direction one because it's gonna be one on top one on the bottom but then
+y-direction one porque vai ser um em cima e um embaixo, mas depois
 
 0:25:46.490,0:25:51.890
-then on the x-direction okay you know you're following fantastic and so now if
+então na direção x ok você sabe que está seguindo fantástico e agora se
 
 0:25:51.890,0:25:57.320
-I just run this one you wanna get the initial size okay so now you have both
+Acabei de executar este, você quer obter o tamanho inicial, então agora você tem os dois
 
 0:25:57.320,0:26:05.500
-1d and 2d the point is that what is the dimension of a convolutional kernel and
+1d e 2d o ponto é que qual é a dimensão de um kernel convolucional e
 
 0:26:05.500,0:26:12.470
-symbol for to the dimensional signal again I repeat what is the
+símbolo para o sinal dimensional novamente repito qual é o
 
 0:26:12.470,0:26:20.049
-dimensionality of the collection of careness use for two-dimensional data
+dimensionalidade da coleção de uso de cuidados para dados bidimensionais
 
 0:26:20.860,0:26:27.679
-again for right so four is gonna be the number of dimensions that are required
+novamente para a direita, então quatro será o número de dimensões necessárias
 
 0:26:27.679,0:26:35.659
-to store the collection of kernels when you perform 2d convolutions the one is
+para armazenar a coleção de kernels quando você executa convoluções 2d, o que é
 
 0:26:35.659,0:26:40.370
-going to be the stride so if you don't know how this works you just put a
+vai ser o passo, então se você não sabe como isso funciona, basta colocar um
 
 0:26:40.370,0:26:44.000
-question mark and gonna tell you here so stride is gonna be telling you you
+ponto de interrogação e vou te dizer aqui, então stride vai te dizer que você
 
 0:26:44.000,0:26:50.929
-stride off you move every time the kernel by one if you are the first one
+afaste-se cada vez que o kernel por um se você for o primeiro
 
 0:26:50.929,0:26:55.460
-means you only is the batch size so torch expects you to always use batches
+significa que você é apenas o tamanho do lote, então a tocha espera que você sempre use lotes
 
 0:26:55.460,0:27:00.110
-meaning how many signals you're using just one right so that our expectation
+ou seja, quantos sinais você está usando apenas um certo para que nossa expectativa
 
 0:27:00.110,0:27:04.549
-if you send an input vector which is going to be input tensor which has
+se você enviar um vetor de entrada que será o tensor de entrada que tem
 
 0:27:04.549,0:27:12.289
-dimension three is gonna be breaking and complain okay so we have still some time
+dimensão três vai quebrar e reclamar ok então ainda temos algum tempo
 
 0:27:12.289,0:27:18.049
-to go in the second part all right second part is going to be so you've
+para ir na segunda parte, tudo bem, a segunda parte vai ser assim que você
 
 0:27:18.049,0:27:23.779
-been computing some derivatives right for the first homework right so the
+computando algumas derivadas para o primeiro dever de casa, então o
 
 0:27:23.779,0:27:31.909
-following homework maybe you have to do you have to compute this one okay you're
+seguindo a lição de casa talvez você tenha que fazer você tem que calcular isso ok você está
 
 0:27:31.909,0:27:35.510
-supposed to be laughing it's a joke okay there you go
+deveria estar rindo é uma piada ok lá vai
 
 0:27:35.510,0:27:43.340
-fantastic so this is what you can wrote back in the 90s for the computation of
+fantástico, então isso é o que você pode escrever nos anos 90 para o cálculo de
 
 0:27:43.340,0:27:50.029
-the gradients of the of the lsdm which are gonna be covered I guess in next
+os gradientes do lsdm que serão cobertos, acho que no próximo
 
 0:27:50.029,0:27:54.950
-next lesson so how somehow so they had to still do these things right it's kind
+próxima lição então como de alguma forma eles ainda tinham que fazer essas coisas direito é gentil
 
 0:27:54.950,0:28:00.769
-of crazy nevertheless we can use PyTorch to have automatic computation of these
+de louco, no entanto, podemos usar o PyTorch para ter computação automática desses
 
 0:28:00.769,0:28:06.500
-gradients so we can go and check out how these automatic gradient works
+gradientes para que possamos verificar como funciona esse gradiente automático
 
 0:28:06.500,0:28:12.159
-okay all right so all right so we are going to be going
+tudo bem então tudo bem então vamos indo
 
 0:28:23.090,0:28:28.490
-now to the notebook number three which is the yeah
+agora para o notebook número três que é o sim
 
 0:28:28.490,0:28:33.590
-invisible let me see if I can highlight it now it's even worse okay number three
+invisível deixe-me ver se consigo destacar agora é ainda pior ok número três
 
 0:28:33.590,0:28:41.619
-Auto gratitute Oriole okay let me go fullscreen
+Gratidão automática Oriole ok, deixe-me ir em tela cheia
 
 0:28:41.619,0:28:53.029
-okay so out of our tutorial was gonna be here here just create my tensor which
+ok então fora do nosso tutorial ia estar aqui aqui apenas crie meu tensor que
 
 0:28:53.029,0:28:57.499
-has as well these required gradients equal true in this case I mean asking
+tem também esses gradientes necessários iguais verdadeiros neste caso, quero dizer perguntando
 
 0:28:57.499,0:29:02.539
-torch please track all the gradient computations did it got the competition
+tocha, por favor, rastreie todos os cálculos de gradiente, ele conseguiu a competição
 
 0:29:02.539,0:29:07.749
-over the tensor such that we can perform computation of partial derivatives okay
+sobre o tensor de modo que possamos realizar o cálculo de derivadas parciais ok
 
 0:29:07.749,0:29:13.279
-in this case I'm gonna have my Y is going to be so X is simply gonna be one
+neste caso eu vou ter meu Y vai ser então X vai ser simplesmente um
 
 0:29:13.279,0:29:20.419
-two three four the Y is going to be X subtracted number two okay alright so
+dois três quatro o Y vai ser X subtraído o número dois tudo bem então
 
 0:29:20.419,0:29:26.869
-now we can notice that there is this grad F n grad f NN FN function here so
+agora podemos notar que existe essa função grad F n grad f NN FN aqui, então
 
 0:29:26.869,0:29:32.059
-let's see what this stuff is we go sit there and see oh this is a sub backward
+vamos ver o que é isso vamos sentar lá e ver oh isso é um sub para trás
 
 0:29:32.059,0:29:37.629
-what is it meaning that the Y has been generated by a module which performs the
+o que significa que o Y foi gerado por um módulo que realiza o
 
 0:29:37.629,0:29:43.669
-subtraction between X and and - right so you have X minus 2 therefore if you
+subtração entre X e e - certo, então você tem X menos 2, portanto, se você
 
 0:29:43.669,0:29:51.860
-check who generated Y well there's a sub a subtraction module ok so what's gonna
+verifique quem gerou Y bem, há um sub um módulo de subtração ok então o que vai
 
 0:29:51.860,0:30:01.009
-be now the God function of X you're supposed to answer oh okay
+seja agora a função de Deus de X você deveria responder oh ok
 
 0:30:01.009,0:30:03.580
-why is none because they should have written there
+por que não há porque eles deveriam ter escrito lá
 
 0:30:07.580,0:30:12.020
-Alfredo generated that right okay all right none is fine as well
+Alfredo gerou esse certo ok tudo bem nenhum está bem também
 
 0:30:12.020,0:30:17.000
-okay so let's actually put our nose inside we were here we can actually
+ok, então vamos realmente colocar nosso nariz dentro de nós estávamos aqui, podemos realmente
 
 0:30:17.000,0:30:23.770
-access the first element you have the accumulation why is the accumulation I
+acesse o primeiro elemento você tem o acúmulo porque é o acúmulo eu
 
 0:30:25.090,0:30:29.830
-don't know I forgot but then if you go inside there you're gonna see the
+não sei, eu esqueci, mas então se você entrar lá você vai ver o
 
 0:30:29.830,0:30:34.760
-initial vector the initial tensor we are using is the one two three four okay so
+vetor inicial o tensor inicial que estamos usando é o um dois três quatro ok então
 
 0:30:34.760,0:30:41.390
-inside this computational graph you can also find the original tensor okay all
+dentro deste gráfico computacional você também pode encontrar o tensor original tudo bem
 
 0:30:41.390,0:30:46.880
-right so let's now get the Z and inside is gonna be my Y square times three and
+certo, então vamos agora pegar o Z e dentro vai ser meu Y quadrado vezes três e
 
 0:30:46.880,0:30:51.620
-then I compute my average a it's gonna be the mean of Z right so if I compute
+então eu calculo minha média a vai ser a média de Z certo então se eu calcular
 
 0:30:51.620,0:30:56.330
-the square of this thing here and I multiply by three and I take the average
+o quadrado dessa coisa aqui e multiplico por três e faço a média
 
 0:30:56.330,0:31:00.500
-so this is the square part times 3 and then this is the average okay so you can
+então esta é a parte quadrada vezes 3 e então esta é a média ok então você pode
 
 0:31:00.500,0:31:06.200
-try if you don't believe me all right so let's see how this thing looks like so
+tente se você não acredita em mim tudo bem então vamos ver como essa coisa se parece
 
 0:31:06.200,0:31:10.549
-I'm gonna be promoting here all these sequence of computations so we started
+Estarei promovendo aqui todas essas sequências de cálculos, então começamos
 
 0:31:10.549,0:31:16.669
-by from a two by two matrix what was this guy here to buy - who is this X
+por a partir de uma matriz dois a dois o que esse cara estava aqui para comprar - quem é esse X
 
 0:31:16.669,0:31:22.399
-okay you're following it cool then we subtracted - right and then we
+ok você está seguindo legal, então nós subtraímos - certo e então nós
 
 0:31:22.399,0:31:27.440
-multiplied by Y twice right that's why you have to ro so you get the same
+multiplicado por Y duas vezes certo é por isso que você tem que ro para obter o mesmo
 
 0:31:27.440,0:31:31.669
-subtraction that is the whyatt the X minus 2 multiplied by itself then
+subtração que é o porquê do X menos 2 multiplicado por ele mesmo então
 
 0:31:31.669,0:31:36.649
-you have another multiplication what is this okay multiply by three and then you
+você tem outra multiplicação o que é isso ok multiplique por três e então você
 
 0:31:36.649,0:31:42.980
-have the final the mean backward because this Y is green because it's mean no
+tem o final a média para trás porque este Y é verde porque significa não
 
 0:31:42.980,0:31:51.140
-okay yeah thank you for laughing okay so I compute back prop right
+ok, sim, obrigado por rir, ok, então eu calculo o suporte de volta certo
 
 0:31:51.140,0:31:59.409
-what does backdrop do what does this line do
+o que o pano de fundo faz o que esta linha faz
 
 0:32:00.360,0:32:08.610
-I want to hear everyone you know already we compute what radians right so black
+Eu quero ouvir todos que você conhece já computamos o que radianos certo tão preto
 
 0:32:08.610,0:32:11.580
-propagation is how you compute the gradients how do we train your networks
+propagação é como você calcula os gradientes como treinamos suas redes
 
 0:32:11.580,0:32:20.730
-with gradients ain't right or whatever Aaron said yesterday back
+com gradientes não está certo ou o que Aaron disse ontem de volta
 
 0:32:20.730,0:32:27.000
-propagation is that is used for computing the gradient completely
+propagação é que é usado para calcular o gradiente completamente
 
 0:32:27.000,0:32:29.970
-different things okay please keep them separate don't merge
+coisas diferentes ok por favor mantenha-os separados não mescle
 
 0:32:29.970,0:32:34.559
-them everyone after a bit that don't they don't see me those two things keep
+todos eles depois de um tempo que não eles não me veem essas duas coisas continuam
 
 0:32:34.559,0:32:43.740
-colliding into one mushy thought don't it's painful okay she'll compute the
+colidindo em um pensamento piegas não é doloroso ok ela vai calcular o
 
 0:32:43.740,0:32:51.659
-gradients right so guess what we are computing some gradients now okay so we
+gradientes certo, então adivinhe o que estamos calculando alguns gradientes agora ok, então nós
 
 0:32:51.659,0:33:02.580
-go on your page it's going to be what what was a it was the average right so
+vá na sua página vai ser o que era uma era a média né então
 
 0:33:02.580,0:33:10.529
-this is 1/4 right the summation of all those zᵢ
+isso é 1/4 certo a soma de todos aqueles zᵢ
 
 0:33:10.529,0:33:17.460
-what so I goes from 1 to 4 okay so what is that I said I is going
+o que então eu vou de 1 a 4 ok então o que é que eu disse que vou
 
 0:33:17.460,0:33:27.539
-to be equal to 3yᵢ² right yeah no questions no okay all right and then
+ser igual a 3yᵢ² certo sim sem perguntas não ok tudo bem e então
 
 0:33:27.539,0:33:36.840
-this one is was equal to 3(x-2)² right so a what does it belong
+este é igual a 3(x-2)² certo, então o que ele pertence
 
 0:33:36.840,0:33:38.899
-to where does a belong to what is the ℝ
+para onde pertence a qual é o ℝ
 
 0:33:44.279,0:33:51.200
-right so it's a scaler okay all right so now we can compute ∂a/∂x.
+certo então é um scaler tudo bem então agora podemos calcular ∂a/∂x.
 
 0:33:51.200,0:33:58.110
-So how much is this stuff you're gonna have 1/4 comes out forum here and
+Então quanto é esse material que você vai ter 1/4 sai do fórum aqui e
 
 0:33:58.110,0:34:03.090
-then you have you know let's have this one with respect to the xᵢ element
+então você sabe, vamos ter este em relação ao elemento xᵢ
 
 0:34:03.090,0:34:09.179
-okay so we're gonna have this one zᵢ inside is that, I have the 3yᵢ²,
+ok, então vamos ter este zᵢ dentro é esse, eu tenho o 3yᵢ²,
 
 0:34:09.179,0:34:15.899
-and it's gonna be 3(xᵢ- 2)². Right so these three comes
+e será 3(xᵢ- 2)². Certo, então esses três vem
 
 0:34:15.899,0:34:22.080
-out here the two comes down as well and then you multiply by (xᵢ – 2).
+aqui os dois também descem e então você multiplica por (xᵢ – 2).
 
 0:34:22.080,0:34:33.260
-So far should be correct okay fantastic all right so my X was this element here
+Até agora deve estar correto ok fantástico tudo bem então meu X era esse elemento aqui
 
 0:34:33.589,0:34:38.190
-actually let me compute as well this one so this one goes away this one becomes
+na verdade, deixe-me calcular também este, então este vai embora este se torna
 
 0:34:38.190,0:34:47.690
-true this is 1.5 times xᵢ – 3. Right - 2 - 3
+verdade isso é 1,5 vezes xᵢ – 3. Certo - 2 - 3
 
 0:34:55.159,0:35:06.780
-ok mathematics okay okay thank you all right. So what's gonna be ∂a/∂x ?
+ok matemática ok ok obrigado tudo bem. Então, o que vai ser ∂a/∂x ?
 
 0:35:06.780,0:35:11.339
-I'm actually writing the transpose directly here so for the first element
+Na verdade, estou escrevendo a transposição diretamente aqui, então para o primeiro elemento
 
 0:35:11.339,0:35:18.859
-you have one you have one times 1.5 so 1.5 minus 3 you get 1 minus 1.5 right
+você tem um você tem um vezes 1,5 então 1,5 menos 3 você obtém 1 menos 1,5 certo
 
 0:35:18.859,0:35:23.670
-second one is going to be 3 minus 3 you get 0 Ryan this is 3 minus 3
+o segundo vai ser 3 menos 3 você tem 0 Ryan isso é 3 menos 3
 
 0:35:23.670,0:35:27.420
-maybe I should write everything right so you're actually following so you have
+talvez eu deva escrever tudo certo para que você esteja realmente seguindo, então você tem
 
 0:35:27.420,0:35:37.589
-1.5 minus 3 now you have 3 minus 3 below you have 4 point 5 minus 3 and then the
+1,5 menos 3 agora você tem 3 menos 3 abaixo você tem 4 ponto 5 menos 3 e então o
 
 0:35:37.589,0:35:47.160
-last one is going to be 6 minus 3 which is going to be equal to minus 1 point 5
+o último vai ser 6 menos 3 que vai ser igual a menos 1 ponto 5
 
 0:35:47.160,0:35:59.789
-0 1 point 5 and then 3 right you agree ok let me just write this on here
+0 1 ponto 5 e depois 3 certo você concorda ok deixa eu escrever isso aqui
 
 0:35:59.789,0:36:06.149
-okay just remember so we have you be computed the backpropagation here I'm
+ok apenas lembre-se para que você seja computado a retropropagação aqui estou eu
 
 0:36:06.149,0:36:14.609
-gonna just bring it to the gradients and then the right it's the same stuff we
+vamos apenas trazê-lo para os gradientes e então a direita é a mesma coisa que nós
 
 0:36:14.609,0:36:27.630
-got here right such that I don't have to transpose it here whenever you perform
+chegou aqui certo para não ter que transpor aqui sempre que você tocar
 
 0:36:27.630,0:36:33.209
-the partial derivative in PyTorch you get the same the same shape is the input
+a derivada parcial no PyTorch você obtém a mesma forma é a entrada
 
 0:36:33.209,0:36:37.469
-dimension so if you have a weight whatever dimension then when you compute
+dimensão, então, se você tiver um peso em qualquer dimensão, quando você calcular
 
 0:36:37.469,0:36:41.069
-the partial you still have the same dimension they don't swap they don't
+o parcial você ainda tem a mesma dimensão eles não trocam eles não
 
 0:36:41.069,0:36:44.789
-turn okay they just use this for practicality at the correct version I
+vire ok eles só usam isso por praticidade na versão correta eu
 
 0:36:44.789,0:36:49.919
-mean the the gradient should be the transpose of that thing sorry did
+significa que o gradiente deve ser a transposição daquela coisa, desculpe fiz
 
 0:36:49.919,0:36:54.479
-Jacobian which is the transpose of the gradient right if it's a vector but this
+Jacobian que é a transposição do gradiente certo se for um vetor, mas isso
 
 0:36:54.479,0:37:08.130
-is a tensor so whatever we just used the same same shape thing no so this one
+é um tensor, então o que quer que usamos a mesma coisa de forma não, então este
 
 0:37:08.130,0:37:13.639
-should be a flipping I believe maybe I'm wrong but I don't think all right so
+deve ser uma reviravolta acredito que talvez eu esteja errado, mas não acho certo então
 
 0:37:13.639,0:37:19.919
-this is like basic these basic PyTorch now you can do crazy stuff because we
+isso é como básico esses PyTorch básicos agora você pode fazer coisas malucas porque nós
 
 0:37:19.919,0:37:23.609
-like crazy right I mean I do I think if you like me you
+como um louco certo, quero dizer, eu acho que se você gosta de mim, você
 
 0:37:23.609,0:37:29.669
-like crazy right okay so here I just create my
+como um louco certo, então aqui eu apenas crio meu
 
 0:37:29.669,0:37:34.259
-vector X which is going to be a three dimensional well a one-dimensional
+vetor X que vai ser um poço tridimensional um unidimensional
 
 0:37:34.259,0:37:43.769
-tensor of three items I'm going to be multiplying X by two then I call this
+tensor de três itens eu vou multiplicar X por dois então eu chamo isso
 
 0:37:43.769,0:37:49.859
-one Y then I start my counter to zero and then until the norm of the Y is long
+um Y então eu começo meu contador a zero e depois até que a norma do Y seja longa
 
 0:37:49.859,0:37:56.699
-thousand below thousand I keep doubling Y okay and so you can get like a dynamic
+mil abaixo de mil eu continuo dobrando Y ok e assim você pode ficar como um dinâmico
 
 0:37:56.699,0:38:01.529
-graph right the graph is base is conditional to the actual random
+gráfico à direita o gráfico é base é condicional ao aleatório real
 
 0:38:01.529,0:38:04.979
-initialization which you can't even tell because I didn't even use a seed so
+inicialização que você nem pode dizer porque eu nem usei uma semente, então
 
 0:38:04.979,0:38:08.999
-everyone that is running this stuff is gonna get different numbers so these are
+todos que estão executando essas coisas vão obter números diferentes, então estes são
 
 0:38:08.999,0:38:11.910
-the final values of the why can you tell me
+os valores finais do por que você pode me dizer
 
 0:38:11.910,0:38:23.549
-how many iterations we run so the mean of this stuff is actually lower than a
+quantas iterações executamos para que a média dessas coisas seja realmente menor do que um
 
 0:38:23.549,0:38:27.630
-thousand yeah but then I'm asking whether you know how many times this
+mil sim, mas então eu estou perguntando se você sabe quantas vezes isso
 
 0:38:27.630,0:38:41.119
-loop went through no good why it's random Rises you know it's bad question
+loop passou por nada bom porque é aleatório aumenta você sabe que é uma pergunta ruim
 
 0:38:41.119,0:38:45.539
-about bad questions next time I have a something for you okay so I'm gonna be
+sobre perguntas ruins da próxima vez que eu tiver algo para você, então eu vou ser
 
 0:38:45.539,0:38:51.569
-printing this one now I'm telling you the grabbed are 2048 right
+imprimindo este agora estou dizendo que os agarrados são 2048 certo
 
 0:38:51.569,0:38:55.589
-just check the central one for the moment right this is the actual gradient
+apenas verifique o central no momento certo este é o gradiente real
 
 0:38:55.589,0:39:04.739
-so can you tell me now how many times the loop went on so someone said 11 how
+então você pode me dizer agora quantas vezes o loop foi para alguém disse 11 como
 
 0:39:04.739,0:39:14.420
-many ends up for 11 okay for people just roast their hands what about the others
+muitos acabam por 11 ok para as pessoas apenas assar as mãos e os outros
 
 0:39:14.809,0:39:17.809
-21 okay any other guys 11 10
+21 ok qualquer outro cara 11 10
 
 0:39:25.529,0:39:30.749
-okay we have actually someone that has the right solution and this loop went on
+ok, na verdade, temos alguém que tem a solução certa e esse loop continuou
 
 0:39:30.749,0:39:35.759
-for 10 times why is that because you have the first multiplication by 2 here
+por 10 vezes por que é porque você tem a primeira multiplicação por 2 aqui
 
 0:39:35.759,0:39:40.589
-and then loop goes on over and over and multiplies by 2 right so the final
+e então o loop continua repetidamente e multiplica por 2 certo para que o final
 
 0:39:40.589,0:39:45.239
-number is gonna be the least number of iterations in the loop plus the
+number será o menor número de iterações no loop mais o
 
 0:39:45.239,0:39:50.779
-additional like addition and multiplication outside right yes no
+adicional como adição e multiplicação fora da direita sim não
 
 0:39:50.779,0:39:56.670
-you're sleeping maybe okay I told you not to eat before class otherwise you
+você está dormindo talvez tudo bem, eu te disse para não comer antes da aula, caso contrário você
 
 0:39:56.670,0:40:05.009
-get groggy okay so inference this is cool so here I'm gonna be just having
+fique grogue bem, então inferência isso é legal, então aqui eu vou estar apenas tendo
 
 0:40:05.009,0:40:09.420
-both my X & Y we are gonna just do linear regression right linear or
+tanto no meu X quanto no Y, vamos apenas fazer regressão linear direita linear ou
 
 0:40:09.420,0:40:17.670
-whatever think the add operator is just the scalar product okay so both the X
+o que quer que pense que o operador add é apenas o produto escalar, então tanto o X
 
 0:40:17.670,0:40:21.589
-and W has have the requires gradient equal to true
+e W tem o gradiente requerido igual a true
 
 0:40:21.589,0:40:27.119
-being this means we are going to be keeping track of the the gradients and
+sendo isso significa que vamos acompanhar os gradientes e
 
 0:40:27.119,0:40:31.290
-the computational graph so if I execute this one you're gonna get the partial
+o gráfico computacional, então, se eu executar este, você obterá o parcial
 
 0:40:31.290,0:40:37.710
-derivatives of the inner product with respect to the Z with respect to the
+derivadas do produto interno em relação ao Z em relação ao
 
 0:40:37.710,0:40:43.920
-input is gonna be the weights right so in the range is the input right and the
+entrada vai ser os pesos certos, então no intervalo está a entrada certa e o
 
 0:40:43.920,0:40:47.160
-ones are the weights so partial derivative with respect to the input is
+uns são os pesos, então a derivada parcial em relação à entrada é
 
 0:40:47.160,0:40:50.070
-gonna be the weights partial with respect to the weights are gonna be the
+serão os pesos parciais em relação aos pesos serão os
 
 0:40:50.070,0:40:56.670
-input right yes no yes okay now I just you know usually it's this one is the
+entrada certo sim não sim ok agora eu só sei que normalmente é este é o
 
 0:40:56.670,0:41:00.359
-case I just have required gradients for my parameters because I'm gonna be using
+caso eu só precisei de gradientes para meus parâmetros porque vou usar
 
 0:41:00.359,0:41:06.030
-the gradients for updating later on the the parameters of the mother is so in
+os gradientes para atualizar mais tarde os parâmetros da mãe são assim
 
 0:41:06.030,0:41:12.300
-this case you get none let's have in this case instead what I usually do
+neste caso você não tem nenhum vamos ter neste caso o que eu costumo fazer
 
 0:41:12.300,0:41:17.250
-wanna do inference when I do inference I tell torch a torch stop tracking any
+quero fazer inferência quando eu faço inferência eu digo tocha uma tocha parar de rastrear qualquer
 
 0:41:17.250,0:41:22.950
-kind of operation so I say torch no God please so this one regardless of whether
+tipo de operação, então eu digo tocha, não, Deus, por favor, então este, independentemente de
 
 0:41:22.950,0:41:28.859
-your input always have the required grass true or false whatever when I say
+sua entrada sempre tem a grama necessária verdadeira ou falsa, o que quer que eu diga
 
 0:41:28.859,0:41:35.060
-torch no brats you do not have any computation a graph taken care of right
+tocha sem pirralhos você não tem nenhum cálculo um gráfico cuidado certo
 
 0:41:35.060,0:41:41.130
-therefore if I try to run back propagation on a tensor which was
+portanto, se eu tentar executar a propagação de volta em um tensor que foi
 
 0:41:41.130,0:41:46.320
-generated from like doesn't have actually you know graph because this one
+gerado a partir de como não tem realmente você sabe gráfico porque este
 
 0:41:46.320,0:41:50.940
-doesn't have a graph you're gonna get an error okay so if I run this one you get
+não tem um gráfico, você receberá um erro, então, se eu executar este, você receberá
 
 0:41:50.940,0:41:55.410
-an error and you have a very angry face here because it's an error and then it
+um erro e você tem uma cara muito brava aqui porque é um erro e então
 
 0:41:55.410,0:42:00.720
-takes your element 0 of tensor does not require grads and does not have a god
+pega seu elemento 0 do tensor não requer grads e não tem um deus
 
 0:42:00.720,0:42:07.650
-function right so II which was the yeah whatever they reside here actually then
+funcionam bem então II que era o sim, o que quer que eles residam aqui, na verdade, então
 
 0:42:07.650,0:42:11.400
-you couldn't run back problems that because there is no graph attached to
+você não poderia ter problemas que, porque não há gráfico anexado a
 
 0:42:11.400,0:42:19.710
-that ok questions this is so powerful you cannot do it this time with tensor
+que ok perguntas isso é tão poderoso que você não pode fazer isso desta vez com tensor
 
 0:42:19.710,0:42:26.790
-you okay tensor flow is like whatever yeah more stuff here actually more stuff
+você está bem, o fluxo do tensor é tipo, sim, mais coisas aqui, na verdade, mais coisas
 
 0:42:26.790,0:42:30.600
-coming right now [Applause]
+chegando agora [Aplausos]
 
 0:42:30.600,0:42:36.340
-so we go back here we have inside the extra folder he has some nice cute
+então voltamos aqui temos dentro da pasta extra ele tem uns bonitinhos
 
 0:42:36.340,0:42:40.450
-things I wanted to cover both of them just that we go just for the second I
+coisas que eu queria cobrir os dois apenas que vamos apenas para o segundo que eu
 
 0:42:40.450,0:42:47.290
-think sorry the second one is gonna be the following so in this case we are
+acho que desculpe o segundo vai ser o seguinte, então neste caso estamos
 
 0:42:47.290,0:42:52.750
-going to be generating our own specific modules so I like let's say I'd like to
+vamos gerar nossos próprios módulos específicos, então eu gosto, digamos que eu gostaria de
 
 0:42:52.750,0:42:58.030
-define my own function which is super special amazing function I can decide if
+definir minha própria função, que é uma função incrível super especial, posso decidir se
 
 0:42:58.030,0:43:02.560
-I want to use it for you know training Nets I need to get the forward pass and
+Eu quero usá-lo para você sabe treinar Nets eu preciso pegar o passe para frente e
 
 0:43:02.560,0:43:06.220
-also have to know what is the partial derivative of the input respect to the
+também tem que saber qual é a derivada parcial da entrada em relação ao
 
 0:43:06.220,0:43:10.930
-output such that I can use this module in any kind of you know point in my
+saída de tal forma que eu possa usar este módulo em qualquer tipo de ponto que você conheça no meu
 
 0:43:10.930,0:43:15.670
-inner code such that you know by using back prop you know chain rule you just
+código interno de tal forma que você saiba que, usando back prop, você conhece a regra da cadeia, você acabou de
 
 0:43:15.670,0:43:20.320
-plug the thing. Yann went on several times as long as you know partial
+ligue a coisa. Yann continuou várias vezes, desde que você saiba parcial
 
 0:43:20.320,0:43:23.410
-derivative of the output with respect to the input you can plug these things
+derivada da saída em relação à entrada, você pode conectar essas coisas
 
 0:43:23.410,0:43:31.690
-anywhere in your chain of operations so in this case we define my addition which
+em qualquer lugar em sua cadeia de operações, então neste caso definimos minha adição que
 
 0:43:31.690,0:43:35.620
-is performing the addition of the two inputs in this case but then when you
+está realizando a adição das duas entradas neste caso, mas quando você
 
 0:43:35.620,0:43:41.130
-perform the back propagation if you have an addition what is the back propagation
+execute a retropropagação se tiver uma adição qual é a retropropagação
 
 0:43:41.130,0:43:47.020
-so if you have a addition of the two things you get an output when you send
+então, se você tiver uma adição das duas coisas, obterá uma saída quando enviar
 
 0:43:47.020,0:43:53.320
-down the gradients what does it happen with the with the gradient it gets you
+abaixo dos gradientes o que acontece com o gradiente que você obtém
 
 0:43:53.320,0:43:57.160
-know copied over both sides right and that's why you get both of them are
+sabe copiado em ambos os lados corretos e é por isso que você obtém os dois
 
 0:43:57.160,0:44:01.390
-copies or the same thing and they are sent through one side of the other you
+cópias ou a mesma coisa e eles são enviados de um lado do outro você
 
 0:44:01.390,0:44:05.170
-can execute this stuff you're gonna see here you get the same gradient both ways
+pode executar essas coisas que você verá aqui, você obtém o mesmo gradiente nos dois sentidos
 
 0:44:05.170,0:44:09.460
-in this case I have a split so I come from the same thing and then I split and
+neste caso eu tenho uma divisão então eu venho da mesma coisa e então eu dividi e
 
 0:44:09.460,0:44:13.180
-I have those two things doing something else if I go down with the gradient what
+Eu tenho essas duas coisas fazendo outra coisa se eu descer com o gradiente o que
 
 0:44:13.180,0:44:20.080
-do I do you add them right and that's why we have here the add install you can
+eu faço você adicioná-los direito e é por isso que temos aqui o add install você pode
 
 0:44:20.080,0:44:23.680
-execute this one you're going to see here that we had these two initial
+execute este que você verá aqui que tivemos esses dois
 
 0:44:23.680,0:44:27.910
-gradients here and then when you went up or sorry when you went down the two
+gradientes aqui e depois quando você subiu ou desculpe quando você desceu os dois
 
 0:44:27.910,0:44:30.790
-things the two gradients sum together and they are here okay
+coisas que os dois gradientes somam e estão aqui ok
 
 0:44:30.790,0:44:36.190
-so again if you use pre-made things in PyTorch. They are correct this one you
+então, novamente, se você usar coisas pré-fabricadas no PyTorch. Eles estão corretos este você
 
 0:44:36.190,0:44:41.080
-can mess around you can put any kind of different in
+pode mexer você pode colocar qualquer tipo de diferente em
 
 0:44:41.080,0:44:47.950
-for a function and backward function I think we ran out of time other questions
+para uma função e uma função para trás, acho que ficamos sem tempo outras perguntas
 
 0:44:47.950,0:44:58.800
-before we actually leave no all right so I see on Monday and stay warm
+antes de realmente sairmos não, tudo bem, então eu vejo na segunda-feira e fico aquecido
\ No newline at end of file
diff --git a/docs/pt/week06/06-2.md b/docs/pt/week06/06-2.md
index 885769e86..5b4e55ce6 100644
--- a/docs/pt/week06/06-2.md
+++ b/docs/pt/week06/06-2.md
@@ -91,7 +91,6 @@ Figure 2. Recurrent Networks with unrolled loop
 -->
 
 <center>
- "
 <img src="{{site.baseurl}}/images/week06/06-2/RNN_unrolled.png" /><br>
 Figura 2. Redes recorrentes com loop desenrolado
 </center>
diff --git a/docs/pt/week06/06-3.md b/docs/pt/week06/06-3.md
index 03426aadb..51e22e580 100644
--- a/docs/pt/week06/06-3.md
+++ b/docs/pt/week06/06-3.md
@@ -25,7 +25,7 @@ RNN é um tipo de arquitetura que podemos usar para lidar com sequências de dad
 ### Vanilla *vs.* Recurrent NN
 -->
 
-### Rede Neural "Comum" * vs. * Redes Neurais Recorrentes
+### Rede Neural "Comum" *vs.* Redes Neurais Recorrentes
 
 <!--Figure 1 is a vanilla neural network diagram with three layers. "Vanilla" is an American term meaning plain. The pink bubble is the input vector x, in the center is the hidden layer in green, and the final blue layer is the output. Using an example from digital electronics on the right, this is like a combinational logic, where the current output only depends on the current input.
 -->
diff --git a/docs/pt/week06/lecture06.sbv b/docs/pt/week06/lecture06.sbv
index 1eab7f0b2..a67ffa2f9 100644
--- a/docs/pt/week06/lecture06.sbv
+++ b/docs/pt/week06/lecture06.sbv
@@ -1,3338 +1,3338 @@
 0:00:04.960,0:00:08.970
-So I want to do two things, talk about
+Então eu quero fazer duas coisas, falar sobre
 
 0:00:11.019,0:00:14.909
-Talk a little bit about like some ways to use Convolutional Nets in various ways
+Fale um pouco sobre como algumas maneiras de usar as Redes Convolucionais de várias maneiras
 
 0:00:16.119,0:00:18.539
-Which I haven't gone through last time
+Que eu não passei da última vez
 
 0:00:19.630,0:00:21.630
-and
+e
 
 0:00:22.689,0:00:24.689
-And I'll also
+E eu também vou
 
 0:00:26.619,0:00:29.518
-Talk about different types of architectures that
+Fale sobre os diferentes tipos de arquiteturas que
 
 0:00:30.820,0:00:33.389
-Some of which are very recently designed
+Alguns dos quais são muito recentemente projetados
 
 0:00:34.059,0:00:35.710
-that people have been
+que as pessoas foram
 
 0:00:35.710,0:00:40.320
-Kind of playing with for quite a while. So let's see
+Tipo de jogar com por um bom tempo. Então vamos ver
 
 0:00:43.660,0:00:47.489
-So last time when we talked about Convolutional Nets we stopped that the
+Então, da última vez, quando falamos sobre Redes Convolucionais, paramos que o
 
 0:00:47.890,0:00:54.000
-idea that we can use Convolutional Nets with kind of a sliding we do over large images and it consists in just
+ideia de que podemos usar Redes Convolucionais com um tipo de deslizamento que fazemos sobre imagens grandes e consiste em apenas
 
 0:00:54.550,0:00:56.550
-applying the convolution on large images
+aplicando a convolução em imagens grandes
 
 0:00:57.070,0:01:01.559
-which is a very general image, a very general method, so we're gonna
+que é uma imagem muito geral, um método muito geral, então vamos
 
 0:01:03.610,0:01:06.900
-See a few more things on how to use convolutional Nets and
+Veja mais algumas coisas sobre como usar redes convolucionais e
 
 0:01:07.659,0:01:08.580
-to some extent
+até certo ponto
 
 0:01:08.580,0:01:09.520
-I'm going to
+eu vou
 
 0:01:09.520,0:01:16.020
-Rely on a bit of sort of historical papers and things like this to explain kind of simple forms of all of those ideas
+Confie em um pouco de papéis históricos e coisas assim para explicar formas simples de todas essas ideias
 
 0:01:17.409,0:01:21.269
-so as I said last time
+então como eu disse da última vez
 
 0:01:21.850,0:01:27.720
-I had this example where there's multiple characters on an image and you can, you have a convolutional net that
+Eu tive este exemplo onde há vários caracteres em uma imagem e você pode, você tem uma rede convolucional que
 
 0:01:28.360,0:01:32.819
-whose output is also a convolution like everyday air is a convolution so you can interpret the output as
+cuja saída também é uma convolução como o ar cotidiano é uma convolução para que você possa interpretar a saída como
 
 0:01:33.250,0:01:40.739
-basically giving you a score for every category and for every window on the input and the the framing of the window depends on
+basicamente dando-lhe uma pontuação para cada categoria e para cada janela na entrada e o enquadramento da janela depende
 
 0:01:41.860,0:01:47.879
-Like the the windows that the system observes when your back project for my particular output
+Como as janelas que o sistema observa quando seu projeto de volta para minha saída específica
 
 0:01:49.000,0:01:54.479
-Kind of steps by the amount of subsampling the total amount of sub something you have in a network
+Tipo de etapas pela quantidade de subamostragem da quantidade total de sub algo que você tem em uma rede
 
 0:01:54.640,0:01:59.849
-So if you have two layers that subsample by a factor of two, you have two pooling layers, for example
+Então, se você tem duas camadas que subamostra por um fator de dois, você tem duas camadas de pool, por exemplo
 
 0:01:59.850,0:02:02.219
-That's a factor of two the overall
+Isso é um fator de dois no total
 
 0:02:02.920,0:02:07.199
-subsampling ratio is 4 and what that means is that every output is
+razão de subamostragem é 4 e o que isso significa é que cada saída é
 
 0:02:07.509,0:02:14.288
-Gonna basically look at a window on the input and successive outputs is going to look at the windows that are separated by four pixels
+Vou basicamente olhar para uma janela na entrada e as saídas sucessivas vão olhar para as janelas que são separadas por quatro pixels
 
 0:02:14.630,0:02:17.350
-Okay, it's just a product of all the subsampling layers
+Ok, é apenas um produto de todas as camadas de subamostragem
 
 0:02:20.480,0:02:21.500
-So
+assim
 
 0:02:21.500,0:02:24.610
-this this is nice, but then you're gonna have to make sense of
+isso é legal, mas então você vai ter que entender
 
 0:02:25.220,0:02:30.190
-All the stuff that's on the input. How do you pick out objects objects that
+Todas as coisas que estão na entrada. Como você escolhe objetos objetos que
 
 0:02:31.310,0:02:33.020
-overlap each other
+sobrepor uns aos outros
 
 0:02:33.020,0:02:38.949
-Etc. And one thing you can do for this is called "Non maximum suppression"
+Etc. E uma coisa que você pode fazer para isso é chamada de "supressão não máxima"
 
 0:02:41.180,0:02:43.480
-Which is what people use in sort of object detection
+Que é o que as pessoas usam no tipo de detecção de objetos
 
 0:02:44.750,0:02:47.350
-so basically what that consists in is that if you have
+então basicamente o que isso consiste é que se você tem
 
 0:02:49.160,0:02:53.139
-Outputs that kind of are more or less at the same place and 
+Saídas que estão mais ou menos no mesmo lugar e
 
 0:02:53.989,0:02:58.749
-or also like overlapping places and one of them tells you I see a
+ou também gosto de lugares sobrepostos e um deles diz que vejo um
 
 0:02:58.910,0:03:02.199
-Bear and the other one tells you I see a horse one of them wins
+Urso e o outro diz que vejo um cavalo que um deles ganha
 
 0:03:02.780,0:03:07.330
-Okay, it's probably one that's wrong. And you can't have a bear on a horse at the same time at the same place
+Ok, provavelmente é um que está errado. E você não pode ter um urso em um cavalo ao mesmo tempo no mesmo lugar
 
 0:03:07.330,0:03:10.119
-So you do what's called? No, maximum suppression you can
+Então você faz o que é chamado? Não, supressão máxima que você pode
 
 0:03:10.700,0:03:11.959
-Look at which
+Olha qual
 
 0:03:11.959,0:03:15.429
-which of those has the highest score and you kind of pick that one or you see if
+qual deles tem a pontuação mais alta e você meio que escolhe esse ou vê se
 
 0:03:15.500,0:03:19.660
-any neighbors also recognize that as a bear or a horse and you kind of make a
+quaisquer vizinhos também reconhecem isso como um urso ou um cavalo e você meio que faz um
 
 0:03:20.360,0:03:24.999
-vote if you want, a local vote, okay, and I'm gonna go to the details of this because
+votem se quiserem, voto local, ok, e vou detalhar isso porque
 
 0:03:25.760,0:03:28.719
-Just just kind of rough ideas. Well, this is
+Apenas um tipo de ideias grosseiras. Bem, isso é
 
 0:03:29.930,0:03:34.269
-already implemented in code that you can download and also it's kind of the topic of a 
+já implementado em código que você pode baixar e também é meio que o tópico de um
 
 0:03:35.030,0:03:37.509
-full-fledged computer vision course
+curso completo de visão computacional
 
 0:03:38.239,0:03:42.939
-So here we just allude to kind of how we use deep learning for this kind of application
+Então, aqui, apenas aludimos a como usamos o aprendizado profundo para esse tipo de aplicativo
 
 0:03:46.970,0:03:48.970
-Let's see, so here's
+Vamos ver, então aqui está
 
 0:03:50.480,0:03:55.750
-Again going back to history a little bit some ideas of how you use
+Novamente voltando um pouco para a história algumas ideias de como você usa
 
 0:03:57.049,0:03:59.739
-neural nets to or convolutional nets in this case to
+redes neurais para ou redes convolucionais neste caso para
 
 0:04:00.500,0:04:04.690
-Recognize strings of characters which is kind of the same program as recognizing multiple objects, really
+Reconhecer strings de caracteres que é meio que o mesmo programa que reconhece vários objetos, realmente
 
 0:04:05.450,0:04:12.130
-So if you have, you have an image that contains the image at the top... "two, three two, zero, six"
+Então, se você tiver, você tem uma imagem que contém a imagem no topo... "dois, três dois, zero, seis"
 
 0:04:12.130,0:04:15.639
-It's a zip code and the characters touch so you don't know how to separate them in advance
+É um código postal e os caracteres se tocam para que você não saiba separá-los com antecedência
 
 0:04:15.979,0:04:22.629
-So you just apply a convolutional net to the entire string but you don't know in advance what width the characters will take and so
+Então você apenas aplica uma rede convolucional a toda a string, mas não sabe de antemão qual a largura que os caracteres terão e assim
 
 0:04:24.500,0:04:30.739
-what you see here are four different sets of outputs and those four different sets of outputs of
+o que você vê aqui são quatro conjuntos diferentes de saídas e esses quatro conjuntos diferentes de saídas de
 
 0:04:31.170,0:04:33.170
-the convolutional net
+a rede convolucional
 
 0:04:33.300,0:04:36.830
-Each of which has ten rows and the ten words corresponds to each of the ten categories
+Cada uma delas tem dez linhas e as dez palavras correspondem a cada uma das dez categorias
 
 0:04:38.220,0:04:43.489
-so if you look at the top for example the top, the top block
+então se você olhar para o topo, por exemplo, o topo, o bloco superior
 
 0:04:44.220,0:04:46.940
-the white squares represent high-scoring categories
+os quadrados brancos representam categorias de alta pontuação
 
 0:04:46.940,0:04:53.450
-So what you see on the left is that the number two is being recognized. So the window that is looked at by the
+Então, o que você vê à esquerda é que o número dois está sendo reconhecido. Assim, a janela que é vista pelo
 
 0:04:54.120,0:04:59.690
-Output units that are on the first column is on the, on the left side of the image and it, and it detects a two
+As unidades de saída que estão na primeira coluna estão no lado esquerdo da imagem e ela detecta duas
 
 0:05:00.330,0:05:03.499
-Because the you know their order 0 1 2 3 4 etc
+Porque você sabe o pedido deles 0 1 2 3 4 etc
 
 0:05:03.810,0:05:07.160
-So you see a white square that corresponds to the detection of a 2
+Então você vê um quadrado branco que corresponde à detecção de um 2
 
 0:05:07.770,0:05:09.920
-and then as the window is
+e então como a janela é
 
 0:05:11.400,0:05:13.400
-shifted over the, over the input
+deslocado sobre o, sobre a entrada
 
 0:05:14.310,0:05:19.549
-Is a 3 or low scoring 3 that is seen then the 2 again there's three character
+É um 3 ou 3 de pontuação baixa que é visto, então o 2 novamente há três caracteres
 
 0:05:19.550,0:05:24.980
-It's three detectors that see this 2 and then nothing then the 0 and then the 6
+São três detectores que veem esse 2 e depois nada, depois o 0 e depois o 6
 
 0:05:26.670,0:05:28.670
-Now this first
+Agora este primeiro
 
 0:05:29.580,0:05:32.419
-System looks at a fairly narrow window and
+O sistema olha para uma janela bastante estreita e
 
 0:05:35.940,0:05:40.190
-Or maybe it's a wide window no, I think it's a wide window so it looks at a pretty wide window and
+Ou talvez seja uma janela larga não, eu acho que é uma janela larga então ela olha para uma janela bem larga e
 
 0:05:41.040,0:05:42.450
-it
+isto
 
 0:05:42.450,0:05:44.450
-when it looks at the, the
+quando olha para o, o
 
 0:05:45.240,0:05:50.030
-The two, the two that's on the left for example, it actually sees a piece of the three with it, with it
+Os dois, os dois que estão à esquerda por exemplo, na verdade vê um pedaço dos três com ele, com ele
 
 0:05:50.030,0:05:55.459
-So it's kind of in the window the different sets of outputs here correspond to different size
+Então é meio que na janela os diferentes conjuntos de saídas aqui correspondem a tamanhos diferentes
 
 0:05:55.830,0:06:01.009
-Of the kernel of the last layer. So the second row the second block
+Do kernel da última camada. Então a segunda linha o segundo bloco
 
 0:06:01.890,0:06:05.689
-The the size of the kernel is four in the horizontal dimension
+O tamanho do kernel é quatro na dimensão horizontal
 
 0:06:07.590,0:06:11.869
-The next one is 3 and the next one is 2. what this allows the system to do is look at
+O próximo é 3 e o próximo é 2. o que isso permite que o sistema faça é olhar para
 
 0:06:13.380,0:06:19.010
-Regions of various width on the input without being kind of too confused by the characters that are on the side if you want
+Regiões de várias larguras na entrada sem ficar meio confuso com os caracteres que estão na lateral se quiser
 
 0:06:19.500,0:06:20.630
-so for example
+então por exemplo
 
 0:06:20.630,0:06:28.189
-the, the, the second to the zero is very high-scoring on the, on the, the
+o, o, o segundo a zero é uma pontuação muito alta no, no, no
 
 0:06:29.370,0:06:36.109
-Second third and fourth map but not very high-scoring on the top map. Similarly, the three is kind of high-scoring on the
+Segundo terceiro e quarto mapa, mas não com pontuação muito alta no mapa superior. Da mesma forma, o três é uma espécie de pontuação alta no
 
 0:06:37.020,0:06:38.400
-second third and fourth map
+segundo terceiro e quarto mapa
 
 0:06:38.400,0:06:41.850
-but not on the first map because the three kind of overlaps with the two and so
+mas não no primeiro mapa porque os três tipos de sobreposição com os dois e assim
 
 0:06:42.009,0:06:45.059
-It wants to really look at in our window to be able to recognize it
+Ele quer realmente olhar em nossa janela para poder reconhecê-lo
 
 0:06:45.639,0:06:47.639
-Okay. Yes
+OK. sim
 
 0:06:51.400,0:06:55.380
-So it's the size of the white square that indicates the score basically, okay
+Então é o tamanho do quadrado branco que indica a pontuação basicamente, ok
 
 0:06:57.759,0:07:02.038
-So look at you know, this this column here you have a high-scoring zero
+Então olhe para você sabe, esta coluna aqui você tem uma pontuação alta zero
 
 0:07:03.009,0:07:06.179
-Here because it's the first the first row correspond to the category zero
+Aqui porque é o primeiro a primeira linha corresponde à categoria zero
 
 0:07:06.430,0:07:10.079
-but it's not so high-scoring from the top, the top one because that
+mas não é tão alta pontuação do topo, o melhor porque isso
 
 0:07:10.539,0:07:15.419
-output unit looks at a pretty wide input and it gets confused by the stuff that's on the side
+unidade de saída olha para uma entrada bastante ampla e fica confusa com as coisas que estão ao lado
 
 0:07:16.479,0:07:17.910
-Okay, so you have something like this
+Ok, então você tem algo assim
 
 0:07:17.910,0:07:23.579
-so now you have to make sense out of it and extract the best interpretation of that, of that sequence and
+então agora você tem que entender isso e extrair a melhor interpretação disso, dessa sequência e
 
 0:07:24.760,0:07:31.349
-It's true for zip code, but it's true for just about every piece of text. Not every combination of characters is possible
+É verdade para o código postal, mas é verdade para quase todos os pedaços de texto. Nem todas as combinações de caracteres são possíveis
 
 0:07:31.599,0:07:36.149
-so when you read English text there is, you know, an English dictionary English grammar and
+então, quando você lê um texto em inglês, há, você sabe, um dicionário de inglês, gramática inglesa e
 
 0:07:36.699,0:07:40.919
-Not every combination of character is possible so you can have a language model that
+Nem todas as combinações de caracteres são possíveis, então você pode ter um modelo de linguagem que
 
 0:07:41.470,0:07:42.610
-attempts to
+tentativas de
 
 0:07:42.610,0:07:48.720
-Tell you what is the most likely sequence of characters. So we're looking at here given that this is English or whatever language
+Diga-lhe qual é a sequência de caracteres mais provável. Então, estamos olhando aqui dado que isso é inglês ou qualquer outro idioma
 
 0:07:49.510,0:07:54.929
-Or given that this is a zip code not every zip code are possible. So this --- possibility for error correction
+Ou dado que este é um código postal, nem todos os códigos postais são possíveis. Então esta --- possibilidade de correção de erros
 
 0:07:56.949,0:08:00.719
-So how do we take that into account? I'll come to this in a second but
+Então, como levamos isso em consideração? Eu vou chegar a isso em um segundo, mas
 
 0:08:03.460,0:08:06.930
-But here what we need to do is kind of you know
+Mas aqui o que precisamos fazer é meio que você sabe
 
 0:08:08.169,0:08:10.169
-Come up with a consistent interpretation
+Crie uma interpretação consistente
 
 0:08:10.389,0:08:15.809
-That you know, there's obviously a three there's obviously a two, a three,a zero somewhere
+Que você sabe, obviamente há um três, obviamente há um dois, um três, um zero em algum lugar
 
 0:08:16.630,0:08:19.439
-Another two etc. How to return this
+Outros dois etc. Como devolver isso
 
 0:08:20.110,0:08:22.710
-array of scores into, into a consistent
+matriz de pontuações em, em um consistente
 
 0:08:23.470,0:08:25.470
-interpretation
+interpretação
 
 0:08:28.610,0:08:31.759
-Is the width of the, the horizontal width of the,
+É a largura do, a largura horizontal do,
 
 0:08:33.180,0:08:35.180
-the kernel of the last layer
+o kernel da última camada
 
 0:08:35.400,0:08:36.750
-Okay
+OK
 
 0:08:36.750,0:08:44.090
-Which means when you backprop---, back project on the input the, the viewing window on the input that influences that particular unit
+O que significa que quando você faz backprop ---, projeta de volta na entrada, a janela de visualização na entrada que influencia essa unidade específica
 
 0:08:44.550,0:08:48.409
-has various size depending on which unit you look at. Yes
+tem vários tamanhos dependendo de qual unidade você olha. sim
 
 0:08:52.500,0:08:54.500
-The width of the block yeah
+A largura do bloco sim
 
 0:08:56.640,0:08:58.070
-It's a, it corresponds
+É um, corresponde
 
 0:08:58.070,0:08:58.890
-it's how wide the
+é quão largo o
 
 0:08:58.890,0:09:05.090
-Input image is divided by 4 because the substantive issue is 4 so you get one of one column of those for every four pixel
+A imagem de entrada é dividida por 4 porque o problema substantivo é 4, então você obtém uma de uma coluna para cada quatro pixels
 
 0:09:05.340,0:09:11.660
-so remember we had this, this way of using a neural net, convolutional net which is that you, you basically make every
+então lembre-se que tínhamos isso, essa maneira de usar uma rede neural, rede convolucional que é que você basicamente faz cada
 
 0:09:12.240,0:09:17.270
-Convolution larger and you view the last layer as a convolution as well. And now what you get is multiple
+Convolução maior e você visualiza a última camada como uma convolução também. E agora o que você obtém é múltiplo
 
 0:09:17.790,0:09:23.119
-Outputs. Okay. So what I'm representing here on the slide you just saw
+Saídas. OK. Então, o que estou representando aqui no slide que você acabou de ver
 
 0:09:23.760,0:09:30.470
-is the, is this 2d array on the output which corresponds where, where the, the row corresponds to categories
+é o, é este array 2d na saída que corresponde onde, onde a, a linha corresponde às categorias
 
 0:09:31.320,0:09:35.030
-Okay, and each column corresponds to a different location on the input
+Ok, e cada coluna corresponde a um local diferente na entrada
 
 0:09:39.180,0:09:41.750
-And I showed you those examples here so
+E eu te mostrei esses exemplos aqui então
 
 0:09:42.300,0:09:50.029
-Here, this is a different representation here where the, the character that is displayed just before the title bar is you know
+Aqui, esta é uma representação diferente aqui onde o caractere que é exibido logo antes da barra de título é que você conhece
 
 0:09:50.030,0:09:56.119
-Indicates the winning category, so I'm not displaying the scores of every category. I'm just, just, just displaying the winning category here
+Indica a categoria vencedora, portanto não estou exibindo as pontuações de todas as categorias. Estou apenas, apenas, apenas exibindo a categoria vencedora aqui
 
 0:09:57.180,0:09:58.260
-but each
+mas cada
 
 0:09:58.260,0:10:04.640
-Output looks at a 32 by 32 window and the next output by looks at a 32 by 32 window shifted by 4 pixels
+A saída olha para uma janela de 32 por 32 e a próxima saída olha para uma janela de 32 por 32 deslocada por 4 pixels
 
 0:10:04.650,0:10:06.650
-Ok, etc.
+Tudo bem, etc
 
 0:10:08.340,0:10:14.809
-So how do you turn this you know sequence of characters into the fact that it is either 3 5 or 5 3
+Então, como você transforma essa sequência de caracteres no fato de que é 3 5 ou 5 3
 
 0:10:29.880,0:10:33.979
-Ok, so here the reason why we have four of those is so that is because the last player
+Ok, então aqui a razão pela qual temos quatro desses é porque o último jogador
 
 0:10:34.800,0:10:36.270
-this different
+isso diferente
 
 0:10:36.270,0:10:42.889
-Is different last layers, if you want this four different last layers each of which is trained to recognize the ten categories
+São últimas camadas diferentes, se você quiser essas quatro últimas camadas diferentes, cada uma delas treinada para reconhecer as dez categorias
 
 0:10:43.710,0:10:50.839
-And those last layers have different kernel width so they essentially look at different width of Windows on the input
+E essas últimas camadas têm larguras de kernel diferentes, então elas essencialmente olham para larguras diferentes do Windows na entrada
 
 0:10:53.670,0:10:59.510
-So you want some that look at wide windows so they can they can recognize kind of large characters and some that look at, look
+Então você quer alguns que olhem para janelas largas para que possam reconhecer tipos de caracteres grandes e alguns que olhem, olhem
 
 0:10:59.510,0:11:02.119
-At narrow windows so they can recognize narrow characters without being
+Em janelas estreitas para que possam reconhecer caracteres estreitos sem serem
 
 0:11:03.210,0:11:05.210
-perturbed by the the neighboring characters
+perturbado pelos personagens vizinhos
 
 0:11:09.150,0:11:14.329
-So if you know a priori that there are five five characters here because it's a zip code
+Então, se você sabe a priori que há cinco cinco caracteres aqui porque é um código postal
 
 0:11:16.529,0:11:18.529
-You can do you can use a trick and
+Você pode fazer você pode usar um truque e
 
 0:11:20.010,0:11:22.010
-There is sort of few specific tricks that
+Existem alguns truques específicos que
 
 0:11:23.130,0:11:27.140
-I can explain but I'm going to explain sort of the general trick if you want. I
+Eu posso explicar, mas vou explicar o truque geral, se você quiser. eu
 
 0:11:27.959,0:11:30.619
-Didn't want to talk about this actually at least not now
+Na verdade não queria falar sobre isso, pelo menos não agora
 
 0:11:31.709,0:11:37.729
-Okay here so here's a general trick the general trick is or the you know, kind of a somewhat specific trick
+Ok, aqui está um truque geral, o truque geral é ou o que você sabe, um truque um pouco específico
 
 0:11:38.370,0:11:40.609
-Oops, I don't know way it keeps changing slide
+Opa, não sei como continua mudando de slide
 
 0:11:43.890,0:11:50.809
-You say I have I know I have five characters in this word, is there a
+Você diz que eu tenho eu sei que tenho cinco caracteres nesta palavra, há um
 
 0:11:57.990,0:12:01.760
-So that's one of those arrays that produces scores so for each category
+Essa é uma daquelas matrizes que produz pontuações para cada categoria
 
 0:12:03.060,0:12:07.279
-Let's say I have four categories here and each location
+Digamos que eu tenha quatro categorias aqui e cada local
 
 0:12:11.339,0:12:18.049
-There's a score, okay and let's say I know that I want five characters out
+Há uma pontuação, ok e digamos que eu sei que quero cinco caracteres
 
 0:12:20.250,0:12:27.469
-I'm gonna draw them vertically one two, three, four five because it's a zip code
+Vou desenhá-los verticalmente um dois, três, quatro cinco porque é um código postal
 
 0:12:29.579,0:12:34.279
-So the question I'm going to ask now is what is the best character I can put in this and
+Então, a pergunta que vou fazer agora é qual é o melhor personagem que posso colocar nisso e
 
 0:12:35.220,0:12:37.220
-In this slot in the first slot
+Neste slot no primeiro slot
 
 0:12:38.699,0:12:43.188
-And the way I'm going to do this is I'm gonna draw an array
+E o jeito que vou fazer isso é desenhar um array
 
 0:12:48.569,0:12:50.569
-And on this array
+E nesta matriz
 
 0:12:54.120,0:13:01.429
-I'm going to say what's the score here for, at every intersection in the array?
+Eu vou dizer qual é a pontuação aqui, em cada interseção na matriz?
 
 0:13:07.860,0:13:11.659
-It's gonna be, what is the, what is the score of putting
+Vai ser, qual é, qual é a pontuação de colocar
 
 0:13:12.269,0:13:17.899
-A particular character here at that location given the score that I have at the output of my neural net
+Um personagem em particular aqui nesse local, dada a pontuação que tenho na saída da minha rede neural
 
 0:13:19.560,0:13:21.560
-Okay, so let's say that
+Ok, então vamos dizer que
 
 0:13:24.480,0:13:28.159
-So what I'm gonna have to decide is since I have fewer characters
+Então o que eu vou ter que decidir é que eu tenho menos personagens
 
 0:13:29.550,0:13:32.539
-On the on the output to the system five
+Na saída para o sistema cinco
 
 0:13:33.329,0:13:39.919
-Then I have viewing windows and scores produced by the by the system. I'm gonna have to figure out which one I drop
+Então tenho janelas de visualização e partituras produzidas pelo sistema. Eu vou ter que descobrir qual eu derrubo
 
 0:13:40.949,0:13:42.949
-okay, and
+tudo bem e
 
 0:13:43.860,0:13:47.689
-What I can do is build this, build this array
+O que posso fazer é construir isso, construir este array
 
 0:13:55.530,0:13:57.530
-And
+E
 
 0:14:01.220,0:14:09.010
-What I need to do is go from here to here by finding a path through this through this array
+O que eu preciso fazer é ir daqui até aqui encontrando um caminho através deste array
 
 0:14:15.740,0:14:17.859
-In such a way that I have exactly five
+De tal forma que eu tenho exatamente cinco
 
 0:14:20.420,0:14:24.640
-Steps if you want, so each step corresponds to to a character and
+Passos se você quiser, então cada passo corresponde a um personagem e
 
 0:14:25.790,0:14:31.630
-the overall score of a particular string is the overall is the sum of all the scores that
+a pontuação geral de uma determinada sequência é a geral é a soma de todas as pontuações que
 
 0:14:33.050,0:14:37.060
-Are along this path in other words if I get
+Estão nesse caminho, em outras palavras, se eu conseguir
 
 0:14:39.560,0:14:41.560
-Three
+Três
 
 0:14:41.930,0:14:47.890
-Instances here, three locations where I have a high score for this particular category, which is category one. Okay let's call it 0
+Instâncias aqui, três locais onde tenho uma pontuação alta para essa categoria específica, que é a categoria um. Ok, vamos chamá-lo de 0
 
 0:14:48.440,0:14:50.440
-So 1 2 3
+Então 1 2 3
 
 0:14:51.140,0:14:54.129
-I'm gonna say this is the same guy and it's a 1
+Eu vou dizer que este é o mesmo cara e é um 1
 
 0:14:55.460,0:14:57.460
-and here if I have
+e aqui se eu tiver
 
 0:14:58.160,0:15:03.160
-Two guys. I have high score for 3, I'm gonna say those are the 3 and here
+Dois rapazes. Eu tenho pontuação alta para 3, vou dizer que esses são os 3 e aqui
 
 0:15:03.160,0:15:08.800
-I have only one guy that has high score for 2. So that's a 2 etc
+Eu tenho apenas um cara que tem pontuação alta para 2. Então isso é um 2 etc.
 
 0:15:11.930,0:15:13.370
-So
+assim
 
 0:15:13.370,0:15:15.880
-This path here has to be sort of continuous
+Esse caminho aqui tem que ser meio contínuo
 
 0:15:16.580,0:15:23.080
-I can't jump from one position to another because that would be kind of breaking the order of the characters. Okay?
+Eu não posso pular de uma posição para outra porque isso seria meio que quebrar a ordem dos personagens. OK?
 
 0:15:24.650,0:15:31.809
-And I need to find a path that goes through high-scoring cells if you want that correspond to
+E preciso encontrar um caminho que passe pelas células de alta pontuação, se você quiser que corresponda a
 
 0:15:33.500,0:15:36.489
-High scoring categories along this path and it's a way of
+Categorias de alta pontuação ao longo deste caminho e é uma forma de
 
 0:15:37.190,0:15:39.190
-saying you know if I have
+dizendo que você sabe se eu tenho
 
 0:15:39.950,0:15:43.150
-if those three cells here or
+se essas três células aqui ou
 
 0:15:44.000,0:15:47.530
-Give me the same character. It's only one character. I'm just going to output
+Dê-me o mesmo personagem. É apenas um personagem. só vou dar saída
 
 0:15:48.440,0:15:50.799
-One here that corresponds to this
+Um aqui que corresponde a este
 
 0:15:51.380,0:15:57.189
-Ok, those three guys have high score. I stay on the one, on the one and then I transition
+Ok, esses três caras têm pontuação alta. Eu fico no um, no um e então faço a transição
 
 0:15:57.770,0:16:02.379
-To the second character. So now I'm going to fill out this slot and this guy has high score for three
+Para o segundo personagem. Então agora eu vou preencher essa vaga e esse cara tem pontuação alta por três
 
 0:16:02.750,0:16:06.880
-So I'm going to put three here and this guy has a high score for two
+Então eu vou colocar três aqui e esse cara tem uma pontuação alta para dois
 
 0:16:07.400,0:16:08.930
-as two
+como dois
 
 0:16:08.930,0:16:10.930
-Etc
+etc.
 
 0:16:14.370,0:16:19.669
-The principle to find this this path is a shortest path algorithm
+O princípio para encontrar este caminho é um algoritmo de caminho mais curto
 
 0:16:19.670,0:16:25.190
-You can think of this as a graph where I can go from the lower left cell to the upper right cell
+Você pode pensar nisso como um gráfico onde eu posso ir da célula inferior esquerda para a célula superior direita
 
 0:16:25.560,0:16:27.560
-By either going to the left
+Ou indo para a esquerda
 
 0:16:28.410,0:16:32.269
-or going up and to the left and
+ou subindo e para a esquerda e
 
 0:16:35.220,0:16:38.660
-For each of those transitions there is a there's a cost and for each of the
+Para cada uma dessas transições há um custo e para cada uma das
 
 0:16:39.060,0:16:45.169
-For putting a character at that location, there is also a cost or a score if you want
+Para colocar um personagem nesse local, também há um custo ou uma pontuação, se você quiser
 
 0:16:47.460,0:16:49.460
-So the overall
+Então o geral
 
 0:16:50.700,0:16:57.049
-Score of the one at the bottom would be the combined score of the three locations that detect that one and
+A pontuação do que está na parte inferior seria a pontuação combinada dos três locais que detectam aquele e
 
 0:16:59.130,0:17:01.340
-Because it's more all three of them are
+Porque é mais que todos os três são
 
 0:17:02.730,0:17:04.730
-contributing evidence to the fact that there is a 1
+contribuindo com evidências para o fato de que existe um 1
 
 0:17:06.720,0:17:08.959
-When you constrain the path to have 5 steps
+Quando você restringe o caminho para ter 5 passos
 
 0:17:10.530,0:17:14.930
-Ok, it has to go from the bottom left to the top right and
+Ok, tem que ir do canto inferior esquerdo para o canto superior direito e
 
 0:17:15.930,0:17:18.169
-It has 5 steps, so it has to go through 5 steps
+Tem 5 passos, então tem que passar por 5 passos
 
 0:17:18.750,0:17:24.290
-There's no choice. That's that's how you force the system to kind of give you 5 characters basically, right?
+Não há escolha. É assim que você força o sistema a fornecer basicamente 5 caracteres, certo?
 
 0:17:24.810,0:17:28.909
-And because the path can only go from left to right and from top to bottom
+E porque o caminho só pode ir da esquerda para a direita e de cima para baixo
 
 0:17:30.330,0:17:33.680
-It has to give you the characters in the order in which they appear in the image
+Ele deve fornecer os caracteres na ordem em que aparecem na imagem
 
 0:17:34.350,0:17:41.240
-So it's a way of imposing the order of the character and imposing that there are fives, there are five characters in the string. Yes
+Então é uma forma de impor a ordem do caractere e impor que são cincos, são cinco caracteres na string. sim
 
 0:17:42.840,0:17:48.170
-Yes, okay in the back, yes, right. Yes
+Sim, tudo bem na parte de trás, sim, certo. sim
 
 0:17:52.050,0:17:55.129
-Well, so if we have just the string of one you have to have
+Bem, então se tivermos apenas a seqüência de um você tem que ter
 
 0:17:55.680,0:18:02.539
-Trained the system in advance so that when it's in between two ones or two characters, whatever they are, it says nothing
+Treinou o sistema com antecedência para que, quando estiver entre dois ou dois personagens, sejam eles quais forem, não diga nada
 
 0:18:02.540,0:18:04.540
-it says none of the above
+não diz nenhuma das opções acima
 
 0:18:04.740,0:18:06.740
-Otherwise you can tell, right
+Caso contrário, você pode dizer, certo
 
 0:18:07.140,0:18:11.359
-Yeah, a system like this needs to be able to tell you this is none of the above. It's not a character
+Sim, um sistema como esse precisa ser capaz de dizer que isso não é nenhuma das opções acima. Não é um personagem
 
 0:18:11.360,0:18:16.160
-It's a piece of it or I'm in the middle of two characters or I have two characters on the side
+É um pedaço disso ou estou no meio de dois personagens ou tenho dois personagens ao lado
 
 0:18:16.160,0:18:17.550
-But nothing in the middle
+Mas nada no meio
 
 0:18:17.550,0:18:19.550
-Yeah, absolutely
+Sim, absolutamente
 
 0:18:24.300,0:18:26.300
-It's a form of non maximum suppression
+É uma forma de supressão não máxima
 
 0:18:26.300,0:18:31.099
-so you can think of this as kind of a smart form of non maximum suppression where you say like for every location you can only
+então você pode pensar nisso como uma forma inteligente de supressão não máxima, onde você diz como para cada local que você só pode
 
 0:18:31.100,0:18:31.950
-have one
+tem um
 
 0:18:31.950,0:18:33.950
-character
+personagem
 
 0:18:33.990,0:18:40.370
-And the order in which you produce the five characters must correspond to the order in which they appear on the image
+E a ordem em que você produz os cinco caracteres deve corresponder à ordem em que eles aparecem na imagem
 
 0:18:41.640,0:18:47.420
-What you don't know is how to warp one into the other. Okay. So how to kind of you know, how many
+O que você não sabe é como deformar um no outro. OK. Então, como você sabe, quantos
 
 0:18:48.210,0:18:53.780
-detectors are gonna see the number two. It may be three of them and we're gonna decide they're all the same
+os detectores vão ver o número dois. Pode ser três deles e vamos decidir que são todos iguais
 
 0:19:00.059,0:19:02.748
-So the thing is for all of you who
+Então a coisa é para todos vocês que
 
 0:19:03.629,0:19:06.469
-are on computer science, which is not everyone
+estão em ciência da computação, que não é todo mundo
 
 0:19:07.590,0:19:12.379
-The the way you compute this path is just a shortest path algorithm. You do this with dynamic programming
+A maneira como você calcula esse caminho é apenas um algoritmo de caminho mais curto. Você faz isso com programação dinâmica
 
 0:19:13.499,0:19:15.090
-Okay
+OK
 
 0:19:15.090,0:19:21.350
-so find the shortest path to go from bottom left to top right by going through by only going to
+então encontre o caminho mais curto para ir do canto inferior esquerdo ao canto superior direito, passando por apenas indo para
 
 0:19:22.080,0:19:25.610
-only taking transition to the right or diagonally and
+apenas fazendo transição para a direita ou diagonalmente e
 
 0:19:26.369,0:19:28.369
-by minimizing the
+minimizando o
 
 0:19:28.830,0:19:31.069
-cost so if you think each of those
+custo, então se você acha que cada um desses
 
 0:19:31.710,0:19:38.659
-Is is filled by a cost or maximizing the score if you think that scores there are probabilities, for example
+É preenchido por um custo ou maximizando a pontuação se você acha que as pontuações existem probabilidades, por exemplo
 
 0:19:38.789,0:19:41.479
-And it's just a shortest path algorithm in a graph
+E é apenas um algoritmo de caminho mais curto em um gráfico
 
 0:19:54.840,0:19:56.840
-This kind of method by the way was
+Esse tipo de método, aliás, foi
 
 0:19:57.090,0:20:04.730
-So many early methods of speech recognition kind of work this way, not with neural nets though. We sort of hand extracted features from
+Muitos métodos iniciais de reconhecimento de fala funcionam dessa maneira, mas não com redes neurais. Nós meio que extraímos recursos de
 
 0:20:05.909,0:20:13.189
-but it would basically match the sequence of vectors extracted from a speech signal to a template of a word and then you
+mas basicamente combinaria a sequência de vetores extraída de um sinal de fala para um modelo de uma palavra e então você
 
 0:20:13.409,0:20:17.809
-know try to see how you warp the time to match the the
+sei tentar ver como você distorce o tempo para combinar com o
 
 0:20:19.259,0:20:24.559
-The word to be recognized to to the templates and you had a template for every word over fixed size
+A palavra a ser reconhecida para os modelos e você tinha um modelo para cada palavra em tamanho fixo
 
 0:20:25.679,0:20:32.569
-This was called DTW, dynamic time working. There's more sophisticated version of it called hidden markov models, but it's very similar
+Isso foi chamado de DTW, trabalho de tempo dinâmico. Existe uma versão mais sofisticada chamada de modelos ocultos de markov, mas é muito semelhante
 
 0:20:33.600,0:20:35.600
-People still do this to some extent
+As pessoas ainda fazem isso até certo ponto
 
 0:20:43.000,0:20:44.940
-Okay
+OK
 
 0:20:44.940,0:20:49.880
-So detection, so if you want to apply commercial net for detection
+Então detecção, então se você quiser aplicar rede comercial para detecção
 
 0:20:50.820,0:20:55.380
-it works amazingly well, and it's surprisingly simple, but you
+funciona incrivelmente bem e é surpreendentemente simples, mas você
 
 0:20:56.020,0:20:57.210
-You know what you need to do
+Você sabe o que você precisa fazer
 
 0:20:57.210,0:20:59.210
-You basically need to let's say you wanna do face detection
+Você basicamente precisa dizer que quer fazer a detecção de rosto
 
 0:20:59.440,0:21:05.130
-Which is a very easy problem one of the first problems that computer vision started solving really well for kind of recognition
+Que é um problema muito fácil um dos primeiros problemas que a visão computacional começou a resolver muito bem para o tipo de reconhecimento
 
 0:21:05.500,0:21:07.500
-you collect a data set of
+você coleta um conjunto de dados de
 
 0:21:08.260,0:21:11.249
-images with faces and images without faces and
+imagens com rostos e imagens sem rostos e
 
 0:21:12.160,0:21:13.900
-you train a
+você treina um
 
 0:21:13.900,0:21:19.379
-convolutional net with input window in something like 20 by 20 or 30 by 30 pixels?
+rede convolucional com janela de entrada em algo como 20 por 20 ou 30 por 30 pixels?
 
 0:21:19.870,0:21:21.959
-To tell you whether there is a face in it or not
+Para dizer se há um rosto nele ou não
 
 0:21:22.570,0:21:28.620
-Okay. Now you take this convolutional net, you apply it on an image and if there is a face that happens to be roughly
+OK. Agora você pega essa rede convolucional, aplica em uma imagem e se houver um rosto que seja aproximadamente
 
 0:21:29.230,0:21:31.230
-30 by 30 pixels the
+30 por 30 pixels o
 
 0:21:31.809,0:21:35.699
-the content will will light up at the corresponding output and
+o conteúdo acenderá na saída correspondente e
 
 0:21:36.460,0:21:38.460
-Not light up when there is no face
+Não acende quando não há rosto
 
 0:21:39.130,0:21:41.999
-now there is two problems with this, the first problem is
+agora há dois problemas com isso, o primeiro problema é
 
 0:21:42.940,0:21:47.370
-there is many many ways a patch of an image can be a non face and
+há muitas maneiras pelas quais um pedaço de uma imagem pode ser um não rosto e
 
 0:21:48.130,0:21:53.489
-During your training, you probably haven't seen all of them. You haven't seen even a representative set of them
+Durante seu treinamento, você provavelmente não viu todos eles. Você não viu nem mesmo um conjunto representativo deles
 
 0:21:53.950,0:21:56.250
-So your system is gonna have lots of false positives
+Então seu sistema vai ter muitos falsos positivos
 
 0:21:58.390,0:22:04.709
-That's the first problem. Second problem is in the picture not all faces are 30 by 30 pixels. So how do you handle
+Esse é o primeiro problema. O segundo problema é que na foto nem todos os rostos têm 30 por 30 pixels. Então, como você lida
 
 0:22:05.380,0:22:10.229
-Size variation so one way to handle size variation, which is very simple
+Variação de tamanho, uma maneira de lidar com a variação de tamanho, que é muito simples
 
 0:22:10.230,0:22:14.010
-but it's mostly unnecessary in modern versions, well
+mas é principalmente desnecessário em versões modernas, bem
 
 0:22:14.860,0:22:16.860
- at least it's not completely necessary
+pelo menos não é totalmente necessário
 
 0:22:16.929,0:22:22.499
-Is you do a multiscale approach. So you take your image you run your detector on it. It fires whenever it wants
+Você faz uma abordagem multiescala. Então você pega sua imagem e roda seu detector nela. Ele dispara quando quer
 
 0:22:23.440,0:22:27.299
-And you will detect faces are small then you reduce the image by
+E você detectará que os rostos são pequenos e reduzirá a imagem por
 
 0:22:27.850,0:22:30.179
-Some scale in this case, in this case here
+Alguma escala neste caso, neste caso aqui
 
 0:22:30.179,0:22:31.419
-I take a square root of two
+eu tiro uma raiz quadrada de dois
 
 0:22:31.419,0:22:36.599
-You apply the convolutional net again on that smaller image and now it's going to be able to detect faces that are
+Você aplica a rede convolucional novamente nessa imagem menor e agora ela será capaz de detectar faces que são
 
 0:22:38.350,0:22:45.750
-That were larger in the original image because now what was 30 by 30 pixel is now about 20 by 20 pixels, roughly
+Isso era maior na imagem original porque agora o que era 30 por 30 pixels agora é cerca de 20 por 20 pixels, aproximadamente
 
 0:22:47.169,0:22:48.850
-Okay
+OK
 
 0:22:48.850,0:22:53.309
-But there may be bigger faces there. So you scale the image again by a factor of square root of 2
+Mas pode haver rostos maiores lá. Então você dimensiona a imagem novamente por um fator de raiz quadrada de 2
 
 0:22:53.309,0:22:57.769
-So now the images the size of the original one and you run the convolutional net again
+Então agora as imagens do tamanho da original e você executa a rede convolucional novamente
 
 0:22:57.770,0:23:01.070
-And now it's going to detect faces that were 60 by 60 pixels
+E agora vai detectar rostos que tinham 60 por 60 pixels
 
 0:23:02.190,0:23:06.109
-In the original image, but are now 30 by 30 because you reduce the size by half
+Na imagem original, mas agora são 30 por 30 porque você reduz o tamanho pela metade
 
 0:23:07.800,0:23:10.369
-You might think that this is expensive but it's not. Tthe
+Você pode pensar que isso é caro, mas não é. O
 
 0:23:11.220,0:23:15.439
-expense is, half of the expense is the final scale
+despesa é, metade da despesa é a escala final
 
 0:23:16.080,0:23:18.379
-the sum of the expense of the other networks are
+a soma das despesas das outras redes são
 
 0:23:19.590,0:23:21.859
-Combined is about the same as the final scale
+Combinado é aproximadamente o mesmo que a escala final
 
 0:23:26.070,0:23:29.720
-It's because the size of the network is you know
+É porque o tamanho da rede é que você sabe
 
 0:23:29.720,0:23:33.019
-Kind of the square of the the size of the image on one side
+Tipo do quadrado do tamanho da imagem de um lado
 
 0:23:33.020,0:23:38.570
-And so you scale down the image by square root of 2 the network you have to run is smaller by a factor of 2
+E assim você reduz a imagem pela raiz quadrada de 2, a rede que você precisa executar é menor por um fator de 2
 
 0:23:40.140,0:23:45.619
-Okay, so the overall cost of this is 1 plus 1/2 plus 1/4 plus 1/8 plus 1/16 etc
+Ok, então o custo total disso é 1 mais 1/2 mais 1/4 mais 1/8 mais 1/16 etc
 
 0:23:45.990,0:23:51.290
-Which is 2 you waste a factor of 2 by doing multi scale, which is very small. Ok
+Que é 2 você desperdiça um fator de 2 fazendo multi-escala, o que é muito pequeno. OK
 
 0:23:51.290,0:23:53.290
-you can afford a factor of 2 so
+você pode pagar um fator de 2, então
 
 0:23:54.570,0:23:59.600
-This is a completely ancient face detection system from the early 90s and
+Este é um sistema de detecção de rosto completamente antigo do início dos anos 90 e
 
 0:24:00.480,0:24:02.600
-the maps that you see here are all kind of
+os mapas que você vê aqui são todos do tipo
 
 0:24:03.540,0:24:05.540
-maps that indicate kind of
+mapas que indicam o tipo de
 
 0:24:06.120,0:24:13.160
-Scores of face detectors, the face detector here I think is 20 by 20 pixels. So it's very low res and
+Dezenas de detectores de rosto, o detector de rosto aqui eu acho que é de 20 por 20 pixels. Então é muito baixa resolução e
 
 0:24:13.890,0:24:19.070
-It's a big mess at the fine scales. You see kind of high-scoring areas, but it's not really very definite
+É uma grande confusão nas escalas finas. Você vê áreas de alta pontuação, mas não é realmente muito definido
 
 0:24:19.710,0:24:21.710
-But you see more
+Mas você vê mais
 
 0:24:22.530,0:24:24.150
-More definite
+Mais definido
 
 0:24:24.150,0:24:26.720
-Things down here. So here you see
+Coisas aqui embaixo. Então aqui você vê
 
 0:24:27.780,0:24:33.290
-A white blob here white blob here white blob here same here. You see white blob here, White blob here and
+Uma mancha branca aqui, uma mancha branca aqui, uma mancha branca aqui mesmo aqui. Você vê bolha branca aqui, bolha branca aqui e
 
 0:24:34.020,0:24:35.670
-Those are faces
+Esses são rostos
 
 0:24:35.670,0:24:41.060
-and so that's now how you, you need to do maximum suppression to get those
+e agora é assim que você precisa fazer a supressão máxima para obter esses
 
 0:24:41.580,0:24:46.489
-little red squares that are kind of the winning categories if you want the winning locations where you have a face
+pequenos quadrados vermelhos que são as categorias vencedoras se você quiser os locais vencedores onde você tem um rosto
 
 0:24:50.940,0:24:52.470
-So
+assim
 
 0:24:52.470,0:24:57.559
-Known as sumo suppression in this case means I have a high-scoring white white blob here
+Conhecido como supressão de sumô neste caso, significa que tenho uma bolha branca branca de alta pontuação aqui
 
 0:24:57.560,0:25:01.340
-That means there is probably the face underneath which is roughly 20 by 20
+Isso significa que provavelmente há o rosto embaixo, que tem aproximadamente 20 por 20
 
 0:25:01.370,0:25:06.180
-It is another face in a window of 20 by 20. That means one of those two is wrong
+É outro rosto em uma janela de 20 por 20. Isso significa que um desses dois está errado
 
 0:25:06.250,0:25:10.260
-so I'm just gonna take the highest-scoring one within the window of 20 by 20 and
+então eu vou pegar o de maior pontuação dentro da janela de 20 por 20 e
 
 0:25:10.600,0:25:15.239
-Suppress all the others and you'll suppress the others at that location at that scale
+Suprima todos os outros e você suprimirá os outros naquele local nessa escala
 
 0:25:15.240,0:25:22.410
-I mean that nearby location at that scale but also at other scales. Okay, so you you pick the highest-scoring
+Refiro-me a esse local próximo nessa escala, mas também em outras escalas. Ok, então você escolhe a pontuação mais alta
 
 0:25:23.680,0:25:25.680
-blob if you want
+blob se você quiser
 
 0:25:26.560,0:25:28.560
-For every location every scale
+Para cada local cada escala
 
 0:25:28.720,0:25:34.439
-And whenever you pick one you you suppress the other ones that could be conflicting with it either
+E sempre que você escolhe um, você suprime os outros que podem estar em conflito com ele.
 
 0:25:34.780,0:25:37.259
-because they are a different scale at the same place or
+porque eles são uma escala diferente no mesmo lugar ou
 
 0:25:37.960,0:25:39.960
-At the same scale, but you know nearby
+Na mesma escala, mas você sabe nas proximidades
 
 0:25:44.350,0:25:46.350
-Okay, so that's the
+Ok, então esse é o
 
 0:25:46.660,0:25:53.670
-that's the first problem and the second problem is the fact that as I said, there's many ways to be different from your face and
+esse é o primeiro problema e o segundo problema é o fato de que, como eu disse, há muitas maneiras de ser diferente do seu rosto e
 
 0:25:54.730,0:25:59.820
-Most likely your training set doesn't have all the non-faces, things that look like faces
+Muito provavelmente seu conjunto de treinamento não tem todos os não rostos, coisas que parecem rostos
 
 0:26:00.790,0:26:05.249
-So the way people deal with this is that they do what's called negative mining
+Então, a maneira como as pessoas lidam com isso é que eles fazem o que é chamado de mineração negativa
 
 0:26:05.950,0:26:07.390
-so
+assim
 
 0:26:07.390,0:26:09.390
-You go through a large collection of images
+Você passa por uma grande coleção de imagens
 
 0:26:09.460,0:26:14.850
-when you know for a fact that there is no face and you run your detector and you keep all the
+quando você sabe de fato que não há rosto e você executa seu detector e mantém todos os
 
 0:26:16.720,0:26:19.139
-Patches where you detector fires
+Patches onde você detecta disparos
 
 0:26:21.190,0:26:26.580
-You verify that there is no faces in them and if there is no face you add them to your negative set
+Você verifica se não há rostos neles e, se não houver rosto, você os adiciona ao seu conjunto negativo
 
 0:26:27.610,0:26:31.830
-Okay, then you retrain your detector. And then you use your retrained detector to do the same
+Ok, então você retreinar seu detector. E então você usa seu detector treinado para fazer o mesmo
 
 0:26:31.990,0:26:35.580
-Go again through a large dataset of images where there you know
+Vá novamente por um grande conjunto de dados de imagens onde você conhece
 
 0:26:35.580,0:26:40.710
-There is no face and whenever your detector fires add that as a negative sample
+Não há rosto e sempre que seu detector disparar adicione isso como uma amostra negativa
 
 0:26:41.410,0:26:43.410
-you do this four or five times and
+você faz isso quatro ou cinco vezes e
 
 0:26:43.840,0:26:50.129
-In the end you have a very robust face detector that does not fall victim to negative samples
+No final você tem um detector de rosto muito robusto que não é vítima de amostras negativas
 
 0:26:53.080,0:26:56.669
-These are all things that look like faces in natural images are not faces
+Estas são todas as coisas que parecem rostos em imagens naturais não são rostos
 
 0:27:03.049,0:27:05.049
-This works really well
+Isso funciona muito bem
 
 0:27:10.380,0:27:17.209
-This is over 15 years old work this is my grandparents marriage, their wedding
+Este é um trabalho de mais de 15 anos, este é o casamento dos meus avós, o casamento deles
 
 0:27:18.480,0:27:20.480
-their wedding
+o casamento deles
 
 0:27:22.410,0:27:24.410
-Okay
+OK
 
 0:27:24.500,0:27:29.569
-So here's a another interesting use of convolutional nets and this is for
+Então aqui está outro uso interessante de redes convolucionais e isso é para
 
 0:27:30.299,0:27:34.908
-Semantic segmentation what's called semantic segmentation, I alluded to this in the first the first lecture
+Segmentação semântica o que é chamado de segmentação semântica, eu fiz alusão a isso na primeira aula
 
 0:27:36.390,0:27:44.239
-so what is semantic segmentation is the problem of assigning a category to every pixel in an image and
+então o que é segmentação semântica é o problema de atribuir uma categoria a cada pixel em uma imagem e
 
 0:27:46.020,0:27:49.280
-Every pixel will be labeled with a category of the object it belongs to
+Cada pixel será rotulado com uma categoria do objeto ao qual pertence
 
 0:27:50.250,0:27:55.429
-So imagine this would be very useful if you want to say drive a robot in nature. So this is a
+Então imagine que isso seria muito útil se você quiser dizer dirigir um robô na natureza. Então isso é um
 
 0:27:56.039,0:28:00.769
-Robotics project that I worked on, my students and I worked on a long time ago
+Projeto de robótica em que trabalhei, meus alunos e eu trabalhamos há muito tempo
 
 0:28:01.770,0:28:07.520
-And what you like is to label the image so that regions that the robot can drive on
+E o que você gosta é de rotular a imagem para que as regiões em que o robô possa dirigir
 
 0:28:08.820,0:28:10.820
-are indicated and
+são indicados e
 
 0:28:10.860,0:28:15.199
-Areas that are obstacles also indicated so the robot doesn't drive there. Okay
+As áreas que são obstáculos também são indicadas para que o robô não dirija até lá. OK
 
 0:28:15.200,0:28:22.939
-So here the green areas are things that the robot can drive on and the red areas are obstacles like tall grass in that case
+Então aqui as áreas verdes são coisas sobre as quais o robô pode dirigir e as áreas vermelhas são obstáculos como grama alta nesse caso
 
 0:28:28.049,0:28:34.729
-So the way you you train a convolutional net to do to do this kind of semantic segmentation is very similar to what I just
+Então, a maneira como você treina uma rede convolucional para fazer esse tipo de segmentação semântica é muito semelhante ao que acabei de
 
 0:28:35.520,0:28:38.659
-Described you you take a patch from the image
+Descrito você você pega um patch da imagem
 
 0:28:39.360,0:28:41.360
-In this case. I think the patches were
+Nesse caso. acho que os remendos foram
 
 0:28:42.419,0:28:44.719
-20 by 40 or something like that, they are actually small
+20 por 40 ou algo assim, eles são realmente pequenos
 
 0:28:46.080,0:28:51.860
-For which, you know what the central pixel is whether it's traversable or not, whether it's green or red?
+Para qual, você sabe o que é o pixel central, se é percorrível ou não, se é verde ou vermelho?
 
 0:28:52.470,0:28:56.390
-okay, either is being manually labeled or the label has been obtained in some way and
+ok, ou está sendo rotulado manualmente ou o rótulo foi obtido de alguma forma e
 
 0:28:57.570,0:29:00.110
-You run a conv net on this patch and you train it, you know
+Você executa uma rede de conversão neste patch e treina, você sabe
 
 0:29:00.110,0:29:02.479
-tell me if it's if he's green or red tell me if it's
+me diga se é se ele é verde ou vermelho me diga se é
 
 0:29:03.000,0:29:05.000
-Drivable area or not
+Área transitável ou não
 
 0:29:05.970,0:29:09.439
-And once the system is trained you apply it on the entire image and it you know
+E uma vez que o sistema é treinado você aplica na imagem inteira e você sabe
 
 0:29:09.440,0:29:14.540
-It puts green or red depending on where it is. in this particular case actually, there were five categories
+Coloca verde ou vermelho dependendo de onde está. neste caso em particular, na verdade, havia cinco categorias
 
 0:29:14.830,0:29:18.990
-There's the super green green purple, which is a foot of an object
+Há o super verde verde roxo, que é um pé de um objeto
 
 0:29:19.809,0:29:24.269
-Red, which is an obstacle that you know threw off and super red, which is like a definite obstacle
+Vermelho, que é um obstáculo que você sabe que jogou fora e super vermelho, que é como um obstáculo definitivo
 
 0:29:25.600,0:29:30.179
-Over here. We're only showing three three colors now in this particular
+Por aqui. Estamos mostrando apenas três três cores agora neste particular
 
 0:29:31.809,0:29:37.319
-Project the the labels were actually collected automatically you didn't have to manually
+Projete se os rótulos foram realmente coletados automaticamente, você não precisou manualmente
 
 0:29:39.160,0:29:44.160
-Label the images and the patches what we do would be to run the robot around and then
+Rotule as imagens e os patches, o que fazemos seria rodar o robô e depois
 
 0:29:44.890,0:29:49.379
-through stereo vision figure out if a pixel is a
+através da visão estéreo descobrir se um pixel é um
 
 0:29:51.130,0:29:53.669
-Correspond to an object that sticks out of the ground or is on the ground
+Corresponde a um objeto que sai do chão ou está no chão
 
 0:29:55.540,0:29:59.309
-So the the middle column here it says stereo labels these are
+A coluna do meio aqui diz rótulos estéreo que são
 
 0:30:00.309,0:30:05.789
-Labels, so the color green or red is computed from stereo vision from basically 3d reconstruction
+Rótulos, para que a cor verde ou vermelha seja calculada a partir da visão estéreo da reconstrução basicamente 3D
 
 0:30:06.549,0:30:08.639
-okay, so for, you have two cameras and
+ok, então, você tem duas câmeras e
 
 0:30:09.309,0:30:15.659
-The two cameras can estimate the distance of every pixel by basically comparing patches. It's relatively expensive, but it kind of works
+As duas câmeras podem estimar a distância de cada pixel basicamente comparando patches. É relativamente caro, mas funciona
 
 0:30:15.730,0:30:17.819
-It's not completely reliable, but it sort of works
+Não é totalmente confiável, mas funciona
 
 0:30:18.820,0:30:21.689
-So now for every pixel you have a depth the distance from the camera
+Então agora para cada pixel você tem uma profundidade a distância da câmera
 
 0:30:22.360,0:30:25.890
-Which means you know the position of that pixel in 3d which means you know
+O que significa que você sabe a posição desse pixel em 3d, o que significa que você sabe
 
 0:30:25.890,0:30:30.030
-If it sticks out out of the ground or if it's on the ground because you can fit a plane to the ground
+Se sair do chão ou se estiver no chão, porque você pode encaixar um avião no chão
 
 0:30:30.880,0:30:33.900
-okay, so the green pixels are the ones that are basically
+ok, então os pixels verdes são os que são basicamente
 
 0:30:34.450,0:30:37.980
-You know near the ground and the red ones are the ones that are up
+Você sabe perto do chão e os vermelhos são os que estão em cima
 
 0:30:39.280,0:30:42.479
-so now you have labels you can try and accomplish on that to
+então agora você tem rótulos que você pode tentar e realizar para
 
 0:30:43.330,0:30:44.919
-predict those labels
+prever esses rótulos
 
 0:30:44.919,0:30:49.529
-Then you will tell me why would you want to train a convolutional net on that to do this if you can do this from stereo?
+Então você vai me dizer por que você quer treinar uma rede convolucional para fazer isso se você pode fazer isso em estéreo?
 
 0:30:50.260,0:30:53.760
-And the answer is stereo only works up to ten meters, roughly
+E a resposta é estéreo só funciona até dez metros, aproximadamente
 
 0:30:54.669,0:30:59.789
-Past ten meters you can't really using binocular vision and stereo vision, you can't really estimate the distance very well
+Depois de dez metros você não pode realmente usar visão binocular e visão estéreo, você não pode realmente estimar a distância muito bem
 
 0:30:59.790,0:31:04.799
-And so that only works out to about ten meters and driving a robot by only looking
+E isso só funciona até cerca de dez metros e dirigir um robô apenas olhando
 
 0:31:05.200,0:31:07.770
-ten meters ahead of you is not a good idea
+dez metros à sua frente não é uma boa ideia
 
 0:31:08.950,0:31:13.230
-It's like driving a car in the fog right? It's gonna it's not very efficient
+É como dirigir um carro no meio do nevoeiro, certo? Vai não é muito eficiente
 
 0:31:14.380,0:31:21.089
-So what you used to accomplished on that for is to label every pixel in the image up to the horizon
+Então, o que você costumava fazer era rotular cada pixel na imagem até o horizonte
 
 0:31:21.790,0:31:23.790
-essentially
+essencialmente
 
 0:31:24.130,0:31:30.239
-Okay, so the cool thing about about this system is that as I said the labels were collected automatically but also
+Ok, então o legal desse sistema é que, como eu disse, os rótulos eram coletados automaticamente, mas também
 
 0:31:32.080,0:31:33.730
-The robot
+O robô
 
 0:31:33.730,0:31:38.849
-Adapted itself as it run because he collects stereo labels constantly
+Adaptou-se à medida que corre porque coleciona rótulos estéreo constantemente
 
 0:31:39.340,0:31:43.350
-It can constantly retrain its neural net to adapt to the environment
+Ele pode treinar constantemente sua rede neural para se adaptar ao ambiente
 
 0:31:43.360,0:31:49.199
-it's in. In this particular instance of this robot, it would only will only retrain the last layer
+está dentro. Nesta instância específica deste robô, ele apenas treinará novamente a última camada
 
 0:31:49.540,0:31:53.879
-So the N minus 1 layers of the ConvNet were fixed, were trained in the in the lab
+Então as camadas N menos 1 do ConvNet foram corrigidas, foram treinadas no laboratório
 
 0:31:53.880,0:32:01.499
-And then the last layer was kind of adapted as the robot run, it allowed the robot to deal with environments
+E então a última camada foi meio que adaptada à medida que o robô funcionava, permitindo que o robô lidasse com ambientes
 
 0:32:01.500,0:32:02.680
-He'd never seen before
+Ele nunca tinha visto antes
 
 0:32:02.680,0:32:04.120
-essentially
+essencialmente
 
 0:32:04.120,0:32:06.120
-You still have long-range vision?
+Você ainda tem visão de longo alcance?
 
 0:32:10.000,0:32:17.520
-The input to the the conv network basically multiscale views of sort of bands of the image around the horizon
+A entrada para a rede conv basicamente visualizações multiescala de tipos de bandas da imagem ao redor do horizonte
 
 0:32:18.700,0:32:20.700
-no need to go into details
+não precisa entrar em detalhes
 
 0:32:21.940,0:32:25.710
-Is a very small neural net by today's standard but that's what we could afford I
+É uma rede neural muito pequena para o padrão de hoje, mas é o que poderíamos pagar.
 
 0:32:27.070,0:32:29.970
-Have a video. I'm not sure it's gonna work, but I'll try
+Tenha um vídeo. Não tenho certeza se vai funcionar, mas vou tentar
 
 0:32:31.990,0:32:33.990
-Yeah, it works
+Sim, funciona
 
 0:32:41.360,0:32:45.010
-So I should tell you a little bit about the castor character he characters here so
+Então, eu deveria falar um pouco sobre o personagem castor que ele representa aqui, então
 
 0:32:47.630,0:32:49.630
 Huh
 
 0:32:51.860,0:32:53.860
-You don't want the audio
+Você não quer o áudio
 
 0:32:55.370,0:32:59.020
-So Pierre Semanet and Raia Hadsell were two students
+Então Pierre Semanet e Raia Hadsell eram dois alunos
 
 0:32:59.600,0:33:02.560
-working with me on this project two PhD students
+trabalhando comigo neste projeto, dois estudantes de doutorado
 
 0:33:03.170,0:33:08.200
-Pierre Sermanet is at Google Brain. He works on robotics and Raia Hadsell is the sales director of Robotics at DeepMind
+Pierre Sermanet está em Google Brain. Ele trabalha com robótica e Raia Hadsell é diretora de vendas de Robótica na DeepMind
 
 0:33:09.050,0:33:11.050
-Marco Scoffier is NVIDIA
+Marco Scoffier é NVIDIA
 
 0:33:11.150,0:33:15.249
-Matt Grimes is a DeepMind, Jan Ben is at Mobile Eye which is now Intel
+Matt Grimes é um DeepMind, Jan Ben está no Mobile Eye, que agora é Intel
 
 0:33:15.920,0:33:17.920
-Ayse Erkan is at
+Ayse Erkan está em
 
 0:33:18.260,0:33:20.260
-Twitter and
+Twitter e
 
 0:33:20.540,0:33:22.540
-Urs Muller is still working with us, he is
+Urs Muller ainda está trabalhando conosco, ele está
 
 0:33:22.910,0:33:29.139
-Actually head of a big group that works on autonomous driving at Nvidia and he is collaborating with us
+Na verdade, chefe de um grande grupo que trabalha em direção autônoma na Nvidia e está colaborando conosco
 
 0:33:30.800,0:33:32.800
-Actually
+Na realidade
 
 0:33:33.020,0:33:38.020
-Our further works on this project, so this is a robot
+Nossos trabalhos posteriores neste projeto, então este é um robô
 
 0:33:39.290,0:33:44.440
-And it can drive it about you know, sort of fast walking speed
+E pode dirigir sobre você sabe, uma espécie de velocidade de caminhada rápida
 
 0:33:46.310,0:33:48.999
-And it's supposed to drive itself in sort of nature
+E é suposto dirigir-se em uma espécie de natureza
 
 0:33:50.720,0:33:55.930
-So it's got this mass with four eyes, there are two stereo pairs to two stereo camera pairs and
+Então tem essa massa com quatro olhos, há dois pares estéreo para dois pares de câmeras estéreo e
 
 0:33:57.020,0:34:02.320
-It has three computers in the belly. So it's completely autonomous. It doesn't talk to the network or anything
+Tem três computadores na barriga. Então é totalmente autônomo. Não fala com a rede nem nada
 
 0:34:03.200,0:34:05.200
-And those those three computers
+E aqueles aqueles três computadores
 
 0:34:07.580,0:34:10.120
-I'm on the left. That's when I had a pony tail
+Eu estou à esquerda. Foi quando eu tinha um rabo de cavalo
 
 0:34:13.640,0:34:19.659
-Okay, so here the the system is the the neural net is crippled so the we didn't turn on the neural Nets
+Ok, então aqui o sistema é que a rede neural está aleijada, então não ligamos as redes neurais
 
 0:34:19.659,0:34:22.029
-It's only using stereo vision and now it's using the neural net
+Está usando apenas visão estéreo e agora está usando a rede neural
 
 0:34:22.130,0:34:26.529
-so it's it's pretty far away from this barrier, but it sees it and so it directly goes to
+então está bem longe dessa barreira, mas ele a vê e vai diretamente para
 
 0:34:27.169,0:34:31.599
-The side it wants to go to a goal, a GPS coordinate. That's behind it. Same here
+O lado que ele quer ir para um objetivo, uma coordenada GPS. Isso está por trás disso. Mesmo aqui
 
 0:34:31.600,0:34:33.429
-He wants to go to a GPS coordinate behind it
+Ele quer ir para uma coordenada GPS atrás dele
 
 0:34:33.429,0:34:37.689
-And it sees right away that there is this wall of people that he can't go through
+E vê logo que tem essa parede de gente que ele não consegue passar
 
 0:34:38.360,0:34:43.539
-The guy on the right here is Marcos, He is holding the transmitter,he is not driving the robot but is holding the kill switch
+O cara da direita aqui é Marcos, ele está segurando o transmissor, ele não está dirigindo o robô, mas está segurando o interruptor de matar
 
 0:34:48.849,0:34:50.849
-And so
+E assim
 
 0:34:51.039,0:34:54.689
-You know, that's what the the the convolutional net looks like
+Você sabe, é assim que a rede convolucional se parece
 
 0:34:55.659,0:34:57.659
-really small by today's standards
+muito pequeno para os padrões de hoje
 
 0:35:00.430,0:35:02.430
-And
+E
 
 0:35:03.700,0:35:05.700
-And it produces for every
+E produz para cada
 
 0:35:06.400,0:35:08.400
-every location every patch on the input
+cada local cada patch na entrada
 
 0:35:08.829,0:35:13.859
-The second last layer is a 100 dimensional vector that goes into a classifier that classifies into five categories
+A segunda última camada é um vetor de 100 dimensões que entra em um classificador que classifica em cinco categorias
 
 0:35:14.650,0:35:16.650
-so once the system classifies
+então, uma vez que o sistema classifica
 
 0:35:16.779,0:35:20.189
-Each of those five categories in the image you can you can warp the image
+Cada uma dessas cinco categorias na imagem você pode deformar a imagem
 
 0:35:20.349,0:35:25.979
-Into a map that's centered on the robot and you can you can do planning in this map to figure out like how to avoid
+Em um mapa que está centrado no robô e você pode fazer o planejamento neste mapa para descobrir como evitar
 
 0:35:25.980,0:35:31.379
-Obstacles and stuff like that. So this is what this thing does. It's a particular map called a hyperbolic map, but
+Obstáculos e coisas assim. Então é isso que essa coisa faz. É um mapa particular chamado mapa hiperbólico, mas
 
 0:35:33.999,0:35:36.239
-It's not important for now
+Não é importante por enquanto
 
 0:35:38.380,0:35:40.380
-Now that
+Agora isso
 
 0:35:40.509,0:35:42.509
-because this was you know
+porque isso era você sabe
 
 0:35:42.970,0:35:49.199
-2007 the computers were slowly there were no GPUs so we could run this we could run this neural net only at about one frame per
+Em 2007, os computadores estavam lentamente, não havia GPUs, então poderíamos executar isso, poderíamos executar essa rede neural apenas em cerca de um quadro por
 
 0:35:49.200,0:35:50.859
-second
+segundo
 
 0:35:50.859,0:35:54.268
-As you can see here the at the bottom it updates about one frame per second
+Como você pode ver aqui na parte inferior, ele atualiza cerca de um quadro por segundo
 
 0:35:54.269,0:35:54.640
-and
+e
 
 0:35:54.640,0:35:59.609
-So if you have someone kind of walking in front of the robot the robot won't see it for a second and will you know?
+Então, se você tiver alguém andando na frente do robô, o robô não o verá por um segundo e você saberá?
 
 0:35:59.680,0:36:01.329
-Run over him
+Corre por cima dele
 
 0:36:01.329,0:36:07.079
-So that's why we have a second vision system here at the top. This one is stereo. It doesn't use a neural net
+Então é por isso que temos um segundo sistema de visão aqui no topo. Este é estéreo. Não usa uma rede neural
 
 0:36:09.039,0:36:13.949
-Odometry I think we don't care this is the controller which is also learned, but we don't care and
+Odometria acho que não nos importamos este é o controlador que também se aprende, mas não nos importamos e
 
 0:36:15.730,0:36:21.989
-This is the the system here again, it's vision is crippled they can only see up to two point two and a half meters
+Este é o sistema aqui novamente, sua visão é aleijada eles só podem ver até dois metros e meio
 
 0:36:21.989,0:36:23.989
-So it's very short
+Então é muito curto
 
 0:36:24.099,0:36:26.099
-But it kind of does a decent job
+Mas meio que faz um trabalho decente
 
 0:36:26.529,0:36:28.529
-and
+e
 
 0:36:28.930,0:36:34.109
-This is to test this sort of fast reacting vision systems or here pierre-simon a is jumping in front of it and
+Isso é para testar esse tipo de sistema de visão de reação rápida ou aqui pierre-simon a está pulando na frente dele e
 
 0:36:34.420,0:36:40.950
-the robot stops right away so that now that's the full system with long-range vision and
+o robô para imediatamente para que agora esse seja o sistema completo com visão de longo alcance e
 
 0:36:41.950,0:36:43.950
-annoying grad students
+estudantes de graduação irritantes
 
 0:36:49.370,0:36:52.150
-Right, so it's kind of giving up
+Certo, então é meio que desistir
 
 0:37:03.970,0:37:06.149
-Okay, oops
+Ok, opa
 
 0:37:09.400,0:37:11.049
-Okay, so
+OK, então
 
 0:37:11.049,0:37:12.690
-That's called semantic segmentation
+Isso se chama segmentação semântica
 
 0:37:12.690,0:37:18.329
-But the real form of semantic segmentation is the one in which you you give an object category for every location
+Mas a forma real de segmentação semântica é aquela em que você dá uma categoria de objeto para cada local
 
 0:37:18.729,0:37:21.599
-So that's the kind of problem here we're talking about where
+Então esse é o tipo de problema aqui que estamos falando sobre onde
 
 0:37:22.569,0:37:25.949
-every pixel is either building or sky or
+cada pixel é um edifício ou céu ou
 
 0:37:26.769,0:37:28.769
-Street or a car or something like this?
+Rua ou um carro ou algo assim?
 
 0:37:29.799,0:37:37.409
-And around 2010 a couple datasets started appearing with a few thousand images where you could train vision systems to do this
+E por volta de 2010 alguns conjuntos de dados começaram a aparecer com alguns milhares de imagens onde você poderia treinar sistemas de visão para fazer isso
 
 0:37:39.940,0:37:42.059
-And so the technique here is
+E então a técnica aqui é
 
 0:37:42.849,0:37:44.849
-essentially identical to the one I
+essencialmente idêntico ao que eu
 
 0:37:45.309,0:37:47.309
-Described it's also multi scale
+Descrito também é multi-escala
 
 0:37:48.130,0:37:52.920
-So you basically have an input image you have a convolutional net
+Então você basicamente tem uma imagem de entrada, você tem uma rede convolucional
 
 0:37:53.259,0:37:57.959
-that has a set of outputs that you know, one for each category
+que tem um conjunto de saídas que você conhece, uma para cada categoria
 
 0:37:58.539,0:38:01.258
-Of objects for which you have label, which in this case is 33
+De objetos para os quais você tem rótulo, que neste caso é 33
 
 0:38:02.680,0:38:05.879
-When you back project one output of the convolutional net onto the input
+Quando você volta a projetar uma saída da rede convolucional na entrada
 
 0:38:06.219,0:38:11.249
-It corresponds to an input window of 46 by 46 windows. So it's using a context of 46
+Corresponde a uma janela de entrada de 46 por 46 janelas. Então está usando um contexto de 46
 
 0:38:12.309,0:38:16.889
-by 46 pixels to make the decision about a single pixel at least that's the the
+por 46 pixels para tomar a decisão sobre um único pixel, pelo menos esse é o
 
 0:38:17.589,0:38:19.589
-neural net at the back, at the bottom
+rede neural na parte de trás, na parte inferior
 
 0:38:19.900,0:38:24.569
-But it has out 46 but 46 is not enough if you want to decide what a gray pixel is
+Mas tem 46, mas 46 não é suficiente se você quiser decidir o que é um pixel cinza
 
 0:38:24.569,0:38:27.359
-Is it the shirt of the person is it the street? Is it the
+É a camisa da pessoa é a rua? É o
 
 0:38:28.119,0:38:31.679
-Cloud or kind of pixel on the mountain. You have to look at a wider
+Nuvem ou tipo de pixel na montanha. Você tem que olhar para uma visão mais ampla
 
 0:38:32.650,0:38:34.650
-context to be able to make that decision so
+contexto para poder tomar essa decisão
 
 0:38:35.529,0:38:39.179
-We use again this kind of multiscale approach where the same image is
+Usamos novamente esse tipo de abordagem multiescala onde a mesma imagem é
 
 0:38:39.759,0:38:45.478
-Reduced by a factor of 2 and a factor of 4 and you run those two extra images to the same convolutional
+Reduzido por um fator de 2 e um fator de 4 e você executa essas duas imagens extras para o mesmo convolucional
 
 0:38:45.479,0:38:47.789
-net same weight same kernel same everything
+net mesmo peso mesmo kernel mesmo tudo
 
 0:38:48.940,0:38:54.089
-Except the the last feature map you upscale them so that they have the same size as the original one
+Exceto o último mapa de recursos, você os aprimora para que tenham o mesmo tamanho que o original
 
 0:38:54.089,0:38:58.859
-And now you take those combined feature Maps and send them to a couple layers of a classifier
+E agora você pega esses mapas de feições combinados e os envia para algumas camadas de um classificador
 
 0:38:59.410,0:39:01.410
-So now the classifier to make its decision
+Então agora o classificador para tomar sua decisão
 
 0:39:01.749,0:39:07.738
-Has four 46 by 46 windows on images that have been rescaled and so the effective
+Possui quatro janelas de 46 por 46 em imagens que foram redimensionadas e, portanto, o efetivo
 
 0:39:08.289,0:39:12.718
-size of the context now is is 184 by 184 window because
+o tamanho do contexto agora é 184 por 184 janela porque
 
 0:39:13.269,0:39:15.269
-the the core scale
+a escala central
 
 0:39:15.610,0:39:17.910
-Network basically looks at more this entire
+A rede basicamente olha mais para todo esse
 
 0:39:19.870,0:39:21.870
-Image
+Imagem
 
 0:39:24.310,0:39:30.299
-Then you can clean it up in various way I'm not gonna go to details for this but it works quite well
+Então você pode limpá-lo de várias maneiras, não vou entrar em detalhes, mas funciona muito bem
 
 0:39:33.970,0:39:36.330
-So this is the result
+Então esse é o resultado
 
 0:39:37.870,0:39:40.140
-The guy who did this in my lab is Clément Farabet
+O cara que fez isso no meu laboratório é Clément Farabet
 
 0:39:40.170,0:39:46.319
-He's a VP at Nvidia now in charge of all of machine learning infrastructure and the autonomous driving
+Ele é vice-presidente da Nvidia agora responsável por toda a infraestrutura de aprendizado de máquina e condução autônoma
 
 0:39:47.080,0:39:49.080
-Not surprisingly
+Não surpreendentemente
 
 0:39:51.100,0:39:57.959
-And and so that system, you know, this is this is Washington Square Park by the way, so this is the NYU campus
+E então esse sistema, você sabe, este é o Washington Square Park a propósito, então este é o campus da NYU
 
 0:39:59.440,0:40:02.429
-It's not perfect far from that from that. You know it
+Não é perfeito longe disso. Você sabe
 
 0:40:03.220,0:40:06.300
-Identified some areas of the street as sand
+Identificou algumas áreas da rua como areia
 
 0:40:07.330,0:40:09.160
-or desert and
+ou deserto e
 
 0:40:09.160,0:40:12.479
-There's no beach. I'm aware of in Washington Square Park
+Não há praia. Estou ciente de em Washington Square Park
 
 0:40:13.750,0:40:15.750
-and
+e
 
 0:40:16.480,0:40:17.320
-But you know
+Mas você sabe
 
 0:40:17.320,0:40:22.469
-At the time this was the kind of system of this kind at the the number of training samples for this was very small
+Na época, esse era o tipo de sistema desse tipo, o número de amostras de treinamento para isso era muito pequeno
 
 0:40:22.470,0:40:24.400
-so it was kind of
+então foi tipo
 
 0:40:24.400,0:40:27.299
-It was about 2,000 or 3,000 images something like that
+Foram cerca de 2.000 ou 3.000 imagens algo assim
 
 0:40:31.630,0:40:34.410
-You run you take a you take a full resolution image
+Você corre, você tira uma imagem de resolução total
 
 0:40:36.220,0:40:42.689
-You run it to the first n minus 2 layers of your  ConvNet that gives you your future Maps
+Você o executa para as primeiras n menos 2 camadas do seu ConvNet que fornece seus mapas futuros
 
 0:40:42.970,0:40:45.570
-Then you reduce the image by a factor of two run it again
+Então você reduz a imagem por um fator de dois, executa-a novamente
 
 0:40:45.570,0:40:50.009
-You get a bunch of feature maps that are smaller then running again by reducing by a factor of four
+Você obtém vários mapas de recursos que são menores e executados novamente reduzindo por um fator de quatro
 
 0:40:50.320,0:40:51.900
-You get smaller feature maps
+Você obtém mapas de recursos menores
 
 0:40:51.900,0:40:52.420
-now
+agora
 
 0:40:52.420,0:40:57.420
-You take the small feature map and you rescale it you up sample it so it's the same size as the first one same
+Você pega o pequeno mapa de recursos e redimensiona-o, amostra-o para que fique do mesmo tamanho que o primeiro
 
 0:40:57.420,0:41:00.089
-for the second one, you stack all those feature maps together
+para o segundo, você empilha todos esses mapas de recursos juntos
 
 0:41:00.880,0:41:07.199
-Okay, and that you feed to two layers for a classifier for every patch
+Ok, e que você alimente duas camadas para um classificador para cada patch
 
 0:41:07.980,0:41:12.240
-Yeah, the paper was rejected from CVPR 2012 even though the results were
+Sim, o artigo foi rejeitado no CVPR 2012, embora os resultados fossem
 
 0:41:13.090,0:41:14.710
-record-breaking and
+recorde e
 
 0:41:14.710,0:41:17.520
-It was faster than the best competing
+Foi mais rápido que o melhor concorrente
 
 0:41:18.400,0:41:20.400
-method by a factor of 50
+método por um fator de 50
 
 0:41:20.950,0:41:25.920
-Even running on standard hardware, but we also had implementation on special hardware that was incredibly fast
+Mesmo rodando em hardware padrão, mas também tivemos implementação em hardware especial que foi incrivelmente rápido
 
 0:41:26.980,0:41:28.130
-and
+e
 
 0:41:28.130,0:41:34.600
-people didn't know what the convolutional net was at the time and so the reviewers basically could not fathom that
+as pessoas não sabiam o que era a rede convolucional na época e, portanto, os revisores basicamente não conseguiam entender isso
 
 0:41:35.660,0:41:37.359
-The method they'd never heard of could work
+O método que eles nunca ouviram falar poderia funcionar
 
 0:41:37.359,0:41:40.899
-So well. There is way more to say about convolutional nets
+Tão bem. Há muito mais a dizer sobre redes convolucionais
 
 0:41:40.900,0:41:44.770
-But I encourage you to take a computer vision course for to hear about this
+Mas eu encorajo você a fazer um curso de visão computacional para ouvir sobre isso
 
 0:41:45.950,0:41:49.540
-Yeah, this is okay this data set this particular dataset that we used
+Sim, tudo bem este conjunto de dados este conjunto de dados específico que usamos
 
 0:41:51.590,0:41:57.969
-Is a collection of images street images that was collected mostly by Antonio Torralba at MIT and
+É uma coleção de imagens de rua que foi coletada principalmente por Antonio Torralba no MIT e
 
 0:42:02.690,0:42:04.130
-He had a
+Ele tem um
 
 0:42:04.130,0:42:08.530
-sort of a tool for kind of labeling so you could you know, you could sort of
+uma espécie de ferramenta para rotular para que você pudesse saber, você poderia meio que
 
 0:42:09.140,0:42:12.100
-draw the contour over the object and then label of the object and
+desenhe o contorno sobre o objeto e, em seguida, rotule o objeto e
 
 0:42:12.650,0:42:18.129
-So if it would kind of, you know fill up the object most of the segmentations were done by his mother
+Então se fosse meio, você sabe preencher o objeto que a maioria das segmentações foram feitas pela mãe dele
 
 0:42:20.030,0:42:22.030
-Who's in Spain
+Quem está na Espanha
 
 0:42:22.310,0:42:24.310
-she had a lot of time to
+ela teve muito tempo para
 
 0:42:25.220,0:42:27.220
-Spend doing this
+Gaste fazendo isso
 
 0:42:27.380,0:42:29.300
 Huh?
 
 0:42:29.300,0:42:34.869
-His mother yeah labeled that stuff. Yeah. This was in the late late 2000
+Sua mãe sim rotulou essas coisas. Sim. Isso foi no final dos anos 2000
 
 0:42:37.190,0:42:41.530
-Okay, now let's talk about a bunch of different architectures, right so
+Ok, agora vamos falar sobre um monte de arquiteturas diferentes, certo então
 
 0:42:43.400,0:42:45.520
-You know as I mentioned before
+Você sabe como eu mencionei antes
 
 0:42:45.950,0:42:51.159
-the idea of deep learning is that you have this catalog of modules that you can assemble in sort of different graphs and
+a ideia de aprendizado profundo é que você tenha esse catálogo de módulos que você pode montar em diferentes gráficos e
 
 0:42:52.040,0:42:54.879
-and together to do different functions and
+e juntos para fazer diferentes funções e
 
 0:42:56.210,0:42:58.210
-and a lot of the
+e muito do
 
 0:42:58.430,0:43:03.280
-Expertise in deep learning is to design those architectures to do something in particular
+Expertise em deep learning é projetar essas arquiteturas para fazer algo em particular
 
 0:43:03.619,0:43:06.909
-It's a little bit like, you know in the early days of computer science
+É um pouco como, você sabe, nos primeiros dias da ciência da computação
 
 0:43:08.180,0:43:11.740
-Coming up with an algorithm to write a program was kind of a new concept
+Criar um algoritmo para escrever um programa foi uma espécie de novo conceito
 
 0:43:12.830,0:43:14.830
-you know reducing a
+você sabe reduzir um
 
 0:43:15.560,0:43:19.209
-Problem to kind of a set of instructions that could be run on a computer
+Problema para um tipo de conjunto de instruções que poderia ser executado em um computador
 
 0:43:19.210,0:43:21.580
-It was kind of something new and here it's the same problem
+Foi meio que algo novo e aqui está o mesmo problema
 
 0:43:21.830,0:43:26.109
-you have to sort of imagine how to reduce a complex function into sort of a
+você tem que imaginar como reduzir uma função complexa em uma espécie de
 
 0:43:27.500,0:43:29.560
-graph possibly dynamic graph of
+gráfico possivelmente gráfico dinâmico de
 
 0:43:29.720,0:43:35.830
-Functional modules that you don't need to know completely the function of but that you're going to whose function is gonna be finalized by learning
+Módulos funcionais dos quais você não precisa saber completamente a função, mas que você está indo para cuja função será finalizada aprendendo
 
 0:43:36.109,0:43:38.199
-But the architecture is super important, of course
+Mas a arquitetura é super importante, claro
 
 0:43:38.920,0:43:43.359
-As we saw with convolutional Nets. the first important category is recurrent net. So
+Como vimos com as redes convolucionais. a primeira categoria importante é a rede recorrente. assim
 
 0:43:44.180,0:43:47.379
-We've we've seen when we talked about the backpropagation
+Nós vimos quando falamos sobre a retropropagação
 
 0:43:48.140,0:43:50.140
-There's a big
+Há um grande
 
 0:43:50.510,0:43:58.029
-Condition of the condition was that the graph of the interconnection of the module could not have loops. Okay. It had to be a
+A condição da condição era que o gráfico da interligação do módulo não pudesse ter laços. OK. Tinha que ser um
 
 0:43:59.299,0:44:04.059
-graph for which there is sort of at least a partial order of the module so that you can compute the
+gráfico para o qual há pelo menos uma ordem parcial do módulo para que você possa calcular o
 
 0:44:04.819,0:44:09.489
-The the modules in such a way that when you compute the output of a module all of its inputs are available
+Os módulos de tal forma que quando você calcula a saída de um módulo todas as suas entradas estão disponíveis
 
 0:44:11.240,0:44:13.299
-But recurrent net is one in which you have loops
+Mas rede recorrente é aquela em que você tem loops
 
 0:44:14.480,0:44:15.490
-How do you deal with this?
+Como você lida com isso?
 
 0:44:15.490,0:44:18.459
-So here is an example of a recurrent net architecture
+Então aqui está um exemplo de uma arquitetura de rede recorrente
 
 0:44:18.920,0:44:25.210
-Where you have an input which varies over time X(t) that goes through the first neural net. Let's call it an encoder
+Onde você tem uma entrada que varia ao longo do tempo X(t) que passa pela primeira rede neural. Vamos chamá-lo de codificador
 
 0:44:25.789,0:44:29.349
-That produces a representation of the of the input
+Isso produz uma representação da entrada
 
 0:44:29.349,0:44:32.679
-Let's call it H(t) and it goes into a recurrent layer
+Vamos chamá-lo de H(t) e ele entra em uma camada recorrente
 
 0:44:32.680,0:44:38.409
-This recurrent layer is a function G that depends on trainable parameters W this trainable parameters also for the encoder
+Esta camada recorrente é uma função G que depende de parâmetros treináveis ​​W estes parâmetros treináveis ​​também para o encoder
 
 0:44:38.410,0:44:40.410
-but I didn't mention it and
+mas eu não mencionei isso e
 
 0:44:41.150,0:44:42.680
-that
+aquele
 
 0:44:42.680,0:44:46.480
-Recurrent layer takes into account H(t), which is the representation of the input
+A camada recorrente leva em conta H(t), que é a representação da entrada
 
 0:44:46.480,0:44:49.539
-but it also takes into account Z(t-1), which is the
+mas também leva em conta Z(t-1), que é a
 
 0:44:50.150,0:44:55.509
-Sort of a hidden state, which is its output at a previous time step its own output at a previous time step
+Tipo de um estado oculto, que é sua saída em uma etapa de tempo anterior sua própria saída em uma etapa de tempo anterior
 
 0:44:56.299,0:44:59.709
-Okay, this G function can be a very complicated neural net inside
+Ok, esta função G pode ser uma rede neural muito complicada dentro
 
 0:45:00.950,0:45:06.519
-convolutional net whatever could be as complicated as you want. But what's important is that one of its inputs is
+rede convolucional o que quer que seja tão complicado quanto você quiser. Mas o importante é que uma de suas entradas é
 
 0:45:08.869,0:45:10.869
-Its output at a previous time step
+Sua saída em uma etapa de tempo anterior
 
 0:45:11.630,0:45:13.160
-Okay
+OK
 
 0:45:13.160,0:45:15.049
 Z(t-1)
 
 0:45:15.049,0:45:21.788
-So that's why this delay indicates here. The input of G at time t is actually Z(t-1)
+Então é por isso que esse atraso indica aqui. A entrada de G no tempo t é na verdade Z(t-1)
 
 0:45:21.789,0:45:24.459
-Which is the output its output at a previous time step
+Qual é a saída sua saída em uma etapa de tempo anterior
 
 0:45:27.230,0:45:32.349
-Ok, then the output of that recurrent module goes into a decoder which basically produces an output
+Ok, então a saída desse módulo recorrente vai para um decodificador que basicamente produz uma saída
 
 0:45:32.450,0:45:35.710
-Ok, so it turns a hidden representation Z into an output
+Ok, então transforma uma representação oculta Z em uma saída
 
 0:45:39.859,0:45:41.979
-So, how do you deal with this, you unroll the loop
+Então, como você lida com isso, você desenrola o loop
 
 0:45:44.230,0:45:47.439
-So this is basically the same diagram, but I've unrolled it in time
+Este é basicamente o mesmo diagrama, mas eu o desenrolei a tempo
 
 0:45:49.160,0:45:56.170
-Okay, so at time at times 0 I have X(0) that goes through the encoder produces H of 0 and then I apply
+Ok, então no momento 0 eu tenho X(0) que passa pelo codificador produz H de 0 e então eu aplico
 
 0:45:56.170,0:46:00.129
-The G function I start with a Z arbitrary Z, maybe 0 or something
+A função G eu começo com um Z arbitrário Z, talvez 0 ou algo assim
 
 0:46:01.160,0:46:05.980
-And I apply the function and I get Z(0) and that goes into the decoder produces an output
+E eu aplico a função e recebo Z(0) e isso entra no decodificador produz uma saída
 
 0:46:06.650,0:46:08.270
-Okay
+OK
 
 0:46:08.270,0:46:09.740
-and then
+e então
 
 0:46:09.740,0:46:16.479
-Now that has Z(0) at time step 1. I can use the Z(0) as the previous output for the time step. Ok
+Agora que tem Z(0) no passo de tempo 1. Posso usar o Z(0) como a saída anterior para o passo de tempo. OK
 
 0:46:17.570,0:46:22.570
-Now the output is X(1) and time 1. I run through the encoder I run through the recurrent layer
+Agora a saída é X(1) e tempo 1. Eu corro pelo codificador eu corro pela camada recorrente
 
 0:46:22.570,0:46:24.570
-Which is now no longer recurrent
+Que agora não é mais recorrente
 
 0:46:24.890,0:46:28.510
-And run through the decoder and then the next time step, etc
+E percorrer o decodificador e, em seguida, o próximo passo de tempo, etc
 
 0:46:29.810,0:46:34.269
-Ok, this network that's involved in time doesn't have any loops anymore
+Ok, esta rede que está envolvida no tempo não tem mais loops
 
 0:46:37.130,0:46:39.040
-Which means I can run backpropagation through it
+O que significa que posso executar retropropagação através dele
 
 0:46:39.040,0:46:44.259
-So if I have an objective function that says the last output should be that particular one
+Então, se eu tenho uma função objetivo que diz que a última saída deve ser aquela em particular
 
 0:46:45.020,0:46:48.609
-Or maybe the trajectory should be a particular one of the outputs. I
+Ou talvez a trajetória deva ser uma das saídas em particular. eu
 
 0:46:49.730,0:46:51.760
-Can just back propagate gradient through this thing
+Pode apenas voltar a propagar o gradiente através dessa coisa
 
 0:46:52.940,0:46:55.510
-It's a regular network with one
+É uma rede regular com um
 
 0:46:56.900,0:46:59.980
-Particular characteristic, which is that every block
+Característica particular, que é que cada bloco
 
 0:47:01.609,0:47:03.609
-Shares the same weights
+Compartilha os mesmos pesos
 
 0:47:04.040,0:47:07.509
-Okay, so the three instances of the encoder
+Ok, então as três instâncias do codificador
 
 0:47:08.150,0:47:11.379
-They are the same encoder at three different time steps
+Eles são o mesmo codificador em três etapas de tempo diferentes
 
 0:47:11.380,0:47:16.869
-So they have the same weights the same G functions has the same weights, the three decoders have the same weights. Yes
+Então eles têm os mesmos pesos as mesmas funções G tem os mesmos pesos, os três decodificadores têm os mesmos pesos. sim
 
 0:47:20.990,0:47:23.260
-It can be variable, you know, I have to decide in advance
+Pode ser variável, você sabe, eu tenho que decidir com antecedência
 
 0:47:25.160,0:47:27.399
-But it depends on the length of your input sequence
+Mas isso depende do comprimento da sua sequência de entrada
 
 0:47:28.579,0:47:30.109
-basically
+basicamente
 
 0:47:30.109,0:47:33.159
-Right and you know, it's you can you can run it for as long as you want
+Certo e você sabe, é você pode executá-lo pelo tempo que quiser
 
 0:47:33.890,0:47:38.290
-You know, it's the same weights over so you can just you know, repeat the operation
+Você sabe, são os mesmos pesos para que você possa saber, repita a operação
 
 0:47:40.130,0:47:46.390
-Okay this technique of unrolling and then back propagating through time basically is called surprisingly
+Ok, essa técnica de desenrolar e depois voltar a propagar através do tempo basicamente é chamada surpreendentemente
 
 0:47:47.060,0:47:49.060
-BPTT back prop through time
+Suporte traseiro BPTT ao longo do tempo
 
 0:47:50.000,0:47:52.000
-It's pretty obvious
+É bem óbvio
 
 0:47:53.470,0:47:55.470
-That's all there is to it
+Isso é tudo o que há para isso
 
 0:47:56.710,0:48:01.439
-Unfortunately, they don't work very well at least not in their naive form
+Infelizmente, eles não funcionam muito bem, pelo menos não em sua forma ingênua
 
 0:48:03.910,0:48:06.000
-So in the naive form
+Então, na forma ingênua
 
 0:48:07.360,0:48:11.519
-So a simple form of recurrent net is one in which the encoder is linear
+Assim, uma forma simples de rede recorrente é aquela em que o codificador é linear
 
 0:48:11.770,0:48:16.560
-The G function is linear with high probably tangent or sigmoid or perhaps ReLU
+A função G é linear com alta provavelmente tangente ou sigmóide ou talvez ReLU
 
 0:48:17.410,0:48:22.680
-And the decoder also is linear something like this maybe with a ReLU or something like that, right so it could be very simple
+E o decodificador também é linear algo assim talvez com um ReLU ou algo assim, certo então pode ser bem simples
 
 0:48:23.530,0:48:24.820
-and
+e
 
 0:48:24.820,0:48:27.539
-You get a number of problems with this and one problem is?
+Você tem uma série de problemas com isso e um problema é?
 
 0:48:29.290,0:48:32.969
-The so called vanishing gradient problem or exploding gradient problem
+O chamado problema do gradiente de fuga ou problema do gradiente explosivo
 
 0:48:34.060,0:48:38.640
-And it comes from the fact that if you have a long sequence, let's say I don't know 50 time steps
+E isso vem do fato de que se você tem uma sequência longa, digamos que eu não saiba 50 passos de tempo
 
 0:48:40.060,0:48:44.400
-Every time you back propagate gradients
+Toda vez que você volta a propagar gradientes
 
 0:48:45.700,0:48:52.710
-The gradients that get multiplied by the weight matrix of the G function. Okay at every time step
+Os gradientes que são multiplicados pela matriz de pesos da função G. Ok a cada passo de tempo
 
 0:48:54.010,0:48:58.560
-the gradients get multiplied by the the weight matrix now imagine the weight matrix has
+os gradientes são multiplicados pela matriz de pesos agora imagine que a matriz de pesos tem
 
 0:48:59.110,0:49:00.820
-small values in it
+pequenos valores nele
 
 0:49:00.820,0:49:07.049
-Which means that means that every time you take your gradient you multiply it by the transpose of this matrix to get the gradient at previous
+O que significa que toda vez que você pega seu gradiente, você o multiplica pela transposição desta matriz para obter o gradiente anterior
 
 0:49:07.050,0:49:08.290
-time step
+passo de tempo
 
 0:49:08.290,0:49:10.529
-You get a shorter vector you get a smaller vector
+Você obtém um vetor menor, obtém um vetor menor
 
 0:49:11.200,0:49:14.520
-And you keep rolling the the vector gets shorter and shorter exponentially
+E você continua rolando o vetor fica cada vez mais curto exponencialmente
 
 0:49:14.980,0:49:18.449
-That's called the vanishing gradient problem by the time you get to the 50th
+Isso é chamado de problema do gradiente de fuga quando você chega ao 50º
 
 0:49:19.210,0:49:23.100
-Time steps which is really the first time step. You don't get any gradient
+Passos de tempo que é realmente o primeiro passo de tempo. Você não recebe nenhum gradiente
 
 0:49:28.660,0:49:32.970
-Conversely if the weight matrix is really large and the non-linearity and your
+Por outro lado, se a matriz de peso for muito grande e a não linearidade e seu
 
 0:49:33.760,0:49:36.120
-Recurrent layer is not saturating
+A camada recorrente não está saturando
 
 0:49:36.670,0:49:41.130
-your gradients can explode if the weight matrix is large every time you multiply the
+seus gradientes podem explodir se a matriz de peso for grande toda vez que você multiplicar o
 
 0:49:41.650,0:49:43.650
-gradient by the transpose of the matrix
+gradiente pela transposição da matriz
 
 0:49:43.660,0:49:46.920
-the vector gets larger and it explodes which means
+o vetor fica maior e explode, o que significa
 
 0:49:47.290,0:49:51.810
-your weights are going to diverge when you do a gradient step or you're gonna have to use a tiny learning rate for it to
+seus pesos vão divergir quando você fizer uma etapa de gradiente ou você terá que usar uma pequena taxa de aprendizado para
 
 0:49:51.810,0:49:53.810
-work
+trabalhar
 
 0:49:54.490,0:49:56.290
-So
+assim
 
 0:49:56.290,0:49:58.529
-You have to use a lot of tricks to make those things work
+Você tem que usar muitos truques para fazer essas coisas funcionarem
 
 0:49:59.860,0:50:04.620
-Here's another problem. The reason why you would want to use a recurrent net. Why would you want to use a recurrent net?
+Aqui está outro problema. A razão pela qual você gostaria de usar uma rede recorrente. Por que você iria querer usar uma rede recorrente?
 
 0:50:05.690,0:50:12.639
-The purported advantage of recurrent net is that they can remember remember things from far away in the past
+A suposta vantagem da rede recorrente é que eles podem se lembrar de coisas distantes no passado
 
 0:50:13.850,0:50:15.850
-Okay
+OK
 
 0:50:16.970,0:50:24.639
-If for example you imagine that the the X's are our characters that you enter one by one
+Se por exemplo você imaginar que os X's são nossos caracteres que você digita um a um
 
 0:50:25.940,0:50:31.300
-The characters come from I don't know a C program or something like that, right?
+Os personagens vêm de não conheço um programa em C ou algo do tipo, certo?
 
 0:50:34.070,0:50:35.300
-And
+E
 
 0:50:35.300,0:50:37.870
-What your system is supposed to tell you at the end, you know?
+O que seu sistema deve dizer no final, sabe?
 
 0:50:37.870,0:50:42.699
-it reads a few hundred characters corresponding to the source code of a function and at the end is
+ele lê algumas centenas de caracteres correspondentes ao código-fonte de uma função e no final é
 
 0:50:43.730,0:50:49.090
-you want to train your system so that it produces one if it's a syntactically correct program and
+você deseja treinar seu sistema para que ele produza um se for um programa sintaticamente correto e
 
 0:50:49.910,0:50:51.910
-Minus one if it's not okay
+Menos um se não estiver tudo bem
 
 0:50:52.430,0:50:54.320
-hypothetical problem
+problema hipotético
 
 0:50:54.320,0:50:57.489
-Recurrent Nets won't do it. Okay, at least not with our tricks
+Redes recorrentes não farão isso. Ok, pelo menos não com nossos truques
 
 0:50:59.180,0:51:02.500
-Now there is a thing here which is the issue which is that
+Agora há uma coisa aqui que é a questão que é que
 
 0:51:03.860,0:51:07.599
-Among other things this program has to have balanced braces and parentheses
+Entre outras coisas, este programa deve ter chaves e parênteses balanceados
 
 0:51:09.110,0:51:10.280
-So
+assim
 
 0:51:10.280,0:51:13.540
-It has to have a way of remembering how many open parentheses
+Tem que ter uma maneira de lembrar quantos parênteses abertos
 
 0:51:13.540,0:51:20.350
-there are so that it can check that you're closing them all or how many open braces there are so so all of them get
+existem para que ele possa verificar se você está fechando todos eles ou quantos colchetes abertos existem para que todos sejam
 
 0:51:21.620,0:51:24.939
-Get closed right so it has to store eventually, you know
+Feche bem para que tenha que armazenar eventualmente, você sabe
 
 0:51:27.380,0:51:29.410
-Essentially within its hidden state Z
+Essencialmente dentro de seu estado oculto Z
 
 0:51:29.410,0:51:32.139
-it has to store like how many braces and and
+ele tem que armazenar como quantos chaves e e
 
 0:51:32.630,0:51:37.240
-Parentheses were open if it wants to be able to tell at the end that all of them have been closed
+Os parênteses estavam abertos se quiser poder dizer no final que todos eles foram fechados
 
 0:51:38.620,0:51:41.040
-So it has to have some sort of counter inside right
+Então tem que ter algum tipo de contador dentro certo
 
 0:51:43.180,0:51:45.080
-Yes
+sim
 
 0:51:45.080,0:51:47.840
-It's going to be a topic tomorrow
+Vai ser um tópico amanhã
 
 0:51:51.050,0:51:56.469
-Now if the program is very long that means, you know Z has to kind of preserve information for a long time and
+Agora, se o programa for muito longo, isso significa que você sabe que Z precisa preservar as informações por um longo tempo e
 
 0:51:57.230,0:52:02.679
-Recurrent net, you know give you the hope that maybe a system like this can do this, but because of a vanishing gradient problem
+Rede recorrente, você sabe, dá a esperança de que talvez um sistema como esse possa fazer isso, mas por causa de um problema de gradiente de fuga
 
 0:52:02.810,0:52:05.259
-They actually don't at least not simple
+Eles realmente não pelo menos não são simples
 
 0:52:07.280,0:52:09.280
-Recurrent Nets
+Redes recorrentes
 
 0:52:09.440,0:52:11.440
-Of the type. I just described
+Do tipo. acabei de descrever
 
 0:52:12.080,0:52:14.080
-So you have to use a bunch of tricks
+Então você tem que usar um monte de truques
 
 0:52:14.200,0:52:18.460
-Those are tricks from you know Yoshua Bengio's lab, but there is a bunch of them that were published by various people
+Esses são truques que você conhece do laboratório de Yoshua Bengio, mas tem um monte deles que foram publicados por várias pessoas
 
 0:52:19.700,0:52:22.090
-Like Thomas Mikolov and various other people
+Como Thomas Mikolov e várias outras pessoas
 
 0:52:24.050,0:52:27.789
-So to avoid exploding gradients you can clip the gradients just you know, make it you know
+Então, para evitar gradientes explosivos, você pode recortar os gradientes que você conhece, faça você saber
 
 0:52:27.790,0:52:30.279
-If the gradients get too large, you just kind of squash them down
+Se os gradientes ficarem muito grandes, você apenas os esmaga
 
 0:52:30.950,0:52:32.950
-Just normalize them
+Basta normalizá-los
 
 0:52:35.180,0:52:41.800
-Weak integration momentum I'm not gonna mention that. a good initialization so you want to initialize the weight matrices so that
+Momento de integração fraco Não vou mencionar isso. uma boa inicialização, então você deseja inicializar as matrizes de peso para que
 
 0:52:42.380,0:52:44.380
-They preserves the norm more or less
+Eles preservam a norma mais ou menos
 
 0:52:44.660,0:52:49.180
-this is actually a whole bunch of papers on this on orthogonal neural nets and invertible
+este é realmente um monte de artigos sobre isso em redes neurais ortogonais e inversíveis
 
 0:52:49.700,0:52:51.700
-recurrent Nets
+Redes recorrentes
 
 0:52:54.770,0:52:56.770
-But the big trick is
+Mas o grande truque é
 
 0:52:57.470,0:53:04.630
-LSTM and GRUs. Okay. So what is that before I talk about that I'm gonna talk about multiplicative modules
+LSTM e GRUs. OK. Então o que é isso antes de falar sobre isso vou falar sobre módulos multiplicativos
 
 0:53:06.410,0:53:08.470
-So what are multiplicative modules
+Então, o que são módulos multiplicativos
 
 0:53:09.500,0:53:11.000
-They're basically
+Eles são basicamente
 
 0:53:11.000,0:53:14.709
-Modules in which you you can multiply things with each other
+Módulos nos quais você pode multiplicar coisas entre si
 
 0:53:14.710,0:53:20.590
-So instead of just computing a weighted sum of inputs you compute products of inputs and then weighted sum of that
+Então, em vez de apenas calcular uma soma ponderada de entradas, você calcula os produtos de entradas e, em seguida, a soma ponderada disso
 
 0:53:20.600,0:53:23.110
-Okay, so you have an example of this on the top left
+Ok, então você tem um exemplo disso no canto superior esquerdo
 
 0:53:23.720,0:53:25.040
-on the top
+no topo
 
 0:53:25.040,0:53:29.080
-so the output of a system here is just a weighted sum of
+então a saída de um sistema aqui é apenas uma soma ponderada de
 
 0:53:30.080,0:53:32.080
-weights and inputs
+pesos e entradas
 
 0:53:32.240,0:53:37.810
-Okay classic, but the weights actually themselves are weighted sums of weights and inputs
+Ok clássico, mas os pesos na verdade são somas ponderadas de pesos e entradas
 
 0:53:38.780,0:53:43.149
-okay, so Wij here, which is the ij'th term in the weight matrix of
+ok, então Wij aqui, que é o ij'th termo na matriz de peso de
 
 0:53:43.820,0:53:46.479
-The module we're considering is actually itself
+O módulo que estamos considerando é na verdade ele mesmo
 
 0:53:47.270,0:53:49.270
-a weighted sum of
+uma soma ponderada de
 
 0:53:50.060,0:53:53.439
-three third order tenser Uijk
+três tensores de terceira ordem Uijk
 
 0:53:54.410,0:53:56.560
-weighted by variables Zk.
+ponderado por variáveis ​​Zk.
 
 0:53:58.220,0:54:02.080
-Okay, so basically what you get is that Wij is kind of a weighted sum of
+Ok, então basicamente o que você obtém é que Wij é uma soma ponderada de
 
 0:54:04.160,0:54:06.160
-Matrices
+Matrizes
 
 0:54:06.800,0:54:08.800
-Uk
+Reino Unido
 
 0:54:09.020,0:54:13.419
-Weighted by a coefficient Zk and the Zk can change there are input variables the same way
+Ponderado por um coeficiente Zk e o Zk pode mudar existem variáveis ​​de entrada da mesma forma
 
 0:54:13.460,0:54:17.230
-So in effect, it's like having a neural net
+Então, na verdade, é como ter uma rede neural
 
 0:54:18.260,0:54:22.600
-With weight matrix W whose weight matrix is computed itself by another neural net
+Com matriz de peso W cuja matriz de peso é calculada por outra rede neural
 
 0:54:24.710,0:54:30.740
-There is a general form of this where you don't just multiply matrices, but you have a neural net that is some complex function
+Existe uma forma geral disso onde você não apenas multiplica matrizes, mas você tem uma rede neural que é uma função complexa
 
 0:54:31.650,0:54:33.650
-turns X into S
+transforma X em S
 
 0:54:34.859,0:54:40.819
-Some generic function. Ok, give you ConvNet whatever and the weights of those neural nets
+Alguma função genérica. Ok, dê a você o ConvNet e os pesos dessas redes neurais
 
 0:54:41.910,0:54:44.839
-are not variables that you learn directly but they are the output of
+não são variáveis ​​que você aprende diretamente, mas são a saída de
 
 0:54:44.970,0:54:48.800
-Another neuron that that takes maybe another input into account or maybe the same input
+Outro neurônio que leva em consideração talvez outra entrada ou talvez a mesma entrada
 
 0:54:49.830,0:54:55.069
-Some people call those architectures hyper networks. Ok. There are networks whose weights are computed by another network
+Algumas pessoas chamam essas arquiteturas de hiper-redes. OK. Existem redes cujos pesos são calculados por outra rede
 
 0:54:56.160,0:54:59.270
-But here's just a simple form of it, which is kind of a bilinear form
+Mas aqui está apenas uma forma simples, que é uma forma bilinear
 
 0:54:59.970,0:55:01.740
-or quadratic
+ou quadrático
 
 0:55:01.740,0:55:03.180
-form
+Formato
 
 0:55:03.180,0:55:05.810
-Ok, so overall when you kind of write it all down
+Ok, então no geral, quando você escreve tudo
 
 0:55:06.570,0:55:13.339
-SI is equal to sum over j And k of Uijk Zk Xj. This is a double sum
+SI é igual a soma sobre j E k de Uijk Zk Xj. Esta é uma soma dupla
 
 0:55:15.750,0:55:18.169
-People used to call this Sigma Pi units, yes
+As pessoas costumavam chamar isso de unidades Sigma Pi, sim
 
 0:55:22.890,0:55:27.290
-We'll come to this in just a second basically
+Chegaremos a isso em apenas um segundo basicamente
 
 0:55:31.500,0:55:33.500
-If you want a neural net that can
+Se você quer uma rede neural que possa
 
 0:55:34.740,0:55:36.740
-perform a transformation from
+realizar uma transformação de
 
 0:55:37.440,0:55:41.929
-A vector into another and that transformation is to be programmable
+Um vetor em outro e essa transformação deve ser programável
 
 0:55:42.990,0:55:50.089
-Right, you can have that transformation be computed by a neural net but the weight of that neural net would be it themselves the output
+Certo, você pode fazer com que essa transformação seja calculada por uma rede neural, mas o peso dessa rede neural seria a saída
 
 0:55:50.089,0:55:51.390
-of
+do
 
 0:55:51.390,0:55:54.200
-Another neural net that figures out what the transformation is
+Outra rede neural que descobre qual é a transformação
 
 0:55:55.349,0:56:01.399
-That's kind of the more general form more specifically is very useful if you want to route
+Essa é a forma mais geral, mais especificamente, é muito útil se você quiser rotear
 
 0:56:03.359,0:56:08.389
-Signals through a neural net in different ways on a data dependent way so
+Sinais através de uma rede neural de diferentes maneiras em uma maneira dependente de dados para
 
 0:56:10.980,0:56:16.669
-You in fact that's exactly what is mentioned below so the attention module is a special case of this
+Você de fato é exatamente isso que está mencionado abaixo, então o módulo de atenção é um caso especial disso
 
 0:56:17.460,0:56:20.510
-It's not a quadratic layer. It's kind of a different type, but it's a
+Não é uma camada quadrática. É um tipo diferente, mas é um
 
 0:56:21.510,0:56:23.510
-particular type of
+determinado tipo de
 
 0:56:25.140,0:56:26.849
-Architecture that
+Arquitetura que
 
 0:56:26.849,0:56:28.849
-basically computes a
+basicamente calcula um
 
 0:56:29.339,0:56:32.029
-convex linear combination of a bunch of vectors, so
+combinação linear convexa de um monte de vetores, então
 
 0:56:32.790,0:56:34.849
-x₁ and x₂ here are vectors
+x₁ e x₂ aqui são vetores
 
 0:56:37.770,0:56:42.499
-w₁ and w₂ are scalars, basically, okay and
+w₁ e w₂ são escalares, basicamente, ok e
 
 0:56:45.540,0:56:47.870
-What the system computes here is a weighted sum of
+O que o sistema calcula aqui é uma soma ponderada de
 
 0:56:49.590,0:56:55.069
-x₁ and x₂ weighted by w₁ w₂ and again w₁ w₂ are scalars in this case
+x₁ e x₂ ponderados por w₁ w₂ e novamente w₁ w₂ são escalares neste caso
 
 0:56:56.910,0:56:58.910
-Here the sum at the output
+Aqui a soma na saída
 
 0:56:59.730,0:57:01.020
-so
+assim
 
 0:57:01.020,0:57:07.999
-Imagine that those two weights. w₁ w₂ are between 0 and 1 and sum to 1 that's what's called a convex linear combination
+Imagine que esses dois pesos. w₁ w₂ estão entre 0 e 1 e somam 1 isso é o que é chamado de combinação linear convexa
 
 0:57:10.260,0:57:13.760
-So by changing w₁ w₂ so essentially
+Então, alterando w₁ w₂ tão essencialmente
 
 0:57:15.480,0:57:18.139
-If this sum to 1 there are the output of a softmax
+Se essa soma for 1 há a saída de um softmax
 
 0:57:18.810,0:57:23.629
-Which means w₂ is equal to 1 - w₁ right? That's kind of the direct consequence
+O que significa que w₂ é igual a 1 - w₁ certo? Esse é o tipo de consequência direta
 
 0:57:27.450,0:57:29.450
-So basically by changing
+Então basicamente mudando
 
 0:57:29.790,0:57:34.340
-the size of w₁ w₂ you kind of switch the output to
+o tamanho de w₁ w₂ você meio que muda a saída para
 
 0:57:34.530,0:57:39.860
-Being either x₁ or x₂ or some linear combination of the two some interpolation between the two
+Sendo x₁ ou x₂ ou alguma combinação linear dos dois alguma interpolação entre os dois
 
 0:57:41.610,0:57:43.050
-Okay
+OK
 
 0:57:43.050,0:57:47.179
-You can have more than just x₁ and x₂ you can have a whole bunch of x vectors
+Você pode ter mais do que apenas x₁ e x₂, você pode ter um monte de vetores x
 
 0:57:48.360,0:57:50.360
-and that
+e essa
 
 0:57:50.730,0:57:54.800
-system will basically choose an appropriate linear combination or focus
+sistema irá basicamente escolher uma combinação linear apropriada ou foco
 
 0:57:55.140,0:58:02.210
-Is called an attention mechanism because it allows a neural net to basically focus its attention on a particular input and ignoring ignoring the others
+É chamado de mecanismo de atenção porque permite que uma rede neural basicamente concentre sua atenção em uma entrada específica e ignorando as outras.
 
 0:58:02.880,0:58:05.240
-The choice of this is made by another variable Z
+A escolha desta é feita por outra variável Z
 
 0:58:05.790,0:58:09.679
-Which itself could be the output to some other neural net that looks at Xs for example
+O que em si poderia ser a saída para alguma outra rede neural que analisa Xs, por exemplo
 
 0:58:10.740,0:58:12.270
-okay, and
+tudo bem e
 
 0:58:12.270,0:58:18.409
-This has become a hugely important type of function, it's used in a lot of different situations now
+Isso se tornou um tipo de função extremamente importante, é usado em muitas situações diferentes agora
 
 0:58:19.440,0:58:22.700
-In particular it's used in LSTM and GRU but it's also used in
+Em particular, é usado em LSTM e GRU, mas também é usado em
 
 0:58:26.730,0:58:30.020
-Pretty much every natural language processing system nowadays that use
+Praticamente todos os sistemas de processamento de linguagem natural hoje em dia que usam
 
 0:58:31.830,0:58:37.939
-Either transformer architectures or all the types of attention they all use this kind of this kind of trick
+Ou arquiteturas de transformadores ou todos os tipos de atenção que todos usam esse tipo de truque
 
 0:58:43.280,0:58:46.570
-Okay, so you have a vector Z pass it to a softmax
+Ok, então você tem um vetor Z passando para um softmax
 
 0:58:46.570,0:58:52.509
-You get a bunch of numbers between 0 & 1 that sum to 1 use those as coefficient to compute a weighted sum
+Você obtém um monte de números entre 0 e 1 que somam a 1, use-os como coeficiente para calcular uma soma ponderada
 
 0:58:52.700,0:58:54.560
-of a bunch of vectors X
+de um monte de vetores X
 
 0:58:54.560,0:58:56.589
-xᵢ and you get the weighted sum
+xᵢ e você obtém a soma ponderada
 
 0:58:57.290,0:59:00.070
-Weighted by those coefficients those coefficients are data dependent
+Ponderados por esses coeficientes, esses coeficientes são dependentes de dados
 
 0:59:00.890,0:59:02.890
-Because Z is data dependent
+Porque Z é dependente de dados
 
 0:59:05.390,0:59:07.390
-All right, so
+Tudo bem, então
 
 0:59:09.800,0:59:13.659
-Here's an example of how you use this whenever you have this symbol here
+Aqui está um exemplo de como você usa isso sempre que tiver este símbolo aqui
 
 0:59:15.530,0:59:17.859
-This circle with the dots in the middle, that's a
+Este círculo com os pontos no meio, é um
 
 0:59:20.510,0:59:26.739
-Component by component multiplication of two vectors some people call this Hadamard product
+Multiplicação componente por componente de dois vetores, algumas pessoas chamam esse produto de Hadamard
 
 0:59:29.660,0:59:34.629
-Anyway, it's turn-by-turn multiplication. So this is a
+De qualquer forma, é a multiplicação passo a passo. Então isso é um
 
 0:59:36.200,0:59:41.020
-a type of a kind of functional module
+um tipo de um tipo de módulo funcional
 
 0:59:43.220,0:59:47.409
-GRU, gated recurrent Nets, was proposed by Kyunghyun Cho who is professor here
+GRU, gated recorrente Nets, foi proposto por Kyunghyun Cho que é professor aqui
 
 0:59:50.420,0:59:51.880
-And it attempts
+E ele tenta
 
 0:59:51.880,0:59:54.430
-It's an attempt at fixing the problem that naturally occur in
+É uma tentativa de corrigir o problema que ocorre naturalmente em
 
 0:59:54.560,0:59:58.479
-recurrent Nets that I mentioned the fact that you have exploding gradient the fact that the
+Nets recorrentes que mencionei o fato de você ter explodindo gradiente o fato de que o
 
-1:00:00.050,1:00:04.629
-recurrent nets don't really remember their states for very long. They tend to kind of forget really quickly
+0:00:00.050,0:00:04.629
+as redes recorrentes realmente não se lembram de seus estados por muito tempo. Eles tendem a esquecer muito rapidamente
 
-1:00:05.150,1:00:07.540
-And so it's basically a memory cell
+0:00:05.150,0:00:07.540
+E então é basicamente uma célula de memória
 
-1:00:08.060,1:00:14.080
-Okay, and I have to say this is the kind of second big family of sort of
+0:00:08.060,0:00:14.080
+Ok, e eu tenho que dizer que este é o tipo de segunda grande família do tipo
 
-1:00:16.820,1:00:20.919
-Recurrent net with memory. The first one is LSTM, but I'm going to talk about it just afterwards
+0:00:16.820,0:00:20.919
+Rede recorrente com memória. O primeiro é o LSTM, mas vou falar dele logo depois
 
-1:00:21.650,1:00:23.650
-Just because this one is a little simpler
+0:00:21.650,0:00:23.650
+Só porque este é um pouco mais simples
 
-1:00:24.950,1:00:27.550
-The equations are written at the bottom here so
+0:00:24.950,0:00:27.550
+As equações são escritas na parte inferior aqui, então
 
-1:00:28.280,1:00:30.280
-basically, there is a
+0:00:28.280,0:00:30.280
+basicamente, existe um
 
-1:00:31.280,1:00:32.839
-a
+0:00:31.280,0:00:32.839
+uma
 
-1:00:32.839,1:00:34.839
-gating vector Z
+0:00:32.839,0:00:34.839
+vetor Z
 
-1:00:35.720,1:00:37.550
-which is
+0:00:35.720,0:00:37.550
+qual é
 
-1:00:37.550,1:00:41.919
-simply the application of a nonlinear function the sigmoid function
+0:00:37.550,0:00:41.919
+simplesmente a aplicação de uma função não linear a função sigmóide
 
-1:00:42.950,1:00:44.089
-to
+0:00:42.950,0:00:44.089
+para
 
-1:00:44.089,1:00:49.119
-two linear layers and a bias and those two linear layers take into account the input X(t) and
+0:00:44.089,0:00:49.119
+duas camadas lineares e uma polarização e essas duas camadas lineares levam em consideração a entrada X(t) e
 
-1:00:49.400,1:00:54.389
-The previous state which they did note H in their case, not Z like I did
+0:00:49.400,0:00:54.389
+O estado anterior em que eles notaram H no caso deles, não Z como eu fiz
 
-1:00:55.930,1:01:01.889
-Okay, so you take X you take H you compute matrices
+0:00:55.930,0:01:01.889
+Ok, então você pega X você pega H você calcula matrizes
 
-1:01:02.950,1:01:04.140
-You pass a result
+0:01:02.950,0:01:04.140
+Você passa um resultado
 
-1:01:04.140,1:01:07.440
-you add the results you pass them through sigmoid functions and you get a bunch of
+0:01:04.140,0:01:07.440
+você adiciona os resultados, passa-os por funções sigmoid e obtém um monte de
 
-1:01:07.539,1:01:11.939
-values between 0 & 1 because the sigmoid is between 0 & 1 gives you a coefficient and
+0:01:07.539,0:01:11.939
+valores entre 0 e 1 porque o sigmóide está entre 0 e 1 fornece um coeficiente e
 
-1:01:14.140,1:01:16.140
-You use those coefficients
+0:01:14.140,0:01:16.140
+Você usa esses coeficientes
 
-1:01:16.660,1:01:20.879
-You see the formula at the bottom the Z is used to basically compute a linear combination
+0:01:16.660,0:01:20.879
+Você vê a fórmula na parte inferior o Z é usado basicamente para calcular uma combinação linear
 
-1:01:21.700,1:01:24.210
-of two inputs if Z is equal to 1
+0:01:21.700,0:01:24.210
+de duas entradas se Z for igual a 1
 
-1:01:25.420,1:01:28.379
-You basically only look at h(t-1). If Z 
+0:01:25.420,0:01:28.379
+Você basicamente só olha para h(t-1). Se Z
 
-1:01:29.859,1:01:35.669
-Is equal to 0 then 1 - Z is equal to 1 then you you look at this
+0:01:29.859,0:01:35.669
+É igual a 0 então 1 - Z é igual a 1 então você olha para isso
 
-1:01:36.400,1:01:38.109
-expression here and
+0:01:36.400,0:01:38.109
+expressão aqui e
 
-1:01:38.109,1:01:43.528
-That expression is, you know some weight matrix multiplied by the input passed through a hyperbolic tangent function
+0:01:38.109,0:01:43.528
+Essa expressão é, você conhece alguma matriz de peso multiplicada pela entrada passada por uma função tangente hiperbólica
 
-1:01:43.529,1:01:46.439
-It could be a ReLU but it's a hyperbolic tangent in this case
+0:01:43.529,0:01:46.439
+Pode ser um ReLU, mas é uma tangente hiperbólica neste caso
 
-1:01:46.839,1:01:49.528
-And it's combined with other stuff here that we can ignore for now
+0:01:46.839,0:01:49.528
+E é combinado com outras coisas aqui que podemos ignorar por enquanto
 
-1:01:50.829,1:01:58.439
-Okay. So basically what what the Z value does is that it tells the system just copy if Z equal 1 it just copies its
+0:01:50.829,0:01:58.439
+OK. Então, basicamente, o que o valor Z faz é dizer ao sistema que apenas copie se Z for igual a 1, ele apenas copia seu
 
-1:01:58.440,1:02:00.440
-previous state and ignores the input
+0:01:58.440,0:02:00.440
+estado anterior e ignora a entrada
 
-1:02:00.789,1:02:04.978
-Ok, so it acts like a memory essentially. It just copies its previous state on its output 
+0:02:00.789,0:02:04.978
+Ok, então ele age essencialmente como uma memória. Ele apenas copia seu estado anterior em sua saída
 
-1:02:06.430,1:02:08.430
-and if Z
+0:02:06.430,0:02:08.430
+e se Z
 
-1:02:09.549,1:02:17.189
-Equals 0 then the current state is forgotten essentially and is basically you would you just read the input
+0:02:09.549,0:02:17.189
+É igual a 0, então o estado atual é esquecido essencialmente e é basicamente você leria a entrada
 
-1:02:19.450,1:02:24.629
-Ok multiplied by some matrix so it changes the state of the system
+0:02:19.450,0:02:24.629
+Ok multiplicado por alguma matriz para alterar o estado do sistema
 
-1:02:28.960,1:02:35.460
-Yeah, you do this component by component essentially, okay vector 1 yeah exactly
+0:02:28.960,0:02:35.460
+Sim, você faz esse componente por componente essencialmente, ok vetor 1 sim exatamente
 
-1:02:47.500,1:02:53.459
-Well, it's just like the number of independent multiplications, right, what is the derivative of
+0:02:47.500,0:02:53.459
+Bem, é como o número de multiplicações independentes, certo, qual é a derivada de
 
-1:02:54.880,1:02:59.220
-some objective function with respect to the input of a product. It's equal to the
+0:02:54.880,0:02:59.220
+alguma função objetivo em relação à entrada de um produto. É igual ao
 
-1:03:01.240,1:03:07.829
-Derivative of that objective function with respect to the add, to the product multiplied by the other term. That's the as simple as that
+0:03:01.240,0:03:07.829
+Derivada dessa função objetivo em relação à soma, ao produto multiplicado pelo outro termo. Isso é tão simples quanto isso
 
-1:03:18.039,1:03:20.039
-So it's because by default
+0:03:18.039,0:03:20.039
+Então é porque por padrão
 
-1:03:20.529,1:03:22.529
-essentially unless Z is
+0:03:20.529,0:03:22.529
+essencialmente, a menos que Z seja
 
-1:03:23.619,1:03:25.509
-your Z is
+0:03:23.619,0:03:25.509
+seu Z é
 
-1:03:25.509,1:03:30.689
-More less by default equal to one and so by default the system just copies its previous state
+0:03:25.509,0:03:30.689
+Mais menos por padrão igual a um e assim por padrão o sistema apenas copia seu estado anterior
 
-1:03:33.039,1:03:35.999
-And if it's just you know slightly less than one it
+0:03:33.039,0:03:35.999
+E se é só você sabe um pouco menos de um
 
-1:03:37.210,1:03:42.539
-It puts a little bit of the input into the state but doesn't significantly change the state and what that means. Is that it
+0:03:37.210,0:03:42.539
+Ele coloca um pouco da entrada no estado, mas não altera significativamente o estado e o que isso significa. É isso
 
-1:03:43.630,1:03:44.799
-preserves
+0:03:43.630,0:03:44.799
+preserva
 
-1:03:44.799,1:03:46.919
-Norm, and it preserves information, right?
+0:03:44.799,0:03:46.919
+Norma, e preserva a informação, certo?
 
-1:03:48.940,1:03:53.099
-Since basically memory cell that you can change continuously
+0:03:48.940,0:03:53.099
+Desde basicamente célula de memória que você pode mudar continuamente
 
-1:04:00.480,1:04:04.159
-Well because you need something between zero and one it's a coefficient, right
+0:04:00.480,0:04:04.159
+Bem, porque você precisa de algo entre zero e um, é um coeficiente, certo
 
-1:04:04.160,1:04:07.789
-And so it needs to be between zero and one that's what we do sigmoids
+0:04:04.160,0:04:07.789
+E por isso precisa estar entre zero e um é o que fazemos sigmoids
 
-1:04:11.850,1:04:13.080
-I
+0:04:11.850,0:04:13.080
+eu
 
-1:04:13.080,1:04:16.850
-mean you need one that is monotonic that goes between 0 and 1 and
+0:04:13.080,0:04:16.850
+significa que você precisa de um que seja monotônico que vá entre 0 e 1 e
 
-1:04:17.970,1:04:20.059
-is monotonic and differentiable I mean
+0:04:17.970,0:04:20.059
+é monotônico e diferenciável, quero dizer
 
-1:04:20.730,1:04:22.849
-There's lots of sigmoid functions, but you know
+0:04:20.730,0:04:22.849
+Há muitas funções sigmóides, mas você sabe
 
-1:04:24.000,1:04:26.000
-Why not?
+0:04:24.000,0:04:26.000
+Por que não?
 
-1:04:26.100,1:04:29.779
-Yeah, I mean there is some argument for using others, but you know doesn't make a huge
+0:04:26.100,0:04:29.779
+Sim, quero dizer, há algum argumento para usar outros, mas você sabe que não faz um grande
 
-1:04:30.540,1:04:32.540
-amount of difference
+0:04:30.540,0:04:32.540
+quantidade de diferença
 
-1:04:32.700,1:04:37.009
-Okay in the full form of gru. there is also a reset gate. So the reset gate is
+0:04:32.700,0:04:37.009
+Ok, na forma completa de gru. há também um portão de reset. Então a porta de reset é
 
-1:04:37.650,1:04:44.989
-Is this guy here? So R is another vector that's computed also as a linear combination of inputs and previous state and
+0:04:37.650,0:04:44.989
+Esse cara está aqui? Então R é outro vetor que é calculado também como uma combinação linear de entradas e estado anterior e
 
-1:04:45.660,1:04:51.319
-It serves to multiply the previous state. So if R is 0 then the previous state is
+0:04:45.660,0:04:51.319
+Serve para multiplicar o estado anterior. Então, se R é 0, então o estado anterior é
 
-1:04:52.020,1:04:54.410
-if R is 0 and Z is 1
+0:04:52.020,0:04:54.410
+se R é 0 e Z é 1
 
-1:04:55.950,1:05:00.499
-The system is basically completely reset to 0 because that is 0
+0:04:55.950,0:05:00.499
+O sistema é basicamente completamente redefinido para 0 porque isso é 0
 
-1:05:01.350,1:05:03.330
-So it only looks at the input
+0:05:01.350,0:05:03.330
+Então ele só olha para a entrada
 
-1:05:03.330,1:05:09.950
-But that's basically a simplified version of something that came out way earlier in 1997 called
+0:05:03.330,0:05:09.950
+Mas isso é basicamente uma versão simplificada de algo que surgiu no início de 1997 chamado
 
-1:05:10.260,1:05:12.260
-LSTM long short-term memory
+0:05:10.260,0:05:12.260
+Memória de curto prazo longa LSTM
 
-1:05:13.050,1:05:14.820
-Which you know attempted
+0:05:13.050,0:05:14.820
+Que você sabe que tentou
 
-1:05:14.820,1:05:19.519
-Which was an attempt at solving the same issue that you know recurrent Nets basically lose memory for too long
+0:05:14.820,0:05:19.519
+Que foi uma tentativa de resolver o mesmo problema que você sabe que as redes recorrentes basicamente perdem memória por muito tempo
 
-1:05:19.520,1:05:21.520
-and so you build them as
+0:05:19.520,0:05:21.520
+e então você os constrói como
 
-1:05:22.860,1:05:26.120
-As memory cells by default and by default they will preserve the information
+0:05:22.860,0:05:26.120
+Como células de memória por padrão e por padrão, elas preservarão as informações
 
-1:05:26.760,1:05:28.430
-It's essentially the same idea here
+0:05:26.760,0:05:28.430
+É essencialmente a mesma ideia aqui
 
-1:05:28.430,1:05:33.979
-It's a you know, the details are slightly different here don't have dots in the middle of the round shape here for the product
+0:05:28.430,0:05:33.979
+É um você sabe, os detalhes são um pouco diferentes aqui não tem pontos no meio da forma redonda aqui para o produto
 
-1:05:33.980,1:05:35.610
-But it's the same thing
+0:05:33.980,0:05:35.610
+Mas é a mesma coisa
 
-1:05:35.610,1:05:41.539
-And there's a little more kind of moving parts. It's basically it looks more like an actual run sale
+0:05:35.610,0:05:41.539
+E há um pouco mais de peças móveis. É basicamente parece mais uma venda real
 
-1:05:41.540,1:05:44.060
-So it's like a flip-flop they can you know preserve
+0:05:41.540,0:05:44.060
+Então é como um flip-flop que você pode saber preservar
 
-1:05:44.430,1:05:48.200
-Information and there is some leakage that you can have, you can reset it to 0 or to 1
+0:05:44.430,0:05:48.200
+Informações e há algum vazamento que você pode ter, você pode redefini-lo para 0 ou para 1
 
-1:05:48.810,1:05:50.810
-It's fairly complicated
+0:05:48.810,0:05:50.810
+É bastante complicado
 
-1:05:52.050,1:05:59.330
-Thankfully people at NVIDIA Facebook Google and various other places have very efficient implementations of those so you don't need to
+0:05:52.050,0:05:59.330
+Felizmente, as pessoas da NVIDIA Facebook Google e vários outros lugares têm implementações muito eficientes para que você não precise
 
-1:05:59.550,1:06:01.550
-figure out how to write the
+0:05:59.550,0:06:01.550
+descobrir como escrever o
 
-1:06:01.620,1:06:03.710
-CUDA code for this or write the back pop
+0:06:01.620,0:06:03.710
+Código CUDA para isso ou escreva o pop de volta
 
-1:06:05.430,1:06:07.430
-Works really well
+0:06:05.430,0:06:07.430
+Funciona muito bem
 
-1:06:07.500,1:06:12.689
-it's it's quite what you'd use but it's used less and less because
+0:06:07.500,0:06:12.689
+é bem o que você usaria, mas é usado cada vez menos porque
 
-1:06:13.539,1:06:15.539
-people use recurrent Nets
+0:06:13.539,0:06:15.539
+as pessoas usam redes recorrentes
 
-1:06:16.150,1:06:18.210
-people used to use recurrent Nets for natural language processing
+0:06:16.150,0:06:18.210
+as pessoas costumavam usar redes recorrentes para processamento de linguagem natural
 
-1:06:19.329,1:06:21.220
-mostly and
+0:06:19.329,0:06:21.220
+principalmente e
 
-1:06:21.220,1:06:25.949
-Things like speech recognition and speech recognition is moving towards using convolutional Nets
+0:06:21.220,0:06:25.949
+Coisas como reconhecimento de fala e reconhecimento de fala estão se movendo para o uso de redes convolucionais
 
-1:06:27.490,1:06:29.200
-temporal conditional Nets
+0:06:27.490,0:06:29.200
+Redes condicionais temporais
 
-1:06:29.200,1:06:34.109
-while the natural language processing is moving towards using what's called transformers
+0:06:29.200,0:06:34.109
+enquanto o processamento de linguagem natural está se movendo para usar o que é chamado de transformadores
 
-1:06:34.630,1:06:36.900
-Which we'll hear a lot about tomorrow, right?
+0:06:34.630,0:06:36.900
+Sobre o qual ouviremos muito amanhã, certo?
 
-1:06:37.630,1:06:38.950
-no?
+0:06:37.630,0:06:38.950
+não?
 
-1:06:38.950,1:06:40.950
-when
+0:06:38.950,0:06:40.950
+quando
 
-1:06:41.109,1:06:43.109
-two weeks from now, okay
+0:06:41.109,0:06:43.109
+daqui a duas semanas, ok
 
-1:06:46.599,1:06:48.599
-So what transformers are
+0:06:46.599,0:06:48.599
+Então, quais são os transformadores
 
-1:06:49.119,1:06:51.119
-Okay, I'm not gonna talk about transformers just now
+0:06:49.119,0:06:51.119
+Ok, eu não vou falar sobre transformadores agora
 
-1:06:51.759,1:06:56.219
-but these key transformers are kind of a generalization so
+0:06:51.759,0:06:56.219
+mas esses transformadores de chave são uma espécie de generalização, então
 
-1:06:57.009,1:06:58.619
-General use of attention if you want
+0:06:57.009,0:06:58.619
+Uso geral de atenção se quiser
 
-1:06:58.619,1:07:02.038
-So the big neural Nets that use attention that you know
+0:06:58.619,0:07:02.038
+Então, as grandes redes neurais que usam a atenção que você conhece
 
-1:07:02.039,1:07:06.329
-Every block of neuron uses attention and that tends to work pretty well it works
+0:07:02.039,0:07:06.329
+Cada bloco de neurônio usa atenção e isso tende a funcionar muito bem, funciona
 
-1:07:06.329,1:07:09.538
-So well that people are kind of basically dropping everything else for NLP
+0:07:06.329,0:07:09.538
+Tão bem que as pessoas estão basicamente largando todo o resto pela PNL
 
-1:07:10.869,1:07:12.869
-so the problem is
+0:07:10.869,0:07:12.869
+então o problema é
 
-1:07:13.269,1:07:15.299
-Systems like LSTM are not very good at this so
+0:07:13.269,0:07:15.299
+Sistemas como o LSTM não são muito bons nisso, então
 
-1:07:16.599,1:07:20.219
-Transformers are much better. The biggest transformers have billions of parameters
+0:07:16.599,0:07:20.219
+Os transformadores são muito melhores. Os maiores transformadores têm bilhões de parâmetros
 
-1:07:21.430,1:07:26.879
-Like the biggest one is by 15 billion something like that that order of magnitude the t5 or whatever it's called
+0:07:21.430,0:07:26.879
+Como o maior é por 15 bilhões algo assim nessa ordem de magnitude o t5 ou o que quer que seja chamado
 
-1:07:27.910,1:07:29.910
-from Google so
+0:07:27.910,0:07:29.910
+do Google então
 
-1:07:30.460,1:07:36.779
-That's an enormous amount of memory and it's because of the particular type of architecture that's used in transformers
+0:07:30.460,0:07:36.779
+Isso é uma enorme quantidade de memória e é por causa do tipo particular de arquitetura que é usado em transformadores
 
-1:07:36.779,1:07:40.319
-They they can actually store a lot of knowledge if you want
+0:07:36.779,0:07:40.319
+Eles podem realmente armazenar muito conhecimento, se você quiser
 
-1:07:41.289,1:07:43.559
-So that's the stuff people would use for
+0:07:41.289,0:07:43.559
+Então essas são as coisas que as pessoas usariam para
 
-1:07:44.440,1:07:47.069
-What you're talking about like question answering systems
+0:07:44.440,0:07:47.069
+Do que você está falando, como sistemas de resposta a perguntas
 
-1:07:47.769,1:07:50.099
-Translation systems etc. They will use transformers
+0:07:47.769,0:07:50.099
+Sistemas de tradução etc. Eles usarão transformadores
 
-1:07:52.869,1:07:54.869
-Okay
+0:07:52.869,0:07:54.869
+OK
 
-1:07:57.619,1:08:01.778
-So because LSTM kind of was sort of you know one of the first
+0:07:57.619,0:08:01.778
+Então, porque o LSTM meio que foi meio que você sabe, um dos primeiros
 
-1:08:02.719,1:08:04.958
-architectures recurrent architecture that kind of worked
+0:08:02.719,0:08:04.958
+arquiteturas arquitetura recorrente que meio que funcionou
 
-1:08:05.929,1:08:11.408
-People tried to use them for things that at first you would think are crazy but turned out to work
+0:08:05.929,0:08:11.408
+As pessoas tentaram usá-los para coisas que a princípio você acharia loucura, mas acabou funcionando
 
-1:08:12.109,1:08:16.689
-And one example of this is translation. It's called neural machine translation
+0:08:12.109,0:08:16.689
+E um exemplo disso é a tradução. Chama-se tradução automática neural
 
-1:08:17.509,1:08:19.509
-So there was a paper 
+0:08:17.509,0:08:19.509
+Então havia um papel
 
-1:08:19.639,1:08:22.149
-by Ilya Sutskever at NIPS 2014 where he
+0:08:19.639,0:08:22.149
+por Ilya Sutskever no NIPS 2014 onde ele
 
-1:08:22.969,1:08:29.799
-Trained this giant multi-layer LSTM. So what's a multi-layered LSTM? It's an LSTM where you have
+0:08:22.969,0:08:29.799
+Treinou este gigante LSTM multicamada. Então, o que é um LSTM multicamadas? É um LSTM onde você tem
 
-1:08:30.589,1:08:36.698
-so it's the unfolded version, right? So at the bottom here you have an LSTM which is here unfolded for three time steps
+0:08:30.589,0:08:36.698
+então é a versão desdobrada, certo? Então, na parte inferior, você tem um LSTM que é desdobrado em três etapas de tempo
 
-1:08:36.699,1:08:41.618
-But it will have to be unfolded for the length of a sentence you want to translate, let's say a
+0:08:36.699,0:08:41.618
+Mas terá que ser desdobrado no comprimento de uma frase que você deseja traduzir, digamos um
 
-1:08:42.259,1:08:43.969
-sentence in French
+0:08:42.259,0:08:43.969
+frase em francês
 
-1:08:43.969,1:08:45.529
-and
+0:08:43.969,0:08:45.529
+e
 
-1:08:45.529,1:08:48.038
-And then you take the hidden
+0:08:45.529,0:08:48.038
+E então você pega o oculto
 
-1:08:48.289,1:08:53.709
-state at every time step of this LSTM and you feed that as input to a second LSTM and
+0:08:48.289,0:08:53.709
+estado em cada passo de tempo deste LSTM e você alimenta isso como entrada para um segundo LSTM e
 
-1:08:53.929,1:08:55.150
-I think in his network
+0:08:53.929,0:08:55.150
+acho que na rede dele
 
-1:08:55.150,1:08:58.329
-he actually had four layers of that so you can think of this as a
+0:08:55.150,0:08:58.329
+ele realmente tinha quatro camadas disso, então você pode pensar nisso como um
 
-1:08:58.639,1:09:02.139
-Stacked LSTM that you know each of them are recurrent in time
+0:08:58.639,0:09:02.139
+LSTM empilhado que você sabe que cada um deles é recorrente no tempo
 
-1:09:02.139,1:09:05.589
-But they are kind of stacked as the layers of a neural net
+0:09:02.139,0:09:05.589
+Mas eles são empilhados como as camadas de uma rede neural
 
-1:09:06.500,1:09:07.670
-so
+0:09:06.500,0:09:07.670
+assim
 
-1:09:07.670,1:09:14.769
-At the last time step in the last layer, you have a vector here, which is meant to represent the entire meaning of that sentence
+0:09:07.670,0:09:14.769
+No último passo de tempo na última camada, você tem um vetor aqui, que deve representar todo o significado dessa frase
 
-1:09:16.309,1:09:18.879
-Okay, so it could be a fairly large vector
+0:09:16.309,0:09:18.879
+Ok, então pode ser um vetor bastante grande
 
-1:09:19.849,1:09:24.819
-and then you feed that to another multi-layer LSTM, which
+0:09:19.849,0:09:24.819
+e então você alimenta isso para outro LSTM multicamada, que
 
-1:09:27.319,1:09:31.028
-You know you run for a sort of undetermined number of steps and
+0:09:27.319,0:09:31.028
+Você sabe que corre por uma espécie de número indeterminado de passos e
 
-1:09:32.119,1:09:37.209
-The role of this LSTM is to produce words in a target language if you do translation say German
+0:09:32.119,0:09:37.209
+O papel deste LSTM é produzir palavras em um idioma de destino, se você fizer tradução, diga alemão
 
-1:09:38.869,1:09:40.839
-Okay, so this is time, you know
+0:09:38.869,0:09:40.839
+Ok, então esta é a hora, você sabe
 
-1:09:40.839,1:09:44.499
-It takes the state you run through the first two layers of the LSTM
+0:09:40.839,0:09:44.499
+Leva o estado que você executa nas duas primeiras camadas do LSTM
 
-1:09:44.630,1:09:48.849
-Produce a word and then take that word and feed it as input to the next time step
+0:09:44.630,0:09:48.849
+Produza uma palavra e, em seguida, pegue essa palavra e alimente-a como entrada para a próxima etapa de tempo
 
-1:09:49.940,1:09:52.359
-So that you can generate text sequentially, right?
+0:09:49.940,0:09:52.359
+Para que você possa gerar texto sequencialmente, certo?
 
-1:09:52.909,1:09:58.899
-Run through this produce another word take that word feed it back to the input and keep going. So this is a
+0:09:52.909,0:09:58.899
+Percorra isso, produza outra palavra, leve essa palavra de volta à entrada e continue. Então isso é um
 
-1:10:00.619,1:10:02.619
-Should do this for translation you get this gigantic
+0:10:00.619,0:10:02.619
+Deve fazer isso para tradução você fica com esse gigantesco
 
-1:10:03.320,1:10:07.480
-Neural net you train and this is the it's a system of this type
+0:10:03.320,0:10:07.480
+Rede neural você treina e esse é o sistema desse tipo
 
-1:10:07.480,1:10:12.010
-The one that Sutskever represented at NIPS 2014 it was was the first neural
+0:10:07.480,0:10:12.010
+Aquele que Sutskever representou no NIPS 2014 foi o primeiro neural
 
-1:10:13.130,1:10:19.209
-Translation system that had performance that could rival sort of more classical approaches not based on neural nets
+0:10:13.130,0:10:19.209
+Sistema de tradução que teve desempenho que poderia rivalizar com abordagens mais clássicas não baseadas em redes neurais
 
-1:10:21.350,1:10:23.950
-And people were really surprised that you could get such results
+0:10:21.350,0:10:23.950
+E as pessoas ficaram realmente surpresas que você pudesse obter tais resultados
 
-1:10:26.840,1:10:28.840
-That success was very short-lived
+0:10:26.840,0:10:28.840
+Esse sucesso durou muito pouco
 
-1:10:31.280,1:10:33.280
-Yeah, so the problem is
+0:10:31.280,0:10:33.280
+Sim, então o problema é
 
-1:10:34.340,1:10:37.449
-The word you're gonna say at a particular time depends on the word you just said
+0:10:34.340,0:10:37.449
+A palavra que você vai dizer em um determinado momento depende da palavra que você acabou de dizer
 
-1:10:38.180,1:10:41.320
-Right, and if you ask the system to just produce a word
+0:10:38.180,0:10:41.320
+Certo, e se você pedir ao sistema para produzir apenas uma palavra
 
-1:10:42.800,1:10:45.729
-And then you don't feed that word back to the input
+0:10:42.800,0:10:45.729
+E então você não alimenta essa palavra de volta para a entrada
 
-1:10:45.730,1:10:49.120
-the system could be used in other word that has that is inconsistent with the previous one you produced
+0:10:45.730,0:10:49.120
+o sistema pode ser usado em outra palavra que seja inconsistente com o anterior que você produziu
 
-1:10:55.790,1:10:57.790
-It should but it doesn't
+0:10:55.790,0:10:57.790
+Deveria, mas não
 
-1:10:58.760,1:11:05.590
-I mean not well enough that that it works. So so this is so this is kind of sequential production is pretty much required
+0:10:58.760,0:11:05.590
+Quero dizer, não bem o suficiente para que funcione. Então, isso é um tipo de produção sequencial que é praticamente necessária
 
-1:11:07.790,1:11:09.790
-In principle, you're right
+0:11:07.790,0:11:09.790
+Em princípio, você está certo
 
-1:11:10.910,1:11:12.910
-It's not very satisfying
+0:11:10.910,0:11:12.910
+Não é muito satisfatório
 
-1:11:13.610,1:11:19.089
-so there's a problem with this which is that the entire meaning of the sentence has to be kind of squeezed into
+0:11:13.610,0:11:19.089
+então há um problema com isso que é que todo o significado da frase tem que ser meio que espremido
 
-1:11:19.430,1:11:22.419
-That hidden state that is between the encoder of the decoder
+0:11:19.430,0:11:22.419
+Esse estado oculto que está entre o codificador do decodificador
 
-1:11:24.530,1:11:29.829
-That's one problem the second problem is that despite the fact that that LSTM are built to preserve information
+0:11:24.530,0:11:29.829
+Esse é um problema, o segundo problema é que, apesar do fato de que os LSTM são construídos para preservar informações
 
-1:11:31.040,1:11:36.010
-They are basically memory cells. They don't actually preserve information for more than about 20 words
+0:11:31.040,0:11:36.010
+São basicamente células de memória. Na verdade, eles não preservam informações por mais de 20 palavras
 
-1:11:36.860,1:11:40.299
-So if your sentence is more than 20 words by the time you get to the end of the sentence
+0:11:36.860,0:11:40.299
+Então, se sua frase tiver mais de 20 palavras quando você chegar ao final da frase
 
-1:11:40.520,1:11:43.270
-Your your hidden state will have forgotten the beginning of it
+0:11:40.520,0:11:43.270
+Seu seu estado oculto terá esquecido o início dele
 
-1:11:43.640,1:11:49.269
-so what people use for this the fix for this is a huge hack is called BiLSTM and
+0:11:43.640,0:11:49.269
+então o que as pessoas usam para isso, a correção para isso é um enorme hack chamado BiLSTM e
 
-1:11:50.060,1:11:54.910
-It's a completely trivial idea that consists in running two LSTMs in opposite directions
+0:11:50.060,0:11:54.910
+É uma ideia completamente trivial que consiste em executar dois LSTMs em direções opostas
 
-1:11:56.210,1:11:59.020
-Okay, and then you get two codes one that is
+0:11:56.210,0:11:59.020
+Ok, e então você recebe dois códigos, um que é
 
-1:11:59.720,1:12:04.419
-running the LSTM from beginning to end of the sentence that's one vector and then the second vector is from
+0:11:59.720,0:12:04.419
+executando o LSTM do início ao fim da frase que é um vetor e, em seguida, o segundo vetor é de
 
-1:12:04.730,1:12:09.939
-Running an LSTM in the other direction you get a second vector. That's the meaning of your sentence
+0:12:04.730,0:12:09.939
+Executando um LSTM na outra direção, você obtém um segundo vetor. Esse é o significado da sua frase
 
-1:12:10.280,1:12:16.809
-You can basically double the length of your sentence without losing too much information this way, but it's not a very satisfying solution
+0:12:10.280,0:12:16.809
+Você pode basicamente dobrar o comprimento de sua frase sem perder muita informação dessa maneira, mas não é uma solução muito satisfatória
 
-1:12:17.120,1:12:19.450
-So if you see biLSTM, that's what that's what it is
+0:12:17.120,0:12:19.450
+Então, se você vir biLSTM, é isso que é
 
-1:12:22.830,1:12:29.179
-So as I said, the success was short-lived because in fact before the paper was published at NIPS
+0:12:22.830,0:12:29.179
+Então, como eu disse, o sucesso durou pouco porque, na verdade, antes do artigo ser publicado no NIPS
 
-1:12:30.390,1:12:32.390
-There was a paper by
+0:12:30.390,0:12:32.390
+Havia um papel de
 
-1:12:34.920,1:12:37.969
-Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio
+0:12:34.920,0:12:37.969
+Dzmitry Bahdanau, Kyunghyun Cho e Yoshua Bengio
 
-1:12:38.670,1:12:42.319
-which was published on arxiv in September 14 that said
+0:12:38.670,0:12:42.319
+que foi publicado no arxiv em 14 de setembro que dizia
 
-1:12:43.560,1:12:47.209
-We can use attention. So the attention mechanism I mentioned earlier
+0:12:43.560,0:12:47.209
+Podemos usar a atenção. Então, o mecanismo de atenção que mencionei anteriormente
 
-1:12:49.320,1:12:51.300
-Instead of having those gigantic
+0:12:49.320,0:12:51.300
+Em vez de ter aqueles gigantescos
 
-1:12:51.300,1:12:54.890
-Networks and squeezing the entire meaning of a sentence into this small vector
+0:12:51.300,0:12:54.890
+Redes e espremendo todo o significado de uma frase neste pequeno vetor
 
-1:12:55.800,1:12:58.190
-it would make more sense to the translation if
+0:12:55.800,0:12:58.190
+faria mais sentido para a tradução se
 
-1:12:58.710,1:13:03.169
-Every time said, you know, we want to produce a word in French corresponding to a sentence in English
+0:12:58.710,0:13:03.169
+Cada vez dito, você sabe, queremos produzir uma palavra em francês correspondente a uma frase em inglês
 
-1:13:04.469,1:13:08.509
-If we looked at the location in the English sentence that had that word
+0:13:04.469,0:13:08.509
+Se olharmos para o local na frase em inglês que tinha essa palavra
 
-1:13:09.390,1:13:10.620
-Okay
+0:13:09.390,0:13:10.620
+OK
 
-1:13:10.620,1:13:12.090
-so
+0:13:10.620,0:13:12.090
+assim
 
-1:13:12.090,1:13:17.540
-Our decoder is going to produce french words one at a time and when it comes to produce a word
+0:13:12.090,0:13:17.540
+Nosso decodificador vai produzir palavras em francês uma de cada vez e quando se trata de produzir uma palavra
 
-1:13:18.449,1:13:21.559
-that has an equivalent in the input english sentence it's
+0:13:18.449,0:13:21.559
+que tem um equivalente na frase de entrada em inglês é
 
-1:13:21.960,1:13:29.750
-going to focus its attention on that word and then the translation from French to English of that word would be simple or the
+0:13:21.960,0:13:29.750
+vai focar sua atenção nessa palavra e então a tradução do francês para o inglês dessa palavra seria simples ou o
 
-1:13:30.360,1:13:32.300
-You know, it may not be a single word
+0:13:30.360,0:13:32.300
+Você sabe, pode não ser uma única palavra
 
-1:13:32.300,1:13:34.050
-it could be a group of words right because
+0:13:32.300,0:13:34.050
+poderia ser um grupo de palavras certo porque
 
-1:13:34.050,1:13:39.590
-Very often you have to turn a group of word in English into a group of words in French to kind of say the same
+0:13:34.050,0:13:39.590
+Muitas vezes você tem que transformar um grupo de palavras em inglês em um grupo de palavras em francês para dizer o mesmo
 
-1:13:39.590,1:13:41.590
-thing if it's German you have to
+0:13:39.590,0:13:41.590
+coisa se for alemão você tem que
 
-1:13:42.150,1:13:43.949
-put the
+0:13:42.150,0:13:43.949
+coloque o
 
-1:13:43.949,1:13:47.479
-You know the verb at the end of the sentence whereas in English, it might be at the beginning
+0:13:43.949,0:13:47.479
+Você conhece o verbo no final da frase, enquanto em inglês, pode estar no início
 
-1:13:48.060,1:13:51.109
-So basically you use this attention mechanism
+0:13:48.060,0:13:51.109
+Então, basicamente, você usa esse mecanismo de atenção
 
-1:13:51.110,1:13:57.440
-so this attention module here is the one that I showed a couple slides earlier which basically decides
+0:13:51.110,0:13:57.440
+então este módulo de atenção aqui é o que eu mostrei alguns slides antes que basicamente decide
 
-1:13:58.739,1:14:04.428
-Which of the time steps which of the hidden representation for which other word in the input sentence it is going to focus on
+0:13:58.739,0:14:04.428
+Em qual dos passos de tempo qual da representação oculta para qual outra palavra na sentença de entrada ela irá focar?
 
-1:14:06.570,1:14:12.259
-To kind of produce a representation that is going to produce the current word at a particular time step
+0:14:06.570,0:14:12.259
+Para produzir uma representação que produzirá a palavra atual em um determinado intervalo de tempo
 
-1:14:12.260,1:14:15.320
-So here we're at time step number three, we're gonna produce a third word
+0:14:12.260,0:14:15.320
+Então aqui estamos na etapa de tempo número três, vamos produzir uma terceira palavra
 
-1:14:16.140,1:14:21.829
-And we're gonna have to decide which of the input word corresponds to this and we're gonna have this attention mechanism
+0:14:16.140,0:14:21.829
+E teremos que decidir qual palavra de entrada corresponde a isso e teremos esse mecanismo de atenção
 
-1:14:21.830,1:14:23.830
-so essentially we're gonna have a
+0:14:21.830,0:14:23.830
+então, essencialmente, vamos ter um
 
-1:14:25.140,1:14:28.759
-Small piece of neural net that's going to look at the the inputs on this side
+0:14:25.140,0:14:28.759
+Pequeno pedaço de rede neural que vai olhar para as entradas deste lado
 
-1:14:31.809,1:14:35.879
-It's going to have an output which is going to go through a soft max that is going to produce a bunch of
+0:14:31.809,0:14:35.879
+Vai ter uma saída que vai passar por um soft max que vai produzir um monte de
 
-1:14:35.979,1:14:42.269
-Coefficients that sum to 1 between 0 and 1 and they're going to compute a linear combination of the states at different time steps
+0:14:35.979,0:14:42.269
+Coeficientes que somam 1 entre 0 e 1 e eles vão calcular uma combinação linear dos estados em diferentes etapas de tempo
 
-1:14:43.719,1:14:48.899
-Ok by setting one of those coefficients to 1 and the other ones to 0 it is going to focus the attention of the system on
+0:14:43.719,0:14:48.899
+Ok, definindo um desses coeficientes para 1 e os outros para 0, ele focará a atenção do sistema em
 
-1:14:48.900,1:14:50.900
-one particular word
+0:14:48.900,0:14:50.900
+uma palavra específica
 
-1:14:50.949,1:14:56.938
-So the magic of this is that this neural net that decides that runs to the softmax and decides on those coefficients actually
+0:14:50.949,0:14:56.938
+Então a mágica disso é que essa rede neural que decide que vai até o softmax e decide sobre esses coeficientes na verdade
 
-1:14:57.159,1:14:59.159
-Can be trained with back prop is just another
+0:14:57.159,0:14:59.159
+Pode ser treinado com suporte traseiro é apenas mais um
 
-1:14:59.590,1:15:03.420
-Set of weights in a neural net and you don't have to built it by hand. It just figures it out
+0:14:59.590,0:15:03.420
+Conjunto de pesos em uma rede neural e você não precisa construí-lo manualmente. Isso só dá conta
 
-1:15:06.550,1:15:10.979
-This completely revolutionized the field of neural machine translation in the sense that
+0:15:06.550,0:15:10.979
+Isso revolucionou completamente o campo da tradução automática neural no sentido de que
 
-1:15:11.889,1:15:13.889
-within a
+0:15:11.889,0:15:13.889
+dentro de um
 
-1:15:14.050,1:15:20.309
-Few months team from Stanford won a big competition with this beating all the other methods
+0:15:14.050,0:15:20.309
+A equipe de poucos meses de Stanford venceu uma grande competição com isso batendo todos os outros métodos
 
-1:15:22.119,1:15:28.199
-And then within three months every big company that works on translation had basically deployed systems based on this
+0:15:22.119,0:15:28.199
+E então, em três meses, todas as grandes empresas que trabalham com tradução basicamente implantaram sistemas baseados nesse
 
-1:15:29.289,1:15:31.469
-So this just changed everything
+0:15:29.289,0:15:31.469
+Então isso apenas mudou tudo
 
-1:15:33.189,1:15:40.349
-And then people started paying attention to attention, okay pay more attention to attention in a sense that
+0:15:33.189,0:15:40.349
+E então as pessoas começaram a prestar atenção na atenção, ok, preste mais atenção na atenção no sentido de que
 
-1:15:41.170,1:15:44.879
-And then there was a paper by a bunch of people at Google
+0:15:41.170,0:15:44.879
+E então havia um artigo de um monte de gente no Google
 
-1:15:45.729,1:15:52.529
-What the title was attention is all you need and It was basically a paper that solved a bunch of natural language processing tasks
+0:15:45.729,0:15:52.529
+O que o título era atenção é tudo que você precisa e foi basicamente um artigo que resolveu um monte de tarefas de processamento de linguagem natural
 
-1:15:53.050,1:15:59.729
-by using a neural net where every layer, every group of neurons basically was implementing attention and that's what a
+0:15:53.050,0:15:59.729
+usando uma rede neural onde cada camada, cada grupo de neurônios basicamente estava implementando a atenção e é isso que um
 
-1:16:00.459,1:16:03.149
-Or something called self attention. That's what a transformer is
+0:16:00.459,0:16:03.149
+Ou algo chamado auto-atenção. Isso é o que é um transformador
 
-1:16:08.829,1:16:15.449
-Yes, you can have a variable number of outputs of inputs that you focus attention on
+0:16:08.829,0:16:15.449
+Sim, você pode ter um número variável de saídas de entradas nas quais você foca sua atenção
 
-1:16:18.340,1:16:20.849
-Okay, I'm gonna talk now about memory networks
+0:16:18.340,0:16:20.849
+Ok, vou falar agora sobre redes de memória
 
-1:16:35.450,1:16:40.309
-So this stems from work at Facebook that was started by Antoine Bordes
+0:16:35.450,0:16:40.309
+Então isso vem do trabalho no Facebook que foi iniciado por Antoine Bordes
 
-1:16:41.970,1:16:43.970
-I think in 2014 and
+0:16:41.970,0:16:43.970
+Acho que em 2014 e
 
-1:16:45.480,1:16:47.480
-By
+0:16:45.480,0:16:47.480
+De
 
-1:16:49.650,1:16:51.799
-Sainbayar Sukhbaatar, I
+0:16:49.650,0:16:51.799
+Sainbayar Sukhbaatar, eu
 
-1:16:56.760,1:16:58.760
-Think in 2015 or 16
+0:16:56.760,0:16:58.760
+Pense em 2015 ou 16
 
-1:16:59.040,1:17:01.040
-Called end-to-end memory networks
+0:16:59.040,0:17:01.040
+Redes de memória de ponta a ponta chamadas
 
-1:17:01.520,1:17:06.890
-Sainbayar Sukhbaatar was a PhD student here and it was an intern at Facebook when he worked on this
+0:17:01.520,0:17:06.890
+Sainbayar Sukhbaatar era estudante de doutorado aqui e era estagiário no Facebook quando trabalhou nisso
 
-1:17:07.650,1:17:10.220
-together with a bunch of other people Facebook and
+0:17:07.650,0:17:10.220
+junto com um monte de outras pessoas Facebook e
 
-1:17:10.860,1:17:12.090
-the idea of memory
+0:17:10.860,0:17:12.090
+a ideia de memória
 
-1:17:12.090,1:17:17.270
-Network is that you'd like to have a short-term memory you'd like your neural net to have a short-term memory or working memory
+0:17:12.090,0:17:17.270
+Rede é que você gostaria de ter uma memória de curto prazo você gostaria que sua rede neural tivesse uma memória de curto prazo ou memória de trabalho
 
-1:17:18.300,1:17:23.930
-Okay, you'd like it to you know, you you tell okay, if I tell you a story I tell you
+0:17:18.300,0:17:23.930
+Ok, você gostaria que você sabe, você conta tudo bem, se eu te contar uma história eu te conto
 
-1:17:25.410,1:17:27.410
-John goes to the kitchen
+0:17:25.410,0:17:27.410
+João vai para a cozinha
 
-1:17:28.170,1:17:30.170
-John picks up the milk
+0:17:28.170,0:17:30.170
+João pega o leite
 
-1:17:34.440,1:17:36.440
-Jane goes to the kitchen
+0:17:34.440,0:17:36.440
+Jane vai para a cozinha
 
-1:17:37.290,1:17:40.910
-And then John goes to the bedroom and drops the milk there
+0:17:37.290,0:17:40.910
+E então John vai para o quarto e joga o leite lá
 
-1:17:41.430,1:17:44.899
-And then goes back to the kitchen and ask you. Where's the milk? Okay
+0:17:41.430,0:17:44.899
+E então volta para a cozinha e te pergunta. Onde está o leite? OK
 
-1:17:44.900,1:17:47.720
-so every time I had told you a sentence you kind of
+0:17:44.900,0:17:47.720
+então toda vez que eu te disse uma frase você meio que
 
-1:17:48.330,1:17:50.330
-updated in your mind a
+0:17:48.330,0:17:50.330
+atualizado em sua mente um
 
-1:17:50.340,1:17:52.340
-Kind of current state of the world if you want
+0:17:50.340,0:17:52.340
+Tipo de estado atual do mundo, se você quiser
 
-1:17:52.920,1:17:56.870
-and so by telling you the story you now you have a representation of the state to the world and if I ask you a
+0:17:52.920,0:17:56.870
+e assim, ao contar a história, você agora tem uma representação do estado para o mundo e se eu lhe perguntar uma
 
-1:17:56.870,1:17:59.180
-Question about the state of the world you can answer it. Okay
+0:17:56.870,0:17:59.180
+Pergunta sobre o estado do mundo, você pode respondê-la. OK
 
-1:18:00.270,1:18:02.270
-You store this in a short-term memory
+0:18:00.270,0:18:02.270
+Você armazena isso em uma memória de curto prazo
 
-1:18:03.720,1:18:06.769
-You didn't store it, ok, so there's kind of this
+0:18:03.720,0:18:06.769
+Você não armazenou, ok, então tem isso
 
-1:18:06.770,1:18:10.399
-There's a number of different parts in your brain, but it's two important parts, one is the cortex
+0:18:06.770,0:18:10.399
+Há várias partes diferentes em seu cérebro, mas são duas partes importantes, uma é o córtex
 
-1:18:10.470,1:18:13.279
-The cortex is where you have long term memory. Where you
+0:18:10.470,0:18:13.279
+O córtex é onde você tem memória de longo prazo. Onde você
 
-1:18:15.120,1:18:17.120
-You know you
+0:18:15.120,0:18:17.120
+Conheces-te
 
-1:18:17.700,1:18:22.129
-Where all your your thinking is done and all that stuff and there is a separate
+0:18:17.700,0:18:22.129
+Onde todo o seu pensamento é feito e todas essas coisas e há um
 
-1:18:24.720,1:18:26.460
-You know
+0:18:24.720,0:18:26.460
+Você sabe
 
-1:18:26.460,1:18:28.879
-Chunk of neurons called the hippocampus which is sort of
+0:18:26.460,0:18:28.879
+Um pedaço de neurônios chamado hipocampo, que é uma espécie de
 
-1:18:29.100,1:18:32.359
-Its kind of two formations in the middle of the brain and they kind of send
+0:18:29.100,0:18:32.359
+São duas formações no meio do cérebro e elas meio que enviam
 
-1:18:34.320,1:18:36.650
-Wires to pretty much everywhere in the cortex and
+0:18:34.320,0:18:36.650
+Fios para praticamente todo o córtex e
 
-1:18:37.110,1:18:44.390
-The hippocampus is thought that to be used as a short-term memory. So it can just you know, remember things for relatively short time
+0:18:37.110,0:18:44.390
+O hipocampo é pensado para ser usado como uma memória de curto prazo. Então, você pode apenas saber, lembrar de coisas por um tempo relativamente curto
 
-1:18:45.950,1:18:47.450
-The prevalent
+0:18:45.950,0:18:47.450
+O predominante
 
-1:18:47.450,1:18:53.530
-theory is that when you when you sleep and you dream there's a lot of information that is being transferred from your
+0:18:47.450,0:18:53.530
+teoria é que quando você dorme e sonha, há muita informação que está sendo transferida do seu
 
-1:18:53.810,1:18:56.800
-hippocampus to your cortex to be solidified in long-term memory
+0:18:53.810,0:18:56.800
+hipocampo ao seu córtex para ser solidificado na memória de longo prazo
 
-1:18:59.000,1:19:01.090
-Because the hippocampus has limited capacity
+0:18:59.000,0:19:01.090
+Porque o hipocampo tem capacidade limitada
 
-1:19:04.520,1:19:08.859
-When you get senile like you get really old very often your hippocampus shrinks and
+0:19:04.520,0:19:08.859
+Quando você fica senil como você fica muito velho, muitas vezes seu hipocampo encolhe e
 
-1:19:09.620,1:19:13.570
-You don't have short-term memory anymore. So you keep repeating the same stories to the same people
+0:19:09.620,0:19:13.570
+Você não tem mais memória de curto prazo. Então você continua repetindo as mesmas histórias para as mesmas pessoas
 
-1:19:14.420,1:19:16.420
-Okay, it's very common
+0:19:14.420,0:19:16.420
+Ok, é muito comum
 
-1:19:19.430,1:19:25.930
-Or you go to a room to do something and by the time you get to the room you forgot what you were there for
+0:19:19.430,0:19:25.930
+Ou você vai a uma sala para fazer alguma coisa e, quando chega à sala, esqueceu para que estava lá.
 
-1:19:29.450,1:19:31.869
-This starts happening by the time you're 50, by the way
+0:19:29.450,0:19:31.869
+Isso começa a acontecer quando você tem 50 anos, a propósito
 
-1:19:36.290,1:19:40.390
-So, I don't remember what I said last week of two weeks ago, um
+0:19:36.290,0:19:40.390
+Então, eu não me lembro do que eu disse na semana passada de duas semanas atrás, hum
 
-1:19:41.150,1:19:44.950
-Okay, but anyway, so memory network, here's the idea of memory network
+0:19:41.150,0:19:44.950
+Ok, mas enfim, rede de memória, aqui está a ideia de rede de memória
 
-1:19:46.340,1:19:50.829
-You have an input to the memory network. Let's call it X and think of it as an address
+0:19:46.340,0:19:50.829
+Você tem uma entrada para a rede de memória. Vamos chamá-lo de X e pensar nele como um endereço
 
-1:19:51.770,1:19:53.770
-Of the memory, okay
+0:19:51.770,0:19:53.770
+Da memória, ok
 
-1:19:53.930,1:19:56.409
-What you're going to do is you're going to compare this X
+0:19:53.930,0:19:56.409
+O que você vai fazer é comparar esse X
 
-1:19:58.040,1:20:03.070
-With a bunch of vectors, we're gonna call K
+0:19:58.040,0:20:03.070
+Com um monte de vetores, vamos chamar K
 
-1:20:08.180,1:20:10.180
-So k₁ k₂ k₃
+0:20:08.180,0:20:10.180
+Então k₁ k₂ k₃
 
-1:20:12.890,1:20:18.910
-Okay, so you compare those two vectors and the way you compare them is via dot product very simple
+0:20:12.890,0:20:18.910
+Ok, então você compara esses dois vetores e a maneira como você os compara é via produto escalar muito simples
 
-1:20:28.460,1:20:33.460
-Okay, so now you have the three dot products of all the three Ks with the X
+0:20:28.460,0:20:33.460
+Ok, agora você tem os três produtos escalares de todos os três Ks com o X
 
-1:20:34.730,1:20:37.990
-They are scalar values, you know plug them to a softmax
+0:20:34.730,0:20:37.990
+Eles são valores escalares, você sabe conectá-los a um softmax
 
-1:20:47.630,1:20:50.589
-So what you get are three numbers between 0 & 1 that sum to 1
+0:20:47.630,0:20:50.589
+Então, o que você obtém são três números entre 0 e 1 que somam 1
 
-1:20:53.840,1:20:59.259
-What you do with those you have 3 other vectors that I'm gonna call V
+0:20:53.840,0:20:59.259
+O que você faz com esses você tem 3 outros vetores que eu vou chamar de V
 
-1:21:00.680,1:21:02.680
-v₁, v₂ and v₃
+0:21:00.680,0:21:02.680
+v₁, v₂ e v₃
 
-1:21:03.770,1:21:07.120
-And what you do is you multiply
+0:21:03.770,0:21:07.120
+E o que você faz é multiplicar
 
-1:21:08.990,1:21:13.570
-These vectors by those scalars, so this is very much like the attention mechanism that we just talked about
+0:21:08.990,0:21:13.570
+Esses vetores por esses escalares, então isso é muito parecido com o mecanismo de atenção que acabamos de falar
 
-1:21:17.870,1:21:20.950
-Okay, and you sum them up
+0:21:17.870,0:21:20.950
+Ok, e você resumiu
 
-1:21:27.440,1:21:34.870
-Okay, so take an X compare X with each of the K each of the Ks those are called keys
+0:21:27.440,0:21:34.870
+Ok, então pegue um X compare X com cada um dos K ​​cada um dos Ks que são chamados de chaves
 
-1:21:39.170,1:21:44.500
-You get a bunch of coefficients between the zero and one that sum to one and then compute a linear combination of the values
+0:21:39.170,0:21:44.500
+Você obtém vários coeficientes entre zero e um que somam um e, em seguida, calcula uma combinação linear dos valores
 
-1:21:45.260,1:21:47.260
-Those are value vectors
+0:21:45.260,0:21:47.260
+Esses são vetores de valor
 
-1:21:50.510,1:21:51.650
-And
+0:21:50.510,0:21:51.650
+E
 
-1:21:51.650,1:21:53.150
-Sum them up
+0:21:51.650,0:21:53.150
+Soma-os
 
-1:21:53.150,1:22:00.400
-Okay, so imagine that one of the key exactly matches X you're gonna have a large coefficient here and small coefficients there
+0:21:53.150,0:22:00.400
+Ok, então imagine que uma das chaves corresponda exatamente a X, você terá um coeficiente grande aqui e coeficientes pequenos ali
 
-1:22:00.400,1:22:06.609
-So the output of the system will essentially be V2, if K 2 matches X the output would essentially be V 2
+0:22:00.400,0:22:06.609
+Assim, a saída do sistema será essencialmente V2, se K 2 corresponder a X, a saída será essencialmente V 2
 
-1:22:08.060,1:22:09.500
-Okay
+0:22:08.060,0:22:09.500
+OK
 
-1:22:09.500,1:22:11.890
-So this is an addressable associative memory
+0:22:09.500,0:22:11.890
+Então esta é uma memória associativa endereçável
 
-1:22:12.620,1:22:19.419
-Associative memory is exactly that where you have keys with values and if your input matches a key you get the value here
+0:22:12.620,0:22:19.419
+A memória associativa é exatamente onde você tem chaves com valores e se sua entrada corresponder a uma chave você obtém o valor aqui
 
-1:22:19.420,1:22:21.420
-It's a kind of soft differentiable version of that
+0:22:19.420,0:22:21.420
+É uma espécie de versão soft diferenciável disso
 
-1:22:26.710,1:22:28.710
-So you can
+0:22:26.710,0:22:28.710
+Então você pode
 
-1:22:29.019,1:22:34.559
-you can back propagate to this you can you can write into this memory by changing the V vectors or
+0:22:29.019,0:22:34.559
+você pode voltar a propagar para isso você pode escrever nessa memória alterando os vetores V ou
 
-1:22:34.929,1:22:38.609
-Even changing the K vectors. You can change the V vectors by gradient descent
+0:22:34.929,0:22:38.609
+Mesmo mudando os vetores K. Você pode alterar os vetores V por gradiente descendente
 
-1:22:39.489,1:22:45.598
-Okay, so if you wanted the output of the memory to be something in particular by backpropagating gradient through this
+0:22:39.489,0:22:45.598
+Ok, então se você quiser que a saída da memória seja algo em particular, retropropagando o gradiente por meio disso
 
-1:22:47.019,1:22:52.259
-you're going to change the currently active V to whatever it needs for the
+0:22:47.019,0:22:52.259
+você vai mudar o V atualmente ativo para o que for necessário para o
 
-1:22:53.530,1:22:55.530
-for the output
+0:22:53.530,0:22:55.530
+para a saída
 
-1:22:56.050,1:22:58.050
-So in those papers
+0:22:56.050,0:22:58.050
+Então naqueles papéis
 
-1:22:59.800,1:23:02.460
-What what they did was I
+0:22:59.800,0:23:02.460
+O que eles fizeram foi eu
 
-1:23:03.969,1:23:06.299
-Mean there's a series of papers on every network, but
+0:23:03.969,0:23:06.299
+Significa que há uma série de papéis em cada rede, mas
 
-1:23:08.409,1:23:11.879
-What they did was exactly scenario I just explained where you you kind of
+0:23:08.409,0:23:11.879
+O que eles fizeram foi exatamente o cenário que acabei de explicar onde você meio que
 
-1:23:12.909,1:23:16.319
-Tell a story to a system so give it a sequence of sentences
+0:23:12.909,0:23:16.319
+Conte uma história para um sistema, então dê uma sequência de frases
 
-1:23:17.530,1:23:22.800
-Those sentences are encoded into vectors by running through a neural net which is not pre-trained, you know
+0:23:17.530,0:23:22.800
+Essas frases são codificadas em vetores passando por uma rede neural que não é pré-treinada, você sabe
 
-1:23:25.269,1:23:29.279
-it just through the training of the entire system it figures out how to encode this
+0:23:25.269,0:23:29.279
+apenas através do treinamento de todo o sistema, ele descobre como codificar isso
 
-1:23:30.039,1:23:35.009
-and then those sentences are written to the memory of this type and
+0:23:30.039,0:23:35.009
+e então essas frases são escritas na memória deste tipo e
 
-1:23:35.829,1:23:41.129
-Then when you ask a question to the system you encode the question at the input of a neural net, the neural net produces
+0:23:35.829,0:23:41.129
+Então, quando você faz uma pergunta ao sistema, você codifica a pergunta na entrada de uma rede neural, a rede neural produz
 
-1:23:41.130,1:23:44.999
-An X to the memory the memory returns a value
+0:23:41.130,0:23:44.999
+Um X para a memória a memória retorna um valor
 
-1:23:46.510,1:23:47.590
-And
+0:23:46.510,0:23:47.590
+E
 
-1:23:47.590,1:23:49.480
-Then you use this value
+0:23:47.590,0:23:49.480
+Então você usa esse valor
 
-1:23:49.480,1:23:54.329
-and the previous state of the network to kind of reaccess the memory, you can do this multiple times and
+0:23:49.480,0:23:54.329
+e o estado anterior da rede para reacessar a memória, você pode fazer isso várias vezes e
 
-1:23:54.550,1:23:58.139
-You train this entire network to produce or an answer to your to your question
+0:23:54.550,0:23:58.139
+Você treina toda essa rede para produzir ou uma resposta para sua pergunta
 
-1:23:59.139,1:24:03.748
-And if you have lots and lots of scenarios lots and lots of questions or also lots of answers
+0:23:59.139,0:24:03.748
+E se você tem muitos e muitos cenários muitas e muitas perguntas ou também muitas respostas
 
-1:24:04.119,1:24:10.169
-Which they did in this case with by artificially generating stories questions and answers
+0:24:04.119,0:24:10.169
+O que eles fizeram neste caso, gerando artificialmente histórias, perguntas e respostas
 
-1:24:11.440,1:24:12.940
-this thing actually
+0:24:11.440,0:24:12.940
+essa coisa na verdade
 
-1:24:12.940,1:24:15.989
-learns to store stories and
+0:24:12.940,0:24:15.989
+aprende a armazenar histórias e
 
-1:24:16.780,1:24:18.760
-answer questions
+0:24:16.780,0:24:18.760
+responder a perguntas
 
-1:24:18.760,1:24:20.409
-Which is pretty amazing
+0:24:18.760,0:24:20.409
+O que é bem incrível
 
-1:24:20.409,1:24:22.409
-So that's the memory Network
+0:24:20.409,0:24:22.409
+Então essa é a rede de memória
 
-1:24:27.110,1:24:29.860
-Okay, so the first step is you compute
+0:24:27.110,0:24:29.860
+Ok, então o primeiro passo é você calcular
 
-1:24:32.210,1:24:34.300
-Alpha I equals
+0:24:32.210,0:24:34.300
+Alfa I é igual
 
-1:24:36.590,1:24:43.899
-KI transpose X. Okay, just a dot product. Okay, and then you compute
+0:24:36.590,0:24:43.899
+KI transpõe X. Ok, apenas um produto escalar. Ok, e então você calcula
 
-1:24:48.350,1:24:51.519
-CI or the vector C I should say
+0:24:48.350,0:24:51.519
+CI ou o vetor CI deve dizer
 
-1:24:54.530,1:24:57.579
-Is the softmax function
+0:24:54.530,0:24:57.579
+É a função softmax
 
-1:25:00.320,1:25:02.979
-Applied to the vector of alphas, okay
+0:25:00.320,0:25:02.979
+Aplicado ao vetor de alfas, ok
 
-1:25:02.980,1:25:07.840
-So the C's are between 0 and 1 and sum to 1 and then the output of the system
+0:25:02.980,0:25:07.840
+Então os C's estão entre 0 e 1 e somam 1 e então a saída do sistema
 
-1:25:09.080,1:25:11.080
-is
+0:25:09.080,0:25:11.080
+é
 
-1:25:11.150,1:25:13.360
-sum over I of
+0:25:11.150,0:25:13.360
+soma sobre I de
 
-1:25:14.930,1:25:16.930
-Ci
+0:25:14.930,0:25:16.930
+Lá
 
-1:25:17.240,1:25:21.610
-Vi where Vis are the value vectors. Okay. That's the memory
+0:25:17.240,0:25:21.610
+Vi onde Vis são os vetores de valor. OK. Essa é a memória
 
-1:25:30.420,1:25:34.489
-Yes, yes, yes, absolutely
+0:25:30.420,0:25:34.489
+Sim, sim, sim, absolutamente
 
-1:25:37.140,1:25:38.640
-Not really
+0:25:37.140,0:25:38.640
+Na verdade
 
-1:25:38.640,1:25:41.869
-No, I mean all you need is everything to be encoded as vectors?
+0:25:38.640,0:25:41.869
+Não, quero dizer, tudo que você precisa é que tudo seja codificado como vetores?
 
-1:25:42.660,1:25:48.200
-Right and so run for your favorite convnet, you get a vector that represents the image and then you can do the QA
+0:25:42.660,0:25:48.200
+Certo e então corra para o seu convnet favorito, você pega um vetor que representa a imagem e aí você pode fazer o QA
 
-1:25:50.880,1:25:52.880
-Yeah, I mean so
+0:25:50.880,0:25:52.880
+Sim, quero dizer assim
 
-1:25:53.490,1:25:57.050
-You can imagine lots of applications of this so in particular
+0:25:53.490,0:25:57.050
+Você pode imaginar muitas aplicações disso, em particular
 
-1:25:58.110,1:26:00.110
-When application is I
+0:25:58.110,0:26:00.110
+Quando a aplicação é eu
 
-1:26:00.690,1:26:02.690
-Mean you can you can think of
+0:26:00.690,0:26:02.690
+Significa que você pode, você pode pensar
 
-1:26:06.630,1:26:09.109
-You know think of this as a kind of a memory
+0:26:06.630,0:26:09.109
+Você sabe, pense nisso como uma espécie de memória
 
-1:26:11.160,1:26:14.000
-And then you can have some sort of neural net
+0:26:11.160,0:26:14.000
+E então você pode ter algum tipo de rede neural
 
-1:26:16.020,1:26:16.970
-That you know
+0:26:16.020,0:26:16.970
+Que você sabe
 
-1:26:16.970,1:26:24.230
-it takes takes an input and then produces an address for the memory gets a value back and
+0:26:16.970,0:26:24.230
+ele pega uma entrada e então produz um endereço para a memória pega um valor de volta e
 
-1:26:25.050,1:26:27.739
-Then keeps growing and eventually produces an output
+0:26:25.050,0:26:27.739
+Então continua crescendo e eventualmente produz uma saída
 
-1:26:28.830,1:26:30.830
-This was very much like a computer
+0:26:28.830,0:26:30.830
+Isso era muito parecido com um computador
 
-1:26:31.050,1:26:33.650
-Ok. Well the neural net here is the
+0:26:31.050,0:26:33.650
+OK. Bem, a rede neural aqui é a
 
-1:26:34.920,1:26:37.099
-the CPU the ALU the CPU
+0:26:34.920,0:26:37.099
+a CPU a ULA a CPU
 
-1:26:37.680,1:26:43.099
-Ok, and the memory is just an external memory you can access whenever you need it, or you can write to it if you want
+0:26:37.680,0:26:43.099
+Ok, e a memória é apenas uma memória externa que você pode acessar sempre que precisar, ou você pode escrever nela se quiser
 
-1:26:43.890,1:26:49.040
-It's a recurrent net in this case. You can unfold it in time, which is what these guys did
+0:26:43.890,0:26:49.040
+É uma rede recorrente neste caso. Você pode desdobrá-lo no tempo, que é o que esses caras fizeram
 
-1:26:51.330,1:26:52.650
-And
+0:26:51.330,0:26:52.650
+E
 
-1:26:52.650,1:26:58.009
-And then so then there are people who kind of imagined that you could actually build kind of differentiable computers out of this
+0:26:52.650,0:26:58.009
+E então há pessoas que meio que imaginaram que você poderia realmente construir computadores diferenciáveis ​​a partir disso
 
-1:26:58.410,1:27:03.530
-There's something called neural Turing machine, which is essentially a form of this where the memory is not of this type
+0:26:58.410,0:27:03.530
+Existe algo chamado máquina de Turing neural, que é essencialmente uma forma disso onde a memória não é desse tipo
 
-1:27:03.530,1:27:07.040
-It's kind of a soft tape like in a regular Turing machine
+0:27:03.530,0:27:07.040
+É uma espécie de fita macia como em uma máquina de Turing normal
 
-1:27:07.890,1:27:14.030
-This is somewhere from deep mind that the interesting story about this which is that the facebook people put out
+0:27:07.890,0:27:14.030
+Isso é em algum lugar da mente profunda que a história interessante sobre isso é que as pessoas do Facebook publicaram
 
-1:27:14.760,1:27:19.909
-The paper on the memory network on arxiv and three days later
+0:27:14.760,0:27:19.909
+O artigo sobre a rede de memória no arxiv e três dias depois
 
-1:27:22.110,1:27:24.110
-The deepmind people put out a paper
+0:27:22.110,0:27:24.110
+As pessoas da mente profunda colocaram um papel
 
-1:27:25.290,1:27:30.679
-About neural Turing machine and the reason they put three days later is that they've been working on the all Turing machine and
+0:27:25.290,0:27:30.679
+Sobre a máquina de Turing neural e a razão pela qual eles colocaram três dias depois é que eles estão trabalhando na máquina de Turing e
 
-1:27:31.350,1:27:32.640
-in their
+0:27:31.350,0:27:32.640
+em seu
 
-1:27:32.640,1:27:37.160
-Tradition they kind of keep project secret unless you know until they can make a big splash
+0:27:32.640,0:27:37.160
+Tradição, eles mantêm o projeto em segredo, a menos que você saiba, até que eles possam fazer um grande sucesso
 
-1:27:37.770,1:27:40.699
-But there they got scooped so they put the paper out on arxiv
+0:27:37.770,0:27:40.699
+Mas lá eles foram pegos, então eles colocaram o papel no arxiv
 
-1:27:45.060,1:27:50.539
-Eventually, they made a big splash with another with a paper but that was a year later or so
+0:27:45.060,0:27:50.539
+Eventualmente, eles fizeram um grande barulho com outro com um papel, mas isso foi um ano depois ou mais
 
-1:27:52.230,1:27:54.230
-So what's happened
+0:27:52.230,0:27:54.230
+Então o que aconteceu
 
-1:27:55.020,1:28:01.939
-since then is that people have kind of taken this module this idea that you compare inputs to keys and
+0:27:55.020,0:28:01.939
+desde então é que as pessoas meio que pegaram nesse módulo essa ideia de que você compara entradas com chaves e
 
-1:28:02.550,1:28:04.550
-that gives you coefficients and
+0:28:02.550,0:28:04.550
+que lhe dá coeficientes e
 
-1:28:04.950,1:28:07.819
-You know you you produce values
+0:28:04.950,0:28:07.819
+Você sabe que você produz valores
 
-1:28:08.520,1:28:09.990
-as
+0:28:08.520,0:28:09.990
+Como
 
-1:28:09.990,1:28:14.449
-Kind of a essential module in a neural net and that's basically where the transformer is
+0:28:09.990,0:28:14.449
+Uma espécie de módulo essencial em uma rede neural e é basicamente onde o transformador está
 
-1:28:15.060,1:28:18.049
-so a transformer is basically a neural net in which
+0:28:15.060,0:28:18.049
+então um transformador é basicamente uma rede neural na qual
 
-1:28:19.290,1:28:21.290
-Every group of neurons is one of those
+0:28:19.290,0:28:21.290
+Cada grupo de neurônios é um desses
 
-1:28:21.720,1:28:29.449
-It's a it's a whole bunch of memories. Essentially. There's some more twist to it. Okay, but that's kind of the basic the basic idea
+0:28:21.720,0:28:29.449
+É um monte de memórias. Essencialmente. Há um pouco mais de reviravolta nisso. Ok, mas isso é meio que o básico, a ideia básica
 
-1:28:32.460,1:28:34.460
-But you'll hear about this
+0:28:32.460,0:28:34.460
+Mas você vai ouvir sobre isso
 
-1:28:34.980,1:28:36.750
-in a week Oh
+0:28:34.980,0:28:36.750
+em uma semana Ah
 
-1:28:36.750,1:28:38.250
-in two weeks
+0:28:36.750,0:28:38.250
+em duas semanas
 
-1:28:38.250,1:28:40.140
-one week one week
+0:28:38.250,0:28:40.140
+uma semana uma semana
 
-1:28:40.140,1:28:42.140
-Okay any more questions?
+0:28:40.140,0:28:42.140
+Ok mais alguma pergunta?
 
-1:28:44.010,1:28:46.640
-Cool. All right. Thank you very much
+0:28:44.010,0:28:46.640
+Frio. Tudo bem. Muito obrigado
\ No newline at end of file
diff --git a/docs/pt/week06/practicum06.sbv b/docs/pt/week06/practicum06.sbv
index f05301b02..0cdbe1765 100644
--- a/docs/pt/week06/practicum06.sbv
+++ b/docs/pt/week06/practicum06.sbv
@@ -1,1742 +1,1742 @@
 0:00:00.030,0:00:03.959
-so today we are gonna be covering quite a lot of materials so I will try not to
+então hoje vamos cobrir muitos materiais, então tentarei não
 
 0:00:03.959,0:00:08.309
-run but then yesterday young scooped me completely so young talked about exactly
+correr, mas ontem jovem me pegou completamente tão jovem falou exatamente
 
 0:00:08.309,0:00:15.269
-the same things I wanted to talk today so I'm gonna go a bit faster please slow
+as mesmas coisas que eu queria falar hoje então eu vou um pouco mais rápido por favor devagar
 
 0:00:15.269,0:00:18.210
-me down if you actually are somehow lost okay
+me para baixo se você realmente está de alguma forma perdido ok
 
 0:00:18.210,0:00:21.420
-so I will just try to be a little bit faster than you sir
+então vou tentar ser um pouco mais rápido que o senhor
 
 0:00:21.420,0:00:26.250
-so today we are gonna be talking about recurrent neural networks record neural
+então hoje vamos falar sobre redes neurais recorrentes registram neurais
 
 0:00:26.250,0:00:31.050
-networks are one type of architecture we can use in order to be to deal with
+redes são um tipo de arquitetura que podemos usar para lidar com
 
 0:00:31.050,0:00:37.430
-sequences of data what are sequences what type of signal is a sequence
+sequências de dados o que são sequências que tipo de sinal é uma sequência
 
 0:00:39.890,0:00:44.219
-temporal is a temporal component but we already seen data with temporal
+temporal é um componente temporal, mas já vimos dados com tempo
 
 0:00:44.219,0:00:49.350
-component how what are they called what dimensional what is the dimension
+componente como eles são chamados qual dimensional qual é a dimensão
 
 0:00:49.350,0:00:55.320
-of that kind of signal so on the convolutional net lesson we have seen
+desse tipo de sinal, então na lição da rede convolucional que vimos
 
 0:00:55.320,0:00:59.969
-that a signal could be one this signal to this signal 3d signal based on the
+que um sinal pode ser um este sinal para este sinal 3d de sinal com base no
 
 0:00:59.969,0:01:06.270
-domain and the domain is what you map from to go to right so temporal handling
+domínio e o domínio é o que você mapeia para ir para a direita para que o tratamento temporal
 
 0:01:06.270,0:01:10.580
-sequential sequences of data is basically dealing with one the data
+seqüências sequenciais de dados está basicamente lidando com um dos dados
 
 0:01:10.580,0:01:15.119
-because the domain is going to be just the temporal axis nevertheless you can
+porque o domínio será apenas o eixo temporal, mas você pode
 
 0:01:15.119,0:01:18.689
-also use RNN to deal with you know two dimensional data you have double
+também use RNN para lidar com você sabe dados bidimensionais que você tem duplo
 
 0:01:18.689,0:01:28.049
-Direction okay okay so this is a classical neural network in the diagram
+Direção ok ok então esta é uma rede neural clássica no diagrama
 
 0:01:28.049,0:01:33.299
-that is I'm used to draw where I represent each in this case bunch of
+ou seja, estou acostumado a desenhar onde represento cada um, neste caso, um monte de
 
 0:01:33.299,0:01:37.590
-neurons like each of those is a vector and for example the X is my input vector
+neurônios como cada um deles é um vetor e, por exemplo, o X é meu vetor de entrada
 
 0:01:37.590,0:01:42.450
-it's in pink as usual then I have my hidden layer in a green in the center
+está em rosa como de costume, então eu tenho minha camada oculta em um verde no centro
 
 0:01:42.450,0:01:46.200
-then I have my final blue eared lane layer which is the output network so
+então eu tenho minha camada final de pista azul, que é a rede de saída, então
 
 0:01:46.200,0:01:52.320
-this is a three layer neural network in my for my notation and so if some of you
+esta é uma rede neural de três camadas no meu para minha notação e, portanto, se alguns de vocês
 
 0:01:52.320,0:01:57.960
-are familiar with digital electronics this is like talking about a
+estão familiarizados com a eletrônica digital é como falar sobre um
 
 0:01:57.960,0:02:03.329
-combinatorial logic your current output depends only on the current input and
+lógica combinatória sua saída atual depende apenas da entrada atual e
 
 0:02:03.329,0:02:08.420
-that's it there is no there is no other input instead when we
+é isso, não há nenhuma outra entrada em vez disso, quando
 
 0:02:08.420,0:02:12.590
-are talking about our men we are gonna be talking about something that looks
+estamos falando sobre nossos homens, vamos falar sobre algo que parece
 
 0:02:12.590,0:02:17.420
-like this in this case our output here on the right hand side depends on the
+assim, neste caso, nossa saída aqui no lado direito depende do
 
 0:02:17.420,0:02:21.860
-current input and on the state of the system and again if you're a king of
+entrada atual e no estado do sistema e novamente se você é um rei de
 
 0:02:21.860,0:02:26.750
-digital electronics this is simply sequential logic whereas you have an
+eletrônica digital isso é simplesmente lógica seqüencial enquanto você tem um
 
 0:02:26.750,0:02:31.580
-internal state the onion is the dimension flip-flop if you have no idea
+estado interno a cebola é a dimensão flip-flop se você não tem idéia
 
 0:02:31.580,0:02:37.040
-what a flip-flop you know check it out it's just some very basic memory unit in
+que flip-flop você conhece, veja só, é apenas uma unidade de memória muito básica em
 
 0:02:37.040,0:02:41.810
-digital electronics nevertheless this is the only difference right in the first
+eletrônica digital, no entanto, esta é a única diferença logo no primeiro
 
 0:02:41.810,0:02:45.290
-case you have an output which is just function of the input in the second case
+caso você tenha uma saída que é apenas função da entrada no segundo caso
 
 0:02:45.290,0:02:49.580
-you have an output which is function of the input and the state of the system
+você tem uma saída que é função da entrada e do estado do sistema
 
 0:02:49.580,0:02:54.130
-okay that's the big difference yeah vanilla is in American term for saying
+ok, essa é a grande diferença sim, baunilha é um termo americano para dizer
 
 0:02:58.040,0:03:04.670
-it's plane doesn't have a taste that American sorry I try to be the most
+é avião não tem um gosto que americano desculpe eu tento ser o mais
 
 0:03:04.670,0:03:11.390
-American I can in Italy you feel taken an ice cream which is doesn't have a
+Americano eu posso na Itália você se sente tomado um sorvete que não tem
 
 0:03:11.390,0:03:15.950
-taste it's gonna be fior di latte which is milk taste in here we don't have milk
+prove vai ser fior di latte que é leite prove aqui não temos leite
 
 0:03:15.950,0:03:20.049
-tests they have vanilla taste which is the plain ice cream
+testes eles têm sabor de baunilha que é o sorvete simples
 
 0:03:20.049,0:03:28.360
-okay Americans sorry all right so oh so let's see what does
+ok americanos desculpe tudo bem então oh então vamos ver o que faz
 
 0:03:28.360,0:03:32.760
-it change this with young representation so young draws those kind of little
+isso muda com a representação jovem tão jovem desenha esse tipo de pequeno
 
 0:03:32.760,0:03:38.170
-funky things here which represent a mapping between a TENS tensor to another
+coisas estranhas aqui que representam um mapeamento entre um tensor TENS para outro
 
 0:03:38.170,0:03:41.800
-painter from one a vector to another vector right so there you have your
+pintor de um vetor para outro vetor, então você tem o seu
 
 0:03:41.800,0:03:46.630
-input vector X is gonna be mapped through this item here to this hidden
+o vetor de entrada X será mapeado através deste item aqui para este oculto
 
 0:03:46.630,0:03:50.620
-representation so that actually represent my fine transformation so my
+representação de modo que realmente represente minha bela transformação para que minha
 
 0:03:50.620,0:03:54.130
-rotation Plus this question then you have the heater representation that you
+rotação Mais esta pergunta, então você tem a representação do aquecedor que você
 
 0:03:54.130,0:03:57.850
-have another rotation is question then you get the final output right similarly
+ter outra rotação é uma questão, então você obtém a saída final da mesma forma
 
 0:03:57.850,0:04:03.220
-in the recurrent diagram you can have these additional things this is a fine
+no diagrama recorrente você pode ter essas coisas adicionais isso é bom
 
 0:04:03.220,0:04:06.640
-transformation squashing that's like a delay module with a final transformation
+esmagamento de transformação que é como um módulo de atraso com uma transformação final
 
 0:04:06.640,0:04:10.900
-excursion and now you have the final one affine transformation and squashing
+excursão e agora você tem a última transformação afim e esmagamento
 
 0:04:10.900,0:04:18.100
-right these things is making noise okay sorry all right so what is the first
+certo essas coisas estão fazendo barulho ok desculpe tudo bem então qual é o primeiro
 
 0:04:18.100,0:04:24.250
-case first case is this one is a vector to sequence so we input one bubble the
+caso o primeiro caso é este é um vetor para sequenciar, então inserimos uma bolha no
 
 0:04:24.250,0:04:28.270
-pink wonder and then you're gonna have this evolution of the internal state of
+maravilha rosa e então você vai ter essa evolução do estado interno de
 
 0:04:28.270,0:04:33.070
-the system the green one and then as the state of the system evolves you can be
+o sistema o verde e então, à medida que o estado do sistema evolui, você pode ser
 
 0:04:33.070,0:04:38.470
-spitting out at every time stamp one specific output what can be an example
+cuspindo em cada carimbo de hora uma saída específica o que pode ser um exemplo
 
 0:04:38.470,0:04:43.240
-of this kind of architecture so this one could be the following my input is gonna
+deste tipo de arquitetura para que este possa ser o seguinte, minha entrada vai
 
 0:04:43.240,0:04:46.750
-be one of these images and then the output is going to be a sequence of
+ser uma dessas imagens e, em seguida, a saída será uma sequência de
 
 0:04:46.750,0:04:53.140
-characters representing the English description of whatever this input is so
+caracteres que representam a descrição em inglês de qualquer que seja essa entrada
 
 0:04:53.140,0:04:57.940
-for example in the center when we have a herd of elephants so the last one herd
+por exemplo, no centro, quando temos uma manada de elefantes, então a última manada
 
 0:04:57.940,0:05:03.880
-of elephants walking across a dry grass field so it's very very very well
+de elefantes andando por um campo de grama seca, então está muito, muito bem
 
 0:05:03.880,0:05:09.130
-refined right then you have in the center here for example two dogs play in
+refinado, então você tem no centro aqui, por exemplo, dois cães brincam
 
 0:05:09.130,0:05:15.640
-the in the grass maybe there are three but okay they play they're playing in
+na grama talvez haja três, mas tudo bem, eles jogam, eles estão jogando
 
 0:05:15.640,0:05:20.500
-the grass right so it's cool in this case you know a red motorcycle park on
+a grama certa, então é legal, neste caso você conhece um parque de motos vermelho
 
 0:05:20.500,0:05:24.610
-the side of the road looks more pink or you know a little
+o lado da estrada parece mais rosa ou você sabe um pouco
 
 0:05:24.610,0:05:30.490
-blow a little a little girl in the pink that is blowing bubbles that she's not
+soprar uma garotinha no rosa que está soprando bolhas que ela não está
 
 0:05:30.490,0:05:35.650
-blowing right anything there all right and then you also have you know even
+soprando direito qualquer coisa lá tudo bem e então você também tem você sabe mesmo
 
 0:05:35.650,0:05:41.560
-more wrong examples right so you have like yellow school bus parked in the
+mais exemplos errados certo, então você tem como ônibus escolar amarelo estacionado no
 
 0:05:41.560,0:05:44.050
-parking lot well it's CL um but it's not a school
+estacionamento bem, é CL hum, mas não é uma escola
 
 0:05:44.050,0:05:49.860
-bus so it can be failing as well but I also can do a very very nice you know
+ônibus, então pode estar falhando também, mas eu também posso fazer um muito, muito legal, você sabe
 
 0:05:49.860,0:05:56.470
-you can also perform very well so this was from one input vector which is B for
+você também pode ter um desempenho muito bom, então isso foi de um vetor de entrada que é B para
 
 0:05:56.470,0:06:01.720
-example representation of my image to a sequence of symbols which are D for
+exemplo de representação da minha imagem para uma sequência de símbolos que são D para
 
 0:06:01.720,0:06:05.620
-example characters or words that are making here my English sentence okay
+exemplos de caracteres ou palavras que estão fazendo aqui minha frase em inglês ok
 
 0:06:05.620,0:06:11.440
-clear so far yeah okay another kind of usage you can have is maybe the
+claro até agora sim ok outro tipo de uso que você pode ter é talvez o
 
 0:06:11.440,0:06:17.560
-following so you're gonna have sequence two final vector okay so I don't care
+seguindo, então você terá a sequência dois vetores finais ok, então eu não me importo
 
 0:06:17.560,0:06:22.120
-about the intermediate sequences so okay the top right is called Auto regressive
+sobre as sequências intermediárias, então tudo bem, o canto superior direito é chamado de Auto regressivo
 
 0:06:22.120,0:06:26.590
-network and outer regressive network is a network which is outputting an output
+rede e rede regressiva externa é uma rede que está emitindo uma saída
 
 0:06:26.590,0:06:29.950
-given that you feel as input the previous output okay
+dado que você sente como entrada a saída anterior ok
 
 0:06:29.950,0:06:33.700
-so this is called Auto regressive you have this kind of loopy part on the
+então isso é chamado de Auto regressivo, você tem esse tipo de parte maluca no
 
 0:06:33.700,0:06:37.780
-network on the left hand side instead I'm gonna be providing several sequences
+rede no lado esquerdo, em vez disso, fornecerei várias sequências
 
 0:06:37.780,0:06:40.140
-yeah that's gonna be the English translation
+sim, essa vai ser a tradução em inglês
 
 0:06:51.509,0:06:55.380
-so you have a sequence of words that are going to make up your final sentence
+então você tem uma sequência de palavras que vão compor sua frase final
 
 0:06:55.380,0:07:00.330
-it's it's blue there you can think about a index in a dictionary and then each
+é azul lá você pode pensar em um índice em um dicionário e então cada
 
 0:07:00.330,0:07:03.300
-blue is going to tell you which word you're gonna pick on an indexed
+azul vai lhe dizer qual palavra você vai escolher em um índice
 
 0:07:03.300,0:07:09.780
-dictionary right so this is a school bus right so oh yeah a yellow school bus you
+dicionário certo então este é um ônibus escolar certo então oh sim um ônibus escolar amarelo você
 
 0:07:09.780,0:07:14.940
-go to a index of a then you have second index you can figure out that is yellow
+vá para um índice de a, então você tem o segundo índice, você pode descobrir que é amarelo
 
 0:07:14.940,0:07:17.820
-and then school box right so the sequence here is going to be
+e, em seguida, a caixa da escola para a direita, então a sequência aqui será
 
 0:07:17.820,0:07:22.590
-representing the sequence of words the model is out on the other side there on
+representando a sequência de palavras que o modelo está do outro lado lá
 
 0:07:22.590,0:07:26.460
-the left you're gonna have instead I keep feeding a sequence of symbols and
+a esquerda você terá em vez disso, continuo alimentando uma sequência de símbolos e
 
 0:07:26.460,0:07:30.750
-only at the end I'm gonna look what is my final output what can be an
+só no final eu vou olhar qual é a minha saída final o que pode ser um
 
 0:07:30.750,0:07:36.150
-application of this one so something yun also mentioned was different so let's
+aplicação deste, então algo que yun também mencionou era diferente, então vamos
 
 0:07:36.150,0:07:40.789
-see if I can get my network to compile Python or to an open pilot own
+ver se consigo fazer com que minha rede compile Python ou para um piloto aberto próprio
 
 0:07:40.789,0:07:45.599
-interpretation so in this case I have my current input which I feed my network
+interpretação, então neste caso eu tenho minha entrada atual que alimento minha rede
 
 0:07:45.599,0:07:54.979
-which is going to be J equal 8580 for then for X in range eight some - J 920
+que será J igual a 8580 para então para X na faixa de oito alguns - J 920
 
 0:07:54.979,0:07:59.430
-blah blah blah and then print this one and then my network is going to be
+blá blá blá e depois imprima este e então minha rede vai ser
 
 0:07:59.430,0:08:04.860
-tasked with the just you know giving me twenty five thousand and eleven okay so
+encarregado de apenas você sabe me dar vinte e cinco mil e onze ok então
 
 0:08:04.860,0:08:09.210
-this is the final output of a program and I enforced in the network to be able
+esta é a saída final de um programa e eu apliquei na rede para poder
 
 0:08:09.210,0:08:13.860
-to output me the correct output the correct in your solution of this program
+para me enviar a saída correta, a correta na sua solução deste programa
 
 0:08:13.860,0:08:18.330
-or even more complicated things for example I can provide a sequence of
+ou coisas ainda mais complicadas, por exemplo, posso fornecer uma sequência de
 
 0:08:18.330,0:08:21.900
-other symbols which are going to be eighty eight thousand eight hundred
+outros símbolos que serão oitenta e oito mil e oitocentos
 
 0:08:21.900,0:08:26.669
-thirty seven then I have C is going to be something then I have print this one
+trinta e sete então eu tenho C vai ser algo então eu tenho que imprimir este
 
 0:08:26.669,0:08:33.360
-if something that is always true as the other one and then you know the output
+se algo que é sempre verdadeiro como o outro e então você conhece a saída
 
 0:08:33.360,0:08:38.849
-should be twelve thousand eight 184 right so you can train a neural net to
+deve ser doze mil oito 184 certo para que você possa treinar uma rede neural para
 
 0:08:38.849,0:08:42.690
-do these operations so you feed a sequence of symbols and then at the
+fazer essas operações para que você alimente uma sequência de símbolos e depois no
 
 0:08:42.690,0:08:48.870
-output you just enforce that the final target should be a specific value okay
+saída você apenas impõe que o destino final deve ser um valor específico ok
 
 0:08:48.870,0:08:56.190
-and these things making noise okay maybe I'm better
+e essas coisas fazendo barulho ok talvez eu seja melhor
 
 0:08:56.190,0:09:02.589
-all right so what's next next is going to be for example a sequence to vector
+tudo bem, então o que vem a seguir será, por exemplo, uma sequência para vetor
 
 0:09:02.589,0:09:07.210
-to sequence this used to be the standard way of performing length language
+para sequenciar isso costumava ser a maneira padrão de executar a linguagem de comprimento
 
 0:09:07.210,0:09:13.000
-translation so you start with a sequence of symbols here shown in pink so you
+tradução para que você comece com uma sequência de símbolos aqui mostrados em rosa para que você
 
 0:09:13.000,0:09:17.290
-have a sequence of inputs then everything gets condensed into this kind
+tem uma sequência de entradas, então tudo é condensado nesse tipo
 
 0:09:17.290,0:09:23.020
-of final age which is this H over here which is going to be somehow my concept
+da idade final que é este H aqui que vai ser de alguma forma o meu conceito
 
 0:09:23.020,0:09:27.880
-right so I have a sentence I squeeze the sentence temporal information into just
+certo, então eu tenho uma frase, eu espremo a informação temporal da frase em apenas
 
 0:09:27.880,0:09:31.600
-one vector which is representing the meaning the message I'd like to send
+um vetor que está representando o significado da mensagem que eu gostaria de enviar
 
 0:09:31.600,0:09:36.310
-across and then I get this meaning in whatever representation unrolled back in
+e então eu recebo esse significado em qualquer representação desenrolada de volta
 
 0:09:36.310,0:09:41.380
-a different language right so I can encode I don't know today I'm very happy
+um idioma diferente né pra eu codificar não sei hoje estou muito feliz
 
 0:09:41.380,0:09:47.350
-in English as a sequence of word and then you know you can get LG Sonoma to
+em inglês como uma sequência de palavras e então você sabe que pode fazer com que o LG Sonoma
 
 0:09:47.350,0:09:53.170
-Felicia and then I speak outside Thailand today or whatever now today I'm
+Felicia e depois falo fora da Tailândia hoje ou o que quer que seja agora hoje estou
 
 0:09:53.170,0:09:58.480
-very tired Jin Chen walk han lei or whatever ok so
+muito cansado Jin Chen anda han lei ou o que quer que seja ok então
 
 0:09:58.480,0:10:02.020
-again you have some kind of encoding then you have a compressed
+novamente você tem algum tipo de codificação, então você tem um compactado
 
 0:10:02.020,0:10:08.110
-representation and then you get like the decoding given the same compressed
+representação e então você fica como a decodificação dada a mesma compactação
 
 0:10:08.110,0:10:15.040
-version ok and so for example I guess language translation again recently we
+versão ok e então, por exemplo, acho que a tradução de idiomas novamente recentemente
 
 0:10:15.040,0:10:20.709
-have seen transformers and a lot of things like in the recent time so we're
+vimos transformadores e muitas coisas como nos últimos tempos, então estamos
 
 0:10:20.709,0:10:25.300
-going to cover that the next lesson I think but this used to be the state of
+vou cobrir isso na próxima lição, eu acho, mas isso costumava ser o estado de
 
 0:10:25.300,0:10:31.000
-the art until few two years ago and here you can see that if you actually check
+a arte até uns dois anos atrás e aqui você pode ver que se você realmente verificar
 
 0:10:31.000,0:10:38.950
-if you do a PCA over the latent space you have that words are grouped by
+se você fizer um PCA sobre o espaço latente você tem que as palavras são agrupadas por
 
 0:10:38.950,0:10:43.630
-semantics ok so if we zoom in that region there are we're gonna see that in
+semântica ok, então, se ampliarmos nessa região, veremos isso em
 
 0:10:43.630,0:10:48.400
-what in the same location you find all the amounts december february november
+o que no mesmo local você encontra todos os valores dezembro fevereiro novembro
 
 0:10:48.400,0:10:52.750
-whatever right if you put a few focus on a different region you get that a few
+certo, se você colocar um pouco de foco em uma região diferente, você obtém alguns
 
 0:10:52.750,0:10:55.250
-days next few miles and so on right so
+dias próximos quilômetros e assim por diante, então
 
 0:10:55.250,0:11:00.230
-different location will have some specific you know common meaning so we
+local diferente terá algum significado comum específico que você conhece para que possamos
 
 0:11:00.230,0:11:05.780
-basically see in this case how by training these networks you know just
+basicamente veja neste caso como treinando essas redes você sabe apenas
 
 0:11:05.780,0:11:09.680
-with symbols they will pick up on some specific semantics
+com símbolos eles vão pegar algumas semânticas específicas
 
 0:11:09.680,0:11:16.130
-you know features right in this case you can see like there is a vector so the
+você conhece os recursos, neste caso, você pode ver como se houvesse um vetor, então o
 
 0:11:16.130,0:11:20.900
-vector that is connecting women to men is gonna be the same vector that is well
+vetor que está conectando mulheres a homens será o mesmo vetor que está bem
 
 0:11:20.900,0:11:27.590
-woman - man which is this one I think is gonna be equal to Queen - King right and
+mulher - homem que é esse eu acho que vai ser igual a rainha - rei certo e
 
 0:11:27.590,0:11:32.890
-so yeah it's correct and so you're gonna have that the same distance in this
+então sim, está correto e você terá a mesma distância neste
 
 0:11:32.890,0:11:37.730
-embedding space will be applied to things that are female and male for
+espaço de incorporação será aplicado a coisas que são femininas e masculinas para
 
 0:11:37.730,0:11:43.370
-example or in the other case you have walk-in and walked swimming and swamp so
+exemplo ou no outro caso você entrou e andou nadando e pântano assim
 
 0:11:43.370,0:11:47.960
-you always have this you know specific linear transformation you can apply in
+você sempre tem isso você sabe a transformação linear específica que você pode aplicar em
 
 0:11:47.960,0:11:53.690
-order to go from one type of word to the other one or this one you have the
+para ir de um tipo de palavra para outro ou este você tem a
 
 0:11:53.690,0:11:59.180
-connection between cities and the capitals all right so one more right I
+ligação entre as cidades e as capitais tudo certo então mais um certo eu
 
 0:11:59.180,0:12:05.210
-think what's missing from the big picture here it's a big picture because
+pense no que está faltando no quadro geral aqui é um quadro geral porque
 
 0:12:05.210,0:12:09.560
-it's so large no no it's such a big picture because it's the overview okay
+é tão grande não não é uma imagem tão grande porque é a visão geral ok
 
 0:12:09.560,0:12:18.590
-you didn't get the joke it's okay what's missing here vector to seek with no okay
+você não entendeu a piada está tudo bem o que está faltando aqui vetor para buscar sem tudo bem
 
 0:12:18.590,0:12:23.330
-good but no because you can still use the other one so you have this one the
+bom, mas não, porque você ainda pode usar o outro, então você tem este
 
 0:12:23.330,0:12:27.830
-vector is sequence to sequence right so this one is you start feeding inside
+vetor é sequência para sequência correta, então este é você começar a alimentar dentro
 
 0:12:27.830,0:12:31.580
-inputs you start outputting something right what can be an example of this
+entradas você começa a produzir algo certo o que pode ser um exemplo disso
 
 0:12:31.580,0:12:38.900
-stuff so if you had a Nokia phone and you use the t9 you know this stuff from
+coisas, então se você tem um telefone Nokia e usa o t9, você conhece essas coisas de
 
 0:12:38.900,0:12:43.100
-20 years ago you have basically suggestions on what your typing is
+20 anos atrás você tem basicamente sugestões sobre qual é a sua digitação
 
 0:12:43.100,0:12:47.150
-you're typing right so this would be one type of these suggestions where like one
+você está digitando certo, então este seria um tipo dessas sugestões, como uma
 
 0:12:47.150,0:12:50.570
-type of this architecture as you getting suggestions as you're typing things
+tipo dessa arquitetura à medida que você recebe sugestões enquanto digita coisas
 
 0:12:50.570,0:12:57.290
-through or you may have like speech to captions right I talked and you have the
+através ou você pode ter como fala para legendas né eu falei e você tem o
 
 0:12:57.290,0:13:02.520
-things below or something very cool which is
+coisas abaixo ou algo muito legal que é
 
 0:13:02.520,0:13:08.089
-the following so I start writing here the rings of Saturn glitter while the
+o seguinte então eu começo a escrever aqui os anéis de Saturno brilham enquanto o
 
 0:13:08.089,0:13:16.260
-harsh ice two men look at each other hmm okay they were enemies but the server
+gelo duro dois homens se olham hmm ok eles eram inimigos mas o servidor
 
 0:13:16.260,0:13:20.100
-robots weren't okay okay hold on so this network was trained on some
+os robôs não estavam bem ok espere então esta rede foi treinada em alguns
 
 0:13:20.100,0:13:24.360
-sci-fi novels and therefore you can just type something then you let the network
+romances de ficção científica e, portanto, você pode simplesmente digitar algo e deixar a rede
 
 0:13:24.360,0:13:28.290
-start outputting some suggestions for you so you know if you don't know how to
+começar a enviar algumas sugestões para você, para que você saiba se não sabe como
 
 0:13:28.290,0:13:34.620
-write a book then you can you know ask your computer to help you out okay
+escreva um livro, então você pode saber, peça ao seu computador para ajudá-lo, ok
 
 0:13:34.620,0:13:39.740
-that's so cool or one more that I really like it this one is fantastic I think
+isso é tão legal ou mais um que eu realmente gosto este é fantástico eu acho
 
 0:13:39.740,0:13:45.959
-you should read read it I think so you put some kind of input there like the
+você deveria ler, leia, eu acho, então você coloca algum tipo de entrada lá como o
 
 0:13:45.959,0:13:51.630
-scientist named alone what is it or the prompt right so you put in the
+cientista nomeou sozinho o que é ou o prompt certo para que você coloque no
 
 0:13:51.630,0:13:56.839
-the top prompt and then you get you know this network start writing about very
+o prompt superior e então você sabe que esta rede começa a escrever sobre muito
 
 0:13:56.839,0:14:05.690
-interesting unicorns with multiple horns is called horns say unicorn right okay
+unicórnios interessantes com vários chifres são chamados de chifres, diga unicórnio certo, ok
 
 0:14:05.690,0:14:09.480
-alright let's so cool just check it out later and you can take a screenshot of
+tudo bem, que legal, confira mais tarde e você pode tirar uma captura de tela
 
 0:14:09.480,0:14:14.970
-the screen anyhow so that was like the eye candy such that you get you know
+a tela de qualquer maneira, era como um colírio para os olhos, para que você saiba
 
 0:14:14.970,0:14:21.089
-hungry now let's go into be PTT which is the thing that they aren't really like
+com fome agora vamos entrar no PTT que é a coisa que eles não gostam muito
 
 0:14:21.089,0:14:27.390
-yesterday's PTT said okay alright let's see how this stuff works okay so on the
+o PTT de ontem disse ok ok vamos ver como isso funciona bem então no
 
 0:14:27.390,0:14:31.620
-left hand side we see again this vector middle in the representation the output
+lado esquerdo vemos novamente este vetor no meio na representação a saída
 
 0:14:31.620,0:14:35.520
-to a fine transformation and then there we have the classical equations right
+para uma transformação fina e então temos as equações clássicas certas
 
 0:14:35.520,0:14:42.450
-all right so let's see how this stuff is similar or not similar and you can't see
+tudo bem, então vamos ver como essas coisas são semelhantes ou não semelhantes e você não pode ver
 
 0:14:42.450,0:14:46.620
-anything so for the next two seconds I will want one minute I will turn off the
+qualquer coisa então nos próximos dois segundos eu vou querer um minuto eu vou desligar o
 
 0:14:46.620,0:14:51.300
-lights then I turn them on [Music]
+luzes, então eu as ligo [Música]
 
 0:14:51.300,0:14:55.570
-okay now you can see something all right so let's see what are the questions of
+ok agora você pode ver algo certo então vamos ver quais são as perguntas de
 
 0:14:55.570,0:15:00.490
-this new architecture don't stand up you're gonna be crushing yourself
+essa nova arquitetura não se levante você vai se esmagar
 
 0:15:00.490,0:15:04.270
-alright so you have here the hidden representation now there's gonna be this
+tudo bem, então você tem aqui a representação oculta agora vai haver isso
 
 0:15:04.270,0:15:10.000
-nonlinear function of this rotation of a stack version of my input which I
+função não linear desta rotação de uma versão de pilha da minha entrada que eu
 
 0:15:10.000,0:15:15.520
-appended the previous configuration of the hidden layer okay and so this is a
+anexou a configuração anterior da camada oculta ok e então esta é uma
 
 0:15:15.520,0:15:19.420
-very nice compact notation it's just I just put the two vectors one on top of
+notação compacta muito boa, é só colocar os dois vetores um em cima
 
 0:15:19.420,0:15:24.640
-each other and then I sign assign I sum the bias I also and define initial
+um ao outro e então eu assino atribuo eu soma o viés eu também e defino inicial
 
 0:15:24.640,0:15:29.920
-condition my initial H is gonna be 0 so at the beginning whenever I have t=1
+condicionar meu H inicial será 0, então no início sempre que eu tiver t = 1
 
 0:15:29.920,0:15:34.360
-this stuff is gonna be settle is a vector of zeros and then I have this
+esse material vai ser resolvido é um vetor de zeros e então eu tenho isso
 
 0:15:34.360,0:15:39.880
-matrix Wh is gonna be two separate matrices so sometimes you see this a
+matriz Wh serão duas matrizes separadas, então às vezes você vê isso como
 
 0:15:39.880,0:15:48.130
-question is Wₕₓ times x plus Wₕₕ times h[t-1] but you can also figure out
+pergunta é Wₕₓ vezes x mais Wₕₕ vezes h[t-1], mas você também pode descobrir
 
 0:15:48.130,0:15:52.450
-that if you stock those two matrices you know one attached to the other that you
+que se você estocar essas duas matrizes você sabe que uma está ligada à outra que você
 
 0:15:52.450,0:15:56.620
-just put this two vertical lines completely equivalent notation but it
+basta colocar essas duas linhas verticais em notação completamente equivalente, mas
 
 0:15:56.620,0:16:01.360
-looked like very similar to whatever we had here so hidden layer is affine
+parecia muito semelhante ao que tínhamos aqui, então a camada oculta é afim
 
 0:16:01.360,0:16:05.230
-transformation of the input inner layer is affine transformation of the input
+transformação da camada interna de entrada é a transformação afim da entrada
 
 0:16:05.230,0:16:11.440
-and the previous value okay and then you have the final output is going to be
+e o valor anterior ok e então você tem a saída final que será
 
 0:16:11.440,0:16:20.140
-again my final rotation so I'm gonna turn on the light so no magic so far
+novamente minha rotação final então eu vou acender a luz então sem mágica até agora
 
 0:16:20.140,0:16:27.690
-right you're okay right you're with me to shake the heads what about the others
+certo você está bem certo você está comigo para balançar a cabeça e os outros
 
 0:16:27.690,0:16:34.930
-no yes okay whatever so this one is simply on the right hand
+não sim tudo bem então este é simplesmente na mão direita
 
 0:16:34.930,0:16:40.330
-side I simply unroll over time such that you can see how things are just not very
+lado eu simplesmente desenrolo ao longo do tempo para que você possa ver como as coisas não são muito
 
 0:16:40.330,0:16:43.990
-crazy like this loop here is not actually a loop this is like a
+louco como este loop aqui não é realmente um loop isso é como um
 
 0:16:43.990,0:16:48.500
-connection to next time steps right so that around
+conexão com o próximo passo certo para que em torno
 
 0:16:48.500,0:16:52.760
-arrow means is just this right arrow so this is a neural net it's dinkley a
+seta significa que é apenas esta seta para a direita, então esta é uma rede neural, é dinkley a
 
 0:16:52.760,0:16:57.950
-neural net which is extended in in length rather also not only in a in a
+rede neural que é estendida em comprimento e não apenas em um em um
 
 0:16:57.950,0:17:01.639
-thickness right so you have a network that is going this direction input and
+espessura certa, então você tem uma rede que vai nessa direção de entrada e
 
 0:17:01.639,0:17:05.600
-output but as you can think as there's been an extended input and this been an
+saída, mas como você pode pensar, houve uma entrada estendida e esta foi uma
 
 0:17:05.600,0:17:10.220
-extended output while all these intermediate weights are all share right
+saída estendida enquanto todos esses pesos intermediários são todos compartilhados
 
 0:17:10.220,0:17:14.120
-so all of these weights are the same weights and then you use this kind of
+então todos esses pesos são os mesmos pesos e então você usa esse tipo de
 
 0:17:14.120,0:17:17.510
-shared weights so it's similar to a convolutional net in the sense that you
+pesos compartilhados, por isso é semelhante a uma rede convolucional no sentido de que você
 
 0:17:17.510,0:17:21.410
-had this parameter sharing right across different time domains because you
+tinha esse parâmetro de compartilhamento em diferentes domínios de tempo porque você
 
 0:17:21.410,0:17:28.820
-assume there is some kind of you know stationarity right of the signal make
+suponha que haja algum tipo de estacionaridade à direita do sinal
 
 0:17:28.820,0:17:32.870
-sense so this is a kind of convolution right you can see how this is kind of a
+sentido, então isso é um tipo de convolução, certo, você pode ver como isso é uma espécie de
 
 0:17:32.870,0:17:40.130
-convolution alright so that was kind of you know a little bit of the theory we
+convolução tudo bem então foi meio que você conhece um pouco da teoria que nós
 
 0:17:40.130,0:17:46.160
-already seen that so let's see how this works for a practical example so in this
+já vi isso então vamos ver como isso funciona para um exemplo prático então neste
 
 0:17:46.160,0:17:51.830
-case we we are just reading this code here so this is world language model you
+caso estamos apenas lendo este código aqui, então este é o modelo de linguagem mundial que você
 
 0:17:51.830,0:17:57.770
-can find it at the PyTorch examples so you have a sequence of symbols I have
+pode encontrá-lo nos exemplos do PyTorch para que você tenha uma sequência de símbolos que eu tenho
 
 0:17:57.770,0:18:01.910
-just represented there every symbol is like a letter in the alphabet and then
+apenas representado lá cada símbolo é como uma letra no alfabeto e então
 
 0:18:01.910,0:18:05.419
-the first part is gonna be basically splitting this one in this way right
+a primeira parte vai ser basicamente dividindo este desta forma certo
 
 0:18:05.419,0:18:10.309
-so you preserve vertically in the time domain but then I split the long long
+então você preserva verticalmente no domínio do tempo, mas então eu divido o longo
 
 0:18:10.309,0:18:16.640
-long sequence such that I can now chop I can use best bets bets how do you say
+sequência longa de tal forma que agora posso cortar posso usar as melhores apostas como se diz
 
 0:18:16.640,0:18:21.980
-computation so the first thing you have the best size is gonna be 4 in this case
+computação, então a primeira coisa que você tem o melhor tamanho será 4 neste caso
 
 0:18:21.980,0:18:27.410
-and then I'm gonna be getting in my first batch and then I will force the
+e então eu vou entrar no meu primeiro lote e então vou forçar o
 
 0:18:27.410,0:18:33.650
-network to be able to so this will be my best back propagation through time
+rede para poder, então esta será minha melhor propagação de volta ao longo do tempo
 
 0:18:33.650,0:18:38.270
-period and I will force the network to output the next sequence of characters
+período e forçarei a rede a produzir a próxima sequência de caracteres
 
 0:18:38.270,0:18:44.510
-ok so given that I have a,b,c, I will force my network to say d given that I have
+ok então dado que eu tenho a,b,c, vou forçar minha rede a dizer d dado que eu tenho
 
 0:18:44.510,0:18:50.000
-g,h,i, I will force the network to come up with j. Given m,n,o,
+g,h,i, forçarei a rede a criar j. Dado m,n,o,
 
 0:18:50.000,0:18:54.980
-I want p, given s,t,u, I want v. So how can you actually make
+Eu quero p, dado s,t,u, eu quero v. Então, como você pode realmente fazer
 
 0:18:54.980,0:18:59.660
-sure you understand what I'm saying whenever you are able to predict my next
+certeza de que você entende o que estou dizendo sempre que puder prever meu próximo
 
 0:18:59.660,0:19:04.010
-world you're actually able to you know you basically know in already what I'm
+mundo você é realmente capaz de você sabe que basicamente já sabe o que eu sou
 
 0:19:04.010,0:19:11.720
-saying right yeah so by trying to predict an upcoming word you're going to
+dizendo certo, sim, tentando prever uma próxima palavra que você vai
 
 0:19:11.720,0:19:15.170
-be showing some kind of comprehension of whatever is going to be this temporal
+estar mostrando algum tipo de compreensão do que quer que seja este temporal
 
 0:19:15.170,0:19:22.700
-information in the data all right so after we get the beds we have so how
+informações nos dados tudo bem, então depois de conseguirmos as camas que temos, então como
 
 0:19:22.700,0:19:26.510
-does it work let's actually see you know and about a bit of a detail this is
+funciona, vamos ver se você sabe e um pouco de detalhe isso é
 
 0:19:26.510,0:19:30.650
-gonna be my first output is going to be a batch with four items I feed this
+vai ser minha primeira saída vai ser um lote com quatro itens eu alimento isso
 
 0:19:30.650,0:19:34.220
-inside the near corner all night and then my neural net we come up with a
+no canto mais próximo a noite toda e, em seguida, minha rede neural, chegamos a um
 
 0:19:34.220,0:19:39.740
-prediction of the upcoming sample right and I will force that one to be my b,h,n,t
+previsão da próxima amostra certa e eu forçarei essa a ser minha b,h,n,t
 
 0:19:39.740,0:19:47.450
-okay then I'm gonna be having my second input I will provide the previous
+ok, então eu vou ter minha segunda entrada, vou fornecer a anterior
 
 0:19:47.450,0:19:53.420
-hidden state to the current RNN I will feel these inside and then I expect to
+estado oculto para o RNN atual, sentirei isso por dentro e espero
 
 0:19:53.420,0:19:58.670
-get the second line of the output the target right and then so on right I get
+obter a segunda linha da saída o destino certo e depois assim por diante, eu recebo
 
 0:19:58.670,0:20:03.410
-the next state and sorry the next input I get the next state and then I'm gonna
+o próximo estado e desculpe a próxima entrada eu recebo o próximo estado e então eu vou
 
 0:20:03.410,0:20:07.700
-get inside the neural net the RNN I which I will try to force to get the
+entrar na rede neural o RNN I que vou tentar forçar para obter o
 
 0:20:07.700,0:20:13.840
-final target okay so far yeah each one is gonna be the output of the
+alvo final ok até agora sim cada um vai ser a saída do
 
 0:20:18.730,0:20:28.280
-internet recurrent neural net right I'll show you the equation before you have h[1]
+direito de rede neural recorrente da internet vou mostrar a equação antes que você tenha h[1]
 
 0:20:28.280,0:20:43.460
-comes out from this one right second the output I'm gonna be forcing the output
+sai deste um segundo a saída eu vou forçar a saída
 
 0:20:43.460,0:20:48.170
-actually to be my target my next word in the sequence of letters right so I have
+na verdade, para ser meu alvo, minha próxima palavra na sequência de letras, então eu tenho
 
 0:20:48.170,0:20:52.610
-a sequence of words force my network to predict what's the next word given the
+uma sequência de palavras força minha rede a prever qual é a próxima palavra dada a
 
 0:20:52.610,0:21:02.480
-previous word know h1 is going to be fed inside here and you stuck the next word
+palavra anterior sabe que h1 será alimentado aqui e você colocou a próxima palavra
 
 0:21:02.480,0:21:07.880
-the next word together with the previous state and then you'll do a rotation of
+a próxima palavra junto com o estado anterior e então você fará uma rotação de
 
 0:21:07.880,0:21:13.670
-the previous word with a previous sorry the new word with the next state the new
+a palavra anterior com uma anterior desculpe a nova palavra com o próximo estado o novo
 
 0:21:13.670,0:21:17.720
-word with the previous state you'll do our rotation here find transformation
+palavra com o estado anterior você fará nossa rotação aqui encontre a transformação
 
 0:21:17.720,0:21:21.230
-right and then you apply the non-linearity so you always get a new
+certo e, em seguida, você aplica a não-linearidade para obter sempre um novo
 
 0:21:21.230,0:21:25.610
-word that is the current X and then you get the previous state just to see in
+palavra que é o X atual e aí você pega o estado anterior só para ver em
 
 0:21:25.610,0:21:30.650
-what state the system once and then you output a new output right and so we are
+qual estado do sistema uma vez e, em seguida, você produz uma nova saída correta e, portanto, estamos
 
 0:21:30.650,0:21:35.000
-in this situation here we have a bunch of inputs I have my first input and then
+nesta situação aqui temos um monte de entradas eu tenho minha primeira entrada e depois
 
 0:21:35.000,0:21:39.200
-I get the first output I have this internal memory that is sent forward and
+Eu recebo a primeira saída eu tenho essa memória interna que é enviada para frente e
 
 0:21:39.200,0:21:44.240
-then this network will now be aware of what happened here and then I input the
+então esta rede agora estará ciente do que aconteceu aqui e então eu insiro o
 
 0:21:44.240,0:21:49.450
-next input and so on I get the next output and I force the output to be the
+próxima entrada e assim por diante, recebo a próxima saída e forço a saída a ser a
 
 0:21:49.450,0:21:57.040
-output here the value inside the batch ok alright what's missing now
+saia aqui o valor dentro do lote ok tudo bem o que está faltando agora
 
 0:21:57.070,0:22:00.160
-[Music] this is for PowerPoint drawing
+[Música] isto é para desenho do PowerPoint
 
 0:22:02.890,0:22:08.370
-constraint all right what's happening now so here I'm gonna be sending the
+restrição tudo bem o que está acontecendo agora, então aqui eu vou enviar o
 
 0:22:08.370,0:22:13.300
-here I just drawn an arrow with the final h[T] but there is a slash on the
+aqui acabei de desenhar uma seta com o h[T] final, mas há uma barra no
 
 0:22:13.300,0:22:16.780
-arrow what is the slash on the arrow who can
+seta qual é a barra na seta quem pode
 
 0:22:16.780,0:22:27.100
-understand what the slash mean of course there will be there is gonna be the next
+entenda o que a barra significa, claro que haverá, haverá o próximo
 
 0:22:27.100,0:22:31.570
-batch they're gonna be starting from here D and so on this is gonna be my
+lote eles vão começar a partir daqui D e assim por diante este vai ser o meu
 
 0:22:31.570,0:22:46.690
-next batch d,j,p,v  e,k,q,w  and f,l,r,x. This slash here means do not back
+próximo lote d,j,p,ve,k,q,w ef,l,r,x. Esta barra aqui significa não voltar
 
 0:22:46.690,0:22:51.550
-propagate through okay so that one is gonna be calling dot detach in Porsche
+propagar através de ok para que alguém chame o dot detach no Porsche
 
 0:22:51.550,0:22:56.560
-which is gonna be stopping the gradient to be you know propagated back to
+que vai parar o gradiente para ser propagado de volta para
 
 0:22:56.560,0:23:01.450
-forever okay so this one say know that and so whenever I get the sorry no no
+para sempre tudo bem, então este diga, saiba disso e sempre que eu recebo, desculpe, não, não
 
 0:23:01.450,0:23:06.970
-gradient such that when I input the next gradient the first input here it's gonna
+gradiente de tal forma que, quando eu inserir o próximo gradiente, a primeira entrada aqui será
 
 0:23:06.970,0:23:11.530
-be this guy over here and also of course without gradient such that we don't have
+ser esse cara aqui e também claro sem gradiente para que não tenhamos
 
 0:23:11.530,0:23:17.170
-an infinite length RNN okay make sense yes
+um comprimento infinito RNN ok faz sentido sim
 
 0:23:17.170,0:23:24.640
-no I assume it's a yes okay so vanishing and exploding
+não, eu suponho que é um sim, tudo bem, então desaparecendo e explodindo
 
 0:23:24.640,0:23:30.730
-gradients we touch them upon these also yesterday so again I'm kind of going a
+gradientes nós tocamos neles também ontem, então novamente eu estou indo um
 
 0:23:30.730,0:23:35.620
-little bit faster to the intent user so let's see how this works
+um pouco mais rápido para o usuário intent, então vamos ver como isso funciona
 
 0:23:35.620,0:23:40.390
-so usually for our recurrent neural network you have an input you have a
+então geralmente para nossa rede neural recorrente você tem uma entrada você tem um
 
 0:23:40.390,0:23:45.160
-hidden layer and then you have an output then this value of here how do you get
+camada oculta e então você tem uma saída, então esse valor aqui como você obtém
 
 0:23:45.160,0:23:50.680
-this information through here what what what does this R represent do you
+esta informação por aqui o que o que esse R representa você
 
 0:23:50.680,0:23:55.840
-remember the equation of the hidden layer so the new hidden layer is gonna
+lembre-se da equação da camada oculta para que a nova camada oculta
 
 0:23:55.840,0:24:01.050
-be the previous hidden layer which we rotate
+ser a camada oculta anterior que giramos
 
 0:24:03.100,0:24:08.030
-alright so we rotate the previous hidden layer and so how do you rotate hidden
+tudo bem, então giramos a camada oculta anterior e como você gira oculta
 
 0:24:08.030,0:24:15.220
-layers matrices right and so every time you see all ads on tile arrow there is a
+matrizes de camadas corretas e, portanto, toda vez que você vê todos os anúncios na seta do bloco, há uma
 
 0:24:15.220,0:24:21.920
-rotation there is a matrix now if the you know this matrix can
+rotação existe uma matriz agora se você sabe que esta matriz pode
 
 0:24:21.920,0:24:26.900
-change the sizing of your final output right so if you think about perhaps
+altere o tamanho da sua saída final corretamente, então se você pensar em talvez
 
 0:24:26.900,0:24:31.190
-let's say the determinant right if the terminal is unitary it's a mapping the
+digamos que o determinante certo se o terminal for unitário é um mapeamento do
 
 0:24:31.190,0:24:34.610
-same areas for the same area if it's larger than one they're going to be
+mesmas áreas para a mesma área, se for maior que um, eles serão
 
 0:24:34.610,0:24:39.560
-getting you know this radians to getting larger and larger or if it's smaller
+conhecendo esses radianos para ficar cada vez maior ou se for menor
 
 0:24:39.560,0:24:44.660
-than I'm gonna get these gradients to go to zero whenever you perform the back
+do que eu vou fazer esses gradientes irem para zero sempre que você executar as costas
 
 0:24:44.660,0:24:48.920
-propagation in this direction okay so the problem here is that whenever we do
+propagação nesta direção ok, então o problema aqui é que sempre que fazemos
 
 0:24:48.920,0:24:53.390
-is send gradients back so the gains are going to be going down like that are
+é enviar gradientes de volta para que os ganhos sejam reduzidos assim
 
 0:24:53.390,0:24:57.800
-gonna be going like down like this then down like this way and down like this
+vai descer assim, então para baixo assim e para baixo assim
 
 0:24:57.800,0:25:01.610
-way and also all down this way and so on right so the gradients are going to be
+caminho e também por este caminho e assim por diante, para que os gradientes sejam
 
 0:25:01.610,0:25:06.380
-always going against the direction of the arrow in H ro has a matrix inside
+sempre indo contra a direção da seta em H ro tem uma matriz dentro
 
 0:25:06.380,0:25:11.510
-right and again this matrix will affect how these gradients propagate and that's
+certo e novamente essa matriz afetará como esses gradientes se propagam e isso é
 
 0:25:11.510,0:25:18.590
-why you can see here although we have a very bright input that one like gets
+por que você pode ver aqui, embora tenhamos uma entrada muito brilhante que um gosta de obter
 
 0:25:18.590,0:25:23.720
-lost through oh well if you have like a gradient coming down here the gradient
+perdido por oh bem se você tem como um gradiente descendo aqui o gradiente
 
 0:25:23.720,0:25:30.410
-gets you know kill over time okay so how do we fix that to fix this one we simply
+você sabe matar ao longo do tempo ok então como podemos consertar isso para consertar este nós simplesmente
 
 0:25:30.410,0:25:40.420
-remove the matrices in this horizontal operation does it make sense no yes no
+remova as matrizes nesta operação horizontal faz sentido não sim não
 
 0:25:40.420,0:25:47.630
-the problem is that the next hidden state will have you know its own input
+o problema é que o próximo estado oculto fará com que você conheça sua própria entrada
 
 0:25:47.630,0:25:52.910
-memory coming from the previous step through a matrix multiplication now this
+memória vindo do passo anterior através de uma multiplicação de matrizes agora isso
 
 0:25:52.910,0:25:58.760
-matrix multiplication will affect what's gonna be the gradient that comes in the
+a multiplicação de matrizes afetará o que vai ser o gradiente que vem no
 
 0:25:58.760,0:26:02.630
-other direction okay so whenever you have an output here you
+outra direção ok então sempre que você tiver uma saída aqui você
 
 0:26:02.630,0:26:06.740
-have a final loss now you have the grade that are gonna be going against the
+ter uma derrota final agora você tem a nota que vai contra o
 
 0:26:06.740,0:26:12.050
-arrows up to the input the problem is that this gradient which is going
+setas até a entrada o problema é que esse gradiente que vai
 
 0:26:12.050,0:26:16.910
-through the in the opposite direction of these arrows will be multiplied by the
+na direção oposta dessas setas será multiplicado pelo
 
 0:26:16.910,0:26:22.460
-matrix right the transpose of the matrix and again these matrices will affect
+matriz direita a transposição da matriz e novamente essas matrizes afetarão
 
 0:26:22.460,0:26:26.030
-what is the overall norm of this gradient right and it will be all
+qual é a norma geral desse gradiente certo e será tudo
 
 0:26:26.030,0:26:28.310
-killing it you have vanishing gradient or you're
+matando você tem gradiente de fuga ou você está
 
 0:26:28.310,0:26:32.690
-gonna have exploding the gradient which is going to be whenever is going to be
+vai ter explodir o gradiente que vai ser sempre que vai ser
 
 0:26:32.690,0:26:37.880
-getting amplified right so in order to be avoiding that we have to avoid so you
+sendo amplificado certo para evitar o que temos que evitar para que você
 
 0:26:37.880,0:26:41.960
-can see this is a very deep network so recurrently our network where the first
+podemos ver que esta é uma rede muito profunda, então recorrentemente nossa rede onde o primeiro
 
 0:26:41.960,0:26:45.320
-deep networks back in the night is actually and the word
+redes profundas de volta à noite é na verdade e a palavra
 
 0:26:45.320,0:26:49.850
-depth was actually in time which and of course they were facing the same issues
+profundidade estava realmente no tempo que e, claro, eles estavam enfrentando os mesmos problemas
 
 0:26:49.850,0:26:54.350
-we face with deep learning in modern day days where ever we were still like
+enfrentamos com o aprendizado profundo nos dias modernos, onde ainda estávamos como
 
 0:26:54.350,0:26:58.450
-stacking several layers we were observing that the gradients get lost as
+empilhando várias camadas, observamos que os gradientes se perdem à medida que
 
 0:26:58.450,0:27:05.750
-depth right so how do we solve gradient getting lost through the depth in a
+profundidade certa, então como resolvemos o gradiente se perdendo na profundidade em um
 
 0:27:05.750,0:27:08.770
-current days skipping constant connection right the
+dias atuais pulando a conexão constante direito o
 
 0:27:11.270,0:27:15.530
-receiver connections we use and similarly here we can use skip
+conexões do receptor que usamos e da mesma forma aqui podemos usar pular
 
 0:27:15.530,0:27:21.860
-connections as well when we go down well up in in time okay so let's see how this
+conexões também quando descemos bem no tempo certo, então vamos ver como isso
 
 0:27:21.860,0:27:30.500
-works yeah so the problem is that the
+funciona sim, então o problema é que o
 
 0:27:30.500,0:27:34.250
-gradients are only going in the backward paths right back
+gradientes estão apenas indo nos caminhos para trás de volta
 
 0:27:34.250,0:27:38.990
-[Music] well the gradient has to go the same way
+[Música] bem, o gradiente tem que seguir o mesmo caminho
 
 0:27:38.990,0:27:42.680
-it went forward by the opposite direction right I mean you're computing
+foi para a frente na direção oposta, quero dizer, você está computando
 
 0:27:42.680,0:27:46.970
-chain rule so if you have a function of a function of a function then you just
+regra da cadeia, então se você tem uma função de uma função de uma função, então você apenas
 
 0:27:46.970,0:27:52.220
-use those functions to go back right the point is that whenever you have these
+usar essas funções para voltar à direita o ponto é que sempre que você tiver esses
 
 0:27:52.220,0:27:55.790
-gradients coming back they will not have to go through matrices therefore also
+gradientes voltando eles não terão que passar por matrizes, portanto, também
 
 0:27:55.790,0:28:01.250
-the forward part has not doesn't have to go through the matrices meaning that the
+a parte direta não precisa passar pelas matrizes, o que significa que a
 
 0:28:01.250,0:28:07.310
-memory cannot go through matrix multiplication if you don't want to have
+a memória não pode passar pela multiplicação de matrizes se você não quiser ter
 
 0:28:07.310,0:28:11.770
-this effect when you perform back propagation okay
+este efeito quando você executa a propagação de volta ok
 
 0:28:14.050,0:28:19.420
-yeah it's gonna be worth much better working I show you in the next slide
+sim vai valer a pena trabalhar muito melhor eu mostro no próximo slide
 
 0:28:19.420,0:28:22.539
-[Music] show you next slide
+[Música] mostra o próximo slide
 
 0:28:27.740,0:28:32.270
-so how do we fix this problem well instead of using one recurrent neural
+então, como corrigimos bem esse problema em vez de usar um recurso neural recorrente
 
 0:28:32.270,0:28:36.650
-network we're gonna using for recurrent neural network okay so the first
+rede que vamos usar para rede neural recorrente ok, então o primeiro
 
 0:28:36.650,0:28:41.510
-RNN on the first network is gonna be the one that goes
+RNN na primeira rede vai ser a que vai
 
 0:28:41.510,0:28:46.370
-from the input to this intermediate state then I have other three networks
+da entrada para este estado intermediário, então eu tenho outras três redes
 
 0:28:46.370,0:28:51.410
-and each of those are represented by these three symbols 1 2 & 3.
+e cada um deles é representado por esses três símbolos 1 2 e 3.
 
 0:28:51.410,0:28:56.870
-okay think about this as our open mouth and it's like a closed mouth okay like
+ok pense nisso como nossa boca aberta e é como uma boca fechada ok tipo
 
 0:28:56.870,0:29:04.580
-the emoji okay so if you use this kind of for net for recurrent neural network
+o emoji ok então se você usar esse tipo de rede para rede neural recorrente
 
 0:29:04.580,0:29:09.740
-be regular Network you gotta have for example from the input I send things
+seja regular Rede você tem que ter por exemplo a partir da entrada que eu envio coisas
 
 0:29:09.740,0:29:14.390
-through in the open mouth therefore it gets here I have a closed mouth here so
+passa na boca aberta então chega aqui eu tenho a boca fechada aqui então
 
 0:29:14.390,0:29:18.920
-nothing goes forward then I'm gonna have this open mouth here such that the
+nada vai pra frente então eu vou ficar de boca aberta aqui pra que o
 
 0:29:18.920,0:29:23.600
-history goes forward so the history gets sent forward without going through a
+a história avança para que a história seja enviada para frente sem passar por um
 
 0:29:23.600,0:29:29.120
-neural network matrix multiplication it just gets through our open mouth and
+multiplicação de matrizes de redes neurais, ele apenas passa pela nossa boca aberta e
 
 0:29:29.120,0:29:34.670
-all the other inputs find a closed mouth so the hidden state will not change upon
+todas as outras entradas encontram uma boca fechada para que o estado oculto não mude
 
 0:29:34.670,0:29:40.820
-new inputs okay and then here you're gonna have a open mouth here such that
+novas entradas ok e então aqui você vai ter uma boca aberta aqui de tal forma que
 
 0:29:40.820,0:29:44.960
-you can get the final output here then the open mouth keeps going here such
+você pode obter a saída final aqui, então a boca aberta continua aqui como
 
 0:29:44.960,0:29:48.560
-that you have another output there and then finally you get the last closed
+que você tem outra saída lá e, finalmente, você obtém a última fechada
 
 0:29:48.560,0:29:54.620
-mouth at the last one now if you perform back prop you will have the gradients
+boca no último agora se você executar back prop você terá os gradientes
 
 0:29:54.620,0:29:58.880
-flowing through the open mouth and you don't get any kind of matrix
+fluindo pela boca aberta e você não obtém nenhum tipo de matriz
 
 0:29:58.880,0:30:04.400
-multiplication so now let's figure out how these open mouths are represented
+multiplicação, então agora vamos descobrir como essas bocas abertas são representadas
 
 0:30:04.400,0:30:10.010
-how are they instantiated in like in in terms of mathematics is it clear design
+como eles são instanciados em termos de matemática é um design claro
 
 0:30:10.010,0:30:13.130
-right so now we are using open and closed mouths and each of those mouths
+agora estamos usando bocas abertas e fechadas e cada uma dessas bocas
 
 0:30:13.130,0:30:17.880
-is plus the the first guy here that connects the input to the hidden are
+é mais o primeiro cara aqui que conecta a entrada ao oculto são
 
 0:30:17.880,0:30:25.580
-brn ends so these on here that is a gated recurrent network it's simply for
+brn termina então isso aqui que é uma rede recorrente fechada é simplesmente para
 
 0:30:25.580,0:30:32.060
-normal recurrent neural network combined in a clever way such that you have
+rede neural recorrente normal combinada de maneira inteligente, de modo que você tenha
 
 0:30:32.060,0:30:37.920
-multiplicative interaction and not matrix interaction is it clear so far
+interação multiplicativa e não interação de matrizes está claro até agora
 
 0:30:37.920,0:30:42.000
-this is like intuition I haven't shown you how all right so let's figure out
+isso é como a intuição eu não te mostrei como tudo bem então vamos descobrir
 
 0:30:42.000,0:30:48.570
-who made this and how it works okay so we're gonna see now those long short
+quem fez isso e como funciona bem, então vamos ver agora aqueles longos
 
 0:30:48.570,0:30:55.530
-term memory or gated recurrent neural networks so I'm sorry okay that was the
+memória de termo ou redes neurais recorrentes fechadas, então me desculpe, tudo bem, esse foi o
 
 0:30:55.530,0:30:59.730
-dude okay this is the guy who actually invented this stuff actually him and his
+cara ok esse é o cara que realmente inventou essas coisas na verdade ele e seus
 
 0:30:59.730,0:31:07.620
-students back some in 1997 and we were drinking here together okay all right so
+alunos de volta alguns em 1997 e estávamos bebendo aqui juntos, tudo bem, então
 
 0:31:07.620,0:31:14.010
-that is the question of a recurrent neural network and on the top left are
+essa é a questão de uma rede neural recorrente e no canto superior esquerdo estão
 
 0:31:14.010,0:31:18.000
-you gonna see in the diagram so I just make a very compact version of this
+você verá no diagrama, então eu apenas faço uma versão muito compacta disso
 
 0:31:18.000,0:31:23.310
-recurrent neural network here is going to be the collection of equations that
+rede neural recorrente aqui será a coleção de equações que
 
 0:31:23.310,0:31:27.840
-are expressed in a long short term memory they look a little bit dense so I
+são expressos em uma memória de longo prazo, eles parecem um pouco densos, então eu
 
 0:31:27.840,0:31:32.970
-just draw it for you here okay let's actually goes through how this stuff
+apenas desenhe para você aqui ok, vamos realmente ver como essas coisas
 
 0:31:32.970,0:31:36.320
-works so I'm gonna be drawing an interactive
+funciona então eu vou desenhar um interativo
 
 0:31:36.320,0:31:40.500
-animation here so you have your input gate here which is going to be an affine
+animação aqui para que você tenha seu portão de entrada aqui, que será um afim
 
 0:31:40.500,0:31:43.380
-transformation so all of these are recurrent Network write the same
+transformação para que todos sejam recorrentes A rede escreve o mesmo
 
 0:31:43.380,0:31:49.920
-equation I show you here so this input transformation will be multiplying my C
+equação que eu mostro aqui para que essa transformação de entrada esteja multiplicando meu C
 
 0:31:49.920,0:31:55.440
-tilde which is my candidate gate here I have a don't forget gate which is
+til que é o meu portão candidato aqui eu tenho um portão não se esqueça que é
 
 0:31:55.440,0:32:01.920
-multiplying my previous value of my cell memory and then my Poppa stylist maybe
+multiplicando meu valor anterior da minha memória celular e então meu estilista Poppa talvez
 
 0:32:01.920,0:32:08.100
-don't forget previous plus input ii i'm gonna show you now how it works then i
+não se esqueça da entrada anterior mais ii vou mostrar agora como funciona então eu
 
 0:32:08.100,0:32:12.600
-have my final hidden representations to be multiplication element wise
+tenho minhas representações ocultas finais para serem elementos de multiplicação
 
 0:32:12.600,0:32:17.850
-multiplication between my output gate and my you know whatever hyperbolic
+multiplicação entre meu portão de saída e meu você sabe o que quer que seja hiperbólico
 
 0:32:17.850,0:32:22.740
-tangent version of the cell such that things are bounded and then I have
+versão tangente da célula de tal forma que as coisas são limitadas e então eu tenho
 
 0:32:22.740,0:32:26.880
-finally my C tilde which is my candidate gate is simply
+finalmente meu til C, que é meu portão candidato, é simplesmente
 
 0:32:26.880,0:32:31.110
-Anette right so you have one recurrent network one that modulates the output
+Anette certo então você tem uma rede recorrente uma que modula a saída
 
 0:32:31.110,0:32:35.730
-one that modulates this is don't forget gate and this is the input gate
+um que modula este é o portão não esqueça e este é o portão de entrada
 
 0:32:35.730,0:32:40.050
-so all this interaction between the memory and the gates is a multiplicative
+então toda essa interação entre a memória e os portões é um multiplicativo
 
 0:32:40.050,0:32:44.490
-interaction and this forget input and don't forget the input and output are
+interação e isso esqueça a entrada e não esqueça que a entrada e a saída são
 
 0:32:44.490,0:32:48.780
-all sigmoids and therefore they are going from 0 to 1 so I can multiply by a
+todos os sigmóides e, portanto, eles vão de 0 a 1 para que eu possa multiplicar por um
 
 0:32:48.780,0:32:53.340
-0 you have a closed mouth or you can multiply by 1 if it's open mouth right
+0 você está de boca fechada ou pode multiplicar por 1 se estiver de boca aberta né
 
 0:32:53.340,0:33:00.120
-if you think about being having our internal linear volume which is below
+se você pensa em ter nosso volume linear interno que está abaixo
 
 0:33:00.120,0:33:06.120
-minus 5 or above plus 5 okay such that you using the you use the gate in the
+menos 5 ou acima mais 5 ok, de modo que você usa o portão no
 
 0:33:06.120,0:33:11.940
-saturated area or 0 or 1 right you know the sigmoid so let's see how this stuff
+área saturada ou 0 ou 1 certo você conhece o sigmóide então vamos ver como essas coisas
 
 0:33:11.940,0:33:16.260
-works this is the output let's turn off the
+funciona esta é a saída vamos desligar o
 
 0:33:16.260,0:33:20.450
-output how do I do turn off the output I simply put a 0
+output como faço para desligar a saída eu simplesmente coloco um 0
 
 0:33:20.450,0:33:26.310
-inside so let's say I have a purple internal representation see I put a 0
+dentro então digamos que eu tenha uma representação interna roxa veja eu coloquei um 0
 
 0:33:26.310,0:33:29.730
-there in the output gate the output is going to be multiplying a 0 with
+lá no portão de saída a saída vai estar multiplicando um 0 com
 
 0:33:29.730,0:33:36.300
-something you get 0 okay then let's say I have a green one I have one then I
+algo que você recebe 0 ok então vamos dizer que eu tenho um verde eu tenho um então eu
 
 0:33:36.300,0:33:40.830
-multiply one with the purple I get purple and then finally I get the same
+multiplique um com o roxo eu fico roxo e então finalmente eu recebo o mesmo
 
 0:33:40.830,0:33:46.170
-value similarly I can control the memory and I can for example we set it in this
+valor da mesma forma eu posso controlar a memória e posso, por exemplo, configurá-lo neste
 
 0:33:46.170,0:33:51.240
-case I'm gonna be I have my internal memory see this is purple and then I
+caso eu vou ser eu tenho minha memória interna ver isso é roxo e então eu
 
 0:33:51.240,0:33:57.450
-have here my previous guy which is gonna be blue I guess I have a zero here and
+tenho aqui meu cara anterior que vai ser azul eu acho que tenho um zero aqui e
 
 0:33:57.450,0:34:01.500
-therefore the multiplication gives me a zero there I have here a zero so
+então a multiplicação me dá um zero aí eu tenho aqui um zero então
 
 0:34:01.500,0:34:05.190
-multiplication is gonna be giving a zero at some two zeros and I get a zero
+a multiplicação vai dar um zero em cerca de dois zeros e eu recebo um zero
 
 0:34:05.190,0:34:09.690
-inside of memory so I just erase the memory and you get the zero there
+dentro da memória, então eu apenas apago a memória e você obtém o zero lá
 
 0:34:09.690,0:34:15.210
-otherwise I can keep the memory I still do the internal thing I did a new one
+caso contrário eu posso manter a memória eu ainda faço a coisa interna eu fiz uma nova
 
 0:34:15.210,0:34:19.919
-but I keep a wonder such that the multiplication gets blue the Sun gets
+mas guardo uma maravilha tal que a multiplicação fica azul o sol fica
 
 0:34:19.919,0:34:25.649
-blue and then I keep sending out my bloom finally I can write such that I
+azul e então eu continuo enviando minha flor finalmente eu posso escrever de tal forma que eu
 
 0:34:25.649,0:34:31.110
-can get a 1 in the input gate the multiplication gets purple then the I
+pode obter um 1 no portão de entrada a multiplicação fica roxa então o I
 
 0:34:31.110,0:34:35.010
-set a zero in the don't forget such that the
+defina um zero no não esqueça de tal forma que o
 
 0:34:35.010,0:34:40.679
-we forget and then multiplication gives me zero I some do I get purple and then
+nós esquecemos e então a multiplicação me dá zero eu alguns eu fico roxo e então
 
 0:34:40.679,0:34:45.780
-I get the final purple output okay so here we control how to send how to write
+Eu recebo a saída roxa final, então aqui nós controlamos como enviar como escrever
 
 0:34:45.780,0:34:50.850
-in memory how to reset the memory and how to output something okay so we have
+na memória como redefinir a memória e como produzir algo bem, então temos
 
 0:34:50.850,0:35:04.770
-all different operation this looks like a computer - and in an yeah it is
+todas as operações diferentes, isso parece um computador - e sim, é
 
 0:35:04.770,0:35:08.700
-assumed in this case to show you like how the logic works as we are like
+assumido neste caso para mostrar como a lógica funciona como nós somos
 
 0:35:08.700,0:35:14.250
-having a value inside the sigmoid has been or below minus 5 or being above
+ter um valor dentro do sigmóide foi ou abaixo de menos 5 ou está acima
 
 0:35:14.250,0:35:27.780
-plus 5 such that we are working as a switch 0 1 switch okay the network can
+mais 5 tal que estamos trabalhando como switch 0 1 switch ok a rede pode
 
 0:35:27.780,0:35:32.790
-choose to use this kind of operation to me make sense I believe this is the
+optar por usar este tipo de operação para mim faz sentido acredito que este é o
 
 0:35:32.790,0:35:37.110
-rationale behind how this network has been put together the network can decide
+a lógica por trás de como esta rede foi montada, a rede pode decidir
 
 0:35:37.110,0:35:42.690
-to do anything it wants usually they do whatever they want but this seems like
+para fazer o que quiser, geralmente eles fazem o que querem, mas isso parece
 
 0:35:42.690,0:35:46.800
-they can work at least if they've had to saturate the gates it looks like things
+eles podem trabalhar pelo menos se eles tiveram que saturar os portões parece que as coisas
 
 0:35:46.800,0:35:51.930
-can work pretty well so in the remaining 15 minutes of kind of I'm gonna be
+pode funcionar muito bem, então nos 15 minutos restantes do tipo eu estarei
 
 0:35:51.930,0:35:56.880
-showing you two notebooks I kind of went a little bit faster because again there
+mostrando dois cadernos eu meio que fui um pouco mais rápido porque novamente lá
 
 0:35:56.880,0:36:04.220
-is much more to be seen here in the notebooks so yeah
+é muito mais para ser visto aqui nos notebooks então sim
 
 0:36:10.140,0:36:17.440
-so this the the actual weight the actual gradient you care here is gonna be the
+então este é o peso real, o gradiente real que você se importa aqui será o
 
 0:36:17.440,0:36:21.970
-gradient with respect to previous C's right the thing you care is gonna be
+gradiente em relação ao C anterior certo, o que você se importa será
 
 0:36:21.970,0:36:25.000
-basically the partial derivative of the current seen with respect to previous
+basicamente a derivada parcial da corrente vista em relação à anterior
 
 0:36:25.000,0:36:30.160
-C's such that you if you have the original initial C here and you have
+C é tal que você se você tem a inicial C original aqui e você tem
 
 0:36:30.160,0:36:35.140
-multiple C over time you want to change something in the original C you still
+vários C ao longo do tempo você quer mudar algo no C original você ainda
 
 0:36:35.140,0:36:39.130
-have the gradient coming down all the way until the first C which comes down
+tem o gradiente descendo até o primeiro C que desce
 
 0:36:39.130,0:36:43.740
-to getting gradients through that matrix Wc here right so if you want to change
+para obter gradientes através dessa matriz Wc aqui, então se você quiser mudar
 
 0:36:46.660,0:36:52.089
-those weights here you just go through the chain of multiplications that are
+esses pesos aqui você apenas passa pela cadeia de multiplicações que são
 
 0:36:52.089,0:36:56.890
-not involving any matrix multiplication as such that you when you get the
+não envolvendo nenhuma multiplicação de matrizes como tal que você quando você obtém o
 
 0:36:56.890,0:37:00.490
-gradient it still gets multiplied by one all the time and it gets down to
+gradiente ainda é multiplicado por um o tempo todo e se reduz a
 
 0:37:00.490,0:37:05.760
-whatever we want to do okay did I answer your question
+tudo o que queremos fazer ok eu respondi sua pergunta
 
 0:37:09.150,0:37:16.660
-so the matrices will change the amplitude of your gradient right so if
+então as matrizes vão mudar a amplitude do seu gradiente, então se
 
 0:37:16.660,0:37:22.000
-you have like these largest eigenvalue being you know 0.0001 every time you
+você tem como esses maiores autovalores sendo que você conhece 0,0001 toda vez que você
 
 0:37:22.000,0:37:26.079
-multiply you get the norm of this vector getting killed right so you have like an
+multiplique, você obtém a norma desse vetor sendo morto, então você tem como um
 
 0:37:26.079,0:37:31.569
-exponential decay in this case if my forget gate is actually always equal to
+decaimento exponencial neste caso se meu portão de esquecimento for realmente sempre igual a
 
 0:37:31.569,0:37:37.510
-1 then you get c = c-t. What is the partial
+1 então você obtém c = ct. Qual é a parcial
 
 0:37:37.510,0:37:43.299
-derivative of c[t]/c[t-1]?
+derivada de c[t]/c[t-1]?
 
 0:37:43.299,0:37:48.579
-1 right so the parts of the relative that is the
+1 certo então as partes do parente que é o
 
 0:37:48.579,0:37:52.390
-thing that you actually multiply every time there's gonna be 1 so output
+coisa que você realmente multiplica toda vez que haverá 1, então a saída
 
 0:37:52.390,0:37:57.609
-gradient output gradients can be input gradients right yeah i'll pavillions
+gradientes de saída de gradiente podem ser gradientes de entrada, sim, eu vou pavilhões
 
 0:37:57.609,0:38:01.510
-gonna be implicit because it would apply the output gradient by the derivative of
+será implícito porque aplicaria o gradiente de saída pela derivada de
 
 0:38:01.510,0:38:05.599
-this module right if the this module is e1 then the thing that is
+este módulo certo se este módulo é e1 então a coisa que é
 
 0:38:05.599,0:38:14.660
-here keeps going that is the rationale behind this now this is just for drawing
+aqui continua essa é a lógica por trás disso agora isso é só para desenhar
 
 0:38:14.660,0:38:24.710
-purposes I assumed it's like a switch okay such that I can make things you
+propósitos, eu assumi que é como um interruptor, tudo bem, para que eu possa fazer as coisas que você
 
 0:38:24.710,0:38:29.089
-know you have a switch on and off to show like how it should be working maybe
+sei que você tem um interruptor ligado e desligado para mostrar como deveria estar funcionando, talvez
 
 0:38:29.089,0:38:46.579
-doesn't work like that but still it works it can work this way right yeah so
+não funciona assim, mas ainda funciona, pode funcionar dessa maneira, sim, então
 
 0:38:46.579,0:38:50.089
-that's the implementation of pro question is gonna be simply you just pad
+essa é a implementação da pergunta profissional será simplesmente você apenas pad
 
 0:38:50.089,0:38:55.069
-all the other sync when sees with zeros before the sequence so if you have
+todos os outros sincronizam quando vê com zeros antes da sequência, então se você tiver
 
 0:38:55.069,0:38:59.920
-several several sequences yes several sequences that are of a different length
+várias várias sequências sim várias sequências de comprimento diferente
 
 0:38:59.920,0:39:03.619
-you just put them all aligned to the right
+você acabou de colocá-los todos alinhados à direita
 
 0:39:03.619,0:39:08.960
-and then you put some zeros here okay such that you always have in the last
+e então você coloca alguns zeros aqui ok tal que você sempre tem no último
 
 0:39:08.960,0:39:14.599
-column the latest element if you put two zeros here it's gonna be a mess in right
+coluna o elemento mais recente se você colocar dois zeros aqui vai ser uma bagunça certo
 
 0:39:14.599,0:39:17.299
-in the code if you put the zeros in the in the beginning you just stop doing
+no código se você colocar os zeros no começo você simplesmente para de fazer
 
 0:39:17.299,0:39:21.319
-back propagation when you hit the last symbol right so you start from here you
+propagação de volta quando você acertar o último símbolo certo, então você começa a partir daqui
 
 0:39:21.319,0:39:25.460
-go back here so you go forward then you go back prop and stop whenever you
+volte aqui para ir em frente, então você volta prop e pare sempre que você
 
 0:39:25.460,0:39:29.599
-actually reach the end of your sequence if you pad on the other side you get a
+realmente chegar ao final de sua seqüência se você pad do outro lado você obtém um
 
 0:39:29.599,0:39:34.730
-bunch of drop there in the next ten minutes so you're gonna be seen two
+monte de queda lá nos próximos dez minutos, então você vai ser visto dois
 
 0:39:34.730,0:39:45.049
-notebooks if you don't have other questions okay wow you're so quiet okay
+cadernos se você não tiver outras perguntas ok uau você está tão quieto ok
 
 0:39:45.049,0:39:49.970
-so we're gonna be going now for sequence classification alright so in this case
+então vamos agora para a classificação de sequências bem, então neste caso
 
 0:39:49.970,0:39:54.589
-I'm gonna be I just really stuff loud out loud the goal is to classify a
+Eu vou ser eu realmente falo alto o objetivo é classificar um
 
 0:39:54.589,0:40:00.259
-sequence of elements sequence elements and targets are represented locally
+sequência de elementos elementos de sequência e alvos são representados localmente
 
 0:40:00.259,0:40:05.660
-input vectors with only one nonzero bit so it's a one hot encoding the sequence
+vetores de entrada com apenas um bit diferente de zero, então é um hot codificando a sequência
 
 0:40:05.660,0:40:10.770
-starts with a B for beginning and end with a E and otherwise consists of a
+começa com um B para começar e terminar com um E e, caso contrário, consiste em um
 
 0:40:10.770,0:40:16.370
-randomly chosen symbols from a set {a, b, c, d} which are some kind of noise
+símbolos escolhidos aleatoriamente de um conjunto {a, b, c, d} que são algum tipo de ruído
 
 0:40:16.370,0:40:22.380
-expect for two elements in position t1 and t2 this position can be either or X
+espere para dois elementos na posição t1 e t2 esta posição pode ser ou X
 
 0:40:22.380,0:40:29.460
-or Y in for the hard difficulty level you have for example that the sequence
+ou Y para o nível de dificuldade difícil que você tem, por exemplo, que a sequência
 
 0:40:29.460,0:40:35.220
-length length is chose randomly between 100 and 110 10 t1 is randomly chosen
+comprimento comprimento é escolhido aleatoriamente entre 100 e 110 10 t1 é escolhido aleatoriamente
 
 0:40:35.220,0:40:40.530
-between 10 and 20 Tinto is randomly chosen between 50 and 60 there are four
+entre 10 e 20 Tinto é escolhido aleatoriamente entre 50 e 60 há quatro
 
 0:40:40.530,0:40:47.010
-sequences classes Q, R, S and U which depends on the temporal order of x and y so if
+classes de sequências Q, R, S e U que depende da ordem temporal de x e y então se
 
 0:40:47.010,0:40:53.520
-you have X,X you can be getting a Q. X,Y you get an R. Y,X you get an S
+você tem X,X você pode estar recebendo um Q. X,Y você ganha um R. Y,X você ganha um S
 
 0:40:53.520,0:40:57.750
-and Y,Y get U. You so we're going to be doing a sequence classification based on
+e Y,Y obtém U. Então, vamos fazer uma classificação de sequência com base em
 
 0:40:57.750,0:41:03.720
-the X and y or whatever those to import to these kind of triggers okay
+o X e y ou o que quer que importe para esses tipos de gatilhos, ok
 
 0:41:03.720,0:41:08.370
-and in the middle in the middle you can have a,b,c,d in random positions like you
+e no meio no meio você pode ter a,b,c,d em posições aleatórias como você
 
 0:41:08.370,0:41:12.810
-know randomly generated is it clear so far so we do cast a classification of
+sei gerado aleatoriamente está claro até agora, então lançamos uma classificação de
 
 0:41:12.810,0:41:23.180
-sequences where you may have these X,X  X,Y  Y,X ou Y,Y. So in this case
+seqüências onde você pode ter esses X,XX,YY,X ou Y,Y. Então neste caso
 
 0:41:23.210,0:41:29.460
-I'm showing you first the first input so the return type is a tuple of sequence
+Estou mostrando primeiro a primeira entrada para que o tipo de retorno seja uma tupla de sequência
 
 0:41:29.460,0:41:36.780
-of two which is going to be what is the output of this example generator and so
+de dois que vai ser qual é a saída deste gerador de exemplo e assim
 
 0:41:36.780,0:41:43.050
-let's see what is what is this thing here so this is my data I'm going to be
+vamos ver o que é isso aqui então esses são meus dados que eu vou ser
 
 0:41:43.050,0:41:48.030
-feeding to the network so I have 1, 2, 3, 4, 5, 6, 7, 8 
+alimentando a rede, então eu tenho 1, 2, 3, 4, 5, 6, 7, 8
 
 0:41:48.030,0:41:54.180
-different symbols here in a row every time why there are eight we
+símbolos diferentes aqui em uma linha cada vez que há oito nós
 
 0:41:54.180,0:42:02.970
-have X and Y and a, b, c and d beginning and end. So we have one hot out of you
+têm X e Y e a, b, c e d início e fim. Então nós temos um quente fora de você
 
 0:42:02.970,0:42:08.400
-know eight characters and then i have a sequence of rows which are my sequence
+sei oito caracteres e então eu tenho uma sequência de linhas que são minha sequência
 
 0:42:08.400,0:42:12.980
-of symbols okay in this case you can see here i have a beginning with all zeros
+de símbolos ok neste caso você pode ver aqui eu tenho um começo com todos os zeros
 
 0:42:12.980,0:42:19.260
-why is all zeros padding right so in this case the sequence was shorter than
+por que todos os zeros estão corretos, então, neste caso, a sequência foi menor que
 
 0:42:19.260,0:42:21.329
-the expect the maximum sequence in the bed
+a esperar a sequência máxima na cama
 
 0:42:21.329,0:42:29.279
-and then the first first sequence has an extra zero item at the beginning in them
+e então a primeira primeira sequência tem um item zero extra no início nelas
 
 0:42:29.279,0:42:34.859
-you're gonna have like in this case the second item is of the two a pole to pole
+você vai ter como neste caso o segundo item é dos dois um pólo a pólo
 
 0:42:34.859,0:42:41.160
-is the corresponding best class for example I have a batch size of 32 and
+é a melhor classe correspondente, por exemplo, tenho um tamanho de lote de 32 e
 
 0:42:41.160,0:42:51.930
-then I'm gonna have an output size of 4. Why 4 ? Q, R, S and U.
+então eu vou ter um tamanho de saída de 4. Por que 4 ? Q, R, S e U.
 
 0:42:51.930,0:42:57.450
-so I have 4 a 4 dimensional target vector and I have a sequence of 8
+então eu tenho 4 um vetor alvo de 4 dimensões e tenho uma sequência de 8
 
 0:42:57.450,0:43:04.499
-dimensional vectors as input okay so let's see how this sequence looks like
+vetores dimensionais como entrada ok, então vamos ver como essa sequência se parece
 
 0:43:04.499,0:43:12.779
-in this case is gonna be BbXcXcbE. So X,X let's see X X X X is Q
+neste caso será BbXcXcbE. Então X,X vamos ver que XXXX é Q
 
 0:43:12.779,0:43:18.569
-right so we have our Q sequence and that's why the final target is a Q the 1
+certo, então temos nossa sequência Q e é por isso que o alvo final é um Q o 1
 
 0:43:18.569,0:43:25.019
-0 0 0 and then you're gonna see B B X C so the second item and the second last
+0 0 0 e então você verá BBXC então o segundo item e o penúltimo
 
 0:43:25.019,0:43:30.390
-is gonna be B lowercase B you can see here the second item and the second last
+vai ser B minúsculo B você pode ver aqui o segundo item e o penúltimo
 
 0:43:30.390,0:43:36.390
-item is going to be a be okay all right so let's now create a recurrent Network
+o item vai ficar bem, então vamos agora criar uma rede recorrente
 
 0:43:36.390,0:43:41.249
-in a very quick way so here I can simply say my recurrent network is going to be
+de uma maneira muito rápida, então aqui posso simplesmente dizer que minha rede recorrente será
 
 0:43:41.249,0:43:47.369
-torch and an RNN and I'm gonna be using a reader network really non-linearity
+tocha e um RNN e vou usar uma rede de leitura realmente não linear
 
 0:43:47.369,0:43:52.709
-and then I have my final linear layer in the other case I'm gonna be using a led
+e então eu tenho minha camada linear final no outro caso eu vou usar um led
 
 0:43:52.709,0:43:57.119
-STM and then I'm gonna have a final inner layer so I just execute these guys
+STM e então eu vou ter uma camada interna final, então eu apenas executo esses caras
 
 0:43:57.119,0:44:07.920
-I have my training loop and I'm gonna be training for 10 books so in the training
+Eu tenho meu loop de treinamento e vou treinar para 10 livros, então no treinamento
 
 0:44:07.920,0:44:13.259
-group you can be always looking for those five different steps first step is
+grupo você pode estar sempre procurando por esses cinco passos diferentes, o primeiro passo é
 
 0:44:13.259,0:44:18.900
-gonna be get the data inside the model right so that's step number one what is
+vamos obter os dados dentro do modelo correto, então esse é o passo número um, o que é
 
 0:44:18.900,0:44:30.669
-step number two there are five steps we remember hello
+passo número dois há cinco passos que lembramos olá
 
 0:44:30.669,0:44:35.089
-you feel that you feed the network if you feed the network with some data then
+você sente que alimenta a rede se alimenta a rede com alguns dados, então
 
 0:44:35.089,0:44:41.539
-what do you do you compute the loss okay then we have compute step to compute the
+o que você faz você calcula a perda ok, então temos o passo de cálculo para calcular o
 
 0:44:41.539,0:44:52.549
-loss fantastic number three is zero the cash right then number four which is
+perda fantástica número três é zero o dinheiro certo então o número quatro que é
 
 0:44:52.549,0:45:09.699
-computing the off yes lost dog backwards lost not backward don't compute the
+computando o off sim cachorro perdido para trás perdido não para trás não computar o
 
 0:45:09.699,0:45:16.449
-partial derivative of the loss with respect to the network's parameters yeah
+derivada parcial da perda em relação aos parâmetros da rede sim
 
 0:45:16.449,0:45:27.380
-here backward finally number five which is step in opposite direction of the
+aqui para trás, finalmente, número cinco, que é um passo na direção oposta do
 
 0:45:27.380,0:45:31.819
-gradient okay all right those are the five steps you always want to see in any
+gradiente tudo bem esses são os cinco passos que você sempre quer ver em qualquer
 
 0:45:31.819,0:45:37.909
-training blueprint if someone is missing then you're [ __ ] up okay so we try now
+plano de treinamento se alguém estiver faltando, então você está [ __ ] bem, então tentamos agora
 
 0:45:37.909,0:45:42.469
-the RNN and the LSTM and you get something looks like this
+o RNN e o LSTM e você obtém algo parecido com isso
 
 0:45:42.469,0:45:55.929
-so our NN goes up to 50% in accuracy and then the LSTM got 100% okay oh okay
+então nosso NN vai até 50% em precisão e então o LSTM ficou 100% ok oh ok
 
 0:45:56.439,0:46:06.019
-first of all how many weights does this LSTM have compared to the RNN four
+em primeiro lugar, quantos pesos este LSTM tem em comparação com o RNN quatro
 
 0:46:06.019,0:46:11.059
-times more weights right so it's not a fair comparison I would say because LSTM
+vezes mais pesos certo, então não é uma comparação justa, eu diria porque LSTM
 
 0:46:11.059,0:46:16.819
-is simply for rnns combined somehow right so this is a two layer neural
+é simplesmente para rnns combinados de alguma forma certo, então este é um neural de duas camadas
 
 0:46:16.819,0:46:20.659
-network whereas the other one is at one layer right always both ever like it has
+rede enquanto o outro está em uma camada certa sempre ambos sempre como se tivesse
 
 0:46:20.659,0:46:25.009
-one hidden layer they are an end if Alice TM we can think about having two
+uma camada oculta eles são um fim se Alice TM podemos pensar em ter duas
 
 0:46:25.009,0:46:33.199
-hidden so again one layer two layers well one hidden to lead in one set of
+escondido então novamente uma camada duas camadas bem uma escondida para levar em um conjunto de
 
 0:46:33.199,0:46:37.610
-parameters four sets of the same numbers like okay not fair okay anyway
+parâmetros quatro conjuntos dos mesmos números como ok não justo ok de qualquer maneira
 
 0:46:37.610,0:46:43.610
-let's go with hundred iterations okay so now I just go with 100 iterations and I
+vamos com cem iterações ok, então agora eu vou com 100 iterações e eu
 
 0:46:43.610,0:46:49.490
-show you how if they work or not and also when I be just clicking things such
+mostrar como funcionam ou não e também quando estou apenas clicando em coisas como
 
 0:46:49.490,0:46:56.000
-that we have time to go through stuff okay now my computer's going to be
+que temos tempo para fazer as coisas ok agora meu computador vai ser
 
 0:46:56.000,0:47:02.990
-complaining all right so again what are the five types of operations like five
+reclamando tudo bem então novamente quais são os cinco tipos de operações como cinco
 
 0:47:02.990,0:47:06.860
-okay now is already done sorry I was going to do okay so this is
+ok agora já está feito desculpe eu ia fazer tudo bem então é isso
 
 0:47:06.860,0:47:16.280
-the RNN right RNN and finally actually gave to 100% okay so iron and it just
+o RNN certo RNN e finalmente deu a 100% ok então ferro e só
 
 0:47:16.280,0:47:20.030
-let it more time like a little bit more longer training actually works the other
+deixe mais tempo como um pouco mais de treinamento mais longo realmente funciona o outro
 
 0:47:20.030,0:47:26.060
-one okay and here you can see that we got 100% in twenty eight bucks okay the
+um ok e aqui você pode ver que temos 100% em vinte e oito dólares ok o
 
 0:47:26.060,0:47:30.650
-other case we got 2,100 percent in roughly twice as long
+outro caso, obtivemos 2.100% em aproximadamente o dobro do tempo
 
 0:47:30.650,0:47:35.690
-twice longer at a time okay so let's first see how they perform here so I
+duas vezes mais de cada vez ok, então vamos primeiro ver como eles se comportam aqui, então eu
 
 0:47:35.690,0:47:42.200
-have this sequence BcYdYdaE which is a U sequence and then we ask the network
+temos essa sequência BcYdYdaE que é uma sequência U e então perguntamos a rede
 
 0:47:42.200,0:47:46.760
-and he actually meant for actually like labels it as you okay so below we're
+e ele realmente quis dizer como rótulos como você está bem, então abaixo estamos
 
 0:47:46.760,0:47:51.140
-gonna be seeing something very cute so in this case we were using sequences
+veremos algo muito fofo, então neste caso estávamos usando sequências
 
 0:47:51.140,0:47:56.870
-that are very very very very small right so even the RNN is able to train on
+que são muito, muito, muito pequenos, então até o RNN é capaz de treinar
 
 0:47:56.870,0:48:02.390
-these small sequences so what is the point of using a LSTM well we can first
+essas pequenas sequências, então qual é o ponto de usar um LSTM bem, podemos primeiro
 
 0:48:02.390,0:48:07.430
-of all increase the difficulty of the training part and we're gonna see that
+de tudo aumentar a dificuldade da parte do treinamento e vamos ver isso
 
 0:48:07.430,0:48:13.280
-the RNN can be miserably failing whereas the LSTM keeps working in this
+o RNN pode estar falhando miseravelmente enquanto o LSTM continua trabalhando neste
 
 0:48:13.280,0:48:19.790
-visualization part below okay I train a network now Alice and LSTM now with the
+parte de visualização abaixo ok eu treino uma rede agora Alice e LSTM agora com o
 
 0:48:19.790,0:48:26.000
-moderate level which has eighty symbols rather than eight or ten ten symbols so
+nível moderado que tem oitenta símbolos em vez de oito ou dez dez símbolos, então
 
 0:48:26.000,0:48:31.430
-you can see here how this model actually managed to succeed at the end although
+você pode ver aqui como esse modelo realmente conseguiu ter sucesso no final, embora
 
 0:48:31.430,0:48:38.870
-there is like a very big spike and I'm gonna be now drawing the value of the
+há um pico muito grande e agora vou desenhar o valor do
 
 0:48:38.870,0:48:43.970
-cell state over time okay so I'm going to be input in our sequence of eighty
+estado da célula ao longo do tempo ok, então eu vou entrar em nossa sequência de oitenta
 
 0:48:43.970,0:48:49.090
-symbols and I'm gonna be showing you what is the value of the hidden state
+símbolos e eu vou mostrar a você qual é o valor do estado oculto
 
 0:48:49.090,0:48:53.330
-hidden State so in this case I'm gonna be showing you
+estado oculto, então, neste caso, eu vou mostrar a você
 
 0:48:53.330,0:48:56.910
-[Music] hidden hold on
+[Música] escondido espera
 
 0:48:56.910,0:49:01.140
-yeah I'm gonna be showing I'm gonna send my input through a hyperbolic tangent
+sim, vou mostrar que vou enviar minha entrada através de uma tangente hiperbólica
 
 0:49:01.140,0:49:06.029
-such that if you're below minus 2.5 I'm gonna be mapping to minus 1 if you're
+de tal forma que se você estiver abaixo de menos 2,5 eu vou mapear para menos 1 se você estiver
 
 0:49:06.029,0:49:12.329
-above 2.5 you get mapped to plus 1 more or less and so let's see how this stuff
+acima de 2,5 você é mapeado para mais 1 mais ou menos e então vamos ver como essas coisas
 
 0:49:12.329,0:49:18.029
-looks so in this case here you can see that this specific hidden layer picked
+parece que, neste caso, você pode ver que essa camada oculta específica foi selecionada
 
 0:49:18.029,0:49:27.720
-on the X here and then it became red until you got the other X right so this
+no X aqui e então ficou vermelho até você acertar o outro X então isso
 
 0:49:27.720,0:49:33.710
-is visualizing the internal state of the LSD and so you can see that in specific
+está visualizando o estado interno do LSD e assim você pode ver isso em
 
 0:49:33.710,0:49:39.599
-unit because in this case I use hidden representation like hidden dimension of
+unidade porque neste caso eu uso representação oculta como dimensão oculta de
 
 0:49:39.599,0:49:47.700
-10 and so in this case the 1 2 3 4 5 the fifth hidden unit of the cell lay the
+10 e assim neste caso o 1 2 3 4 5 a quinta unidade oculta da célula
 
 0:49:47.700,0:49:52.829
-5th cell actually is trigger by observing the first X and then it goes
+A 5ª célula na verdade é acionada observando o primeiro X e depois vai
 
 0:49:52.829,0:49:58.410
-quiet after seen the other acts this allows me to basically you know take
+quieto depois de ver os outros atos, isso me permite basicamente você saber
 
 0:49:58.410,0:50:07.440
-care of I mean recognize if the sequence is U, P, R or S. Okay does it make sense okay
+cuidar de quero dizer reconhecer se a sequência é U, P, R ou S. Ok, faz sentido ok
 
 0:50:07.440,0:50:14.519
-oh this one more notebook I'm gonna be showing just quickly which is the 09-echo_data
+ah mais esse notebook que vou mostrar rapidinho que é o 09-echo_data
 
 0:50:14.519,0:50:22.410
-in this case I'm gonna be in South corner I'm gonna have a network echo in
+neste caso eu vou estar no canto sul vou ter um eco de rede em
 
 0:50:22.410,0:50:27.059
-whatever I'm saying so if I say something I asked a network to say if I
+o que quer que eu esteja dizendo, se eu disser algo, pedi a uma rede para dizer se eu
 
 0:50:27.059,0:50:30.960
-say something I asked my neighbor to say if I say something I ask ok Anderson
+fala alguma coisa eu pedi pro meu vizinho falar se eu falar alguma coisa eu peço ok Anderson
 
 0:50:30.960,0:50:42.150
-right ok so in this case here and I'll be inputting this is the first sequence
+certo ok então neste caso aqui e eu estarei inserindo esta é a primeira sequência
 
 0:50:42.150,0:50:50.579
-is going to be 0 1 1 1 1 0 and you'll have the same one here 0 1 1 1 1 0 and I
+vai ser 0 1 1 1 1 0 e você terá o mesmo aqui 0 1 1 1 1 0 e eu
 
 0:50:50.579,0:50:57.259
-have 1 0 1 1 0 1 etc right so in this case if you want to output something
+tem 1 0 1 1 0 1 etc certo, então, neste caso, se você quiser produzir algo
 
 0:50:57.259,0:51:00.900
-after some right this in this case is three time
+depois de algum certo isso neste caso é três vezes
 
 0:51:00.900,0:51:06.809
-step after you have to have some kind of short-term memory where you keep in mind
+passo depois você tem que ter algum tipo de memória de curto prazo onde você tenha em mente
 
 0:51:06.809,0:51:11.780
-what I just said where you keep in mind what I just said where you keep in mind
+o que eu acabei de dizer onde você tem em mente o que eu acabei de dizer onde você tem em mente
 
 0:51:11.780,0:51:16.890
-[Music] what I just said yeah that's correct so
+[Música] o que acabei de dizer, sim, está correto, então
 
 0:51:16.890,0:51:22.099
-you know pirating actually requires having some kind of working memory
+você sabe que a pirataria realmente requer algum tipo de memória de trabalho
 
 0:51:22.099,0:51:27.569
-whereas the other one the language model which it was prompted prompted to say
+enquanto o outro o modelo de linguagem que foi levado a dizer
 
 0:51:27.569,0:51:33.539
-something that I haven't already said right so that was a different kind of
+algo que eu ainda não disse direito, então esse foi um tipo diferente de
 
 0:51:33.539,0:51:38.700
-task you actually had to predict what is the most likely next word in keynote you
+tarefa que você realmente teve que prever qual é a próxima palavra mais provável no keynote que você
 
 0:51:38.700,0:51:42.329
-cannot be always right right but this one you can always be right you know
+não pode estar sempre certo, mas este você sempre pode estar certo, você sabe
 
 0:51:42.329,0:51:49.079
-this is there is no random stuff anyhow so I have my first batch here and then
+isso é que não há coisas aleatórias de qualquer maneira, então eu tenho meu primeiro lote aqui e depois
 
 0:51:49.079,0:51:53.549
-the sec the white patch which is the same similar thing which is shifted over
+a sec a mancha branca que é a mesma coisa semelhante que é deslocada
 
 0:51:53.549,0:52:01.319
-time and then we have we have to chunk this long long long sequence so before I
+tempo e então temos que fragmentar essa sequência longa, longa, então antes de eu
 
 0:52:01.319,0:52:05.250
-was sending a whole sequence inside the network and I was enforcing the final
+estava enviando uma sequência inteira dentro da rede e eu estava aplicando o final
 
 0:52:05.250,0:52:09.569
-target to be something right in this case I had to chunk if the sequence goes
+target para ser algo certo, neste caso, eu tive que separar se a sequência for
 
 0:52:09.569,0:52:13.319
-this direction I had to chunk my long sequence in little chunks and then you
+nessa direção eu tive que dividir minha longa sequência em pequenos pedaços e então você
 
 0:52:13.319,0:52:18.869
-have to fill the first chunk keep trace of whatever is the hidden state send a
+tem que preencher o primeiro pedaço manter o rastro de qualquer que seja o estado oculto enviar um
 
 0:52:18.869,0:52:23.549
-new chunk where you feed and initially as the initial hidden state the output
+novo pedaço onde você alimenta e inicialmente como o estado oculto inicial a saída
 
 0:52:23.549,0:52:28.319
-of this chant right so you feed this chunk you have a final hidden state then
+deste canto certo então você alimenta este pedaço você tem um estado oculto final então
 
 0:52:28.319,0:52:33.960
-you feed this chunk and as you put you have to put these two as input to the
+você alimenta esse pedaço e como você coloca você tem que colocar esses dois como entrada para o
 
 0:52:33.960,0:52:38.430
-internal memory right now you feed the next chunk where you put this one as
+memória interna agora você alimenta o próximo pedaço onde você coloca este como
 
 0:52:38.430,0:52:44.670
-input as to the internal state and you we are going to be comparing here RNN
+entrada quanto ao estado interno e você vamos comparar aqui RNN
 
 0:52:44.670,0:52:57.059
-with analyst TMS I think so at the end here you can see that okay we managed to
+com o analista TMS eu acho que no final aqui você pode ver que tudo bem nós conseguimos
 
 0:52:57.059,0:53:02.789
-actually get we are an n/a accuracy that goes 100 100 percent then if you are
+na verdade, somos uma precisão n/a que vai de 100 a 100 por cento, então se você estiver
 
 0:53:02.789,0:53:08.220
-starting now to mess with the size of the memory chunk with a memory interval
+começando agora a mexer com o tamanho do pedaço de memória com um intervalo de memória
 
 0:53:08.220,0:53:11.619
-you can be seen with the LSTM you can keep this memory
+você pode ser visto com o LSTM você pode manter esta memória
 
 0:53:11.619,0:53:16.399
-for a long time as long as you have enough capacity the RNN after you reach
+por um longo tempo, desde que você tenha capacidade suficiente do RNN depois de atingir
 
 0:53:16.399,0:53:22.880
-some kind of length you start forgetting what happened in the past and it was
+algum tipo de comprimento você começa a esquecer o que aconteceu no passado e foi
 
 0:53:22.880,0:53:29.809
-pretty much everything for today so stay warm wash your hands and I'll see you
+praticamente tudo por hoje então fique aquecido lave as mãos e eu vou te ver
 
 0:53:29.809,0:53:34.929
-next week bye bye
+semana que vem tchau
\ No newline at end of file