implement_RNN (vector-to-sequence) 2개의 가중치

명징직조지훈 2023. 3. 2. 23:54

2023.03.02 - [분류 전체보기] - implement_RNN (vector-to-sequence, delta3)

implement_RNN (vector-to-sequence, delta3)

2023.03.02 - [분류 전체보기] - implement_RNN (vector-to-sequence, delta2) implement_RNN (vector-to-sequence, delta2) 2023.02.27 - [분류 전체보기] - implement_RNN (delta-to-sequence, delta2) implement_RNN (delta-to-sequence, delta2) 2023.02.26 -

teach-meaning.tistory.com

입력에 대한 가중치, 이전 출력에 대한 가중치, 출력값에 대한 가중치 3개를 뒀는데,

출력의 값을 입력과 그 가중치, 이전 출력과 그 가중치의 행렬 연산과 그 합의 활성화 함수 연산

a(w*x + w*y + bias) 형태로 구현 시 출력에 대한 가중치는 필요 없어지게 된다.

이러한 모습으로 구현해본다.

이 경우 이전 타임 스텝의 출력을 입력으로 받는다.

    def delta_to_sequence_cal_rnn_test(self, input, target):

      # 타겟값의 저장
      self.target = target

      # 입력 데이터의 저장
      self.input = input

      # 입력 데이터 크기에 맞는 가중치 값이 생성되어야 한다.
      # 입력 (3,1) 데이터에 대한 연산 결과 (1,1)를 예측한다고 가정,

      # 입력값에 사용되는 가중치
      # (3,1) 크기의 입력 데이터에 대해 (1,1) 의 행렬곱 연산 결과를 얻기 위해, (1,3) 크기의 가중치 행렬이 필요하다
      self.input_w = np.random.rand(target.shape[0], input.shape[1])
      
      # 이전 출력에 대한 가중치
      # (1,1) 노드 출력에 대해 (1,1) 의 행렬곱 연산 결과를 얻기 위한, (1,1) 크기의 가중치 행렬
      self.before_w = np.random.rand(target.shape[0], target.shape[0])

      # 이전 노드의 출력, 첫 부분의 경우 이 값이 0이다. 행렬의 크기는 (1,1)
      before = np.zeros((target.shape[0], target.shape[0]))
      
      # (3, n) 크기의 입력 데이터에서 n 번의 반복을 수행하게 됨
      for i in range(input.shape[0]):
        # 노드 입력 데이터의 numpy 행렬 생성
        input_data = input[i].reshape(input.shape[1],-1)

        # RNN 노드 입력 값 계산, 이전 노드 출력 @ 가중치 + 입력 데이터 @ 가중치
        node_input = (self.before_w @ before) + (self.input_w @ input_data)
        self.node_input.append(node_input)

        # 노드 출력 계산, 활성화 함수 연산을 수행한다.
        node_output = self.activation.sigmoid(node_input)
        self.node_output.append(node_output)

        # 타입 스텝의 출력이 다음 노드의 입력이 된다.
        before = node_output

      self.predict = node_output

      return self.predict

바뀐 부분으로 result_w 가 제거되고, 활성화 함수를 계산한 값이 노드의 출력, 다음 노드의 입력이 되고, 해당 값에 before_w 연산이 이뤄진다.

이는 delta 값 계산도 바꾸게 된다.

이전에는 출력 노드에 대한 계산을 따로 하고, 마지막 노드의 예측값과 비용 함수에 대한 연산을 제외하곤 동일한 반복을 수행할 것이다.

    def cal_delta(self):
      #마지막 노드의 변화량에 대한 오차 함수의 변화량 계산
      delta = (self.cost.diff_error_squared_sum() * self.activation.sigmoid_diff(self.predict))
      self.delta.append(delta)

      # n개의 delta 값을 계산하기 위한 반복
      for i in range(input.shape[0] - 1, -1, -1):
        # 노드 출력의 변화에 대한 비용 함수의 변화량 계산
        delta = self.before_w @ delta
        # 활성화 함수에 따른 변화량 계산, 해당 값이 노드 변화에 대한 비용 함수의 변화량이 된다. 
        delta = delta * self.activation.sigmoid_diff(self.node_output[self.input.shape[0] - 1])
        # 노드별 delta 값의 저장
        self.delta.append(delta)

벡터 투 시퀀스 RNN 에서 마지막 예측값과 비용 함수의 미분 함수, 활성화 함수의 미분 함수를 통해 마지막 노드의 변화량에 대한 비용 함수의 변화량을 구할 수 있다.

앞 단계 노드들의 delta 값을 계산하기 위해서는 계산한 delta 값에 노드간 연결 가중치인 before_w 와의 연산을 통해 이전 노드 출력의 변화에 대한 비용 함수의 변화량,

활성화 함수 미분 함수와의 연산으로 해당 노드의 변화에 대한 비용 함수의 변화량을 계산할 수 있다.

rnn.delta
>>>
[array([[0.22325928]]),
 array([[0.02529628]]),
 array([[0.00286618]]),
 array([[0.00032475]]),
 array([[3.67958077e-05]]),
 array([[4.16913105e-06]]),
 array([[4.7238136e-07]]),
 array([[5.35229396e-08]]),
 array([[6.06439057e-09]]),
 array([[6.87122816e-10]]),
 array([[7.78541156e-11]]),
 array([[8.8212226e-12]]),
 array([[9.99484324e-13]]),
 array([[1.13246084e-13]]),
 array([[1.28312923e-14]]),
 array([[1.45384332e-15]]),
 array([[1.64727008e-16]]),
 array([[1.86643133e-17]]),
 array([[2.11475092e-18]]),
 array([[2.39610821e-19]]),
 array([[2.71489873e-20]]),
 array([[3.07610276e-21]]),
 array([[3.48536323e-22]]),
 array([[3.94907381e-23]]),
 array([[4.47447883e-24]]),
 array([[5.06978641e-25]]),
 array([[5.74429677e-26]]),
 array([[6.50854744e-27]]),
 array([[7.37447794e-28]]),
 array([[8.35561626e-29]]),
 array([[9.46729025e-30]]),
 array([[1.07268671e-30]]),
 array([[1.21540244e-31]]),
 array([[1.37710581e-32]]),
 array([[1.56032302e-33]]),
 array([[1.76791639e-34]]),
 array([[2.00312904e-35]]),
 array([[2.26963559e-36]]),
 array([[2.57159954e-37]]),
 array([[2.91373831e-38]]),
 array([[3.30139699e-39]]),
 array([[3.74063176e-40]]),
 array([[4.23830457e-41]]),
 array([[4.80219032e-42]]),
 array([[5.44109832e-43]]),
 array([[6.16500992e-44]]),
 array([[6.98523442e-45]]),
 array([[7.9145858e-46]]),
 array([[8.96758284e-47]]),
 array([[1.0160676e-47]]),
 array([[1.15125044e-48]])]

계산한 delta 값, 이 역시 보다 작은 수의 곱연산의 반복으로 그 크기가 매우 작아진다.

해당 delta 값을 통해 각 층별 노드들에서 before_w, input_w 에 대한 가중치 변화량을 계산할 수 있는데, 이들을 모두 합한 값을 통해 가중치 업데이트를 진행한다.

    def update_weight(self, learning_rate):
      # 원활한 연산을 위해 마지막 노드부터 거꾸로 계산된 delta 값을 뒤집어준다.
      self.delta = self.delta[::-1]

      result = 0

      # 해당 delta 값과의 데이터 입력값을 통해 input_w 의 가중치 변화량을 계산할 수 있다
      for i in range(self.input.shape[0]):
        result = result + self.delta[i] * self.node_input[i]

      self.before_weight_update = result

      # 이전 노드 출력값을 통해 before_w 의 가중치 변화량 계산
      result = 0

      for i in range(self.input.shape[0]):
        result = result + self.delta[i] * self.input[i]

      self.input_weight_update = result

rnn.input_weight_update
>>>
array([[-0.00344549]])

rnn.before_weight_update
>>>
array([[0.00565245]])

결과를 확인해보면, 출력에서 멀어질수록 가중치 업데이트에 미치는 영향이 매우 작음을 알 수 있다.

(이를 보정할까?)

학습률, 학습 횟수에 따른 비용 함수의 변화를 확인,

    def iterations(self, iterations, learning_rate):
      # 이전 노드의 출력, 첫 부분의 경우 이 값이 0이다. 행렬의 크기는 (1,1)
      before = np.zeros((self.target.shape[0], self.target.shape[0]))
      for j in range(iterations):
        # 초기화
        self.node_input = []
        self.node_output = []
        for i in range(input.shape[0]):
          # 노드 입력 데이터의 numpy 행렬 생성
          input_data = input[i].reshape(input.shape[1],-1)

          # RNN 노드 입력 값 계산, 이전 노드 출력 @ 가중치 + 입력 데이터 @ 가중치
          node_input = (self.before_w @ before) + (self.input_w @ input_data)
          self.node_input.append(node_input)

          # 노드 출력 계산, 활성화 함수 연산을 수행한다.
          node_output = self.activation.sigmoid(node_input)
          self.node_output.append(node_output)

          # 순환 노드의 다음 노드 입력값의 계산
          before = node_output

        self.predict = before
        self.cal_delta()
        self.update_weight(learning_rate)
        print(self.cost_cal(), self.predict, self.target)

rnn.iterations(1000, 10)
>>>
0.4054894435206465 [[0.54564968]] [[-0.35489398]]
0.3959451604343717 [[0.53498821]] [[-0.35489398]]
0.38942254407455557 [[0.52762802]] [[-0.35489398]]
0.3846538080898872 [[0.52220785]] [[-0.35489398]]
0.380960783523163 [[0.5179872]] [[-0.35489398]]
0.37798792680455673 [[0.51457474]] [[-0.35489398]]
0.37552706738648656 [[0.51173981]] [[-0.35489398]]
0.37344648449372475 [[0.50933571]] [[-0.35489398]]
0.3716579668784316 [[0.50726374]] [[-0.35489398]]
0.37009980898983896 [[0.50545456]] [[-0.35489398]]
0.36872732160902205 [[0.50385781]] [[-0.35489398]]
...

0.11701916001626193 [[0.12888109]] [[-0.35489398]]
0.11697235729702807 [[0.12878433]] [[-0.35489398]]
0.11692565075515272 [[0.12868776]] [[-0.35489398]]
0.11687904009086031 [[0.12859136]] [[-0.35489398]]
0.11683252500560432 [[0.12849515]] [[-0.35489398]]
0.11678610520206151 [[0.12839911]] [[-0.35489398]]

비용 함수의 값이 감소하는 것을 확인할 수 있다.

더 적은 학습 횟수, 학습률로도 이전보다 더 빠르게 비용 함수가 감소하는 것을 볼 수 있는데 이는 각 단계별 가중치 업데이트 크기를 모두 합하여 더 빠르게 수렴한 것을 볼 수 있다.

하지만, 일정 수준으로 수렴하고 더 감소하지 않는데, 이를 해결하기 위해서는 SGD 가속 경사 하강법을 구현이 필요할 것 같다.