Diary of Chanjun 데이터 분석가의 다이어리

시계열 데이터 - ARIMA

학습목적

시계열 데이터를 다루는 법과 시계열 예측을 하기 위한 여러가지 모델을 사용해보고 특성을 이해한다.
이를 위해서 데이콘에서 진행 중인 대회인 전력사용량 예측 AI 경진대회의 데이터를 사용하고, 베이스라인 코드를 따라하고 또 나만의 코드로 만들어 결과를 제출하여 순위도 확인해볼 것이다.

이 장에서는 ARIMA 모형만 다루고 EDA를 다루지 않습니다.

시계열 데이터란?

일정 시간 간격으로 배치된 데이터

시계열 분석의 목표

이런 시계열이 어떤 법칙에서 생성되는지 기본적인 질문을 이해하는 것

시계열 예측

주어진 시계열을 보고 수학적인 모델을 만들어서 미래에 일어날 것들을 예측하는 것 (X가 시계열 Y가 예측값)
일반적으로 이런 방법들은 금융시장에서의 주가 예측 등에서 많이 쓰인다.

출처 : https://ko.wikipedia.org/wiki/%EC%8B%9C%EA%B3%84%EC%97%B4


관련 단어 :

Oops

출처 : https://otexts.com/fppkr/stationarity.html


ARIMA(Autoregressvie integrated MovingAverage)

정의 :

import os
import sys
import warnings
from tqdm import tqdm

import itertools
import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
import plotnine as p9
import seaborn as sns

import scipy
import stats
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller, kpss

from sklearn.metrics import mean_absolute_error
%matplotlib inline
warnings.filterwarnings("ignore")
mpl.rcParams['axes.unicode_minus'] = False
# fm._rebuild()
plt.rcParams["font.family"] = 'NanumMyeongjo'
plt.rcParams["figure.figsize"] = (10,10)
y = 1
coff = 0.3
const = 0
ar_process = []
for i in range(300):
    error = np.random.randn()
    y = const + y * coff + error
    ar_process.append(y)
print(const / (1 - coff))
plt.plot(ar_process)
0.0





[<matplotlib.lines.Line2D at 0x27c015223a0>]

png

Oops

ma_process = []
error = np.random.randn()
for i in range(300):
    ex_error = error
    error = np.random.randn()
    y = const + coff * ex_error + error
    ma_process.append(y)
plt.plot(ma_process)
[<matplotlib.lines.Line2D at 0x27c015709d0>]

png

Oops

train = pd.read_csv("data/dacon/energy/train.csv", encoding = "cp949")
train.head()
num date_time 전력사용량(kWh) 기온(°C) 풍속(m/s) 습도(%) 강수량(mm) 일조(hr) 비전기냉방설비운영 태양광보유
0 1 2020-06-01 00 8179.056 17.6 2.5 92.0 0.8 0.0 0.0 0.0
1 1 2020-06-01 01 8135.640 17.7 2.9 91.0 0.3 0.0 0.0 0.0
2 1 2020-06-01 02 8107.128 17.5 3.2 91.0 0.0 0.0 0.0 0.0
3 1 2020-06-01 03 8048.808 17.1 3.2 91.0 0.0 0.0 0.0 0.0
4 1 2020-06-01 04 8043.624 17.0 3.3 92.0 0.0 0.0 0.0 0.0
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122400 entries, 0 to 122399
Data columns (total 10 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   num         122400 non-null  int64  
 1   date_time   122400 non-null  object 
 2   전력사용량(kWh)  122400 non-null  float64
 3   기온(°C)      122400 non-null  float64
 4   풍속(m/s)     122400 non-null  float64
 5   습도(%)       122400 non-null  float64
 6   강수량(mm)     122400 non-null  float64
 7   일조(hr)      122400 non-null  float64
 8   비전기냉방설비운영   122400 non-null  float64
 9   태양광보유       122400 non-null  float64
dtypes: float64(8), int64(1), object(1)
memory usage: 9.3+ MB
test = pd.read_csv("data/dacon/energy/test.csv", encoding = "cp949")
test.head()
num date_time 기온(°C) 풍속(m/s) 습도(%) 강수량(mm, 6시간) 일조(hr, 3시간) 비전기냉방설비운영 태양광보유
0 1 2020-08-25 00 27.8 1.5 74.0 0.0 0.0 NaN NaN
1 1 2020-08-25 01 NaN NaN NaN NaN NaN NaN NaN
2 1 2020-08-25 02 NaN NaN NaN NaN NaN NaN NaN
3 1 2020-08-25 03 27.3 1.1 78.0 NaN 0.0 NaN NaN
4 1 2020-08-25 04 NaN NaN NaN NaN NaN NaN NaN
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10080 entries, 0 to 10079
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   num           10080 non-null  int64  
 1   date_time     10080 non-null  object 
 2   기온(°C)        3360 non-null   float64
 3   풍속(m/s)       3360 non-null   float64
 4   습도(%)         3360 non-null   float64
 5   강수량(mm, 6시간)  1680 non-null   float64
 6   일조(hr, 3시간)   3360 non-null   float64
 7   비전기냉방설비운영     2296 non-null   float64
 8   태양광보유         1624 non-null   float64
dtypes: float64(7), int64(1), object(1)
memory usage: 708.9+ KB
print(train.num.nunique())
print(test.num.nunique())
print(pd.concat([train.num.value_counts().sort_index(), test.num.value_counts()], axis = 1).head())
60
60
    num  num
1  2040  168
2  2040  168
3  2040  168
4  2040  168
5  2040  168
train.date_time = pd.to_datetime(train.date_time)
test.date_time = pd.to_datetime(test.date_time)
train.date_time.describe()
count                  122400
unique                   2040
top       2020-08-13 07:00:00
freq                       60
first     2020-06-01 00:00:00
last      2020-08-24 23:00:00
Name: date_time, dtype: object
test.date_time.describe()
count                   10080
unique                    168
top       2020-08-28 18:00:00
freq                       60
first     2020-08-25 00:00:00
last      2020-08-31 23:00:00
Name: date_time, dtype: object
print(2040 * 60)
print(168 * 60)
122400
10080
train.iloc[:,2:].head()
전력사용량(kWh) 기온(°C) 풍속(m/s) 습도(%) 강수량(mm) 일조(hr) 비전기냉방설비운영 태양광보유
0 8179.056 17.6 2.5 92.0 0.8 0.0 0.0 0.0
1 8135.640 17.7 2.9 91.0 0.3 0.0 0.0 0.0
2 8107.128 17.5 3.2 91.0 0.0 0.0 0.0 0.0
3 8048.808 17.1 3.2 91.0 0.0 0.0 0.0 0.0
4 8043.624 17.0 3.3 92.0 0.0 0.0 0.0 0.0
y_col = "전력사용량(kWh)"
idxs = [1, 2, 3, 4]
(
    p9.ggplot() +
    p9.geom_line(data = train[train.num.isin(idxs)].assign(num = lambda x : x.num.astype(str)), mapping = p9.aes(x = "date_time", y = y_col, color = "num", group = "num")) +
    p9.theme_bw() +
    p9.theme(text = p9.element_text(family = "NanumMyeongjo"))
)

png

<ggplot: (170728743361)>

Oops

시계열분해

출처 : https://otexts.com/fppkr/components.html

시계열분해
TEST NAME 판단 요소 판단 기준
adf Trend 추세가 제거되면 정상이라고 판단
kpss Seosonality 계절성이 제거되면 정상이라고 판단

출처 : https://skyeong.net/285
출처 : https://sosoeasy.tistory.com/392

def print_adfuller(inputSeries):
    result = adfuller(inputSeries)
    print('ADF Statistic: %f' % result[0])
    print('p-value: %f' % result[1])
def print_kpss(inputSeries):
    result = kpss(inputSeries)
    print('KPSS Statistic: %f' % result[0])
    print('p-value: %f' % result[1])

num이 1번인 건물의 시계열분해를 해보도록 하겠습니다.

decompose_df = train[train.num == 1][["date_time", y_col]].set_index("date_time")
decompse_result = seasonal_decompose(decompose_df[y_col])
decompse_result.plot()
plt.show()

png

Oops

print_adfuller(decompose_df)
ADF Statistic: -5.900664
p-value: 0.000000
print_kpss(decompose_df)
KPSS Statistic: 1.791702
p-value: 0.010000

num이 1번인 건물의 차분을 1로 하여 시계열분해를 해보도록 하겠습니다.


decompose_df_d1 = decompose_df - decompose_df.shift(1)
decompse_result_d1 = seasonal_decompose(decompose_df_d1[y_col][1:])

decompse_result_d1.plot()
plt.show()

png

Oops

seosonality_chk = decompse_result_d1.seasonal.reset_index().assign(day = lambda x : x.date_time.dt.day, hour = lambda x : x.date_time.dt.hour).groupby(by = "seasonal").hour.unique()
seosonality_chk = seosonality_chk.reset_index()
seosonality_chk.iloc[seosonality_chk.apply(lambda x : x.hour[0], axis = 1).sort_values().index]
seasonal hour
10 -7.484606 [0]
1 -23.279445 [1]
9 -8.214249 [2]
4 -15.593606 [3]
5 -14.263374 [4]
0 -25.089410 [5]
8 -8.289303 [6]
22 28.361108 [7]
18 12.748197 [8]
23 41.420590 [9]
21 22.865965 [10]
17 8.701412 [11]
14 0.666853 [12]
19 13.516894 [13]
20 22.731447 [14]
16 6.116965 [15]
12 0.333501 [16]
6 -10.214499 [17]
3 -17.286678 [18]
15 1.598751 [19]
7 -9.401928 [20]
11 -0.015088 [21]
13 0.566055 [22]
2 -20.495553 [23]
print_adfuller(decompose_df_d1[1:])
ADF Statistic: -8.993809
p-value: 0.000000
print_kpss(decompose_df_d1[1:])
KPSS Statistic: 0.048369
p-value: 0.100000

num이 1번인 건물의 차분을 2로 하여 시계열분해를 해보도록 하겠습니다.

decompose_df_d2 = decompose_df - (2*decompose_df.shift(1)) + decompose_df.shift(2)
decompse_result_d2 = seasonal_decompose(decompose_df_d2[y_col][2:])
decompse_result_d2.plot()
plt.show()

png

Oops

seosonality_chk = decompse_result_d2.seasonal.reset_index().assign(day = lambda x : x.date_time.dt.day, hour = lambda x : x.date_time.dt.hour).groupby(by = "seasonal").hour.unique()
seosonality_chk = seosonality_chk.reset_index()
seosonality_chk.iloc[seosonality_chk.apply(lambda x : x.hour[0], axis = 1).sort_values().index]
seasonal hour
18 13.001128 [0]
3 -15.804657 [1]
19 15.055378 [2]
10 -7.389175 [3]
14 1.320414 [4]
7 -10.835854 [5]
20 16.790289 [6]
23 36.640593 [7]
4 -15.622729 [8]
22 28.662575 [9]
1 -18.564443 [10]
5 -14.174372 [11]
9 -7.681234 [12]
17 12.712712 [13]
15 9.204736 [14]
2 -16.624300 [15]
12 -5.793282 [16]
8 -10.557818 [17]
11 -7.081997 [18]
21 18.875611 [19]
6 -11.010497 [20]
16 9.377021 [21]
13 0.571325 [22]
0 -21.071425 [23]
print_adfuller(decompose_df_d2[2:])
ADF Statistic: -25.306054
p-value: 0.000000
print_kpss(decompose_df_d2[2:])
KPSS Statistic: 0.012631
p-value: 0.100000
AR, MA의 모수 추정

출처 : https://tjansry354.tistory.com/14
출처 : https://kerpect.tistory.com/161
출처 : https://byeongkijeong.github.io/ARIMA-with-Python/

plt.rcParams["figure.figsize"] = (5, 3)

num이 1인 데이터의 plot과 acf, pacf 그래프를 그려보았다.

idx = 1
series = train[train.num == idx][y_col]
series.plot()
plot_acf(series)
plot_pacf(series)
plt.show()

png

png

png

Oops
Oops
Oops


차분을 1차 하여 num이 1인 데이터의 plot과 acf, pacf 그래프를 그려보았다.

series_d1 = series - series.shift(1)
series_d1.plot()
plot_acf(series_d1[1:])
plot_pacf(series_d1[1:])
plt.show()

png

png

png

Oops
Oops
Oops


차분을 2차 하여 num이 1인 데이터의 plot과 acf, pacf 그래프를 그려보았다.

series_d2 = series - (2*series.shift(1)) + series.shift(2)
series_d2.plot()
plot_acf(series_d2[2:])
plot_pacf(series_d2[2:])
plt.show()

png

png

png

Oops
Oops
Oops

cut = int(len(series) * 0.8)
train = series[:cut]
val = series[cut:]
pdq_list = list(itertools.product(range(3), range(3), range(3)))
pd_lists = list()
for pdq in tqdm(pdq_list[1:]) :
    arima = ARIMA(train, order = pdq)
    arima = arima.fit()
    predict = arima.predict(start = 1, end = len(val))
    aicc = arima.aicc
    aic = arima.aic
    bic = arima.bic
    mae = mean_absolute_error(val, predict)
    pd_lists = pd_lists + [[pdq, aicc, aic, bic, mae]]
100%|███████████████████████████████████████████████████| 26/26 [00:08<00:00,  3.11it/s]
arima_metric = pd.DataFrame(pd_lists, columns = ["pdq", "aicc", "aic", "bic", "mae"])
arima_metric.head()
pdq aicc aic bic mae
0 (0, 0, 1) 18897.962018 18897.947276 18914.139960 149.379200
1 (0, 0, 2) 18062.369967 18062.345382 18083.935628 156.448449
2 (0, 1, 0) 16422.809294 16422.806839 16428.203787 173.148794
3 (0, 1, 1) 16382.330861 16382.323490 16393.117387 173.129768
4 (0, 1, 2) 16364.348722 16364.333971 16380.524817 173.133983
sns.heatmap(arima_metric.iloc[:, 1:].corr(), annot = True, fmt = ".2f", cmap = "Blues")
<AxesSubplot:>

png

Oops


print(arima_metric[arima_metric["aicc"] == arima_metric["aicc"].max()])
print(arima_metric[arima_metric["aㅅicc"] == arima_metric["aicc"].min()])
         pdq          aicc           aic          bic       mae
0  (0, 0, 1)  18897.962018  18897.947276  18914.13996  149.3792
          pdq         aicc           aic           bic         mae
19  (2, 0, 2)  16344.79814  16344.746448  16377.131817  170.946399
print(arima_metric[arima_metric["aic"] == arima_metric["aic"].max()])
print(arima_metric[arima_metric["aic"] == arima_metric["aic"].min()])
         pdq          aicc           aic          bic       mae
0  (0, 0, 1)  18897.962018  18897.947276  18914.13996  149.3792
          pdq         aicc           aic           bic         mae
19  (2, 0, 2)  16344.79814  16344.746448  16377.131817  170.946399
print(arima_metric[arima_metric["bic"] == arima_metric["bic"].max()])
print(arima_metric[arima_metric["bic"] == arima_metric["bic"].min()])
         pdq          aicc           aic          bic       mae
0  (0, 0, 1)  18897.962018  18897.947276  18914.13996  149.3792
          pdq         aicc           aic           bic         mae
19  (2, 0, 2)  16344.79814  16344.746448  16377.131817  170.946399
print(arima_metric[arima_metric["mae"] == arima_metric["mae"].max()])
print(arima_metric[arima_metric["mae"] == arima_metric["mae"].min()])
         pdq          aicc           aic           bic         mae
5  (0, 2, 0)  17802.388635  17802.386178  17807.782513  187.044582
         pdq          aicc           aic          bic       mae
0  (0, 0, 1)  18897.962018  18897.947276  18914.13996  149.3792

arima = ARIMA(train, order = (2, 0, 2))
arima = arima.fit()
predict = arima.predict(start = 1, end = len(val))

predict.index = val.index

(
    p9.ggplot() +
    p9.geom_line(data = val.reset_index(), mapping = p9.aes(x = "index", y = y_col), group = 1, color = "green") +
    p9.geom_line(data = predict.reset_index(), mapping = p9.aes(x = "index", y = "predicted_mean"), group = 1, color = "red") +
    p9.theme(text = p9.element_text(family = "NanumMyeongjo"))
)

png

<ggplot: (170731823863)>

Oops


arima = ARIMA(train, order = (0, 0, 1))
arima = arima.fit()
predict = arima.predict(start = 1, end = len(val))

predict.index = val.index

(
    p9.ggplot() +
    p9.geom_line(data = val.reset_index(), mapping = p9.aes(x = "index", y = y_col), group = 1, color = "green") +
    p9.geom_line(data = predict.reset_index(), mapping = p9.aes(x = "index", y = "predicted_mean"), group = 1, color = "red") +
    p9.theme(text = p9.element_text(family = "NanumMyeongjo"))
)

png

<ggplot: (170730380746)>

Oops


arima = ARIMA(train, order = (0, 2, 0))
arima = arima.fit()
predict = arima.predict(start = 1, end = len(val))

predict.index = val.index

(
    p9.ggplot() +
    p9.geom_line(data = val.reset_index(), mapping = p9.aes(x = "index", y = y_col), group = 1, color = "green") +
    p9.geom_line(data = predict.reset_index(), mapping = p9.aes(x = "index", y = "predicted_mean"), group = 1, color = "red") +
    p9.theme(text = p9.element_text(family = "NanumMyeongjo"))
)

png

<ggplot: (170728889266)>

Oops



code : https://github.com/Chanjun-kim/Chanjun-kim.github.io/blob/main/_posts/2021-06-08-TimeSeries1.md

참고 자료 :