데이터 호출과 확인

1.1 데이터 다운로드

구글 드라이브에서 데이터 호출,

from google.colab import drive
drive.mount('/content/drive')

filename = '/content/drive/MyDrive/fetal_health.csv'

1.2 데이터 구조 확인

DataFrame의 head() 메서드를 통해 첫 다섯 행 확인

data = pd.read_csv(filename)
data.head()

주요 특성 확인

baseline value : Baseline Fetal Heart Rate (FHR), 기준 태아 심박수
accelerations : Number of accelerations per second
fetal_movement : Number of fetal movements per second, 초당 태아 움직임 수
uterine_contractions : Number of uterine contractions per second, 초당 자궁 수축 횟수
light_decelerations : Number of LDs per second
severe_decelerations : Number of SDs per second
prolongued_decelerations : Number of PDs per second
abnormal_short_term_variability : Percentage of time with abnormal short term variability, 비정상적인 단기 변동성이 있는 시간의 백분율
fetal_health : Fetal health: 1 - Normal 2 - Suspect 3 - Pathological, 1 정상, 2 의심, 3 증상

info() 메서드를 통해 데이터에 대한 간략한 설명과 특히 전체 행 수, 각 특성의 데이터 타입과 널이 아닌 값의 개수를 확인하는데 유용하다.

data.info()
>>>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2126 entries, 0 to 2125
Data columns (total 22 columns):
 #   Column                                                  Non-Null Count  Dtype  
---  ------                                                  --------------  -----  
 0   baseline value                                          2126 non-null   float64
 1   accelerations                                           2126 non-null   float64
 2   fetal_movement                                          2126 non-null   float64
 3   uterine_contractions                                    2126 non-null   float64
 4   light_decelerations                                     2126 non-null   float64
 5   severe_decelerations                                    2126 non-null   float64
 6   prolongued_decelerations                                2126 non-null   float64
 7   abnormal_short_term_variability                         2126 non-null   float64
 8   mean_value_of_short_term_variability                    2126 non-null   float64
 9   percentage_of_time_with_abnormal_long_term_variability  2126 non-null   float64
 10  mean_value_of_long_term_variability                     2126 non-null   float64
 11  histogram_width                                         2126 non-null   float64
 12  histogram_min                                           2126 non-null   float64
 13  histogram_max                                           2126 non-null   float64
 14  histogram_number_of_peaks                               2126 non-null   float64
 15  histogram_number_of_zeroes                              2126 non-null   float64
 16  histogram_mode                                          2126 non-null   float64
 17  histogram_mean                                          2126 non-null   float64
 18  histogram_median                                        2126 non-null   float64
 19  histogram_variance                                      2126 non-null   float64
 20  histogram_tendency                                      2126 non-null   float64
 21  fetal_health                                            2126 non-null   float64
dtypes: float64(22)
memory usage: 365.5 KB

데이터 셋에 2126개의 샘플이 들어있는 것을 확인, 모든 특성이 float64인 것과 null 값이 아닌 것을 확인할 수 있다.

범주형 데이터에 대한 각 카테고리별 데이터 개수 확인을 위한 value_counts() 메서드, 타깃 데이터에 해당하는 fetal_health를 확인하여 각 태아 상태 별 데이터 개수 확인

data["fetal_health"].value_counts()
>>>
1.0    1655
2.0     295
3.0     176
Name: fetal_health, dtype: int64

숫자형 특성의 요약 정보를 확인하기 위한 describe() 메서드, 또는 각 숫자형 특성을 히스토그램을 통해 확인해 볼 수 있다.

data.describe()
>>>
...

import matplotlib.pyplot as plt
data.hist(bins=50, figsize=(20,15))
plt.show()

데이터들이 어떠한 분포를 띄는지 볼 수 있다.

1.3 타겟 데이터 분리

다른 데이터들을 통해 fetal_health의 상태를 예측하고 싶으므로 데이터셋에서 따로 분리해 train_data, target_data 두 개를 만들어야 한다.

target_data = data[['fetal_health']]

train_data = data.drop(['fetal_health'], axis=1, inplace=False)

print(len(data.columns), len(train_data.columns), len(target_data.columns))
>>>
22 21 1

기존 데이터에서 하나의 행을 추출, Series 형태가 아닌 DataFrame 형태의 추출을 위해 이중 대괄호 사용
drop 을 통해 특정 열 제거, axis=1 을 통한 열 선택과 inplace=False 값을 통해 데이터 프레임의 복제
기존 data와 train, target 데이터의 열 개수 확인

1.4 상관관계 조사

표준 상관계수 standard correlation coefficient 를 통해 특성 사이의 상관관계를 확인할 수 있다.

corr_matrix = train_data.corr()

상관관계의 범위는 -1부터 1 까지로 1에 가까우면 강한 양의 상관관계로 값이 같이 증가하는 비례 관계, -1은 반비례에 가깝다는 뜻이고 0에 가까우면 선형적인 관계가 없는 것을 나타낸다.

상관관계를 확인하는 또 다른 방법은 숫자형 특성 사이에 산점도를 그려줄 수 있다.

from pandas.plotting import scatter_matrix

attributes = ["baseline value", "histogram_mode", "histogram_mean", "histogram_median"]
scatter_matrix(train_data[attributes], figsize=(12, 8))

2 테스트 세트 분리

테스트 세트와 훈련 세트의 분리

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(train_data, target_data, test_size = 0.2, random_state=42, stratify=target_data)

train_test_split 을 통한 훈련, 테스트 데이터에 대한 분리
라벨 데이터
훈련 데이터
분리 사이즈
랜덤 시드값
데이터 편향을 방지하기 위한 stratify 값 지정

2. 특성 스케일링

각 특성 별 데이터들의 범위를 동일하게 만들어주는 방법으로 min-max 스케일링과 표준화 standardization의 사용

숫자 특성 변환 파이프라인의 구현

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('std_scaler', StandardScaler())
    ])

train_scaler = num_pipeline.fit_transform(x_train)

'머신러닝 아이디어 > 태아 건강 측정' 카테고리의 다른 글

모델 선택과 훈련 (0)	2022.10.18
Fetal Health Classification (0)	2022.10.16

뜻 지, 가르칠 훈

데이터 호출과 확인

1.1 데이터 다운로드

1.2 데이터 구조 확인

1.3 타겟 데이터 분리

1.4 상관관계 조사

2 테스트 세트 분리

2. 특성 스케일링

'머신러닝 아이디어 > 태아 건강 측정' 카테고리의 다른 글

티스토리툴바

데이터 호출과 확인

1.1 데이터 다운로드

1.2 데이터 구조 확인

1.3 타겟 데이터 분리

1.4 상관관계 조사

2 테스트 세트 분리

2. 특성 스케일링

'머신러닝 아이디어 > 태아 건강 측정' 카테고리의 다른 글

'머신러닝 아이디어/태아 건강 측정' Related Articles

티스토리툴바