sklearn : _preprocess_data ( 전처리 및 중앙값 제거 )

def _preprocess_data(
    X,
    y,
    *,
    fit_intercept,
    copy=True,
    copy_y=True,
    sample_weight=None,
    check_input=True,
):
    """Common data preprocessing for fitting linear models.

    This helper is in charge of the following steps:

    - Ensure that `sample_weight` is an array or `None`.
    - If `check_input=True`, perform standard input validation of `X`, `y`.
    - Perform copies if requested to avoid side-effects in case of inplace
      modifications of the input.

    Then, if `fit_intercept=True` this preprocessing centers both `X` and `y` as
    follows:
        - if `X` is dense, center the data and
        store the mean vector in `X_offset`.
        - if `X` is sparse, store the mean in `X_offset`
        without centering `X`. The centering is expected to be handled by the
        linear solver where appropriate.
        - in either case, always center `y` and store the mean in `y_offset`.
        - both `X_offset` and `y_offset` are always weighted by `sample_weight`
          if not set to `None`.

    If `fit_intercept=False`, no centering is performed and `X_offset`, `y_offset`
    are set to zero.

    Returns
    -------
    X_out : {ndarray, sparse matrix} of shape (n_samples, n_features)
        If copy=True a copy of the input X is triggered, otherwise operations are
        inplace.
        If input X is dense, then X_out is centered.
    y_out : {ndarray, sparse matrix} of shape (n_samples,) or (n_samples, n_targets)
        Centered version of y. Possibly performed inplace on input y depending
        on the copy_y parameter.
    X_offset : ndarray of shape (n_features,)
        The mean per column of input X.
    y_offset : float or ndarray of shape (n_features,)
    X_scale : ndarray of shape (n_features,)
        Always an array of ones. TODO: refactor the code base to make it
        possible to remove this unused variable.
    """
    xp, _, device_ = get_namespace_and_device(X, y, sample_weight)
    n_samples, n_features = X.shape
    X_is_sparse = sp.issparse(X)

    if isinstance(sample_weight, numbers.Number):
        sample_weight = None
    if sample_weight is not None:
        sample_weight = xp.asarray(sample_weight)

    if check_input:
        X = check_array(
            X, copy=copy, accept_sparse=["csr", "csc"], dtype=supported_float_dtypes(xp)
        )
        y = check_array(y, dtype=X.dtype, copy=copy_y, ensure_2d=False)
    else:
        y = xp.astype(y, X.dtype, copy=copy_y)
        if copy:
            if X_is_sparse:
                X = X.copy()
            else:
                X = _asarray_with_order(X, order="K", copy=True, xp=xp)

    dtype_ = X.dtype

    if fit_intercept:
        if X_is_sparse:
            X_offset, X_var = mean_variance_axis(X, axis=0, weights=sample_weight)
        else:
            X_offset = _average(X, axis=0, weights=sample_weight, xp=xp)

            X_offset = xp.astype(X_offset, X.dtype, copy=False)
            X -= X_offset

        y_offset = _average(y, axis=0, weights=sample_weight, xp=xp)
        y -= y_offset
    else:
        X_offset = xp.zeros(n_features, dtype=X.dtype, device=device_)
        if y.ndim == 1:
            y_offset = xp.asarray(0.0, dtype=dtype_, device=device_)
        else:
            y_offset = xp.zeros(y.shape[1], dtype=dtype_, device=device_)

    # XXX: X_scale is no longer needed. It is an historic artifact from the
    # time where linear model exposed the normalize parameter.
    X_scale = xp.ones(n_features, dtype=X.dtype, device=device_)
    return X, y, X_offset, y_offset, X_scale

선형 모델을 학습할 때, 입력 데이터 x 와 타겟 값 y 를 전처리하는 데 사용된다.

주요 작업으로 중앙값 제거, 샘플 가중치 반영, 데이터 복사 및 입력 검사의 수행

파라미터

x : 입력 데이터, (n_samples, n_features) ndarray, sparse matrix 형태
y : 타겟 값, (n_samples,) (n_samples, n_targets) 형태의 ndarray
fit_intercept (bool) : True 인 경우 x, y 의 중앙값 제거, intercept 를 학습할 수 있도록 돕는다.
copy : 복사본 생성
sample_weight : 각 샘플에 대한 가중치 지정, 가중치가 제공된 경우, 중앙값 계산 및 데이터 조정에 반영된다.
check_input, x,y 에 대한 유효성 검사를 수행

반환값

x_out : 전처리된 입력 데이터, 복사된 데이터가 반환, 중앙값이 제거된 상태로 반환 될 수 있음
y_out : same
x_offset : x 의 각 특성에 대한 중앙값
y_offset : same

xp, _, device_ = get_namespace_and_device(X, y, sample_weight)
- 입력 데이터 x,y, sample_weight 에 대한 네임 스페이스와 디바이스 정보를 반환한다.
- xp : 데이터 배열을 처리하기 위한 네임스페이스를 나타낸다. numpy, cupy 와 같은 라이브러리를 의미한다.
- device : 데이터가 위치한 디바이스를 나타낸다.
  - 데이터가 CPU, GPU 에서 처리될 지, 어떤 배열 라이브러리를 사용할지를 결정한다.
n_samples, n_features = X.shape : 입력 데이터 X 의 크기를 결정
X_is_sparse = sp.issparse(X) : X 가 희소 행렬인지 확인
if isinstance(sample_weight, numbers.Number): sample_weight = None
- sample_weight 가 단일 숫자인지 확인, 숫자일 경우 모든 샘플에 동일한 가중치 적용
- 개별 샘플에 대해 다르게 적용할 필요가 없으므로 sample_weight = None 설정
if sample_weight is not None: sample_weight = xp.asarray(sample_weight)
- None 이 아닌, 배열 형태로 제공된 경우, xp.asarray 를 통해 배열로 변환
- 네임스페이스 (numpy, cupy) 에 맞게 배열 형태로 바꾼다.

네임스페이스 및 디바이스 선택

데이터를 처리할 때 사용할 라이브러리와 하드웨어를 결정하는 과정,

연산 성능과 메모리 사용에 직접적인 영향을 미친다.

네임스페이스

코드에서 사용하는 특정 라이브러리의 이름 공간을 의미한다. 네임스페이스를 선택한다는 것은 어떤 라이브러리를 사용할지를 결정하는 것을 의미한다.

numpy : CPU 기반의 계산을 수행하는 일반적인 라이브러리
cupy : GPU 기반의 계산을 수행,

디바이스

연산이 수행되는 물리적 하드웨어

CPU Central Processing Unit
- 연산의 직렬적 수행
- 단일 또는 소수의 코어를 사용하여 연산 수행, 복잡한 제어 구조
GPU Graphics Processing Unit
- 병렬 처리를 위해 설꼐된 장치
- 수천 개의 작은 코어를 사용, 단순한 연산을 대규모 병렬 처리

적절한 네임스페이스와 디바이스의 선택, 이를 기반으로 연산 수행

    if check_input:
        X = check_array(
            X, copy=copy, accept_sparse=["csr", "csc"], dtype=supported_float_dtypes(xp)
        )
        y = check_array(y, dtype=X.dtype, copy=copy_y, ensure_2d=False)
    else:
        y = xp.astype(y, X.dtype, copy=copy_y)
        if copy:
            if X_is_sparse:
                X = X.copy()
            else:
                X = _asarray_with_order(X, order="K", copy=True, xp=xp)

입력 데이터 x 와 타겟 값 y 의 유효성 검사를 수행, 필요에 따라 데이터를 복사, 변환하는 역할 수행

if check_input:
- True 일 경우 x, y 에 대해 유효성 검사를 수행한다.
- x, y 가 모델에 적합한 형식, 크기, 데이터 타입을 확인한다.
X = check_array
- X 에 대한 유효성 검사를 수행,
- 희소 행렬인 경우, 특정 혀익만 허용,
- 데이터 타입의 설정
y = check_array
- y 에 대해 유효성 검사 수행
- 데이터 타입 일치,
- 복사본 생성
- y 는 일반적으로 1차원 배열...
유효성 검사 생략 else:
데이터 복사

dtype_ = X.dtype

    if fit_intercept:
        if X_is_sparse:
            X_offset, X_var = mean_variance_axis(X, axis=0, weights=sample_weight)
        else:
            X_offset = _average(X, axis=0, weights=sample_weight, xp=xp)

            X_offset = xp.astype(X_offset, X.dtype, copy=False)
            X -= X_offset

        y_offset = _average(y, axis=0, weights=sample_weight, xp=xp)
        y -= y_offset
    else:
        X_offset = xp.zeros(n_features, dtype=X.dtype, device=device_)
        if y.ndim == 1:
            y_offset = xp.asarray(0.0, dtype=dtype_, device=device_)
        else:
            y_offset = xp.zeros(y.shape[1], dtype=dtype_, device=device_)

    # XXX: X_scale is no longer needed. It is an historic artifact from the
    # time where linear model exposed the normalize parameter.
    X_scale = xp.ones(n_features, dtype=X.dtype, device=device_)
    return X, y, X_offset, y_offset, X_scale

dtype_ = X.dtype
- 데이터 타입 확인 및 저장
fit_intercept 의 값에 따른 중앙값 계산 및 제거
전처리 결과 반환

중앙값을 제거하는 이유

centering 은 intercept 를 포함한 선형 모델에서 중요하며, 다음의 이유로 사용된다.

모델 학습의 안정성
- 각 특성의 평균을 0으로 맞추면, 학습 과정에서 수치적 안정성이 향상된다. 각 특성에 대한 가중치를 보다 정확하게 학습할 수 있다.
해석의 단순화
- intercept 는 더 의미 있는 값을 가진다. 모든 특성이 0 일 때의 예측값의 해석
상관관계 감소
- 중앙값 제거를 통해 특성 간 상관관계가 줄어들 수 있다. 가중치의 해석을 보다 명확하게 하고, multicollinearity 문제를 줄이는 데 도움을 준다.

다중 공선성 Multicollinearity

두 개 이상의 독립 변수가 강한 상관관계를 가져서 서로 간에 예측 가능할 때 발생하는 문제를 말한다.

한 독립 변수가 다른 독립 변수들을 선형 결합으로 잘 예측할 수 있을 때, 이 변수를 포함한 회귀모델의 결과가 불안정해지거나 해석이 어려워진다.

각 변수의 개별적인 효과를 정확하게 추정하는 것을 어렵게 만든다.

계수의 해석을 복잡하게 만든다.

모델의 예측 성능 저하

'dev_AI_framework' 카테고리의 다른 글

기존 FrameWork model 확인 - LinearRegression ( 데이터 검증, 희소 행렬에 대해 알게 되었다 ) (0)	2024.08.09
sklearn : _rescale_data ( sparse_matrix, inplace ) (0)	2024.08.09
sklearn : _check_sample_weight (가중치 검증) (0)	2024.08.09
sklearn : _validate_date ( 데이터 검증 ) (0)	2024.08.09
AI FrameWork 요구사항 분석 및 계획 수립 (0)	2024.08.07

뜻 지, 가르칠 훈

sklearn : _preprocess_data ( 전처리 및 중앙값 제거 )

'dev_AI_framework' 카테고리의 다른 글

티스토리툴바

sklearn : _preprocess_data ( 전처리 및 중앙값 제거 )

'dev_AI_framework' 카테고리의 다른 글

'dev_AI_framework' Related Articles

티스토리툴바