파이썬/파이썬 pandas
[pandas] 결측치 처리하기 isnull, dropna, fillna
Merware
2023. 5. 11. 21:28
- 결측치는 데이터 자체가 없다는 것을 의미한다.
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5],
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
df
"""
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
결측치 확인하기
# isnull()
df.isnull().sum()
"""
A 3
B 1
C 4
D 0
dtype: int64
# info
df.info()
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 1 non-null float64
1 B 3 non-null float64
2 C 0 non-null float64
3 D 4 non-null int64
dtypes: float64(3), int64(1)
memory usage: 256.0 bytes
결측치 삭제하기
- 데이터프레임.dropna() : 결측치가 존재하는 모든 행 삭제
- 데이터프레임.dropna(axis=1) : 결측치가 존재하는 모든 열 삭제
# 결측치가 존재하는 모든 행 삭제
df.dropna()
"""
A B C D
#결측치가 존재하는 모든 열 삭제
df.dropna(axis=1)
"""
D
0 0
1 1
2 5
3 4
결측치 대치하기
- 특정값으로 채우기 : 데이터프레임.fillna(특정값)
- 이전값으로 채우기 : 데이터프레임.fillna(method='ffill')
- 다음값으로 채우기 : 데이터프레임.fillna(method='bfill')
- 컬럼별로 값을 지정하여 치우기 : 데이터프레임.fillna({'컬럼명1':값1, '컬럼명2':값2,...})
특정값으로 채우기
# 0으로 채우기
df.fillna(0)
"""
A B C D
0 0.0 2.0 0.0 0
1 3.0 4.0 0.0 1
2 0.0 0.0 0.0 5
3 0.0 3.0 0.0 4
df
"""
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
"""
# 평균값으로 채우기(컬럼별 평균값으로 채워진다.)
df.fillna(df.mean())
"""
A B C D
0 3.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 3.0 3.0 NaN 5
3 3.0 3.0 NaN 4
이전 값으로 채우기
df.fillna(method='ffill')
"""
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 3.0 4.0 NaN 5
3 3.0 3.0 NaN 4
다음 값으로 채우기
df.fillna(method='bfill')
"""
A B C D
0 3.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN 3.0 NaN 5
3 NaN 3.0 NaN 4
컬럼별로 대치할 값을 지정하여 채우기
df
"""
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
"""
df.fillna({'A':0,'B':1,'C':2,'D':3})
"""
A B C D
0 0.0 2.0 2.0 0
1 3.0 4.0 2.0 1
2 0.0 1.0 2.0 5
3 0.0 3.0 2.0 4
"""
결측치가 포함된 데이터의 통계값
- 결측치는 없는 데이터로 간주한다.
df
"""
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
"""
df['A'].mean()
# 3.0
df['B'].mean()
# 3.0
scores 데이터의 결측치 처리
df = pd.read_csv('data/scores.csv')
df.head()
"""
name kor eng math
0 Aiden 100.0 90.0 95.0
1 Charles 90.0 80.0 75.0
2 Danial 95.0 100.0 100.0
3 Evan 100.0 100.0 100.0
4 Henry NaN 35.0 60.0
"""
# 결측치 확인
df.isnull().sum()
"""
name 0
kor 3
eng 2
math 1
dtype: int64
df.info()
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 30 non-null object
1 kor 27 non-null float64
2 eng 28 non-null float64
3 math 29 non-null float64
dtypes: float64(3), object(1)
memory usage: 1.1+ KB
df
"""
name kor eng math
0 Aiden 100.0 90.0 95.0
1 Charles 90.0 80.0 75.0
2 Danial 95.0 100.0 100.0
3 Evan 100.0 100.0 100.0
4 Henry NaN 35.0 60.0
5 Ian 90.0 100.0 90.0
6 James 70.0 75.0 65.0
7 Julian 80.0 90.0 55.0
8 Justin 50.0 60.0 100.0
9 Kevin 100.0 100.0 90.0
10 Leo 90.0 95.0 70.0
11 Oliver 70.0 75.0 65.0
12 Peter 100.0 95.0 100.0
13 Amy 90.0 75.0 90.0
14 Chloe 95.0 100.0 95.0
15 Danna 100.0 100.0 100.0
16 Ellen NaN 60.0 NaN
17 Emma 70.0 65.0 70.0
18 Jennifer 80.0 55.0 80.0
19 Kate 50.0 NaN 50.0
20 Linda 100.0 90.0 100.0
21 Olivia 90.0 70.0 90.0
22 Rose 70.0 65.0 70.0
23 Sofia 100.0 100.0 100.0
24 Tiffany 90.0 NaN 90.0
25 Vanessa 95.0 70.0 95.0
26 Viviana 100.0 80.0 100.0
27 Vikkie NaN 50.0 100.0
28 Winnie 70.0 100.0 70.0
29 Zuly 80.0 90.0 95.0
# 결측치를 0으로 채우기
df.fillna(0, inplace=True)
df
"""
name kor eng math
0 Aiden 100.0 90.0 95.0
1 Charles 90.0 80.0 75.0
2 Danial 95.0 100.0 100.0
3 Evan 100.0 100.0 100.0
4 Henry 0.0 35.0 60.0
5 Ian 90.0 100.0 90.0
6 James 70.0 75.0 65.0
7 Julian 80.0 90.0 55.0
8 Justin 50.0 60.0 100.0
9 Kevin 100.0 100.0 90.0
10 Leo 90.0 95.0 70.0
11 Oliver 70.0 75.0 65.0
12 Peter 100.0 95.0 100.0
13 Amy 90.0 75.0 90.0
14 Chloe 95.0 100.0 95.0
15 Danna 100.0 100.0 100.0
16 Ellen 0.0 60.0 0.0
17 Emma 70.0 65.0 70.0
18 Jennifer 80.0 55.0 80.0
19 Kate 50.0 0.0 50.0
20 Linda 100.0 90.0 100.0
21 Olivia 90.0 70.0 90.0
22 Rose 70.0 65.0 70.0
23 Sofia 100.0 100.0 100.0
24 Tiffany 90.0 0.0 90.0
25 Vanessa 95.0 70.0 95.0
26 Viviana 100.0 80.0 100.0
27 Vikkie 0.0 50.0 100.0
28 Winnie 70.0 100.0 70.0
29 Zuly 80.0 90.0 95.0