파이썬/파이썬 pandas

[pandas] 결측치 처리하기 isnull, dropna, fillna

Merware 2023. 5. 11. 21:28
  • 결측치는 데이터 자체가 없다는 것을 의미한다.
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                   [3, 4, np.nan, 1],
                   [np.nan, np.nan, np.nan, 5],
                   [np.nan, 3, np.nan, 4]],
                  columns=list('ABCD'))
df
"""
	A	B	C	D
0	NaN	2.0	NaN	0
1	3.0	4.0	NaN	1
2	NaN	NaN	NaN	5
3	NaN	3.0	NaN	4

 

결측치 확인하기

# isnull()
df.isnull().sum()

"""
A    3
B    1
C    4
D    0
dtype: int64
# info
df.info()

"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       1 non-null      float64
 1   B       3 non-null      float64
 2   C       0 non-null      float64
 3   D       4 non-null      int64  
dtypes: float64(3), int64(1)
memory usage: 256.0 bytes

 

결측치 삭제하기

  • 데이터프레임.dropna() : 결측치가 존재하는 모든 행 삭제
  • 데이터프레임.dropna(axis=1) : 결측치가 존재하는 모든 열 삭제
# 결측치가 존재하는 모든 행 삭제
df.dropna()

"""
A	B	C	D
#결측치가 존재하는 모든 열 삭제
df.dropna(axis=1)

"""
	D
0	0
1	1
2	5
3	4

 

결측치 대치하기

  • 특정값으로 채우기 : 데이터프레임.fillna(특정값)
  • 이전값으로 채우기 : 데이터프레임.fillna(method='ffill')
  • 다음값으로 채우기 : 데이터프레임.fillna(method='bfill')
  • 컬럼별로 값을 지정하여 치우기 : 데이터프레임.fillna({'컬럼명1':값1, '컬럼명2':값2,...})

특정값으로 채우기

# 0으로 채우기
df.fillna(0)

"""
	A	B	C	D
0	0.0	2.0	0.0	0
1	3.0	4.0	0.0	1
2	0.0	0.0	0.0	5
3	0.0	3.0	0.0	4
df

"""
	A	B	C	D
0	NaN	2.0	NaN	0
1	3.0	4.0	NaN	1
2	NaN	NaN	NaN	5
3	NaN	3.0	NaN	4
"""
# 평균값으로 채우기(컬럼별 평균값으로 채워진다.)
df.fillna(df.mean())

"""
	A	B	C	D
0	3.0	2.0	NaN	0
1	3.0	4.0	NaN	1
2	3.0	3.0	NaN	5
3	3.0	3.0	NaN	4

 

이전 값으로 채우기

df.fillna(method='ffill')

"""
	A	B	C	D
0	NaN	2.0	NaN	0
1	3.0	4.0	NaN	1
2	3.0	4.0	NaN	5
3	3.0	3.0	NaN	4

 

다음 값으로 채우기

df.fillna(method='bfill')

"""
	A	B	C	D
0	3.0	2.0	NaN	0
1	3.0	4.0	NaN	1
2	NaN	3.0	NaN	5
3	NaN	3.0	NaN	4

 

컬럼별로 대치할 값을 지정하여 채우기

df

"""
	A	B	C	D
0	NaN	2.0	NaN	0
1	3.0	4.0	NaN	1
2	NaN	NaN	NaN	5
3	NaN	3.0	NaN	4
"""
df.fillna({'A':0,'B':1,'C':2,'D':3})

"""
	A	B	C	D
0	0.0	2.0	2.0	0
1	3.0	4.0	2.0	1
2	0.0	1.0	2.0	5
3	0.0	3.0	2.0	4
"""

 

결측치가 포함된 데이터의 통계값

  • 결측치는 없는 데이터로 간주한다.
df

"""
	A	B	C	D
0	NaN	2.0	NaN	0
1	3.0	4.0	NaN	1
2	NaN	NaN	NaN	5
3	NaN	3.0	NaN	4
"""

df['A'].mean()
# 3.0

df['B'].mean()
# 3.0

 

scores 데이터의 결측치 처리

df = pd.read_csv('data/scores.csv')
df.head()

"""
	name	kor	eng	math
0	Aiden	100.0	90.0	95.0
1	Charles	90.0	80.0	75.0
2	Danial	95.0	100.0	100.0
3	Evan	100.0	100.0	100.0
4	Henry	NaN	35.0	60.0
"""
# 결측치 확인
df.isnull().sum()

"""
name    0
kor     3
eng     2
math    1
dtype: int64
df.info()

"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    30 non-null     object 
 1   kor     27 non-null     float64
 2   eng     28 non-null     float64
 3   math    29 non-null     float64
dtypes: float64(3), object(1)
memory usage: 1.1+ KB
df

"""
	name	kor	eng	math
0	Aiden	100.0	90.0	95.0
1	Charles	90.0	80.0	75.0
2	Danial	95.0	100.0	100.0
3	Evan	100.0	100.0	100.0
4	Henry	NaN	35.0	60.0
5	Ian	90.0	100.0	90.0
6	James	70.0	75.0	65.0
7	Julian	80.0	90.0	55.0
8	Justin	50.0	60.0	100.0
9	Kevin	100.0	100.0	90.0
10	Leo	90.0	95.0	70.0
11	Oliver	70.0	75.0	65.0
12	Peter	100.0	95.0	100.0
13	Amy	90.0	75.0	90.0
14	Chloe	95.0	100.0	95.0
15	Danna	100.0	100.0	100.0
16	Ellen	NaN	60.0	NaN
17	Emma	70.0	65.0	70.0
18	Jennifer	80.0	55.0	80.0
19	Kate	50.0	NaN	50.0
20	Linda	100.0	90.0	100.0
21	Olivia	90.0	70.0	90.0
22	Rose	70.0	65.0	70.0
23	Sofia	100.0	100.0	100.0
24	Tiffany	90.0	NaN	90.0
25	Vanessa	95.0	70.0	95.0
26	Viviana	100.0	80.0	100.0
27	Vikkie	NaN	50.0	100.0
28	Winnie	70.0	100.0	70.0
29	Zuly	80.0	90.0	95.0
# 결측치를 0으로 채우기
df.fillna(0, inplace=True)

df

"""
	name	kor	eng	math
0	Aiden	100.0	90.0	95.0
1	Charles	90.0	80.0	75.0
2	Danial	95.0	100.0	100.0
3	Evan	100.0	100.0	100.0
4	Henry	0.0	35.0	60.0
5	Ian	90.0	100.0	90.0
6	James	70.0	75.0	65.0
7	Julian	80.0	90.0	55.0
8	Justin	50.0	60.0	100.0
9	Kevin	100.0	100.0	90.0
10	Leo	90.0	95.0	70.0
11	Oliver	70.0	75.0	65.0
12	Peter	100.0	95.0	100.0
13	Amy	90.0	75.0	90.0
14	Chloe	95.0	100.0	95.0
15	Danna	100.0	100.0	100.0
16	Ellen	0.0	60.0	0.0
17	Emma	70.0	65.0	70.0
18	Jennifer	80.0	55.0	80.0
19	Kate	50.0	0.0	50.0
20	Linda	100.0	90.0	100.0
21	Olivia	90.0	70.0	90.0
22	Rose	70.0	65.0	70.0
23	Sofia	100.0	100.0	100.0
24	Tiffany	90.0	0.0	90.0
25	Vanessa	95.0	70.0	95.0
26	Viviana	100.0	80.0	100.0
27	Vikkie	0.0	50.0	100.0
28	Winnie	70.0	100.0	70.0
29	Zuly	80.0	90.0	95.0