본 장에서는 “따라 하며 배우는 데이터 과학” 3장의 “데이터 취득과 데이터 가공: SQL과 dplyr” 내용을 파이썬으로 익혀보도록 하자.

주피터 노트북 버전은 다음 페이지에 있다: http://nbviewer.jupyter.org/github/jaimyoung/ipds-kr/blob/master/notebooks/python-data-processing.ipynb

%matplotlib inline
import pandas as pd
import numpy as np

1. gapminder 자료 읽어들이기

# gapminder 자료를 다운로드하고 판다스 데이터프레임으로 읽어들이자
gapminder = pd.read_csv("data/gapminder.tsv", sep="\t")
# This also works:
# gapminder = pd.read_csv("https://raw.githubusercontent.com/jennybc/gapminder/master/data-raw/07_gap-merged-with-continent.tsv", sep="\t")
gapminder.head()
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106
gapminder.tail()
country continent year lifeExp pop gdpPercap
3308 Zimbabwe Africa 1987 62.351 9216418 706.157306
3309 Zimbabwe Africa 1992 60.377 10704340 693.420786
3310 Zimbabwe Africa 1997 46.809 11404948 792.449960
3311 Zimbabwe Africa 2002 39.989 11926563 672.038623
3312 Zimbabwe Africa 2007 43.487 12311143 469.709298
gapminder.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3313 entries, 0 to 3312
Data columns (total 6 columns):
country      3313 non-null object
continent    3313 non-null object
year         3313 non-null int64
lifeExp      3313 non-null float64
pop          3313 non-null int64
gdpPercap    3313 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 155.4+ KB
gapminder.describe()
year lifeExp pop gdpPercap
count 3313.000000 3313.000000 3.313000e+03 3313.000000
mean 1980.293088 65.240457 3.177325e+07 11313.820704
std 16.931879 11.772424 1.045019e+08 11369.008362
min 1950.000000 23.599000 5.941200e+04 241.165877
25% 1967.000000 58.333000 2.680018e+06 2505.291423
50% 1982.000000 69.610000 7.559776e+06 7825.823398
75% 1996.000000 73.657000 1.961054e+07 17355.747100
max 2007.000000 82.670000 1.318683e+09 113523.132900

데이터를 처리하는 핵심 동사

1. 행을 선택하기

# gapminder 데이터에서 한국 데이터, 2007년 데이터, 한국 2007년 데이터를 추출하는 명령은 다음과 같다.
# R dplyr 에서는 filter(gapminder, country=='Korea, Rep.' & year==2007)
gapminder.query("country=='Korea, Rep.' & year==2007")
country continent year lifeExp pop gdpPercap
1651 Korea, Rep. Asia 2007 78.623 49044790 23348.13973

2. 행(관측치)를 정렬하기

#  gapminder 데이터를 year, country 변수순으로 정렬하려면,
# R dplyr 에서는 gapminder %>% arrange(year, country)
gapminder.sort_values(by=["year", "country"]).head()
country continent year lifeExp pop gdpPercap
72 Australia Oceania 1950 69.020 8267337 10031.121380
128 Austria Europe 1950 64.880 6935100 5733.098114
251 Belgium Europe 1950 66.350 8639369 7990.465840
308 Belize Americas 1950 57.673 65797 1731.132728
395 Bulgaria Europe 1950 61.390 7250500 2131.563149

3. 열(변수)를 선택하기

#  gapminder 데이터에서 pop, gdpPercap 변수만 선택.
# R dplyr 에서는 gapminder %>% select(pop, gdpPercap)
gapminder[["pop", "gdpPercap"]].head()
pop gdpPercap
0 8425333 779.445314
1 9240934 820.853030
2 10267083 853.100710
3 11537966 836.197138
4 13079460 739.981106

4. 변수 변환하기

#  gapminder 데이터에서 기존의 변수들을 변환한 결과를 기존 변수나 새 변수에 할당한다. 
# R dplyr 에서는 
# gapminder %>%
#    mutate(total_gdp = pop * gdpPercap,
#    le_gdp_ratio = lifeExp / gdpPercap, lgrk = le_gdp_ratio * 100)
# 1. 파이썬에서는 각 변수 할당에 새로운 assign() 함수를 사용해야 한다.
# 2. x.pop 은 내부의 pop() 함수와 충돌을 일으키므로 x['pop']으로 표현했다.
gapminder.\
    assign(total_gdp = lambda x: (x['pop'] * x['gdpPercap'])).\
    assign(le_gdp_ratio = lambda x: (x['lifeExp'] / x['gdpPercap'])).\
    assign(lgrk = lambda x: x['le_gdp_ratio'] * 100).\
    head()
country continent year lifeExp pop gdpPercap total_gdp le_gdp_ratio lgrk
0 Afghanistan Asia 1952 28.801 8425333 779.445314 6.567086e+09 0.036951 3.695064
1 Afghanistan Asia 1957 30.332 9240934 820.853030 7.585449e+09 0.036952 3.695180
2 Afghanistan Asia 1962 31.997 10267083 853.100710 8.758856e+09 0.037507 3.750671
3 Afghanistan Asia 1967 34.020 11537966 836.197138 9.648014e+09 0.040684 4.068419
4 Afghanistan Asia 1972 36.088 13079460 739.981106 9.678553e+09 0.048769 4.876881

5. 요약 통계량 계산하기

#  gapminder 데이터에서 기존의 변수들을 변환한 결과를 기존 변수나 새 변수에 할당한다. 
# R dplyr 에서는 
# gapminder %>% summarize(n_obs = n( ),
#   n_countries = n_distinct(country),
#   n_years = n_distinct(year),
#   med_gdpc = median(gdpPercap),
#   max_gdppc = max(gdpPercap))
gapminder.aggregate(['mean', 'median'])
year lifeExp pop gdpPercap
mean 1980.293088 65.240457 3.177325e+07 11313.820704
median 1982.000000 69.610000 7.559776e+06 7825.823398

6. 랜덤 샘플을 위한 sample()

np.random.seed(12345)
gapminder.sample(n=10)
country continent year lifeExp pop gdpPercap
808 Denmark Europe 1954 71.350 4406000 10272.200280
2738 Spain Europe 1962 69.690 31158061 5693.843879
3174 United States Americas 1963 70.040 189242000 16646.257900
593 Chile Americas 1957 56.074 7048426 4315.622723
1662 Kuwait Asia 2002 76.904 2111561 35110.105660
3113 Uganda Africa 1977 50.350 11457758 843.733137
314 Belize Americas 1987 71.879 176347 3844.931896
1488 Italy Europe 1964 70.400 51600200 8821.596841
300 Belgium Europe 1999 77.800 10235655 28937.151400
854 Denmark Europe 2000 76.900 5337416 32016.753010
gapminder.sample(frac=.01).shape
(33, 6)

7. 고유한 행을 찾아내는 distinct( )

# R dplyr 는
# gapminder %>% select(country) %>% distinct()
# gapminder %>% select(year) %>% distinct()

gapminder.country.unique(), gapminder.year.unique()
(array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
        'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas',
        'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize',
        'Benin', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana',
        'Brazil', 'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi',
        'Cambodia', 'Cameroon', 'Canada', 'Cape Verde',
        'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia',
        'Comoros', 'Congo, Dem. Rep.', 'Congo, Rep.', 'Costa Rica',
        "Cote d'Ivoire", 'Croatia', 'Cuba', 'Cyprus', 'Czech Republic',
        'Denmark', 'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt',
        'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia',
        'Ethiopia', 'Fiji', 'Finland', 'France', 'French Guiana',
        'French Polynesia', 'Gabon', 'Gambia', 'Georgia', 'Germany',
        'Ghana', 'Greece', 'Grenada', 'Guadeloupe', 'Guatemala', 'Guinea',
        'Guinea-Bissau', 'Guyana', 'Haiti', 'Honduras', 'Hong Kong, China',
        'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran', 'Iraq',
        'Ireland', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jordan',
        'Kazakhstan', 'Kenya', 'Korea, Dem. Rep.', 'Korea, Rep.', 'Kuwait',
        'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Lithuania',
        'Luxembourg', 'Macao, China', 'Madagascar', 'Malawi', 'Malaysia',
        'Maldives', 'Mali', 'Malta', 'Martinique', 'Mauritania',
        'Mauritius', 'Mexico', 'Micronesia, Fed. Sts.', 'Moldova',
        'Mongolia', 'Montenegro', 'Morocco', 'Mozambique', 'Myanmar',
        'Namibia', 'Nepal', 'Netherlands', 'Netherlands Antilles',
        'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria',
        'Norway', 'Oman', 'Pakistan', 'Panama', 'Papua New Guinea',
        'Paraguay', 'Peru', 'Philippines', 'Poland', 'Portugal',
        'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russia', 'Rwanda',
        'Samoa', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal',
        'Serbia', 'Sierra Leone', 'Singapore', 'Slovak Republic',
        'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'Spain',
        'Sri Lanka', 'Sudan', 'Suriname', 'Swaziland', 'Sweden',
        'Switzerland', 'Syria', 'Taiwan', 'Tajikistan', 'Tanzania',
        'Thailand', 'Timor-Leste', 'Togo', 'Tonga', 'Trinidad and Tobago',
        'Tunisia', 'Turkey', 'Turkmenistan', 'Uganda', 'Ukraine',
        'United Arab Emirates', 'United Kingdom', 'United States',
        'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela', 'Vietnam',
        'West Bank and Gaza', 'Yemen, Rep.', 'Zambia', 'Zimbabwe'], dtype=object),
 array([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,
        2007, 1950, 1951, 1953, 1954, 1955, 1956, 1958, 1959, 1960, 1961,
        1963, 1964, 1965, 1966, 1968, 1969, 1970, 1971, 1973, 1974, 1975,
        1976, 1978, 1979, 1980, 1981, 1983, 1984, 1985, 1986, 1988, 1989,
        1990, 1991, 1993, 1994, 1995, 1996, 1998, 1999, 2000, 2001, 2003,
        2004, 2005, 2006]))
gapminder.drop_duplicates(['country', 'year']).head()
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106

group_by() 를 이용한 그룹 연산

# R dplyr 는
# gapminder %>%
# filter(year == 2007) %>% 
# group_by(continent) %>% 
# summarize(median(lifeExp))

gapminder.\
    query('year == 2007').\
    groupby('continent').\
    agg({'lifeExp':'median'})
lifeExp
continent
Africa 52.9470
Americas 73.4950
Asia 71.9930
Europe 78.6085
FSU 69.0270
Oceania 71.4540

조인 연산자; inner, left, right, full(outer) join

R dplyr 예는 다음과 같다.

(df1 <- data_frame(x = c(1, 2), y = 2:1))
(df2 <- data_frame(x = c(1, 3), a = 10, b = "a"))
df1 %>% inner_join(df2)
df1 %>% left_join(df2)
df1 %>% right_join(df2)
df1 %>% full_join(df2)

파이썬 판다스에서는 DataFrame.merge 함수로 처리하면 된다.

df1 = pd.DataFrame(data={'x':range(2), 'y':range(2, 0, -1)})
df1
x y
0 0 2
1 1 1
df2 = pd.DataFrame(data={'x':[1,3], 'a':10, 'b':"a"})
df2
a b x
0 10 a 1
1 10 a 3
df1.merge(df2, how="inner")
x y a b
0 1 1 10 a
df1.merge(df2, how="left")
x y a b
0 0 2 NaN NaN
1 1 1 10.0 a
df1.merge(df2, how="inner")
x y a b
0 1 1 10 a
df1.merge(df2, how="right")
x y a b
0 1 1.0 10 a
1 3 NaN 10 a
df1.merge(df2, how="outer")
x y a b
0 0 2.0 NaN NaN
1 1 1.0 10.0 a
2 3 NaN 10.0 a