第02章 DataFrame基础运算
使用列名列表提取DataFrame的多列:
>>> import pandas as pd
>>> import numpy as np
>>> movies = pd.read_csv("data/movie.csv")
>>> movie_actor_director = movies[
... [
... "actor_1_name",
... "actor_2_name",
... "actor_3_name",
... "director_name",
... ]
... ]
>>> movie_actor_director.head()
actor_1_name actor_2_name actor_3_name director_name
0 CCH Pounder Joel Dav... Wes Studi James Ca...
1 Johnny Depp Orlando ... Jack Dav... Gore Ver...
2 Christop... Rory Kin... Stephani... Sam Mendes
3 Tom Hardy Christia... Joseph G... Christop...
4 Doug Walker Rob Walker NaN Doug Walker
# 提取单列时,列表和键名提取出来的数据类型不同。
>>> type(movies[["director_name"]])
<class 'pandas.core.frame.DataFrame'> # DataFrame类型
>>> type(movies["director_name"])
<class 'pandas.core.series.Series'> # Series类型
也可以使用loc
提取多列。
>>> type(movies.loc[:, ["director_name"]])
<class 'pandas.core.frame.DataFrame'>
>>> type(movies.loc[:, "director_name"])
<class 'pandas.core.series.Series'>
预先将列名存储在列表中,可以提高代码的可读性。
>>> cols = [
... "actor_1_name",
... "actor_2_name",
... "actor_3_name",
... "director_name",
... ]
>>> movie_actor_director = movies[cols]
如果没有使用列表,则会报KeyError
错误。
>>> movies[
... "actor_1_name",
... "actor_2_name",
... "actor_3_name",
... "director_name",
... ]
Traceback (most recent call last):
...
KeyError: ('actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name')
缩短列名之后查看每种数据类型的个数:
>>> movies = pd.read_csv("data/movie.csv")
>>> def shorten(col):
... return (
... str(col)
... .replace("facebook_likes", "fb")
... .replace("_for_reviews", "")
... )
>>> movies = movies.rename(columns=shorten)
>>> movies.dtypes.value_counts()
float64 13
int64 3
object 12
dtype: int64
使用.select_dtypes
方法提取整型的列:
>>> movies.select_dtypes(include="int").head()
num_voted_users cast_total_fb movie_fb
0 886204 4834 33000
1 471220 48350 0
2 275868 11700 85000
3 1144337 106759 164000
4 8 143 0
选择所有数值类型的列:
>>> movies.select_dtypes(include="number").head()
num_critics duration ... aspect_ratio movie_fb
0 723.0 178.0 ... 1.78 33000
1 302.0 169.0 ... 2.35 0
2 602.0 148.0 ... 2.35 85000
3 813.0 164.0 ... 2.35 164000
4 NaN NaN ... NaN 0
选择整型和字符串的列:
>>> movies.select_dtypes(include=["int", "object"]).head()
color direc/_name ... conte/ating movie_fb
0 Color James Cameron ... PG-13 33000
1 Color Gore Verbinski ... PG-13 0
2 Color Sam Mendes ... PG-13 85000
3 Color Christopher Nolan ... PG-13 164000
4 NaN Doug Walker ... NaN 0
提取所有非浮点类型的列:
>>> movies.select_dtypes(exclude="float").head()
color director_name ... content_rating movie_fb
0 Color James Ca... ... PG-13 33000
1 Color Gore Ver... ... PG-13 0
2 Color Sam Mendes ... PG-13 85000
3 Color Christop... ... PG-13 164000
4 NaN Doug Walker ... NaN 0
使用.filter
方法筛选所有列名中包含fb
的列:
>>> movies.filter(like="fb").head()
director_fb actor_3_fb ... actor_2_fb movie_fb
0 0.0 855.0 ... 936.0 33000
1 563.0 1000.0 ... 5000.0 0
2 0.0 161.0 ... 393.0 85000
3 22000.0 23000.0 ... 23000.0 164000
4 131.0 NaN ... 12.0 0
items
参数可以用来选择多列:
>>> cols = [
... "actor_1_name",
... "actor_2_name",
... "actor_3_name",
... "director_name",
... ]
>>> movies.filter(items=cols).head()
actor_1_name ... director_name
0 CCH Pounder ... James Cameron
1 Johnny Depp ... Gore Verbinski
2 Christoph Waltz ... Sam Mendes
3 Tom Hardy ... Christopher Nolan
4 Doug Walker ... Doug Walker
regex
参数可以用来进行正则匹配,下面的代码提取出了列名中包含数字的列:
>>> movies.filter(regex=r"\d").head()
actor_3_fb actor_2_name ... actor_3_name actor_2_fb
0 855.0 Joel Dav... ... Wes Studi 936.0
1 1000.0 Orlando ... ... Jack Dav... 5000.0
2 161.0 Rory Kin... ... Stephani... 393.0
3 23000.0 Christia... ... Joseph G... 23000.0
4 NaN Rob Walker ... NaN 12.0
对列进行排序的原则:
下面是个例子。先读取数据,缩短列名:
>>> movies = pd.read_csv("data/movie.csv")
>>> def shorten(col):
... return col.replace("facebook_likes", "fb").replace(
... "_for_reviews", ""
... )
>>> movies = movies.rename(columns=shorten)
对下面的列名进行
>>> movies.columns
Index(['color', 'director_name', 'num_critic', 'duration', 'director_fb',
'actor_3_fb', 'actor_2_name', 'actor_1_fb', 'gross', 'genres',
'actor_1_name', 'movie_title', 'num_voted_users', 'cast_total_fb',
'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
'movie_imdb_link', 'num_user', 'language', 'country', 'content_rating',
'budget', 'title_year', 'actor_2_fb', 'imdb_score', 'aspect_ratio',
'movie_fb'],
dtype='object')
>>> cat_core = [
... "movie_title",
... "title_year",
... "content_rating",
... "genres",
... ]
>>> cat_people = [
... "director_name",
... "actor_1_name",
... "actor_2_name",
... "actor_3_name",
... ]
>>> cat_other = [
... "color",
... "country",
... "language",
... "plot_keywords",
... "movie_imdb_link",
... ]
>>> cont_fb = [
... "director_fb",
... "actor_1_fb",
... "actor_2_fb",
... "actor_3_fb",
... "cast_total_fb",
... "movie_fb",
... ]
>>> cont_finance = ["budget", "gross"]
>>> cont_num_reviews = [
... "num_voted_users",
... "num_user",
... "num_critic",
... ]
>>> cont_other = [
... "imdb_score",
... "duration",
... "aspect_ratio",
... "facenumber_in_poster",
... ]
将上面所有列表连起来,组成最终的列的顺序,并确认没有遗漏任何列:
>>> new_col_order = (
... cat_core
... + cat_people
... + cat_other
... + cont_fb
... + cont_finance
... + cont_num_reviews
... + cont_other
... )
>>> set(movies.columns) == set(new_col_order)
True
将新的列数组传给movies,得到排好列的对象:
>>> movies[new_col_order].head()
movie_title title_year ... aspect_ratio facenumber_in_poster
0 Avatar 2009.0 ... 1.78 0.0
1 Pirates ... 2007.0 ... 2.35 0.0
2 Spectre 2015.0 ... 2.35 1.0
3 The Dark... 2012.0 ... 2.35 0.0
4 Star War... NaN ... NaN 0.0
查看数据集的属性:shape、size、ndim。
>>> movies = pd.read_csv("data/movie.csv")
>>> movies.shape
(4916, 28)
>>> movies.size
137648
>>> movies.ndim
2
.count
方法可以统计所有的非缺失值:
>>> movies.count()
color 4897
director_name 4814
num_critic_for_reviews 4867
duration 4901
director_facebook_likes 4814
...
title_year 4810
actor_2_facebook_likes 4903
imdb_score 4916
aspect_ratio 4590
movie_facebook_likes 4916
Length: 28, dtype: int64