介绍Pandas中3个常见的数据类型操作方法:
import pandas as pd
import numpy as np官网地址:https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html
pandas.to_numeric(arg, # scalar, list, tuple, 1-d array, or Series
errors='raise', # ‘ignore’, ‘raise’, ‘coerce’;默认是raise
downcast=None)errors的3种取值情况:
downcast的使用:
s = pd.Series(["2.0", '1', -3, 5.0]) # 数值(类似)
s0 2.0
1 1
2 -3
3 5.0
dtype: object默认是object类型,也就是字符串。下面转成数值型:
# 1、默认转成float64
pd.to_numeric(s)0 2.0
1 1.0
2 -3.0
3 5.0
dtype: float64# 2、指定类型
pd.to_numeric(s, downcast="integer")0 2
1 1
2 -3
3 5
dtype: int8# 3、指定类型
pd.to_numeric(s, downcast="signed")0 2
1 1
2 -3
3 5
dtype: int8# 4、指定类型
pd.to_numeric(s, downcast="unsigned")0 2.0
1 1.0
2 -3.0
3 5.0
dtype: float64# 5、指定类型
pd.to_numeric(s, downcast="float")0 2.0
1 1.0
2 -3.0
3 5.0
dtype: float32s1 = pd.Series(["2.0", 'pandas', -3, 5.0]) # 数值+字符串
s10 2.0
1 pandas
2 -3
3 5.0
dtype: object# pd.to_numeric(s1) # 默认是会抛出异常# 忽略异常
pd.to_numeric(s1, errors="ignore")0 2.0
1 pandas
2 -3
3 5.0
dtype: object# pd.to_numeric(s1, errors="raise") # 无效解析引发异常# 无效解析设置为None
pd.to_numeric(s1, errors="coerce")0 2.0
1 NaN
2 -3.0
3 5.0
dtype: float64# 无效解析设置为None
pd.to_numeric(s1, errors="coerce", downcast="float")0 2.0
1 NaN
2 -3.0
3 5.0
dtype: float32# 无效解析设置为None,最后用0代替
pd.to_numeric(s1, errors="coerce").fillna(0)0 2.0
1 0.0
2 -3.0
3 5.0
dtype: float64s2 = pd.Series([1,2.0,3.0], dtype="float64")
s20 1.0
1 2.0
2 3.0
dtype: float64s3 = pd.to_numeric(s2, downcast="float")
s30 1.0
1 2.0
2 3.0
dtype: float32s4 = pd.to_numeric(s2, downcast="integer")
s40 1
1 2
2 3
dtype: int8类型转化的优势之一:节省内存资源。比较上面3种不同数值类型下的数据所占内存大小:
print("memory of float64: ", s2.memory_usage())
print("memory of float32: ", s3.memory_usage())
print("memory of int8: ", s4.memory_usage())memory of float64: 152
memory of float32: 140
memory of int8: 131另一种转化的方法:astype
s20 1.0
1 2.0
2 3.0
dtype: float64s2.astype("float32")0 1.0
1 2.0
2 3.0
dtype: float32s2.astype("int64")0 1
1 2
2 3
dtype: int64s2.astype("int32")0 1
1 2
2 3
dtype: int32s2.astype("category")/Applications/downloads/anaconda/anaconda3/lib/python3.7/site-packages/pandas/io/formats/format.py:1429: FutureWarning: Index.ravel returning ndarray is deprecated; in a future version this will return a view on self.
for val, m in zip(values.ravel(), mask.ravel())
0 1.0
1 2.0
2 3.0
dtype: category
Categories (3, float64): [1.0, 2.0, 3.0]pandas.to_datetime(arg,
errors='raise',
dayfirst=False,
yearfirst=False,
utc=None,
format=None,
exact=True,
unit=None,
infer_datetime_format=False,
origin='unix',
cache=True)df = pd.DataFrame({"Year":[2022,2021,2022],
"Month":[1,3,5],
"Day":["10","12","28"]
})
df.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
Year | Month | Day | |
|---|---|---|---|
0 | 2022 | 1 | 10 |
1 | 2021 | 3 | 12 |
2 | 2022 | 5 | 28 |
df.dtypesYear int64
Month int64
Day object
dtype: object直接拼接会报错:字符串和数值型不能直接相加。
# df["Date"] = df["Year"] + df["Month"] + df["Day"]df["Date"] = pd.to_datetime(df)
df.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
Year | Month | Day | Date | |
|---|---|---|---|---|
0 | 2022 | 1 | 10 | 2022-01-10 |
1 | 2021 | 3 | 12 | 2021-03-12 |
2 | 2022 | 5 | 28 | 2022-05-28 |
df.dtypesYear int64
Month int64
Day object
Date datetime64[ns]
dtype: objectpd.to_datetime("10/2/21") # 默认Timestamp('2021-10-02 00:00:00')pd.to_datetime("10-2-21") # 默认Timestamp('2021-10-02 00:00:00')pd.to_datetime("10/2/21",dayfirst=True)Timestamp('2021-02-10 00:00:00')pd.to_datetime("10/2/21",yearfirst=True)Timestamp('2010-02-21 00:00:00')pd.to_datetime("22-01-21",dayfirst=True)Timestamp('2021-01-22 00:00:00')pd.to_datetime("22-01-21",yearfirst=True)Timestamp('2022-01-21 00:00:00')pd.to_datetime('20220107', format='%Y%m%d', errors='ignore')Timestamp('2022-01-07 00:00:00')pd.to_datetime('20220107112347', errors='ignore')Timestamp('2022-01-07 11:23:47')pd.to_datetime('20220107112233', format='%Y%m%d%H%M%S')Timestamp('2022-01-07 11:22:33')筛选指定类型下的数据信息
df.dtypesYear int64
Month int64
Day object
Date datetime64[ns]
dtype: objectdf.select_dtypes(include=["int"]).dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
Year | Month | |
|---|---|---|
0 | 2022 | 1 |
1 | 2021 | 3 |
2 | 2022 | 5 |
df.select_dtypes(include=["object"]).dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
Day | |
|---|---|
0 | 10 |
1 | 12 |
2 | 28 |
df.select_dtypes(include=["O"]) # 效果同上.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
Day | |
|---|---|
0 | 10 |
1 | 12 |
2 | 28 |
# 排除object字段类型
df.select_dtypes(exclude=["object"]).dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
Year | Month | Date | |
|---|---|---|---|
0 | 2022 | 1 | 2022-01-10 |
1 | 2021 | 3 | 2021-03-12 |
2 | 2022 | 5 | 2022-05-28 |
# 排除object + int字段类型
df.select_dtypes(exclude=["object","int"]).dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
Date | |
|---|---|
0 | 2022-01-10 |
1 | 2021-03-12 |
2 | 2022-05-28 |