Pandas数据类型之category的用法

创建category

使用Series创建

在创建Series的同时添加dtype=\”category\”就可以创建好category了。category分为两部分,一部分是order,一部分是字面量:

In [1]: s = pd.Series([\"a\", \"b\", \"c\", \"a\"], dtype=\"category\")

In [2]: s
Out[2]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [\'a\', \'b\', \'c\']

可以将DF中的Series转换为category:

In [3]: df = pd.DataFrame({\"A\": [\"a\", \"b\", \"c\", \"a\"]})

In [4]: df[\"B\"] = df[\"A\"].astype(\"category\")

In [5]: df[\"B\"]
Out[32]: 
0    a
1    b
2    c
3    a
Name: B, dtype: category
Categories (3, object): [a, b, c]

可以创建好一个pandas.Categorical ,将其作为参数传递给Series:

In [10]: raw_cat = pd.Categorical(
   ....:     [\"a\", \"b\", \"c\", \"a\"], categories=[\"b\", \"c\", \"d\"], ordered=False
   ....: )
   ....: 

In [11]: s = pd.Series(raw_cat)

In [12]: s
Out[12]: 
0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (3, object): [\'b\', \'c\', \'d\']

使用DF创建

创建DataFrame的时候,也可以传入 dtype=\”category\”:

In [17]: df = pd.DataFrame({\"A\": list(\"abca\"), \"B\": list(\"bccd\")}, dtype=\"category\")

In [18]: df.dtypes
Out[18]: 
A    category
B    category
dtype: object

DF中的A和B都是一个category:

In [19]: df[\"A\"]
Out[19]: 
0    a
1    b
2    c
3    a
Name: A, dtype: category
Categories (3, object): [\'a\', \'b\', \'c\']

In [20]: df[\"B\"]
Out[20]: 
0    b
1    c
2    c
3    d
Name: B, dtype: category
Categories (3, object): [\'b\', \'c\', \'d\']

或者使用df.astype(\”category\”)将DF中所有的Series转换为category:

In [21]: df = pd.DataFrame({\"A\": list(\"abca\"), \"B\": list(\"bccd\")})

In [22]: df_cat = df.astype(\"category\")

In [23]: df_cat.dtypes
Out[23]: 
A    category
B    category
dtype: object

创建控制

默认情况下传入dtype=\’category\’ 创建出来的category使用的是默认值:

1.Categories是从数据中推断出来的。

2.Categories是没有大小顺序的。

可以显示创建CategoricalDtype来修改上面的两个默认值:

In [26]: from pandas.api.types import CategoricalDtype

In [27]: s = pd.Series([\"a\", \"b\", \"c\", \"a\"])

In [28]: cat_type = CategoricalDtype(categories=[\"b\", \"c\", \"d\"], ordered=True)

In [29]: s_cat = s.astype(cat_type)

In [30]: s_cat
Out[30]: 
0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (3, object): [\'b\' < \'c\' < \'d\']

同样的CategoricalDtype还可以用在DF中:

In [31]: from pandas.api.types import CategoricalDtype

In [32]: df = pd.DataFrame({\"A\": list(\"abca\"), \"B\": list(\"bccd\")})

In [33]: cat_type = CategoricalDtype(categories=list(\"abcd\"), ordered=True)

In [34]: df_cat = df.astype(cat_type)

In [35]: df_cat[\"A\"]
Out[35]: 
0    a
1    b
2    c
3    a
Name: A, dtype: category
Categories (4, object): [\'a\' < \'b\' < \'c\' < \'d\']

In [36]: df_cat[\"B\"]
Out[36]: 
0    b
1    c
2    c
3    d
Name: B, dtype: category
Categories (4, object): [\'a\' < \'b\' < \'c\' < \'d\']

转换为原始类型

使用Series.astype(original_dtype) 或者 np.asarray(categorical)可以将Category转换为原始类型:

In [39]: s = pd.Series([\"a\", \"b\", \"c\", \"a\"])

In [40]: s
Out[40]: 
0    a
1    b
2    c
3    a
dtype: object

In [41]: s2 = s.astype(\"category\")

In [42]: s2
Out[42]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [\'a\', \'b\', \'c\']

In [43]: s2.astype(str)
Out[43]: 
0    a
1    b
2    c
3    a
dtype: object

In [44]: np.asarray(s2)
Out[44]: array([\'a\', \'b\', \'c\', \'a\'], dtype=object)

categories的操作

获取category的属性

Categorical数据有 categoriesordered 两个属性。可以通过s.cat.categoriess.cat.ordered来获取:

In [57]: s = pd.Series([\"a\", \"b\", \"c\", \"a\"], dtype=\"category\")

In [58]: s.cat.categories
Out[58]: Index([\'a\', \'b\', \'c\'], dtype=\'object\')

In [59]: s.cat.ordered
Out[59]: False

重排category的顺序:

In [60]: s = pd.Series(pd.Categorical([\"a\", \"b\", \"c\", \"a\"], categories=[\"c\", \"b\", \"a\"]))

In [61]: s.cat.categories
Out[61]: Index([\'c\', \'b\', \'a\'], dtype=\'object\')

In [62]: s.cat.ordered
Out[62]: False

重命名categories

通过给s.cat.categories赋值可以重命名categories:

In [67]: s = pd.Series([\"a\", \"b\", \"c\", \"a\"], dtype=\"category\")

In [68]: s
Out[68]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [\'a\', \'b\', \'c\']

In [69]: s.cat.categories = [\"Group %s\" % g for g in s.cat.categories]

In [70]: s
Out[70]: 
0    Group a
1    Group b
2    Group c
3    Group a
dtype: category
Categories (3, object): [\'Group a\', \'Group b\', \'Group c\']

使用rename_categories可以达到同样的效果:

In [71]: s = s.cat.rename_categories([1, 2, 3])

In [72]: s
Out[72]: 
0    1
1    2
2    3
3    1
dtype: category
Categories (3, int64): [1, 2, 3]

或者使用字典对象:

# You can also pass a dict-like object to map the renaming
In [73]: s = s.cat.rename_categories({1: \"x\", 2: \"y\", 3: \"z\"})

In [74]: s
Out[74]: 
0    x
1    y
2    z
3    x
dtype: category
Categories (3, object): [\'x\', \'y\', \'z\']

使用add_categories添加category

可以使用add_categories来添加category:

In [77]: s = s.cat.add_categories([4])

In [78]: s.cat.categories
Out[78]: Index([\'x\', \'y\', \'z\', 4], dtype=\'object\')

In [79]: s
Out[79]: 
0    x
1    y
2    z
3    x
dtype: category
Categories (4, object): [\'x\', \'y\', \'z\', 4]

使用remove_categories删除category

In [80]: s = s.cat.remove_categories([4])

In [81]: s
Out[81]: 
0    x
1    y
2    z
3    x
dtype: category
Categories (3, object): [\'x\', \'y\', \'z\']

删除未使用的cagtegory

In [82]: s = pd.Series(pd.Categorical([\"a\", \"b\", \"a\"], categories=[\"a\", \"b\", \"c\", \"d\"]))

In [83]: s
Out[83]: 
0    a
1    b
2    a
dtype: category
Categories (4, object): [\'a\', \'b\', \'c\', \'d\']

In [84]: s.cat.remove_unused_categories()
Out[84]: 
0    a
1    b
2    a
dtype: category
Categories (2, object): [\'a\', \'b\']

重置cagtegory

使用set_categories()可以同时进行添加和删除category操作:

In [85]: s = pd.Series([\"one\", \"two\", \"four\", \"-\"], dtype=\"category\")

In [86]: s
Out[86]: 
0     one
1     two
2    four
3       -
dtype: category
Categories (4, object): [\'-\', \'four\', \'one\', \'two\']

In [87]: s = s.cat.set_categories([\"one\", \"two\", \"three\", \"four\"])

In [88]: s
Out[88]: 
0     one
1     two
2    four
3     NaN
dtype: category
Categories (4, object): [\'one\', \'two\', \'three\', \'four\']

category排序

如果category创建的时候带有 ordered=True , 那么可以对其进行排序操作:

In [91]: s = pd.Series([\"a\", \"b\", \"c\", \"a\"]).astype(CategoricalDtype(ordered=True))

In [92]: s.sort_values(inplace=True)

In [93]: s
Out[93]: 
0    a
3    a
1    b
2    c
dtype: category
Categories (3, object): [\'a\' < \'b\' < \'c\']

In [94]: s.min(), s.max()
Out[94]: (\'a\', \'c\')

可以使用 as_ordered() 或者 as_unordered() 来强制排序或者不排序:

In [95]: s.cat.as_ordered()
Out[95]: 
0    a
3    a
1    b
2    c
dtype: category
Categories (3, object): [\'a\' < \'b\' < \'c\']

In [96]: s.cat.as_unordered()
Out[96]: 
0    a
3    a
1    b
2    c
dtype: category
Categories (3, object): [\'a\', \'b\', \'c\']

重排序

使用Categorical.reorder_categories() 可以对现有的category进行重排序:

In [103]: s = pd.Series([1, 2, 3, 1], dtype=\"category\")

In [104]: s = s.cat.reorder_categories([2, 3, 1], ordered=True)

In [105]: s
Out[105]: 
0    1
1    2
2    3
3    1
dtype: category
Categories (3, int64): [2 < 3 < 1]

多列排序

sort_values 支持多列进行排序:

In [109]: dfs = pd.DataFrame(
   .....:     {
   .....:         \"A\": pd.Categorical(
   .....:             list(\"bbeebbaa\"),
   .....:             categories=[\"e\", \"a\", \"b\"],
   .....:             ordered=True,
   .....:         ),
   .....:         \"B\": [1, 2, 1, 2, 2, 1, 2, 1],
   .....:     }
   .....: )
   .....: 

In [110]: dfs.sort_values(by=[\"A\", \"B\"])
Out[110]: 
   A  B
2  e  1
3  e  2
7  a  1
6  a  2
0  b  1
5  b  1
1  b  2
4  b  2

比较操作

如果创建的时候设置了ordered==True ,那么category之间就可以进行比较操作。支持 ==, !=, >, >=, <, 和 <=这些操作符。

In [113]: cat = pd.Series([1, 2, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))

In [114]: cat_base = pd.Series([2, 2, 2]).astype(CategoricalDtype([3, 2, 1], ordered=True))

In [115]: cat_base2 = pd.Series([2, 2, 2]).astype(CategoricalDtype(ordered=True))
In [119]: cat > cat_base
Out[119]: 
0     True
1    False
2    False
dtype: bool

In [120]: cat > 2
Out[120]: 
0     True
1    False
2    False
dtype: bool

其他操作

Cagetory本质上来说还是一个Series,所以Series的操作category基本上都可以使用,比如: Series.min(), Series.max() 和 Series.mode()。

value_counts:

In [131]: s = pd.Series(pd.Categorical([\"a\", \"b\", \"c\", \"c\"], categories=[\"c\", \"a\", \"b\", \"d\"]))

In [132]: s.value_counts()
Out[132]: 
c    2
a    1
b    1
d    0
dtype: int64

DataFrame.sum():

In [133]: columns = pd.Categorical(
   .....:     [\"One\", \"One\", \"Two\"], categories=[\"One\", \"Two\", \"Three\"], ordered=True
   .....: )
   .....: 

In [134]: df = pd.DataFrame(
   .....:     data=[[1, 2, 3], [4, 5, 6]],
   .....:     columns=pd.MultiIndex.from_arrays([[\"A\", \"B\", \"B\"], columns]),
   .....: )
   .....: 

In [135]: df.sum(axis=1, level=1)
Out[135]: 
   One  Two  Three
0    3    3      0
1    9    6      0

Groupby:

In [136]: cats = pd.Categorical(
   .....:     [\"a\", \"b\", \"b\", \"b\", \"c\", \"c\", \"c\"], categories=[\"a\", \"b\", \"c\", \"d\"]
   .....: )
   .....: 

In [137]: df = pd.DataFrame({\"cats\": cats, \"values\": [1, 2, 2, 2, 3, 4, 5]})

In [138]: df.groupby(\"cats\").mean()
Out[138]: 
      values
cats        
a        1.0
b        2.0
c        4.0
d        NaN

In [139]: cats2 = pd.Categorical([\"a\", \"a\", \"b\", \"b\"], categories=[\"a\", \"b\", \"c\"])

In [140]: df2 = pd.DataFrame(
   .....:     {
   .....:         \"cats\": cats2,
   .....:         \"B\": [\"c\", \"d\", \"c\", \"d\"],
   .....:         \"values\": [1, 2, 3, 4],
   .....:     }
   .....: )
   .....: 

In [141]: df2.groupby([\"cats\", \"B\"]).mean()
Out[141]: 
        values
cats B        
a    c     1.0
     d     2.0
b    c     3.0
     d     4.0
c    c     NaN
     d     NaN

Pivot tables:

In [142]: raw_cat = pd.Categorical([\"a\", \"a\", \"b\", \"b\"], categories=[\"a\", \"b\", \"c\"])

In [143]: df = pd.DataFrame({\"A\": raw_cat, \"B\": [\"c\", \"d\", \"c\", \"d\"], \"values\": [1, 2, 3, 4]})

In [144]: pd.pivot_table(df, values=\"values\", index=[\"A\", \"B\"])
Out[144]: 
     values
A B        
a c       1
  d       2
b c       3
  d       4
© 版权声明
THE END
喜欢就支持一下吧
点赞0 分享
评论 抢沙发

请登录后发表评论

    暂无评论内容