CSVやexcelデータをPandasに読み込み後、データ確認に使うコマンド（備忘録）

ファイルをDataframeに読み込む

Dataframeの概観確認

df,

df.shape,

df[‘列名’].value_counts(dropna=False)

（動作環境）Windows 8.1 , python: 3.6.5 , pandas: 0.23.0, numpy: 1.14.3

1.ファイルをDataframeに読み込む

まず、データ（csvファイル）の読み込み。テストデータはここから入手。https://archive.ics.uci.edu/ml/datasets/iris

In [1]:

import pandas as pd
w_path = 'C:/Users/N/'
df     = pd.read_csv(w_path + 'iris.csv', encoding='shift-jis', skiprows=0)

Excelファイルの場合

In [2]:

df = pd.read_excel(w_path + 'iris.xlsx', encoding='utf8', sheet_name='Sheet1',skiprows=0,dtype='object')

2.Dataframeの概観確認

df　　　　Dataframeの表示

In [3]:

df

	SepalLength	SepalWidth	PetalLength	PetalWidth	Class
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa
5	5.4	3.9	1.7	0.4	setosa
…	…	…	…	…	…
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3.0	5.1	1.8	virginica

150 rows × 5 columns

df.shape　　　　行数と列数の表示

In [4]:

df.shape

Out[4]:

(150, 5)

df.dtypes　　　　データ項目の型を表示

In [5]:

df.dtypes

Out[5]:

SepalLength    float64
SepalWidth     float64
PetalLength    float64
PetalWidth     float64
Class           object
dtype: object

np.max(df)　　　　各列の最大値を表示

In [6]:

import numpy as np
np.max(df)

Out[6]:

SepalLength          7.9
SepalWidth           4.4
PetalLength          6.9
PetalWidth           2.5
Class          virginica
dtype: object

df.describe()　　　　統計量を表示

In [7]:

df.describe()

Out[7]:

	SepalLength	SepalWidth	PetalLength	PetalWidth
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.054000	3.758667	1.198667
std	0.828066	0.433594	1.764420	0.763161
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

3.列の詳細確認

df.info()　　　　各列ごとの件数（nullでない）、データ型など

In [8]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
SepalLength    150 non-null float64
SepalWidth     150 non-null float64
PetalLength    150 non-null float64
PetalWidth     150 non-null float64
Class          150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB

df.isnull().any()　　　　nullがあるかどうか

元データ「iris.csv」には欠損値はないので、列名「 SepalLength 」の内、２件を欠損値に変更してます。

In [9]:

df.isnull().any()

Out[9]:

SepalLength     True
SepalWidth     False
PetalLength    False
PetalWidth     False
Class          False
dtype: bool

df[‘列名’].count()　　　　null以外の件数

In [10]:

df['SepalLength'].count()

Out[10]:

df[‘列名’].value_counts(dropna=False)　　データ値ごとの件数

In [11]:

df['SepalLength'].value_counts(dropna=False)

Out[11]:

 5.0    10
 6.3     9
 6.7     8
・・・
 6.6     2
NaN      2
 7.4     1
 5.3     1
Name: SepalLength, dtype: int64

Dataframe操作でよく使うコマンド

1.ファイルをDataframeに読み込む

2.Dataframeの概観確認

df　　　　Dataframeの表示

df.shape　　　　行数と列数の表示

df.dtypes　　　　データ項目の型を表示

np.max(df)　　　　各列の最大値を表示

df.describe()　　　　統計量を表示

3.列の詳細確認

df[‘列名’].count()　　　　null以外の件数

In [10]:

df['SepalLength'].count()

Out[10]:

148

df[‘列名’].value_counts(dropna=False)　　データ値ごとの件数

In [11]:

df['SepalLength'].value_counts(dropna=False)

Out[11]:

5.0 10 6.3 9 6.7 8 ・・・ 6.6 2 NaN 2 7.4 1 5.3 1 Name: SepalLength, dtype: int64

1.ファイルをDataframeに読み込む

2.Dataframeの概観確認

df Dataframeの表示

df.shape 行数と列数の表示

df.dtypes データ項目の型を表示

np.max(df) 各列の最大値を表示

df.describe() 統計量を表示

3.列の詳細確認

df[‘列名’].count() null以外の件数 In [10]: df['SepalLength'].count() Out[10]: 148

df[‘列名’].value_counts(dropna=False) データ値ごとの件数 In [11]: df['SepalLength'].value_counts(dropna=False) Out[11]: 5.0 10 6.3 9 6.7 8 ・・・ 6.6 2 NaN 2 7.4 1 5.3 1 Name: SepalLength, dtype: int64

df　　　　Dataframeの表示

df.shape　　　　行数と列数の表示

df.dtypes　　　　データ項目の型を表示

np.max(df)　　　　各列の最大値を表示

df.describe()　　　　統計量を表示

df[‘列名’].count()　　　　null以外の件数

In [10]:

df['SepalLength'].count()

Out[10]:

148

df[‘列名’].value_counts(dropna=False)　　データ値ごとの件数

In [11]:

df['SepalLength'].value_counts(dropna=False)

Out[11]:

5.0 10 6.3 9 6.7 8 ・・・ 6.6 2 NaN 2 7.4 1 5.3 1 Name: SepalLength, dtype: int64