Data Inspection – AICorr.com


Analysing data

This is a data inspection tutorial.

When aiming to analyse or modify data, it is always beneficial to inspect it first. Data inspection in Pandas involves examining the structure, content, and basic statistics of a DataFrame to gain insights into the data. As a result, you can better understand your data and identify potential issues or patterns. Pandas offers several common techniques for inspecting information. Let’s dive into them.

First, let’s create a random dataframe (10 columns and 10 rows). For this, we will use the combination of Pandas and NumPy. If you are not familiar with NumPy, please refer to our tutorials here.

import pandas as pd
import numpy as np

# Sample DataFrame with random data
data = {
    'A': np.random.rand(10),
    'B': np.random.randint(0, 100, 10),
    'C': np.random.choice(['X', 'Y', 'Z'], 10),
    'D': np.random.randn(10),
    'E': np.random.uniform(1, 10, 10)
}

df = pd.DataFrame(data)
print(df)
          A   B  C         D         E
0  0.977025  23  Z  0.032601  2.115461
1  0.700341  86  X  0.540359  8.837162
2  0.919997  11  Z  0.597639  2.506773
3  0.340381  74  X  0.335716  2.205103
4  0.006617  77  X  1.214410  8.628857
5  0.861336  38  Z -0.051425  1.277405
6  0.975448  52  Y -0.868918  7.873193
7  0.037024  42  Y -0.660432  2.030910
8  0.935821  10  Y -1.793465  5.806170
9  0.506367  14  Z  0.540431  3.938597

Basic inspection

Now, we can start with some basic inspection methods.

  • head() – displays first few rows
  • tail() – displays last few rows
  • sample() – displays random samples
  • dtypes – shows each column’s data type
  • columns – shows all column names
  • shape – shows the dimensionality in an object
  • size – shows the number of elements in an object
# View the first 3 rows (5 by default)
print(df.head(3))

# View the last 3 rows (5 by default)
print(df.tail(3))

# Sample 3 rows
print(df.sample(n=3))

# Get the column names
print(df.columns)

# Get data types
print(df.dtypes)

# Get dimensionality
print(df.shape)

# Get number of elements
print(df.size)
# head()
          A   B  C         D         E
0  0.977025  23  Z  0.032601  2.115461
1  0.700341  86  X  0.540359  8.837162
2  0.919997  11  Z  0.597639  2.506773

# tail()
          A   B  C         D         E
7  0.037024  42  Y -0.660432  2.030910
8  0.935821  10  Y -1.793465  5.806170
9  0.506367  14  Z  0.540431  3.938597

# sample()
          A   B  C         D         E
7  0.037024  42  Y -0.660432  2.030910
6  0.975448  52  Y -0.868918  7.873193
4  0.006617  77  X  1.214410  8.628857

# dtypes
A    float64
B      int32
C     object
D    float64
E    float64
dtype: object

#columns
Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

# shape
(10, 5)

# size
50

Further inspection

  • value_counts() – returns all column value counts
  • nunique() – checks for unique values
  • isnull() – check for NULL values
  • info() – generates detailed information about an object
  • describe() – generates descriptive statistics about an object
# Get value counts for a specific column
print(df['A'].value_counts())

# Check for unique values
print(df.nunique())

# Show NULL values 
print(df.isnull())

# Details about the object
print(df.info())

# Descriptive statistics about the object
print(df.describe())
# value_counts()
A
0.977025    1
0.700341    1
0.919997    1
0.340381    1
0.006617    1
0.861336    1
0.975448    1
0.037024    1
0.935821    1
0.506367    1
Name: count, dtype: int64

# nunique()
A    10
B    10
C     3
D    10
E    10
dtype: int64

# isnull()
       A      B      C      D      E
0  False  False  False  False  False
1  False  False  False  False  False
2  False  False  False  False  False
3  False  False  False  False  False
4  False  False  False  False  False
5  False  False  False  False  False
6  False  False  False  False  False
7  False  False  False  False  False
8  False  False  False  False  False
9  False  False  False  False  False

# info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       10 non-null     float64
 1   B       10 non-null     int32  
 2   C       10 non-null     object 
 3   D       10 non-null     float64
 4   E       10 non-null     float64
dtypes: float64(3), int32(1), object(1)
memory usage: 488.0+ bytes
None

# describe()
               A          B          D          E
count  10.000000  10.000000  10.000000  10.000000
mean    0.626036  42.700000  -0.011308   4.521963
std     0.382257  28.724941   0.878057   2.993704
min     0.006617  10.000000  -1.793465   1.277405
25%     0.381878  16.250000  -0.508180   2.137872
50%     0.780839  40.000000   0.184159   3.222685
75%     0.931865  68.500000   0.540413   7.356437
max     0.977025  86.000000   1.214410   8.837162

This is an original data inspection educational material created by aicorr.com.

Next: Data Manipulation

We will be happy to hear your thoughts

Leave a reply

0
Your Cart is empty!

It looks like you haven't added any items to your cart yet.

Browse Products
Powered by Caddy