data analysis data inspection Pandas python Tutorials

Data Inspection – AICorr.com

AICorr

7 April 2024

0 Views 0

SaveSavedRemoved 0

Analysing data

This is a data inspection tutorial.

When aiming to analyse or modify data, it is always beneficial to inspect it first. Data inspection in Pandas involves examining the structure, content, and basic statistics of a DataFrame to gain insights into the data. As a result, you can better understand your data and identify potential issues or patterns. Pandas offers several common techniques for inspecting information. Let’s dive into them.

First, let’s create a random dataframe (10 columns and 10 rows). For this, we will use the combination of Pandas and NumPy. If you are not familiar with NumPy, please refer to our tutorials here.

import pandas as pd
import numpy as np

# Sample DataFrame with random data
data = {
    'A': np.random.rand(10),
    'B': np.random.randint(0, 100, 10),
    'C': np.random.choice(['X', 'Y', 'Z'], 10),
    'D': np.random.randn(10),
    'E': np.random.uniform(1, 10, 10)
}

df = pd.DataFrame(data)
print(df)

          A   B  C         D         E
0  0.977025  23  Z  0.032601  2.115461
1  0.700341  86  X  0.540359  8.837162
2  0.919997  11  Z  0.597639  2.506773
3  0.340381  74  X  0.335716  2.205103
4  0.006617  77  X  1.214410  8.628857
5  0.861336  38  Z -0.051425  1.277405
6  0.975448  52  Y -0.868918  7.873193
7  0.037024  42  Y -0.660432  2.030910
8  0.935821  10  Y -1.793465  5.806170
9  0.506367  14  Z  0.540431  3.938597

Basic inspection

Now, we can start with some basic inspection methods.

head() – displays first few rows
tail() – displays last few rows
sample() – displays random samples
dtypes – shows each column’s data type
columns – shows all column names
shape – shows the dimensionality in an object
size – shows the number of elements in an object

# View the first 3 rows (5 by default)
print(df.head(3))

# View the last 3 rows (5 by default)
print(df.tail(3))

# Sample 3 rows
print(df.sample(n=3))

# Get the column names
print(df.columns)

# Get data types
print(df.dtypes)

# Get dimensionality
print(df.shape)

# Get number of elements
print(df.size)

# head()
          A   B  C         D         E
0  0.977025  23  Z  0.032601  2.115461
1  0.700341  86  X  0.540359  8.837162
2  0.919997  11  Z  0.597639  2.506773

# tail()
          A   B  C         D         E
7  0.037024  42  Y -0.660432  2.030910
8  0.935821  10  Y -1.793465  5.806170
9  0.506367  14  Z  0.540431  3.938597

# sample()
          A   B  C         D         E
7  0.037024  42  Y -0.660432  2.030910
6  0.975448  52  Y -0.868918  7.873193
4  0.006617  77  X  1.214410  8.628857

# dtypes
A    float64
B      int32
C     object
D    float64
E    float64
dtype: object

#columns
Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

# shape
(10, 5)

# size
50

Further inspection

value_counts() – returns all column value counts
nunique() – checks for unique values
isnull() – check for NULL values
info() – generates detailed information about an object
describe() – generates descriptive statistics about an object

# Get value counts for a specific column
print(df['A'].value_counts())

# Check for unique values
print(df.nunique())

# Show NULL values 
print(df.isnull())

# Details about the object
print(df.info())

# Descriptive statistics about the object
print(df.describe())

# value_counts()
A
0.977025    1
0.700341    1
0.919997    1
0.340381    1
0.006617    1
0.861336    1
0.975448    1
0.037024    1
0.935821    1
0.506367    1
Name: count, dtype: int64

# nunique()
A    10
B    10
C     3
D    10
E    10
dtype: int64

# isnull()
       A      B      C      D      E
0  False  False  False  False  False
1  False  False  False  False  False
2  False  False  False  False  False
3  False  False  False  False  False
4  False  False  False  False  False
5  False  False  False  False  False
6  False  False  False  False  False
7  False  False  False  False  False
8  False  False  False  False  False
9  False  False  False  False  False

# info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       10 non-null     float64
 1   B       10 non-null     int32  
 2   C       10 non-null     object 
 3   D       10 non-null     float64
 4   E       10 non-null     float64
dtypes: float64(3), int32(1), object(1)
memory usage: 488.0+ bytes
None

# describe()
               A          B          D          E
count  10.000000  10.000000  10.000000  10.000000
mean    0.626036  42.700000  -0.011308   4.521963
std     0.382257  28.724941   0.878057   2.993704
min     0.006617  10.000000  -1.793465   1.277405
25%     0.381878  16.250000  -0.508180   2.137872
50%     0.780839  40.000000   0.184159   3.222685
75%     0.931865  68.500000   0.540413   7.356437
max     0.977025  86.000000   1.214410   8.837162

This is an original data inspection educational material created by aicorr.com.

Next: Data Manipulation

Data Inspection – AICorr.com

Analysing data

Basic inspection

Further inspection

Like this:

Nicole Scherzinger Age, Height, Biography, Net Worth, Husband

Top 10 Michael Jackson AI Voice Generator

Anthropic’s Claude AI now autonomously interacts with external data and tools

Mistral announces Codestral, its first programming focused AI model

Aggregating Real-time Sensor Data with Python and Redpanda

The end of centralized data? Samsung teams with Expanso on distributed processing

Leave a reply Cancel reply

Data Inspection – AICorr.com

Analysing data

Basic inspection

Further inspection

Share this:

Like this:

Nicole Scherzinger Age, Height, Biography, Net Worth, Husband

Top 10 Michael Jackson AI Voice Generator

Anthropic’s Claude AI now autonomously interacts with external data and tools

Mistral announces Codestral, its first programming focused AI model

Aggregating Real-time Sensor Data with Python and Redpanda

The end of centralized data? Samsung teams with Expanso on distributed processing

Leave a reply Cancel reply