Understand your data.

Aditya Kumar
5 min readFeb 20, 2023

--

As we take our first step towards data scientists, the first data we get is mostly in CSV (Comma separated values) format. CSV and JSON are the most common data format provided, but as the volume of data increases, CSV becomes less popular. Let's try to bring data from CSV and JSON into the data frame and visualize it.

  1. Data in CSV format: If we have any data in CSV format, we can read the data using pandas.
Fig 1: Reading data from CSV

2. Data in JSON format: JSON can be read using pandas in the same way we can CSV files. read_json is used to read JSON files.

Fig 2: Reading data from JSON files

Once we have our data with us, we can try some basic steps to understand our data.

Fig 3: Displaying top 10 rows

data.head() will display the top 5 rows of our data from the data frame. We can assign a number to head(number) to display the data as per our wish. But many times top few rows won't help you to understand the data, in that case, we can display a set of random rows.

Fig 4: Random rows from the data frame

We can get a brief info about the data using info().

Fig 5: Info details about the data frame

Info will help us to know the column, their data types and how much space is it consuming in the memory.

We can check if our dataset has any missing values by using df.isnull().sum(). In our dataset, as shown below, we don't have any missing values.

Fig.6: Count of missing values in column

We can check if our dataset has any duplicate rows. In order to get a proper result, we need to make sure all the rows have unique values. We can check that by using df.duplicated().sum()

Fig 7: Check the number of duplicate columns

Since we have duplicate rows, we need to remove them from our data frame. In this case, we create a new data frame as “newdata” which will not have any duplicate rows.

Fig 8: New dataframe without duplicates

Whenever we have data with us we can check if values are interrelated to each other or not. In case there is no relation among the columns we can remove that from our dataset. We can check it using corr(). This works only on numeric data. In the below example you can see sepalWidth has the least correlation while the other 3 have very good correlation.

Fig 9: Correlation among data

Since we have studied our data, let's try to go deep into our columns. In our case, we have one categorical data as species, let's try to understand more about species. Let's try to find out how many types of values are there in this column and what's the count of each value.

Fig 10: Count of categorical data

The seaborn library allows us to do that. In our case, we have used countplot to get the details and can see the count is almost the same for all three types of species. We can get the exact percentage count of the above using pie chart. The below pie chart shows our species are equally distributed.

Fig 11: Percentage of column values

In the above case, we had only 3 values, but we have a range of values from 1 to any number, in this case, we can plot a histogram. Let's take the example of sepallength. As we are not sure what the max or min value is related to this column. Let plt the histogram using matplotlib library.

Fig 12: Histogram of sepal length coloumn.

Looking at the above graph we can assume that the min sapallenght is around 4 and it ranges up to 7.8 and sapallength with value 5 has a maximum frequency in this column.

We can even check the correlation between more than one columns using scatterplot. For example, let's find out how sapalLenght is related to petallenght.

Fig 13: Relation plotted between petalLength and sepalLength

Suppose if we want to know the same relation with respect to the species, i.e which species has a much stronger relationship between petalLength and sepalLength.

Fig 14: relation between petallength and sepalength with respect to species

The above graph shows setosa species has a smaller size of petal and sepal, while the other 2 have greater sizes.

We can also plot the relationship between the species and anyone column of the data frame, i.e one categorical and one numerical dataset.

Fig 15:Relation between numerical and categorical values

Suppose we want to plot graph which shows a relation with all other numerical columns. This can be achieved using pairplot of the seaborn library.

Fig 16: Relation among all the columns.

We want to check the same above graph, considering the categorical column as our parameter.

Fig 17: Relation among numerical columns considering one categorical value.

Using the above library, we can easily visualize our data and easily understand our data.

The above information has been gathered from various sources and mostly from the YouTube channel https://www.youtube.com/@campusx-official.

Hope you find it useful.

--

--

Aditya Kumar
Aditya Kumar

Written by Aditya Kumar

Data Scientist with 6 years of experience. To find out more connect with me on https://www.linkedin.com/in/adityakumar529/

No responses yet