Pandas in brief

Jaishri Rai
2 min readMay 25, 2023

The idea behind the article is to summarise Pandas library in Python. It briefly explains usage and methods/functions used for the same. One can use this as a revision article.

Pandas is a powerful open-source data manipulation and analysis library for Python. It provides data structures and functions that allow for efficient handling, cleaning, analyzing, and visualization of structured data. Some objectives that Pandas can fulfill include:

  1. Data Cleaning and Preparation: Pandas provides functions to handle missing data, perform data imputation, handle duplicates, and transform data to a desired format.
  2. Data Manipulation: Pandas allow for filtering, sorting, grouping, and aggregating data. It supports operations such as merging, joining, and reshaping datasets.
  3. Data Analysis: Pandas provides a wide range of statistical and analytical functions, including descriptive statistics, correlation analysis, data visualization, time series analysis, and more.
  4. Data Visualization: Pandas integrates well with other libraries like Matplotlib and Seaborn to create informative plots, charts, and graphs to visualize data.

Quick Revision Points:

> import pandas as pd -> Import Library

> pd.read_csv(filepath) -> Load file

> df.head() -> show top 5 lines from dataframe

> df.columns -> show column name

> df.head().transpose() -> to tranpose the dataframe

> df.shape -> to see size of rows and columns

> df.info() -> to see columns, non-null entries, check datatypes

> df[0:5] -> shows top five rows from index 0 to 4. Slicing based on index.

> df[‘cloumn_name’][0:5]

> df[[‘cloumn_name1’, ‘cloumn_name2’]][0:5]

> df.iloc[1:6,3:10]

> count(): Count the non-null values in each column or row.

> value_counts(): Count the occurrences of unique values in a column.

> Sum(): Calculate the sum of values in each column or row.

> mean(): Calculate the mean (average) of values in each column or row.

> std(): Standard Deviation- Calculate the standard deviation of values in each column or row.

> corr(): Correlation- Calculate the correlation between columns in a DataFrame.

> cov(): Covariance- Calculate the covariance between columns in a DataFrame.

> min(), max(): Minimum and Maximum: Find the minimum and maximum values in each column or row.

> median(): Calculate the median of values in each column or row.

> unique(): Unique Values- Get the unique values in a column.

> nunique(): Nunique Count the number of unique values in a column.

> crosstab(): Cross-tabulation features will help find occurrences for the combination of values for two columns.

> Creating new columns: df[‘new_column’]= df[‘col1’] + df[‘col2’]

> Sorting values: df[[‘col1’, ‘col2’]].sort_values(‘col1’, ascending = False)[0:5]

> Groupby: Group record on basis of column values and then we can also apply aggregated operations such as mean, max, min, sum etc: df.groupby(‘col1’) -> to group on basis of col1

Groupby can be used for multiple columns as well like for a given district, village if we want to count number of people then: df.groupby([‘district’, ‘village’])[‘people_id’].count()

> Renaming columns: df.rename(columns={‘col1’: ‘new_col1’, ‘col2’: ‘new_col2’},in place =True)

> Joins: # Create two DataFrames
df1 = pd.DataFrame({‘A’: [1, 2, 3], ‘B’: [‘a’, ‘b’, ‘c’]})
df2 = pd.DataFrame({‘A’: [2, 3, 4], ‘C’: [‘x’, ‘y’, ‘z’]}) # Inner Join
inner_join = df1.join(df2.set_index(‘A’), on=’A’, how=’inner’)
print(inner_join)

--

--

Jaishri Rai

Someone who wants to dig deep in hope that one day my thoughts, my resentments will become part of my armory to make someone’s life better.