Pandas is a powerful, open-source Python library that provides flexible data structures designed to make working with structured (tabular, multidimensional, potentially heterogeneous) data both easy and intuitive. It is one of the primary tools data analysts use for data cleaning and preparation.
import pandas as pd
# Reading a CSV file
df = pd.read_csv('file_path.csv')
# Reading an Excel file
df = pd.read_excel('file_path.xlsx')
# First five rows
df.head()
# Last five rows
df.tail()
# Information about the DataFrame
df.info()
# Selecting a single column
df['column_name']
# Selecting multiple columns
df[['col1', 'col2']]
# Row selection using loc and iloc
df.loc[0] # Selects first row by label
df.iloc[0] # Selects first row by index
# Drop rows with any NA values
df.dropna()
# Fill NA values with a specified value or method (like 'mean')
df.fillna(value=0)
# Applying functions to a column
df['col1'].apply(lambda x: x*2)
# Renaming columns
df.rename(columns={'old_name': 'new_name'}, inplace=True)
# Filtering rows based on conditions
filtered_df = df[df['col1'] > 50]
# Grouping by a column and calculating the mean of other columns
df.groupby('col1').mean()
# Multiple aggregations
df.groupby('col1').agg(['mean', 'sum', 'count'])
# Concatenating DataFrames
df_new = pd.concat([df1, df2])
# Merging on a specific column
merged_df = pd.merge(df1, df2, on='common_column')
# Writing to CSV
df.to_csv('output.csv', index=False)
# Writing to Excel
df.to_excel('output.xlsx', index=False)
Pandas is a fundamental tool in a data analyst’s toolbox, providing a vast array of data manipulation capabilities. Whether you’re cleaning, transforming, aggregating, or visualizing data, Pandas can significantly streamline the process, making data preparation and analysis efficient and straightforward.