Topic 2: Using Pandas for Data Manipulation

1. Introduction to Pandas

Pandas is a powerful, open-source Python library that provides flexible data structures designed to make working with structured (tabular, multidimensional, potentially heterogeneous) data both easy and intuitive. It is one of the primary tools data analysts use for data cleaning and preparation.

2. Core Components: DataFrame and Series

  • DataFrame: A two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns). Think of it like an Excel spreadsheet or SQL table.
  • Series: A one-dimensional labeled array that can hold any data type, including objects.

3. Basic Pandas Operations

a. Reading Data

import pandas as pd # Reading a CSV file df = pd.read_csv('file_path.csv') # Reading an Excel file df = pd.read_excel('file_path.xlsx')

b. Viewing Data

# First five rows df.head() # Last five rows df.tail() # Information about the DataFrame

c. Selecting Data

# Selecting a single column df['column_name'] # Selecting multiple columns df[['col1', 'col2']] # Row selection using loc and iloc df.loc[0] # Selects first row by label df.iloc[0] # Selects first row by index

4. Data Cleaning and Manipulation

a. Handling Missing Data

# Drop rows with any NA values df.dropna() # Fill NA values with a specified value or method (like 'mean') df.fillna(value=0)

b. Data Transformation

# Applying functions to a column df['col1'].apply(lambda x: x*2) # Renaming columns df.rename(columns={'old_name': 'new_name'}, inplace=True)

c. Filtering

# Filtering rows based on conditions filtered_df = df[df['col1'] > 50]

d. Grouping and Aggregation

# Grouping by a column and calculating the mean of other columns df.groupby('col1').mean() # Multiple aggregations df.groupby('col1').agg(['mean', 'sum', 'count'])

e. Merging, Joining, and Concatenating

# Concatenating DataFrames df_new = pd.concat([df1, df2]) # Merging on a specific column merged_df = pd.merge(df1, df2, on='common_column')

5. Exporting Data

# Writing to CSV df.to_csv('output.csv', index=False) # Writing to Excel df.to_excel('output.xlsx', index=False)

6. Conclusion

Pandas is a fundamental tool in a data analyst’s toolbox, providing a vast array of data manipulation capabilities. Whether you’re cleaning, transforming, aggregating, or visualizing data, Pandas can significantly streamline the process, making data preparation and analysis efficient and straightforward.