Python | Pandas Working With Text Data

Python’s Pandas library is a powerful tool for data manipulation and analysis, and it provides robust support for working with text data. Whether you’re dealing with messy data that needs cleaning or performing complex text processing tasks, Pandas offers a variety of functions to simplify your workflow. This blog will guide you through the essentials of handling text data using Pandas.

Introduction to Pandas

Pandas is an open-source data analysis and manipulation library built on top of the Python programming language. It’s widely used in data science, finance, economics, and statistics due to its ease of use and powerful capabilities. Pandas primarily uses two data structures: Series and DataFrame. A Series is a one-dimensional array-like object, and a DataFrame is a two-dimensional table with rows and columns.

Loading Text Data

Before diving into text manipulation, let’s start by loading a sample dataset. Pandas can read data from various file formats such as CSV, Excel, and SQL databases. For this example, we’ll use a CSV file.

Python
import pandas as pd

# Load the CSV file into a DataFrame
df = pd.read_csv('sample_data.csv')
print(df)

Output:

Markdown
            text
0     Hi there! 
1   How are you?
2   My age is 25
3  I love python

Basic Text Operations

i) Accessing Text Data

Assume we have a DataFrame df with a column named text containing textual data. You can access this column like any other DataFrame column.

Python
# Accessing the 'text' column
text_data = df['text']
print(text_data)

Output:

Markdown
0       Hi there! 
1     How are you?
2     My age is 25
3    I love python
Name: text, dtype: object
ii) String Methods

Pandas provides a variety of string methods accessible via the .str accessor. These methods perform vectorized string operations on Series.

Python
# Convert to lowercase
df['text_lower'] = df['text'].str.lower()

# Replace a substring
df['text_replaced'] = df['text'].str.replace('Hi', 'Hello')

# Remove leading and trailing whitespace
df['text_stripped'] = df['text'].str.strip()

print(df)

Output:

Markdown
            text     text_lower   text_replaced  text_stripped
0     Hi there!      hi there!    Hello there!       Hi there!
1   How are you?   how are you?    How are you?   How are you?
2   My age is 25   my age is 25    My age is 25   My age is 25
3  I love python  i love python   I love python  I love python
iii) Extracting Substrings

You can extract substrings using the .str accessor. This is useful when you need to split text data into meaningful components.

Python
# Extract first 5 characters
df['text_substr'] = df['text'].str[:5]

# Extract using regular expressions
df['text_extracted'] = df['text'].str.extract(r'(\d+)')  # Extracts digits

Output:

Markdown
            text text_substr text_extracted
0     Hi there!         Hi t            NaN
1   How are you?       How a            NaN
2   My age is 25       My ag             25
3  I love python       I lov            NaN
iv) Splitting Strings

Splitting strings into lists or separate columns can be easily achieved.

Python
# Split by whitespace
df['text_split'] = df['text'].str.split()

# Split into multiple columns
df[['first', 'second']] = df['text'].str.split(expand=True, n=1)

Output:

Markdown
            text         text_split first       second
0     Hi there!        [Hi, there!]    Hi      there! 
1   How are you?   [How, are, you?]   How     are you?
2   My age is 25  [My, age, is, 25]    My    age is 25
3  I love python  [I, love, python]     I  love python

Advanced Text Operations

i) Finding and Counting

You can find and count occurrences of a substring within each element of the Series.

Python
# Find positions of a substring
df['text_find'] = df['text'].str.find('Hi')

# Count occurrences of a substring
df['text_count'] = df['text'].str.count('Hi')

Output:

Markdown
            text   text_find  text_count
0     Hi there!            1           1
1   How are you?          -1           0
2   My age is 25          -1           0
3  I love python          -1           0
ii) Handling Missing Values

Text data often contains missing values. Pandas provides functions to handle them effectively.

Python
data = {'text': [' Hi there! ', 'How are you?', 'My age is 25', None]}
df = pd.DataFrame(data)

# Fill missing values with a placeholder
df['text'] = df['text'].fillna('missing')
print(df)

Output:

Markdown
           text
0    Hi there! 
1  How are you?
2  My age is 25
3       missing

In the last row, the ‘None‘ value is replaced with ‘missing‘ using fillna().

For dropping the missing values,

Python
data = {'text': [' Hi there! ', 'How are you?', 'My age is 25', None]}
df = pd.DataFrame(data)

# Drop rows with missing values
df = df.dropna(subset=['text'])

Output:

Markdown
           text
0    Hi there! 
1  How are you?
2  My age is 25

In the above example, the last row is removed because it contains a None value.

iii) Applying Custom Functions

For more complex transformations, you can apply custom functions to each element in the Series.

Python
# Define a custom function
def custom_function(text):
    return text[::-1]  # Reverse the string

# Apply the custom function
df['text_custom'] = df['text'].apply(custom_function)

Output:

Markdown
            text    text_custom
0     Hi there!      !ereht iH 
1   How are you?   ?uoy era woH
2   My age is 25   52 si ega yM
3  I love python  nohtyp evol I
iv) Combining and Joining Strings

Consider a dataframe that initially look like this,

Markdown
            text  text2         text_split
0     Hi there!   text1       [Hi, there!]
1   How are you?  text2   [How, are, you?]
2   My age is 25  text3  [My, age, is, 25]
3  I love python  text4  [I, love, python]

Concatenating or joining strings can be done by,

Python
# Concatenate with another column
df['combined'] = df['text'] + ' ' + df['other_column']

# Join list elements into a single string
df['text_joined'] = df['text_split'].str.join(' ')

Output:

Markdown
            text  text2             combined         text_split    text_joined
0     Hi there!   text1     Hi there!  text1       [Hi, there!]      Hi there!
1   How are you?  text2   How are you? text2   [How, are, you?]   How are you?
2   My age is 25  text3   My age is 25 text3  [My, age, is, 25]   My age is 25
3  I love python  text4  I love python text4  [I, love, python]  I love python

Practical Example: Cleaning Text Data

Let’s walk through a practical example of cleaning a text column in a DataFrame.

Python
# Sample DataFrame
data = {'text': [' This is a SAMPLE text! ', 'Another text, with punctuation.', 'Text with numbers 1234']}
df = pd.DataFrame(data)

# 1. Convert to lowercase
df['text_cleaned'] = df['text'].str.lower()

# 2. Remove punctuation
df['text_cleaned'] = df['text_cleaned'].str.replace('[^\w\s]', '', regex=True)

# 3. Remove numbers
df['text_cleaned'] = df['text_cleaned'].str.replace('\d+', '', regex=True)

# 4. Remove leading/trailing whitespace
df['text_cleaned'] = df['text_cleaned'].str.strip()

# Display the cleaned text
print(df[['text', 'text_cleaned']])

Finally the dataframe will look like,

Markdown
                              text                   text_cleaned
0          This is a SAMPLE text!           this is a sample text
1  Another text, with punctuation.  another text with punctuation
2           Text with numbers 1234              text with numbers

Conclusion

Pandas provides a comprehensive suite of tools for working with text data. Whether you need to clean, transform, or analyze text, the functions and methods available in Pandas make these tasks straightforward and efficient. By mastering these techniques, you can handle various text processing challenges and prepare your data for further analysis or machine learning tasks.

Happy coding!

Also Explore:

Leave a Comment