Pandas DataFrame sample() Method – Explained with Examples

The sample() method in Pandas is a powerful tool that allows you to randomly select rows or columns from a DataFrame. This can be particularly useful for tasks such as creating training and testing datasets, performing random sampling for analysis, or simply exploring a subset of your data. In this blog post, we’ll delve into the sample() method, explaining its various parameters and demonstrating its use through clear examples.

Syntax of sample() method:
Python
DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)
Parameters:
  • n: The number of items to sample (default is None).
  • frac: The fraction of items to sample (default is None).
  • replace: Whether to sample with replacement or not (default is False).
  • weights: The sampling weights.
  • random_state: Seed for the random number generator (default is None).
  • axis: The axis to sample along (0 for rows, 1 for columns, default is None).
Basic Usage

Let’s start with a basic example to understand how the sample() method works. We’ll create a simple DataFrame to work with:

Python
import pandas as pd

# Creating a sample DataFrame
data = {'A': [1, 2, 3, 4, 5], 'B': ['a', 'b', 'c', 'd', 'e']}
df = pd.DataFrame(data)
print(df)

Output:

   A  B
0  1  a
1  2  b
2  3  c
3  4  d
4  5  e
1. Sampling with a Specific Number of Rows

You can use the n parameter to specify the number of rows you want to sample. For example, to sample 3 rows:

Python
sampled_df = df.sample(n=3)
print(sampled_df)

This will output 3 random rows from the DataFrame.

Output:

   A  B
2  3  c
1  2  b
3  4  d
2. Sampling with a Fraction of Rows

Alternatively, you can use the frac parameter to specify the fraction of rows to sample. For example, to sample 40% of the rows:

Python
sampled_df = df.sample(frac=0.4)
print(sampled_df)

This will output a random sample of 40% of the rows in the DataFrame.

Output:

   A  B
2  3  c
4  5  e
3. Sampling with Replacement

By default, the sample() method samples without replacement. If you want to sample with replacement, you can set the replace parameter to True:

Python
sampled_df = df.sample(n=3, replace=True)
print(sampled_df)

This can result in duplicate rows in the sampled DataFrame.

Output:

   A  B
3  4  d
4  5  e
2  3  c
4. Sampling Columns

To sample columns instead of rows, set the axis parameter to 1:

Python
sampled_columns = df.sample(n=1, axis=1)
print(sampled_columns)

This will output a random sample of 1 column from the DataFrame.

Output:

   A
0  1
1  2
2  3
3  4
4  5
5. Setting a Random Seed

To ensure reproducibility, you can set a random seed using the random_state parameter:

Python
sampled_df = df.sample(n=3, random_state=42)
print(sampled_df)

This will always produce the same sample when run multiple times with the same random seed.

Output:

   A  B
1  2  b
4  5  e
2  3  c

Additional Examples

Example 1: Sampling a Specific Number of Rows
Python
sampled_df = df.sample(n=2)
print(sampled_df)

Output (may vary):

   A  B
1  2  b
4  5  e

This example randomly selects 2 rows from the DataFrame. The selected rows, which can vary each time the method is run, are shown in the output.

Example 2: Sampling a Fraction of Rows
Python
sampled_df = df.sample(frac=0.6)<br>print(sampled_df)

Output (may vary):

   A  B
3  4  d
0  1  a
1  2  b

This example randomly samples 60% of the rows from the DataFrame. The selected rows will vary, providing a subset based on the specified fraction.

Example 3: Sampling with Replacement
Python
sampled_df = df.sample(n=4, replace=True)
print(sampled_df)

Output (may vary):

   A  B
4  5  e
0  1  a
2  3  c
1  2  b

In this example, 4 rows are randomly sampled with replacement, meaning the same row can appear multiple times in the output.

Example 4: Sampling Columns
Python
sampled_columns = df.sample(n=1, axis=1)
print(sampled_columns)

Output (may vary):

   B
0  a
1  b
2  c
3  d
4  e

This example randomly selects 1 column from the DataFrame, displaying only the data from the chosen column.

Example 5: Setting a Random Seed
Python
sampled_df = df.sample(n=3, random_state=1)
print(sampled_df)

Output:

   A  B
2  3  c
1  2  b
4  5  e

This example samples 3 rows from the DataFrame with a fixed random seed, ensuring the same rows are selected each time the code is run, providing reproducible results.

Conclusion

The sample() method is a versatile and useful function in Pandas for randomly sampling rows or columns from a DataFrame. Whether you need to create training and testing datasets, perform random sampling for analysis, or just explore a subset of your data, the sample() method provides a straightforward and flexible approach. By understanding and utilizing its various parameters, you can tailor the sampling process to meet your specific needs.

Also Explore:

Leave a Comment