The sample() method in Pandas is a powerful tool that allows you to randomly select rows or columns from a DataFrame. This can be particularly useful for tasks such as creating training and testing datasets, performing random sampling for analysis, or simply exploring a subset of your data. In this blog post, we’ll delve into the sample()
method, explaining its various parameters and demonstrating its use through clear examples.
Syntax of sample() method:
DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)
Parameters:
- n: The number of items to sample (default is
None
). - frac: The fraction of items to sample (default is
None
). - replace: Whether to sample with replacement or not (default is
False
). - weights: The sampling weights.
- random_state: Seed for the random number generator (default is
None
). - axis: The axis to sample along (
0
for rows,1
for columns, default isNone
).
Basic Usage
Let’s start with a basic example to understand how the sample()
method works. We’ll create a simple DataFrame to work with:
import pandas as pd
# Creating a sample DataFrame
data = {'A': [1, 2, 3, 4, 5], 'B': ['a', 'b', 'c', 'd', 'e']}
df = pd.DataFrame(data)
print(df)
Output:
A B
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
1. Sampling with a Specific Number of Rows
You can use the n
parameter to specify the number of rows you want to sample. For example, to sample 3 rows:
sampled_df = df.sample(n=3)
print(sampled_df)
This will output 3 random rows from the DataFrame.
Output:
A B
2 3 c
1 2 b
3 4 d
2. Sampling with a Fraction of Rows
Alternatively, you can use the frac
parameter to specify the fraction of rows to sample. For example, to sample 40% of the rows:
sampled_df = df.sample(frac=0.4)
print(sampled_df)
This will output a random sample of 40% of the rows in the DataFrame.
Output:
A B
2 3 c
4 5 e
3. Sampling with Replacement
By default, the sample()
method samples without replacement. If you want to sample with replacement, you can set the replace
parameter to True
:
sampled_df = df.sample(n=3, replace=True)
print(sampled_df)
This can result in duplicate rows in the sampled DataFrame.
Output:
A B
3 4 d
4 5 e
2 3 c
4. Sampling Columns
To sample columns instead of rows, set the axis
parameter to 1
:
sampled_columns = df.sample(n=1, axis=1)
print(sampled_columns)
This will output a random sample of 1 column from the DataFrame.
Output:
A
0 1
1 2
2 3
3 4
4 5
5. Setting a Random Seed
To ensure reproducibility, you can set a random seed using the random_state
parameter:
sampled_df = df.sample(n=3, random_state=42)
print(sampled_df)
This will always produce the same sample when run multiple times with the same random seed.
Output:
A B
1 2 b
4 5 e
2 3 c
Additional Examples
Example 1: Sampling a Specific Number of Rows
sampled_df = df.sample(n=2)
print(sampled_df)
Output (may vary):
A B
1 2 b
4 5 e
This example randomly selects 2 rows from the DataFrame. The selected rows, which can vary each time the method is run, are shown in the output.
Example 2: Sampling a Fraction of Rows
sampled_df = df.sample(frac=0.6)<br>print(sampled_df)
Output (may vary):
A B
3 4 d
0 1 a
1 2 b
This example randomly samples 60% of the rows from the DataFrame. The selected rows will vary, providing a subset based on the specified fraction.
Example 3: Sampling with Replacement
sampled_df = df.sample(n=4, replace=True)
print(sampled_df)
Output (may vary):
A B
4 5 e
0 1 a
2 3 c
1 2 b
In this example, 4 rows are randomly sampled with replacement, meaning the same row can appear multiple times in the output.
Example 4: Sampling Columns
sampled_columns = df.sample(n=1, axis=1)
print(sampled_columns)
Output (may vary):
B
0 a
1 b
2 c
3 d
4 e
This example randomly selects 1 column from the DataFrame, displaying only the data from the chosen column.
Example 5: Setting a Random Seed
sampled_df = df.sample(n=3, random_state=1)
print(sampled_df)
Output:
A B
2 3 c
1 2 b
4 5 e
This example samples 3 rows from the DataFrame with a fixed random seed, ensuring the same rows are selected each time the code is run, providing reproducible results.
Conclusion
The sample()
method is a versatile and useful function in Pandas for randomly sampling rows or columns from a DataFrame. Whether you need to create training and testing datasets, perform random sampling for analysis, or just explore a subset of your data, the sample()
method provides a straightforward and flexible approach. By understanding and utilizing its various parameters, you can tailor the sampling process to meet your specific needs.
Also Explore: