Sampling is a crucial technique in data analysis and machine learning, allowing you to work with manageable subsets of large datasets, perform statistical analysis, and validate models. While the basic sample() method in Pandas provides essential functionality, there are more advanced sampling methods and techniques that can be employed to achieve specific goals. In this blog, we’ll delve into some of these advanced sampling methods, including stratified sampling, systematic sampling, and weighted sampling.
1. Stratified Sampling
Stratified sampling ensures that the sample represents different subgroups (strata) within the dataset. This is particularly useful when the population has distinct subgroups and you want each subgroup to be proportionally represented in the sample.
Example: Stratified Sampling
Consider a dataset with two columns: Category
and Value
. We want to sample 30% of the data while maintaining the proportion of each category.
import pandas as pd
# Sample data
data = {
'Category': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'Value': [10, 20, 30, 40, 50, 60, 70, 80, 90]
}
df = pd.DataFrame(data)
# Stratified sampling
stratified_sample = df.groupby('Category', group_keys=False).apply(lambda x: x.sample(frac=0.3))
print(stratified_sample)
Output:
Category Value
0 A 10
4 B 50
7 C 80
The example demonstrates how to maintain the proportion of different categories in a sample by using groupby
and apply
in Pandas. This ensures that each category is proportionally represented in the sample.
2. Systematic Sampling
Systematic sampling involves selecting every k-th element from a list or dataset. This method is straightforward and ensures a simple random sample if the list is randomly ordered.
Example: Systematic Sampling
Suppose we have a DataFrame with 100 rows and we want to sample every 10th row.
import numpy as np
# Sample data
data = {'Value': np.arange(1, 101)}
df = pd.DataFrame(data)
# Systematic sampling
k = 10
systematic_sample = df.iloc[::k, :]
print(systematic_sample)
Output:
Value
0 1
10 11
20 21
30 31
40 41
50 51
60 61
70 71
80 81
90 91
This example shows how to select every k-th row from a DataFrame using slicing. By setting k=10
, it systematically samples every 10th row from a dataset of 100 rows.
3. Weighted Sampling
Weighted sampling allows for different probabilities of selection for each element in the dataset. This is useful when some elements are more important or more frequent than others.
Example: Weighted Sampling
Consider a dataset where some rows have higher weights, indicating a higher probability of being selected.
# Sample data
data = {
'Item': ['A', 'B', 'C', 'D', 'E'],
'Weight': [0.1, 0.2, 0.3, 0.4, 0.5]
}
df = pd.DataFrame(data)
# Weighted sampling
weighted_sample = df.sample(n=3, weights='Weight', random_state=1)
print(weighted_sample)
Output:
Item Weight
4 E 0.5
3 D 0.4
2 C 0.3
In this example, rows are sampled based on their specified weights using the sample
method with the weights
parameter. This allows for different probabilities of selection for each row, highlighting how weighted sampling can be implemented.
4. Reservoir Sampling
Reservoir sampling is used to randomly select a fixed-size sample from a large or unknown-size dataset, typically in a single pass. This method is efficient for streaming data.
Example: Reservoir Sampling
Here, we implement a simple version of reservoir sampling to select 5 samples from a stream of data.
import random
def reservoir_sampling(stream, sample_size):
sample = []
for i, item in enumerate(stream):
if i < sample_size:
sample.append(item)
else:
j = random.randint(0, i)
if j < sample_size:
sample[j] = item
return sample
# Stream data
stream_data = range(1, 101)
sample_size = 5
reservoir_sample = reservoir_sampling(stream_data, sample_size)
print(reservoir_sample)
Output (example):
[8, 26, 59, 83, 99]
The example implements reservoir sampling to randomly select a fixed-size sample from a stream of data in a single pass. This is useful for efficiently handling large or streaming datasets.
Conclusion
Advanced sampling methods in Pandas offer flexibility and precision for various data analysis tasks. Stratified sampling ensures representation of subgroups, systematic sampling provides simplicity and structure, weighted sampling accounts for different probabilities, and reservoir sampling is efficient for streaming data. By understanding and applying these advanced sampling techniques, you can enhance the robustness and reliability of your data analysis and machine learning projects.
Experiment with these methods to see how they can be applied to your specific use cases. Happy sampling!
Also Explore: