Advanced Sampling Methods in Pandas | Explanazon

Sampling is a crucial technique in data analysis and machine learning, allowing you to work with manageable subsets of large datasets, perform statistical analysis, and validate models. While the basic sample() method in Pandas provides essential functionality, there are more advanced sampling methods and techniques that can be employed to achieve specific goals. In this blog, we’ll delve into some of these advanced sampling methods, including stratified sampling, systematic sampling, and weighted sampling.

1. Stratified Sampling

Stratified sampling ensures that the sample represents different subgroups (strata) within the dataset. This is particularly useful when the population has distinct subgroups and you want each subgroup to be proportionally represented in the sample.

Example: Stratified Sampling

Consider a dataset with two columns: Category and Value. We want to sample 30% of the data while maintaining the proportion of each category.

Python
import pandas as pd

# Sample data
data = {
    'Category': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'Value': [10, 20, 30, 40, 50, 60, 70, 80, 90]
}
df = pd.DataFrame(data)

# Stratified sampling
stratified_sample = df.groupby('Category', group_keys=False).apply(lambda x: x.sample(frac=0.3))
print(stratified_sample)

Output:

  Category  Value
0        A     10
4        B     50
7        C     80

The example demonstrates how to maintain the proportion of different categories in a sample by using groupby and apply in Pandas. This ensures that each category is proportionally represented in the sample.

2. Systematic Sampling

Systematic sampling involves selecting every k-th element from a list or dataset. This method is straightforward and ensures a simple random sample if the list is randomly ordered.

Example: Systematic Sampling

Suppose we have a DataFrame with 100 rows and we want to sample every 10th row.

Python
import numpy as np

# Sample data
data = {'Value': np.arange(1, 101)}
df = pd.DataFrame(data)

# Systematic sampling
k = 10
systematic_sample = df.iloc[::k, :]
print(systematic_sample)

Output:

    Value
0       1
10     11
20     21
30     31
40     41
50     51
60     61
70     71
80     81
90     91

This example shows how to select every k-th row from a DataFrame using slicing. By setting k=10, it systematically samples every 10th row from a dataset of 100 rows.

3. Weighted Sampling

Weighted sampling allows for different probabilities of selection for each element in the dataset. This is useful when some elements are more important or more frequent than others.

Example: Weighted Sampling

Consider a dataset where some rows have higher weights, indicating a higher probability of being selected.

Python
# Sample data
data = {
    'Item': ['A', 'B', 'C', 'D', 'E'],
    'Weight': [0.1, 0.2, 0.3, 0.4, 0.5]
}
df = pd.DataFrame(data)

# Weighted sampling
weighted_sample = df.sample(n=3, weights='Weight', random_state=1)
print(weighted_sample)

Output:

  Item  Weight
4    E     0.5
3    D     0.4
2    C     0.3

In this example, rows are sampled based on their specified weights using the sample method with the weights parameter. This allows for different probabilities of selection for each row, highlighting how weighted sampling can be implemented.

4. Reservoir Sampling

Reservoir sampling is used to randomly select a fixed-size sample from a large or unknown-size dataset, typically in a single pass. This method is efficient for streaming data.

Example: Reservoir Sampling

Here, we implement a simple version of reservoir sampling to select 5 samples from a stream of data.

Python
import random

def reservoir_sampling(stream, sample_size):
    sample = []
    for i, item in enumerate(stream):
        if i < sample_size:
            sample.append(item)
        else:
            j = random.randint(0, i)
            if j < sample_size:
                sample[j] = item
    return sample

# Stream data
stream_data = range(1, 101)
sample_size = 5
reservoir_sample = reservoir_sampling(stream_data, sample_size)
print(reservoir_sample)

Output (example):

[8, 26, 59, 83, 99]

The example implements reservoir sampling to randomly select a fixed-size sample from a stream of data in a single pass. This is useful for efficiently handling large or streaming datasets.

Conclusion

Advanced sampling methods in Pandas offer flexibility and precision for various data analysis tasks. Stratified sampling ensures representation of subgroups, systematic sampling provides simplicity and structure, weighted sampling accounts for different probabilities, and reservoir sampling is efficient for streaming data. By understanding and applying these advanced sampling techniques, you can enhance the robustness and reliability of your data analysis and machine learning projects.

Experiment with these methods to see how they can be applied to your specific use cases. Happy sampling!

Also Explore:

Leave a Comment